Once again, a long time has passed since the last blog post. Why is that? Well, I've been a bit busy with work, the PHP Unconference in Hamburg, getting married, going to Helsinki for our honeymoon and all that kind of stuff, you know - the usual

On the PHP-related side of this post, I've created yet another open source project which has already proven useful for the company I've been working with for the last year - namely
NorthClick, which are the same guys who are behind
Jimdo.com (the free website creator that will undoubtedly one day swallow up Weebly and MySpace, ahem

).
While working at some really old code that provided a fulltext search feature, I was at one point
incredibly pissed rather unsatisfied due to the fact that said code resisted all attempts to debug it. This lead to the decision to sit down on a rainy weekend to try if I couldn't come up with something more useful, and most importantly, scalable. After about three hours of trial and error with Zend_Service_REST and Zend_Search_Lucene, I came up with a working prototype of a service-oriented fulltext index.
The basic idea was to decouple the indexing logic from the application logic, making the fulltext feature completely independent of the main application. The solution for this is a simple XML-based webservice that you throw your documents at, and that will allow querying the index later on.
At first I had no idea that a solution like this had already existed until I had a talk with Peter Petermann at last fall's International PHP Conference in Frankfurt, who pointed me to the
Apache Solr project, which does pretty much the same thing, but is implemented in Java. Since I wasn't yet happy with the auto-generated XML response syntax and the webservice API of my own project, I then decided to just
stealborrow Solr's ideas and make my API and XML syntax somewhat compatible to theirs, so switching between the two projects should be easy (in theory, didn't try it yet).
As a result of all this,
Marjory was born and can now be found at Google Code.
Here's a quick example of how easily a document can be added to Marjory:
$data = <<<EOD
<add catalog="default">
<doc src="http://my.website.tld/my/document.html" />
</add>
EOD;
'header' => 'Content-type: text/xml',
'content' => $data)));
Well, okay, this will only work for HTML documents located on a webserver. What if your application reads, say, a PDF, extracts text from it and wants to store that in the fulltext index? Just alter the XML:
<add catalog="default">
<doc uri="MyUniqueDocumentId">
<field name="title">Marjory: Search as a service</field>
<field name="abstract">An epic novel about full-text indexing in an SOA environment</field>
<field name="content">Lorem ipsum dolor sit amet... (to be continued)</field>
</doc>
</add>
People having experience with Solr will find the above syntax a lot more familiar. The previously shown example is just a nice shortcut for indexing documents located on the web, which was a requirement for the NorthClick CMS.
And this is how documents can be searched once they've been indexed:
$xml = simplexml_load_file('http://marjory.example.com/rest/select?q=Marjory');
foreach($xml->xpath('//doc') as $document) {
printf("\nFound document: %s\n",
(string
) $document['uri']);
foreach($document->str as $field) {
printf("Field %s contains value: %s\n",
(string
)$field['name'],
(string
)$field);
}
}
Easy, right? There's more that can be done, like limiting the number of results from a query, specifying the fields returned or specifying the catalog to search in. Marjory also allows to write your own document parsers or adaptors for search engines other than Zend_Search_Lucene.
The repository also contains code that shows how to use Marjory with
Dropr, the message queueing system developed by fellow Jimdoers Sönke Rümpler and Boris Erdmann. Using a message queue instead of firing the requests against the webservice will further improve your application performance (no more waiting for the indexer to finish processing your document) and make the indexing process more fail-safe (should something go wrong during processing, Dropr will simply try again). If you didn't hear about Dropr yet,
have a look - it has proven incredibly useful for distributed architectures like Jimdo.com.
If you're interested in how exactly Marjory works and what it can do for you, look at the
"basics" section of the documentation or download the
slides of the presentation I gave at the recent Unconference in Hamburg.
Recent Comments