Zend_Search_Lucene Quick Start
Posted in PHP, Programming and Zend Framework on Tuesday, the 3rd of June, 2008.
Tagged: lamp, php and zendframework
I recently had a spontaneous urge to add a search form to my weblog - this one you're reading right now - and it seemed like a good opportunity to have a look at Zend_Search_Lucene.
I'm really impressed with the simplicity and power of the module. Sadly the documentation, whilst extensive, isn't particularly clear - so here's a quick overview of getting Zend_Search_Lucene up and running.
For the uninitiated, Apache Lucene is an open-source indexing and search tool written in Java, and Zend_Search_Lucene is the purely PHP5 implementation of Lucene [1] that ships with Zend Framework.
Indexing
Before we can do any searching, we need to initialise an index. This is done through the Zend_Search_Lucene::create() method. Indexes are stored on disk, so we will need to create a directory which is readable and writeable by whichever user the script will run as. I've imaginatively called that /path/to/index for the purposes of this post.
Here's an example script which initialises the index, and adds three documents to it, ready for searching:
<?php $index = Zend_Search_Lucene::create('/path/to/index/'); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::unIndexed( 'title', 'Item number 1') ); $doc->addField( Zend_Search_Lucene_Field::text( 'contents', 'cow elephant dog hamster') ); $index->addDocument($doc); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::unIndexed( 'title', 'Item number 2') ); $doc->addField( Zend_Search_Lucene_Field::text( 'contents', 'cow aardvark dog hamster') ); $index->addDocument($doc); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::unIndexed( 'title', 'Item number 3') ); $doc->addField( Zend_Search_Lucene_Field::text( 'contents', 'cow elephant dog esquilax elephant') ); $index->addDocument($doc); $index->commit();
It's important not to overlook that final call to commit() - nothing will work without that. The 'title' field is unIndexed as we won't be searching on it, merely displaying it in our list of results. The 'contents' field is text, and this will be indexed for searching.
Where you get your document data from is completely up to you. It might be an RSS feed, a website crawler or - as in my case - a tiny PHP cron script which queries the weblog table in my database.
Either way, that's our index created. Since an index is no use unless you query it, let's have a look at how we can do that.
Searching
Here's about the simplest search you can possibly do with Zend_Search_Lucene:
<?php $index = Zend_Search_Lucene::open('/path/to/index/'); $results = $index->find('contents:elephant'); foreach ( $results as $result ) { echo $result->score, ' :: ', $result->title, "n"; }
The 'contents:elephant' query specifies that we wish to search for documents whose 'contents' field contains the term 'elephant'. That runs in a flash, and produces the following output:
0.61871843353823 :: Item number 3 0.5 :: Item number 1
As you can see, the two Zend_Search_Lucene_Document objects which contain the word 'elephant' are returned, ordered by descending 'score'. Item 3 contains the word twice, which is why it receives the highest score.
Of course, there are far more features than I've even hinted at here, so I'll more than likely return to Zend_Search_Lucene in a further post looking at some of the more advanced stuff, but for now, that's your lot.
Footnotes
[1] Incidentally, the index files created by Zend_Search_Lucene are entirely compatible with those created by Apache Lucene, allowing the two implementations to interoperate happily, should the need arise.
Posted by Ciaran McNulty on Sunday, the 8th of June, 2008.
Out of interest, why index on a schedule rather than on an update?