Zend_Search_Lucene Quick Start

Posted in PHP, Programming and Zend Framework on Tuesday, the 3rd of June, 2008.

Tagged: , and

I recently had a spontaneous urge to add a search form to my weblog - this one you're reading right now - and it seemed like a good opportunity to have a look at Zend_Search_Lucene.

I'm really impressed with the simplicity and power of the module. Sadly the documentation, whilst extensive, isn't particularly clear - so here's a quick overview of getting Zend_Search_Lucene up and running.

For the uninitiated, Apache Lucene is an open-source indexing and search tool written in Java, and Zend_Search_Lucene is the purely PHP5 implementation of Lucene [1] that ships with Zend Framework.

Indexing

Before we can do any searching, we need to initialise an index. This is done through the Zend_Search_Lucene::create() method. Indexes are stored on disk, so we will need to create a directory which is readable and writeable by whichever user the script will run as. I've imaginatively called that /path/to/index for the purposes of this post.

Here's an example script which initialises the index, and adds three documents to it, ready for searching:

<?php
 
$index = Zend_Search_Lucene::create('/path/to/index/');
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 1') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog hamster') );
$index->addDocument($doc);
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 2') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow aardvark dog hamster') );
$index->addDocument($doc);
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 3') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog esquilax elephant') );
$index->addDocument($doc);
 
$index->commit();

It's important not to overlook that final call to commit() - nothing will work without that. The 'title' field is unIndexed as we won't be searching on it, merely displaying it in our list of results. The 'contents' field is text, and this will be indexed for searching.

Where you get your document data from is completely up to you. It might be an RSS feed, a website crawler or - as in my case - a tiny PHP cron script which queries the weblog table in my database.

Either way, that's our index created. Since an index is no use unless you query it, let's have a look at how we can do that.

Searching

Here's about the simplest search you can possibly do with Zend_Search_Lucene:

<?php
 
$index   = Zend_Search_Lucene::open('/path/to/index/');
$results = $index->find('contents:elephant');
 
foreach ( $results as $result ) {
	echo $result->score, ' :: ', $result->title, "n";
}

The 'contents:elephant' query specifies that we wish to search for documents whose 'contents' field contains the term 'elephant'. That runs in a flash, and produces the following output:

0.61871843353823 :: Item number 3
0.5 :: Item number 1

As you can see, the two Zend_Search_Lucene_Document objects which contain the word 'elephant' are returned, ordered by descending 'score'. Item 3 contains the word twice, which is why it receives the highest score.

Of course, there are far more features than I've even hinted at here, so I'll more than likely return to Zend_Search_Lucene in a further post looking at some of the more advanced stuff, but for now, that's your lot.

Footnotes

[1] Incidentally, the index files created by Zend_Search_Lucene are entirely compatible with those created by Apache Lucene, allowing the two implementations to interoperate happily, should the need arise.

Comments

Posted by Ciaran McNulty on Sunday, the 8th of June, 2008.

Out of interest, why index on a schedule rather than on an update?

Posted by Simon Harris on Saturday, the 28th of June, 2008.

Absolutely no reason other than simplicity! I'm not using any kind of CMS as it stands, so there's not really anywhere to hook the indexer in.

If I were integrating Zend_Search_Lucene with a CMS I'd want to look at - as you say - triggering the indexer on an update event, and having it run asynchronously. The Zend Platform (of which, more later!) "job queue" looks quite neat for that kind of thing.

Posted by Clive on Friday, the 11th of July, 2008.

Short, sweet, simple and super - thank you for this. :-)

Posted by Zeno on Thursday, the 22nd of January, 2009.

It is very useful, and take a look of the other article (Search engine indexing)

http://devzone.zend.com/node/view/id/91

Posted by Alok on Friday, the 10th of April, 2009.

great article concise and straight.

Posted by Cristobal on Tuesday, the 16th of February, 2010.

Zend_Search_Lucene how long it takes to index? I'm indexing 10 documents of 100 kb each and takes 2 minutes ... if i do index 1000 files?

Ty.

Posted by Prasad on Thursday, the 28th of February, 2013.

Very nice article. Very useful.
Thanks for sharing ..!!!

Posted by Doug on Wednesday, the 29th of October, 2014.

I know this article is very old now but still relevant I think.

The issue I'm having right now is, how can I build an Index from an existing database. I have a table I wish to index which contains 2500+ row. Is there an automated way to turn the entire table into an index?
I use Symfony2 and Doctrine if that helps at all.

Enter your comment: