3 Jun 2008, 9:45 p.m.

Zend_Search_Lucene Quick Start

I recently had a spontaneous urge to add a search form to my weblog - this one you're reading right now - and it seemed like a good opportunity to have a look at Zend_Search_Lucene.

I'm really impressed with the simplicity and power of the module. Sadly the documentation, whilst extensive, isn't particularly clear - so here's a quick overview of getting Zend_Search_Lucene up and running.

For the uninitiated, Apache Lucene is an open-source indexing and search tool written in Java, and Zend_Search_Lucene is the purely PHP5 implementation of Lucene [1] that ships with Zend Framework.

Indexing

Before we can do any searching, we need to initialise an index. This is done through the Zend_Search_Lucene::create() method. Indexes are stored on disk, so we will need to create a directory which is readable and writeable by whichever user the script will run as. I've imaginatively called that /path/to/index for the purposes of this post.

Here's an example script which initialises the index, and adds three documents to it, ready for searching:


<?php

$index = Zend_Search_Lucene::create('/path/to/index/');

$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 1') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog hamster') );
$index->addDocument($doc);

$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 2') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow aardvark dog hamster') );
$index->addDocument($doc);

$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 3') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog esquilax elephant') );
$index->addDocument($doc);

$index->commit();

It's important not to overlook that final call to commit() - nothing will work without that. The 'title' field is unIndexed as we won't be searching on it, merely displaying it in our list of results. The 'contents' field is text, and this will be indexed for searching.

Where you get your document data from is completely up to you. It might be an RSS feed, a website crawler or - as in my case - a tiny PHP cron script which queries the weblog table in my database.

Either way, that's our index created. Since an index is no use unless you query it, let's have a look at how we can do that.

Searching

Here's about the simplest search you can possibly do with Zend_Search_Lucene:


<?php

$index   = Zend_Search_Lucene::open('/path/to/index/');
$results = $index->find('contents:elephant');

foreach ( $results as $result ) {
	echo $result->score, ' :: ', $result->title, "\n";
}

The 'contents:elephant' query specifies that we wish to search for documents whose 'contents' field contains the term 'elephant'. That runs in a flash, and produces the following output:

0.61871843353823 :: Item number 3
0.5 :: Item number 1

As you can see, the two Zend_Search_Lucene_Document objects which contain the word 'elephant' are returned, ordered by descending 'score'. Item 3 contains the word twice, which is why it receives the highest score.

Of course, there are far more features than I've even hinted at here, so I'll more than likely return to Zend_Search_Lucene in a further post looking at some of the more advanced stuff, but for now, that's your lot.

Footnotes

[1] Incidentally, the index files created by Zend_Search_Lucene are entirely compatible with those created by Apache Lucene, allowing the two implementations to interoperate happily, should the need arise.

Posted by Simon at 01:53:00 PM
8 Jun 2008, 8:50 a.m.

Ciaran McNulty

Out of interest, why index on a schedule rather than on an update?

28 Jun 2008, 7:48 a.m.

Simon [ADMIN]

Absolutely no reason other than simplicity! I'm not using any kind of CMS as it stands, so there's not really anywhere to hook the indexer in.

If I were integrating Zend_Search_Lucene with a CMS I'd want to look at - as you say - triggering the indexer on an update event, and having it run asynchronously. The Zend Platform (of which, more later!) "job queue" looks quite neat for that kind of thing.

11 Jul 2008, 11:03 a.m.

Clive

Short, sweet, simple and super - thank you for this. :-)

22 Jan 2009, 11:21 p.m.

Zeno

It is very useful, and take a look of the other article (Search engine indexing)

http://devzone.zend.com/node/view/id/91

10 Apr 2009, 8:04 a.m.

Alok

great article concise and straight.

16 Feb 2010, 7:52 p.m.

Cristobal

Zend_Search_Lucene how long it takes to index? I'm indexing 10 documents of 100 kb each and takes 2 minutes ... if i do index 1000 files?

Ty.

28 Feb 2013, 8:55 a.m.

Prasad

Very nice article. Very useful.
Thanks for sharing ..!!!

29 Oct 2014, 1:37 a.m.

Doug

I know this article is very old now but still relevant I think.

The issue I'm having right now is, how can I build an Index from an existing database. I have a table I wish to index which contains 2500+ row. Is there an automated way to turn the entire table into an index?
I use Symfony2 and Doctrine if that helps at all.