Zend_Search_Lucene Quick Start

Posted in PHP, Programming and Zend Framework on Tuesday, the 3rd of June, 2008.

Tagged: , and

I recently had a spontaneous urge to add a search form to my weblog - this one you're reading right now - and it seemed like a good opportunity to have a look at Zend_Search_Lucene.

I'm really impressed with the simplicity and power of the module. Sadly the documentation, whilst extensive, isn't particularly clear - so here's a quick overview of getting Zend_Search_Lucene up and running.

For the uninitiated, Apache Lucene is an open-source indexing and search tool written in Java, and Zend_Search_Lucene is the purely PHP5 implementation of Lucene [1] that ships with Zend Framework.

Indexing

Before we can do any searching, we need to initialise an index. This is done through the Zend_Search_Lucene::create() method. Indexes are stored on disk, so we will need to create a directory which is readable and writeable by whichever user the script will run as. I've imaginatively called that /path/to/index for the purposes of this post.

Here's an example script which initialises the index, and adds three documents to it, ready for searching:

<?php
 
$index = Zend_Search_Lucene::create('/path/to/index/');
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 1') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog hamster') );
$index->addDocument($doc);
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 2') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow aardvark dog hamster') );
$index->addDocument($doc);
 
$doc = new Zend_Search_Lucene_Document();
$doc->addField( 
	Zend_Search_Lucene_Field::unIndexed(
		'title', 'Item number 3') );
$doc->addField( 
	Zend_Search_Lucene_Field::text(
		'contents', 'cow elephant dog esquilax elephant') );
$index->addDocument($doc);
 
$index->commit();

It's important not to overlook that final call to commit() - nothing will work without that. The 'title' field is unIndexed as we won't be searching on it, merely displaying it in our list of results. The 'contents' field is text, and this will be indexed for searching.

Where you get your document data from is completely up to you. It might be an RSS feed, a website crawler or - as in my case - a tiny PHP cron script which queries the weblog table in my database.

Either way, that's our index created. Since an index is no use unless you query it, let's have a look at how we can do that.

Searching

Here's about the simplest search you can possibly do with Zend_Search_Lucene:

<?php
 
$index   = Zend_Search_Lucene::open('/path/to/index/');
$results = $index->find('contents:elephant');
 
foreach ( $results as $result ) {
	echo $result->score, ' :: ', $result->title, "n";
}

The 'contents:elephant' query specifies that we wish to search for documents whose 'contents' field contains the term 'elephant'. That runs in a flash, and produces the following output:

0.61871843353823 :: Item number 3
0.5 :: Item number 1

As you can see, the two Zend_Search_Lucene_Document objects which contain the word 'elephant' are returned, ordered by descending 'score'. Item 3 contains the word twice, which is why it receives the highest score.

Of course, there are far more features than I've even hinted at here, so I'll more than likely return to Zend_Search_Lucene in a further post looking at some of the more advanced stuff, but for now, that's your lot.

Footnotes

[1] Incidentally, the index files created by Zend_Search_Lucene are entirely compatible with those created by Apache Lucene, allowing the two implementations to interoperate happily, should the need arise.

Comments

Posted by Ciaran McNulty on Sunday, the 8th of June, 2008.

Out of interest, why index on a schedule rather than on an update?

Posted by Simon Harris on Saturday, the 28th of June, 2008.

Absolutely no reason other than simplicity! I'm not using any kind of CMS as it stands, so there's not really anywhere to hook the indexer in.

If I were integrating Zend_Search_Lucene with a CMS I'd want to look at - as you say - triggering the indexer on an update event, and having it run asynchronously. The Zend Platform (of which, more later!) "job queue" looks quite neat for that kind of thing.

Posted by Clive on Friday, the 11th of July, 2008.

Short, sweet, simple and super - thank you for this. :-)

Posted by Zeno on Thursday, the 22nd of January, 2009.

It is very useful, and take a look of the other article (Search engine indexing)

http://devzone.zend.com/node/view/id/91

Posted by Alok on Friday, the 10th of April, 2009.

great article concise and straight.

Posted by Cristobal on Tuesday, the 16th of February, 2010.

Zend_Search_Lucene how long it takes to index? I'm indexing 10 documents of 100 kb each and takes 2 minutes ... if i do index 1000 files?

Ty.

Posted by manoj on Monday, the 4th of June, 2012.

Hi all,
I am using zend search lucene for searching sentences in a paragraph and below is the sample code I used to search.

FOR CREATING INDEX:
----------------------------------

$this->content="This enabled me to improve my presentation skills, experience taking a Q&A session and learn to work to deadlines.";


require_once '/home/project/mgh/lib/Zend/Search/Lucene.php';
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
$index = new Zend_Search_Lucene('/home/project/mgh/data/search_file/lucene.customer.index',true);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::unIndexed('URL', $file1_path));
$doc->addField(Zend_Search_Lucene_Field::text('contents',$this->content));
$index->addDocument($doc);
$index->commit();

for searching
------------------

require_once('Zend/Search/Lucene.php');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
$index = new Zend_Search_Lucene('/home/project/mgh/data/search_file/lucene.customer.index');
Zend_Search_Lucene::getDefaultSearchField('contents');
$results = $index->find('contents:"improve my extra skills" ');
$this->count=count($results);

On searching for "improve my presentation skills" it is resulting 0 result count.
Actually, words like improve - my - skills will get matched but only word 'extra' will not be matched.

How to get result count even if few words are matched and few gets unmatched...

Please reply..
Thanks in advance.

Posted by Prasad on Thursday, the 28th of February, 2013.

Very nice article. Very useful.
Thanks for sharing ..!!!

Enter your comment: