This post is an expansion on a response I wrote on Stack Overflow regarding Lucene Optimization. A common misstep for a developer who is implementing Lucene for the first time is to stuff an overabundance of data inside it, reducing performance. For someone who doesn't have a background in natural language processing or information retrieval, some data normalization steps may not come naturally. This post specifically regards Zend_Search_Lucene, but much of my advice can be abstracted across various languages to other Lucene ports, such as Lucene.NET.
Lucene is not a database.
It's an index. All field types other than unindexed are stored in a very efficient way, using minimal hashes to reflect each individual word type. When a search is made, the query is also transformed such that it will match those unique hashes of the data. Anything that you're not indexing should be something you're going to use to recover information from a database. Because we are going to normalize our data, this will be basically anything we'll be recovering about the document to display in a search result. Relying on the database to populate these fields will decrease your index size (and thus increase your search speed) by an order of magnitude. On the search engine I work on every day, it decreased the size of the index by over 75%.

One easy way to implement this is to create fields for a table name and a primary key ID to be associated with each Zend_Search_Lucene_Document object that you add to your index. Then, for each result you want to display, you can simply instantiate the given object via its regular model. I would suggest that for any model type you will be displaying in your search, you should make sure that it implements an interface which properly displays the fields that you will be showing.

As an example, say you are running a book site, but are also selling, perhaps, book club subscriptions, e-books, and individual short stories, with each different type having its own table with disparate field names. Each of these tables also generate a specific class in your application. We want to make sure each search field -- say title, description, page url, and price -- is populated by the object we instantiate regardless of its actual class. The best solution for this is an interface that defines the required methods to be searchable. Let's call this interface My_Search_Searchable_Interface. If a class implements this interface, it will, for instance, have to have its own version of My_Search_Searchable_Interface::getName(), getPrice(), getPageUrl(), and getDescription(). We will be able to return the proper string to valuate that specific field regardless of the column's name for that specific table within your database. With that issue addressed, there is no reason to store these in Lucene, and we don't have to worry about what specific object we're working with, because we have planned all of the applicable classes to function properly within the search context.

If the above approach isn't scalable for your data model, you can always create a database table that stores the necessary fields manually based on whatever objects you're piecing together on a single page. It's a fine enough solution, because all you would actually need is to create a single search result class to handle the above data across a variety of page types, assuming you have done the work to be able to properly extract the applicable data from each page.

By using the proper interface, we are able to index every piece of text that we want to be able to search over in a document via Zend_Search_Lucene_Field::UnStored(). This particular static method creates an indexable field of tokenized text with the least space consumption. It is so efficient in size because there is no way to retrieve the tokenized data in a readable format -- data goes in, but it doesn't go out. This isn't a problem, though, because we are going to normalize our data for even better processing and robustness. We are also now able to access the necessary data from any document hit we come across when searching by accessing the table name and primary key ID fields, and simply making the appropriate query and instantiating either directly via the database adapter or via the appropriate model. Note that if you're using models, you may want to opt for a model field rather than a table name field.

We now have a solution in place for a lean, fast, and robust search implementation, where we only store the data we need. The next step is of the utmost importance to robustness, but still has a value with respect to reduced index size and improved performance.

Normalize Your Data
This is accomplished in three steps: input filtering, stopword filtering, and stemming. We do these steps for a number of reasons. First, by normalizing our data, we reduce the size of the index by removing extraneous terms and characters. Not to sound like a broken record, but we should assume that steps that improve speed and reduce index size are good. Second, we make grammatically related alternations of specific terms recoverable regardless of inflection. Imagine an index that couldn't find any matches for 'books' simply because each entry for a book exclusively used the singular term. Third, it improves the overall robustness of the index, and will increase the probability that you return a pertinent result, even with somewhat underspecified input.

Input Filtering
Input filtering is a step specifically associated with the search query, but you should keep in mind that anything you do to the data you're indexing, you should also do to the text that will be querying the index. It's a simple matter of proper matching and data integrity. The best way to make sure that the query input matches the format of the indexed text is to use the same filter chain, implemented with Zend_Filter. I would suggest the following filter chain for a beginner:

$filterChain = new Zend_Filter();
$filterChain->addFilter('StripTags')      // no html tags, no malicious code
               ->addFilter('Alnum')          // only alphanumeric text
               ->addFilter('StringTrim')     // remove extra whitespace
               ->addFilter('StringToLower'); // self-explanatory


If you're comfortable with regular expressions, Zend_Filter_PregReplace is also a viable option for more advanced string manipulation. I would suggest implementing this filter chain as a custom class, but you could also write your own custom filter if you're comfortable with it and know exactly what you want. Let's assume we'll be using a custom filter, because it m
akes it easier to use across various classes. Here's how that filter would look.

class My_Filter_Searchable extends Zend_Filter
{
    public function filter($data)
    {
        $filterChain = new Zend_Filter();
        $filterChain->addFilter('StripTags')
                       ->addFilter('Alnum')
                       ->addFilter('StringTrim')
                       ->addFilter('StringToLower');
        return $filterChain->filter($data);
    }
}


Whatever you do, once you have your filter chain totally established, you should implement it your input, if you're using Zend_Form or Zend_Filter_Input. I highly suggest Zend_Filter_Input if you aren't using Zend_Form, because you can instantiate it with any kind of data from $_POST or $_GET; there's no need to use Zend_Form if you aren't already, and are only focusing on search. Also, don't forget that if you are using Zend_Form, the Zend_Form_Element::getUnfiltered() value will return the value of the text before any transformations. Just make sure to escape it in the view to prevent cross-site scripting or other malicious code.

You will also want to use this filter on any text that you will be storing in the index. The easiest way to do that is via an analyzer. Analyzers in Lucene are responsible for tokenizing the text that you feed to them. In a sense, they are already doing a portion of the data normalization by themselves. For instance, depending on the analyzer you use, you can tell Lucene whether you want to differentiate between capitalization, whether you want to include numbers, etc. You will want to extend an analyzer already provided to you from the Zend library. They've already done the heavy lifting for you, so why not? Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive is my analyzer of choice to extend upon, because it also indexes numbers -- not a default setting for Zend_Search_Lucene. In order to utilize your filter in this analyzer, you simply create the class as follows:

class My_Search_CustomAnalyzer
    extends Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive
{
    public function nextToken()
    {
        $token = parent::nextToken();
        return Zend_Filter::filterStatic($token, 'My_Filter_Searchable');
    }
}


Note that I'm using my custom filter. We'll be adding to that as we go along.

Remove Stopwords
Stopwords are words that are used frequently, but have little semantic value. A few key examples are:

  • 'the'
  • 'and'
  • 'what'
  • 'is'

You wouldn't find anything useful if you searched for just these terms, so why clutter your index with them? In fact, depending on the size of your index, if you let someone search for these terms, your result page may time out while Lucene pores over all of the viable matches in search of the top N results. Find a good list of stopwords, and remove them from any indexable field. As a thought experiment, imagine taking a walk across equally sized letters written on the sidewalk. How long would "The Good, The Bad, and the Ugly" take compared to "Good, Bad, Ugly"?

The best way to remove stopwords is to create another filter that will be responsible for eliminating these extraneous words from your text:

class My_Filter_Stopwords extends Zend_Filter
{
    /**
     * create an array of stopwords mapping to true.
     * this is done for faster performance.
     * make sure each entry is a lower-case string.
     **/

    static $stopWords = array('the'   =>  true,
                                        'and'   =>  true,
                                       'what'  =>  true,
                                       /**    ...    **/
                                       'is'    =>  true);

    public function filter($data)
    {
        $newWords = array();
        if (is_string($data)){
            $words = explode(' ', $data);
        } else if (is_array($data)){
            $words = $data;
        } else {
            throw new Exception("Wrong data format in My_Filter_Stopwords");
        }

        foreach ($words as $word)
        {
            // test if key is set in array -- faster than in_array()
            if (!isset(self::$stopWords[strtolower($word)]) {
            $newWords[] = $word;
            }
        }

        return $newWords;
    }
}


There is at least one resource with a ready PHP array of stopwords. Another approach would be to use a static function that returns an array of stopwords, but first checking if a static variable is already valuated. If not, it could take a flat file or a database and dump the data into the class that way. This is nice to use if you need stopwords for other features of your application than search. You would now add this filter to My_Filter_Searchable right at the end of the filter chain.

Stem Everything
As previously stated, stemming removes inflection from both verbs and nouns so that searches for 'running' will also match the term 'run', and searches for 'works' will also match the term 'worked'. It also reduces the amount of extraneous text characters being stored, reducing index size and speeding up performance.

There is a variety of stemmers available, but barring passing your text to a Python script implementing the Natural Language Toolkit, options are fairly limited in PHP. One easy-to-implement option is the Porter Stemmer, an extremely thorough but absolutely naive stemming algorithm. It doesn't get much better in PHP as far as an off-the-shelf implementation that doesn't require training data. Even from a language-agnostic standpoint, better-performing stemmers are generally probabilistic or lexicon-based, which pose problems as far as licensing, acquisition of training data, and oftentimes performance.

I won't write out the code for this step, other than to suggest one of two options with stemmer code I linked to above:

  1. Repurpose the given code to implement Zend_Filter. All it will need is a filter() call which calls its main stem() function on each word.
  2. Define a Zend_Filter class that simply uses the PorterStemmer class in its filter() method.

Then, of course, we would add this filter to My_Filter_Searchable right at the end, and it would be
propagated to our custom analyzer.

Conclusion
Lucene is a super-optimized tf-idf information retrieval model. With the above steps, I have outlined a general implementation of Zend_Search_Lucene which leverages its intrinsic talents of speed and memory management over a large swath of textual data when trying to retrieve relevant documents given a query. Some other considerations for your implementation should be:

  • Custom Weighting: Are there certain page types that have a higher importance on your site? Consider both relevance and business requirements. Sometimes making more money can trump what has the best text match (but try not to be evil).
  • Query Hit Limits: If your website is books.com, and someone is searching for books, every single page may be a match. You can limit this by setting a result limit with Zend_Search_Lucene::setResultSetLimit( ).
  • Domain-Specific Stopwords: Another best practice is to identify these kinds of words and include them in your list of stopwords. It will improve your site's performance, but you will need a contingency plan for when someone actually does just search for 'books' on books.com.
  • Don't Waste Your Time on Irrelevant Results: Sometimes it's better to show some search results than none. Other times, you will have one or two actually relevant results and a handful of one-offs that it might not be worth the processing time. One approach I take is to set two minimum relevance scores. If we don't find any offers with the first minimum, we'll check again with the second, just for additional robustness on the mid-to-long tail.

This has been my first blog post on the Zend Framework, and specifically the Zend_Search_Lucene module. Feel free to comment with any questions you might have, and if it's compelling enough, maybe you'll get a future blog post out of me from it.