Apache Solr is Lucene's next logical step, architecturally speaking. Before Solr, a developer using Lucene would have to search using an implementation of Lucene, opening an index each time, serially making queries over an index. These are costly and time-consuming operations, especially as the size of your index grows. Wouldn't it be great to just outsource all of that trouble to a separate, already running instance of Lucene? That's a bit what Solr is like.f you're a PHP developer working in the Zend Framework, you may be familiar with search in the context of Zend_Search_Lucene. Most of the Zend Lucene tips I've suggested in the past don't carry over to Solr. That's a good thing; a lot of what was missing in Zend_Search_Lucene is part and parcel of Solr as it comes out of its tarball. Text normalization is already taken care of. The Porter Stemmer works by default without additional finagling. Beyond this, Solr is still highly configurable. A peek in its conf folder shows a variety of easy tweaks you can make for your own needs:
- a list of customizable stopwords
- a list of correct spellings to override statistical spelling correction
- synonym grouping
- the option for result ordering for specific queries
- even a customizable list of words that should not be stemmed!
It also includes the schema file, responsible for building out the fields that Solr will use when interacting with documents.
So let's assume you have downloaded Apache Solr and managed to play with it successfully using the Apache Solr tutorial. Hopefully, schema.xml is clear enough that you have an idea of what fields you want for your own implementation, and how to add them to the catch-all 'text' field. If the pages you are indexing have a clear relationship with classes you have built or rows in specific databases, it would make sense to try to leverage this uniqueness. Create a field in your schema for a signature that relates to this. For instance, you may have a Textbook object with an row ID of 57 for the data you're pulling out of a database. Make that unique identifier 'Textbook_57'. That way, you can easily reinstantiate that object you originally used when creating your document to be indexed.
So how do we interact with Solr in our PHP web application? There are a number of ways to index and search for documents via Solr with PHP, but they all involve making a request to the instance of Solr currently running. So rather than reinvent the wheel, I suggest starting with Solr PHP Client. This is a great library because it follows Zend Framework design patterns pretty closely. That by itself makes it a sensible addition to any ZF-based web application.
A couple of things to watch out for when using Solr PHP Client:
- By default, Apache_Solr_Response uses json_decode(). Your version of PHP may or may not have this function. If you're having trouble parsing the response you get, try replacing instances of json_decode() with Zend_Json::decode().
- If you're looking to use Solr's spell correction service (which I would highly suggest), Apache_Solr_Service does not currently have any support for this functionality. I've put in a ticket for a patch I wrote on the Google Code website. You can access it here.
- You'll get an exception if Solr doesn't send a response code of 200. Be sure to plan ahead for a bit of exception handling in your search application.
So basically at this point, you have this component in your site that used to do the brunt of the work opening a Lucene index and searching for it, and return hits that you then use to display your results -- be it by pulling from the same fields you indexed or by instantiating objects via those fields and leveraging display methods you've written for them. Apache_Solr_Service replaces this functionality by creating an Apache_Solr_Response instance based on the answer it receives from Solr. The response instance has an attribute called response that contains an array keyed as 'documents'. The items in this array are the basic analogue to Zend_Search_Lucene 'hits'.
The approach that I take is to simply extract the informative unique ID I gave my document when indexing, and instantiate my objects using well planned-out interfaces on my model classes. That way I have freedom to reuse partials I've built out across the site. You can also use the Apache_Solr_Document objects that Apache_Solr_Response fleshes out while parsing. Their attributes will be the same as the fields you assigned when indexing.
It says a lot that Lucene and Solr have recently merged development projects. Even if you're confident that Lucene is fine for your solution, it definitely won't hurt to take a few hours and see what you might be missing. You may think your site has a bit too much momentum to make such a drastic change immediately. However, most of the work is already done for you between Solr itself and the PHP library I mentioned. All that's left is hooking it up to your site and creating a build process to index your site.
It was very quick to get my implementation up and running, and I was indexing and querying within hours. My original indexing task used to be based on requesting various URLs. It took close 6 hours. It now only takes ten minutes to accomplish the same job, directly using the objects responsible for creating those pages. Another big help was the Solr migration in and of itself: Apache_Solr_Service supports adding documents in large batches. I took the switchover as an opportunity to completely overhaul several components of how the rest of the site interacts with search. Search has never been better or faster, and I can't wait to deploy.