Upgrade to Solr 4.1 and Save Space

Solr 4.0 gives us the ability to update only specific fields documents, provided an ID. This is an awesome feature which opens up a lot of application-level capability for indexing that Solr just didn't used to have. Before, if you had nine or ten different facets of information for a single document, you would have to invoke the services required to populate all of those facets to update just one of those facets, per document. Atomic updates allow us to split up the way we update documents, improving indexing performance and reducing load on data sources.

The big drawback here is that in order to take advantage of atomic updates, every single field in your index needs to be stored. Unstored fields will be lost during an atomic update. Therefore, reaping the benefits of atomic updates has heretofore meant either an increase in storage requirements as you set all your unstored fields as stored, or a reduction in the number of fields stored for a given document. This is one of those insurmountable dilemmas that can keep the more exciting capabilities of Solr out of the reach of search engineers in resource-strapped orgs. No one likes to make these sorts of tradeoffs, because it inevitably ties one hand or the other behind your back.

Fortunately for us, Apache announced the release of Solr 4.1 on January 20th. In my opinion, this is kind of a big deal, and exactly what we were hoping for. It appears that a lot of work has been done to ameliorate the issues of expanded storage needs given the atomic update paradigm. Stored fields are now provided with default compression. This means that any field that you have to store (i.e. all of them, if you want atomic updates) take up less space than they did with previous versions of Solr.

I created a test index of about 1 million documents, taking up 43 gigs using Solr 4.0. Dropping in the Solr instance I configured to the 4.1 library and start the application server resulted in an immediate storage improvement, down to 12 gigs.

The takeaway from this anecdote is obvious. This index is a little over a quarter of its previous size. I have a feeling that indices that will benefit most from this are those with a lot of dynamic fields, and probably a lot of sparsity for those fields.

I want to issue a huge thanks to the solr.pl crew who originally pointed out this benefit of upgrading from Solr 4 to Solr 4.1. This is definitely going to make it a lot easier for people at high scale to embrace the atomic update paradigm. That paradigm is going to enable a lot of interesting distributed indexing ETL pipelines, and further position Solr as a leading NoSQL data storage solution.