Getting All Distinct Values From Solr

If you're looking to try to get all distinct values for a field, Solr has a great functionality called faceting that does most of the work for you.

If you're using any kind of sufficiently large set of data, where your number of distinct values for a particular field is in the range of thousands or hundreds of thousands, getting all values can be hard.

The traditional approach is to serially iterate or paginate using a fixed limit and a growing offset. This can be a problem at any kind of reasonable scale. This amount of serialization requires you to put the latency for all of your queries end to end, providing an unnecessarily large worst-case time complexity.

There are a lot smarter ways to do things these days, and each individual approach can be boiled down to the use of asynchronous processing. It's easy to asynchronously process a paginated dataset and maintain its order assuming you know the size of the data first.

The problem then becomes determining the number of distinct values on the client side within a document set without first iterating over each value (or asking Solr for the max, which is complicated, and is often difficult to implement client-side). We will need to make a series of educated guesses that should work faster than iterating serially.

I've created the following quick-and-dirty heuristic to get the maximum number of distinct values for a resultset where you can easily determine the total number of documents:

Start with the total number of documents in the result set as your assumed number of distinct values (we'll call this the expected max). Attempt to retrieve that value for the facet offset at the expected from solr, with and a sufficiently large set of values as your limit. If you don't find any values, cut the expected max in half and start over with that value as the expected max. If you do find values, and the number of values is as large as the limit requested, then set your expected max to 1.15% its current value and start over.

Once you have retrieved a result set where the number of values returned is less than the limit provided, you can add the length of the available values to your expected max, and you now know the number of distinct values. You should manage to do this with a lot fewer requests than normally, too.

Once you have the the max, you can retrieve all distinct values asynchronously client-side, because you can now derive all the arguments required to invoke or enqueue requests against each result set as you retrieve them from Solr.