Using NLP at High Scale for Recommendations

I've been doing a lot of work lately using data extracted from a massively cloud-scaled version of Stanford CoreNLP to power recommendations and ad retargeting. We used A/B testing and statistical significance measures to show that these recommendations help provide improved engagement with our most important users.

To give a brief summary, let's say that you describe a community by its most common core concepts and action words. We know that those are great indicators for what a community is about. But it would be difficult to provide good recommendations using, for instance, collaborative filtering against those tokens, due to something know as the curse of dimensionality. We can still use those values to power a recommendation platform.

You can use unsupervised dimensionality reduction techniques like latent Dirichlet allocation (LDA) over that same data to identify important groupings of words. These groupings suggest latent, structural characteristics of a given community's core content. These features also map a document into hyper-dimensional euclidean space. With that data, you can use spatial similarity metrics to get interesting and useful recommendations. These improved recommendations can improve engagement, and engagement is often tied to revenue.

Tools of the trade for this solution included distributed LDA from Gensim, cloud scaling with Boto, fast numerical processing in Python with NumPy and SciPy, and Stanford CoreNLP, which we have scaled in the cloud to serve as the foundation of our parsing and data extraction pipeline.

You can get more detail about this approach in the paper we submitted to ACM RecSys 2014 here.