Deep Learning on Long-Form UGC at Scale

Not too long ago, I was a senior software engineer at Wikia -- the other company run by Jimmy Wales. This was the company that took the MediaWiki platform that Wikipedia uses, and scaled it to create communities for countless interests.

I had just completed a stint focusing on improving Wikia's search functionality. What I really wanted to focus on was finding ways to deliver interesting product features directly from the content provided by our user communities. This would allow me to apply my talents in machine learning and natural language processing to put a greater degree of intelligence behind the MediaWiki platform.

I led a very small team, and built a lot of fun technologies. Little made it to production, but we had some very interesting results and learned a great deal from what did. During this time, Grant Ingersoll, one of the authors of Taming Text, reached out to us to talk about how we were using Solr and other natural language processing technologies as part of the data processing pipeline we were building. This prompted us to write a chapter for a possible second edition of Taming Text. We got great feedback from Grant on the chapter, but after about two years, the project has stalled a bit.

With Grant's approval, I'm making this chapter available here for free for the first time: A HighScale Deep Learning Pipeline for Identifying Similarities in Online Communities.

It's always fun to look back at projects like this and ask what I'd do differently. I probably should have used a Storm topology for the CoreNLP parsing component. That would have made things a bit more reliable (about as reliable as Storm, though, I guess). I definitely think a lot of the threaded functions should have probably gone into Celery or some other worker processing platform. The amount we got done with just boto, the current literature, and tens of millions of English articles was tons of fun, though.

I'd also like to give a quick thank you to Murad Salahi, whose article Teaching a Computer How To Read provided much inspiration for what we ultimately accomplished at Wikia. Not to mention, thanks to my co-authors John Kuner and Tristan Chong for providing valuable contributions to the paper and the work it outlined.

If you want to take a deeper dive, most of the code for this project is still available online as part of Wikia's Data Science Toolkit project.