Maybe I don't advertise it much on my website, but I'm a total nerd about hip hop music. A proficient stalker may have noticed my over-analytical writings over on Rap Wiki. Anyone unacquainted might enjoy my contributions to the pages like DJ Screw and Lil B. As with everything I'm passionate about, I tend to perform a fairly exhaustive breadth-first search on various information sources about the things I like and what they are related to until I accidentally find myself steeped in the deep-end of esotericism. As a result of this long-term obsession, I visit All Hip Hop for music news, World Star Hip Hop for new videos, DatPiff for the latest mixtapes, and, of course, MediaTakeOut for all the juicy gossip.
I'm always looking for ways to combine things that excite me. Considering the fact that I spend most of my work day bouncing around in my chair to one mixtape or another while coding, it was only a matter of time before I was hit with the following epiphany: I should make a language model for MediaTakeOut. MediaTakeOut is an interesting website, because its headlines have a lot of entertaining, edgy constructions that are relatively unique to its site. I wanted to use a language model to see whether the prevailing patterns I saw could actually be reconstructed using the laws of probability. Also, let's be honest, here: language models aren't really that useful, but they can be hilarious. The reason why they're funny is that they decontextualize the underlying qualities of a given textual corpus (thus generating grammatical nonsense). This is particularly telling when that corpus uniquely subsumes a specific author or domain. A lot of the headlines on MTO are already written to be funny or outrageous, so one could imagine what you might get from inserting a little bit of randomness into the mixture.
So a couple of nights' worth of work, and I wrote a MediaTakeOut Headline Generator, lovingly styled after the website it parodies. I open-sourced the code at GitHub under the project name MTO-ON-BLAST. Though maybe that's not a fair name. Putting someone or something "on blast" means to give it a hard time, such as "R&B singer Monica gets put ON BLAST". I used the term since those were the kind of entertaining constructions I was looking to generate in the language model. I would posit that the Chomsky Bot does a better job putting Chomsky on blast than my project does putting MTO on blast.
Anyhow, I'm really writing this blog post because I wanted to describe the components of the project, and how each worked. I first cut my teeth on real programming using Python in grad school, so this was an opportunity to play around with Python after years in PHP and a couple of months slogging through some intense Perl. Let's be honest, though: this isn't much more advanced than a graduate Computational Linguistics I project from an implementation perspective. I could have written my own language model rather than use an off-the-shelf implementation if I wanted to impress my thesis advisor, but I was really only looking to have some fun.
First off, I wrote a script called mto-scrape.py, which uses the lxml library to record headlines in the MTO archives by accessing elements in the DOM. We iterate through each page of the archive, scraping the page's HTML, parsing it into a DOM object, and pulling out headlines from appropriate DOM elements. I also ended up using this basic approach to scrape comics from Nothing Nice To Say, an old favorite webcomic of mine, for offline viewing. So the same approach is fairly reusable.
Since the above script is relatively naive, I stripped escape character slashes and excessive whitespace using a a command line script. You can diff headlines.txt and headlinesprepped.txt if you want to get a feel for the changes. I should probably add a bash script to automate this step, but I trust that anyone who actually wants to repeat my results would know enough to handle this rather quickly.
The next step is the script mto-analyze.py, which uses the Natural Language Toolkit to consume the prepped text file and store the necessary data to create the language model. It also displays the top 100 unigrams, bigrams, and trigrams for the corpus. For the uninitiated, these are N-grams, where N is the number of adjacent tokens to constitute a unique type. Consider the following tweet from Lil B:
I MAKE SURE TO TURN OFF MY LIGHTS WHEN IM NOT USING THEM TO SAVE THE ENERGY OF THE WORLD, WE ARE VERY LUCKY RIGHT NOW #BASED - Lil B
The first three unigrams would be: [I], [MAKE], [SURE]. The first three bigrams would be [I MAKE], [MAKE SURE], [SURE TO].The first three trigrams would be [I MAKE SURE], [MAKE SURE TO], [SURE TO TURN]. These tokens are then stored as types, with individual counts associated them. For instance, the unigram types [TO] and [THE] would both have a token count of two, since they are seen twice in this tweet. This means that they are more likely to appear than other words, from a probabilistic standpoint. This is, without getting too academic, the prevailing concept behind a language model. The more data that we have, the larger the constructions we can attempt to generalize against. With almost 37,000 headlines scraped from MTO, with an average of 16 words (or about 96 characters) per headline, it's relatively safe for us to attempt to build a trigram model. That's a model that uses information from the previous two words to select a relatively probable next word.
Many language models use conditional probability based on the prior probability of observed events. Finding the conditional probability when constructing a trigram language model would be answering the question, given the last two words provided, what is the probability distribution of the all possible next words? We use the prior probability of observed trigrams in our headlinesprepped.txt file to inform this decision. This is what offers the illusion of grammaticality when using a relatively flat approach to sentence construction. By inserting randomness into which probability-weighted type is selected, we make each generated sentence variable, rather than just always generating the most probable sentence.
The script mto-languagemodel.py uses the data stored from the analyze script to create a language model using a smoothed bayesian approach available in NLTK. It spits out N sentences, which are then stored on my site's server. For performance purposes, I don't do any headline generation on the fly. Instead, I grab a random line out of a large flat file. When searching for a subject, I'm basically adding a grep step before pulling out the random line. A post-processing regular expression handles appropriate spacing for punctuation tokens, which are treated just like normal tokens in the language model.
So there you have it. Now all you need to do is go to the generator and search for a particular term. The fun part about language models is how they help you discover the emergent characteristics of a given corpus in a more novel way that simple charts or graphs (like what the analyzer component provides). So under all the fun, this is some real computational linguistics in action.