Just Keep Learning

Jeff Atwood blew up the Internet today by making the statement that not everyone should learn to program. I think maybe one of the biggest complaints here is that his arguments read as elitist, and exclusionary, whereas the real kernel of his argument is that not everyone needs to be able to program to be good at what they do. Truly, that’s a good point. But I also think it’s also a totally valid point that anyone who wants to take the time to learn how to program should have that opportunity, and not be driven out by harsh attitudes.

Alternately, learning to program should require some real learning by doing that isn’t informed by some gold-star method or gamification. I am of the opinion that these approaches only develop a false sense of mastery among its users. This is the kind of pernicious marketing that Jeff Atwood is railing against. You’re not suddenly a PHP expert because you spent ten hours on Codecademy.

Learning to program encapsulates a lot of different skills. It’s an iterative process, and you should approach it as with any new skill of similar profundity: with care, interest, and a knowledge of your own limitations. Instead of assuming you can tackle any task related to your achievement badges, you should aim to constantly question whether what you’re doing now is adequate, performant, scalable, and extensible. This is a great behavior to get into for any programmer, at any time. It’s the reason why I look at diffs before ever committing code, and it’s one reason why I get better at what I do as time goes on — because I take the time to review my practices and look for better ones.

But this isn’t just good advice for programming as a skill. There are plenty of great skills that you could do professionally that you can learn at home. Look at the vast array of talents required in home improvement. There are certified individuals who have the requisite education and experience to build you a house, set up your wiring, install insulation, or configure your home network. But that doesn’t mean that we should leave all aspects of these skills exclusively to the pros. That’s where Atwood oversteps his bounds with respect to programming as well. While it’s not a good idea to wire your entire house if you don’t know what you’re doing, it’s well worth knowing how to switch out your outlets, or to reset your router if it’s acting up. Not everyone may need to have the programming chops to develop a database schema or roll their own MVC platform, sure. But that doesn’t mean that they might not get good use out of some text processing that a basic education in regular expressions with a lightweight interpreted language could lend.

Most of the real bare-bolts stuff I’ve learned about software, I have learned on the job, by doing. I’ve probably aced more job interviews based on my knowledge of design patterns –which exclusively involved reading every design pattern listed on Wikipedia, and then trying to identify them in real code — than talking about the technical aspects of my Master’s thesis.

The availability of information on the Internet means that the empowerment of passionate amateurs is a trend in all skilled professions. It shouldn’t be railed against; we should just be happy that it’s easier for people who want to learn to find resources that get them adequately situated. Sure, you can watch a YouTube video about how to use a smoker, but you’re not a pit master until you’ve made it a lifestyle. And there will always be that dichotomy for all hobbies or professions that require learning and practice.

I would even go so far as to say that some skilled professions such as law and medicine should start looking for ways to better incorporate this additional tier of semi-experts into the profession. The issue here, as programming has feared in the past, is that it is likely to drive down the low end of the market. But I think that software in general has become better for having people from different backgrounds with a passion in the practice, if not so much the college major, and the availability of new technology for these fields, as with programming, allows individuals to more easily sit on the shoulders of giants. But again, there will always be a difference between those who can identify the signs of a flu, and those who know what other symptoms symptoms to look for that could result in a more grave diagnosis.

Leave a comment

Search V2 Now Live on Wikia!

I’m proud to announce that the search solution and interface I’ve been working on at Wikia is now live!

This introduces a new UI, grouped inter-wiki search for the global site, and improved relevance. If you’re curious about more details, please take a look at today’s search announcement as well as a long-term view of search at Wikia.

I’ve had a lot of fun developing this stuff, and I’m looking forward to our next big moves!

Leave a comment

MTO ON BLAST: A language model for a gossip blog

Maybe I don’t advertise it much on my website, but I’m a total nerd about hip hop music. A proficient stalker may have noticed my over-analytical writings over on Rap Wiki. Anyone unacquainted might enjoy my contributions to the pages like DJ Screw and Lil B. As with everything I’m passionate about, I tend to perform a fairly exhaustive breadth-first search on various information sources about the things I like and what they are related to until I accidentally find myself steeped in the deep-end of esotericism. As a result of this long-term obsession, I visit All Hip Hop for music news, World Star Hip Hop for new videos, DatPiff for the latest mixtapes, and, of course, MediaTakeOut for all the juicy gossip.

I’m always looking for ways to combine things that excite me. Considering the fact that I spend most of my work day bouncing around in my chair to one mixtape or another while coding, it was only a matter of time before I was hit with the following epiphany: I should make a language model for MediaTakeOut. MediaTakeOut is an interesting website, because its headlines have a lot of entertaining, edgy constructions that are relatively unique to its site. I wanted to use a language model to see whether the prevailing patterns I saw could actually be reconstructed using the laws of probability. Also, let’s be honest, here: language models aren’t really that useful, but they can be hilarious. The reason why they’re funny is that they decontextualize the underlying qualities of a given textual corpus (thus generating grammatical nonsense). This is particularly telling when that corpus uniquely subsumes a specific author or domain. A lot of the headlines on MTO are already written to be funny or outrageous, so one could imagine what you might get from inserting a little bit of randomness into the mixture.

So a couple of nights’ worth of work, and I wrote a MediaTakeOut Headline Generator, lovingly styled after the website it parodies. I open-sourced the code at GitHub under the project name MTO-ON-BLAST. Though maybe that’s not a fair name. Putting someone or something “on blast” means to give it a hard time, such as “R&B singer Monica gets put ON BLAST”. I used the term since those were the kind of entertaining constructions I was looking to generate in the language model. I would posit that the Chomsky Bot does a better job putting Chomsky on blast than my project does putting MTO on blast.

Anyhow, I’m really writing this blog post because I wanted to describe the components of the project, and how each worked. I first cut my teeth on real programming using Python in grad school, so this was an opportunity to play around with Python after years in PHP and a couple of months slogging through some intense Perl. Let’s be honest, though: this isn’t much more advanced than a graduate Computational Linguistics I project from an implementation perspective. I could have written my own language model rather than use an off-the-shelf implementation if I wanted to impress my thesis advisor, but I was really only looking to have some fun.

First off, I wrote a script called mto-scrape.py, which uses the lxml library to record headlines in the MTO archives by accessing elements in the DOM. We iterate through each page of the archive, scraping the page’s HTML, parsing it into a DOM object, and pulling out headlines from appropriate DOM elements. I also ended up using this basic approach to scrape comics from Nothing Nice To Say, an old favorite webcomic of mine, for offline viewing. So the same approach is fairly reusable.

Since the above script is relatively naive, I stripped escape character slashes and excessive whitespace using a a command line script. You can diff headlines.txt and headlinesprepped.txt if you want to get a feel for the changes. I should probably add a bash script to automate this step, but I trust that anyone who actually wants to repeat my results would know enough to handle this rather quickly.

The next step is the script mto-analyze.py, which uses the Natural Language Toolkit to consume the prepped text file and store the necessary data to create the language model. It also displays the top 100 unigrams, bigrams, and trigrams for the corpus. For the uninitiated, these are N-grams, where N is the number of adjacent tokens to constitute a unique type. Consider the following tweet from Lil B:

I MAKE SURE TO TURN OFF MY LIGHTS WHEN IM NOT USING THEM TO SAVE THE ENERGY OF THE WORLD, WE ARE VERY LUCKY RIGHT NOW #BASED – Lil B

The first three unigrams would be: [I], [MAKE], [SURE]. The first three bigrams would be [I MAKE], [MAKE SURE], [SURE TO].The first three trigrams would be [I MAKE SURE], [MAKE SURE TO], [SURE TO TURN]. These tokens are then stored as types, with individual counts associated them. For instance, the unigram types [TO] and [THE] would both have a token count of two, since they are seen twice in this tweet. This means that they are more likely to appear than other words, from a probabilistic standpoint. This is, without getting too academic, the prevailing concept behind a language model. The more data that we have, the larger the constructions we can attempt to generalize against. With almost 37,000 headlines scraped from MTO, with an average of 16 words (or about 96 characters) per headline, it’s relatively safe for us to attempt to build a trigram model. That’s a model that uses information from the previous two words to select a relatively probable next word.

Many language models use conditional probability based on the prior probability of observed events. Finding the conditional probability when constructing a trigram language model would be answering the question, given the last two words provided, what is the probability distribution of the all possible next words? We use the prior probability of observed trigrams in our headlinesprepped.txt file to inform this decision. This is what offers the illusion of grammaticality when using a relatively flat approach to sentence construction. By inserting randomness into which probability-weighted type is selected, we make each generated sentence variable, rather than just always generating the most probable sentence.

The script mto-languagemodel.py uses the data stored from the analyze script to create a language model using a smoothed bayesian approach available in NLTK. It spits out N sentences, which are then stored on my site’s server. For performance purposes, I don’t do any headline generation on the fly. Instead, I grab a random line out of a large flat file. When searching for a subject, I’m basically adding a grep step before pulling out the random line. A post-processing regular expression handles appropriate spacing for punctuation tokens, which are treated just like normal tokens in the language model.

So there you have it. Now all you need to do is go to the generator and search for a particular term. The fun part about language models is how they help you discover the emergent characteristics of a given corpus in a more novel way that simple charts or graphs (like what the analyzer component provides). So under all the fun, this is some real computational linguistics in action.

Leave a comment

The English Language: A Fractal of Bad Design

Written in response to PHP: A Fractal of Bad Design. English speakers: please try not to take this personally. You’re in awful company.

Preface

I’m cranky. I complain about a lot of things. There’s a lot in the world of natural language that I don’t like. Most natural languages were invented by complete amateurs who didn’t have half a clue what they were doing when they started joining noun with verb. Combine that with the Sapir-Worf hypothesis, and you’ve got a whole gang of self-congratulating numbskulls who can’t even think past the foolish paradigms they’ve constructed to truly subject their puny minds to logical thinking.

This is not the same. English is an aberration. It’s not merely awkward to speak or ill-suited for what I want, or sub-optimal, or against my religion. I could tell you something I like about most languages I don’t speak, even though I have good reasons not to speak them. But English is the lone exception. Every time I try to compile a list of gripes about the English language, I get stuck in this depth-first search of discovering more and more appalling trivia. (Hence, fractal.)

English is an embarrassment, a blight upon my Broca’s area. It’s so broken, but so lauded by every empowered amateur who’s yet to learn anything else, as to be maddening. It has paltry few redeeming qualities that I would prefer to forget it exists at all.

But I’ve got to get this out of my system. So here’s one last try.

An Analogy (not really)

Say you were learning English for the first time, and someone told you that the rule for forming the past tense of a verb would be to add -ed to the end of a word. That’s a productive rule that applies to all persons and numbers. I walked; John slipped; they skipped; we flipped. Now try to apply that to be, do, buy, eat, etc. You think to yourself, “Hey, this is ridiculous! There are all these rules I have to learn that don’t work for all of verbs I want to use the most. What a stupid language. How on Earth did the people speaking this make such a mess out of their most common verbs?”

Now in order to use English proficiently, you have to memorize all of these exceptions. Most of the time, you’re dealing in these exceptions, and not the rules. However, since you took the time to memorize the productive rules, you find that ultimately, you’re able to generalize for a large set of vocabulary you never even memorized. The more you practice, the better you get. Eventually, you’re able to explain the subtleties of some of these stranger subsets of rules. Some non-native experts can even explain to you why there are subsets of verbs that exhibit ablauting, and why certain subsets of Latin-inherited words exhibit prosody and tonality that violates iambic pentameter. And the fact that some native speakers can’t shouldn’t be too much of a surprise, either. Sometimes, it’s hard to articulate the hard stuff, but it doesn’t make them any worse at speaking English. It just confirms their status as an empowered amateur.

But in the end, it’s still English. And I would never speak English, because I’m better than that. I have a degree in communication sciences and am very active on ChatHub.

Stance

English is a mess. It’s a mess because long ago, a population of Celtic people were overrun by a few populations of Germanic people. Sometime thereafter, a portion of the island they were living on became occupied by the Roman empire. Later still, in 1066 A.D., William the Conquerer effectively made England a French colony. For much of its history, English was a language of subservience, or a language in flux due to one or more events of occupation or contact.

English became a language of historical significance, and a tool with critical mass, when a perfect storm of events, beginning in the 15th Century, culminated into today. The discovery of coal reserves, the establishment of British Empire, the Protestant Reformation, the colonization of the New World and India, and ultimately a scientific history that reaches from Newton through to Turing poised English as a major international language for trade, science, and engineering.

This is terrible. The concept of linguistic relativity suggests that using a language subjects you to the subconscious paradigms expressed in it. The fact that English misses a number of aspects, tenses, person and number distinctions screams for a solution created by people who are too smart to use a natural language as their lingua franca.

What it really all boils down to is the fact that English is a great language for complete amateurs to learn. The overwhelming majority of people who speak English as their only langauge learned it as their first language and stopped there. Most of them are terrible at it and make egregious mistakes in its grammar and usage every day. Do we really want to continue to allow these groundlings of such low acumen to continue to pollute linguistic landscape?

My position is thus:

  • English is full of surprises: ox -> oxen; be / was / am; the pronunciation of the word “subtle”
  • English is inconsistent: food vs. good; read (present) vs. read (past); one sheep two sheep…
  • English is flaky: English in Canada is like a completely different language from the English spoken in India or South Africa. And I’m still trying to figure out what the lady in the beginning of “Let Me Ride” from Dr. Dre is saying, but it’s allegedly also English.
  • English is opaque: I before E except after C doesn’t even freaking work most of the time

Don’t comment with these things

I’ve been in arguments about the English language a lot. I hear very generic counter-arguments for why we shouldn’t scrap English all together:

  • Don’t tell me you were born speaking it. Learn something else. Yeesh.
  • Don’t tell me everybody’s using it. If everyone jumped off a cliff, would you? Yeah, I didn’t think so.
  • Don’t tell me it’s an international lingua franca for science, engineering, and trade. If it was so great, then why did they have to borrow the word lingua franca from Latin? And why would lingua franca mean “French language”? See, you peel back the onion, and you’re just left with more and more layers of absurdity. And tears.
  • Don’t tell me a subset of its rules were inherited from Latin and French. Really, what’s the point of speaking some weird wrapper around Latin when we can just speak Latin? Then we wouldn’t have to worry about appropriately converting noun declensions to their place in English (type safety). How else will anyone understand a Latin loan word, if we’re not properly declining the term into its ablative case? And the Germanic part is like Perl. And Perl is hard.
  • Don’t tell me that Shakespeare, Milton, Twain, or Meyer (yes, I went there) wrote in English. I’m aware! They could write in pictograms, for all I care. You’ll always find smart people who can overcome the shortcomings of their platform.
  • Ideally, don’t tell me anything. Hearing or reading too much English on any given day is literally enough to send me into a flying rage. So I wrote a simple script in Erlang that filters out all email messages not written in Esperanto, so odds are that I won’t even see any emails you send me.

Side observation: I loooooove Esperanto. It’s got a great spec; it’s completely logical; there’s this very smart committee of fantastically wealthy, extremely popular people who spend all of their waking hours monitoring its usage to prevent it from getting too illogical. I mean, look at George Soros. That guy was raised speaking Esperanto and he gets by just fine.

ENOUGH!

Have I made my point? Almost any sufficiently utilized system develops irregularities, particularly around areas of high frequency of interaction. It’s one of the reasons that we have irregular verbs and irregular plurals. PHP’s development, without any kind of an academically or professionally defined specification, is as organic as the growth of natural language. It’s only recently stopped being a pidgin of C and Perl over the last decade and entered a status as a well-used but poorly regarded creole. That’s exactly the status of English before the Globe Theatre or the first translation of the Bible into what we refer to as Modern English.

Specifications, or the lack thereof, do not define the utility of a language. Prestige does not define the value of a language. If you don’t like a language — either natural or programming — don’t use it, and leave it at that.

Could you imagine if I wrote this about the fundamental shortcomings of a language from a developing nation, or a low-prestige dialect of any language? Everyone would rightly call me a bigot. I’m suggesting that squelching about the shortcomings of a language without trying to offer solutions does roughly amount to technological bigotry. It’s not productive, and it’s not welcome. Participation and dialogue is always welcome. But you’re not opening a productive dialogue by denigrating the subject of discussion.

3 Comments

Search Haters Gonna Hate?

So I just had my morning derailed by some polemic about search over on the MSDN blog. Don’t worry — it was linked to from Hacker News; I wouldn’t normally go their on my own volition. Dr. James Whittaker, a “a technology executive focused on making the web a better place for users and developers”, wrote an article called “Why I Hate Search”. Okay, James. Why do you hate search?

The word ‘search’ is a negative word. It fairly reeks of loss and effort. You lose your car keys and you search for them. Your pet runs away and you search for her. Having to search implies loss. It implies effort. Search is a means to an end. You search to rescue; you seek to find. There is little that is pleasant about the process itself. The only time to feel good about a search is when it ends, successfully.

With a heavy sigh, I continue reading. I sludge through a subjective, nebulous paragraph espousing the author’s opinion on the connotations of the term “search” versus the term “find”. He cites “searching” for keys when you lose them and “searching” for pets when they run away as reasons why the term “search” is so awful. Let’s pause for a minute. The difference between calling an application a “search engine” versus a “find engine” is even more trivial than pragmatics or semantics; it is a marketing problem, and has no effect on the ultimate functionality of such a utility.

So beyond marketing issues, which sites like Bing and Ask have previously attempted to tackle (in my opinion, rather unsuccessfully), what is the problem with search? Dr. Whittaker feels that search is broken, in summation, for the following reasons:

  • Search engines serve up search results pages — they don’t send you directly to what you’re looking for
  • Search engines do this because their revenue is largely based on ad delivery hosted on the search results page
  • Large companies that are centered on a search application have no incentive to innovate beyond this paradigm, because doing so would cause a loss of revenue. This means we must look to outsiders for further innovation on search (e.g. Apple’s Siri)

This line of reasoning is patently absurd. This assumes that for most queries:

  1. There is only one correct page for any given query.
  2. The user knows exactly what result they want.
  3. The user is 100% right in that this result is appropriate both for himself or herself, as well as for all other users. Furthermore, they are right in that there are no results remotely as relevant as their desired best result for same query.

I’m a little surprised that the kind of ambiguities that make serving a results page necessary escape the author. He is a self proclaimed “former Googler, former professor and former startup founder”. Dr. Whittaker readily conflates search, which is as much about discovery as it is about navigation, with mind-reading. To his credit, I probably should have just stopped reading after he devoted more than one melodramatic paragraph to the semantics of the term “search”.

Here’s an excellent example of everything I find wrong with this article: after having some questions about the actual usage statistics in these contexts, I used Google to get some numbers on how many documents match these exact phrases. I used the verbatim tool and found numbers identical to the default setting, so let’s naively assume that it’s applicable across all of the hidden variables for every other user in the search engine.

“looking for my keys”: 313,000 results
“searching for my keys”: 104,000 results

“looking for my pet”: 4,460,000 results
“searching for my pet”: 538,000 results

I may not have combed through every result to make sure that they relate to the context of lost pets and keys, but at least I’m using real data to inform my opinion. Using this information, we can see that the exact straw men the author constructed don’t hold under their own weight. If he had tried to be in any way experimental about his argument, he could have saved at least two people from writing an acerbic blog post. Dr. Whittaker makes the same mistake that the generative linguists have made for decades: informing what they believe to be objective arguments on intuition alone.

The fact that I was able to access this information through a search engine unravels Whittaker’s perceived intentions of the general purpose of search. Let’s not even get into the “discovery” use case, whereby an individual seeks to learn about a topic by accessing multiple relevant documents on an identical term (really, what is the one right page for “Napoleon” or “The Beatles”?). I think one example of what makes search a robust and useful platform in its current state is counterpoint enough.

In my use case, I made a search not in hopes of accessing a single page, but in search of information about the term itself, and its presence on the web — search metadata, if you will, that is extremely helpful in making decisions about future search behavior, and the importance of each document relative to the other. Google is not just a “finding” engine; it’s a tool for research, and a warehouse of knowledge. If you want to get a single page and skip the SERPs, just click the “I’m feeling lucky” button and quit complaining.

2 Comments

What I’m Up To Lately

So I’ve been a little quiet on this blog lately. I’ve been busy devoting a lot of my free time to some interesting contracting opportunities.

I’m currently working with Time Doctor, LLC, a company working on the sites TimeDoctor.com and Staff.com, as a retained technical advisor. It’s a great opportunity to work on different software in my spare time and continue honing my craft. I get the opportunity to analyze code quality and application architecture, and solve some interesting problems on a useful stack of productivity and hiring tools. I’d see myself continuing on in this capacity for the foreseeable future, since I really enjoy doing it in my free time.

Here and there, I’ve also been providing some additional time to Earmilk.com, a really cool cutting-edge music blog that’s got some interesting new features on the up and up. Keep an eye out for those changes to come into effect real soon.

My big announcement is that I will be changing my full-time employment from Software Architect at AetherQuest Solutions to Senior Software Engineer at Wikia. I’m making this change for a lot of reasons. Working at a top-50 website is definitely up there. And I’m really excited to get into Big Data again now that the technology has progressed a bit. I’m excited to use my search knowledge again, too, after spending so much time on other sort of enterprise-level concerns in my current role.

I’ll be continuing strengthening my skills in application architecture, technical definition, and project management on the side as I work through some important issues with Global Workforce. This is important to me, because I don’t want to lose any of the momentum I’ve achieved over the last year and change.

I had a lot of great opportunities to learn and grow at AQS, and I’m really proud of what I accomplished there. But it’s about time I put the “language” back in “Language Hacker”!

Leave a comment

SLA-Driven Development

A service-level agreement is a useful method for setting expectations on deliverability, and incentivizing quick turnaround. It’s also a great way to motivate a development team. SLAs encourage goal-setting and collective buy-in internally. They improve your users’ perception of your software and the people who maintain it. Best of all, they provide concrete, measurable criteria to evaluate the success of your team at a larger-scale level than issue-by-issue, or the regressions introduced in a given release. Whether your business model is software as a service or software as a product (free products included), you can easily use the fundamental components of an SLA to reap positive benefits. In this post, I will describe some guidelines for implementing an SLA in your team.

Continue reading »

Leave a comment

Novel Methodologies for Distributed Development Teams

I work in the Bay Area at a company based out of the Washington, DC area. I lead a development team with the manpower roughly divided between both offices. On top of that, we have certain stakeholders that work in other cities out of a home office. This isn’t unusual these days in tech. However, it puts a lot of strain on traditional methods of project management and release management. I’ve used it as an opportunity to implement some nifty methods that some might call agile to keep pushing our product forward in a rapid but guided manner. Continue reading »

Leave a comment

The Highlighter Incident

When it comes to “The Real World”, one of my biggest learning experiences was getting fired from a temp job at a Verizon Wireless store the summer before grad school.

Continue reading »

Leave a comment

Future-Proof Your Database Change Log

Adding a change log to your database is the best way to make sure you’re working on a version of your web application that adequately reflects a given state of your product. However, when working with a branching-and-merging development environment, where two different developers may be working on a migration at the same time, we often encounter a race condition that can cause keeping development-level environments pristine compared to those that only deploy stable code. Here, I outline a methodology that allows both the flexibility of distributed development as well as sane migration management in an application with an evolving database schema.
Continue reading »

Leave a comment