Last week, I decided to play around with some CSS3 that I can't use at work due to the usual issue with Internet Explorer. The site I ended up playing with is the SourceForge page for a parser I wrote as part of my master's thesis (Robust Methods for Automated Discourse Connective Argument Head Identification) over 2007 and 2008. MRG Utils takes sentence trees in the Penn Treebank II format and turns them into iterable, navigable Python classes. I used these classes to generate instances over which to run a maximum entropy ranker. It was responsible not only for sniffing out rankable instances for a given decision, but also for extracting syntactic and semantic features to inform the model. In the process of playing with the website's front end, I revisited my old code, which was a bit of a walk down memory lane.
Just a little bit more background on why I built this library: this project started as a research assistantship under Jason Baldridge. It was attached to an NSF-funded project -- principally investigated by Jason, Nick Asher, and Jonas Kuhn -- known as Discor. The two interdependent goals of the project was to investigate methods to generate discourse structure and then determine dependencies between clauses within those structures. I personally investigated ways to link discourse connectives to the syntactic heads of the clauses they connect, as inspired by a paper from Ben Wellner.
While structural discourse connectives were a fairly easy issue, conflating this with adverbial discourse connectives brought a whole new dimension to the issue. There were not only syntactic issues to consider, but the whole gamut of issues involved with anaphora resolution as well. We ended up improving on Wellner's results by interpolating type-specific models, which better accounted for the varying behaviors between structural and adverbial connectives. If I had continued on with this line of study, I probably would have ended up using Pascal Denis' anaphora resolution approach based in integer linear programming to further improve performance for discourse adverbials.
In order to accomplish the above task, I had to pair up the Penn Treebank II with the Penn Discourse Treebank and extract data to create rankable instances for a maximum entropy ranker. To do this, I rolled my own Python-based tree parser, that specifically deals with .mrg files as seen in the PTB2.
This project represents several firsts for me. Specifically, it was the first large-scale, object-oriented development project I had taken on by myself. Back then, as a linguist first and a developer second, this was great practice for me to really improve my understanding of software architecture. But at this point, as with any code base from the past, I mostly look at it and see places where I could have improved.
Before doing the whole obligatory "here's what I would do now" bit, I want to take a minute to sort of play up the good points of this library. The code is fairly effective with respect to going from a straight text document to a data structure that could be easily analyzed and navigated for a variety of purposes. Non-terminal nodes have a great set of methods for aggregating or accessing children and siblings, which can be pretty nifty. It was also quite useful to be able to create gorn addresses during object instantiation, especially considering the fact that that is how the Penn Discourse Treebank is tacked onto data available in the Penn Treebank II's .mrg files. I also appreciate how smart this implementation is about using the recursive nature of syntactic trees to build itself and populate its children. Each parent object basically tries to go straight to the bottom of its structure and instantiate itself by letting its children build it up.
My basic thoughts on how I would do it now, after years of experience, are the following simple notes:
I should have been less dependent on explode and implode functions, and should have spent more time getting the regular expressions right. It would have been worth it.
While this library did a good job with respect to speed (the time bottleneck really occurred during classification), could I have been more efficient? Maybe instead of going straight from .mrg file to Python class, I should have created a script that better described the dependencies between nodes and then loaded that faster.
As a web developer, my first instinct at this point would be to give a unique ID to every node on a tree -- terminal or not -- and then build out one-to-many relationships for each parent node. If I had done this with a database, I could have used a database abstraction layer to populate my objects each time. That might make my life a little easier in terms of separating out functions that are useful for feature extraction and navigation from functions necessary to going from flat file to an object nested within some kind of hierarchical tree structure nested within a document. But that would have involved populating a database with information I would be gaining and extracting already from MRG Utils. Maybe it would have been repeating myself too much.
So there are my thoughts. At this point, I'd probably be happy to say that this set of code does a good job at achieving its goal. Until Python becomes more interface-friendly, I don't really see any design patterns that I would have rather used. Unfortunately, I don't really have much of a use for this code anymore. I no longer have access to the Penn Treebank anymore, don't really have the time to do academic projects. I released the code into the open source quickly upon finishing my time at grad school, and hope that it someday serves a useful purpose to another computational linguist.
As with all small-niche software projects, I keep hoping it will help just one more person someday.
Since I posted it in the Summer of 2008, MRG Utils has been downloaded from Sourceforge 38 times. I'm guessing that most of those downloads are from a crawler of some kind, and not a human. I have never been contacted with questions about the code. Either way, it was a rewarding project, and I hope someday to find the time to work on an open-source software project that helps more programmers.