After the discussion generated from my last post, I've come to realize that the solution I have in place for analyzing millions of pages of XML parse data is pretty useful, and relatively performant. Because of this, I've decided to share the library I've been using with a broader audience.

Today I'm celebrating two events: my 30th birthday (woop woop), and the 1.0 release of the Python CoreNLP XML Library.

corenlp_xml is a Python library that provides a data model on top of Stanford CoreNLP's XML output. You can install it using pip. corenlp_xml uses lxml and lazy-loading techniques for high-performance querying and data access capabilities. It uses NLTK's tree parsing capabilities to provide additional interactions against the XML's S-expression sentence parse node.

I've used corenlp_xml to solve the following problems at high scale:

  • Recursively identify all noun phrases for each sentence in a document
  • Cross-reference proper noun phrases with coreference mentions, getting a full mention count in a document for a given entity
  • Identify interactions between subject and object using dependency parse data
  • Get all semantic heads for a given document

More information is available on corenlp_xml's ReadTheDocs page. This library should be a great help to anyone who wants the power and accuracy of CoreNLP's parsing output, but is interested in using Python's fast numerical computing affordances for further analytics, data science, or machine learning.

If you think the library would be useful for you, please feel free to contribute back to the project, or vote for my upcoming Lucene/Solr Revolution talk.