I've recently been researching eXist-db for the use case of storing, indexing, and querying XML parse data output from the Stanford CoreNLP pipeline. I had high hopes, but after putting it through its paces, I can say that Exist (sorry, not buying into the silly spelling) needs a lot of improvements before it can address the needs of a high-scale, high-availability scenario. This post is intended to serve as a real-life use case, a list of grievances, and an appeal to the Exist community for more modern architecture and community management.
Update: A member of the community has reached out to me both in the comments here and on Twitter, which I find encouraging. I'll add some updates as I encounter them.
At first blanch, Exist had a lot of potential. It supports XQuery, which allows you to query multiple collections of XML documents based on node text, XPath, or attribute values. This would be helpful in my specific use case because it would allow me to reason over, for instance, dependency parses. For instance, show me all documents where the subject of a sentence is Wolverine. From there, we could use that information to start reasoning about characteristics about Wolverine, directly from the XML output of our pipeline.
We're currently storing XML files on Amazon S3, so I was also excited about the possibility of getting that core data managed through a legitimate database, with CRUD and indexing capabilities. Particularly, the ability to use Lucene-style indexes on different fields and attributes would make accessing parse data cross documents faster, easier, and more robust than our current approaches.
Exist has been around since 2000, and won "Best XML Database of the Year" in 2006. With a history like that, I expected that it would be as reliable and scalable as something like Apache Solr -- another mature Java-based queryable document store currently positioning itself as a next-generation NoSQL database. You'd also think with having a brand-new release candidate, the community would be active and easy to follow.
Kicking the Tires
To test Exist, I spun up an m2.4xlarge EC2 instance with Java 8 installed. First, there was some jumping through some hoops on SourceForge to download the correct JAR file and then upload it to the server. I ran the installation JAR, which allowed me to configure JVM memory -- 50GB, plenty left over for garbage collection with 64 on the machine -- and cache memory -- 5GB, since we can still afford the space.
I initially ran into a lot of JVM out-of-memory errors, and was kind of awestruck. How could what I'm doing take up dozens of gigs of memory already? I eventually figured out that those configurations don't replicate over to the config file for the wrapper tool, which is used to turn the server into a service. So once I manually configured those settings, I was in business.
Starting from scratch, I was looking for three things: file upload speed, consistency/reliability, and ease of use. Exist didn't end up meeting my needs at the scale of tens of millions for any of these.
File upload speed was the big one. I was using a test set of about 50,000 documents with an average size of roughly 150Kb. Each document took roughly 200ms to ingest. This sounds pretty fast at first blush, but since a given upload is being handled serially, we're still talking almost a year to index just 15 million documents of that average size on one machine.
I tried multiple attempts to force concurrency onto Exist from the client side -- by running multiple scripts, and by using multiple queries. The more documents you try to upload at once, the slower each individual upload takes. So that's off the table. I explored each of the six different ways to insert data that made sense. None of them were fast. None of them took advantage of concurrency within the application without a lot of forcing. My assumption is that uploading is not meant to be concurrent due to blocking I/O and indexing concerns. We'll talk about that later.
As far as consistency and reliability goes, my personal interaction with Exist has led me to consider it an unstable platform. I had to wipe the database files and start over several times through my experimentation. Especially when trying to throw a lot of documents at Exist at once, the database would often go into read-only mode for no reason, or quietly just stop working and need to be restarted.
Good luck if you have to hard-restart Exist, too. Any event that doesn't involve a graceful shutdown seems likely to cause haywire due to leftover lock files. This initializes a recovery mode that may sit at 98% complete forever. You're actually better off just blowing away your data and re-ingesting it. So speed really matters here! Needless to say, there's only master-slave support for replication, but no support for sharding/clustering.
Ease of use is another area of improvement for Exist. I had a lot of trouble finding answers to most of my questions. For anything that wasn't explicitly answered in the documentation, the only place to look were old plaintext forums. These were not a pleasure to wade through -- especially with all the "reply cruft" in each message. Considering all this, I was a little surprised to see that Exist does in fact have a Twitter account -- but not a very active one. I've been mentioning Exist for the last few months, and quite a bit, directly, over the last week, and haven't received any responses.
Exist is not presently suitable for a demanding, high-scale environment. As a single server instance, it cannot support a high-bandwidth insertion pipeline. And it does not support clustering in a manner that would enable the horizontal scaling needed to remediate this particular bottleneck. Basic interactions with the database even on the order of tens of thousands run the risk of significant system instability and data loss. The user community is small and uses somewhat old-fashioned forms of communication. Applicable questions to issues of scalability are few and far between, and their answers even more sparse.
How Exist Can Improve
The intention of this post isn't to decry the Exist community. It obviously has other priorities apart from its application as a server -- hence all the desktop support, for instance. But in order to be considered a real NoSQL database, in my opinion, Exist needs to support high scale. I chose Exist in the first place because it seemed to be the strongest choice for an XML database. This means that there is still a need for a high-scale XML database, and Exist can meet that need with some serious changes.
Mitigate Issues Related to Blocking I/O
Slowdowns happen when lots of documents are uploaded at once, even when using multiple connections or client scripts. This suggests blocking I/O. What I can tell from the logs is that these files are being written immediately.
This caused me to ask myself, what does Solr do differently from Exist that allows it to accept hundreds of long-form documents a second? Solr has a commit phase wherein it aggregates all "upload" or "index" events at once and performs actions involving blocking I/O -- that is, writing to the index -- at once. If I'm throwing tens of gigs of memory at Exist, there's no reason why a commit step wouldn't be feasible. Allowing documents to stay in memory for a short period of time, should significantly speed up the speed of uploading.
Whether or not the above is addressed, Exist should allow for Zookeeper-based clustering. I'm sort of imagining something similar to SolrCloud for tacking on paxos-style clustering for a set of collections. This would at least start to help address problems related to indexing and querying with hundreds of thousands of collections with upwards of a million documents per collection.
Fix Stability Problems
Uploading something shouldn't kill the server. If the server must die, then personally, I would rather lose some data than all data. Implementing Zookeeper and using a transaction log would probably go a long way in improving consistency over the current recovery mode rigamarole -- which doesn't really do all that much recovering, in my experience.
Get off SourceForge and Onto GitHub
I understand that SourceForge was a good choice even as recently as 2007. It's not anymore. It's my opinion that Exist's reliance on SourceForge as the canonical location for the software and community is turning off potential users and contributors. Exist should completely abandon SourceForge and fully commit to GitHub for downloads contributions. They should start handling community interactions using GitHub Issues so that the content can be more easily consumed.
Deprecate the Mailing List
Some of the problems with the community is that the forum/mailing list experience is a big turnoff compared to the modern engineering experience. I suggest keeping the mailing list archives online, but deprecating participation in favor of interactions on StackOverflow. Creating a presence on StackOverflow will not only make the questions and their answers easier to read and follow, but it will help better represent the existing community and garner interest from prospective participants.
Ditch the capital X
This is only half-joking. For consistency's sake, you shouldn't be capitalizing the second character in the name of your software. It's pretty aggravating when trying to start and stop services, for instance, to always remember to capitalize the X because I can't tab-complete it once I've typed 'ex'. I know this isn't a big deal, but since the rest of the experience wasn't flawless, it just feels like one more aggravation trying to work with it. The whole thing seems like an unnecessary marketing move. We get it, guys -- it's for XML! There's an X in "exist"!
For better or worse, we're stuck with XML. XML is a big, verbose way of defining structured data, but there's clearly a need for it over things like JSON -- even today. So there is an absolute need for an effective way to index, query, and manage these documents across one or more collections. As more data is generated, we're going to need a stable, highly-available, performant way to manage it all as XML. Even as an intermediary format that we are extracting data from and pushing into another formalization, a server capable of high storage and efficient querying just sounds useful.
ExistDB is pretty much the only game in town with respect to its community and its configurability. But there are a lot of things that need to change -- and as soon as possible -- to make it a platform for growth over the next five years. Without taking any of the above suggestions, the odds are that Exist won't be ready for use in high-scale, high-demand environments any time soon, and it will be stuck in 2006 for many years to come. This would essentially mean the slow and inevitable death of the project, as data needs get bigger and bigger. I hope for the community's sake that they make the difficult decisions and do the hard work required to step ExistDB's game up.
My Next Steps
Originally, I wanted to use Exist as an intermediary between XML parses and the Neo4j graph database. But since Neo4j now has a super-fast built-in ETL solution, I'll still be able to quickly move from XML to a dependency graph with a quick intermediary Python script. But it's really too bad, because I was looking forward to using Exist for managing all of those transformations. While it's not capable of cross-document querying, I have written a library that is intended to serve as a data model for Stanford CoreNLP XML parse outputs. So I can still use that for document-by-document transformations. I'll just need to do all my querying on Neo4j!
Request for Comment
This was my last week or so struggling with getting Exist to meet my particular use case. Please feel free to comment here on anything I might have missed, or anything that might help me learn ways I still might be able to use Exist to meet my needs. I'm not writing off the project; I'm writing down my problems with it, to show that I made an informed decision to give up on using it for now.