As I begin writing this, I realize I am in some way contributing to all the noise and novelty around "NoSQL". As tempting as rewriting your website to use Cassandra may be, the chances that an objective cost-benefit analysis will support your opinion are pretty small when your project is out to make money.  Pragmatism is the key to utilizing any tool to its fullest and most appropriate. You can still ride the "no wave" (with apologies to Lydia Lunch et al.) by implementing key-value stores piece by piece on your site. Focus on domains where their special optimizations can be used the most efficiently. In this blog post, I'll be outlining Sesh, a wrapper to Zend_Session_Namespace which uses Redis to reduce the amount of RAM used within a session.Scalability is a problem that never goes away. As terrible as premature optimization is, not having a plan for higher demand is, for the web developer, an even more pernicious evil.  As a site attracts more and more users, the fundamental characteristics of traditional web development can start to feel cumbersome. In my opinion, that is why there's so much talk lately about the NoSQL movement. It's easier to blame the database, which is a slightly opaque but necessary evil to the traditional LAMP developer.  Treating each database query as a detriment is a tempting fallacy: they're easy to count, and sometimes hard to optimize for even an average developer. As load increases on a site, improperly designed or scaled databases can reduce a site's responsiveness, but so can inefficient server-side code.


Sure, fast key-value stores are great. In fact, I'm going to spend a part of this post speaking in support of one. But that's no reason to dis your RDBMS. Their value is obvious. If your data doesn't have a lot of intricate, multi-faceted relationships, then a flat key-value store works. On the other hand, if MySQL is at the core of your website, it may be foolhardy to think you could rewrite the whole thing to use Cassandra in one weekend in anticipation of Twitter-like traffic. More than likely, spending a weekend to make sure your queries are properly optimized and your data is clean will actually improve your site. Let's face it; you're not going to get an overhaul like that done in the time you'd expect anyway.

Making sure you're not letting an alternate architecture become some kind of a 'grass is always greener' scenario is really the name of the game. Nothing is better at keeping you pragmatic than actually implementing new, exciting software on a small scale first, where it can really count.

The Problem
I recently had to work on a problem that used a lot of session data to decide on whether or not to display something. Because it all addressed a particular feature, I wanted to keep everything in the same namespace.  With the multitude of variables -- many somewhat similar in function -- associated with the display logic, it was becoming abundantly clear that I was going to need to be fairly verbose to be able to keep track of everything my code was doing.  So of course, I was working with variables such as $myNamespace->flagClickedPageviews to differentiate from $myNamespace->defaultPageviews, etc.   Because this feature could display anywhere on the site, any user that accessed the site would immediately have some of these variables stored in their session.

During a code review, a colleague of mine brought up an excellent point: these variables will be stored for each session, meaning that each byte will be multiplied by the number of users visiting the site on any given day. If this totals in the hundreds of thousands, then we're talking kilobytes of wasted RAM per variable. With more people, or more session variables, this will add up quickly.

Wanting to write clean, legible, sensible code, shortening variable names would be problematic. Zend_Session_Namespace is an excellent tool to use in this scenario, but one shortcoming is that it stores exactly what you feed to it as an attribute within the session. Optimally, it would be nice to have the best of both worlds: verbose attribute naming with minimal drain on RAM.

Sesh: The Solution
In order to address this issue, I wrote a quick wrapper to Zend_Session_Namespace called Sesh. The object of Sesh is to shorten session variables within a namespace  behind the scenes from actual development. This way, the developer can sensibly name variables without worrying about memory consumption in the server.  You can look at the code herhttp://github.com/relwell/Sesh/blob/master/Session/Namespace.phpe.

Sesh uses Redis, a fast key-value store, to look up single-character hashes for a key within a specific namespace.  It wraps around Zend_Session_Namespace. It assigns its variables to a namespace instance automagically, using the __get() and __set() methods.

For each namespace, Sesh creates an instance of Zend_Session_Namespace to wrap around. Each time a new variable is set, Redis is queried for a hash. If there is no hash, a new one is created. Redis keeps track of the number of namespace attributes created and uses that number to create a single ASCII-extended character using the chr() function.  This character becomes the attribute of the Zend_Session_Namespace instance, which then takes care of all of the actual work of storing it in the session -- as a single character.

So here's the theoretical savings: let's assume an average of eight characters for each namespace attribute. We're saving seven characters. We'll say each namespace has, say, about five attributes, and that maybe we'll be working with five namespaces on the entire site. That's just an assumption; there's no data to back this.  We've saved a total of 7 * 5 * 5, or 175 characters. Each character costs at least one byte to store in RAM. At 500,000 users a day, each with their own session, this implementation will save about 85MB of RAM.  If you're looking to run lean, there's definitely a reason to take the hour or so to implement this across your site.

Implementation
Here are some basic caveats before jumping into Sesh:
  • You will need to have Redis installed on your server.
  • You will need to be using Rediska and at least the Zend_Session code library to use this off the shelf.
  • By limiting everything to a single ASCII-extended character, the amount of variables per namespace is theoretically limited to 255, with chr(0) being null. This has not been tested, however, so your feedback is totally welcome there.
Takeaways
The main point about why Sesh is valuable is that it reduces your memory load when using sessions on your site. It does this by taking a lot of extra data that becomes distributed across your server for each user and reducing it to a single unique identifier. This identifier is quickly accessible in a single location. There are shortened 'tokens' in various reams of session data, but the more verbose 'type' is available in one place, as it were.

This is a natural extension of the strengths Redis and other fast key-value stores possess.  Rediska in particular also has faculties for storing each users's session using Redis via its own save handler.  By hashing to reduce memory, we are in some ways duplicating certain functions a database, especially if we're using Rediska for sessions as well.  However, because of the relative simplicity of its architecture, we are not bogged down by issues of indexing or primary keys. There are plenty of similarities between what we're doing and looking up a table row based on a unique string just so we can find its unique indexed ID for a separate query. It would be much slower, though.

A session namespace and its attributes is a good example of a data type where we are only looking up data from one 'column' at a time: first we get the session for that particular user, and then we get the namespace from that particular session, and then, finally we access the attribute of that namespace when accessing or changing its stored variable. We never have to look up the a user's session or a particular namespace based on the value of that variable. That's the first clue. For cases like this, when there are available optimizations in the form of data compression, abstracting another layer over key or value names may save you some memory in the long run. If you implement it right, you'll never notice it once it's in place.