Dynamic Data Science: Iterating Fast in the Cloud

When building data pipelines that are designed to analyze massive amounts of data, iterating is key. If running a process is measured in days, then a failed process or a buggy output can set you back without giving you much feedback to learn from. I've had to learn a lot along the way about devops approaches for cloud computing in the past couple of years. These have helped me iterate at a fast pace, while generating insights over millions of pages long-form user-generated content. Here are some methods that I've used at Wikia to successfully build out our parsing pipeline, language data extraction pipeline, and scientific computing models in the cloud, using Amazon Web Services.

Subdivide Data Processing Steps

When you've identified a portion of your pipeline that is time-intensive, the first step is the cordon it off as a bottleneck so that it can be treated separately. You can break steps apart easier if you use an extremely small data set compared to the size of the production data you're computing. Early on, you should only be concerned with managing the state of output as it goes from one state in the pipeline to the other.

Tool Manually First, Then Operationalize

In my opinion, a pipeline can be considered operationalized if you can turn on processing in one place and trust that each step will be handled automatically. Any data processing that you have to perform manually between start and finish is usually a step that you can script. Your initial goal should be to determine the minimum amount of configuration and logic needed for each machine at each step to accomplish the task at hand. For example, a pipeline-ready instance has all of its dependencies installed at or on startup, and can perform its processing step using a single script.

Use AMIs or Application Containers to Save on Configuration Time

A key component to Amazon's Elastic Compute Cloud service is the fact that you can preconfigure a machine and image it to your specifications. We deploy an image to mitigate high configuration time required for libraries like NumPy and SciPy. This configuration time translates into instance cycles spend doing something other than the task at hand -- computing your data. These images also usually also come preconfigured with a set of libraries we've written deployed to the home directory.

Chef or Docker are also suitable for maintaining your image configuration needs, but a pre-configured AMI can save you a few cycles if your use case remains stable. In particular, I use AMIs instead of these automation solutions because things like NumPy take so much time to install from a clean box. Even five minutes installation time can make an impact on the cost of computation in the long term when you're using massive horizontal scaling.

User Data Scripts over User Scripting

User data scripts are bash scripts that are run as root on an EC2 instance on instantiation. You can use these to control the environment that you've preconfigured remotely, once the machine is ready, instead of worrying about remote access. Ideally, you should never have to SSH into any node in your computing pipeline.

At Wikia, we use scripts that port data from within our network to the cloud. This script then activates a distributed cloud computing architecture built with general-purpose AMIs that allows us to react dynamically to the size of the data we need to compute. This allows us to use the same architecture for both event-driven analysis -- where we only care about data that has changed -- and full-scale analysis -- where we want to analyze tens of millions of documents, for example. The user scripts for these machines install the latest version of our software and then use that software to poll for new jobs to process.

Cache Aggressively

Avoid computing anything a second time by building up a strong caching layer. We use a caching layer that stores denormalized service responses in S3 for intermediary computations. We also have a scoped event model that allows us to identify when a single data point has changed and needs to be re-cached, and what computations that use that data point must be updated.

This is useful because many of our computations are too expensive to handle at request time. By denormalizing these immediately, we're paying a little bit in computing now so that client applications can be more consistent and reliable.

Another important secret to aggressive caching is only caching things that you're going to request more often than you're going to recompute. In other words, you'll be able to move faster if your use case can tolerate some staleness within your data. There are various solutions for each use case that strike their own balance between operational costs of on-demand computing instances and the value that timely data provides. For ours, we could tolerate the staleness, so we've save a bit of money by only recomputing things at certain intervals, despite how many times they may have changed.

Add Concurrency To Bottlenecks

Sometimes, this is as easy as using a threading or multiprocessing library within your language of choice. If you're already maxed on out RAM or CPU cores for a given machine, and you're using a top-of-the-line instance type, the next step is to request more instances. Using the knowledge you hopefully have from building out AMIs and firing off your processes with user data scripts, you should now have a solid idea of how to handle this issue. While not a necessity in many cases, consider using a concurrency framework to centrally manage your data processing from where the bottleneck has been detected.

I'd also suggest anticipating areas where there may be a need for scaling, and implementing queues early and often. This will make it easier for multiple instances to perform distributed computing without stepping on each other's toes. For post-hoc architecture, you can always use one machine to split up jobs between other machines it spawns off and monitors. This can be done easily by specifying what specific tasks or data points to work on using the user data scripts available in AWS.

Log Early, Log Often

Your processes should write to a log file, almost always in /var/log/. There are plenty of ways to consume these logs. For instance, one of the steps I use takes the logs that are written and uploads them to a log folder in S3. I've also written a few bash scripts that aggregate logs within a time window for the same logging script for multiple different instances. You can do a lot of cool stuff from your machine with just SSH.

So long as you're logging, I think that whatever simple solution you use to access your log data is better than not having feedback from your processes. There are also cloud logging services like Loggly that will handle this stuff for you. You could manage your own Logstash instance on EC2 -- but in order to do that securely, you're going to have to put everything behind a Virtual Private Cloud, which may be more trouble than it's worth.

Conclusion

Fast early iteration is the key to building a fast, maintainable, and extendible cloud architecture for large-scale processing and data science. These suggestions are pretty simple: focus on achieving the output you want; separate your steps into logical, self-contained components; operationalize those components so that they can be repeated easily; plan ahead to eliminate bottlenecks. But by following them, you put yourself in the position to deliver on the outputs you're trying to generate faster and more consistently. Timing is often everything when it comes to meeting business needs. As your data you're producing becomes increasingly interesting to more and more stakeholders (or your input data becomes larger), taking the steps above will help prevent future issues with scaling.