Archive for June, 2010

Why memcacheDB sucks (and it’s not just the server side)

Over the last week, at Delta Projects, I've been working on replacing MemcacheDB with Cassandra. We have run with MemcacheDB for a while, because we store data that we can't store client side and we need some way of storing this serverside. Since the data needs to be available really quickly and has to be persistent, MemcacheDB (a server that exposes more or less the same API as Memcache, but persists to disk) was put in place some time ago. This worked well for a while, but since we're expanding, the requirements changed for storing data server side. The initial requirements were pretty clear: we can get away with a key/value store, latency is important and the data should be persistent. However, since we're going to run with multiple data centers, the requirements have changed. The initial requirements are still valid, but when running over multiple locations, we would like to have all data available in all locations. Next to that, scaling MemcacheDB is hard. We ran with two MemcacheDB servers, where data was sharded, but the databases on both machines outgrew the physical memory in the servers, which caused a huge impact on performance. Since data is sharded and not replicated, you can't just pop in an extra box and lessen the load on the other servers. You have to rebalance manually. So we went on a quest for different server side storage solutions that would fulfill our requirements.

Long story short, we decided on Cassandra.

Cassandra is more than a key/value store. It's a distributed data store that uses Bigtable's ColumnFamily design, which will allow us to use our data differently than just storing and retrieving data by using keys; we can actually query the data. Anyway, the decision was made to move from MemcacheDB to Cassandra and we had to come up with a migration path. The path we envisioned would be:

  • Keep reading and writing from and to MemcacheDB but also write to Cassandra.
  • Stop writing to MemcacheDB and start reading from Cassandra first and have MemcacheDB as a fall-back if we can't read from Cassandra.
  • In the mean time copy data from MemcacheDB. Keys that are not in Cassandra will be copied, the others will be skipped
  • Disable reading from MemcacheDB.

This meant that we had to run with both Cassandra and MemcacheDB at the same time.

While investigating the use of Cassandra, we found out that the Ruby client for Cassandra doesn't deliver. Mainly because Cassandra uses Thrift and the Thrift client for Ruby isn't too good. However, Hector, a Java client for Cassandra, worked flawlessly. This, together with some other discoveries1 made us decide to move our application from MRI Ruby (the standard Ruby interpreter) to JRuby (a Ruby interpreter on the JVM). The nice thing about JRuby is that we were able to use best of both the Java and the Ruby world. Most Ruby code runs well in Ruby, but sometimes code has to be adjusted a little or sometimes it's better to use a JRuby version of a specific library; one that's more suited for running on the JVM.

So, we added Cassandra support to our application and moved from the standard Ruby memcache client to the JRuby version. The JRuby version is a small wrapper that exposes the same API as the Ruby version and wraps an existing Java Memcache client. However, after some testing, we found out that the way keys are distributed over the MemcacheDB servers is not the same in both implementations. Both implementations used different algorithms to determine which key should be placed on which server. The Java version provides the ability to choose between three different algorithms, while the Ruby version only provides one, but neither one is compatible with the others.

For getting our JRuby application to talk to our MemcacheDB servers, we had two options; patch the Java version so it behaved like the Ruby version, or just run with the Ruby version (JRuby is still Ruby, right?). Initially, we chose the latter option, which seemed to be the quickest fix. Just pop in the gem and you're done.

Right.

In the MRI Ruby version of our application, we used one MemcacheDB client per process (and then running with multiple processes), while in the JRuby version, we had to run with one client per Java thread. The standard Ruby version of the client is not very good with Java threads and apparently, TCP sockets are handled differently in Java then when using the MRI interpreter. Sockets would linger for a very long time and when the MemcacheDB server would be unresponsive for a while and we would try to reconnect, we would end up with a lot of established connections that were never going to be used, but also never closed. The JRuby version didn't seem to have all these problems and is (obviously) much better suited to run in the JVM; it uses nice socket pools and such.

Not to waste time, we started patching the JRuby client to mimic the behavior of it's Ruby counterpart, which worked out pretty well. We still have to test it a little bit more thoroughly, but the first tests look promising.

The last couple of days have been a head-breaking quest, but I think we've finally nailed it. We really need to move to Cassandra soon, since MemcacheDB is only going to give us a lot more headaches. At least I got to code some Java again and I learned quite a few things about MemcacheDB and how sharding actually works. I also learned that MemcacheDB is a bad choice. Implementing it is done fairly quickly (since you can just use a Memcache client), but it doesn't scale and apparently the sharding differs from one implementation to the other (what if you run multiple applications accessing MemcacheDB and all are written in different languages?). It's too bad that we spent a lot of time finding a solution for something that is going to be replaced very soon.

If you're writing a new application and need persistent cache or something other than a SQL database, check out alternatives like Cassandra or MongoDB.

1. We decided not to use Scribe for writing to HDFS, but rolled our own code using ActiveMQ, in-process.

Is it worth contributing to open source software?

Over the last couple of weeks, my colleagues and me at Delta Projects have been working on some extra functionality for Facebook's Scribe server. Scribe is a server that logs messages. It can do this to multiple different "stores" and does failover from one store to the other if one fails. In our case, we wanted Scribe to log messages from our ad servers to HDFS and in case of an HDFS failure, write to local disk and replay those logs, once HDFS becomes available again.
After some investigation, we noticed that Scribe was very static in the way file paths were constructed and we couldn't fulfill our requirements; we want to have our log files structured in a directory structure like /year/month/day/hour/adserver.log.
Here is where open source software is great: you have the code, so you can modify it yourself. So we did. We spent a couple of hours implementing dynamic file paths and then committed the feature to our own github fork of scribe. We posted the patch to the Scribe mailing list and spoke about it on IRC with some guys of Facebook and Twitter, but not much happened.
Another requirement is that we store our data compressed on HDFS, but Scribe doesn't come with compression. There is a patch for LZO from Twitter, but because of license issues, this patch will never make it into the main branch of Scribe. Because of this, we decided that we would either use FastLZ or BZip2 for our purposes. But again, Scribe has no built-in support for compression.
Since we already poked around in the Scribe code, we decided to implement compression ourselves. The problem with Scribe however is that it basically untested. There are a couple of tests written in PHP, but the don't cover the whole application. Next to that, the Scribe source (written in C++) is a mess. It's one big hack. No neat design patterns, a lot of dead code everywhere, classes with way too much responsibility, etc. So, for our implementation of compression, we figured that it would be nice not to hack it in there (even the LZO patch from Twitter seemed like a quick hack), but do it correctly, so that we wouldn't only introduce new functionality, but also clean up a lot and refractor ugly parts of the code. We extracted and abstracted a lot of code and put it in the right place (using a lot of practices that Uncle Bob preaches in "Clean Code"). And as icing on the cake, we threw in tests. We introduced Google Tests, a C++ testing framework and wrote tests for every piece of code that we wrote.
Then the day came that our work was finished, so we committed our changes to our github branch and posted our patch to the scribe mailing list. The patch was around 200k (including the tests), which is big and apparently, it scared the guys at Facebook, because after that, nothing happened. The only response was that the patch was too big and we we're asked if we could cut it up in smaller pieces, so reviewing the code would be easier. Unfortunately, we couldn't do this, since so many interweaved concepts were pulled apart and put into different places. We would have to retrace our steps and create a patch for every one of them, with having a working system after every step.
The end result is that we have a lot of (imho really nice) code, that probably nobody is going to use. I even think that we're not going to use it either, since it's in a branch of Scribe that is only maintained by us. If it would be in the main branch of Facebook's Scribe, the code would be maintained by "the community", but keeping it only in our branch will cause a divergence of our branch in the future and we don't want to maintain a spin-off of Scribe for ages to come.
I think it's great that companies like Facebook and Twitter open source a lot of software, but when there is no large community outside of those companies, adding features might not be worth it. Scribe only seems to be used (on a large scale) by Twitter and Facebook and whatever suits them will make it to the main branch. If only it was covered by tests, then the maintainers could have quickly verified our work, but now, there is no reason for Facebook to merge in our patch, so there is no need for allocating resources to review our changes. And again, forking the whole project and maintaining it isn't an option for us. We'll move on to other technology..