About a week ago, the Facebook Data team quietly released the Cassandra Project on Google Code. The Cassandra project has been described as a cross between Google's BigTable and Amazon's Dynamo storage systems. An overview of the project is available in the SIGMOD presentation on Cassandra available at SlideShare. A summary of the salient aspects of the project follows.
The problem Cassandra is aimed at solving is one that plagues social networking sites or any other service that has lots of relationships between users and their data. In such services, data often needs to be denormalized to prevent having to do lots of joins when performing queries. However this means the system needs to deal with the increased write traffic due to denormalization. At this point if you're using a relational database, you realize you're pretty much breaking every major rule of relational database design. Google tackled this problem by coming up with BigTable. Facebook has followed their lead by developing Cassandra which they admit is inspired by BigTable.
The Cassandra data model is fairly straightforward. The entire system is a giant table with lots of rows. Each row is identified by a unique key. Each row has a column family, which can be thought of as the schema for the row. A column family can contain thousands of columns which are a tuple of {name, value, timestamp} and/or super columns which are a tuple of {name, column+} where column+ means one or more columns. This is very similar to the data model behind Google's BigTable.
As I mentioned earlier, denormalized data means you have to be able to handle a lot more writes than you would if storing data in a normalized relational database. Cassandra has several optimizations to make writes cheaper. When a write operation occurs, it doesn't immediately cause a write to the disk. Instead the record is updated in memory and the write operation is added to the commit log. Periodically the list of pending writes is processed and write operations are flushed to disk. As part of the flushing process the set of pending writes is analyzed and redundant writes eliminated. Additionally, the writes are sorted so that the disk is written to sequentially thus significantly improving seek time on the hard drive and reducing the impact of random writes to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new tape.
Cassandra is described as "always writable" which means that a write operation always returns success even if it fails internally to the system. This is similar to the model exposed by Amazon's Dynamo which has an eventual consistency model. From what I've read, it isn't clear how writes operations that occur during an internal failure are reconciled and exposed to users of the system. I'm sure someone with more knowledge can chime in in the comments.
At first glance, this is a very nice addition to the world of Open Source software by the Facebook team. Kudos.
Found via James Hamilton.
PS: Is it me or is this the second significant instance of Facebook Open Sourcing a key infrastructure component "inspired" by Google internals?
Now Playing: Ray J - Gifts