O'Reilly Database War Stories Highlights

June 8, 2006

@ 02:14 PM

Tim O'Reilly ran a series of blog posts a few months ago on the O'Reilly Radar blog entitled "Database War Stories" where he had various folks from Web companies talk about their issues scaling databases. I thought the series of posts had an interesting collection of anecdotes and thus I'm posting this here so I have a handy link to the posts as well as the highlights from each entry.

Web 2.0 and Databases Part 1: Second Life: Like everybody else, we started with One Database All Hail The Central Database, and have subsequently been forced into clustering. However, we've eschewed any of the general purpose cluster technologies (mysql cluster, various replication schemes) in favor of explicit data partitioning. So, we still have a central db that keeps track of where to find what data (per-user, for instance), and N additional dbs that do the heavy lifting. Our feeling is that this is ultimately far more scalable than black-box clustering.

Database War Stories #2: bloglines and memeorandum: Bloglines has several data stores, only a couple of which are managed by "traditional" database tools (which in our case is Sleepycat). User information, including email address, password, and subscription data, is stored in one database. Feed information, including the name of the feed, description of the feed, and the various URLs associated with feed, are stored in another database. The vast majority of data within Bloglines however, the 1.4 billion blog posts we've archived since we went on-line, are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines, somewhat like the system outlined in the Google File System paper, but much more specific to just our application. To round things out, we make extensive use of memcached to try to keep as much data in memory as possible to keep performance as snappy as possible.

Database War Stories #3: Flickr: tags are an interesting one. lots of the 'web 2.0' feature set doesn't fit well with traditional normalised db schema design. denormalization (or heavy caching) is the only way to generate a tag cloud in milliseconds for hundereds of millions of tags. you can cache stuff that's slow to generate, but if it's so expensive to generate that you can't ever regenerate that view without pegging a whole database server then it's not going to work (or you need dedicated servers to generate those views - some of our data views are calculated offline by dedicated processing clusters which save the results into mysql).

Database War Stories #4: NASA World Wind: Flat files are used for quick response on the client side, while on the server side, SQL databases store both imagery (and soon to come, vector files.) However, he admits that "using file stores, especially when a large number of files are present (millions) has proven to be fairly inconsistent across multiple OS and hardware platforms."

Database War Stories #5: craigslist: databases are good at doing some of the heavy lifting, go sort this, give me some of that, but if your database gets hot you are in a world of trouble so make sure can cache stuff up front. Protect your db!
you can only go so deep with master -> slave configuration at some point you're gonna need to break your data over several clusters. Craigslist will do this with our classified data sometime this year.
Do Not expect FullText indexing to work on a very large table.

Database War Stories #6: O'Reilly Research: The lessons:

the need to pay attention to how data is organized to address performance issues, to make the data understandable, to make queries reliable (i.e., getting consistent results), and to identify data quality issues.

when you have a lot of data, partitioning, usually by time, can make the data usable. Be thoughtful about your partitions; you may find its best to make asymmetrical partitions that reflect how users most access the data. Also, if you don't write automated scripts to maintain your partitions, performance can deteriorate over time.

Database War Stories #7: Google File System and BigTable: Jeff wrote back briefly about BigTable: "Interesting discussion. I don't have much to add. I've been working with a number of other people here at Google on building a large-scale storage system for structured and semi-structured data called BigTable. It's designed to scale to hundreds or thousands of machines, and to make it easy to add more machines the system and automatically start taking advantage of those resources without any reconfiguration. We don't have anything published about it yet, but there's a public talk about BigTable that I gave at University of Washington last November available on the web (try some searches for bigtable or view the talk)."

Database War Stories #8: Findory and Amazon: On Findory, our traffic and crawl is much smaller than sites like Bloglines, but, even at our size, the system needs to be carefully architected to be able to rapidly serve up fully personalized pages for each user that change immediately after each new article is read. Our read-only databases are flat files -- Berkeley DB to be specific -- and are replicated out using our own replication management tools to our webservers. This strategy gives us extremely fast access from the local filesystem. We make thousands of random accesses to this read-only data on each page serve; Berkeley DB offers the performance necessary to be able to still serve our personalized pages rapidly under this load. Our much smaller read-write data set, which includes information like each user's reading history, is stored in MySQL. MySQL MyISAM works very well for this type of non-critical data since speed is the primary concern and more sophisticated transactional support is not important.

Database War Stories #9 (finis): Brian Aker of MySQL Responds: Brian Aker of MySQL sent me a few email comments about this whole "war stories" thread, which I reproduce here. Highlight -- he says: "Reading through the comments you got on your blog entry, these users are hitting on the same design patterns. There are very common design patterns for how to scale a database, and few sites really turn out to be all that original. Everyone arrives at certain truths, flat files with multiple dimensions don't scale, you will need to partition your data in some manner, and in the end caching is a requirement."
I agree about the common design patterns, but I didn't hear that flat files don't scale. What I heard is that some very big sites are saying that traditional databases don't scale, and that the evolution isn't from flat files to SQL databases, but from flat files to sophisticated custom file systems. Brian acknowledges that SQL vendors haven't solved the problem, but doesn't seem to think that anyone else has either.

I found most of the stories to be interesting especially the one from the Flickr folks. Based on some early thinking I did around tagging related scenarios for MSN Spaces I'd long since assumed that you'd have to throw out everything you learned in database class at school to build anything truly performant. It's good to see that confirmed by more experienced folks.

I'd have loved to share some of the data we have around the storage infrastructure that handles over 2.5 billion photos for MSN Spaces and over 400 million contact lists with over 8 billion contacts for Hotmail and MSN Messenger. Too bad the series is over. Of course, I probably wouldn't have gotten the OK from PR to share the info anyway. :)