These are my notes from the keynote session MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets by Jeff Dean.
The talk was about the three pillars of Google's data storage and processing platform; GFS, BigTable and MapReduce.
The developers at Google decided to build their own custom distributed file system because they felt that they had unique requirements. These requirements included
One benefit the developers of GFS had was that since it was an in-house application they could control the environment, the client applications and the libraries a lot better than in the off-the-shelf case.
There are two server types in the GFS system.
There are currently over 200 GFS clusters at Google, some of which have over 5000 machines. They now have pools of tens of thousands of machines retrieving data from GFS clusters that run as large as 5 petabytes of storage with read/write throughput of over 40 gigabytes/second across the cluster.
At Google they do a lot of processing of very large amounts of data. In the old days, developers would have to write their own code to partition the large data sets, checkpoint code and save intermediate results, handle failover in case of server crashes, and so on as well as actually writing the business logic for the actual data processing they wanted to do which could have been something straightforward like counting the occurence of words in various Web pages or grouping documents by content checksums. The decision was made to reduce the duplication of effort and complexity of performing data processing tasks by building a platform technology that everyone at Google could use which handled all the generic tasks of working on very large data sets. So MapReduce was born.
MapReduce is an application programming interface for processing very large data sets. Application developers feed in a key/value pair (e.g. {URL,HTML content} pair) then use the map function to extract relevant information from each record which should produce a set of intermediate key/value pairs (e.g. {word, 1 } pairs for each time a word is encountered) and finally the reduce function merges the intermediate values associated with the same key to produce the final output (e.g. {word, total count of occurences} pairs).
A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.
Currently, almost every major product at Google uses MapReduce in some way. There are 6000 MapReduce applications checked into the Google source tree with the hundreds of new applications that utilize it being written per month. To illustrate its ease of use, a graph of new MapReduce applications checked into the Google source tree over time shows that there is a spike every summer as interns show up and create a flood of new MapReduce applications that are then checked into the Google source tree.
There are three server types in the MapReduce system.
map
reduce
One of the main issues they have to deal with in the MapReduce system is problem of stragglers. Stragglers are servers that run slower than expected for one reason or the other. Sometimes stragglers may be due to hardware issues (e.g. bad harddrive conttroller causes reduced I/O throughput) or may just be from the server running too many complex jobs which utilize too much CPU. To counter the effects of stragglers, they now assign multiple servers the same jobs which counterintuitively ends making tasks finish quicker. Another clever optimization is that all data transferred between map and reduce servers is compressed since the servers usually aren't CPU bound so compression/decompression costs are a small price to pay for bandwidth and I/O savings.
After the creation of GFS, the need for structured and semi-structured storage that went beyond opaque files became clear. Examples of situations that could benefit from this included
The system required would need to be able to scale to storing billions of URLs, hundreds of terabytes of satellite imagery, data associated preferences with hundreds of millions of users and more. It was immediately obvious that this wasn't a task for an off-the-shelf commercial database system due to the scale requirements and the fact that such a system would be prohibitively expensive even if it did exist. In addition, an off-the-shelf system would not be able to make optimizations based on the underlying GFS file system. Thus BigTable was born.
BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Instead it is more like a multi-level map data structure. It is a large scale, fault tolerant, self managing system with terabytes of memory and petabytes of storage space which can handle millions of reads/writes per second. BigTable is now used by over sixty Google products and projects as the platform for storing and retrieving structured data.
The BigTable data model is fairly straightforward, each data item is stored in a cell which can be accessed using its {row key, column key, timestamp}. The need for a timestamp came about because it was discovered that many Google services store and compare the same data over time (e.g. HTML content for a URL). The data for each row is stored in one or more tablets which are actually a sequence of 64KB blocks in a data format called SSTable.
There are three primary server types of interest in the BigTable system.
There are a number of optimizations which applications can take advantage of in BigTable. One example is the concept of locality groups. For example, some of the simple metadata associated with a particular URL which is typically accessed together (e.g. language, PageRank™ , etc) can be physically stored together by placing them in a locality group while other columns (e.g. content) are in a separate locality group. In addition, tablets are usually kept in memory until the machine is running out of memory before their data is written to GFS as an SSTable and a new in memory table is created. This process is called compaction. There are other types of compactions where in memory tables are merged with SSTables on disk to create an entirely new SSTable which is then stored in GFS.
Although Google's infrastructure works well at the single cluster level, there are a number of areas with room for improvement including
[The conference was part recruiting event so some of the speakers ended their talks with a recruiting spiel - Dare]
Having access to lots of data and computing power is a geek playground. You can build cool, seemingly trivial apps on top of the data such which turn out to be really useful such as Google Trends and catching misspellings of "britney spears. Another example of the kinds of apps you can build when you have enough data treating the problem of language translation as a statistical modeling problem which turns out to be one of the most successful methods around.
Google hires smart people and lets them work in small teams of 3 to 5 people. They can get away with teams being that small because they have the benefit of an infrastructure that takes care of all the hard problems so devs can focus on building interesting, innovative apps.