These are my notes from the keynote session MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets by Jeff Dean.
The talk was about the three pillars of Google's data storage and processing
platform; GFS, BigTable and MapReduce.
GFS
The developers at Google decided to build their own custom distributed
file system because they felt that they had unique requirements. These
requirements included
- scalable to thousands of network nodes
- massive read/write bandwidth requirements
- ability to handle large blocks of data which are gigabytes in size.
- need exremely efficient distribution of operations across nodes to reduce bottlenecks
One benefit the developers of GFS
had was that since it was an in-house application they could control the environment,
the client applications and the libraries a lot better than in the off-the-shelf case.
GFS Server Architecture
There are two server types in the GFS system.
Master servers
These keep the metadata on the various data files (in 64MB chunks) within the
file system. Client applications talk to the master servers to perform metadata operations
on files or to locate the actual chunk server that contains the actual bits on disk.
Chunk servers
These contain the actual bits on disk and can be considered to be dumb file servers.
Each chunk is replicated across three different chunk servers to create redundancy in
case of server crashes. Client applications retrieve data files directly from chunk
servers once they've been directed to the chunk server which contains the chunk they
want by a master server.
There are currently over 200 GFS
clusters at Google, some of which have over 5000 machines. They now have pools of tens of
thousands of machines retrieving data from
GFS clusters that run as large
as 5 petabytes of storage with read/write throughput of over 40 gigabytes/second across the
cluster.
MapReduce
At Google they do a lot of processing of very large amounts of data. In the old days,
developers would have to write their own code to partition the large data sets, checkpoint
code and save intermediate results, handle failover in case of server crashes, and so on
as well as actually writing the business logic for the actual data processing they wanted
to do which could have been something straightforward like counting the occurence of
words in various Web pages or grouping documents by content checksums. The decision was
made to reduce the duplication of effort and complexity of performing data processing
tasks by building a platform technology that everyone at Google could use which handled
all the generic tasks of working on very large data sets. So
MapReduce was born.
MapReduce is an application
programming interface for processing very large data sets. Application developers feed
in a key/value pair (e.g. {URL,HTML content} pair) then use the map function to extract
relevant information from each record which should produce a set of intermediate
key/value pairs (e.g. {word, 1 } pairs for each time a word is encountered) and finally
the reduce function merges the intermediate values associated with the same key to
produce the final output (e.g. {word, total count of occurences} pairs).
A developer only has to write their specific map and reduce operations for their data
sets which could run as low as 25 - 50 lines of code while the
MapReduce infrastructure deals
with parallelizing the task and distributing it across different machines, handling
machine failures and error conditions in the data, optimizations such as moving computation
close to the data to reduce I/O bandwidth consumed, providing system monitoring and
making the service scalable across hundreds to thousands of machines.
Currently, almost every major product at Google uses
MapReduce in some way. There
are 6000 MapReduce applications
checked into the Google source tree with the hundreds of new applications that utilize it
being written per month. To illustrate its ease of use, a graph of new MapReduce applications checked into the Google
source tree over time shows that there is a spike every summer as interns show up and
create a flood of new MapReduce
applications that are then checked into the Google source tree.
MapReduce Server Architecture
There are three server types in the MapReduce system.
Master server
This assigns user tasks to map and reduce servers as well as keeps track of the state
of these tasks.
Map Servers
Accepts user input and performs map
operation on them then writes the
results to intermediate files
Reduce Servers
Accepts intermediate files produced by map servers and performs reduce
operation on them.
One of the main issues they have to deal with in the
MapReduce system is problem of
stragglers. Stragglers are servers that run slower than expected for one reason or the
other. Sometimes stragglers may be due to hardware issues (e.g. bad harddrive conttroller
causes reduced I/O throughput) or may just be from the server running too many complex
jobs which utilize too much CPU. To counter the effects of stragglers, they now assign
multiple servers the same jobs which counterintuitively ends making tasks finish quicker.
Another clever optimization is that all data transferred between map and reduce servers
is compressed since the servers usually aren't CPU bound so compression/decompression
costs are a small price to pay for bandwidth and I/O savings.
BigTable
After the creation of GFS, the need
for structured and semi-structured storage that went beyond opaque files became clear.
Examples of situations that could benefit from this included
- associating metadata with a URL such as when it was crawled, its PageRank™, contents, links to it, etc
- associating data with a user such as the user's search history and preferences
- geographical data such as information about roads and sattelite imagery
The system required would need to be able to scale to storing billions of URLs,
hundreds of terabytes of satellite imagery, data associated preferences with hundreds
of millions of users and more. It was immediately obvious that this wasn't a task for an
off-the-shelf commercial database system due to the scale requirements and the fact that
such a system would be prohibitively expensive even if it did exist. In addition, an
off-the-shelf system would not be able to make optimizations based on the
underlying GFS file system.
Thus BigTable was born.
BigTable is not a relational
database. It does not support joins nor does it support rich SQL-like queries. Instead
it is more like a multi-level map data structure. It is a large scale, fault tolerant,
self managing system with terabytes of memory and petabytes of storage space which can
handle millions of reads/writes per second.
BigTable is now used by
over sixty Google products and projects as the platform for storing and retrieving
structured data.
The BigTable data model is
fairly straightforward, each data item is stored in a cell which can be accessed using
its {row key, column key, timestamp}. The need for a timestamp came about because it was
discovered that many Google services store and compare the same data over time
(e.g. HTML content for a URL). The data for each row is stored in one or more
tablets which are actually a sequence of 64KB blocks in a data format
called SSTable.
BigTable Server Architecture
There are three primary server types of interest in the
BigTable system.
Master servers
Assigns tablets to tablet servers, keeps track of where tablets are located and
redistributes tasks as needed.
Tablet servers
Handle read/write requests for tablets and split tablets when they exceed size limits
(usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup
1 new tablet and the system recovers.
Lock servers
These are instances of the
Chubby distributed lock service.
Lots of actions within BigTable
require acquisition of locks including opening tablets for writing, ensuring that there
is no more than one active Master at a time, and access control checking.
There are a number of optimizations which applications can take advantage of in
BigTable. One example is the
concept of locality groups. For example, some of the simple metadata associated
with a particular URL which is typically accessed together (e.g. language, PageRank™
, etc) can be physically stored together by placing them in a locality group while
other columns (e.g. content) are in a separate locality group. In addition, tablets
are usually kept in memory until the machine is running out of memory before their data
is written to GFS as an SSTable
and a new in memory table is created. This process is called compaction. There are
other types of compactions where in memory tables are merged with SSTables
on disk to create an entirely new SSTable which is then stored in
GFS.
Current Challenges Facing Google's Infrastructure
Although Google's infrastructure works well at the single cluster level, there are a
number of areas with room for improvement including
- support for geo-distributed clusters
- single global namespace for all data since currently data is segregated by cluster
- more and better automated migration of data and computation
- lots of consistency issues when you couple wide area replication with network
partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or
due to some sort of outage).
Recruiting Sales Pitch
[The conference was part recruiting event so some of the speakers ended their talks
with a recruiting spiel - Dare]
Having access to lots of data and computing power is a geek playground. You can build
cool, seemingly trivial apps on top of the data such which turn out to be really useful
such as Google Trends and
catching misspellings of "britney spears.
Another example of the kinds of apps you can build when you have enough data treating the problem of language translation as a statistical modeling problem which turns
out to be one of the most successful methods around.
Google hires smart people and lets them work in small teams of 3 to 5 people. They can get
away with teams being that small because they have the benefit of an infrastructure that
takes care of all the hard problems so devs can focus on building interesting,
innovative apps.