These are my notes from the talk Using MapReduce on Large Geographic Datasets by Barry Brummit.

Most of this talk was a repetition of the material in the previous talk by Jeff Dean including reusing a lot of the same slides. My notes primarily contain material I felt was unique to this talk.

A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to male lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).

Below are examples of the kinds of problems the Google Maps has used MapReduce to solve.

Locating all points that connect to a particular road
Input Map Shuffle Reduce Output
List of roads and intersections Create pairs of connected points such as {road, intersection} or {road, road} pairs Sort by key Get list of pairs with the same key A list of all the points that connect to a particular road

Rendering Map Tiles
Input Map Shuffle Reduce Output
Geographic Feature List Emit each feature on a set of overlapping lat/long rectangles Sort by Key Emit tile using data for all enclosed features Rendered tiles

Finding Nearest Gas Station to an Address within five miles
Input Map Shuffle Reduce Output
Graph describing node network with all gas stations marked Search five mile radius of each gas station and mark distance to each node Sort by key For each node, emit path and gas station with the shortest distance Graph marked with nearest gas station to each node

When issues are encountered in a MapReduce it is possible for developers to debug these issues by running their MapReduce applications locally on their desktops.

Developers who would like to harness the power of a several hundred to several thousand node cluster but do not work at Google can try

Recruiting Sales Pitch

[The conference was part recruiting event so some of the speakers ended their talks with a recruiting spiel. - Dare]

The Google infrastructure is the product of Google's engineering culture has the following ten characteristics

  1. single source code repository for all Google code
  2. Developers can checkin fixes for any Google product
  3. You can build any Google product in three steps (get, configure, make)
  4. Uniform coding standards across the company
  5. Mandatory code reviews before checkins
  6. Pervasive unit testing
  7. Tests run nightly, emails sent to developers if any failures
  8. Powerful tools that are shared company-wide
  9. Rapid project cycles, developers change projects often, 20% time
  10. Peer driven review process, flat management hierarchy

Q&A

Q: Where are intermediate results from map operations stored?
A: In BigTable or GFS

Q: Can you use MapReduce incrementally? For example, when new roads are built in North America do we have to run MapReduce over teh entire data set or can we only factor in the changed data?
A: Currently, you'll have to process the entire data stream again. However this is a problem that is the target of lots of active research at Google since it affects a lot of teams.