These are my notes from the talk Using MapReduce on Large Geographic Datasets by Barry Brummit.
Most of this talk was a repetition of the material in the previous talk by Jeff Dean including reusing a lot of the same slides. My notes primarily contain material I felt was unique to this talk.
A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to male lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).
Below are examples of the kinds of problems the Google Maps has used MapReduce to solve.
When issues are encountered in a MapReduce it is possible for developers to debug these issues by running their MapReduce applications locally on their desktops.
Developers who would like to harness the power of a several hundred to several thousand node cluster but do not work at Google can try
The Google infrastructure is the product of Google's engineering culture has the following ten characteristics
Q: Where are intermediate results from map operations stored? A: In BigTable or GFS
Q: Can you use MapReduce incrementally? For example, when new roads are built in North America do we have to run MapReduce over teh entire data set or can we only factor in the changed data? A: Currently, you'll have to process the entire data stream again. However this is a problem that is the target of lots of active research at Google since it affects a lot of teams.