Greg Linden has a blog post entitled Google Personalized Search and Bigtable where he writes
One tidbit I found curious in the Google Bigtable paper was
this hint about the internals of Google
Personalized Search:
Personalized Search generates user profiles using a MapReduce
over Bigtable. These user profiles are used to personalize live search
results.
This appears to confirm that Google Personalized Search
works by building high-level profiles of user interests from their past
behavior.
I would guess it works by determining subject interests (e.g.
sports, computers) and biasing all search results toward those categories. That
would be similar to the old personalized search in Google Labs (which was based
on Kaltix
technology) where you had to explicitly specify that profile, but now the
profile is generated implicitly using your search history.
My concern
with this approach is that it does not focus on what you are doing right now,
what you are trying to find, your current mission. Instead, it is a
coarse-grained bias of all results toward what you generally seem to
enjoy.
This problem is worse if the profiles are not updated in real
time.
I totally disagree with Greg here on almost every point. Building a profile of a user's interests to improve their search results is totally different from improving their search results in realtime. The former is personalized search while the latter is more akin to clustering of search results. For example, if I search for "football", a search engine can either use the fact that I've searched for soccer related terms in the past to bubble up the offical website of Fédération Internationale de Football Association (FIFA) instead of the National Football League (NFL) website in the search results or it could cluster the results of the search so I see all the options. Ideally, it should do both. However, expecting that my profile is built in realtime (e.g. learning from my search results from five minutes ago as opposed to those from five days ago) although ideal doesn't seem to me to be necessary to be beneficial to end users. This seems like one of those places where a good enough offline-processing based solution is better than a over better engineered real-time solution. Search is rarely about returning or reacting to realtime data anyway. :)
PS: I do think it's quite interesting to see how many Google applications are built on BigTable and MapReduce. From the post Namespaced Extensions in Feeds it looks like Google Reader is another example.