These are my notes from the talk Lessons in Building Scalable Systems by Reza Behforooz.
The Google Talk team have produced
multiple versions of their application. There is
- a desktop IM client which speaks the Jabber/XMPP protocol.
- a Web-based IM client that is integrated into GMail
- a Web-based IM client that is integrated into Orkut
- An IM widget which can be embedded in iGoogle or in any website that supports embedding Flash.
Google Talk Server Challenges
The team has had to deal with a significant set of challenges since the
service launched including
-
Support displaying online presence and sending messages for millions of
users. Peak traffic is in hundreds of thousands of queries per second with a
daily average of billions of messages handled by the system.
routing and application logic has to be applied to each message according to
the preferences of each user while keeping latency under 100ms.
handling surge of traffic from integration with Orkut and GMail.
-
ensuring in-order delivery of messages
-
needing an extensibile architecture which could support a variety of clients
Lessons
The most important lesson the Google Talk team learned is that you have to measure the right things. Questions like
"how many active users do you have" and "how many IM messages does the system
carry a day" may be good for evaluating marketshare but are not good questions
from an engineering perspective if one is trying to get insight into how the
system is performing.
Specifically, the biggest strain on the system actually turns out to be
displaying presence information. The formula for determining how many presence
notifications they send out is
total_number_of_connected_users * avg_buddy_list_size * avg_number_of_state_changes
Sometimes there are drastic jumps in these numbers. For example, integrating
with Orkut increased the average buddy list
size since people usually have more friends in a social networking service
than they have IM buddies.
Other lessons learned were
Slowly Ramp Up High Traffic Partners: To see what real world
usage patterns would look like when Google Talk was integrated with Orkut and GMail, both services added code to fetch
online presence from the Google Talk
servers to their pages that displayed a user's contacts without adding any UI
integration. This way the feature could be tested under real load without users
being aware that there were any problems if there were capacity problems. In
addition, the feature was rolled out to small groups of users at first (around
1%).
Dynamic Repartitioning: In general, it is a good idea to divide
user data across various servers (aka partitioning or sharding) to reduce
bottlenecks and spread out the load. However, the infrastructure should support
redistributing these partitions/shards without having to cause any downtime.
Add Abstractions that Hide System Complexity: Partner services
such as Orkut and GMail don't know which data centers contain the
Google Talk servers, how many servers
are in the Google Talk cluster and
are oblivious of when or how load balancing, repartitioning or failover occurs
in the Google Talk service.
Understand Semantics of Low Level Libraries: Sometimes low level
details can stick it to you. The Google Talk developers found out that using epoll worked better than the poll/select loop because
they have lots of open TCP conections but only a relatively small number of
them are active at any time.
Protect Against Operational Problems: Review logs and endeavor to smooth out spikes in activity graphs. Limit cascading problems by having logic
to back off from using busy or sick servers.
Any Scalable System is a Distributed System: Apply the lessons from the fallacies of distributed computing. Add fault tolerance to all your
components. Add profiling to live services and follow transactions as they
flow through the system (preferably in a non-intrusive manner). Collect metrics from services
for monitoring both for real time diagnosis and offline generation of reports.
Recommended Software Development Strategies
Compatibility is very important, so making sure deployed binaries are backwards
and forward compatible is always a good idea. Giving developers access to
live servers (ideally public beta servers not main production servers)
will encourage them to test and try out ideas quickly. It also gives them a
sense of empowerement. Developers end up making their systems easier to
deploy, configure, monitor, debug and maintain when they have a better idea
of the end to end process.
Building an experimentation platform
which allows you to empirically test the results of various changes to the
service is also recommended.