A few members of the Hotmail Windows Live Mail team have been doing some writing about scalability recently
From the ACM Queue article A Conversation with Phil Smoot
BF Can you give us some sense of just how big Hotmail is and
what the challenges of dealing with something that size are?
PS Hotmail is a service consisting of thousands of machines
and multiple petabytes of data. It executes billions of transactions over
hundreds of applications agglomerated over nine years—services that are built on
services that are built on services. Some of the challenges are keeping the site
running: namely dealing with abuse and spam; keeping an aggressive,
Internet-style pace of shipping features and functionality every three and six
months; and planning how to release complex changes over a set of multiple
releases.
QA is a challenge in the sense that mimicking Internet loads on our QA lab
machines is a hard engineering problem. The production site consists of hundreds
of services deployed over multiple years, and the QA lab is relatively small, so
re-creating a part of the environment or a particular issue in the QA lab in a
timely fashion is a hard problem. Manageability is a challenge in that you want
to keep your administrative headcount flat as you scale out the number of
machines.
BF I have this sense that the challenges don’t scale
uniformly. In other words, are there certain scaling points where the problem
just looks completely different from how it looked before? Are there things that
are just fundamentally different about managing tens of thousands of systems
compared with managing thousands or hundreds?
PS Sort of, but we tend to think that if you can manage five
servers you should be able to manage tens of thousands of servers and hundreds
of thousands of servers just by having everything fully automated—and that all
the automation hooks need to be built in the service from the get-go. Deployment
of bits is an example of code that needs to be automated. You don’t want your
administrators touching individual boxes making manual changes. But on the other
side, we have roll-out plans for deployment that smaller services probably would
not have to consider. For example, when we roll out a new version of a service
to the site, we don’t flip the whole site at once.
We do some staging, where we’ll validate the new version on a server and then
roll it out to 10 servers and then to 100 servers and then to 1,000
servers—until we get it across the site. This leads to another interesting
problem, which is versioning: the notion that you have to have multiple versions
of software running across the sites at the same time. That is, version N and
N+1 clients need to be able to talk to version N and N+1 servers and N and N+1
data formats. That problem arises as you roll out new versions or as you try
different configurations or tunings across the site.
Another hard problem is load balancing across the site. That is, ensuring
that user transactions and storage capacity are equally distributed over all the
nodes in the system without any particular set of nodes getting too hot.
From the the blog post entitled Issues with .NET Frameworks 2.0by Walter Hsueh
Our team is tackling the scale issues, delving deep into the CLR and
understanding its behavior. We've identified at least two issues in .NET
Frameworks 2.0 that are "low-hanging fruit", and are hunting for more.
1a) Regular Expressions can be very expensive. Certain (unintended and
intended) strings may cause RegExes to exhibit exponential behavior. We've
taken several hotfixes for this. RegExes are so handy, but devs really need to
understand how they work; we've gotten bitten by them.
1b) Designing an AJAX-style browser application (like most engineering
problems) involves trading one problem for another. We can choose to shift the
application burden from the client onto the server. In the case of RegExes, it
might make sense to move them to the client (where CPU can be freely used)
instead of having them run on the server (where you have to share). WindowsLive
Mail made this tradeoff in one case.
2) Managed Thread Local Storage (TLS) is expensive. There is a global lock
in the Whidbey RTM implementation of Thread.GetData/Thread.SetData which causes
scalability issues. Recommendation is to use the [ThreadStatic] attribute on
static class variables. Our RPS went up, our CPU % went down, context switches
dropped by 50%, and lock contentions dropped by over 80%. Good stuff.
Our devs have also started migrating some of our services to Whidbey
and they've also found some interesting issues with regards to
performance. It'd probably would be a good idea to get together some
sort of lessons learned while building mega-scale services on the .NET Framework article together.