Database normalization
is a formal process of designing your database to eliminate redundant
data, utilize space efficiently and reduce update errors. Anyone who
has ever taken a database class has it drummed into their heads that a
normalized database is the only way to go. This is true for the most part .
However there are certain scenarios where the benefits of database
normalization are outweighed by its costs. Two of these scenarios are
described below.
Immutable Data and Append-Only Scenarios
Pat Helland, an enterprise architect at Microsoft who just rejoined the company after a two year stint at Amazon, has a blog post entitled Normalization is for Sissies
where he presents his slides from an internal Microsoft gathering on
database topics. In his presentation, Pat argues that database
normalization is unnecessary in situations where we are storing
immutable data such as financial transactions or a particular day's
price list.
When Multiple Joins are Needed to Produce a Commonly Accessed View
The biggest problem with normalization is that you end up with
multiple tables representing what is conceptually a single item. For
example, consider this normalized set of tables which represent a user
profile on a typical social networking site.
user table |
user_id |
first_name |
last_name |
sex |
hometown |
relationship_status |
interested_in
| religious_views |
political_views |
12345 |
John |
Doe |
Male |
Atlanta, GA |
married |
women |
(null) |
(null) |
user_affiliations table |
user_id (foreign_key) |
affiliation_id (foreign key) |
12345 |
42 |
12345 |
598 |
affiliations table |
affiliation_id |
description |
member_count |
42 |
Microsoft |
18,656 |
598 |
Georgia Tech |
23,488 |
user_phone_numbers table |
user_id (foreign_key) |
phone_number |
phone_type |
12345 |
425-555-1203 |
Home |
12345 |
425-555-6161 |
Work |
12345 |
206-555-0932 |
Cell |
user_screen_names table |
user_id (foreign_key) |
screen_name |
im_service |
12345 |
geeknproud@example.com |
AIM |
12345 |
voip4life@example.org |
Skype |
|
user_work_history table |
user_id (foreign_key) |
company_affiliation_id (foreign key) |
company_name |
job_title |
12345 |
42 |
Microsoft |
Program Manager |
12345 |
78 |
i2 Technologies |
Quality Assurance Engineer |
This is the kind of information you see on the average profile on Facebook. With the above design, it takes six SQL Join
operations to access and display the information about a single user.
This makes rendering the profile page a fairly database intensive
operation which is compounded by the fact that profile pages are the
most popular pages on social networking sites.
The simplest way to fix this problem is to denormalize the database.
Instead of having tables for the user’s affiliations, phone numbers, IM
addresses and so on, we can just place them in the user
table as columns. The drawback with this approach is that there is now
more wasted space (e.g. lots of college students people will have null
for their work_phone) and perhaps some redundant information (e.g. if
we copy over the description of each affiliation into an
affiliation_name column for each user to prevent having to do a join
with the affiliations table). However given the very low costs of
storage versus the improved performance characteristics of querying a
single table and not having to deal with SQL statements that operate
across six tables for every operation. This is a small price to pay.
As Joe Gregorio mentions in his blog post about the emergence of megadata,
a lot of the large Web companies such as Google, eBay and Amazon are
heavily into denormalizing their databases as well as eschewing
transactions when updating these databases to improve their
scalability.
Maybe normalization is for sissies…
UPDATE: Someone pointed out in the comments that
denormalizing the affiliations table into user's table would mean the
member_count would have to updated in thousands of user's rows when a
new member was added to the group. This is obviously not the intent of
denormalization for performance reasons since it replaces a bad problem
with a worse one. Since an affiliation is a distinct concept from a
user, it makes sense for it to have it's own table. Replicating the
names of the groups a user is affiliated with in the user table is a
good performance optimization although it does mean that the name has
to be fixed up in thousands of tables if it ever changes. Since this is
likely to happen very rarely, this is probably acceptable especially if
we schedule renames to be done by a cron job during offpeak ours On the other hand, replicating the member count is just asking for trouble.
UPDATE 2: Lots of great comments here and on reddit
indicate that I should have put more context around this post. Database
denormalization is the kind of performance optimization that should be
carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching.
However if you hit massive scale and are dealing with millions of
queries a day across hundreds of millions to billions of records or
have decided to go with database partitioning/sharding then you will
likely end up resorting to denormalization. A real-world example of
this is the Flickr database back-end whose details are described in Tim O'Reilly's Database War Stories #3: Flickr which contains the following quotes
tags are an interesting one. lots of the 'web 2.0' feature
set doesn't fit well with traditional normalised db schema design.
denormalization (or heavy caching) is the only way to generate a tag
cloud in milliseconds for hundereds of millions of tags. you can cache
stuff that's slow to generate, but if it's so expensive to generate
that you can't ever regenerate that view without pegging a whole
database server then it's not going to work (or you need dedicated
servers to generate those views - some of our data views are calculated
offline by dedicated processing clusters which save the results into
mysql).
federating data also means denormalization is necessary -
if we cut up data by user, where do we store data which relates to two
users (such as a comment by one user on another user's photo). if we
want to fetch it in the context of both user's, then we need to store
it in both shards, or scan every shard for one of the views (which
doesn't scale). we store alot of data twice, but then theres the issue
of it going out of sync. we can avoid this to some extent with two-step
transactions (open transaction 1, write commands, open transaction 2,
write commands, commit 1st transaction if all is well, commit 2nd
transaction if 1st commited) but there still a chance for failure when
a box goes down during the 1st commit.
we need new tools to check data consistency across multiple
shards, move data around shards and so on - a lot of the flickr code
infrastructure deals with ensuring data is consistent and well balanced
and finding and repairing it when it's not."
The part highlighted in red is also important to consider.
Denormalization means that you you are now likely to deal with data
inconsistencies because you are storing redundant copies of data and
may not be able to update all copies of a column value simultaneously
when it is changed for a variety of reasons. Having tools in your
infrastructure to support fixing up data of this sort then become very important.
Now playing: Bow Wow - Outta My System (feat. T-Pain)