Sometime last week, Amazon soft launched Amazon SimpleDB, a hosted service for storing and querying structured data. This release plugged a hole in their hosted Web services offerings which include the Amazon Simple Storage Service (S3) and the Amazon Elastic Compute Cloud (EC2). Amazon’s goal of becoming the “Web OS” upon which the next generation of Web startups builds upon came off as hollow when all they gave you was BLOB storage and hosted computation but not structured storage. With SimpleDB, they’re almost at the point where all the tools you need for building the next del.icio.us or Flickr can be provided by Amazon’s Web Services. The last bit they need to provide is actual Web hosting so that developers don’t need to resort to absurd dynamic DNS hacks when interacting with their Amazon applications from the Web.

The Good: Comoditizing hosted services and getting people to think outside the relational database box

The data model of SimpleDB is remarkably similar to Google’s BigTable in that instead of having multiple tables and relations between them, you get a single big giant table which is accessed via the tuple of {row key, column key}. Although, both SimpleDB and BigTable allow applications to store multiple values for a particular tuple, they do so in different ways. In BigTable, multiple values are additionally keyed by timestamp so I can access data such using tuples such as {”http://www.example.com”,  “incoming_links”, “12–12–2007”}. In Amazon’s SimpleDB I’d simply be able to store multiple values for a particular key pair so I could access {”Dare Obasanjo”, “weblogs”} and it would return (“http://www.25hoursaday.com/weblog”, “http://blogs.msdn.com/dareobasanjo”, “http://carnage4life.spaces.live.com”).

Another similarity that both systems share, is that there is no requirement that all “rows” in a table share the same schema nor is there an explicit notion of declaring a schema. In SimpleDB, tables are called domains, rows are called items and the columns are called attributes. 

It is interesting to imagine how this system evolved. From experience, it is clear that everyone who has had to build a massive relational database that database joins kill performance. The longer you’ve dealt with massive data sets, the more you begin to fall in love with denormalizing your data so you can scale. Taking to its logical extreme, there’s nothing more denormalized than a single table. Even better, Amazon goes a step further by introducing multivalued columns which means that SimpleDB isn’t even in First Normal Form whereas we all learned in school that the minimum we should aspire to is Third Normal Form.

I think it is great to see more mainstream examples that challenge the traditional thinking of how to store, manage and manipulate large amounts of data.

I also think the pricing is very reasonable. If I was a startup founder, I’d strongly consider taking Amazon Web Services for a spin before going with a traditional LAMP or WISC approach.  

The Bad: Eventual Consistency and Data Values are Weakly Typed

The documentation for the PutAttributes method has the following note

Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data.

This may or may not be a problem depending on your application. It may be OK for a del.icio.us style application if it took a few minutes before your tag updates were applied to a bookmark but the same can’t be said for an application like Twitter. What would be useful for developers would be if Amazon gave some more information around the delayed propagation such as average latency during peak and off-peak hours.

There is another interesting note in the documentation of the Query method which states

 Lexicographical Comparison of Different Data Types

Amazon SimpleDB treats all entities as UTF-8 strings. Keep this in mind when storing and querying different data types, such as numbers or dates. Design clients to convert their data into an appropriate string format, so that query expression return expected results.

The following are suggested methods for converting different data types into strings for proper lexicographical order enforcement:

  • Positive integers should be zero-padded to match the largest number of digits in your data set. For example, if the largest number you are planning to use in a range is 1,000,000, every number that you store in Amazon SimpleDB should be zero-padded to at least 7 digits. You would store 25 as 0000025, 4597 as 0004597, and so on.

  • Negative integers should be offset and turned into positive numbers and zero-padded. For example, if the smallest negative integer in your data set is -500, your application should add at least 500 to every number that you store. This ensures that every number is now positive and enables you to use the zero-padding technique.

  • To ensure proper lexicographical order, convert dates to the ISO 8601 format.

[Note] Note

Amazon SimpleDB provides utility functions within our sample libraries that help you perform these conversions in your application.

This is ghetto beyond belief. I should know ahead of time what the lowest number will be in my data set and add/subtract offsets from data values when inserting and retrieving them from SimpleDB? I need to know the largest number in my data set and zero pad to that length? Seriously, WTF?

It’s crazy just thinking about the kinds of bugs that could be introduced into applications because of this wacky semantics and the recommended hacks to get around them. Even if this is the underlying behavior of SimpleDB, Amazon should have fixed this up in an APIs layer above SimpleDB then exposed that instead of providing ghetto helper functions in a handful of popular programming languages then crossing their fingers hoping that no one hits this problem.  

The Ugly: Web Interfaces, that Claim to be RESTful but Aren’t

I’ve talked about APIs that claim to be RESTful but aren’t in the past but Amazon’s takes the cake when it comes to egregious behavior. Again, from the documentation for the PutAttributes method we learn

Sample Request

The following example uses PutAttributes on Item123 which has attributes (Color=Blue), (Size=Med) and (Price=14.99) in MyDomain. If Item123 already had the Price attribute, this operation would replace the values for that attribute.

https://sdb.amazonaws.com/
?Action=PutAttributes
&Attribute.0.Name=Color&Attribute.0.Value=Blue
&Attribute.1.Name=Size&Attribute.1.Value=Med
&Attribute.2.Name=Price&Attribute.2.Value=14.99
&Attribute.2.Replace=true
&AWSAccessKeyId=[valid access key id]
&DomainName=MyDomain
&ItemName=Item123
&SignatureVersion=1
&Timestamp=2007-06-25T15%3A03%3A05-07%3A00
&Version=2007-11-07
&Signature=gabYTEXUgY%2Fdg817JBmj7HnuAA0%3D

Sample Response

<PutAttributesResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07">
  <ResponseMetadata>
    <RequestId>490206ce-8292-456c-a00f-61b335eb202b</RequestId>
    <BoxUsage>0.0000219907</BoxUsage>
  </ResponseMetadata>
</PutAttributesResponse>

Wow. A GET request with a parameter called Action which modifies data? What is this, 2005? I thought we already went through the realization that GET requests that modify data are bad after the Google Web Accelerator scare of 2005?

Of course, I'm not the only one that thinks this is ridonkulous. See similar comments from Stefan Tilkov, Joe Gregorio, and Steve Loughran. Methinks, someone at Amazon needs to go read some guidelines on building RESTful Web services.

Bonus points to Subbu Allamaraju for refactoring the SimpleDB API into a true RESTful Web service

Speaking of ridonkulous APIs trends, it seems the SimpleDB Query method follows the lead of the Google Base GData API in stuffing a SQL-like query language into the query string parameters of HTTP GET requests. I guess it is RESTful, but Damn is it ugly.

Now playing: J. Holiday - Suffocate


 

Friday, 21 December 2007 19:53:01 (GMT Standard Time, UTC+00:00)
I agree with your characterization except for the latency issue, which I believe is an inherent characteristic of replication between data center nodes.

http://oakleafblog.blogspot.com/2007/12/linq-and-entity-framework-posts-for_20.html

--rj
Friday, 21 December 2007 21:37:41 (GMT Standard Time, UTC+00:00)
I though this to be an interesting article using Amazon

http://open.blogs.nytimes.com/author/dgottfrid/
I.P Nichols
Saturday, 22 December 2007 00:43:06 (GMT Standard Time, UTC+00:00)
What would be less ugly than this approach? I mean, you need a complex query language if you are trying to implement a pseudo-database, and GET is the right method to use.
Bob McGrew
Sunday, 23 December 2007 16:43:10 (GMT Standard Time, UTC+00:00)
Bob, the issue (well, the main issue) with the GET here is not for querying. It's that GET is also used to *modify* data.
Anything on a GET request can be placed inside a link. And anything on a link can get called by search engines' crawlers, various pre-fetch mechanisms, or even a user clicking on a link. You certainly *don't* want data in your database to be changed by accident, and this is exactly what change via GET will get you.
Plus, well, if you need to POST to prevent this problem, you may as well have the format be something a little more structured than a long list of values. If the API knows to return structured data back, it should be able to accept it as well. But that's a different issue.
Yaron
Sunday, 30 December 2007 16:09:01 (GMT Standard Time, UTC+00:00)

I'm fine with challenging mainstream thinking if it actually leads to an improvement... but violating first normal form basically means you can't make any logical sense of the data outside of the context of the program accessing it. This makes SimpleDB no better than a hard-coded file format with an API and query engine. Which is useful, certainly, but a very niche use.

I'm also sure the companies that use SimpleDB will enjoy regularly exporting their data to do any sort of trend or aggregate analysis.

As for "From experience, it is clear that everyone who has had to build a massive relational database that database joins kill performance.", I think the echo chamber of "bloggers who build websites" skews this too much. There are a lot of Data Warehouses out there, and yes, while denormalization is useful, big joins are very common (and if tables are properly indexed / profiled, fairly fast in a parallel environment).

Similarly I've seen Oracle-backed websites that do subsecond multi-way joins, which they needed in this particular case because they required data update consistency AND couldn't deal with the latency from a replicated & denormalized table. So, YMMV; not everyone is a search engine or a social network .. ;)

Comments are closed.