Recently I took a look at CouchDB because I saw it favorably mentioned by Sam Ruby and when Sam says some technology is interesting, he’s always right. You get the gist of CouchDB by reading the CouchDB Quick Overview and the CouchDB technical overview.
CouchDB is a distributed document-oriented database which means it is designed to be a massively scalable way to store, query and manage documents. Two things that are interesting right off the bat are that the primary interface to CouchDB is a RESTful JSON API and that queries are performed by creating the equivalent of stored procedures in Javascript which are then applied on each document in parallel. One thing that not so interesting is that editing documents is lockless and utilizes optimistic concurrency which means more work for clients.
As someone who designed and implemented an XML Database query language back in the day, this all seems strangely familiar.
So far, I like what I’ve seen but there seems to already be a bunch of incorrect hype about the project which may damage it’s chances of success if it isn’t checked. Specifically I’m talking about Assaf Arkin’s post CouchDB: Thinking beyond the RDBMS which seems chock full of incorrect assertions and misleading information.
Assaf writes
This day, it happens to be CouchDB. And CouchDB on first look seems like the future of database without the weight that is SQL and write consistency.
CouchDB is a document oriented database which is nothing new [although focusing on JSON instead of XML makes it buzzword compliant] and is definitely not a replacement/evolution of relational databases. In fact, the CouchDB folks assert as much in their overview document.
Document oriented database work well for semi-structured data where each item is mostly independent and is often processed or retrieved in isolation. This describes a large category of Web applications which are primarily about documents which may link to each other but aren’t processed or requested often based on those links (e.g. blog posts, email inboxes, RSS feeds, etc). However there are also lots of Web applications that are about managing heavily structured, highly interrelated data (e.g. sites that heavily utilize tagging or social networking) where the document-centric model doesn’t quite fit.
Here’s where it gets interesting. There are no indexes. So your first option is knowing the name of the document you want to retrieve. The second is referencing it from another document. And remember, it’s JSON in/JSON out, with REST access all around, so relative URLs and you’re fine.
But that still doesn’t explain the lack of indexes. CouchDB has something better. It calls them views, but in fact those are computed tables. Computed using JavaScript. So you feed (reminder: JSON over REST) it a set of functions, and you get a set of queries for computed results coming out of these functions.
Again, this is a claim that is refuted by the actual CouchDB documentation. There are indexes, otherwise the system would be ridiculously slow since you would have to run the function and evaluate every single document in the database each time you ran one of these views (i.e. the equivalent of a full table scan). Assaf probably meant to say that there aren’t any relational database style indexes but…it isn’t a relational database so that isn’t a useful distinction to make.
I’m personally convinced that write consistency is the reason RDBMS are imploding under their own weight. Features like referential integrity, constraints and atomic updates are really important in the client-server world, but irrelevant in a world of services.
You can do all of that in the service. And you can do better if you replace write consistency with read consistency, making allowances for asynchronous updates, and using functional programming in your code instead of delegating to SQL.
I read these two paragraph five or six times and they still seem like gibberish to me. Specifically, it seems silly to say that maintaining data consistency is important in the client-server world but irrelevant in the world of services. Secondly, “Read consistency” and “write consistency” are not an either-or choice. They are both techniques used by database management systems, like Oracle, to present a deterministic and consistent experience when modifying, retrieving and manipulating large amounts of data.
In the world of online services, people are very aware of the CAP conjecture and often choose availability over data consistency but it is a conscious decision. For example, it is more important for Amazon that their system is always available to users than it is that they never get an order wrong every once in a while. See Pat Helland’s (ex-Amazon architect) example of a how a business-centric approach to data consistency may shape one’s views from his post Memories, Guesses, and Apologies where he writes
#1 - The application has only a single replica and makes a "decision" to ship the widget on Wednesday. This "decision" is sent to the user.
#2 - The forklift pummels the widget to smithereens.
#3 - The application has no recourse but to apologize, informing the customer they can have another widget in one month (after the incoming shipment arrives).
#4 - Consider an alternate example with two replicas working independently. Replica-1 "decides" to ship the widget and sends that "decision" to User-1.
#5 - Independently, Replica-2 makes a "decision" to ship the last remaining widget to User-2.
#6 - Replica-2 informs Replica-1 of its "decision" to ship the last remaining widget to User-2.
#7 - Replica-1 realizes that they are in trouble... Bummer.
#8 - Replica-1 tells User-1 that he guessed wrong.
#9 - Note that the behavior experienced by the user in the first example is indistinguishable from the experience of user-1 in the second example.
Eventual Consistency and Crappy Computers
Business realities force apologies. To cope with these difficult realities, we need code and, frequently, we need human beings to apologize. It is essential that businesses have both code and people to manage these apologies.
…
We try too hard as an industry. Frequently, we build big and expensive datacenters and deploy big and expensive computers.
In many cases, comparable behavior can be achieved with a lot of crappy machines which cost less than the big expensive one.
The problem described by Pat isn’t a failure of relational databases vs. document oriented ones as Assaf’s implication would have us believe. It is the business reality that availability is more important than data consistency for certain classes of applications. A lot of the culture and technologies of the relational database world are about preserving data consistency [which is a good thing because I don’t want money going missing from my bank account because someone thought the importance of write consistency is overstated] while the culture around Web applications is about reaching scale cheaply while maintaining high availability in situations where the occurence of data loss is unfortunate but not catastrophic (e.g. lost blog comments, mistagged photos, undelivered friend requests, etc).
Even then most large scale Web applications that don’t utilize the relational database features that are meant to enforce data consistency (triggers, foreign keys, transactions, etc) still end up rolling their own app-specific solutions to handle data consistency problems. However since these are tailored to their application they are more performant than generic features which may exist in a relational database.
For further reading, see an Overview of the Flickr Architecture.
Now playing: Raekwon - Guillotine (Swordz) (feat. Ghostface Killah, Inspectah Deck & GZA/Genius)