Dare Obasanjo's weblog

January 17, 2009

@ 04:24 PM

Some Thoughts on User Interfaces for Activity Streams

A few weeks ago, Joshua Porter posted an excellent analysis of FriendFeed's user interface in his post Thoughts on the Friendfeed interface where he provides this excellent annotated screenshot

In addition to the screenshot, Joshua levels four key criticisms about FriendFeed's current design

Too few items per screen
Secondary information clogs up each item
Difficult to scan content titles quickly
People who aren't my friends

The last item is my biggest pet peeve about FriendFeed and why I haven't found myself able to get into the service. FriendFeed goes out of its way to show me content from and links to people I don't know and haven't become friends with on the site. In the screenshot above, there are at least twice as many people Joshua isn't friends with showing up on the page than people he knows. Here are the three situations FriendFeed commonly shows non-friends in and why they are bad

FriendFeed shows you content from friends of friends: This is major social faux pas. It may sound like a cool viral feature but showing me content from people I haven't subscribed to means I don't have control of who shows up in my feed and it takes away from the intimacy of the site because I'm always seeing content from strangers.
FriendFeed shows you who "liked" some content: Why should I care if some total stranger liked some blog post from a friend of mine? Again, this seems like a viral feature aimed at generating page views from users clicking on the people who liked an item in the feed but it comes at the cost of visual clutter and a reduction in the intimacy of the servers by putting strangers in your face.
FriendFeed shows comments expanded by default in the feed: In the screenshot above, the comment thread for "Overnight Success and FriendFeed Needs" takes up space that could have been used to show another item from one of Joshua's friends. The question to ask is whether a bunch of comments from people Joshua may or may not know is more valuable to show than an update from one of his friends?

In fact the majority of Joshua's remaining complaints including secondary information causing visual clutter and too few items per screen are a consequence of FriendFeed's decision to take multiple opportunities to push people you don't know in your face on the home page. The need to grow virally by encouraging connections between users is costing them by hampering their core user experience.

On the flip side, look at how Facebook has tried to address the issue of prompting users to grow their social graph without spamming the news feed with people you don't know

People often claim that activity streams make them feel like they are drowning in a river of noise. FriendFeed compounds this by drowning you in a content from people you don't even know and never even asked to get content from in the first place.

Rule #1 of every activity stream experience is that users should feel in control of what content they get in their feed. Otherwise, the tendency to succumb to the feeling of "drowning" will be overwhelming.

Note Now Playing: Lupe Fiasco - Kick, Push Note

Categories: Social Software

January 16, 2009

@ 02:23 PM

Comments [9]

Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes

Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. The justification for database sharding is that after a certain scale point it is cheaper and more feasible to scale a site horizontally by adding more machines than to grow it vertically by adding beefier servers.

Why Shard or Partition your Database?

Let's take Facebook.com as an example. In early 2004, the site was mostly used by Harvard students as a glorified online yearbook. You can imagine that the entire storage requirements and query load on the database could be handled by a single beefy server. Fast forward to 2008 where just the Facebook application related page views are about 14 billion a month (which translates to over 5,000 page views per second, each of which will require multiple backend queries to satisfy). Besides query load with its attendant IOPs, CPU and memory cost there's also storage capacity to consider. Today Facebook stores 40 billion physical files to represent about 10 billion photos which is over a petabyte of storage. Even though the actual photo files are likely not in a relational database, their metadata such as identifiers and locations still would require a few terabytes of storage to represent these photos in the database. Do you think the original database used by Facebook had terabytes of storage available just to store photo metadata?

At some point during the development of Facebook, they reached the physical capacity of their database server. The question then was whether to scale vertically by buying a more expensive, beefier server with more RAM, CPU horsepower, disk I/O and storage capacity or to spread their data out across multiple relatively cheap database servers. In general if your service has lots of rapidly changing data (i.e. lots of writes) or is sporadically queried by lots of users in a way which causes your working set not to fit in memory (i.e. lots of reads leading to lots of page faults and disk seeks) then your primary bottleneck will likely be I/O. This is typically the case with social media sites like Facebook, LinkedIn, Blogger, MySpace and even Flickr. In such cases, it is either prohibitively expensive or physically impossible to purchase a single server to handle the load on the site. In such situations sharding the database provides excellent bang for the buck with regards to cost savings relative to the increased complexity of the system.

Now that we have an understanding of when and why one would shard a database, the next step is to consider how one would actually partition the data into individual shards. There are a number of options and their individual tradeoffs presented below – Pseudocode / Joins

How Sharding Changes your Application

In a well designed application, the primary change sharding adds to the core application code is that instead of code such as

//string connectionString = @"Driver={MySQL};SERVER=dbserver;DATABASE=CustomerDB;"; <-- should be in web.config
string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"];          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
          
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader();

the actual connection information about the database to connect to depends on the data we are trying to store or access. So you'd have the following instead

string connectionString = GetDatabaseFor(customerId);          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
         
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader();

the assumption here being that the GetDatabaseFor() method knows how to map a customer ID to a physical database location. For the most part everything else should remain the same unless the application uses sharding as a way to parallelize queries.

A Look at a Some Common Sharding Schemes

There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large scale Web applications today.

Vertical Partitioning: A simple way to segment your application database is to move tables related to specific features to their own server. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The key benefit of this approach is that is straightforward to implement and has low impact to the application as a whole. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers (e.g. handling metadata queries for 10 billion photos by 140 million users may be more than a single server can handle).
Range Based Partitioning: In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity. For example, splitting up sales transactions by what year they were created or assigning users to servers based on the first digit of their zip code. The main problem with this approach is that if the value whose range is used for partitioning isn't chosen carefully then the sharding scheme leads to unbalanced servers. In the previous example, splitting up transactions by date means that the server with the current year gets a disproportionate amount of read and write traffic. Similarly partitioning users based on their zip code assumes that your user base will be evenly distributed across the different zip codes which fails to account for situations where your application is popular in a particular region and the fact that human populations vary across different zip codes.
Key or Hash Based Partitioning: This is often a synonym for user based partitioning for Web 2.0 sites. With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. A simplistic example is to consider if you have ten database servers and your user IDs were a numeric value that was incremented by 1 each time a new user is added. In this example, the hash function could be perform a modulo operation on the user ID with the number ten and then pick a database server based on the remainder value. This approach should ensure a uniform allocation of data to each server. The key problem with this approach is that it effectively fixes your number of database servers since adding new servers means changing the hash function which without downtime is like being asked to change the tires on a moving car.
Directory Based Partitioning: A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the GetDatabaseFor() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a modulo operation. Let's say we want to add five database servers to the pool without incurring downtime. We can keep the existing hash function, add these servers to the pool and then run a script that copies data from the ten existing servers to the five new servers based on a new hash function implemented by performing the modulo operation on user IDs using the new server count of fifteen. Once the data is copied over (although this is tricky since users are always updating their data) the lookup service can change to using the new hash function without any of the calling applications being any wiser that their database pool just grew 50% and the database they went to for accessing John Doe's pictures five minutes ago is different from the one they are accessing now.

Problems Common to all Sharding Schemes

Once a database has been sharded, new constraints are placed on the operations that can be performed on the database. These constraints primarily center around the fact that operations across multiple tables or multiple rows in the same table no longer will run on the same server. Below are some of the constraints and additional complexities introduced by sharding

Joins and Denormalization – Prior to sharding a database, any queries that require joins on multiple tables execute on a single server. Once a database has been sharded across multiple servers, it is often not feasible to perform joins that span database shards due to performance constraints since data has to be compiled from multiple servers and the additional complexity of performing such cross-server.

A common workaround is to denormalize the database so that queries that previously required joins can be performed from a single table. For example, consider a photo site which has a database which contains a user_info table and a photos table. Comments a user has left on photos are stored in the photos table and reference the user's ID as a foreign key. So when you go to the user's profile it takes a join of the user_info and photos tables to show the user's recent comments. After sharding the database, it now takes querying two database servers to perform an operation that used to require hitting only one server. This performance hit can be avoided by denormalizing the database. In this case, a user's comments on photos could be stored in the same table or server as their user_info AND the photos table also has a copy of the comment. That way rendering a photo page and showing its comments only has to hit the server with the photos table while rendering a user profile page with their recent comments only has to hit the server with the user_info table.

Of course, the service now has to deal with all the perils of denormalization such as data inconsistency (e.g. user deletes a comment and the operation is successful against the user_info DB server but fails against the photos DB server because it was just rebooted after a critical security patch).
Referential integrity – As you can imagine if there's a bad story around performing cross-shard queries it is even worse trying to enforce data integrity constraints such as foreign keys in a sharded database. Most relational database management systems do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references once they move to using database shards.

Dealing with data inconsistency issues due to denormalization and lack of referential integrity can become a significant development cost to the service.
Rebalancing (Updated 1/21/2009) – In some cases, the sharding scheme chosen for a database has to be changed. This could happen because the sharding scheme was improperly chosen (e.g. partitioning users by zip code) or the application outgrows the database even after being sharded (e.g. too many requests being handled by the DB shard dedicated to photos so more database servers are needed for handling photos). In such cases, the database shards will have to be rebalanced which means the partitioning scheme changed AND all existing data moved to new locations. Doing this without incurring down time is extremely difficult and not supported by any off-the-shelf today. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).

Further Reading

Sharding for startups by Eric Ries
Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall

Note Now Playing: The Kinks - You Really Got Me Note

Categories: Web Development

January 12, 2009

@ 02:10 PM

Comments [9]

Can RDF really save us from data format proliferation?

Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases.

I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed.

RDF proponents often to argue that if we all used RDF based formats then instead of having to change your code to support every new photo site's Atom feed with custom extensions, you could instead create a mapping from the format you don't understand to the one you do using something like the OWL Web Ontology Language. The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT).

The key problem is that in both cases (i.e. mapping with OWL vs. mapping with XSLT) there is still the problem that Picassa feeds won't work with an app that understand's Zoomr's feeds until some developer writes code. Thus we're really debating on whether it is ~~better~~ cheaper to have the developer write declarative mappings like OWL or XSLT instead of writing new parsing code in their language of choice.

In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem.

Note Now Playing: Lady GaGa & Colby O'Donis - Just Dance Note

Categories: Syndication Technology | XML

January 12, 2009

@ 02:03 PM

Comments [0]

Upcoming Conference Appearance: SXSW '09

It looks like I'll be participating in two panels at the upcoming SXSW Interactive Festival. The descriptions of the panels are below

Feed Me: Bite Size Info for a Hungry Internet

In our fast-paced, information overload society, users are consuming shorter and more frequent content in the form of blogs, feeds and status messages. This panel will look at the social trends, as well as the technologies that makes feed-based communication possible. Led by Ari Steinberg, an engineering manager at Facebook who focuses on the development of News Feed.

Post Standards: Creating Open Source Specs

Many of the most interesting new formats on the web are being developed outside the traditional standards process; Microformats, OpenID, OAuth, OpenSocial, and originally Jabber — four out of five of these popular new specs have been standardized by the IETF, OASIS, or W3C. But real hackers are bringing their implementations to projects ranging from open source apps all the way up to the largest companies in the technology industry. While formal standards bodies still exist, their role is changing as open source communities are able to develop specifications, build working code, and promote it to the world. It isn't that these communities don't see the value in formal standardization, but rather that their needs are different than what formal standards bodies have traditionally offered. They care about ensuring that their technologies are freely implementable and are built and used by a diverse community where anyone can participate based on merit and not dollars. At OSCON last year, the Open Web Foundation was announced to create a new style of organization that helps these communities develop open specifications for the web. This panel brings together community leaders from these technologies to discuss the "why" behind the Open Web Foundation and how they see standards bodies needing to evolve to match lightweight community driven open specifications for the web.

If you'll be at SxSw and are a regular reader of my blog who would like to chat in person, feel free to swing by during one or both panels. I'd also be interested in what people who plan to attend either panel would like to get out of the experience. Let me know in the comments.

Note Now Playing: Estelle - American Boy (feat. Kanye West) Note

Categories: Trip Report

January 10, 2009

@ 01:43 PM

Comments [0]

Live Mesh Wins Best Technology Innovation/Achievement at Crunchies 2008

Angus Logan has the scoop

I’m in San Francisco at the 2008 Crunchie Awards and after ~ 350k votes were cast Ray Ozzie and David Treadwell accepted the award for Best Technology Innovation/Achievement on behalf of the Live Mesh team.

The Crunchies are an annual competition co-hosted by GigaOm, VentureBeat, Silicon Alley Insider, and TechCrunch which culminates in an award the most compelling startup, internet and technology innovations.

Kudos to the Live Mesh folks on getting this award. I can't wait to see what 2009 brings for this product.

PS: I noticed from the TechCrunch post that Facebook Connect was the runner up. I have to give an extra special shout out to my friend Mike for being a key figure behind two of the most innovative technology products of 2008. Nice work man.

Categories: Windows Live

«Older Posts

Newer Posts»

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Dare Obasanjo's weblog

Why Shard or Partition your Database?

How Sharding Changes your Application

A Look at a Some Common Sharding Schemes

Problems Common to all Sharding Schemes