My Holiday Project: A Twitter Search Engine Built on Windows Azure

January 3, 2009

@ 04:51 AM

I spent the last few days hacking on a side project that I thought some of my readers might find interesting; you can find it at http://hottieornottie.cloudapp.net

I had several goals when embarking on this project

As my day job is working on activity feeds in Windows Live, I wanted to prototype various search experiences around searching a feed of our friends' activities.
I wanted to build an application hosted on Windows Azure to learn more about the platform.
I wanted to learn more about the Twitter's various APIs.

After a few days of hacking I'm glad to say I've achieved every goal I wanted to get out of this experiment. I'd like to thank Matt Cutts for the initial idea on how to implement this and Kevin Mark's for saving me from having to write a Twitter crawler by reminding me of Google's Social Graph API.

What it does and how it works

The search experiment provides four kinds of searches

The search functionality with no options checked is exactly the same as search.twitter.com
- Example query: search for "stackoverflow" on Twitter
Checking "Search Near Me" finds all tweets posted by people who are within 30 miles of your geographical location (requires JavaScript). Your geographical location is determined from your IP address while the geographical location of the tweets is determined from the location fields of the Twitter profiles of the authors. Nice way to find out what people in your area are thinking about local news.
- Example query: search for "stackoverflow" mentioned by Twitter users who live within 30 miles of Renton,WA
Checking 'Sort By Follower Count' is my attempt to jump on the authority based Twitter search bandwagon. I don't think it's very useful but it was easy to code. Follower counts are obtained via the Google Social Graph API.
- Example query: search for "stackoverflow" on Twitter sorted by number of followers of the author
Checking 'Limit to People I Follow' requires you to also specify your user name and then all search results are filtered to only return results from people you follow (requires JavaScript). This feature only works for a small subset of Twitter users that have been encountered by a crawler I wrote. The application is crawling Twitter friend lists as you read this and anyone I follow should already have their friend list crawled. If it doesn't work for you, check back in a few days. It's been slow going since Twitter puts a 100 request per hour cap on crawlers.
- Example query: search for "stackoverflow" on Twitter filtered to people followed by Carnage4Life

Developing on Windows Azure: Likes

After building a small scale application with Windows Azure, there are definitely a number of things I like about the experience. The number one thing I loved was the integrated deployment story with Visual Studio. I can build a regular ASP.NET application on my local machine that either used cloud or local storage resources and all it takes is a few mouse clicks to go from my code running on my machine to my code running on computers in Microsoft's data center either in a staging environment or in production. The fact that the data access APIs are all RESTful makes it super easy to go from pointing the app running on your machine to cloud storage or local storage on your machine simply by changing some base URIs in a configuration file.

Another aspect of Windows Azure that I thought was great is how easy it is to create background processing tasks. It was very straightforward to create a Web crawler that crawled Twitter to build a copy of its social graph by simply adding a "Worker Role" to my project. I've criticized Google App Engine in the past for not supporting the ability to create background tasks so it is nice to see this feature in Microsoft's platform as a service offering.

Developing on Windows Azure: Dislikes

The majority of my negative experiences were related to teething problems I'd associate with this being a technology preview that still needs polishing. I hit a rather frustrating bug where half the time I tried to run my application it would end up hanging and I'd have to try again after several minutes. There were also issues with the Visual Studio integration where removing or renaming parts of the project from the Visual Studio UI didn't modify all of the related configuration files so the app was in a broken state until I mended it by hand. Documentation was another place where there is still a lot of work to do. My favorite head scratching moment is that there is a x-ms-Metadata-ApproximateMessagesCount HTTP header which returns the approximate number of messages in the a queue. It is unclear whether "approximate" here refers to the fact that messages in the queue have an "invisibility period" between when they are popped from the queue but before they are deleted where they can't be accessed or whether it refers to some other heuristic that determines the size of the queue. Then there's the fact that the documentation says you need to have a partition key and row key for each entry you place in a table but doesn't really explain why or how you are supposed to pick these keys. In fact, the documentation currently makes it seem like the notion of partition keys is an example of unnecessarily surfacing implementation details of Windows Azure to developers in a way that leads to confusion and cargo cult programming.

One missing piece is the lack of good tools for debugging your application once it is running in the cloud. When it is running on your local machine there is a nice viewer to keep an eye on the log output from your application but once it is in the cloud, your only option is to have the logs dropped to some directory in the cloud and then run one of the code samples to access those logs from your local machine. Since this is a technology preview, it is expected that the tooling shouldn't be all there but it is a cumbersome process as it exists today. Besides accessing your debug output there is also seeing what data your application is actually creating, retrieving and otherwise manipulating in storage. You can use SQL Server Management Studio to look at your data in Table Storage on your local machine but there isn't a similar experience in the cloud. Neither blob nor queue storage have any off-the-shelf tools for inspecting their contents locally or in the cloud so developers have to write custom code by hand. Perhaps this is somewhere the developer community can step up with some Open Source tools (e.g. David Aiken's Windows Azure Online Log Reader) or perhaps some commercial vendors will do step in as they have in the case of Amazon's Web Services (e.g. RightScale)?

Outside of the polish issues and bugs, there was only one aspect of Windows Azure development I disliked; the structured data/relational schema development process. Windows Azure has a Table Storage API which provides a RESTful interface to a row-based data store similar in concept to Google's BigTable. Trying to program locally against this API is rather convoluted and requires writing your classes first then running some object<->relational translation tools on your assemblies. This is probably a consequence of not being a big believer the use of ORM tools so having to first write objects before I can access my DB seems backwards to me. This gripe may just be a matter of preference since a lot of folks who use Rails, Django and various other ORM technologies seem fine with having primarily an object facade over their databases.

Update: Early on in my testing I got a The requested operation is not implemented on the specified resource error when trying out a batch query and incorrectly concluded that the Table Storage API did not support complex OR queries. It turns out that the problem was that I was doing a $filter query using the tolower function. Once I took out the tolower() it was straightforward to construct queries with a bunch of OR clauses so I could request for multiple row keys at once.

I'll file this under "documentation issues" since there is a list of unsupported LINQ query operators and unsupported LINQ comparison operators but not a list of unsupported query expression functions in the Table Storage API documentation. Sorry about any confusion and thanks to Jamie Thomson for asking about this so I could clarify.

Besides the ORM issue, I felt that I was missing some storage capabilities when trying to build my application. One of the features I started building before going with the Google Social Graph API was a quick way to provide the follower counts for a batch of users. For example, I'd get 100 search results from the Twitter API and would then need to look up the follower counts of each user that showed up in the results for use in sorting. However there was no straightforward way to implement this lookup service in Windows Azure. Traditionally, I'd have used one of the following options

~~Create a table of {user_id, follower_count} in a SQL database and then use batches of ugly select statements like SELECT FROM follower_tbl WHERE id=xxxx OR id=yyyy OR id=zzzz OR ….~~

~~Create tuples of {user_id, follower_count} in an in-memory hash table like memcached and then do a bunch of fast hash table lookups to get the follower counts from each user~~

Neither of these options is possible given the three data structures that Windows Azure gives you. It could be that these missing pieces are intended to be provided by SQL Data Services which I haven't taken a look at yet. If not, the lack of these pieces of functionality will be sticking point in the craw of developers making the switch from traditional Web development platforms.

Note Now Playing: Geto Boys - Gangsta (Put Me Down) Note

Categories: Personal | Programming

« The Myth of the Open Source Business Mod... | Home | Is "Follow" A Core Web 2.0 Pattern? »

Saturday, 03 January 2009 07:17:33 (GMT Standard Time, UTC+00:00)

Very cool.

You can request whitelisting from Twitter to remove the request cap.

Check out the W3C Geolocation API to grab a set of coordinates through the web browser. Firefox supports the geolocation API in 3.1+.

Niall Kennedy

Saturday, 03 January 2009 08:20:00 (GMT Standard Time, UTC+00:00)

Let's not forget that the tools right now require you to develop as a local admin. :(

Colin Bowern

Saturday, 03 January 2009 08:48:11 (GMT Standard Time, UTC+00:00)

Very cool, and useful!

Sameer

Saturday, 03 January 2009 10:03:25 (GMT Standard Time, UTC+00:00)

"However there was no straightforward way to implement this lookup service in Windows Azure. "

Dare,
What feature was missing exactly? As far as I know it is possible to "Create a table of {user_id, follower_count}" in Azure so, is the problem the fact that you cannot use OR predicates in your query?

I'm not 100% sure from what you've said above what the missing feature is.

Regards
-Jamie

Jamie Thomson

Saturday, 03 January 2009 12:31:21 (GMT Standard Time, UTC+00:00)

P.S. You know what I'd find useful? I always want to know if a certain person is following me or not (given that we don't get informed when people stop following us there's no real way of knowing who is following us without mnully searching through the list of people that person is following).

How about building that in? i.e. The ability to ask "Is personX following personY?" Just an idea...

Jamie Thomson

Saturday, 03 January 2009 13:06:37 (GMT Standard Time, UTC+00:00)

Jamie,
Thanks for asking about the OR predicates. I went back to revisit my code and it turns out the OR predicates were not the issue. I've updated my post and pointed out my mistake.

A Is-X-following-Y search would be straightforward to implement if I had a fresh and up to date index of Twitter. Being limited to a 100 fetches an hour makes it unlikely I'd ever get a fresh enough index to make the functionality interesting.

Dare Obasanjo

Saturday, 03 January 2009 13:57:03 (GMT Standard Time, UTC+00:00)

Dare,
Ok, so you were making an assumption that ADO.Net Data Services expressions would work against Azure storage. Am I right?

It would be nice if they achieved parity here. Consistency is good. Fingers crossed.

-Jamie

Jamie Thomson

Saturday, 03 January 2009 14:16:20 (GMT Standard Time, UTC+00:00)

Jamie,
Windows Azure's Table Storage API is an ADO.NET Services provider. It just implements a subset of the functionality. My mistake was in thinking that an unsupported feature error was for OR predicates but it actually was for the tolower() function. This is all in the update to my blog post.

Dare Obasanjo

Saturday, 03 January 2009 19:20:31 (GMT Standard Time, UTC+00:00)

Dare,

One of the best and most useful Azure sample apps I've seen to date. On the whole, I agree with your Azure development likes and dislikes.

I've found that application hangs in the Development Fabric caused by waiting for VS to complete an operation are best handled by killing the process with TaskMan.

A simple Table Services test harness with timing data is at http://oakleaf.cloudapp.net.

Cheers,

--rj

Roger Jennings

Sunday, 04 January 2009 04:20:24 (GMT Standard Time, UTC+00:00)

Dare,
I pretty much had the same feelings after coding my first windows azure app. http://blogs.msdn.com/mahjayar/archive/2008/12/12/windows-azure-101-my-first-windows-azure-app.aspx.

I realized a couple more things since then
1. No way to version/update deployed apps. If you need to uodate the deployed binaires then you have to start from scratch.
2. No way to host multiple projects - atleast my account only lets me deploy one production app. When I try adding a new project it just swaps out the production one with the new app.

Maheshwar Jayaraman

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for My Holiday Project: A Twitter Search Engine Built on Windows Azure - Dare Obasanjo's weblog

What it does and how it works

Developing on Windows Azure: Likes

Developing on Windows Azure: Dislikes