Google Base Data API vs. Astoria: Two Approaches to SQL-like Queries in a RESTful Protocol

July 13, 2007

@ 04:18 AM

I should probably start out by pointing out that the title of this post is a lie. By definition, RESTful protocols can not be truly SQL-like because they depend on Uniform Resource Identifiers (URIs aka URLs) for identifying resources. URIs on the Web are really just URLs and URLs are really just hierarchical paths to a particular resource similar to the paths on your local file system (e.g. /users/mark/bobapples, A:\Temp\car.jpeg). Fundamentally URIs identify a single resource or aset of resources. On the other hand, SQL is primarily about dealing with relational data which meansyou write queries that span multiple tables (i.e. resources). A syntax for addressing single resources (i.e. URLs/URIs) is fundamentally incompatible with a query language that operates over multiple resources. This was one ofthe primary reasons the W3C created XQuery even though we already had XPath.

That said, being able to perform sorting, filtering, and aggregate operations over a single set of resources via a URI is extremely useful and is a fundamental aspect of the Web today. As Sam Ruby points out in his blog post Etymology, a search results page is fundamentally RESTful even though its URI identifies a query as opposed to a specific resource or set of resources [although you could get meta and say it identifies the set of resources that meet your search criteria].

Both Google's Google Base data API and Microsoft's Project Astoria are RESTful protocols for performing sorting, filtering and aggregate operations similar to what you find in SQL over a hierarchical set of resources. What follows is an overview of the approaches taken by both protocols.

Filtering Results using Predicates (include supported operators and Google's full text option)

Although Astoria provides an abstraction over relational data, it does so in a way that supports the hierarchical nature of HTTP URIs. The primary resource also known as an entity set is placed at the root of the hierarchy (e.g. the set of allmy customers) and each relationship to another set of resources is treated as anotherlevel in the hierarchy (e.g. each customer's orders). Each step in the hierarchy can be filtered using a predicate. Below are some query URLs and the results they return

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas
Results: All content areas in the Encarta online encyclopedia
Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[name eq 'Geography']
Results: The content area whose name is 'Geography'
Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[name eq 'Geography']/Articles
Results: All articles for the content area whose name is 'Geography'
Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[name eq 'Geography']/Articles[Title eq 'Zimbabwe']
Results: The article with the title 'Zimbabwe' from the 'Geography' content area
Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[name eq 'Geography']/Articles[761569370]
Results: The article from the 'Geography' content area with the ID 761569370.

Query: http://astoria.sandbox.live.com/northwind/northwind.rse/Customers[City eq 'London']/Orders[Freight gteq 100]
Results: All orders shipped to customers from the city of London who paid $100 or more in freight charges

Google Base does not treat the data within it as a hierarchy. Instead filters/predicates can be applied to one of two Atom feeds; http://www.google.com/base/feeds/snippets and http://www.google.com/base/feeds/items which represent all the items within Google Base and all the items a specific user has stored within Google Base respectively. The latter URL requires the HTTP request to be authenticated.

The first way one can filter results from a Google Base feed is by placing one or more categories as part of the path component. For example

Query: http://www.google.com/base/feeds/snippets/-/hotels
Results: All items from the 'hotels' category within Google Base

Query: http://www.google.com/base/feeds/snippets/-/jobs|personals
Results: All items from the 'jobs' or the 'personals' category within Google Base

Query: http://www.google.com/base/feeds/snippets/-/-recipes
Results: All items in Google Base except those from the 'recipes' category

The second way is to filter results in a Google Base feed is by performing a full text query using the q query parameter. For example,

Query: http://www.google.com/base/feeds/snippets?q=atlanta
Results: All items within Google Base that contain the string 'atlanta' in one of their fields

Query: http://www.google.com/base/feeds/snippets/-/hotels?q=atlanta
Results: All items from the 'hotels' category within Google Base that contain the string 'atlanta' in any of their fields

Query: http://www.google.com/base/feeds/snippets/-/hotels|housing?q=seattle|atlanta
Results: All items from the 'hotels' or 'housing' categories within Google Base that contain the string 'seattle' or 'atlanta' in any of their fields

Query: http://www.google.com/base/feeds/snippets/-/hotels?q=washington+-dc
Results: All items from the 'hotels' category within Google Base that contain the string 'washington' but not the string 'dc'

The final way to filter results from Google Base feed is by applying a predicate on a field of the item using the bq query parameter. For example

Query: http://www.google.com/base/feeds/snippets/-/hotels?bq=[location:seattle]
Results: All items from the 'hotels' category that have 'seattle' in their location field

Query: http://www.google.com/base/feeds/snippets/-/hotels?bq=[location:seattle]&q=ramada
Results: All items from the 'hotels' category that have 'seattle' in their location field and 'ramada' in any of their other fields

Query: http://www.google.com/base/feeds/snippets/-/-hotels?bq=[location:@"1 Microsoft Way, Redmond, WA, USA" + 5mi]
Results: All items from the 'hotels' category whose location is within 5 miles of "1 Microsoft Way, Redmond, WA, USA"

Query: http://www.google.com/base/feeds/snippets/-/products?q=zune&bq=[price(float USD)>=250.0 USD]
Results: All items from the 'products' category whose price is greater than $250.00 and have 'zune' in one of their fields

Supported Datatypes, Operators and Functions

As can be seen from the previous examples, both the Google Base data API and Astoria support operations on fields searching for string matches. Astoria supports the major SQL datatypes, a list of which can be obtained from the table describing the System.Data.SqlTypes namespace in the .NET Frameworks. The operations that can be performed on the various fields of an entity are the following comparisons

Operator	Description
eq	Equal
ne	Not equal
gt	Greater than
gteq	Greater than or equal
lt	Less than
lteq	Less than or equal

The list of datatypes supported by Google Base is provided in the Google Base data API documentation topic on Attribute Types. In addition to the comparison operators supported by Astoria, the Google Base data API also supports

Operator	Description
@"..." + Xmi	Convert string in quotes to a geocoded location and match anything that is within a radius of X miles/kilometers/meters around it depending on the unit of distance specified
name(type):X..Y	Test whether the value of the named attribute [and optional type] falls between X and Y (e.g. [event date range:2007-05-20..2007-05-25] matches all events which fall between both dates)
date-range << date-range	Test if the date on the right hand side is a subset of the date range on the left hand side
if boolean-expression then expression else expression	Works like an if...else statement in every programming language you've ever used.

In addition to these operators, it turns out that the Google Base data API also support a full blown expression language for use within predicates. This includes a library of over 20 functions from math functions like sin and cos to aggregation functions like sum and count as well as more esoteric functions like dist, exists and join. Below are some queries which use these operators and functions in action

Query: http://www.google.com/base/feeds/snippets?q=sale&bq=[item type:vehicles][location(location)]
&orderby=[x=location(location):neg(min(dist(x,@'Seattle,WA')))]
Results: All vehicles whose listing contain the text 'sale' and orders results by those that are geographically closest to the city of Seattle, WA

Query: http://www.google.com/base/feeds/snippets/-/events?bq=[event date range:2007-05-20..2007-05-25]
Results: All events that fall between May 20th 2007 and May 25th 2007

Sorting

Sorting query results is often necessary when working with large amounts of data. Both Google Base data API and Astoria provide a way to indicate that the results should be sorted based on one or more fields. In Astoria, sorting is done using the $orderby query parameter. For example,

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas?$orderby=Name
Results: All areas in the Encarta encyclopedia sorted alphabetically by their Name

Query: http://astoria.sandbox.live.com/northwind/northwind.rse/Orders?$orderby=OrderDate desc
Results: All customer orders sorted in descending order by order date

Query: http://astoria.sandbox.live.com/northwind/northwind.rse/Orders?$orderby=RequiredDate,Freight
Results: All customer orders sorted by the required date and the cost of freight

The Google Base data API uses the orderby and sortorder query parameters to control sorting and sort order respectively. Examples are shown below

Query: http://www.google.com/base/feeds/snippets/-/jobs?q=program+manager&orderby=salary(int)
Results: All job listings containing the string 'program manager' sorted by the salary field

Query: http://www.google.com/base/feeds/snippets?q=sale&bq=[item type:vehicles][location(location)]
&orderby=[x=location(location):neg(min(dist(x,@'Cupertino,CA')))]
Results: All vehicles whose listing contain the text 'sale' and orders results by those that are geographically closest to the city of Seattle, WA

Query: http://www.google.com/base/feeds/snippets/-/housing?bq=[location:@"1 Microsoft Way, Redmond, WA, USA" + 5mi]&orderby=[x=bedrooms(int): if exists(x) then max(x) else 0]
Results: All items within the 'housing' category that are within 5 miles of Microsoft's headquarters sorted by number of bedrooms. For items that don't have a bedrooms element use the value 0 when sorting

Paging

When dealing with large numbers of items, it often isn't feasible to return all of them in a single XML document for a variety of reasons. Both the Google Base data API and Astoria provide mechanisms to retrieve results as multiple "pages".

In Astoria, this is done using a combination of the top and skip query parameters which indicate the number of items to return and what item to start the list from respectively. Examples below

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas?$orderby=Name&$top=3
Results: All areas in the Encarta encyclopedia sorted alphabetically by their Name, restricted to only showing 3 items per page

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas?$orderby=Name&$top=3&$skip=1
Results: All areas in the Encarta encyclopedia sorted alphabetically by their Name starting from the second item, restricted to only showing 3 items per page

The Google Base data API uses the max-results and start-index query parameters to indicate the number of items to return and what item to start the list from respectively. The default value of max-results is 25 while its maximum value is 250. The total number of results is emitted in the returned feed as the element openSearch:totalResults. Examples below

Query: http://www.google.com/base/feeds/snippets/-/hotels?bq=[location:seattle]&max-results=10
Results: All hotels within the seattle area, restricted to 10 results per page

Query: http://www.google.com/base/feeds/snippets/-/hotels?bq=[location:seattle]&max-results=50&start-index=100
Results: All hotels within the seattle area, restricted to 50 results per page starting from the hundredth result

Astoria Specific Features

Using links within items the describe relationships is a core aspect of a RESTful protocol and is utilized by Astoria to show the foreign key relationships between rows/entities in the data base. However it can be cumbersome to have to make multiple requests and follow every link to get all the content related to an item. For this reason, Astoria includes the $expand query parameter which automatically follows the links and retrieves the XML inline. Compare the following queries

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[Name eq 'History']/Articles[Title eq 'Civil War, American']
Results: The encyclopedia article on the American Civil War which has links to its Area, ArticleBody, Notes and RelatedArticles

Query: http://astoria.sandbox.live.com/encarta/encarta.rse/Areas[Name eq 'History']/Articles[Title eq 'Civil War, American']?$expand=ArticleBody,Notes,RelatedArticles
Results: The encyclopedia article on the American Civil War with its Area, ArticleBody and Notes shown inline as XML elements

GData Specific Features

Google Base has a notion of adjusted query results. When this feature is enabled, Google Base will automatically use spelling correction, stemming and other tricks to try and match results. For example, if you perform a search for the value "female" in the gender field of an item, the query adjustment engine will know to also match the value "f" in the gender field of any corresponding items. The query adjustment engine applies its heuristics on queries for field names, field values and item categories. The documentation is contradictory as to whether this feature is enabled by default or has to be specifically enabled by the user of the API.

Another interesting feature, is that the Google Base data API allows one to filter out repetitive results using a feature called "crowding". With this feature, limits can be placed on how many results that match a certain criteria should be returned. See the following examples for details

Query: http://www.google.com/base/feeds/snippets/-/restaurants?crowdby=cuisine(text):2
Results: Return all restuarants stored within Google Base but show no more than 2 per cusine type

NOTE: I actually couldn't get this feature to work using either the example queries from the documentation or queries I constructed. It is quite possible that this feature doesn't work but is so esoteric that no one has noticed.

Conclusion

In comparing both approaches there is a lot to like and dislike. I like the "expand" feature in Astoria as well as the fact that I can retrieve XML results from multiple paths of the hierarchy. However there does seem to be a paucity of operators and functions for better filtering of results.

From the Google Base data API, I love the "crowd" feature and having a full library of functions for performing tests within predicates. Also some of the operators such as the ones for finding results near a certain location are quite impressive although unnecessary for the majority of RESTful protocols out there. That said, I do think they went overboard on some of the features such as having if...else blocks within the URIs. I suspect that some of that complexity wouldn't have been needed if they just had hierarchies instead of a flat namespace that requires complex filtering to get anything out of it.

Categories: Web Development | XML Web Services

« Live Search for Mobile 2.0 is Available | Home | Facebook is the next Broadcast.com »

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Google Base Data API vs. Astoria: Two Approaches to SQL-like Queries in a RESTful Protocol - Dare Obasanjo's weblog