A user of RSS Bandit recently forwarded me a discussion on the atom-syntax mailing list which criticized some of our design decisions. In an email in the thread entitled Reader 'updated' semantics Tim Bray wrote

On Jan 10, 2006, at 9:07 AM, James M Snell wrote:

In RSS there is definite confusion on what constitutes an update. In
Atom it is very clear. If <updated> changes, the item has been updated.
No controversy at all.

Indeed. There's a word for behavior of RssBandit and Sage: WRONG. Atom provides a 100% unambiguous way to say "this is the same entry, but it's been changed and the publisher thinks the change is significant." Software that chooses to hide this fact from users is broken - arguably dangerous to those who depend on receiving timely and accurate information - and should not be used. -Tim

People who write technology specifications often have good intentions but unfortunately they often aren't implementers of the specs they are creating. This leads to disconnects between reality and what is actually in the spec.

The problems with updates to blog posts is straightforward. There are minor updates which don't warrant signalling to the user such as typos being fixed (e.g. 12 of 13 miner survive mine collapse changed to 12 of 13 miners survive mine collapse) and those which do because they add significant changes to the story (e.g. 12 of 13 miners survive mine collapse changed to 12 of 13 miners survive killed in mine collapse). 

James Snell is right that it is ambiguous how to detect this in RSS but not in Atom due to the existence of the atom:updated element. The Atom spec states

The "atom:updated" element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value.

On paper it sounds like this solves the problem. On paper, it does. However for this to work correctly, weblog software now need to include an option such as 'Indicate that this change is significant' when users edit posts. Without such an option, the software cannot correctly support the atom:updated element. Since I haven't found any mainstream tools that support this functionality, I haven't bothered to implement a feature which is likely to annoy users more often than be useful since many people edit their blog posts in ways that don't warrant alerting the user.

However I do plan to add features for indicating when posts have changed in unambiguous scenarios such as when new comments are added to a blog post of interest to the user. The question I have for our users is how would you like this indicated in the RSS Bandit user interface?


 

Categories: RSS Bandit

January 20, 2006
@ 02:27 AM

Richard Searle has a blog post entitled The SOAP retrieval anti-pattern where he writes

I have seen systems that use SOAP based Web Services only to implement data retrievals.

The concept is to provide a standardized mechanism for external systems to retrieve data from some master system that controls the data of interest. This has value in that it enforces a decoupling from the master systems data model. It can also be easy to manage and control than the alternative to allowing the consuming systems to directly query the master systems database tables.
...
The selection of a SOAP interface over a RESTful interface is also questionable. The SOAP interface has a few (generally one) parameters and then returns a large object. Such an interface with a single parameter has a trivial representation as a GET. A multi-parameter call can also be trivial mapped if the parameters define a conceptual heirarchy (eg the ids of a company and one of its employees).

Such a GET interface avoids all the complexities of SOAP, WSDL, etc. AJAX and XForm clients can trivially and directly use the interface. A browser can use XSLT to provide a human readable representation.

Performance can easily be boosted by interposing a web cache. Such optimization would probably occur essentially automatically since any significant site would already have caching. Such caching can be further enhanced by using the HTTP header timestamps to compare against the updated timestamps in the master system tables.

I agree 100%, web services that use SOAP solely for data retrieval are usually a sign that the designers of the service need to get a clue when it comes to building distributed applications for the Web.

PS: I realize that my employer has been guilty of this in the past. In fact, we've been known to do this at MSN as well although at least we also provided RESTful interfaces to the service in that instance. ;)


 

Categories: XML Web Services

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML  page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches.

I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently. 

In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote

Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.

http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication

I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is wasted bandwidth since web pages tend to contain more stuff than feeds and change more often.

Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

  1. They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using <div> and <span> to markup the individual appointments, so I can't add hCalendar to them.

  2. I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.

  3. They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

    It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an <abbr> with class="dtstart" interleaved with an <abbr> with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

    Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun. 

I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.


 

Categories: XML

January 18, 2006
@ 12:03 PM

Once people find out that they can use tools like ecto, Blogjet or W.Bloggar to manage their blog on MSN Spaces via the MetaWeblog API, they often ask me why we don't have something equivalent to the Flickr API so they can do the same for the photos they have in their space. 

My questions for folks out there is whether this is something you'd like to see? Do you want to be able to create, edit and delete photos and photo albums in your Space using desktop tools? If so, what kind of tools do you have in mind?

If you are a developer, what kind of API would you like to see? Should it use XML-RPC, SOAP or REST? Do you want a web service or a DLL?

Let me know what you think.


 

Categories: Windows Live | XML Web Services

A few weeks ago I wrote a blog post entitled Windows Live Fremont: A Social Marketplace about the upcoming social marketplace coming from Microsoft. Since then the project has been renamed to Windows Live Expo and the product team is now blogging.

The team blog is located at http://spaces.msn.com/members/teamexpo and they've already posted an entry addressing their most frequently asked question, "So when is it launching then?".


 

Categories: Windows Live

It's been about three years since I first started on RSS Bandit and it doesn't seem like I've run out of steam yet. Every release the application seems to become more popular and last month we finally broke 100,000 downloads in a single month. The time has come for me to start thinking about what I'd like to see in the next family of releases and elicit feedback from our users. The next release is codenamed Jubilee.

Below are a list of feature areas where I'd like to see us work on over the next few months

  1. Extensibility Framework to Enable Richer Plugins: We currently use the IBlogExtension plugin mechanism which allows one to add new context menu items when right-clicking on an item in the list view.  I've used this to implement features such as the "Email  This" and "Post to del.icio.us" which ship with the default install. Torsten implemented "Send to OneNote" using this mechanism as well.

    The next step is to enable richer plugins so people can add their own menu items, toolbar buttons as well as processing steps for received feed items. Torsten used a prototype of this functionality to add Ad blocking features to RSS Bandit. I'd like to add weblog posting functionality using such a plugin model instead of making it a core part of the application since many of our users may just want a reader and not a weblog editor as well.

  2. Comment Watching: For many blogs such as Slashdot and Mini-Microsoft, the comments are often more interesting than the actual blog post. In the next version we'd like to make it easier to not only be updated when new blog posts appear in a blog you are interested in but also when new comments show up in a post you are interested in as well.

  3. Provide better support for podcasts and other rich media in feeds: With the growing popularity of podcasts, we plan to make it easier for users to discover and download rich media from their feeds. This doesn't just mean supporting downloading media files in the background but also supporting better ways of displaying rich media in our default views. Examples of what we have in mind can be taken from  the post Why should text have all the fun? in the Google Reader blog. We should have richer experiences for photo feeds, audio feeds and video feeds.

  4. Thumbs Up & Thumbs Down for Filtering and Suggesting New Feeds:  big problem with using a news aggregator is that it eventually leads to information overload. One tends to subscribe to feeds which produce lots of content of which only a subset is of interest to the user. At the other extreme, users often find it difficult to find new content that matches their interests. Both of these problems can be solved by providing a mechanism which allows the user to rate feeds or entries that are of interest to the user. A thumbs up or thumbs down rating similar to what systems such as TiVo use today. This system can be used to highlight items of interest from subscribed feeds to the user or suggest new feeds using a suggestion service such as AmphetaRate.

  5. Applying search filters to the list view: In certain cases a user may want to perform the equivalent of a search on the items currently being displayed in the list view without resorting to an explicit search. An example is showing all the unread items in the list view. RSS Bandit should provide a way to apply filters to the items currently being displayed in the list view either by providing certain predefined filters or providing the option to apply search folder queries as filters.

These are just some of the ideas I've had. There are also the dozens of feature requests we've received from our users over the past couple of months which we'll use as fodder for ideas for the Jubilee release.


 

Categories: RSS Bandit

January 15, 2006
@ 08:08 PM

Dave Winer made the following insightful observation in a recent blog post

Jeremy Zawodny, who works at Yahoo, says that Google is Yahoo 2.0. Very clever, and there's a lot of truth to it, but watch out, that's not a very good place to be. That's how Microsoft came to dominate the PC software industry. By shipping (following the analogy) WordPerfect 2.0 (and WordStar, MacWrite and Multimate) and dBASE 2.0 (by acquiring FoxBase) and Lotus 2.0 (also known as Excel). It's better to produce your own 2.0s, as Microsoft's vanquished competitors would likely tell you.

Microsoft's corporate culture is very much about looking at an established market leader then building a competing product which is (i) integrated with a family of Microsoft products and (ii) fixes some of the weakneses in the competitors offerings. The company even came up with the buzzword Integrated Innovation to describe some of these aspects of its corporate strategy. 

Going further, one could argue that when Microsoft does try to push disruptive new ideas the lack of a competitor to focus on leads to floundering by the product teams involved. Projects such as WinFS, Netdocs and even Hailstorm can be cited as examples of projects that floundered due to the lack of a competitive focus.

New employees to Microsoft are sometimes frustrated by this aspect of Microsoft's culture. For some it's hard to acknowledge that working at Microsoft isn't about building cool, new stuff but about building cooler versions of products offered by our competitors which integrate well with other Microsoft products. This ethos not only brought us Microsoft Office which Dave mentions in his post but also newer examples including XBox (a better Playstation), C# (a better Java) and MSN Spaces (a better TypePad/Blogger/LiveJournal). 

The main reason I'm writing this is so I don't have to keep explaining it to people, I can just give them a link to this blog post next time it comes up.


 

Categories: Life in the B0rg Cube

A few members of the Hotmail Windows Live Mail team have been doing some writing about scalability recently

From the ACM Queue article A Conversation with Phil Smoot

BF Can you give us some sense of just how big Hotmail is and what the challenges of dealing with something that size are?

PS Hotmail is a service consisting of thousands of machines and multiple petabytes of data. It executes billions of transactions over hundreds of applications agglomerated over nine years—services that are built on services that are built on services. Some of the challenges are keeping the site running: namely dealing with abuse and spam; keeping an aggressive, Internet-style pace of shipping features and functionality every three and six months; and planning how to release complex changes over a set of multiple releases.

QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.

BF I have this sense that the challenges don’t scale uniformly. In other words, are there certain scaling points where the problem just looks completely different from how it looked before? Are there things that are just fundamentally different about managing tens of thousands of systems compared with managing thousands or hundreds?

PS Sort of, but we tend to think that if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.

We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.

Another hard problem is load balancing across the site. That is, ensuring that user transactions and storage capacity are equally distributed over all the nodes in the system without any particular set of nodes getting too hot.

From the the blog post entitled Issues with .NET Frameworks 2.0by Walter Hsueh

Our team is tackling the scale issues, delving deep into the CLR and understanding its behavior.  We've identified at least two issues in .NET Frameworks 2.0 that are "low-hanging fruit", and are hunting for more.

1a)  Regular Expressions can be very expensive.  Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior.  We've taken several hotfixes for this.  RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.

1b)  Designing an AJAX-style browser application (like most engineering problems) involves trading one problem for another.  We can choose to shift the application burden from the client onto the server.  In the case of RegExes, it might make sense to move them to the client (where CPU can be freely used) instead of having them run on the server (where you have to share).  WindowsLive Mail made this tradeoff in one case.

2)  Managed Thread Local Storage (TLS) is expensive.  There is a global lock in the Whidbey RTM implementation of Thread.GetData/Thread.SetData which causes scalability issues.  Recommendation is to use the [ThreadStatic] attribute on static class variables.  Our RPS went up, our CPU % went down, context switches dropped by 50%, and lock contentions dropped by over 80%.  Good stuff.

Our devs have also started migrating some of our services to Whidbey and they've also found some interesting issues with regards to performance. It'd probably would be a good idea to get together some sort of lessons learned while building mega-scale services on the .NET Framework article together.


 

Categories: Windows Live

Recently we had some availability issues with MSN Spaces which have caused some  complaints from some of our loyal customers. Mike Torres addresses these issues in his post Performance & uptime which states

One of the hardest parts about running a worldwide service with tens of millions of users is maintaining service performance and overall uptime.  As a matter of fact, a member of our team (Dare) had some thoughts about this not too long ago.  While we're constantly working towards 100% availability and providing the world's fastest service, sometimes we run into snags along the way that impact your experience with MSN Spaces.
 
That seems to have happened yesterday.  For the networking people out there, it turned out to be a problem with a load balancing device resulting in packet loss (hence the overall slowness of the site).  After some investigation, the team was able to determine the cause and restore the site back to normal.
 
Rest assured that as soon as the service slows down even a little bit, or it becomes more difficult to reach individual spaces, we're immediately aware of it here within our service operations center.  Within minutes we have people working hard to restore things to their normal speedy and reliable state.  Of course, sometimes it takes a little while to get things back to normal - but don't believe for a second that we aren't aware or concerned about the problem.  As a matter of fact, almost everyone on our team uses Spaces daily (surprise!) so we are just as frustrated as you are when things slow down.  So I'm personally sorry if you were frustrated yesterday - I know I was!  We are going to continue to do everything we can to minimize any impact on your experience...  most of the time we'll be successful and every once in a while we won't.  But it's our highest priority and you have a firm commitment from us to do so.

I'm glad to seeing us be more transparent about what's going on with our services. This is a good step.


 

Categories: MSN

Over a year ago, I wrote a blog post entitled SGML on the Web: A Failed Dream? where I asked whether the original vision of XML had failed. Below are excerpts from that post

The people who got together to produce the XML 1.0 recommendation where motivated to do this because they saw a need for SGML on the Web. Specifically  
their discussions focused on two general areas:
  • Classes of software applications for which HTML was an inadequate information format
  • Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology

The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases.

The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web.

And thus XML was born.
...
The W3C's attempts to get people to author XML directly on the Web have mostly failed as can be seen by the dismal adoption rate of XHTML and in fact many [including myself] have come to the conclusion that the costs of adopting XHTML compared to the benefits are too low if not non-existent. There was once an expectation that content producers would be able to place documents conformant to their own XML vocabularies on the Web and then display would entirely be handled by stylesheets but this is yet to become widespread. In fact, at least one member of a W3C working group has called this a bad practice since it means that User Agents that aren't sophisticated enough to understand style sheets are left out in the cold.

Interestingly enough although XML has not been as successfully as its originators initially expected as a markup language for authoring documents on the Web it has found significant success as the successor to the Comma Separated Value (CSV) File Format. XML's primary usage on the Web and even within internal networks is for exchanging machine generated, structured data between applications. Speculatively, the largest usage of XML on the Web today is RSS and it conforms to this pattern.

These thoughts were recently rekindled when reading Tim Bray's recent post Don’t Invent XML Languages where Tim Bray argues that people should stop designing new XML formats. For designing new data formats for the Web, Tim Bray advocates the use of Microformats instead of XML.

The vision behind microformats is completely different from the XML vision. The original XML inventers started with the premise that HTML is not expressive enough to describe every possible document type that would be exchanged on the Web. Proponents of microformats argue that one can embed additional semantics over HTML and thus HTML is expressive enough to represent every possible document type that could be exchanged on the Web. I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML  page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan. 


 

Categories: XML