As an author of a news reader that supports RSS and Atom, I often have to deal with feeds designed by the class of people Mark Pilgrim described in his post Why specs matter as assholes. These are people who
read specs with a fine-toothed comb, looking for loopholes, oversights, or simple typos. Then they write code that is meticulously spec-compliant, but useless. If someone yells at them for writing useless software, they smugly point to the sentence in the spec that clearly spells out how their horribly broken software is technically correct
This is the first in a series of posts highlighting such feeds as an example to others on how not to design syndication feeds for a website. Feeds in these series will often be technically valid RSS/Atom feeds but for one or more reasons cause unnecessary inconvenience to authors and users of news aggregators.
This week's gem is the Cafe con Leche RSS feed. Instead of pointing out what is wrong with this feed myself I'll let the author of the feed do so himself. On September 24th Elliotte Rusty Harold wrote
I've been spending a lot of time reviewing RSS readers lately, and overall they're a pretty poor lot. Latest example. Yesterday's Cafe con Leche feed contained this completely legal title
element:
<title>I'm very pleased to announce the publication of XML in a Nutshell, 3rd edition by myself and W.
Scott Means, soon to be arriving at a fine bookseller near you.
</title>
Note the line break in the middle of the title
content. This confused at least two RSS readers even though there's nothing wrong with it according to the RSS 0.92 spec. Other features from my RSS feeds that have caused problems in the past include long titles, a single URL that points to several stories, and not including more than one day's worth of news in a feed.
Elliote is technically right, none of the RSS specs says that the <link> element in an RSS feed should be unique for each item so he can reuse the same link for multiple items and still have a valid RSS feed. So why does this cause problems for RSS aggregators?
Consider the following RSS feed
<rss version="0.92">
<channel>
<title>Example RSS feed</title>
<link>http://www.example.com</link>
<description>This feed contains an example of how not to design an RSS feed</description>
<item>
<title>I am item 1</title>
<link>http://www.example.com/rssitem</link>
</item>
<item>
<title>I am item 2</title>
<link>http://www.example.com/rssitem</link>
</item>
</channel>
</rss>
Now consider the same feed fetched a few hours later
<rss version="0.92">
<channel>
<title>Example RSS feed</title>
<link>http://www.example.com</link>
<description>This feed contains an example of how not to design an RSS feed</description>
<item>
<title>I am item one</title>
<link>http://www.example.com/rssitem</link>
</item>
<item>
<title>I am item 3</title>
<link>http://www.example.com/rssitem</link>
</item>
<title>I am item 2</title>
<link>http://www.example.com/rssitem</link>
</item>
</channel>
</rss>
Now how does the RSS aggregator tell whether the item with the title "I am item 1" is the same as the one named "I am item one" with a typo in the title fixed or a different one? The simple answer is that it can't. A naive hack is to look at the content of the <description>
element to see if it is the same but what happens when a typo was fixed or some update to the content of the <description>
?
Every RSS aggregator has some sort of hack to deal with this problem. I describe them as hacks because there is no way that an aggregator can 100% accurately determine when items with the same link and no guid are the same item with content changed or different items. This means the behavior of different aggregators with feeds such as the Cafe con Leche RSS feed is extremely inconsistent.
A solution to this problem is for Elliotte Rusty Harrold to upgrade his RSS feed to RSS 2.0 and use guid elements to distinctly identify items.