Misunderstanding XML and Other RSS Follies

March 16, 2004

@ 09:02 PM

I just noticed that Arve Bersvendsen has written a post entitled 11 ways to valid RSS where he states he has seen 11 different ways of providing content in an RSS feed namely

Content in the description element

I have so far identified five different variants of content in the <description> element:

Plaintext as CDATA with HTML entities - Validate
HTML within CDATA - Validate
HTML escaped with entities - Validate
Plain text in CDATA - Validate
Plaintext with inline HTML using escaping - Validate

<content:encoded>

I have encountered and identified two different ways of using <content:encoded>:

Using entities - Validate
Using CDATA - Validate

XHTML content

Finally, I have encountered and identified four different ways in which people has specified XHTML content:

Using <xhtml:body> - Validate
Using <xhtml:div> - Validate
Using <body> with default namespace - Validate
Using <div> with default namespace - Validate

At first these seem like a lot until you actually try to program against this using an XML parser. In which case, the first thing you notice is that there is no difference programming against CDATA vs. escaped entities since they are both syntactic sugar. For example, the XML infoset and data models compatible with it such as the XPath data model do not differentiate character content that is written as character references, CDATA sections or entered directly. So the following

    <test><![CDATA[ ]]>2</test>
    <test>&#160;2</test>
    <test> 2</test>

are all equivalent. More directly if you loaded all three into an instance of System.Xml.XmlDocument and checked their InnerText property they'd all return the same result. So this reduces Arve's first two elements to

Content in the description element

I have so far identified ~~five~~ two different variants of content in the <description> element:

HTML
Plain text

<content:encoded>

I have encountered and identified ~~two different ways~~ one way of using <content:encoded>:

Containing escaped HTML content

If your code makes any distinctions other than these then it is a sign that you have (a) misunderstood how to process RSS or (b) are using a crappy XML parser. When I first started working on RSS Bandit I also was confused by these distinctions but after a while things became clearer. The only problem here is the description element since you can't tell whether it is HTML or not without guessing. Since RSS Bandit always provides the content to an embedded web browser this isn't a problem but I can see how it could be one for aggregators that don't know how to process HTML (although I've never seen one before).

Another misunderstanding by Arve seems to be how namespaces work in XML. A few years ago I wrote an XML Namespaces and How They Affect XPath and XSLT where I wrote

A qualified name, also known as a QName, is an XML name called the local name optionally preceded by another XML name called the prefix and a colon (':') character...The prefix of a qualified name must have been mapped to a namespace URI through an in-scope namespace declaration mapping the prefix to the namespace URI. A qualified name can be used as either an attribute or element name.

Although QNames are important mnemonic guides to determining what namespace the elements and attributes within a document are derived from, they are rarely important to XML aware processors. For example, the following three XML documents would be treated identically by a range of XML technologies including, of course, XML schema validators.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xs:complexType id="123" name="fooType"/>
</xs:schema>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <xsd:complexType id="123" name="fooType"/>
</xsd:schema>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
        <complexType id="123" name="fooType"/>
</schema>

Bearing this information in mind this reduces Arve's example to

XHTML content

Finally, I have encountered and identified ~~four~~ two different ways in which people has specified XHTML content:

Using <xhtml:body>
Using <xhtml:div>

Thus with judicious use of an XML parser (which makes sense since RSS is an XML format), Arve's list of eleven ways of providing content in RSS is actually whittled down to five. I assume Arve is unfamiliar with XML processing which led to his initial confusion.

NOTE: Before anyone bothers to start pointing out that Atom somehow frees aggregator author from this myriad of options I'll point out that Atom has more ways of encoding content than these. Even ignoring the inconsequential differences in syntactic sugar in XML (escaped tags vs. unescaped tags in CDATA sections) the various combinations of the <summary> and <content> elements, the mode attribute (escaped vs. xml) and MIME types (text/plain, text/html, application/xhtml+xml) more than double the number of variations possible in RSS.

Categories: XML

« Exploring Naked Objects | Home | Processing RSS isn't Black Magic »

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Misunderstanding XML and Other RSS Follies - Dare Obasanjo's weblog

Content in the description element

<content:encoded>

XHTML content

Content in the description element

<content:encoded>

XHTML content