I just noticed that Arve Bersvendsen has written a post entitled 11 ways to valid RSS where he states he has seen 11 different ways of providing content in an RSS feed namely
Content in the description element
I have so far identified five different variants of content in the <description>
element:
- Plaintext as CDATA with HTML entities - Validate
- HTML within CDATA - Validate
- HTML escaped with entities - Validate
- Plain text in CDATA - Validate
- Plaintext with inline HTML using escaping - Validate
<content:encoded>
I have encountered and identified two different ways of using <content:encoded>
:
- Using entities - Validate
- Using CDATA - Validate
XHTML content
Finally, I have encountered and identified four different ways in which people has specified XHTML content:
- Using <xhtml:body> - Validate
- Using <xhtml:div> - Validate
- Using <body> with default namespace - Validate
- Using <div> with default namespace - Validate
At first these seem like a lot until you actually try to program against this using an XML parser. In which case, the first thing you notice is that there is no difference programming against CDATA vs. escaped entities since they are both syntactic sugar. For example, the XML infoset and data models compatible with it such as the XPath data model do not differentiate character content that is written as character references, CDATA sections or entered directly. So the following
<test><![CDATA[ ]]>2</test>
<test> 2</test>
<test> 2</test>
are all equivalent. More directly if you loaded all three into an instance of System.Xml.XmlDocument and checked their InnerText property they'd all return the same result. So this reduces Arve's first two elements to
Content in the description element
I have so far identified five two different variants of content in the <description>
element:
- HTML
- Plain text
<content:encoded>
I have encountered and identified two different ways one way of using <content:encoded>
:
- Containing escaped HTML content
If your code makes any distinctions other than these then it is a sign that you have (a) misunderstood how to process RSS or (b) are using a crappy XML parser. When I first started working on RSS Bandit I also was confused by these distinctions but after a while things became clearer. The only problem here is the description element since you can't tell whether it is HTML or not without guessing. Since RSS Bandit always provides the content to an embedded web browser this isn't a problem but I can see how it could be one for aggregators that don't know how to process HTML (although I've never seen one before).
Another misunderstanding by Arve seems to be how namespaces work in XML. A few years ago I wrote an XML Namespaces and How They Affect XPath and XSLT where I wrote
A qualified name, also known as a QName, is an XML name called the local name optionally preceded by another XML name called the prefix and a colon (':') character...The prefix of a qualified name must have been mapped to a namespace URI through an in-scope namespace declaration mapping the prefix to the namespace URI. A qualified name can be used as either an attribute or element name.
Although QNames are important mnemonic guides to determining what namespace the elements and attributes within a document are derived from, they are rarely important to XML aware processors. For example, the following three XML documents would be treated identically by a range of XML technologies including, of course, XML schema validators.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType id="123" name="fooType"/>
</xs:schema>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:complexType id="123" name="fooType"/>
</xsd:schema>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
<complexType id="123" name="fooType"/>
</schema>
Bearing this information in mind this reduces Arve's example to
XHTML content
Finally, I have encountered and identified four two different ways in which people has specified XHTML content:
- Using <xhtml:body>
- Using <xhtml:div>
Thus with judicious use of an XML parser (which makes sense since RSS is an XML format), Arve's list of eleven ways of providing content in RSS is actually whittled down to five. I assume Arve is unfamiliar with XML processing which led to his initial confusion.
NOTE: Before anyone bothers to start pointing out that Atom somehow frees aggregator author from this myriad of options I'll point out that Atom has more ways of encoding content than these. Even ignoring the inconsequential differences in syntactic sugar in XML (escaped tags vs. unescaped tags in CDATA sections) the various combinations of the <summary> and <content> elements, the mode attribute (escaped vs. xml) and MIME types (text/plain, text/html, application/xhtml+xml) more than double the number of variations possible in RSS.