February 13, 2004
@ 03:30 PM

Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality. Specifically

According to RFC 3023..., if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. us-ascii.

So for this to work correctly it means that if the MIME type of an XML document is text/xml then the web server should look inside it before sending it over the wire and send the correct encoding or else the document will be interpreted incorrectly since it is highly likely that us-ascii is not the encoding of the XML document. In practice, most web servers do not do this. I have confirmed this by testing against both IIS and Apache.

Instead what happens is that an XML document is created by the user and dropped on the file system and the web server assumes it is text/xml which it most likely is and sends it as is without setting the charset in the content type header.   

A simple way to test this is to go to Rex Swain's HTTP Viewer and download the following documents from the W3 Schools page on XML encodings

  1. XML document in windows-1252 encoding
  2. XML document in ISO-8859-1 encoding
  3. XML document in UTF-8 encoding
  4. XML document in UTF-16 encoding

All files are sent with a content type of text/xml and no encoding specified in the charset parameter of the Content-Type HTTP header. According to RFC 3023 which Mark Pilgrim quoted in his article that clients should treat them as us-ascii. With the above examples this behavior would be wrong in all four cases.

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable future [slightly modified from Mark Pilgrim's post]  

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml then the encoding is

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. the encoding given in the encoding attribute of the XML declaration within the document, or
  3. utf-8.

Some may argue that this discussion isn't relevant for news aggregators because they'll only consume XML documents whose MIME type application/atom+xml or application/rss+xml but again this ignores practice. In practice most web servers send back RSS feeds as text/xml, if you don't believe me test ten RSS feeds chosen at random using Rex Swain's HTTP Viewer  and see what MIME type the server claims they are.