A recent comment on the Groklaw blog entitled
Which Binary Key? claims that
one needs a "binary key" to consume XML produced by Microsoft Office 2003. Specifically the post claims
No_Axe speaks as if MS Office 12 had already been released and everyone was
using it. He assumes everyone knows the binary key is gone. Yet Microsoft is
saying that MS Office 12 is more or less a year away from release. So who really
knows when and if the binary key has been dropped? All i know is that MSXML 12
is not available today. And that MSXML 2003 has a binary key in the header of
every file.
...
So let me close with this last comment on the fabled “binary key”. In
March of 2005, when phase II of the ODF TC work was complete, and the
specification had been prepared for both OASIS and ISO ratification,
the ODF TC took up the issue of “compliance and conformance” testing.
Specifically, we decided to start work on a compliance testing suite
that would be useful for developers and application providers to
perfect their implementations of ODF. Guess who's XML file format was
the first test target? Right. And guess what the problem is with MSXML?
Right. It's the binary key. We can't do even a simple transformation
between MSXML and ODF!
As someone who's used the XML features of Excel and Word,
I know for a fact that you don't need a "binary key" to process the
files using traditional XML tools. Brian Jones, who works on a number
of the XML features in Office, has a post entitled The myth of the Binary Key
where
he mentions various parts of the Office XML formats that may confuse
one into thinking they are some sort of "binary key" such as namespace
URIs, processing instructions and Base64 encoded binary data. All of
these are standard aspects of XML which one may typically doesn't see in
simple uses of the technology such as in RSS feeds.
Being that I used to work on the XML team there is one thing I want
to add the Brian's list which often confuses people trying to
process XML; the unicode byte order mark (BOM). This is often at the beginning of documents saved in UTF-16 or UTF-8 encoding on Windows. However as the Wikipedia entry on BOM's states
In UTF-16, a BOM is expressed as the
two-byte sequence FE FF at the beginning of
the encoded string, to indicate that the encoded characters that follow it use
big-endian byte order; or it is expressed as the byte sequence FF FE to indicate
little-endian order.
Whilst UTF-8 does not have byte order
issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of
Windows software
(including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy
use of text files for
configuration) this practice is not recommended, as it will interfere with
correct processing of important codes such as the hash-bang at the start of an interpreted script. It
may also interfere with source for programming languages that don't recognise
it. For example, gcc reports stray characters at the
beginning of a source file, and in PHP, if
output buffering is disabled, it has the subtle effect of causing the page to
start being sent to the browser, preventing custom headers from being specified
by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF
BB BF, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to
handle UTF-8.
I wouldn't be surprised if the alleged "binary key" was just a byte
order mark which caused problems when trying to process the XML file
using non-Unicode savvy tools. I suspect some of the ODF folks who had
problems with the XML file would get some use out of Sam Ruby's Just
Use XML talk at this year's XML 2005 conference.