From The XML 1.0 to the XML
Infoset: A Useful Abstraction
A useful abstraction is one which simplifies the
details of a problem or its solution by providing a
palatable and consistent logical model. A more
important characteristic of a useful abstraction is
that it allows one to change the details of the
problem or implementation of its solution without
having to change the abstraction. This latter
characteristic is quite beneficial because it often
lends to extending the abstraction to solve
different problems than originally imagined and
gives the underlying implementation
flexibility.
The XML
Infoset is an abstract representation of XML 1.0. It
provides a simplified logical view of XML and
papers over certain details. The infoset describes
all the pertinent information that is contained
within an XML document without getting bogged down
in the differences between characters entered
directly, in CDATA sections or as entities along
with various other syntactic minutae. The infoset
abstraction gives us several things. The first is
that it conclusively states what information within
an XML document is pertinent information. The
second is that it provides a starting point for
mapping non-XML data sources to XML data.
Given the following XML document<foo attr1="value1"
attr2='value2'
>me&you<\foo>
I can tell that the pertinent information is that
I have a document information item with an element
information item that has two attribute information
items and six child character information items.
Details like how much space is between attributes,
whether single or double quotes are used for
attributes or the fact that the ampersand had to be
escaped are not significant information. This lack
of focus on the textual nature of XML 1.0 gives one
a launch pad towards creating XML infoset
compatible syntaxes for describing structured
data.
As long as mappings from one syntax to the XML
infoset exist then these alternate serializations
of the infoset can be processed using XML
technologies like
XQuery,
XPath and
XML
Schema. Proposals like
Don Park's SML or the various flavors of
binary
XML only need to worry about being compatible
with the XML infoset to the extent which it defines
conformance and not the XML 1.0 syntax.
From URLs
& URNS to URI: A Step Backward
URLs refer to Uniform Resource Locators described
in
RFC
1738. According to the RFC
URLs are used to `locate' resources, by
providing an abstract identification of the
resource location. Having located a
resource, a system may perform a variety
of operations on the resource, as might be
characterized by such words as `access',
`update', `replace', `find attributes'. In
general, only the `access' method needs to be
specified for any URL scheme
URNs refer to Uniform Resource Names described in
RFC
2141. According to the RFC
Uniform Resource Names (URNs) are intended to
serve as persistent, location-independent,
resource identifiers and are designed to
make it easy to map other namespaces (which share
the properties of URNs) into URN-space.
Reading the above excerpts it seems clear that URLs
and URNs are used in connection with retrieving
resources from a network. URLs tell you where to
fetch the resource while URNs are the name of the
resource from which you can then go find its
location. Basically the difference between URNs and
URLs is the difference between
The White
House and
The White House, 1600 Pennsylvania
Avenue NW, Washington, DC 20500 . At first
glance one can consider both URNs and URLs an
abstraction over IP addresses, DNS and all the
other gunk that goes on when one wants to grab
stuff of the network be they web pages, music files
or images.
However there is a wrinkle which isn't obvious nor
does it matter at the currently described level of
abstraction. The wrinkle is that the term
resource which litters both RFCs isn't
rigorously defined but since we are just talking
about grabing files of a network we can just assume
they refer to files on a network. This is until
URIs enter the picture.
URIs refer to Uniform Resource Identifiers
described in
RFC
2396. According to the RFC
A Uniform Resource Identifier (URI) is a compact
string of characters for identifying an
abstract or physical resource
A resource can be anything that has identity.
Familiar examples include an electronic document,
an image, a service (e.g., "today's weather
report for Los Angeles"), and a collection of
other resources. Not all resources are network
"retrievable"; e.g., human beings,
corporations, and bound books in a library can
also be considered resources.
URIs are a merger of the syntax of URLs and URNs
which seem to have been repurposed from their
original task of identifying and locating network
retrievable documents to being more readable
versions
UUIDs which can be used to identify any person,
place or thing regardless of whether it is a file
on the Internet or a feeling in your heart.
This addition to the URN/URL abstraction seemed to
address some of the bits which may have been
considered to be leaky (if I enter
http://www.yahoo.com in my browser and it loads it
from its cache then the URL isn't acting as a
location but as an identifier). Others also saw
URIs as a way for people who needed user friendly
UUIDs for use on the Web. I've so far come into
contact with URIs in two aspects of my professional
experience and they have both left a bad taste in
my mouth. Read on for details.
URIs
in Action: XML Namespaces
The goal of the W3C's
Namespaces
in XML recommendation was to create a mechanism
in which elements and attributes within an XML
document that were from different markup
vocabularies could be unambiguously identified and
combined without processing problems ensuing. To
achieve this XML namespaces were invented. An XML
namespace is a collection of names, identified by a
Uniform Resource Identifier (URI) reference, which
are used in XML documents as element and attribute
names. Below is an example of a document that uses
XML namespaces
<dare:foo
xmlns:dare="http://www.25hoursaday.com" />
The above document has a
foo
element
that is from the "http://www.25hoursaday.com"
namespace. The first thing people ask about
namespaces in XML without fail is "What is at the
namespace website?". Now, given that URIs are just
glorified UUIDs the answer to this question is that
"http://www.25hoursaday.com" isn't necessarily a
website just a pseudo-unique identifier.
This answer has caused several thousand emails to
fly back and forth on various XML and W3C mailing
lists because of the utter confusion it causes.
Several thousand emails is not an exagerration.
Looking at the archives for
xml-uri@w3c.org show peak traffic of almost a
thousand mails a month and lists like XML-DEV
usually have several hundred email in threads that
URI & XML namespaces come up.
As I type this there is currently an active thread
on the WWW-TAG mailing list about
namespace documents and what constitutes a
"valid representation" of the abstract resource
that is an XML namespace. For those that have tons
of free time to read technical yet pointlessly
philosophical discussions, the threads are
here and
here.
URIs and the
Semantic Web: Ambiguity2
One problem with URIs is that they don't uniquely
identify a single thing. Consider the following
hyperlinked statements
Dare is
a Georgia Tech alumni.
Dare's
website is valid XHTML.
In the above statements I use the URI
"http://www.25hoursaday.com" to identify both
myself and my web page. This is a bad thing for the
Semantic Web. If you read
Aaron
Swartz's excellent primer on the Semantic Web
you will notice where he talks about RDF and its
dependence on URIs specifically
RDF gives you a way to make statements that
are machine-processable. Now the
computer can't actually "understand" what you
said, of course, but it can deal with it in a way
that makes it seem like it does. For example, I
could search the Web for all book reviews and
create an average rating for each book. Then, I
could put that information back on the Web.
Another website could take that information (the
list of book rating averages) and create a "Top
Ten Highest Rated Books" page.
RDF is really quite simple. An RDF statement
is a lot like a simple sentence, except that
almost all the words are URIs. Each RDF statement
has three parts: a subject, a predicate and an
object. Let's look at a simple RDF statement:
<http://aaronsw.com/>
<http://love.example.org/terms/reallyLikes>
<http://www.w3.org/People/Berners-Lee/Weaving/>
.
Can you guess what this says? The first URI is
the subject. In this instance, the subject is me.
The second URI is the predicate. It relates the
subject to the object. In this instance, the
predicate is "reallyLikes." The third URI is the
object. Here, the object is Tim Berners-Lee's
book "Weaving the Web." So the RDF statement
above says that I really like "Weaving the
Web."
Now consider changing his RDF example to
<http://aaronsw.com/>
<http://love.example.org/terms/reallyLikes>
<http://www.25hoursaday.com/> .
Can you tell whether Aaron really like my website
or me personally from the above RDF statement?
Neither can I. This inherrent ambiguity is yet
another issue with the vision of the Semantic Web
and the current crop of Semantic Web technologies
that are overly dependent on URIs.
Lessons
Learned
Part of me feels there are several lessons to be
learned from the problems caused by the URI
abstraction and the potential problems that could
be caused by the XML Infoset (proliferation of
un-interoperable XML serialization formats) while
embracing the benefits of useful abstractions as
well. However, I have to go to work so this early
morning ramble will have to end here.
Get yourself a
News Aggregator and subscribe to my
RSSfeedDisclaimer:
The above comments do not
represent the thoughts, intentions, plans or
strategies of my employer. They are solely my
opinion.