Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats
over XML formats. The real-world test case I use when thinking about
choosing microformats over XML is whether instead
of having an HTML web page for my blog and an Atom/RSS feed, I should
instead have a single HTML page with <div class="rss:item">
or <h3 class="atom:title"> embedded in it. To me this seems like
a gross hack but I've seen lots of people comment on how this seems
like a great idea to them.
Given that I hadn't encountered universal disdain for this idea, I
decided to explore further and look for technical arguments for and
against both approaches.
I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats
in the Microformats wiki. However I hadn't seen many in-depth technical
arguments of why they were better than XML formats until
recently.
In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote
Before microformats had a home page, a blog, a wiki, a charismatic leader, and a
cool name, I was against using XHTML for syndication for a number of reasons.
http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication
I had several basic arguments:
1. XHTML-based syndication
required well-formed semantic XHTML with a particular structure, and was
therefore doomed to failure. My experience in the last 3+ years with both feed
parsing and microformats parsing has convinced me that this was incredibly naive
on my part. Microformats may be *easier* to accomplish with semantic XHTML (just
like accessibility is easier in many ways if you're using XHTML + CSS), but you
can be embed structured data in really awful existing HTML markup, without
migrating to "semantic XHTML" at all.
2. Bandwidth. Feeds are generally
smaller than their corresponding HTML pages (even full content feeds), because
they don't contain any of the extra fluff that people put on web pages (headers,
footers, blogrolls, etc.) And feeds only change when actual content changes,
whereas web pages can change for any number of valid reasons that don't involve
changes to the content a feed consumer would be interested in. This is still
valid, and I don't see it going away anytime soon.
3. The
full-vs-partial content debate. Lots of people who publish full content on web
pages (including their home page) want to publish only partial content in feeds.
The rise of spam blogs that automatedly steal content from full-content feeds
and republish them (with ads) has only intensified this debate.
4. Edge
cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think
these can be handled by microformats or successfully ignored. For example,
machine-readable dates can be encoded in the title attribute of the
human-readable date. Hand-crafted summaries can be published on web pages and
marked up appropriately. Feed-only content can just be ignored; few people do it
and it goes against one of the core microformats principles that I now agree
with: if it's not human-readable in a browser, it's worthless or will become
worthless (out of sync) over time.
I tend to agree with Mark's conclusions. The main issue with using
microformats for syndication instead of RSS/Atom feeds is wasted
bandwidth since web pages tend to contain more stuff than feeds and
change more often.
Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes
Microformats (and architectural forms, and all the other
names under which this technique has been invented) take this one step
further by standardizing some of these attribute values and possibly
even some combination of element types and attribute values in one or
more content models.
This technique has some stellar advantages: it's relatively
easy to explain and the fallback is natural and obvious, new code can
be written to use this “extra” information without any change being
required to existing applications, they just ignore it.
Despite how compelling those advantages are, there are some
pretty serious drawbacks associated with microformats as well. Adding
hCalendar support to my itineraries page reinforced several of them.
-
They're not very flexible. While I was able to add hCalendar
to the overall itinerary page, I can't add it to the individual pages
because they don't use the right markup. I'm not using <div>
and <span>
to markup the individual appointments, so I can't add hCalendar to them.
-
I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role
or class
attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.
-
They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron
(and other rule-based validation languages) can do it, but most of us
are used to using grammar-based validation on a daily basis and we're
likely
to forget the extra step of running Schematron validation.
It's interesting that RELAX NG
can almost, but not quite, do it. RELAX NG has no difficulty
distinguishing between two patterns based on an attribute value, but
you can't use those two patterns in an interleave pattern. So the
general case, where you want to say that the content of one of these
special elements is “an <abbr>
with class="dtstart"
interleaved
with an <abbr>
with class="dtend"
interleaved with…”, you're out of luck. If you can limit the content to
something that doesn't require interleaving, you can use RELAX NG for
your particular application, but most of the microformats I've seen use
interleaving in the general case.
Is validation really important? Well, I have well over a
decade of experience with markup languages at this point and I was
reminded just last week that I can't be relied upon to write a simple
HTML document without markup errors if I don't validate it. If they
can't be validated, they will often be incorrect.
The complexity of validating microformats isn't something I'd
considered in my original investigation but is a valid point. As a developer of an
RSS aggregator, I've found the existence of
the Feed Validator to be an
immense help in tracking down issues. Not having the luxury of being
able to validate feeds would make building an aggregator a lot harder
and a lot less fun.
I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.