Can RDF really save us from data format proliferation?

January 12, 2009

@ 02:10 PM

Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases.

I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed.

RDF proponents often to argue that if we all used RDF based formats then instead of having to change your code to support every new photo site's Atom feed with custom extensions, you could instead create a mapping from the format you don't understand to the one you do using something like the OWL Web Ontology Language. The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT).

The key problem is that in both cases (i.e. mapping with OWL vs. mapping with XSLT) there is still the problem that Picassa feeds won't work with an app that understand's Zoomr's feeds until some developer writes code. Thus we're really debating on whether it is ~~better~~ cheaper to have the developer write declarative mappings like OWL or XSLT instead of writing new parsing code in their language of choice.

In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem.

Note Now Playing: Lady GaGa & Colby O'Donis - Just Dance Note

Categories: Syndication Technology | XML

« Upcoming Conference Appearance: SXSW '09... | Home | Building Scalable Databases: Pros and Co... »

Monday, 12 January 2009 15:31:55 (GMT Standard Time, UTC+00:00)

Parsing and formatting XML have been solved by a variety of solutions, including XSLT. XSL does not help in the modeling of what that XML represents. It would be a terrible format in that regard. RDF and OWL were designed specifically to solve the modeling problem (semantics), not the parsing/formatting problem. Once a standardized format like RDF is available to capture the meaning of the data in a portable way, the exporting of that meaning to any XML format is relatively trivial.

scott

Monday, 12 January 2009 16:01:05 (GMT Standard Time, UTC+00:00)

There are distinctly two issues here. The first is solving the n*m problem, where n = format count and m = format producer and consumer count. The idea behind any common format or format mapping / shaping / semantic extraction is to reduce it to n*1 + 1*m, i.e. n+m.

That's all well and good, but it has problems: lowest common denominator, semantic loss in conversion, inhibition of innovation (gated on lowest common denominator), barriers to entry (new producers want only to be concerned with their specifics, not a huge standard, while new consumers don't want to have to implement the world before being useful). Furthermore, requirements to understand a meta-model before implementing the model itself is so large a barrier to non-specialists just trying to get their work done that it's highly unlikely to ever receive serious attention.

Official standards-body approaches are the other issue. Formats in an area of innovation act like a bubbling market; take-up of formats by consumers and producers determine the winner, until eventually network effects reduce the total number of formats to just a handful. The problem with trying to manage this market process via a committee is much like the problem of trying implement socialism: the total information embodied in various choices of one format over and above another, which a free market reveals naturally, is not available to committees. Committees tend to be dominated by large market players which have various strategic and political objectives that may be quite distinct, and indeed sometimes covertly opposed, to the average market desires for any given format.

The working programmer, at the end of the day, is either putting square pegs into square holes (in which case, no problem), or trying to put square pegs into round holes, and having to create an adapter to convert square pegs into square holes. An arbitrary selection of square or round as a standard doesn't necessarily help him for his specific needs; similarly, pointing at some generalized framework for describing the semantic meaning of square and round pegs respectively is far too abstract for him to get his job done efficiently - i.e. without investment whose cost exceeds the value of getting the original job done.

Barry Kelly

Monday, 12 January 2009 17:17:09 (GMT Standard Time, UTC+00:00)

Okay Dare, this one is near and dear to me - so I'll bite.

RDF is a knowledge modeling *syntax*. One of those notations happens to be XML (RDF/XML) but there are a few other notations as well that are actually less verbose and more readable than XML. RDF allows you to express knowledge in a machine interpretable way. For example if I were to express the fact you work at Microsoft in a RDF (Subject-Predicate-Object) way, I would write:

Subject: http://25hoursaday.com#Dare
Predicate: http://verbs.com/#works,
Object: http://microsoft.com

Now, imagine if I wanted to use a predicate of my own invention and I want that predicate to mean that you work *full-time*, not that you only work but you work there at least 40 hours a week. I want machines to be able to interpret that you work there AND you're there at least 40 hours a week. So i create my own predicate (http://myowndomain.net#worksAtFullTime) then using OWL i can use *very granular logic constructs* to express that http://jonte.net#worksAtFullTime *inherits* from http://verbs.com/#works with the additional temporal constraints that a full-time employee has. Now, when a RDF parser comes along and wants to know where http://25hoursaday.com#Dare works (http://verbs.com/#works), it reads both the RDF document and OWL document and *infers* that http://25hoursaday.com#Dare http://verbs.com/#works at http://microsoft.com.

OWL allows for custom inferences and setting up machine interpretable object hierarchies (dogs are mammals, mammals are animals, animals consume oxygen).

Comparing XSLT to OWL is a really bad comparison. It's like comparing XML to UTF-8, it doesn't really even make sense.

So too answer your question Dare - RDF can *indeed* save us from data format proliferation and data format fragmentation. Bill de hÓra is dead on.

Josh Jonte

Monday, 12 January 2009 18:00:52 (GMT Standard Time, UTC+00:00)

Josh,
The question isn't how sophisticated of a mapping a developer has to create. The issue I'm pointing out is that either way a developer has to create a mapping. Whether the mapping is a syntactic or semantic mapping is besides to point. However as Barry pointed out it seems true that performing a semantic mapping carries more cognitive overhead than doing a syntactic mapping.

Dare Obasanjo

Monday, 12 January 2009 19:12:11 (GMT Standard Time, UTC+00:00)

It seems to me one of the main differences is the granularity of this mapping. I'm not aware of anyone anywhere defining little XSLT fragments that map some individual element of one format to another. Yet that seems to be the idea behind decentralized OWL definitions – I could be able to import tiny knowledge snippets, such as "what Google calls X is the same as what Yahoo! calls Y", and combine them to make more sense of the data I have.

That sounds reasonable to me. And it's on some meta-level I'm not sure about related to the idea of having reusable things like <link rel= ... /> in more than one vocabulary.

Stefan Tilkov

Monday, 12 January 2009 20:05:00 (GMT Standard Time, UTC+00:00)

@Stefan: I like that explanation, but where do those little pieces of knowledge come from? The source of the data may not care to include those. Ultimately, it becomes an integration/mapping exercise for the consumer of the data, which needs to run each representation it receives through these snippets to make sense of the data. What kind of programming would that entail to?

Subbu Allamaraju

Monday, 12 January 2009 21:04:45 (GMT Standard Time, UTC+00:00)

I see what you’re saying and I agree with you that you’ll always have to create a mapping. Anytime something is decentralized you need to create some kind of taxonomy that correlates one “thing” to another “thing”.
I guess it’s the thought of using XSLT as a taxonomy definition is what I have contention with. Using XSLT you only create direct correlation mappings from one node type to another and then in your developer documentation (be it in HTML or the XSL document) you quantify what those tags mean to other developers.

For example, let’s say I have a service that exposes contacts using the following constructs (i'm not sure if your comments allow tags, so I squared the brackets):

[contacts]
[contact type="person"]
[name]Dare Obasanjo[/name]
[email]dare@25hoursaday.com[/email]
[/contact]
[contact type="org"]
[name]Pizza Hut[/name]
[phone]2125551212[/phone]
[/contact]
[/contacts]

Now, you have a service that want to massage that XML into your own custom format that looks like this:

[contacts]
[person]
[name]Dare Obasanjo[/name]
[communicationMediums]
[communicationMedium type="e-mail"]dare@25hoursaday.com[/communicationMedium]
[/communicationMediums]
[/person]
[organization]
[name]Pizza Hut[/name]
[communicationMediums]
[communicationMedium type="telephone"]2125551212[/communicationMedium]
[/communicationMediums]
[/organization]
[/contacts]

In order to use XSL, you are going to have to using all sorts of Turing-based constructs and string parsing to catch the "email" to "e-mail" and "phone" to "telephone" and using appropriate tag names. You will need several templates. The XSLT to accomplish this is a big hairy monster.

Now, in order to do this is RDF and OWL. You would only need to create an OWL document defining that "person" and "organization" inherit from "contact" and "phone" and "email" are "communicationMedium". With RDF and OWL there is no string parsing. Turing constructs are left out of the OWL designers toolbox. Because the information is encoded is its most elementary form (S-P-O) all mappings are funneled down into predicate-mappings.

I would agree with Brad, semantic mappings are more cognitive overhead than syntactic mapping. But that mapping that is allowing the machines to do the inferencing and reasoning. You're encoding the inferences and knowledge into the RDF and OWL documents so machines can leverage *your* cognition.

RDF allows for machine-reasoning. It's the difference between a human translator and Google Translate. Which would you rather have?

Josh Jonte

Monday, 12 January 2009 23:54:36 (GMT Standard Time, UTC+00:00)

RDF doesn't free the machine from having to have coupled semantic meaning; in that way its little better than most specs.

Where it deviates from things like XML, in my mind, is that as a format its humanly explorable. The graph form, the relationship mapping, is intrinsically a more discoverable and extensible form. It also invites interesting possibilities for "sets" of data, that can span across distributed data sources, whereas I find XML to be more of the "here is your document" variety. Without hesitation, I'd say my proof-is-in-the-pudding argument for this is manifest by the lack of UPDATE/PATCH semantics for XML: if you want a knowledge pool, you have to bake it in to your spec. Relationships and knowledge pools are a damned fine semantic to have, particularly at the base of your semantics, where XML is tailored towards hierarchical and authoritative data.

rektide

Wednesday, 14 January 2009 13:34:17 (GMT Standard Time, UTC+00:00)

I haven't used OWL, but I've used XSLT and I didn't like it at all.

First, XSLT is really hard to read, both due to syntax and due to the way it works. It doesn't help me modularize my code base. If I want to build a reusable, maintainable library of transformations I have to invent my own way of doing this and it's not going to be the same that other people use, as far are aware of the issue in the first place).

Second, XSLT doesn't help in letting me guarantee the validity of transformation results. I can't write XSLT and then statically check the potential transformation results against a DTD or schema, or something else. Again I have to kludge around by inventing and consistently applying some specific restrictions on how I choose to write my XSLT, with no help from tools.

So from a code manageability viewpoint XSLT is a disaster. It's back to the 60s: close your eyes, code away, run, and pray. RDF technology is probably a little better in this respect.

Reinier Post

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Can RDF really save us from data format proliferation? - Dare Obasanjo's weblog