Using the Right Tool for the
Job
Kirk Allen Evans has a post on
using XSLT to convert CSV to XML. Now although
the stylesheet shows cool XSLT tricks it is a
probably the most convoluted way to go about
solving this problem and fails to utilize the
primary strengths of XSLT but instead highlights
its weaknesses.
XSLT is primarily designed for processing XML
specifically converting XML input to XML output. It
deals very poorly with text processing which is
evidenced by the large complex stylesheet that is
needed to perform the equivalent of the following
lines of C# code
using System;
using System.Xml;
using System.IO;
class Test{
public static void Main(string[] args){
string csv = @"1,5 main st, ,Cumming,
GA, 30040,Kirk Evans
2,13 elm st, ,Anywhere, NJ, 07825,Bob Smith";
StringWriter writer = new
StringWriter();
XmlTextWriter xmlWriter = new
XmlTextWriter(writer);
xmlWriter.Formatting =
Formatting.Indented;
xmlWriter.WriteStartElement("root");
foreach(string row in csv.Split(new
char[]{'\n'})){
xmlWriter.WriteStartElement("row");
foreach(string elem in
row.Split(new char[]{','})){
xmlWriter.WriteElementString("elem",
elem.Trim());
}
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
Console.WriteLine(writer.ToString());
}
}
Both approaches produce the same results except
that the XSLT approach can be understood by a
handful of people while the above approach can be
understood by anyone with formal programming
experience and a rudimentary grasp of XML.
Of course, no discussion on materializing CSV as
XML is complete without a pointer to Chris Lovett's
XmlCsvReader which allows one to read a CSV
file as if it was an XML document.
#
Data
Modelling With XML
Data modelling XML formats is still mostly a black
art unlike relational data modelling that has the
techniques like
normalization that guide people as to the right
way to represent their data.
Christian Romney has a post on designing XML data
formats which shows his heart is in the right
place but whose conclusions I disagree with.
Basically Christian states given a choice
between<NumberOfRooms>424</NumberOfRooms>
<Rise>Mid</Rise>
<MaximumOccupancy>3 (3 adults/2
children)</MaximumOccupancy>
and
<hotelInformation>
<item name="Number of
Rooms">424 </item>
<item
name="Rise">Mid
</item>
<item name="Maximum
Occupancy">3 (3 adults/2 children)
</item>
</hotelInformation>
one should pick the latter. There are many reasons
that the latter is a worse choice both from the
perspective of human readability or programmatic
interaction. However before I mention those I'd
like to look at the reason Christian thinks the
latter is a better format.
Christian states "
The reason the second
implementation is *better* is that the addition of
an additional element (a new piece of information)
does not require any change to the XSL
stylesheet". This is true only because of the
design of his stylesheet. On the other hand, if he
used the following XSLT stylesheet
<xsl:template
match="hotelInformation">
<table>
<xsl:for-each
select="*">
<tr>
<td><xsl:value-of select="local-name()"
/></td>
<td><xsl:value-of select="."
/></td>
</tr>
</xsl:for-each>
</table>
</xsl:template>
additions to the former format would not require
any changes to the stylesheet. The only thing of
note about my stylesheet fragment is that I assumed
the former format has a root element named
hotelInformation
as the latter does
(otherwise it wouldn't be well-formed XML).
Given that both formats are similarly resistant to
change what reasons would one pick the former over
the latter. Let me the count the ways; the former
is less verbose and cluttered with redundant
information making it easier to read, if the former
format and stylesheet can use XML namespaces for
extensibility with ease while the latter cannot,
and finally when programmatically processing XML it
is more cumbersome to take actions based on
attribute names than it is if tehy are based on
element names (this is especially true with
SAX).
#QNames in
Content
One of the most aggravating problems that has risen
from the existence of the
Namespaces
in XML recommendation is the use of
Qualified Names (QNames) as Identifiers in
Content. QNames in content have proven to be
very useful for XSLT and W3C XML Schema but they
cause a number of problems when processing such
documents and in certain edge cases. An example of
such a problem is how to emit attributes whose
values are QNames into another XML document.
Gudge had that
problem and found
a solution. A nice bit of XSLT hackery.
#
--
Get yourself a
News Aggregator and subscribe to my
RSSfeedDisclaimer:
The above comments do not
represent the thoughts, intentions, plans or
strategies of my employer. They are solely my
opinion.