There are several different data models for XML even within the W3C. Each of these data models for XML have different ideas has different ideas of what constitutes a node or more generally a significant item in an XML document. The XPath 1.0 data model has 7 nodes (root, element, attribute, namespace, text, comment and processing instruction) which is similar to the type and number of nodes in the XQuery data model except that the root node is renamed to the document node to more accurately reflect the fact that it represents the entire XML document.

On the other hand the W3C Document Object Model has 12 node types (document, element, attribute, text, comment, processing instruction, CDATA section, entity, entity reference, doctype, notation, and document fragment)

What tends to cause confusion is when one mixes data models as is the case of performing XPath over the DOM. In such cases discrepencies in the data models may cause problems or lead to some confusion. The following example illustrates such a point of confusion

using System;
using System.Xml;

class Test{


 public static void Main(string[] args){

   XmlDocument doc = new XmlDocument();
   doc.LoadXml("<root>Sam <![CDATA[ I ]]> Am</root>");

   Console.WriteLine(doc.SelectNodes("/root/child::node()").Count);

   foreach(XmlNode xn in doc.SelectNodes("/root/child::node()")){
     Console.WriteLine(xn.OuterXml);
   }

 }
}
Now the question is what should the output of the program be?
  1. 3
    Sam <![CDATA[ I ]]> Am

  2. 1
    Sam I Am

  3. 1
    Sam

Contrary to most expectations the answer is C.

From a DOM perspective the answer A seems obvious because the root element does have three DOM nodes as children; a text node containing the string "Sam", the CDATA section and another text node containing the string "Am". The problem with A is that the XPath data model does not have CDATA sections so a XmlCDataSection instance cannot be returned by an XPath query.

B seems like the logical answer because the XPath data model explicitly states that CDATA sections are removed and adjacent text nodes are merged. The problem with B is that the original document did not contain a text node containing the string "Sam I Am" so this means the XPath query would have to create a new node. Even worse one wonders what happens when an attempt is made to access the ParentNode property of the returned Xmlnode object. Should it point at the original root element in the DOM even though the newly created node is technically not one of its child nodes.

C is the compromise answer. It returns something that makes sense to the XPath data model (a text node) but acts only as selection of a child node of the root element without creating a brand new DOM node whose parentage is questionable.

I love my job. :)
 

Categories: