XML stands for eXtensible Markup Language. XML is a meta-markup language developed by the World Wide Web Consortium(W3C) to deal with a number of the shortcomings of HTML. As more and more functionality was added to HTML to account for the diverse needs of users of the Web, the language began to grow increasingly complex and unwieldy. The need for a way to create domain-specific markup languages that did not contain all the cruft of HTML became increasingly necessary and XML was born.
The main difference between HTML and XML is that whereas in HTML the semantics and syntax of tags is fixed, in XML the author of the document is free to create tags whose syntax and semantics are specific to the target application. Also the semantics of a tag is not tied down but is instead dependent on the context of the application that processes the document. The other significant differences between HTML and XML is that the an XML document must be well-formed.
Although the original purpose of XML was as a way to mark up content, it became clear that XML also provided a way to describe structured data thus making it important as a data storage and interchange format. XML provides many advantages as a data format over others, including:
Since XML is a way to describe structured data there should be a means to specify the structure of an XML document. Document Type Definitions (DTDs) and XML Schemas are different mechanisms that are used to specify valid elements that can occur in a document, the order in which they can occur and constrain certain aspects of these elements. An XML document that conforms to a DTD or schema is considered to be valid. Below is listing of the different means of constraining the contents of an XML document.
SAMPLE XML FRAGMENT
<gatech_student gtnum="gt000x">
<name>George Burdell</name>
<age>21</age>
</gatech_student>
DTD FOR SAMPLE XML FRAGMENT
<!ELEMENT gatech_student (name, age)>
<!ATTLIST gatech_student gtnum CDATA>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
The DTD specifies that the gatech_student element has two child
elements, name and age, that contain character data as well as a gtnum
attribute that contains character data. XDR FOR SAMPLE XML FRAGMENT
<Schema name="myschema" xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="age" dt:type="ui1" />
<ElementType name="name" dt:type="string" />
<AttributeType name="gtnum" dt:type="string" />
<ElementType name="gatech_student" order="seq">
<element type="name" minOccurs="1" maxOccurs="1"/>
<element type="age" minOccurs="1" maxOccurs="1"/>
<attribute type="gtnum" />
</ElementType>
</Schema>
The above schema specifies types for a name element that contains a
string as its content, an age element that contains an unsigned integer value
of size one byte (i.e. btw 0 and 255), and a gtnum attribute that is a string
value. It also specifies a gatech_student element that has one occurence each
of a name and an age element in sequence as well as a gtnum attribute. XSD FOR SAMPLE XML FRAGMENT
<schema xmlns="http://www.w3.org/2001/XMLSchema" >
<element name="gatech_student">
<complexType>
<sequence>
<element name="name" type="string"/>
<element name="age" type="unsignedInt"/>
</sequence>
<attribute name="gtnum">
<simpleType>
<restriction base="string">
<pattern value="gt\d{3}[A-Za-z]{1}"/>
</restriction>
</simpleType>
</attribute>
</complexType>
</element>
</schema>
The above schema specifies a gatech_student complex type (meaning it can
have elements as children) that contains a name and an age element in sequence
as well as a gtnum attribute. The name element has to have a string as
content, the age attribute has an unsigned integer value while the gtnum
element has to be matched by a regular expression that matches the letters
"gt" followed by 3 digits and a letter.//emp[name="Fred"]/salary * 12
document("zoo.xml")//chapter[2 TO 5]//figure
<emp empid = {$id}>
{$name}
{$job}
</emp>
Generate an <emp> element that has an "empid" attribute. The
value of the attribute and the content of the element are specified by
variables that are bound in other parts of the query. FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan Kaufmann"
AND $b/year = "1998"
RETURN $b/title
List the titles of books published by Morgan Kaufmann in
1998. <big_publishers>
{
FOR $p IN distinct(document("bib.xml")//publisher)
LET $b := document("bib.xml")//book[publisher = $p]
WHERE count($b) > 100
RETURN $p
}
</big_publishers>
List the publishers who have published more than 100
books. FOR $h IN //holding
RETURN
<holding>
{$h/title,
IF ($h/@type = "Journal")
THEN $h/editor
ELSE $h/author
}
</holding>
SORTBY (title)
Make a list of holdings, ordered by title. For journals, include the
editor, and for all other holdings, include the author. FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
(contains($p, "sailing") AND contains($p, "windsurfing"))
RETURN $b/title
Find titles of books in which both sailing and windsurfing are
mentioned in the same paragraph. FOR $b IN //book
WHERE EVERY $p IN $b//para SATISFIES
contains($p, "sailing")
RETURN $b/title
Find titles of books in where sailing is mentioned in every
paragraph. NAMESPACE xsd = "http://www.w3.org/2001/XMLSchema"
DEFINE FUNCTION depth($e) RETURNS xsd:integer
{
# An empty element has depth 1
# Otherwise, add 1 to max depth of children
IF (empty($e/*)) THEN 1
ELSE max(depth($e/*)) + 1
}
depth(document("partlist.xml"))
Find the maximum depth of the document named
"partlist.xml."As was mentioned in the introduction, there is a dichotomy in how XML is used in industry. On one hand there is the document-centric model of XML where XML is typically used as a means to creating semi-structured documents with irregular content that are meant for human consumption. An example of document-centric usage of XML is XHTML which is the XML based successor to HTML.
SAMPLE XHTML DOCUMENT
<html xmlns ="http://www.w3.org/1999/xhtml">
<head>
<title>Sample Web Page</title>
</head>
<body>
<h1>My Sample Web Page</h1>
<p> All XHTML documents must be well-formed and valid. </p>
<img src="http://www.example.com/sample.jpg" height ="50" width = "25"/>
<br />
<br />
</body>
</html>
The other primary usage of XML is in a data-centric model. In a data-centric model, XML is used as a storage or interchange format for data that is structured, appears in a regular order and is most likely to be machine processed instead of read by a human. In a data-centric model, the fact that the data is stored or transferred as XML is typically incidental since it could be stored or transferred in a number of other formats which may or may not be better suited for the task depending on the data and how it is used. An example of a data-centric usage of XML is SOAP. SOAP is an XML based protocol used for exchanging information in a decentralized, distributed environment. A SOAP message consists of three parts: an envelope that defines a framework for describing what is in a message and how to process it, a set of encoding rules for expressing instances of application-defined datatypes, and a convention for representing remote procedure calls and responses.
SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<m:GetLastTradePrice xmlns:m="Some-URI">
<symbol>DIS</symbol>
</m:GetLastTradePrice>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
In both models where XML is used, it is sometimes necessary to store the XML in some sort of repository or database that allows for more sophisticated storage and retrieval of the data especially if the XML is to be accessed by multiple users. Below is a description of storage options based on what model of XML usage is required.
SAMPLE DB2 XML EXTENDER TABLE AND QUERY
TABLE mail_user
user_name VARCHAR(20) NOT NULL PRIMARY KEY
passwd VARCHAR(10)
mailbox XMLVARCHAR
SELECT user_name FROM mail_user WHERE extractVarchar(mailbox,"/Mailbox/Inbox/Email/Subject") LIKE "%XML%"
The above query returns the names of all the users that have any email in their inbox that
contains the string "XML" in its subject. To improve the performance of the XPath query it is
necessary to index the mailbox XMLVARCHAR.
Oracle has completely integrated XML into it's Oracle 9i database as well as the rest of its family of products. XML documents can be stored as whole documents in user-defined columns [of type XMLType or CLOB/BLOB] where they can be extracted using XMLType functions such as Extract() or they can be stored as decomposed XML documents that are stored in object relational form which can be recontituted using the XML SQL Utility (XSU) or SQL functions and packages. For searching XML, Oracle provides Oracle Text which can be used to index and search XML stored in VARCHAR2 or BLOB variables within a table via the CONTAINS and WITHIN operators used in collusion with SQL SELECT queries. XMLType columns can be queried by selecting them through a programming interface (e.g. SQL, PL/SQL, C, or Java), by querying them directly and using extract() and/or existsNode() or by using Oracle Text operators to query the XML content. The extract() and existsNode() functions uses XPath expressions for querying XML data. Oracle 9i also allows one to create relational views on XML documents stored in XMLType columns which can then be queried using SQL. The columns in the table are mapped to XPath expressions that query the document in the XMLType column.
SAMPLE ORACLE 9i TABLE AND QUERY
CREATE TABLE mail_user(
user_name VARCHAR2(20),
passwd VARCHAR2(10),
mailbox SYS.XMLTYPE );
SELECT user_name FROM mail_user m WHERE m.mailbox.extract('/Mailbox/Inbox/Email/Subject/text()').getStringVal() like '%XML%'
The above query returns the names of all the users that have any email in their inbox that
contains the string "XML" in its subject. To improve the performance of the XPath query it is
necessary to index the mailbox XMLType.
Microsoft's SQL
Server 2000 also supports XML operations being performed on relational
data . XML data can be retrieved from relational tables using the FOR
XML clause. The FOR XML clause has three modes: RAW, AUTO and EXPLICIT. RAW
mode sends each row of data in the resultset back as a XML element named
"row" and with each column being an attribute of the "row" element. AUTO
mode returns query results in a nested XML tree where each element returned
is named after the table it was extracted from and each column is an
attribute of the returned elements. The hierarchy is determined based on the
order of the tables identified by the columns of the SELECT statement. With
EXPLICIT mode the hierarchy of the XML returned is completely controlled by
the query which can be rather complex. SQL Server also provides the OPENXML
clause which to provide a relational view on XML data. OPENXML allows XML
documents placed in memory to be used as parameters to SQL statements or
stored procedures. Thus OPENXML is used to query data from XML, join XML
data with existing relational tables, and insert XML data into the database
by "shredding" it into tables. Also W3C XML schema to can be used to provide
mappings between XML and relational structures. These mappings are called
XML views and allow relational data in tables to be viewed as XML which can
be queried using XPath.The following people helped in reviewing and proofreading this paper: Dr. Sham Navathe, Kimbro Staken, Dmitri Alperovitch, Sam Collins, Omri Gazitt and Dennis Lu.