![]() |
![]() Spring Semester 2003 |
The world of XML is large and is expanding in unpredictable ways every minute, but we'll become familiar with the lay of the land in detail here. We also have a lot of territory to cover because XML is getting into the most amazing places, and in the most amazing ways, these days.
XML is a language defined by the World Wide Web Consortium (W3C, at
www.w3c.org
), the body that
sets the standards for the Web. This set of notes is all about getting a
solid overview of the language and how we can use it.
1. Markup Languages
The markup language that most people are familiar with today is HTML. HTML is used to create web pages by using a set of predefined tags that tell the browser what should be displayed and how. XML is similar to HTML, as both derive from SGML. SGML (Standard Generalized Markup Language) is a very general markup language with enormous capabilities. XML is an easier to use subset of SGML.
2. What Does XML Look Like?
Here's an example:
Here now is how this document works.<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT>
I start with an XML processing instruction. All XML processing
instructions start with <?
and end with ?>
and this one indicates that I am using XML version 1.0, and the UTF-8
character encoding, an 8-bit condensed version of Unicode. Also,
when I add new sections of code they will be
highlighted with
a particular kind of blue (as illustrated) to indicate the actual
lines of code that I am discussing.
Next, I create a new tag named<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT>
<DOCUMENT>
. You can use any
name for a tag, not just DOCUMENT, as long as the name starts with a letter or
an underscore(_), and as long as the following characters consist of letters,
digits, underscores, dots (.), or hyphens, but no spaces. In XML tags always
start with <
and end with >
.
XML documents are made up of XML elements. Much like
in HTML, you create XML documents with an opening tag, such as
<DOCUMENT>
, followed by the element content
(if any), such as text or other elements, and ending with the
matching closing tag that starts with </
, such
as </DOCUMENT>
. (We are simplifying things
here, but not by much). It's necessary to enclose the entire
document, except for processing instructions, in one element,
called the root element - that's the
</DOCUMENT>
here.
Now I will add to this XML document a new element that I made up,<DOCUMENT> </DOCUMENT>
<GREETING>
, which encloses text content (in this
case, Hello from XML), like this:
Next I can add a new element,<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> </DOCUMENT>
<MESSAGE>
, which also
encloses text content, like this:
Now the<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT>
<DOCUMENT>
root element contains two
elements: </GREETING>
and </MESSAGE>
.
Each of the
</GREETING>
and </MESSAGE>
elements also hold text themselves. In this way, I've created a new
XML document. Note however the similarity of this document with the
following HTML page:
The difference is that in the HTML document all the tags are predefined and a web browser would know how to handle them, while here we have just created the tags<html> <body> <h1> Hello from HTML </h1> <p> Welcome to the wild and wooly world of HTML. </p> </body> </html>
<DOCUMENT>
,
<GREETING>
, and <MESSAGE>
from thin air. How can we use an XML document like this one?
What would a browser make of its tags? 3. What's So Great About XML?
XML is popular for many reasons and I will name a few here.
Informally, that means the document must contain one or more elements, and one element, the root element, must contain all the other elements. Each element also must nest inside any enclosing elements properly.
5. Valid XML Documents
An XML document is valid if there is a document type definition (DTD) associated with it, and if the document complies with that DTD. We will not use DTD's here, but they are important.
I give an example below:
A document's DTD specifies the correct syntax of the document. DTD's can be stored in a separate file or in the document itself, using a<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT (GREETING, MESSAGE)> <!ELEMENT GREETING (#PCDATA)> <!ELEMENT MESSAGE (#PCDATA)> ]> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT>
<!DOCTYPE>
element. In the example above
the DTD indicates that you can have <GREETING>
and
<MESSAGE>
elements inside a <DOCUMENT>
element, that the <DOCUMENT>
element is the root element,
and that the <GREETING>
and <MESSAGE>
elements can hold text (parsed character data). The more powerful use of
XML involves parsing an XML document to break it down into its
component parts and then handling the resulting data yourself. 6. Parsing XML Yourself.
Most XML programming is done in Java today, and we'll take advantage of that
in this section. Here's a program that reads the greeting.xml
file and extracts the text content of the <GREETING>
element:
import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.Text; public class Two { public static void main(String[] args) { try { DOMParser parser = new DOMParser(); parser.parse("greeting.xml"); Document doc = parser.getDocument(); for (Node node = doc.getDocumentElement().getFirstChild(); node != null; node = node.getNextSibling()) { if (node instanceof Element && node.getNodeName().equals("GREETING")) { for (Node subnode = node.getFirstChild(); subnode != null; subnode = subnode.getNextSibling()) { if (subnode instanceof Text) { System.out.println("***(" + subnode.getNodeValue() + ")***" ); } } } } } catch (Exception e) { e.printStackTrace(); } } }
The W3C DOM specifies a way of treating a document as a tree of nodes. In this model, every discrete data item is a node, and child elements or enclosed text become subnodes.
Treating a document as a tree of nodes is one good way of handling XML documents in in Java (although there are many other ways) because it makes it relatively easy to explicitely state which elements contain which other elements. The contained elements become subnodes of the container nodes.
Everything in a document becomes a node in this model: elements, element attributes, text, and so on. Here are the possible node types in the W3C DOM:
Let's go back to our original document:
Here's a picture of it:<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT>
And here are some programs that explore all nodes.
Note: the purpose of these programs is to help you become familiar with a kind of processing.
So run them all and try to understand them well.
Stage One: A Star for Each Node.
This part is the heart of it, really.burrowww.cs.indiana.edu% cat StageOne.java import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.Text; public class StageOne { public static void main(String[] args) { try { DOMParser parser = new DOMParser(); parser.parse(args[0]); Document doc = parser.getDocument(); Node node = doc.getDocumentElement().getFirstChild(); while (node != null) { display(node); node = node.getNextSibling(); } } catch (Exception e) { e.printStackTrace(); } } public static void display(Node node) { System.out.println("*"); Node subnode = node.getFirstChild(); while (subnode != null) { display(subnode); subnode = subnode.getNextSibling(); } } } burrowww.cs.indiana.edu% javac StageOne.java burrowww.cs.indiana.edu% cat greeting.xml <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT> burrowww.cs.indiana.edu% java StageOne greeting.xml * * * * * * * burrowww.cs.indiana.edu%
Stage Two: Adequate Indentation for Each Star.
Stage Three. Printing the Type of Node.burrowww.cs.indiana.edu% cat greeting.xml <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello from XML. </GREETING> <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> </DOCUMENT> burrowww.cs.indiana.edu% cat StageTwo.java import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.Text; public class StageTwo { public static void main(String[] args) { try { DOMParser parser = new DOMParser(); parser.parse(args[0]); Document doc = parser.getDocument(); // corresponds to all that was parsed Node node = doc.getDocumentElement().getFirstChild(); while (node != null) { display(0, node); // initial indentation is zero spaces node = node.getNextSibling(); } } catch (Exception e) { e.printStackTrace(); } } public static void display(int indent, Node node) { for (int i = 0; i < indent; i++) System.out.print(" "); System.out.println("*"); Node subnode = node.getFirstChild(); while (subnode != null) { display(indent + 2, subnode); // indentation increases subnode = subnode.getNextSibling(); } } } burrowww.cs.indiana.edu% javac StageTwo.java burrowww.cs.indiana.edu% java StageTwo greeting.xml * * * * * * * burrowww.cs.indiana.edu% cat customer.xml <?xml version="1.0" standalone="yes"?> <document> <customer> <name> <last_name>Bird</last_name> <first_name>Larry</first_name> </name> <date>October 15, 2001</date> <orders> <item> <product>Tomatoes</product> <number>8</number> <price>$1.25</price> </item> <item> <product>Oranges</product> <number>24</number> <price>$4.98</price> </item> </orders> </customer> <customer> <name> <last_name>Jordan</last_name> <first_name>Michael</first_name> </name> <date>October 20, 2001</date> <orders> <item> <product>Bread</product> <number>12</number> <price>$14.95</price> </item> <item> <product>Apples</product> <number>6</number> <price>$1.50</price> </item> </orders> </customer> <customer> <name> <last_name>Kukoc</last_name> <first_name>Tony</first_name> </name> <date>October 25, 2001</date> <orders> <item> <product>Asparagus</product> <number>12</number> <price>$2.95</price> </item> <item> <product>Lettuce</product> <number>6</number> <price>$11.50</price> </item> </orders> </customer> </document>burrowww.cs.indiana.edu% java StageTwo customer.xml * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * burrowww.cs.indiana.edu%
This is exactly what we had before, except now we print the value for every text node, and we print the name for each element node. And theburrowww.cs.indiana.edu% cat greeting.xml <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> What if I write in between the elements. <GREETING> Hello from XML. </GREETING> This line in between the greeting and the message. <MESSAGE> Welcome to the Wild and Wooly World of XML. </MESSAGE> This is right after the message. </DOCUMENT> burrowww.cs.indiana.edu% cat StageThree.java import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.Text; public class StageThree { public static void main(String[] args) { try { DOMParser parser = new DOMParser(); parser.parse(args[0]); Document doc = parser.getDocument(); // corresponds to all that was parsed Node node = doc.getDocumentElement().getFirstChild(); while (node != null) { display(0, node); // initial indentation is zero spaces node = node.getNextSibling(); } } catch (Exception e) { e.printStackTrace(); } } public static void display(int indent, Node node) { for (int i = 0; i < indent; i++) System.out.print(" "); switch(node.getNodeType()) { case Node.DOCUMENT_NODE: System.out.println("Document Node"); break; case Node.ELEMENT_NODE: System.out.println("Element Node: (" + node.getNodeName() + ")"); break; case Node.CDATA_SECTION_NODE: System.out.println("CDATA Section Node"); break; case Node.TEXT_NODE: System.out.println("Text Node: (" + node.getNodeValue() + ")"); break; case Node.PROCESSING_INSTRUCTION_NODE: System.out.println("Processing Instruction Node"); break; default: System.out.println("Unaccounted Type of Node"); break; } Node subnode = node.getFirstChild(); while (subnode != null) { display(indent + 2, subnode); // indentation increases subnode = subnode.getNextSibling(); } } }
greeting.xml
has changed, with three lines. Those lines
correspond to text nodes that would be there (and empty) regardless.
Then of course, is the recursive aspect ofburrowww.cs.indiana.edu% javac StageThree.java burrowww.cs.indiana.edu% java StageThree greeting.xml Text Node: ( What if I write in between the elements. ) Element Node: (GREETING) Text Node: ( Hello from XML. ) Text Node: ( This line in between the greeting and the message. ) Element Node: (MESSAGE) Text Node: ( Welcome to the Wild and Wooly World of XML. ) Text Node: ( This is right after the message. ) burrowww.cs.indiana.edu%
display
, which
appears as early as Stage One. Here's one final, but extremely important note:
xerces.jar
from $CATALINA_HOME/common/lib
in your CLASSPATH