Spring Semester 2003


Lecture Notes Twenty-Five: An eXtensible Markup Language.
Welcome to the world of eXtensible Markup Language (XML). This set of notes is your guided tour.

The world of XML is large and is expanding in unpredictable ways every minute, but we'll become familiar with the lay of the land in detail here. We also have a lot of territory to cover because XML is getting into the most amazing places, and in the most amazing ways, these days.

XML is a language defined by the World Wide Web Consortium (W3C, at www.w3c.org), the body that sets the standards for the Web. This set of notes is all about getting a solid overview of the language and how we can use it.

1. Markup Languages

The markup language that most people are familiar with today is HTML. HTML is used to create web pages by using a set of predefined tags that tell the browser what should be displayed and how. XML is similar to HTML, as both derive from SGML. SGML (Standard Generalized Markup Language) is a very general markup language with enormous capabilities. XML is an easier to use subset of SGML.

2. What Does XML Look Like?

Here's an example:

<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
Here now is how this document works.

I start with an XML processing instruction. All XML processing instructions start with <? and end with ?> and this one indicates that I am using XML version 1.0, and the UTF-8 character encoding, an 8-bit condensed version of Unicode. Also, when I add new sections of code they will be highlighted with a particular kind of blue (as illustrated) to indicate the actual lines of code that I am discussing.

<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
Next, I create a new tag named <DOCUMENT>. You can use any name for a tag, not just DOCUMENT, as long as the name starts with a letter or an underscore(_), and as long as the following characters consist of letters, digits, underscores, dots (.), or hyphens, but no spaces. In XML tags always start with < and end with >.

XML documents are made up of XML elements. Much like in HTML, you create XML documents with an opening tag, such as <DOCUMENT>, followed by the element content (if any), such as text or other elements, and ending with the matching closing tag that starts with </, such as </DOCUMENT>. (We are simplifying things here, but not by much). It's necessary to enclose the entire document, except for processing instructions, in one element, called the root element - that's the </DOCUMENT> here.

<DOCUMENT>



</DOCUMENT>
Now I will add to this XML document a new element that I made up, <GREETING>, which encloses text content (in this case, Hello from XML), like this:
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>




</DOCUMENT>
Next I can add a new element, <MESSAGE>, which also encloses text content, like this:
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
Now the <DOCUMENT> root element contains two elements: </GREETING> and </MESSAGE>. Each of the </GREETING> and </MESSAGE> elements also hold text themselves. In this way, I've created a new XML document. Note however the similarity of this document with the following HTML page:
<html>
  <body>
    <h1>
      Hello from HTML
    </h1>
    <p>
      Welcome to the wild and wooly world of HTML. 
    </p>
  </body>
</html>
The difference is that in the HTML document all the tags are predefined and a web browser would know how to handle them, while here we have just created the tags <DOCUMENT>, <GREETING>, and <MESSAGE> from thin air. How can we use an XML document like this one? What would a browser make of its tags?

3. What's So Great About XML?

XML is popular for many reasons and I will name a few here.

  1. Easy Data Handling and Exchange: Proprietary data formats have become so complex that frequently one version of a complex application can't even read data from an earlier version of the same application. In XML data and markup are stored as text that you can configure.

  2. Customizing Markup Languages (for Custom Browsers): When you and a number of people agree on a markup language you can create customized browsers or applications to handle that language. Hundreds of such languages already are being standardized now. As an example the Chemical Markup Language lets you represent complex molecules graphically, very much as we have seen in the lab about installing applets.

  3. Self-Describing Data: The freedom of choosing your own tags is responsible for that.

  4. Structured and Integrated Data: In XML one can encode not only data, but also a defnition of its structure. Thus (although we won't cover it here, now) XML documents can be constrained and validated. For now we only care for the documents to be well-formed.

4. Well-Formed XML Documents.

Informally, that means the document must contain one or more elements, and one element, the root element, must contain all the other elements. Each element also must nest inside any enclosing elements properly.

5. Valid XML Documents

An XML document is valid if there is a document type definition (DTD) associated with it, and if the document complies with that DTD. We will not use DTD's here, but they are important.

I give an example below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOCUMENT [
    <!ELEMENT DOCUMENT (GREETING, MESSAGE)>
    <!ELEMENT GREETING (#PCDATA)>
    <!ELEMENT MESSAGE (#PCDATA)>
]>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
A document's DTD specifies the correct syntax of the document. DTD's can be stored in a separate file or in the document itself, using a <!DOCTYPE> element. In the example above the DTD indicates that you can have <GREETING> and <MESSAGE> elements inside a <DOCUMENT> element, that the <DOCUMENT> element is the root element, and that the <GREETING> and <MESSAGE> elements can hold text (parsed character data). The more powerful use of XML involves parsing an XML document to break it down into its component parts and then handling the resulting data yourself.

6. Parsing XML Yourself.

Most XML programming is done in Java today, and we'll take advantage of that in this section. Here's a program that reads the greeting.xml file and extracts the text content of the <GREETING> element:

import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;

public class Two {
    public static void main(String[] args) {
	
	try {
	    
	    DOMParser parser = new DOMParser(); 
	    parser.parse("greeting.xml"); 
	    Document doc = parser.getDocument(); 
	    
	    for (Node node = doc.getDocumentElement().getFirstChild();
		 node != null; 
		 node = node.getNextSibling()) {
		
		if (node instanceof Element && 
		    node.getNodeName().equals("GREETING")) {
		    
		    for (Node subnode = node.getFirstChild(); 
			 subnode != null; 
			 subnode = subnode.getNextSibling()) {
			
			if (subnode instanceof Text) {
			    
			    System.out.println("***(" + 
					       subnode.getNodeValue() + 
					       ")***"
					       ); 
			    
			}
		    }
		}
	    }
	    
	} catch (Exception e) {
	    e.printStackTrace(); 
	}
	
    }
    
}

The W3C DOM specifies a way of treating a document as a tree of nodes. In this model, every discrete data item is a node, and child elements or enclosed text become subnodes.

Treating a document as a tree of nodes is one good way of handling XML documents in in Java (although there are many other ways) because it makes it relatively easy to explicitely state which elements contain which other elements. The contained elements become subnodes of the container nodes.

Everything in a document becomes a node in this model: elements, element attributes, text, and so on. Here are the possible node types in the W3C DOM:

Let's go back to our original document:

<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
Here's a picture of it:

And here are some programs that explore all nodes.

Note: the purpose of these programs is to help you become familiar with a kind of processing.

So run them all and try to understand them well.

Stage One: A Star for Each Node.

burrowww.cs.indiana.edu% cat StageOne.java
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;

public class StageOne {
    public static void main(String[] args) {
        
        try {
            
            DOMParser parser = new DOMParser(); 
            parser.parse(args[0]); 
            Document doc = parser.getDocument(); 

            Node node = doc.getDocumentElement().getFirstChild(); 

            while (node != null) {
                display(node); 
                node = node.getNextSibling(); 
            }
            
        } catch (Exception e) {
            e.printStackTrace(); 
        }
        
    }
    
    public static void display(Node node) {
        System.out.println("*"); 
        Node subnode = node.getFirstChild(); 
        while (subnode != null) {
            display(subnode);
            subnode = subnode.getNextSibling(); 
        }
    }

}


burrowww.cs.indiana.edu% javac StageOne.java
burrowww.cs.indiana.edu% cat greeting.xml
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
burrowww.cs.indiana.edu% java StageOne greeting.xml
*
*
*
*
*
*
*
burrowww.cs.indiana.edu% 
This part is the heart of it, really.

Stage Two: Adequate Indentation for Each Star.

burrowww.cs.indiana.edu% cat greeting.xml
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  <GREETING>
    Hello from XML. 
  </GREETING>
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
</DOCUMENT>
burrowww.cs.indiana.edu% cat StageTwo.java
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;

public class StageTwo {
    public static void main(String[] args) {
        
        try {
            
            DOMParser parser = new DOMParser(); 
            parser.parse(args[0]); 
            Document doc = parser.getDocument(); 
            // corresponds to all that was parsed

            Node node = doc.getDocumentElement().getFirstChild(); 

            while (node != null) {
                display(0, node); // initial indentation is zero spaces
                node = node.getNextSibling(); 
            }
            
        } catch (Exception e) {
            e.printStackTrace(); 
        }
        
    }
    
    public static void display(int indent, Node node) {
        for (int i = 0; i < indent; i++) System.out.print(" "); 
        System.out.println("*"); 
        Node subnode = node.getFirstChild(); 
        while (subnode != null) {
            display(indent + 2, subnode); // indentation increases 
            subnode = subnode.getNextSibling(); 
        }
    }

}


burrowww.cs.indiana.edu% javac StageTwo.java
burrowww.cs.indiana.edu% java StageTwo greeting.xml
*
*
  *
*
*
  *
*
burrowww.cs.indiana.edu% cat customer.xml
<?xml version="1.0" standalone="yes"?>
<document>
  <customer>
    <name>
      <last_name>Bird</last_name>
      <first_name>Larry</first_name>
    </name>
    <date>October 15, 2001</date>
    <orders>
      <item>
        <product>Tomatoes</product>
        <number>8</number>
        <price>$1.25</price>
      </item>
      <item>
        <product>Oranges</product>
        <number>24</number>
        <price>$4.98</price>
      </item>
    </orders>
  </customer>
  <customer>
    <name>
      <last_name>Jordan</last_name>
      <first_name>Michael</first_name>
    </name>
    <date>October 20, 2001</date>
    <orders>
      <item>
        <product>Bread</product>
        <number>12</number>
        <price>$14.95</price>
      </item>
      <item>
        <product>Apples</product>
        <number>6</number>
        <price>$1.50</price>
      </item>
    </orders>
  </customer>
  <customer>
    <name>
      <last_name>Kukoc</last_name>
      <first_name>Tony</first_name>
    </name>
    <date>October 25, 2001</date>
    <orders>
      <item>
        <product>Asparagus</product>
        <number>12</number>
        <price>$2.95</price>
      </item>
      <item>
        <product>Lettuce</product>
        <number>6</number>
        <price>$11.50</price>
      </item>
    </orders>
  </customer>
</document>burrowww.cs.indiana.edu% java StageTwo customer.xml
*
*
  *
  *
    *
    *
      *
    *
    *
      *
    *
  *
  *
    *
  *
  *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
  *
*
*
  *
  *
    *
    *
      *
    *
    *
      *
    *
  *
  *
    *
  *
  *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
  *
*
*
  *
  *
    *
    *
      *
    *
    *
      *
    *
  *
  *
    *
  *
  *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
    *
      *
      *
        *
      *
      *
        *
      *
      *
        *
      *
    *
  *
*
burrowww.cs.indiana.edu% 
Stage Three. Printing the Type of Node.

burrowww.cs.indiana.edu% cat greeting.xml
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
  What if I write in between the elements. 
  <GREETING>
    Hello from XML. 
  </GREETING>
  This line in between the greeting and the message. 
  <MESSAGE>
    Welcome to the Wild and Wooly World of XML. 
  </MESSAGE>
  This is right after the message. 
</DOCUMENT>
burrowww.cs.indiana.edu% cat StageThree.java
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;

public class StageThree {
    public static void main(String[] args) {
        
        try {
            
            DOMParser parser = new DOMParser(); 
            parser.parse(args[0]); 
            Document doc = parser.getDocument(); 
            // corresponds to all that was parsed

            Node node = doc.getDocumentElement().getFirstChild(); 

            while (node != null) {
                display(0, node); // initial indentation is zero spaces
                node = node.getNextSibling(); 
            }
            
        } catch (Exception e) {
            e.printStackTrace(); 
        }
        
    }
    
    public static void display(int indent, Node node) {
        for (int i = 0; i < indent; i++) System.out.print(" "); 
        switch(node.getNodeType()) {
        case Node.DOCUMENT_NODE: 
            System.out.println("Document Node"); 
            break;
        case Node.ELEMENT_NODE: 
            System.out.println("Element Node: (" + node.getNodeName() + ")"); 
            break;
        case Node.CDATA_SECTION_NODE: 
            System.out.println("CDATA Section Node"); 
            break;
        case Node.TEXT_NODE: 
            System.out.println("Text Node: (" + node.getNodeValue() + ")"); 
            break;
        case Node.PROCESSING_INSTRUCTION_NODE: 
            System.out.println("Processing Instruction Node"); 
            break;
        default:
            System.out.println("Unaccounted Type of Node"); 
            break;
        }       
        Node subnode = node.getFirstChild(); 
        while (subnode != null) {
            display(indent + 2, subnode); // indentation increases 
            subnode = subnode.getNextSibling(); 
        }
    }

}
This is exactly what we had before, except now we print the value for every text node, and we print the name for each element node. And the greeting.xml has changed, with three lines. Those lines correspond to text nodes that would be there (and empty) regardless.

burrowww.cs.indiana.edu% javac StageThree.java
burrowww.cs.indiana.edu% java StageThree greeting.xml
Text Node: (
  What if I write in between the elements. 
  )
Element Node: (GREETING)
  Text Node: (
    Hello from XML. 
  )
Text Node: (
  This line in between the greeting and the message.
  )
Element Node: (MESSAGE)
  Text Node: (
    Welcome to the Wild and Wooly World of XML. 
  )
Text Node: (
  This is right after the message.
)
burrowww.cs.indiana.edu% 
Then of course, is the recursive aspect of display, which appears as early as Stage One.

Here's one final, but extremely important note:


Last updated: Apr 17, 2003 by Adrian German for A348/A548