Introduction to XML

The eXtensible Markup Language (XML) is a specification published by the W3C (http://www.w3.org).  The W3C describes it as

"...a simple text-based format for representing structured information: documents, data, configuration, books, transactions, invoices, and much more. It was derived from an older standard format called SGML (ISO 8879), in order to be more suitable for Web use."

XML is a higher level version of HTML - the language used to code webpages. Where HTML is used primarily to provide presentation "markup" for displaying a page of information, XML is primarily used to identify information. In essence XML allows information (data) to be contextualized.

For instance the following is easy to read and interpret by spectroscopists

254.0,0.1234

Generically we can identify this as an X,Y data pair. If we are UV/Vis spectroscopist we would probably assume that this was data from a UV spectrum where at 254 nm the absorbance is 0.1234. To a computer however, this is just an ASCII character sequence.

In order for computers to be able to record and search on this contextual information we must find a way to encode it. XML does just that. Given that it important to identify the information as an X,Y pair, that it has units and is part of a UV spectrum we might construct the following XML document.

xml

The simple example above shows the logic and features of xml.  Each piece of information is "tagged" with its contextual information by enclosing it in a word wrapped with < and >.  This basic structure is formally called an element but for the sake of clarity in a scientific context we can call them tags (analogous to HTML).  Unlike HTML, tags in XML can be anything that the author wants them to be (with a few execptions).  Tags can be nested to create a tree structure with a nested tag being called a sibling of the parent. Finally, tags can have attributes as part of their definition as seen in the axes tags.

The flexibility of tag names is just one useful feature of XML.  An even more useful feature it the ability to define the structure of an XML document using an XML Schema.  A schema allows you to dictate what information goes where and define the type of information - integer, decimal, string etc.  Below is an example XML Schema for the document above.

xmlxsd

This very verbose "map" of an XML document can provides tight control on the flow and content of an XML document.  An xs:complexType defines structure nested within an element, an xs:simpleType defines individual elements very specifically, and finally xs:element defines a basic XML element.  Notice the ability to only allow certain values for the unit of the x-axis - important if we are going to represent any XY based scientific data.

For more information refer to the following sources or search the Internet for examples - there are many.