Elemental XML and XML Models

6 September 2005

[2008 note: I've left this here for historical purposes, but for me JSON has most of the modeling virtues of this idea, so I've mostly switched my own practices to that. -glenn]

[There is also a one-page introduction to this topic, prepared for a lightning talk at BarCampBoston, 3 June 06.]

Elemental XML defines a syntactically reduced subset of XML that is reasonably expressive, but mainly trivial to explain, write and parse. It is demonstrably rich enough for use in at least one real-world application (mine), and probably others by extension, but it is not designed to embody arbitrary complexity, and there are probably applications for which the aspects of XML that Elemental XML omits are, in fact, beneficial or necessary.

XML Modeling is an alternate, and again simplified, approach to defining the structure of an Elemental XML document without the use of an XML DTD or XML Schema (either of which is technically allowed, but discouraged on principle). XML Models are intended to represent the data-structure with sufficient rigor to allow useful computed validity, but sufficient clarity for human authors to easily assess a simple document's validity according to its model, and to easily produce a model from an example document and vice versa.

Elemental XML

Here is a very short Elemental XML document:

<?xml version="1.0"?>
<BlogEntry>
<ID>156</ID>
<Date>Tue, 30 Aug 2005 11:15:00 EDT</Date>
<Title>Cool idea for using a simplified subset of XML...</Title>
<Body>Check out <a href="http://www.furia.com?type=code&id=ElementalXML">this</a>.</Body>
</BlogEntry>

The document begins with an XML declaration, and thereafter consists simply of tag pairs that define elements. For those unfamiliar with XML/HTML/SGML markup, the first tag of each pair looks like <ElementName>, with a less-than symbol at the beginning, then the element name, then a greater-than symbol. The second tag of each pair is the same, but with a slash after the less-than symbol, like </ElementName>. An Elemental XML element name can use letters, numbers and the underscore (_), and case doesn't matter, so <ZipCode>, <Address_Line_1> and <Address_Line_2> are all fine tags, and <ZipCode> is the same as <zipcode>. Whatever comes between the two tags in the pair is considered to be the content of the element. An element's content may either be data (as with the elements ID, Date, Title and Body here), or other elements (as with BlogEntry here). If an element contains subelements, it makes no difference what order they appear in, so in a BlogEntry the ID can come first, or last, or between the Date and the Body, etc. The only special rules for the content of an element is that it can't contain any of the document's tags themselves.

To explain it the other way around, for people already familiar with XML: Elemental XML uses elements and subelements for everything, so there are no attributes and no empty tags. It requires that each element contain text or subelements, never both, but it parses only declared elements, and has no entity declarations, so text can include arbitrary tag-looking syntax intended for post-parser processing stages (most obviously HTML intended for browser display, but also nested XML or other tag-variants) without any special escaping or preparation mandated by the Elemental XML container. Empty tags and missing tags are equivalent, there is no special null value.

Text representation of defined tags themselves must still be escaped in some way to prevent ambiguity; Elemental XML doesn't stipulate a method, but the standard XML entity-escaping of < and > as < and > is an effective one. Note that due to full XML's more-complex rules for namespaces, entities and escaping, Elemental XML correctness does not guarantee, require or preclude XML well-formedness.

XML Models

An XML Model defines the structure in which an Elemental XML document represents a certain kind of information. For the most part, an XML Model is an Elemental XML document with all the data removed. The XML Model for the blog entry above would look like this:


<?xml version="1.0"?>
<BlogEntry>.
<ID>.</ID>
<Date>.</Date>
<Title>?</Title>
<Body>?</Body>
</BlogEntry>

In place of actual data, the first character inside each tag-pair is always one of the four symbols [ . ? + * ], which specify:

. there must be exactly one of this element
? there may be zero or one of this element
+ there must be one or more of this element
* there may be zero or more of this element

In the case of container elements, the symbol on the parent tag defines the rules for the whole set of contained elements, and the symbols on the subelements define the rules for their subappearance within each appearance of the parent container element.

The XML Model for the blog entry, then, declares that there must be one and only one <BlogEntry> element, which in turn must contain one and only one <ID> and one and only one <Date>, and may optionally contain a <Title> and/or a <Body>, but no more than one of each.

The XML Model must not contain anything except its tags, those four symbols, and any amount of space/tab/return whitespace (which is all semantically meaningless).

In the other direction, an XML Model is inherently a document template: copy it and plug in real text in place of the .?+* symbols.

Although the relative order of different elements is not significant, where multiples of a particular element are allowed, their order is assumed to be meaningful, and must be preserved by parsing applications.

An Elemental XML document does not explicitly specify an XML Model, and since Elemental XML ignores undeclared structure, an Elemental XML document's validity might be tested against multiple XML Models of varying specificity or scope. For example, the above blog entry could also be validated against this uselessly over-simplified XML Model:


<?xml version="1.0"?>
<BlogEntry>.</BlogEntry>

This leaves the ID, Date, Title and Body elements to be treated (for validation purposes) as text, or perhaps to be validated separately by individual XML Models.

FuriaDocument (example)

Here, with some content elided for sanity, is the actual Elemental XML document for issue 245 of The War Against Silence:

<?xml version="1.0"?>

<FuriaDocument>

<Metadata>
<Metatitle>Sloan, astrid, Anchormen, Promise Ring</Metatitle>
<Number>245</Number>
<Title>How September Came for Sinatra</Title>
<Date>7 October 99</Date>
</Metadata>

<Section>
<Type>entry</Type>
<Heading>Sloan: <i>Between the Bridges</i></Heading>

<Text>There are many valid reasons for being unhappy... A large fraction of the human population 
pursues existences that are vulnerable to severe disruption by forces either entirely or partially 
out of the victims' control, depending on how much how much allowance you're inclined to make for 
human nature and emotional inertia. "The news", most of the time, is a monotonous exercise in 
cataloguing the sources of justified human misery: earthquakes, train wrecks, itinerant plagues, 
esoteric tribal vendettas and general imminent apocalypse. Happiness is <i>possible</i> in the 
face of anything, but the level of Zen required for the serene appreciation of complete 
catastrophe is pretty advanced, and the difficulty of getting through to Amazon.com to order 
<i>Self-Esteem for Dummies</i> is probably not high on the current list of East Timor refugee 
complaints. For me personally, though, and probably also for you, given that you're sitting in 
front of a functioning computer reading music reviews, happiness cowers behind far-less-tangible 
barricades. Nothing ghastly and inexplicable has happened to anybody I love in weeks. I've never 
had a helicopter-deposited President shake my hand and assure me that the nation's thoughts are 
with me and the Red Cross will be here for as long as necessary. I've never been hungry, really. 
But our conceptions of happiness adjust themselves to our environments. Solipsism is depressing, 
and anything short of it exposes you to loss, so we are all indefensible. Distress adapts.</Text>

<Text>We are unhappy, then, however much of the time...</Text>

<Text>And this is why, I think, I reserve such intense joy...</Text>

<Text>Buoyant pop comes in many forms...</Text>
</Section>

<Section>
<Type>entry</Type>
<Heading>astrid: <i>Strange Weather Lately</i></Heading>

<Text>One of the perils of mail-ordering from web sites...</Text>

<Text>This isn't the first band I've discovered by mistake...</Text>
</Section>

<Section>
<Type>entry</Type>
<Heading>The Anchormen: <i>The Boy Who Cried Love</i></Heading>

<Text>Even astrid seem comparatively somber, though, in the company of <i>The Boy Who Cried 
Love</i>, the thirteen-song, twenty-two minute debut by...</Text>
</Section>

<Section>
<Type>entry</Type>
<Heading>The Promise Ring: <i>Very Emergency</i></Heading>

<Text>I haven't heard the first two records by...</Text>
</Section>

</FuriaDocument>

Note that since elements cannot contain mixtures of text and subelements, the <i> tags here are self-evidently not part of the Elemental XML structure, and thus do not need to be escaped.

Here is the corresponding XML Model, which defines some additional optional structure that this one document itself doesn't use:


<?xml version="1.0"?>

<FuriaDocument>.

<Metadata>.
<Metatitle>?</Metatitle>
<Number>?</Number>
<Title>?</Title>
<Date>?</Date>
<DatePub>?</DatePub>
<Width>?</Width>
<Tag>*</Tag>
</Metadata>

<Section>*
<Type>?</Type>
<Number>?</Number>
<Heading>*</Heading>
<CSVTable>?
  <CSV>.</CSV>
  <CGI>?</CGI>
  <SortField>?</SortField>
  <DisplayFlag>?</DisplayFlag>
  <FieldType>*</FieldType>
  <Filter>*</Filter>
</CSVTable>
<Text>*</Text>
</Section>

</FuriaDocument>

The model, then, declares a mandatory FuriaDocument root element. Within this is a mandatory Metadata element with several optional child elements, of which one (Tag) may appear in multiples. The bulk of the content then appears in any number (including zero) of Sections, which in turn contain zero or one Type, Number and CSVTable, and any number of Headings and Texts. CSVTable, in turn, requires a CSV and allows various other things. Any element that doesn't contain other elements contains text (by definition).

Caveats and Recommendations

Elemental XML is best, obviously, for simple structures. It has no mechanism for independent declaration and reuse of complex subelements, for example, so an Elemental XML representation of something as free-form as HTML would be unwieldy. With the element/attribute space flattened into only elements, it's best to use unique tag-names to avoid ambiguity, rather than counting on the structure to distinguish between them.

Technically, however, the reuse of non-container elements is allowed. Reuse of container elements can thus be achieved by defining a reused container as a text element for the purposes of the top-level XML Model, and then re-validating the document against a separate XML Model of just the reused container and its substructure.

That is basically all. There are more things on Earth that you might want to do, but these are the ones Elemental XML understands and models. It is not a programming language, a fine-grained constraint filter, a formatting engine or a source of run-time default values. But for simple information, it doesn't need to be.

Demonstration Validator

A simple command-line validator, xmvalidate.pl, is available for use or as the basis for other validators. It is a perl script, written and tested on Mac OS X 10.4.2 but possibly functional on other perl-enabled systems. Download it, unzip it, and run it without parameters for instructions.

Version Notes

Renaming and minor revisions, 6 September 2005

I can't believe I didn't think of the obvious name for this the first time around. Renamed from the awkward "XML Simple" to the doubly descriptive "Elemental XML". Several minor edits to the text for clarity, and a new version of the validator to add support for multiple root elements in the documents to be validated.

Added the sample validator, 31 August 2005

I thought it was going to be harder.

Unreviewed draft 0.1, 30 August 2005

It is not clear that anybody other than me wants this, but I'm using it, so it seemed like a good idea to write the rules down before I forget them. And since I came to this out of my own irritation at the convolutions of the regular XML rules, conceivably there are other people similarly annoyed, and if those people exist, it is possible that some of them would benefit from a simplification and haven't devised their own, or would be happier with a random external endorsement of one.

Feedback of any sort is very welcome, and in the unlikely event that there is more interest than I'm anticipating, I will be happy to answer and revise and advise.

In particular, anybody who is the right combination of anal and bored is cheerfully encouraged to write a demonstration parser for applying an XML Model to an Elemental XML document. I'm hoping to resist the temptation to do this myself.