Stand-off Markup


Contents

Introduction

Markup is said to be standoff, or external, when the markup data is placed outside of the text it is meant to tag: before, after, or even outside it altogether. The markup therefore points to, rather than wraps, the relevant data content.

There are many reasons for externalizing markup: the source data might be read-only, we could have no writing rights on it, there could be other incompatible markup already present in the data, etc.

Within this document, there is no specific recommendation for the appropriate applications of standoff markup. Rather, this document provides a generic mechanism for expressing all kinds of markup as either internal or external to the content it refers to.

In fact, three major requirements for standoff markup as expressed here are that
  • any valid TEI markup can be either internal or external, depending only on the authors' needs,
  • the external markup can be internalized by applying the markup to the document content by either substituting the existing markup or adding to it, thereby forming a valid TEI document, and
  • the external markup itself specifies whether the internalized document is obtained by substituting the existing internal markup or by adding to it
.

For clarity, we will further systematically use these terms: the source document is the document to which the external markup refers (the source document can be either XML, plain text, or multimedia); we call internal the markup that is already present in the source document, in case it is XML; we call external or standoff the document and the markup that are outside of the source document and point to it; by internalizing we refer to the action of creating a new XML document with the external markup around the content, either in addition to or instead of the original internal markup; analogously, by externalizing we refer to the act of creating a couple of documents out of a plain XML document, one of which is a standoff document containing some of the markup of the original document, and the other is a plain XML document with the text content and whatever markup has not been extracted for the standoff document, or a plain text document if all markup has been extracted.

Addressing is a key issue in standoff markup, since it provides the mechanism for exactly identifying the content of the external elements. It is important to identify an addressing syntax that is at the same time complete (all positions in a document can be identified with the required detail), flexible (several different ways of addressing can be used for different purposes), and reliable (the correct source is identified with no ambiguities). The syntax needs to be capable of identifying arbitrary fragments within text-only sources, multimedia sources, and XML documents.

XInclude and XPointer

This specification makes use of XInclude as the overall inclusion mechanism. 1 XInclude relies on XPointer for the expression of the actual fragments of text to be internalized. Therefore this specification makes use of XPointers as the basic addressing mechanism. Although XInclude only requests support for the element() scheme of XPointer, this specification requires that all the W3C schemes of XPointer can be used for expressing addresses, and in particular it fully supports the xpointer() scheme.

XInclude is a W3C standard to specify a syntax for the inclusion within an XML document of data fragments placed in different resources. Included resources can be either plain text or XML. XInclude instructions within an XML document are meant to be replaced by a resource being pointed at by a URI, possibly including an XPointer for identifying the exact subresource to be included. Fallback content or resources can be specified in case of failure to fetch the requested resource. XInclude defines a namespace (http://www.w3.org/2001/XInclude), which we will usually prefix with xi:, and exactly two elements, <xi:include> and <xi:fallback> .

The <xi:include> element uses the href attribute to specify the location of the resource to be included; its value is an IURI containing, if necessary, an XPointer. Additionally, it uses the parse attribute (whose valid values are either text or xml to specify whether the included content is plain text or an XML fragment, and the encoding attribute to provide a hint, when the included fragment is text, of the character encoding of the fragment. The <xi:fallback> element is required within an <xi:include> and specifies an alternative content to be used when the external resource, for any reason, cannot be fetched.

XPointer is a W3C standard providing a generic syntax to express references on an XML document through the use of schemes, i.e., different syntaxes for expressing locations within an XML document. The current generation of XML Pointer recommendation only recommends one such scheme, called element(), that can specify either bare names (elements with a specific xml:id attribute) or child sequences (numerical sequences specifying the path onto the XML tree down to a specific subtree of XML content). Another scheme, xpointer(), has not been turned into a W3C recommendation yet, although it has been part of the XPointer drafts from the beginning. The xpointer() scheme defines two additional concepts, points and ranges, that can be used to specify sub-node fragments (e.g., a few words within a longer text node) or trans-node fragments (e.g., a longish text spanning across different branches of the overall XML tree).

Doing standoff markup in TEI

In this version of TEI, any tag can be externalized by removing its content and placing an XInclude <xi:include> element that contains the XPointer pointing to the correct content.

For instance the following portion of a TEI document:
<text> <front> <head>1755</head> </front> <body> <l>To make a prairie it takes a clover and one bee,</l> <l>One clover, and a bee,</l> <l>And revery.</l> <l>The revery alone will do,</l> <l>If bees are few.</l> </body> </text>
can be externalized by placing the actual text in a separate document, and providing exactly the same markup with the <xi:include> elements:
Source.txt
To make a prairie it takes a clover and one bee,\n One clover, and a bee,\n And revery.\n The revery alone will do,\n If bees are few.\n

External.xml
<text xmlns:xi="http://www.w3.org/2001/XInclude"> <front> <head>1755</head> </front> <body> <l><xi:include href="Source.txt#range( /.0, /.48)" parse="text"> <xi:fallback>Verse not found</xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.49, /.71)" parse="text"> <xi:fallback>Verse not found</xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.72, /.83)" parse="text"> <xi:fallback>Verse not found</xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.84,/.109)" parse="text"> <xi:fallback>Verse not found</xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range(/.110,/.126)" parse="text"> <xi:fallback>Verse not found</xi:fallback></xi:include></l> </body> </text>

Please note that this specification requires that the XInclude namespace declaration is present in all cases. The <xi:fallback> element contains text or XML fragments to be placed in the document if the inclusion, for accessibility of the external resource, fails for any reason. This specification requires, with XInclude, that the <xi:fallback> element be present.

Nonetheless, there may well be cases where no relevant alternative content can be specified for an inclusion, and that failure to include an external resource should end not in a default inclusion, but in actual application error.

For this reason, this specification defines a new element, error, that is part of m.Incl. The overall purpose of the <error> element is to throw an exception in the processing of a TEI document. Its main use is within an <xi:fallback> element to request the processing of the document to be stopped whenever an inclusions fails for inavailability of the resource.

The previous example can be rephrased as follows to throw an error when the source document is unavailable:
<text xmlns:xi="http://www.w3.org/2001/XInclude"> <front> <head>1755</head> </front> <body> <l><xi:include href="Source.txt#range( /.0, /.48)" parse="text"> <xi:fallback><error msg="Verse not found"/></xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.49, /.71)" parse="text"> <xi:fallback><error msg="Verse not found"/></xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.72, /.83)" parse="text"> <xi:fallback><error msg="Verse not found"/></xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range( /.84,/.109)" parse="text"> <xi:fallback><error msg="Verse not found"/></xi:fallback></xi:include></l> <l><xi:include href="Source.txt#range(/.110,/.126)" parse="text"> <xi:fallback><error msg="Verse not found"/></xi:fallback></xi:include></l> </body> </text>

Well-formedness and validity of stand-off markup

This specification states that the whole source fragment identified by the XInclude element, as well as any markup therein contained, are inserted in the position specified. It is required that the internalized document is well-formed. It is a burden for the author of the external markup to ensure this. Inaccurate ranges, for instance, could very well create badly formed documents that could not be further processed. Of course, badly formed documents are a problem can only arise when source documents are XML. Plain text source documents will always create well-formed documents.

Validity always happen with internalized documents. Thus an XML document containing <xi:include> and <xi:fallback> elements is never a valid TEI document. This specification requires thus that validity is verified after the resolution of all the <xi:include> elements.

Including text and/or xml fragments

When the source text is plain text the overall form of the XPointer pointing to it is of minimal importance. The form of the XPointer matters considerably, on the other hand, when the source document is XML.

In this case, it is rather important to distinguish whether we intend to substitute the source XML with the new one, or just to add new markup to it. The XPointers used in the references can express both cases.

A simple way is to make sure to select only textual data in the XPointer. For instance, given the following document:
Source.xhtml
<html> <body> <div>To make a prairie it takes a <a href="clover.gif">clover</a> and one <a href="bee.gif">bee</a>,</div> <div>One <a href="clover.gif">clover</a>, and a <a href="bee.gif">bee</a>,</div> <div>And revery.</div> <div>The revery alone will do,</div> <div>If bees are few.</div> </body> </html>
the expression range(/1/2/1.0,/1/2/11.1) will select the whole poem, <text> and <div> elements and hypertext links (NB: in XPointer whitespace-only text nodes count).

On the contrary, the expressions xpointer(//text()/range-to(.)) and xpointer(string-range(//text(),"To")/range-to(//text(),"few.") will only select the text of the poem, with no markup inside.

Thus, the following could be a valid standoff document for the Source.xhtml document:
External2.xml
<text> <front> <head>1755</head> </front> <body> <l content='Source.xhtml#xpointer(string-range(//div[1]/text(),"To")/range-to(/ /div[1]/text(),"bee,")'/> <l content='Source.xhtml#xpointer(string-range(//div[2]/text(),"One")/range-to( //div[2]/text(),"bee,")'/> <l content='Source.xhtml#xpointer(string-range(//div[3]/text(),"And")/range-to( //div[3]/text(),".")'/> <l content='Source.xhtml#xpointer(string-range(//div[4]/text(),"The")/range-to( //div[4]/text(),",")'/> <l content='Source.xhtml#xpointer(string-range(//div[5]/text(),"If")/range-to(/ /div[5]/text(),".")'/> </body> </text>
Notes
1.
The version on which this text is based on is the W3C Candidate Recommendation dated 17 September 2002..

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org