TEI: sow02

Introduction

This document gives a concise summary of the main arguments and issues considered by by the TEI-SOM working group in the course of making its recommendations. This document is intended to be particularly useful to the editors and TEI council in understanding the recommendations of the working group.

XPointer versus ID and Entity

The working group has decided to revise all of the hypertext features of the TEI so that every place where IDs are currently specified, URI references can be used. A URI reference is the syntax used in HTML to refer to a URI and a fragment identifier (in the form of an XPointer, separated from the rest of the URI references by a '#' symbol.

The use of URI references (XPointers in fragment Identifiers) has several advantages over the methods in the current recommendations.

Unlike IDs, URI references are independent of parsing context as they can uniformly make reference to any data having a URI. ID/IDREF only work when all data is parsed at a single time. This can present obstacles to using the same external entity in several parsing contexts (e.g. when parsing a part of an unwieldy document like an encyclopedia or corpus).
While the current SGML-like approach is still possible in XML, URIs and fragment identifiers are now very much more commonly used and understood by implementers than unparsed external entities and strings referring to IDs in those entities. While IDREFs within a document are automatically validated by parsers, custom tools are needed to validate these sorts of external links. URI references are also independent of DTD declarations, which is convenient for users of other schema languages.
Reference to an element in a different document using the old mechanisms requires the use of two attributes used together in a TEI-specific way, while URI references are part of the standards defining the Internet and the web.
In the current TEI approach, references using IDs must use different mechanisms and attribute conventions than those referring to external elements. This creates more elements and attributes for a TEI user to understand.
Text and markup ranges are handled uniformly by XPointer, whereas in the current TEI approach, they create yet another variation in pointing when a range must be selected.
URI references need not be significantly more verbose than IDREFs. A URI reference like "#chap1" is interpreted as a reference to the element with the XML IDF "chap1", in the document designated by an empty relative URI the current document, according to the rules of XML Base).

One significant disadvantage of URIs compared to ID/IDREF is that IDREF attributes are automatically checked for validity by a validating XML parser, providing a cheap and useful validity check for encoded documents. Because this may be very important to some projects, we believe that the ability to generate TEI schemas that use ID/IDREF instead of URI references might be a good option for non-linking uses of ID/IDREF in the TEI.

The biggest migration/compatibility difficulty is that the syntax for XPointer ID references is not identical to that in an XML IDREF attribute. This implies that if we offer TEI users the option to stick with traditional use of ID/IDREF, the ODD processor will need to be able to accept a user option to create TEI schemas in which IDREF attributes will replace CDATA attributes intended to contain XPointers. Alternatively, users will be required to use the TEI extension mechanisms to override URI reference semantics for linking tags, in the case that ID/IDREF validation is needed.

There is also an open question as to how large a scope such an option should have. Are all IDs and IDREFs in the TEI the same? The working group has come to feel that XPointer is the best pointing method for hypertext links and standoff markup, except in special cases where the use of IDs creates specific advantages. However, in smaller scale projects (like the encoding of single documents), there is the occasional need to use cross-references to link footnotes to body text, or to represent local discontinuous phenomena, like interrupted quotations. For such small scale projects, the need for a powerful pointer mechanisms is small, and the loss of the automatic validity checking provided by validating parsers is a significant problem.

The group does not have unanimous agreement that ID and IDREF should be uniformly replaced by XPointer everywhere, all the time. While there is a great deal of attraction to the idea of adopting a uniform mechanism (and a belief that XPointer is probably the best such mechanism extant), we feel that users will need the ability to select either XPointer or ID/IDREF in the new schema generation process.

Extension/revision of XPointer schemes

There are some features of TEI Extended pointers that are not supported by the W3C's XPointer pointer scheme. The workgroup's original plan was to define a minimal extension of XPointer to encompass these features, and to declare that using the extension mechanisms in defined in the relevant W3C standards. However, recent changes have introduced strong arguments that may change this plan.

The TEI has recently agreed to cooperate with the ISO in the definition of several aspects of interest in the creation of linguistic corpora, including the creation of tagging interchange standards for linguistic annotation. Because linguistic annotation requires robust pointing mechanisms, and given that the W3C is not currently doing any work on the elaboration of the XPointer mechanisms, the TEI has to opportunity to work through the ISO to define a standard mechanism for pointing and addressing in XML documents (as needed by the ISO. Some such activity is required, as the ISO cannot normatively cite a draft W3C recommendation in any case.

It can be expected that the work of creating input to the ISO to define a standard that would fill the hole where XPointer should be will delay the completion of this group; but it would also leave the TEI in a stronger position with respect to XML pointing, in that the TEI would then be using a clearly defined ISO standard rather than an orphaned W3C working draft. The additional work involved is clearly significant, but the work would also be accompanied by additional manpower, as the ISO work has a fairly high profile within the NLP community, as well as an extremely active group of implementors.

These ideas have not yet been discussed by the working group at large, only by Ide and Durand. However, the minutes of the working group meetings show that Vitali, at least, is interested in changing some aspects of XPointer.

Some potential changes to W3C Pointing schemes

Currently the W3C has defined 3 pointer schemes: bare names, "xmlns:" (which declares namespaces for pointer schemes), and "element:" (which provides abbreviated pointing by means of child numbers. There is a working draft of an xpointer scheme, on which no current work is being performed.

The XPointer scheme builds on XPath, and extends it to support character ranges and points between elements. This is done by extending XPath semantics with new types of object (point, range, and stringrange), and adding functions that return those new objects.

We propose a different path to extended pointing, based on the observation that the new data types are a poor fit to the XPath data model because none of the new data types are useful inputs to existing XPath operators, which are defined only in terms of Nodes and Nodesets. This means that the sequential path structure of XPath isn't productive for the new XPath operations since they are only useful when they occur as the last step of a path.

The alternative we propose is to use pointer schemes as the place to extend the data model of link endpoints. Since XPath is defined in terms of Nodes and Nodesets, we will preserve that design by creating a simple "xpath()" pointer scheme that returns only nodes or nodesets. Since Spans and Character Ranges are still needed data types for linking, those data types will be returned by new pointer schemes. It is obvious that the addressing of Spans also requires the selection of nodes in exactly the manner that XPath already allows, and that it does not make sense to duplicate the separate addressing functions already provided by the element and proposed xpath pointing schemes.

Therefore, the new schemes will be defined minimally, with the ability to recursively use any other xpointer scheme as an argument. Support for spans will be via a new "span()" scheme, that takes two arguments: a pointer scheme selecting the beginning of the span, and a pointer scheme selecting the end of the span. Not only does this separate support for spans from support for other types of addressing, but it enables applications to support pointer schemes like element() + span(), if full xpath addressing is not needed, even though spans are.

Further details of this proposal are to be discussed on the list, and depend on the relationship with the ISO, as to take this path without a formal standards group's invovement would be irresponsible.

Stand-off markup

Standoff markup is the use of elements that point to their contents, rather than syntactically contain them. It is a required technique wherever a strictly hierarchical system of markup cannot adequately represent the structure of a text. This sort of markup is particularly common in databases representing the results of linguistic analyses, both because of the density of annotation, and non-hierarchical phenomena of ambiguity in parsing.

Currently, the group is of the opinion that XInclude provides adequate quoting support, that can enable stand-off markup to work well in the TEI. We intend to recommend appropriate practice in how to apply XInclude in TEI documents when stand-off markup is required. This continues our overall focus on reducing the amount of TEI-specific markup recommended in cases where other standards that solve the same problems already exist.

Another characteristic of XInclude is the fact that there are a number of implementations already available, thus making the tool situation more attractive than it would otherwise be. While complex standoff markup applications will still require specialized tools, the basic interpretation of stand-off markup to create views of consistent hierarchies in documents can be done with off-the-shelf software.

Validation

Validation creates some significant difficulties in conjunction with stand-off markup, and the use of XInclude does not solve these problems in itself. An unmodified validator may note be very useful, because many markup containment relationships are represented by pointers rather than element containment, when standoff markup is in use. For the TEI, the issue is what form of validation to require of documents, some of which may use XInclude to create one or more virtual documents that may even share data. For example, in the case of a single file that contains a stand-off tagging of a base text encoded in XML, XML validation can sensibly be applied to at least 3 things:

The XML document that represents the base document, could be validated according to its DTD (which might or might not be a TEI variant).
The XML document that represents the annotation could be validated. This document will contain Xinclude tags, which must be made legal in the TEI (almost anywhere) if it is to be valid according to the TEI.
The XML document that represents the annotation could be validated after the expansion of the XInclude tags. In this case, the use of XInclude need not be part of the TEI DTD at all.

The working group is in agrement that TEI validation should be based on the last of these, validation after the expansion of XInclude tags. An issue that may need to be considered by the council is whether some conditions should be imposed so that a standoff document must include more than the minimum TEI root element literally, rather than by inclusion.

The use of XInclude allows for this kind of expansion and validation. Because every stand-off stream needs to be expanded, validation may be expensive for some documents, if there are many stand-off streams for the same base document. Many existing schema processors have options to activate XInclude processing before schema validation. Since XInclude processing is an XML to XML transformation, it can practically be executed before validation, without additional specialized tools.

One can also imagine conditions beyond XML/SGML validity that might be worth checking in such documents -- overlap constraints between different views of a document, for instance. Custom software will still be required to check such conditions, as XInclude does not provide these sorts of facilities. This is not a change from the current situation.