Basic working decisions on pointing and linking
TEI SO W 09
28 Sep 2004

Contents

Introduction

This document defines the linking methods to be used in TEI P5 including creating files that conform to the ODD tag-set. It particularly addresses the issues of the name of the linking attribute and how IDREFS attributes will be handled, as requested by Council during its meeting of Fri 14 May.

Pointers and links in the TEI

XML documents represent hierarchical markup structures with elements all of whose contents are contiguous. As not all all phenomena that the TEI Guidelines represent conform to that model, cross-references between XML elements are used to represent some textual and editorial structures. For purposes of the TEI we distinguish two applications of pointing between TEI elements. One is called linking. This is the now well-known notion of the user-selectable hypertext link, whose predecessors are the cross-references in print or manuscript codices. Links are created by authors or editors to indicate possible navigational options to a reader of an encoded text. Because a link refers to another document or another portion of the document within which it resides, a link must indicate at least one address, the location of another relevant textual object.

An example of a typical link is the user of an <ref> element to point to a cited portion of a document.
<ref target="#chap2">See chapter two</ref>

Cross-references and other explicitly authored hypertext links are not the only uses of addresses needed in TEI text encoding, however. Some XML cross references are used to construct representations of document structures (like the continuations of elements interrupted by other elements), that do not fit the XML model of contiguous hierarchical containment. In P5, all such addresses, which in P4 are represented using either ID/IDREF attributes or TEI Extended Pointers, will be represented by the use of the standard URI reference mechanism used on the world wide web. (Note that use of the part attribute for this purpose is not affected, as it does not use a pointer.)

One typical example of the use of addresses to represent a phenomenon that is not a link is the use of the prev and next attributes to represent discontinuous elements:
<sp xml:id="king1" next="#king2" who="#King">Speak, knave or</sp>
<sp who="#Fool">What would you have me say?</sp>
<sp xml:id="king2" prev="#king1" who="#King">&mdash; or lose your
  tongue, but if it be not civil it will be forfeit in any case</sp>

Because of the wide acceptance of URI-based addressing in all sorts of software systems and the weakening of support for the traditional ID/IDREF mechanism in XML schema languages, all TEI addresses use URI references as defined in the URI specification (RFC2396). URI references are the same mechanism used by HTML to create hypertext links, and are capable of addressing single XML resources identified by a URI. In addition to the identification of a resource, a URI reference can additionally indicate a sub-portion of the resource it identifies by the use of an appropriate fragment identifier (the ‘Fragment-ID’, this is the portion of a URI reference following the first unescaped ‘#’ character). The details of the interpretation of the Fragment-ID depend on the Internet MIME-type of the resource identified by the URI.

The W3C has defined an extensible syntax for addressing within XML content, called the XPointer Framework. Within this framework, multiple addressing methods are distinguished by a named XPointer Scheme. The W3C has defined a useful but limited set of schemes for pointing within XML documents.

The TEI will define a MIME-Type for TEI documents. (The actual type will be determined during the application process, but is likely to be something like application/tei+xml.) In addition to providing useful protocol support for the publication of TEI documents on the web, the TEI Consortium is then free to define rules for the interpretation of the Fragment-ID portion of a URI reference to a TEI document. The TEI will use the W3C XPointer framework for its Fragment-ID syntax. The ability to define rules for xml/TEI documents allows the TEI to adopt the W3C's XPointer Framework, but to simultaneously use the syntax extension mechanisms of the framework to support additional pointing schemes which do not already exist, but are needed for TEI applications — for instance, arbitrary ranges within text, location by means of pointing languages like XPath, and perhaps location by means of regular expressions at some point in the future.

Each Pointer scheme used in a Fragment-ID is identified by a scheme name. A single Fragment-ID may contain several alternative ways of addressing a sub-part of an XML resource. These variants are interpreted in order; applications are free to ignore unknown schemes, so that alternative location methods (possibly less precise) can be given in any address, so that applications unable to interpret a given scheme (like a TEI extension) can use alternate means to resolve an address. The TEI accepts the W3C's recommendations, as issued in the three XPointer Recommendations (element(), xmlns(), & xpointer()), and the XLink recommendation.

Adoption of the URI Reference and its associated infrastructure offers significant advantages in terms of implementability, since many tools already exist for managing and using URIs. There are some implications for TEI markup as well.

Two critical W3C constructs will be adopted by the TEI. The xml:base attribute defined by the W3C (in XML Base) allows encoders to control the interpretation of relative URIs. This attribute will be available to TEI encoders globally.

The other construct that is the use of the xml:id attribute currently being defined by the W3C (in xml:id). This global attribute allows schema-language independent marking of XML ID attributes in the document instance. Since the simplest form of Xpointer allowed in the XPointer framework is a bare name identifying an ID attribute in the XML file, all the capabilities current ID/IDREF based pointing method will be supported, albeit with syntactic differences. Users who wish to retain the ID/IDREF mechanism used in P4 will have to use the TEI's customization mechanism to redefine the Address class. The TEI-C might provide a ready-made customization module for this purpose (perhaps easily selectable from Roma).

The use of URI references (with XPointer schemes in fragment identifiers) has several advantages:

Changes to Attributes

This section discussed the intended changes to the main pointing attributes of <ptr> and <ref> for P5, and reviews the rationale behind these changes.

The global xml:base attribute, will support encoder control of the interpretation of URIs within documents. It will be added to tei.global.attributes).

The current global id attribute will be replaced by the xml:id attribute.

Two new data-types, tei.pointer and tei.pointers will be created. The first is defined as xsd:anyURI, the latter as a list of them, something like list { xsd:anyURI+ } in the compact syntax, or the following in the XML syntax:
<list>
  <oneOrMore>
    <data type="anyURI"/>
  </oneOrMore>
</list>

Attributes that are declared in P4 as IDREF will become tei.pointer; those declared in P4 as IDREFS will become tei.pointers.

A new attribute, cref, will be added to <ptr> and <ref> . This attribute and the target attribute are mutually exclusive — i.e. specifying both is an error. This attribute's value is a canonical reference which is to be transformed into an XPointer by applying the TEI algorithm for canonical references. (See SO W 08, currently being updated.)

The resp, crdate, targType, and targOrder attributes will be removed from the a.pointer class.

Rationale for Dropping crdate, targType, and targOrder

resp and crdate in P4 provide a rudimentary way of assigning some metadata to the pointing element. We believe these should be dropped in the interests of keeping things as simple as possible, and under the ‘do it right or don't bother to do it at all’ doctrine. Perhaps the TEI would want to develop a ‘metadata for complicated encoding’ module sometime in the future.

targType provides a simple kind of pointer validation capability. We are not at all sure that anyone uses it in any meaningful way. Especially since it's so easy to imagine situations where the little constraint it provides is not at all useful (e.g., it cannot say ‘should point to an element with TEIForm='div'’; in many linguistic applications almost any element it would point at would be <seg> , anyway.).

targOrder specifies whether the order in which multiple targets are given is significant. This typically depends on the type of the link, and need not be separately specified. The significance of such orderings may itself differ in different use of the same document. For instance textual applications like parsers or speech synthesis may need to process links in a particular order, while information-retrieval or archiving applications could use any convenient ordering.

On Attribute Name

Before Council's Fri 14 May meeting in Ghent, the work group had not come to any consensus on the name of the pointing attribute known as target in P4. Since then, we have settled on leaving the attribute name as it is, target. The main competitor name was href, chosen to match the name used by HTML. This has since been rejected once it was realized that in TEI this attribute can point to multiple places (i.e., is of type tei.pointers), unlike the HTML attribute which can only point to one element.

Simple use of XPointer schemes

The simplest XPointer fragment points to an element labeled with an ID attribute. Because of the popularity of HTML, the syntax is defined so that a Fragment-ID without an explicit pointer scheme is interpreted as an IDREF. Thus, in the example above, each of the ID references is preceded by a '#' character. While these look like IDREFs with an extra character in front, they are actually relative URIs with empty paths, and thus indicate the base URI of the XML file within which they are contained.

When dealing with a document that is stored in multiple parts, best practice in pointing dictates that the relative links used to refer to IDs should include path information for all IDs stored in a separate file. In this way addresses will be correctly resolved when parsing and processing a single resource within a large document stored in several parts. For some large documents, relative URIs may indicate locations within several directories of a file system. While there are many management issues implicit in this practice, they are well addressed in many works on the management of HTML linking.

In the current draft of P5, all addresses refer by means of XML IDs, making the processing of a single chapter at a time difficult. To make chapter-by chapter processing of the Guidelines easy, current IDREFs to a chapter of the Guidelines like the chapter on linking, referred to as SA, would be modified if they appeared in another file of the guidelines, so that:
<p>In section <ptr target="COXR"/> we introduced the simplest 
  pointer elements, <gi>ptr</gi> and <gi>ref</gi>. 
would be replaced by a reference such as
<p>In section <ptr target="../CO/co.odd#COXR"/> we introduced the simplest 
  pointer elements, <gi>ptr</gi> and <gi>ref</gi>.
A reference like this can be resolved appropriately, regardless of whether or not the entire P5 document is being processed.
When a reference is made to an ID within the same file, the use of a relative URI specifying a filename is not necessary. So within the file SA.odd the reference given in the examples in the previous paragraph would look like:
<p>In section <ptr target="#COXR"/> we introduced the simplest
  pointer elements, <gi>ptr</gi> and <gi>ref</gi>.

This standard practice corresponds to that used in HTML, which also makes the linking and addressing features of the TEI familiar and easy to understand. Within large documents like the Guidelines themselves, the use of relative URIs in this way also makes it easier for authors and maintainers of the documents to resolve an address without searching or processing the entire document.

Replacing IDREFS

At Council's Fri 14 May meeting in Ghent concern was expressed over how to handle attributes that are declared as IDREFS in P4. Council was not enthusiastic about the WG recommendation of using curly-brace delimited URIs in an attribute value. Council asked that the SO WG consider the issues more closely, and identified three possible paths:
  • keep IDREFS
  • use child elements
  • separate URIs in a single attribute value
To this list the editor's added the possibility of using XIndirect. Here is a summary of the WG's thoughts on this matter.

The WG feels that keeping IDREFS is simply put, a bad idea. There seems to be little point to have one part of the TEI (that which in P4 are IDREFs) use XPointers and another part (that which in P4 are IDREFSs) use an entirely different, older, mechanism, even though RelaxNG has a compatibility mode. First of all, users would find it at least annoying, if not confusing. Secondly, different kinds of software would be needed to process similar kinds of links.

Using child elements to replace IDREFS attributes may make good sense in certain specific cases, particularly in those cases where a new element is being invented for P5 anyway; however, trying to change the 20 attributes in P4 that use IDREFS (5 of which are global) would be an enormous task that would change the very nature of the TEI encoding scheme, without much obvious gain.

Using multiple URIs separated by whitespace in a single attribute value seems to make the most sense. In all cases currently encoded in P4, and in lots of future cases, the URIs will in fact be nothing more than simple name fragment-IDs, i.e. things like #duck #quack #foo #bar. These are as easy to parse and process as IDREFS are. URIs with internal whitespace require escaping. But even for many URIs that point outside the current document, they will be as simple as doc2#foo doc4#bar, etc. for projects where many complex URIs are needed, and space-separated attributes would be inappropriate, an extension using elements may be used.

Proposed additions to W3C Pointing schemes

Overview

There are several features of TEI Extended Pointers that are not supported by the W3C's XPointer schemes. Currently the W3C has defined 3 pointer schemes: bare names (strings of name characters following the ‘#’, as in HTML), element() (which provides abbreviated pointing by means of child numbers), and xmlns() (which is used to declare namespaces for user-extended pointer schemes). While there is a W3C working draft of a more complete pointer scheme, supporting many but not all of the features of the TEI Extended Pointer system, there is no current or scheduled activity towards revising this draft or issuing it as a recommendation. Given a TEI MIME-Type, the TEI can define any additional addressing schemes needed, while maintaining full standards conformance with W3C recommendations and Internet RFCs.

To match the features of TEI extended pointers, The TEI will define six new Xpointer schemes: xpath(), xpath2(), range(), string-range(), left(), and right(). These schemes overlap in functionality with the W3C's xpointer() scheme draft, but are individually much simpler. Each new scheme is either a completely new facility, or a reference to an existing standard which is adopted without modification.

The new TEI pointer schemes extend the data model of link endpoints beyond the XPath concept of the node. Since XPath provides a very seful way to address Nodes and Nodesets, we preserve those abilities by incorporating XPath as a schmem. Since spans and character ranges are needed data types for linking, those data types will be returned by new pointer schemes. since the new forms of addressing also require the selection of nodes in exactly the manner that XPath already allowsm it does not make sense to duplicate the separate addressing functions already provided by the element() and proposed xpath() pointing schemes.

Therefore, the new schemes are defined simply, but with the ability to recursively use any other xpointer scheme as an argument. Not only does this separate support for spans from support for other types of addressing, but it enables applications to support pointer schemes like element() + range(), if full XPath addressing is not needed, even though spans are.

xpath(path)

The xpath scheme locates a node within an XML Information Set. The single argument path is an XPath path as defined in the W3C XPath 1 Recommendation. The node resulting from evaluating the XPath is the reference of an address using the xpath() scheme.

xpath2(path)

The xpath2() scheme locates a node within an XML Information Set. The single argument path is an XPath 2.0 path as defined in the W3C XPath 2 Recommendation. The node resulting from evaluating the XPath, is the reference of an address using the xpath2() scheme.

left(pointer) and right(pointer)

The left() (right()) scheme locates the point immediately preceding (following) its argument. The pointer argument to left() or right are bare names or XPointer pointer schemes themselves, and are resolved according to their normal rules. 1 Because most pointer schemes return nodes or ranges rather than points, the following description lists the behavior of left() and right() for all three types of possible result.
  • A Node When pointer resolves to a node, the point designated is the point immediately preceding (left()) or following (right()) the node.
  • A range When pointer resolves to a range, the point designated is the point designating the start (left()) or end (right()) of the range.
  • A Point When pointer resolves to a point, that point is the result. The pointer schemes left() and right make no change when given a point as argument.

range(pointer1, pointer2)

The range() scheme locates a range between two locations in an XML information set. The two pointer arguments to range() locate the boundaries of the range by two points. The parameters pointer1 and pointer2 are XPointers themselves, and are resolved according to the rules specified in the definition of the pointer scheme they use. 2 Because most pointer schemes return nodes or ranges rather than points, the following description lists the behavior of range() for all three types of possible result.
  • A Node When pointer1 resolves to a node, the starting point of the range is the point immediately preceding the node. When pointer2 resolves to a node, the ending point of the range is the point immediately following the node. It is an error if the ending point precedes the starting point of a range.
  • A range When pointer1 resolves to a range R, the starting point of the result range is the same as the starting point of R. When pointer2 resolves to a range R, the ending point of the result range is the ending point of R.
  • A Point When pointer1 resolves to a point, that point is the start of the range. When pointer2 resolves to a point, that point is the end of the range.

string-range(pointer, offset, [length])

The string-range() scheme locates a range based on character positions. While string-range endpoints are points adjacent to character positions, they must be designated by the characters to which they are adjacent, in the same way that the nodes corresponding to XML elements are. This avoids ambiguity about which point between two characters is indicated when characters are interrupted by markup.

The pointer argument to string-range() designates a node or a range within which a string is to be located. No string range, even an empty one, can be defined by a string-range() if pointer has the empty string as string value. Every string-range is defined based on an ‘origin character’. The origin is numbered 0, and designates the first character of the string-value of pointer. The offset is a character index relative to the origin; the start of the resulting range is the position designated by the sum of the origin and offset.

If length is specified, the end of the range is at a point adjacent to the character designated by the origin added to length. If the offset is negative, or length is sufficiently large, a string-range can designate characters outside the string-value of the intitial pointer. In this case, characters are located using the string-value of the entire document. It is also legal for length plus the origin to exceed the length of the string-value of the document by one, in order to accomodate ranges that include the last character of a document.

If length is not specified, it defaults to the value 1, and the string range contains one character. If it is specified as 0, the zero-length range is interpreted as the point immediately preceding the origin character.

match(pointer, string[, index])

The match scheme designates the result of a literal match of the argument string within the string-value of the pointer argument. The result is a range from the first matching character to the last. It is an error if there is no matching string. A match may not extend outside the range corresponding to the string value of pointer.

The index argument is an integer greater than or equal to 1, specifying which match should be chosen when there is more than one match within the string-value of pointer. If no index is provided, the default value is 0, indicating the first match found.

Question? should we use 1-origin addressing here? we use 0-origin addressing everywhere else, because it's esier to understand when addressing gaps in strings. It's less natural when counting matches. My user knowledge tells me to be inconsistent here, but my mathematical side rebels. This is not really a technical issue but rther a user-preference one, and so I'd love some guidance.

Examples & discussion

  • A pointer to an XML document stored at the TEI consortium web server (currently the list of errors reported and fixed in P4):
    http://www.tei-c.org/Drafts/edw77.xml
    This points to the entire XML document.
  • A pointer to (the HTML version of) this example:
    http://www.tei-c.org/Activities/SO/sow09.html#sample-example
  • A pointer to (the XML version of) the previous example, when the document containing the pointer shares the same URI path as the designated document:
    ./sow09.xml#sample-example
    or
    sow09.xml#sample-example
  • a simple xpath match to an anchor that is a whole node:
    http://foo.org/foo.xml#xpath(//p[5])
    This selects the 5th paragraph of http://foo.org/foo.xml
  • The following selects the 20th character of the fourth paragraph of the second div of the XML document found at http://foo.org/foo.xml:
    http://foo.org/foo.xml#string-range(xpath(//div[2]/p[4]),20)
  • This could also be used with with the element() scheme:
    http://foo.org/foo.xml#string-range(element(/2/2/4), 20)
    The second pointer will pick the 20th character of the indicated paragraph.
If you just want a range defined by nodes rather than characters you can do that in a few ways:
http://foo.org/foo.xml#range(element(/1/2/3),element(/1/4))
alternatively:
http://foo.org/foo.xml#range(element(/1/2/3),xpath(/doc/chapter[4]))
Notes
1.
Bare names ( <xml:id> references) are permitted as arguments to all TEI-defined XPointer pointer scheme parameters.
2.
Bare names <xml:id> references are permitted as range() scheme parameters.

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org