TEI Extended Pointers: a brief tutorial


Lou Burnard

4 February 1997

Table of Contents


This is a background document prepared originally for the information of the W3C XML working group. It provides an informal introduction to the methods proposed in the Text Encoding Initiative's published Guidelines (TEI P3) for the representation of inter- and intra-document links. Authoritative and complete information is provided in that work; in case of conflict between this document and TEI P3, this document is wrong.

Linking behaviour and semantics are proposed for a variety of elements in the Guidelines: these include independent links (<link>), alternative readings (<alt>), aggregations (<join>) and conventional footnotes (<note>), as well as the generic pointer elements discussed here. A brief tutorial introduction is given below. For completeness, this is followed by a brief discussion of some of the other linking mechanisms.

1 Extended Pointers

The TEI scheme defines two generic pointer elements which support both inter and intra document linkage: <xptr> and <xref>. The only difference between them is that the former is empty, while the latter can contain phrase-level elements or PCDATA. The content of an <xref> is typically a string indicating how the link is to be rendered at the source end.

These elements share the following attributes, which are used to specify the target of the cross reference or link:

doc
specifies the name of an entity (typically, but not exclusively,another SGML document) within which the required location is to be found, by default the current document.
from
specifies the start of the destination of the pointer as an expression in the TEI extended pointer syntax, by default the whole of the document indicated by the doc attribute.
to
specifies the endpoint of the destination of the pointer as an expression in the TEI extended pointer syntax, by default the same as the destination indicated by the from attribute.

An <xptr> (or <xref>) may point to the whole of some other entity simply by supplying its name as the value of the doc attribute:

 
<!ENTITY TEIP3 SYSTEM "http://elib.virginia.edu/TEI/">
<!-- ... -->
see <xref doc="TEIP3">The TEI Guidelines, passim</xref>

This example assumes that some system or public entity with the name TEIP3 has been declared.

The from attribute is used to specify a location within the document specified by the doc attribute, using the TEI extended pointer syntax. In this language, locations are defined as a series of steps, each one identifying some part of the document, often with respect to locations identified in a previous step. For example, you would point to the third sentence of the fourth paragraph of chapter two by selecting chapter two in the first step, the fourth paragraph in the second step, and the third sentence in the last step. A step can be defined in terms of the SGML tree (using such keywords as parent, descendent, preceding, etc.) or, more loosely, in terms of text patterns, word or character positions. You can also use a foreign (non-SGML) notation, or specify a location within a graphic in terms of its co-ordinate system.

The from and to attributes use the same notation. Each points to a location within the target document; the target of the extended pointer is the whole sequence beginning at the start of the location indicated by the from attribute, and running to the end of the location indicated by the to attribute.

The first step in a location path will often be to specify the identifier of some element within the target document, as in this example:

 
<xptr doc="TEIP3" from="id (SA)">
This selects the whole of whatever element bears the identifier SA within the entity TEIP3. If a finer-grained target is required, other steps might follow. The following keywords are available for you to specify other locations in terms of their relationship to this one:
child
elements (or pseudo-elements) contained by this one.
ancestor
elements (or pseudo-elements) which contains this one, directly or indirectly.
previous
elements (or pseudo-elements) with the same parent as this one but preceding it in the document.
next
elements (or pseudo-elements) with the same parent as this one and following it in the document.
preceding
elements (or pseudo-elements) in the document which occur before this one does, irrespective of their parents.
following
elements (or pseudo-elements) in the document which occur after this one does, irrespective of their parents.
In the above definitions and elsewhere preceding and synonymous terms are to be understood as implying elements which would be encountered earlier when the document is processed correctly from beginning to end. The term pseudo-element is used for any string of PCDATA content occurring between SGML tags, which is not itself a complete SGML element, but forms part of one. The <p>element in the following example
 
<p>See <xref doc="TEIP3">The TEI Guidelines, passim</xref>
for a full discussion</p>
has three children: the second is an element (the <xref>) while the first and last are pseudo-elements (the pieces of content data containing the words see and for a full discussion respectively).

Each of the above keywords implies a particular set of elements or pseudo-elements (the set of children, the set of ancestors, the set of previous siblings, etc.); to specify which of them you are pointing at, the keyword may optionally be followed by a parenthesized list containing:

Continuing the above example, the following reference will select the third <p> element directly contained by whatever element has the identifier SA:

 
<xptr doc="TEIP3" from="id (SA) child (3 p)">

Note the difference between this and

 
<xptr doc="TEIP3" from="id (SA) child (3)">
which selects the third child of the element bearing the identifier SA, whatever it may be. If entity TEIP3 contained the following text:
 
<div id="SA"><head>Linking and Alignment</head>
<p id="Para1">Text of paragraph 1. </>
<p id="Para2">Text of paragraph <num>2</num>, which is rather short.</p>
<p id="Para3">Text of paragraph <num>3</num>, which is also rather short.</p></div>
the above <xptr> would reference the second paragraph above, because of the <head> element which is also a child element. Similarly, the following <xptr>
 
<xptr doc="TEIP3" from="id (Para3) child (3)">
points to the pseudo-element which is also rather short. within the element with identifier P2

Rungs of the same or different kinds can be combined as required. Assuming for example that the entity TEIP3 is in fact a reference to the SGML form of the TEI Guidelines, then the following reference will select section 14.2.2 of that publication in which (as it happens) the extended pointer syntax is formally defined:

 
For full details, see
<xref doc="TEIP3" from="id (SA) child (2 div2) child (2 div3)">
  TEI Extended pointer syntax definition
</xref>

Complex specifications are easily built using this syntax. For example, the following reference will select the most recent <head> element which carries an attribute lang with the value LAT, and which occur before the start of the element with identifier SA:

 
<xptr doc="TEIP3" from="id (SA) preceding (1 head lang lat)">

You can define the target of a link with respect to the location of the link itself, rather than with respect to the root of the document, by using the keyword HERE. For example,

 
<xptr from="HERE ancestor (1)"> 
points to the parent of the element within which it appears;
 
<xptr from="HERE ancestor (2)"> 
points to the grandparent of the element within which it appears and so on. As this example also shows, when no value is supplied for the doc attribute, the current document is assumed. The HERE keyword makes no sense except as the first rung of a location ladder.

2 Locating parts of element content

The TEI extended pointer syntax is most reliably used to locate particular SGML element (or pseudo-element) occurrences. In the TEI scheme any SGML element can bear an ID attribute, which (together with the tree location methods described above) means that this is less of a restriction than it might appear.

Sometimes however the target of a cross reference does not correspond with any particular feature of a text, and so may not be tagged as an element, or its position within the SGML document tree is not reliably known. If the desired target is simply a point in the current document, the easiest way to mark it is by introducing an <anchor> element at the appropriate spot. If the target is some sequence of words not otherwise tagged, the <seg> element may be introduced to mark them.

In the following (imaginary) example, <xref> elements have been used to represent points in this text which are to be linked in some way to other parts of it; in the first case to a point, and in the second, to a sequence of words:

 
Returning to <xref from="id (ABCD)">the point where I dozed off</xref>, I noticed that <xref from="id (EFGH)"> three words</xref> had been circled in red by a previous reader

This encoding requires that elements with the specified identifiers (ABCD and EFGH in this example) are to be found somewhere else in the current document. Assuming that no element already exists to carry these identifiers, the <anchor> and <seg> elements might be used:

 
  .... <anchor type="bookmark" id="ABCD"> ....
   ....<seg type="target" id="EFGH"> ... </seg> ...

An alternative approach, useful when identifiers or other markup cannot be introduced into the target document, is to use the string, token, or pattern location methods provided in the TEI extended pointer syntax by the following keywords:

token
selects a sequence of one or more white-space delimited tokens
str
selects a string of characters (not bytes)
pattern
selects the first place where the specified pattern is matched

These three methods should not be used to count across element boundaries: they are provided chiefly to locate fine detail within a given document element, where such points are not already explicitly marked up. The token and str methods are defined as behaving in exactly the same way as the HyTime dataloc method, with quanta token and str respectively. The syntax used to define pattern locations is very similar to the regular expression syntax used by most Unix systems.

Some examples follow:

 
<p>This <xptr from="HERE token (3 5)">is not a very good idea.
selects the three tokens a very good.
 
<p>This <xptr from="HERE str (3 5)">is not a very good idea.
selects the string no (i.e. space, n, o)
 
<p>This <xptr from="HERE pattern ([aeiou][aeiou])">is not a very good idea.
selects the first pair of adjacent vowels following the pointer, i.e. the stringoo in good

Thus, assuming that the three words circled in red in the example above occurred at the start of the third paragraph in the chapter with identifier C5, a pointer like the following would point to them:

 
I noticed that <xref from="id (C5) child (3 p) tokens (1 3)"> three words</xref> had been circled in red by a previous reader

3 Locating sequences of element content

Often, the scope of a cross reference will be adequately defined by the from attribute. For some documents, however, it may be necessary to define both a starting and an ending location: if for example the target crosses SGML element boundaries, or involves several elements. As noted above, the to attribute is provided for this purpose. For example,

 
  <xptr doc="P1" from="id (xyz)" to="id (abc)">
is an extended pointer whose target is the part of the document P1 starting at the beginning of whatever element has identifier XYZ and ending at the end of whatever element in the same document has identifier ABC. Any elements in between are also included, irrespective of structure; the pointer is erroneous if the end of ABC precedes the start of XYZ.

The keyword DITTO can be used to simplify the specification of a location ladder for a to attribute which differs only slightly from that already supplied in the from attribute. For example:

 
<xptr from="id (xyz) ancestor (1 div) pattern (Hegel)" to="ditto pattern (Marx)">
will find the sequence starting with the first occurrence of the string Hegel in the <div> element which is the immediate parent of an element with the identifier xyz and ending with the first occurrence of the string Marx that follows it within the same element.

4 Other location methods

Three other location methods are defined in the TEI extended pointer syntax, specifically for non-SGML data and for HyTime conformant data. The SPACE keyword may be used to address locations in space; the FOREIGN keyword may be used to address locations in terms of some external notation not defined by the Guidelines; the HYQ keyword may be used to supply a HyTime query expression (this should be expressed using the SDQL query language which has now replaced the original HyTime Query Language). The FOREIGN and HYQ keywords are not further described here.

The SPACE keyword takes either two or three parameters. The first is a name for the co-ordinate system in use: it will typically be a string like 2D (two dimensional) or 3D(three dimensional). The second and third parameters consist of a list of numerical values giving co-ordinate values as measured along each dimension of a Cartesian space with all axes orthogonal. The number of values is equal to the number of axes (usually 2 or 3). If only the second parameter is given, the location indicated is a point in the co-ordinate system. For example:

 
<xptr from="SPACE (2d) (0 0)">
indicates the origin of a two dimensional space.

If the third parameter is present, the location indicated is the rectangular prism defined by treating corresponding items from the two lists as inclusive bounds along each dimension in turn. For example

 
<xptr from="SPACE (2d) (0 0) (1 1)">
indicates a single unit square tangent to the origin of a two dimensional space.

5 Attributes for linking elements

All TEI linking elements carry a number of general purpose attributes, listed here for completeness:

type
categorizes the pointer in some respect, using any convenient set of categories.
crDate
specifies when this pointer was made.
resp
specifies the creator of the pointer.
evaluate
specifies the intended meaning when the target of a pointer is itself a pointer.
targType
specifies the type (or types) of element to which this pointer may point.
targOrder
specifies whether or not the order in which elements are pointed at is to be respected (this is irrelevant in the case of <xptr> and <xref> elements)

The targType attribute may be used to add an additional semantic constraint to the linkage, by requiring that the GI of the elements indicated match the list of names supplied. attribute can be used to specify that the element pointed to must be of a particular type, as in the following example:

 
this is discussed in <xref from="id (dspec) targType="div1 p">the section on links</ref>
This reference should fail if the element with identifier dspec is not either a <div1> or a <p>. This constraint is not enforceable by SGML parsers, of course, but a TEI-aware application may choose to enforce it.

The type attribute can be used to categorize the link represented by the pointer in any convenient way. The resp and crDate attributes may also be used to represent the person or agency responsible for making the link, and its date of creation, as in the following example:

 
    ...
   this is discussed in
   <xref type="navigator"
    resp="auto" crdate="950521" 
    from="id (dspec)" 
    targtype="div1 div2">
   the section on links</xref>

The evaluate attribute can take values all, one or none. Its purpose is to specify the intended significance of a link which points to a pointer. With evaluate=all, if the element pointed at is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer. With evaluate=one, this evaluation process is carried out once only; with evaluate=none, it is not carried out at all.

6 Special purpose linking attributes

In any view of the TEI DTD which includes the Additional Tagset for Linking, a number of special purpose linking attributes become available, potentially on any defined element.

next
points to the next element in an aggregate.
prev
points to the previous element in an aggregate.
ana
points to an analysis of the element
corresp
points to an element corresponding with this element
synch
points to an element which is synchronous with this element
sameas
points to an element which is identical to this element
copyof
points to an element which is a distinct copy of this element

All of these attributes have declared values of either IDREF or IDREFS. If an <xptr> is required to carry one of the semantic roles listed above, it should be given an identifier which can then be specified as the value for the attribute concerned. For example, a linguistic analysis of the sentence "John loves Mary" might be encoded as follows:

 
<seg type="sentence" ana="SVO">
  <seg type="lex" ana="NP1">John</seg>
  <seg type="lex" ana="VVI">loves</seg>
  <seg type="lex" ana="NP1">Mary</seg>
</seg>
This encoding implies the existence elsewhere in the document of elements with identifiers SVO, NP1, and VV1 where the significance of these particular codes is explained. Such elements might in fact be references to elements in some other document, as follows:
 
<xptr id="SVO" doc="synSpec" from="id (xsvo)">
<xptr id="NP1" doc="synSpec" from="id (xnp1)">
Here the implication is that there is an element in the external entity synSpec which carries an identifier xsvo and which provides the definition for the analysis concerned.

In the same way, the corresp (corresponding) attribute can be used to represent some form of correspondence between two elements within a document, or different documents. For example, in a multilingual text, it may be used to link translation equivalents, as in the following example

 
<seg lang="FRA" id="FR001" corresp="EN001">Jean aime Marie</seg>
<seg lang="ENG" id="EN001" corresp="FR001">John loves Mary</seg>

If, as is more likely, the French and English sentences are contained in two different documents, external pointers could be added to either document to refer to the other.

All the pointers so far discussed have been contextual, that is, one end of the link being represented is given by the location of the pointing element itself. The TEI also defines an independent linking pointer, (in HyTime terms, an ilink), represented by the <link> element. The targets attribute of this element specifies the identifiers of two or more other elements which are to be linked, in some way defined by its type attribute. For example, the translation equivalence expressed by means of the corresp attribute above could equally well be represented as follows:

 
<seg lang="FRA" id="FR001">Jean aime Marie</seg>
<seg lang="ENG" id="EN001">John loves Mary</seg>
<link type="translation" targets="EN001 FR001">

This mechanism provides a convenient way of linking together sentences from different entities. Suppose that the English sentences are in an entity called ENtext and the French in one called FRtext. Since these are distinct SGML documents, we will use extended pointers to indicate each sentence, and express their alignment by means of an independent link:

 
<xptr id="EN1" doc="ENtext" from="id (S1)">
<xptr id="FR1" doc="FRtext" from="id (S1)">
<link type="translation" targets="EN1 FR1">

A <link> element whose targets are pointers is defined as linking the targets of those pointers.

Groups of pointers of similar types can be identified using the <linkGrp> element: all the links within such a group inherit a type value from their parent.

7 Implementations

A large subset of the TEI recommendations for extended pointers has been implemented in the Synex Viewport engine, and is consequently available to applications of this engine, such as Softquad's Panorama and Panorama Pro.

Here is the formal specification for the subset implemented by Synex. Note that the ten keywords given here in upper case can in fact be specified in upper or lower case, or a mixture:

 
locterm  ::= 'ROOT'&#  9;  // default first rung
   |    'HERE'&#  9;          // location of the xptr.
   |    'ID' '(' NAME ')' // only one ID is allowed. 
   |    'CHILD' steps
   |    'ANCESTOR'  steps
   |    'PREVIOUS'  steps
   |    'NEXT'      steps
   |    'PRECEDING' steps
   |    'FOLLOWING' steps
   |    'DITTO'&#  9;          // valid only in TO attribute.

steps    ::=  '(' step ')
   |    steps '(' step ')' 
step     ::= instance
   |    instance element
   |    instance element avspecs 
avspecs  ::= attribute value
   |    avspecs attribute value 
instance ::=  'ALL'
   |    NUMBER&#  9;        // default sign is + 
   |    '+' NUMBER 
   |    '-' NUMBER 
element  ::=  NAME 
   |    '#CDATA' 
   |    '*' 
attribute ::= NAME 
   |     '*' 
value     ::=  LITERAL&#  9;// i.e. a quoted string.
   |     NAME &#  9;        // As for attribute values in 
   |     NUMBER&#  9;        //   a document, NMTOKENs need not
   |     NUMTOKEN&#  9;//   be quoted.
   |    '#IMPLIED'&#  9;// No value specified, no default.
   |    '*'&#  9;        // Any value matches. 
range    ::=  NUMBER
   |    NUMBER NUMBER</EG></P>

You do not need to use the TEI dtd to take advantage of TEI extended pointers (though it may be a good idea to do so for other reasons). If you are using Panorama, you specify in your dtd or document that it should treat any element as a TEI extended pointer by supplying a processing instruction like the following:

 
&lt;?TAGLINK foo "TEI-P3">
The element named (<foo> in the above example) must, of course, be defined in your dtd with attributes doc, from and to in the same way as the TEI elements <xptr> and <xref>.

A simple test file, demonstrating some of the features documented here is available from the URL http://users.ox.ac.uk/~lou/wip/XRtest.sgm. You must have Panorama installed on your machine to read this file.