6 The TEI additional tag sets

Ten additional tag sets are defined by the current TEI proposals. These include tag sets for special application areas such as the orthographic transcription of speech, the detailed physical description of manuscript or print material, and the recording of an `electronic variorum' modelled on the traditional critical apparatus. A tag set is defined for the detailed documentation of contextual information needed by language corpora, as well as for the detailed encoding of names and dates; abstractions such as networks, graphs or trees; mathematical formulae and tables etc.

In addition to these application-specific additional tag sets, some more general purpose additional tag sets are defined for

linking and alignment
analysis and interpretation
feature structure analysis

The tag set for linking and alignment extends the set of linking and pointing elements already defined in the TEI core to provide facilities for linking to arbitrary locations or spans of texts, whether or not these are in the current document, and whether or not the target is an SGML document. Mechanisms are included for recording the alignment or correspondence of parts of a text, for example in multilingual corpora, or for marking the alignment of audio or video with a transcription of it. As such, this tag set provides a usefully large subset of the facilities offered by the HyTime standard, but with a considerably simpler and more efficient interface. [See note 4]

As noted above, a generic segmentation element is defined for the identification of textual spans appropriate to any analytic scheme. An out-of-line generic <interp> element may be used to link arbitrary text segments (which may be nested or discontinuous) with any user-defined set of attribute/value pair interpretations. Specific tags are also defined for the most common requirements of linguistic analysis such as identification and typing of morphemes, words, phrases, and sentences.

A specialized tagset is also provided for the encoding of abstract interpretations of a text, either in parallel with it or embedded within it. This is based on the feature structure notation employed in theoretical linguistics, but has applications beyond linguistic theory. [See note 5]

Using this mechanism, encoders can define arbitrarily complex bundles or sets of features identified in a text, according to their own methodological bias. They may thus embed a whole range of interpretations of a text, linguistic, literary, or thematic, within a text in a controlled manner. The syntax defined by the Guidelines not only formalizes the way in which such features are encoded, but also provides for a detailed specification of legal feature value/pair combinations and rules determining, for example, the implication of under-specified or defaulted features. This is known as a feature system declaration and is defined by an auxiliary tag set.

An additional tag set is also provided for the encoding of degrees of uncertainty or ambiguity in the encoding of a text. These particular tag sets exhibit in a particularly noticeable form one of the chief strengths of the TEI approach to encoding: it provides the encoder with a well-defined set of tools which can be used to make explicit his or her reading of a text. No claim to absolute authority is made by any encoder, nor ever should be; the TEI scheme merely allows encoders to ``come clean'' about what they have perceived in a text, to whatever degree of detail seems appropriate.

A user of the TEI scheme may combine as many or as few additional tag sets as suit his or her needs. The existence of tag sets for particular application areas in the current draft reflects, to some extent, accidents of history: no claim to systematic or encyclopaedic coverage is implied. Indeed, it is confidently expected that new tag sets will be added, and that their definition will form an important part of the continued work of this and successor projects.

Back to table of contents
On to next section
Back to previous section