TEI: sow05

Purpose of Working Paper

The purpose of this working paper is to demonstrate how to apply the TEI, and especially the new recommendations of the working group on stand-off markup, to language corpora that are annotated with information desscribing any of a variety of linguistic properties of the data. Creators and users of linguistically-annotated corpora are typically computationally-oriented and XML-aware, and the corpora are created primarily for the purposes of processing rather than display. Current developments in XML open new possibilities for data representation that will be of direct benefit to this class of potential TEI user. We discuss the range of corpus representations that stand-off admits, with the strengths and weaknesses of the various possibilities, describe good practice where a clear choice is possible, and give a range of worked examples that can be used as models.

Introduction

It is becoming common to annotate language corpora with a wide range of linguistic information, primarily to enable statistical processing of the data that feeds the development of language models used in a variety of language processing applications. Natural language processing (NLP) applications can be applied to create annotations for linguistic data by analyzing text, speech, and data representing other modalities to determine specific linguistic attributes and associate them with the segments of that data to which they apply. NLP applications also use linguistic annotations to facilitate language understanding and generation. Encoding schemes for linguistic data must support both of these "views" on linguistic annotation, as well as integration of the two to ensure maximal inter-operability.

The annotated linguistic data can consist solely of text, contain a single audio or video signal (usually with an accompanying time-aligned orthographic transcription), or contain multiple, synchronized signals representing different views on the language encounter. Annotations may comprise any of a variety of types of linguistic information, including morpho-syntactic specifications, syntactic and/or discourse structure, semantic and/or time- or event-related information, co-reference links, etc.; or, in the case of spoken data, phonetic, prosodic, gestural, etc. information. Annotation of linguistic data may involve multiple annotation steps, where annotations at lower linguistic levels serves as input in the determination of higher-level annotation categories (for example, morpho-syntactic tagging as input to syntactic analysis), so that annotation can be viewed as an incremental process. Depending on the application intended to use the annotations, lower-level annotations may or may not be preserved in a persistent format; the output of the annotation software may consist solely of higher-level annotations, even though lower-level analysis has been performed. For example, information extraction software typically performs the analysis required for annotation of various linguistic features and utilizes it internally to deliver the desired result, without preserving the annotation information.

Annotation of data for several "layers" of different linguistic features, or even multiple versions of the same type of information, presents considerable problems for its representation. In particular, annotations at different levels of analysis often do not conform to a single tree-structure mapped onto the primary data. Even annotations of the same type may not use the same or compatible segmentations of the data (e.g., different morpho-syntactic analyses may treat differing spans--possibly discontiguous--of a text as a "word"). Even if a corpus annotated with multiple types of linguistic information were to form a well-formed tree structure, including these annotations in the primary data may complicate processing and render the data unwieldy.

General requirements

For the purposes of language processing, several desiderata for encoding linguistic data and its annotations have been identified by the language engineering community:

It must be possible to isolate specific layers of annotation from other annotation layers or the primary (base) data; i.e., it must be possible for NLP applications to easily separate or extract annotation types specific to the task at hand.
There must be support for various stages of input interpretation and output generation, both during annotation (which may be accomplished at different times and with different software) and use.
It must be possible to represent partial/under-specified results and ambiguities, alternatives, etc.
Ideally, it should be possible to merge and compare annotations; however, there can be no requirement that a single XML-compliant document may be created by merging stand-off annotation documents with the primary data because two annotation documents may specify trees over the primary data that contain overlapping hierarchies.

To meet these requirements, the growing practice among annotators of linguistic corpora is to separate annotations from primary data, using what has become known as "stand-off annotation". The base notion behind the stand-off strategy is that primary data and its annotations exist in separate XML documents. Different annotation types (or variants of the same annotation type) are also contained in separate XML documents with their own DTD or schema. The result is a lattice of annotation documents pointing to a primary source or intermediate annotation levels, where the links signify a semantic role rather than navigational options. That is, the links signify the locations where markup contained in a given annotation document would appear in the document to which it is linked. As such the annotation information comprises "remote markup" which is virtually added to the document to which it is linked. This approach has several advantages for corpus-based research:

it avoids the creation of potentially unwieldy documents--envision, in a worst case, a single document containing segmentation and part of speech markup, plus markup for alignment with translations in multiple languages, plus alignment with the speech recording, plus variant part of speech taggings from several taggers, etc.
the original or hub document remains stable and may be left unmodified by any process which adds annotation.
it avoids problems with markup containing overlapping hierarchies (see http://www.cs.vassar.edu/CES/CES1.Annex10.html), which are not allowable in XML.
different versions of the same kind of annotation (e.g., different POS annotation) can be associated with the text.
it is very much in line with what is evolving in W3C work on the Semantic Web.
annotation can be accomplished by associating the XML original or other annotation documents with other, pre-existing documents--e.g., instead of generating a document containing POS markup and linking it to the original, links could be made directly with lexicon entries.
it gives easy access to the original XML document (or, as mentioned above, any among several versions annotated for certain features) for use by other applications.

The concept of stand-off annotation was introduced in the Corpus Encoding Standard (CES) nearly a decade ago, and was first used to represent the MULTEXT Journal of the Commission (JOC) corpus. The JOC corpus consisted of parallel aligned translations in six languages annotated for morpho-syntax. The same strategy was used to represent the MULTEXT- EAST parallel aligned corpus of Orwell's 1984, which exists in ten languages distributed by the TELRI project. Since then, the stand-off strategy has been adopted by numerous corpus creation projects (TIPSTER, MapTask, ANC) and developers of software for corpus manipulation (GATE, ATLAS). Most recently, the notion of stand-off annotation as the preferred representation for linguistically annotated corpora has been endorsed by ISO TC37 SC4, which is developing standard representation mechanisms for language resources.

Stand-off markup: what to encode where

The fundamental operative distinction for stand-off annotation is between the primary data-i.e., the representation of a linguistic act as a text, audio signal, transcription, etc.-and the annotation of that data that provides information about its linguistic properties. Using this as a basis, we can distinguish two fundamental types of annotation activity:

segmentation : delimits linguistic elements that appear in the primary data, including
- continuous segments (appear contiguously in the primary data)
- super- and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g., a contiguous word segments typically comprise a sentence segment)
- discontinuous segments (linking continuous segments)
- landmarks (e.g time stamps) that note a point in the primary data
linguistic annotation: provides linguistic and/or communicative information about the segments in the primary data, e.g., a morpho-syntactic annotation in which a part of speech and lemma are associated with each segment in the data.

In current practice, primary data typically contains segmentation markup for gross logical structure-i.e., markup identifying titles, chapter and section divisions, etc., down to the level of the paragraph. In some cases, sentence boundaries are also marked in primary data, but given that different processors may identify different spans as sentences, this is less frequently recommended. For documents considered to be read-only, or where it is otherwise desired to leave the primary data unaltered, segmentation information may be included in a separate annotation document specifying byte offsets into the primary document where a given segment begins and ends.

Various sub-paragraph elements such as names, dates, abbreviations etc. are commonly identified in the early stages of corpus processing (typically during segmentation into words, sentences, etc.), often because they are signaled by typography. Such elements may be marked in the primary data; however, given the recent work in the field of computational linguistics on the identification of so-called named entities, it is becoming more common to represent this type of annotation as stand-off as well because: (1) it is identified by processes entirely different from the segmentation process; and (2) there is considerable variation in the composition and identification of the entities among recognition systems.

Determining how annotation types are split up into different stand-off annotation documents depends on the way in which the annotations are created (e.g., by multiple annotators at different sites, each working on a specific annotation type), the requirements of the processing software that uses the annotations, as well as, in some cases, the linguistic view of the annotator. There can be no hard and fast rule for splitting annotations, but it is recommended that if the annotation can be considered as including different annotation types, their representation is such that they are easily separable by automatic means.

[there could be a few examples here]

Association of stand-off annotation with primary data and/or other annotations

Annotation documents are linked to the primary data or other annotation documents using ...

[here is where David or someone has to jump in with the exact mechanisms]

Alignment Annotations

There are several cases where the "annotation" of primary data involves only the identification of corresponding segments/elements of two or more different documents:

alignment of parallel representations (e.g. audio signal and transcription),
alignment of parallel translations,
association of pre-defined data categories (such as items in the SC4 Data Category Registry) or other annotations with the primary data

In these cases, the annotation document contains only links associating locations in two or more documents. (We can think of this as a two-way link, where there may be multiple targets of one or both "ends" of the link.)

[here need some explanation of the linkage conventions to be used]

[could include more information on the Data Category Registry and would probably have to say something (brief) about rdf]

Another heading

Traditionally in the field of computational linguistics, a choice has had to be made between site-specific formats that inter-operate with locally-developed software, and adherence to suggested "standard" formats that may not accommodate specific annotation needs. To address this problem, several generic, abstract encoding schemes have been developed that are intended to be capable of representing all or most annotation types have been developed (e.g., the TIGER scheme, annotation graphs). For example, TIGER (http://www.ims.uni-stuttgart.de/projekte/TIGER/), which provides graphical displays and query for syntax, uses this kind of representation. In this example, representing the syntactic structure of one utterance from the Switchboard Corpus, the data forms a graph with a set of non-terminal syntactic constituents, start with an identified root node, and edges leading down to terminals that include both part-of-speech tagged words and other symbols from the orthographic transcription.

</terminals>

</nt>

</nt>

</nt>

</nonterminals>

</graph>

The annotation graph formalism assumes stand-off markup; this is an example of an AG representation in XML:

</annotation>

Formats such as these may be used for actual processing, or may be used as interchange formats for exchanging data among different processes. In the latter case, it is assumed that locally-developed formats can be transformed as needed into and out of more standard, abstract representations using XML transduction mechanisms such as XSLT. However, it is important to note that transduction between formats can only be performed if there is a uniform notion of the data model underlying both representations.

Currently ISO TC37 SC4 (http://www.tc37sc4.org)is developing an abstract format for linguistic annotations that will serve the needs of the language engineering community. The fundamental premise upon which the work of this committee is based is that annotators will encode their data using locally-developed schemes that serve their needs for processing, conciseness and readability, etc., and transduction to the abstract format will be performed for interchange (and, possibly, also for processing since we assume that many processors will adopt this representation). The primary consideration for any scheme that can be transduced to the abstract format is that the user-defined format must conform to the data model defined for the ISO abstract format. The SC4 data model is built around a clear separation of the structure of annotations and their content, that is, the linguistic information the annotation provides. The model therefore combines a structural meta-model, that is, an abstract structure shared by all documents of a given type (e.g. syntactic annotation), and a set of data categories associated with the various components of the structural meta-model.

The structural component of the data model is a directed graph feature structure capable of referencing n-dimensional regions of primary data as well as other annotations. The choice of this model is indicated by its almost universal use in defining general-purpose annotation formats, including the Generic Modeling Tool (GMT) (Ide and Romary, 2001, 2002), Annotation Graphs (Bird and Liberman, 2001), and schemes such as the TIGER graph representation above. A small inventory of logical operations over annotation structures can be specified, which define the model's abstract semantics. These operations allow for expressing the following relations among annotation fragments:

Parallelism: two or more annotations refer to the same data object;
Alternatives: two or more annotations comprise a set of mutually exclusive alternatives (e.g., two possible part-of-speech assignments, before disambiguation);
Aggregation: two or more annotations comprise a list (ordered) or set (unordered) that should be taken as a unit.

Hierarchical relations (parent-child, right-sibling) are typically left implicit in the structure of the XML encoding. [However, there will probably need to be a way to express these relations explicitly, just in case.]

The feature structure graph contains elementary structural nodes to which one or more data category/value pairs (or links to pre-existing categories) are attached, providing the semantics of the annotation.

Not surprisingly, the SC4 data model is a generalization of the TEI recommendations for segmentation and alignment (as well as feature structures). Therefore the TEI provides considerable support for an encoding format that is easily transduced to the more abstract SC4 format.

[we can provide examples of TEI encoded stuff here]

There are two additional points to be considered:

the burden of defining the mapping from a local scheme to an abstract format is on the user. To this end, users should be cognizant of the implicit relations among data and annotations (and fragments thereof) that will need to be made explicit in the transduction process. For example, it is common to provide lists of information that are in some cases alternatives, and in some cases aggregates--the only way a processor can determine which is in tended is to "know" which feature is being described. Another example: implicit in hierarchical syntactic encodings is the information that (simplistically) the first "Noun phrase" child of the "S" unit is the subject; the Noun phrase following the verb is the object, etc. Some encodings (e.g. Penn Treebank) will make some of these relations explicit and leave others implicit. The transducing processor will have to recover these relations (using provided knowledge of the user encoding scheme) if they are to be included in the abstract representation.
if annotations are included in-line in the local scheme, it should/must be possible to generate a stand-off representation

Avoiding tracing cycles

Stand-off markup allows one to represent arbitrary graph structure, but in practice, it can be useful to know which parts of the graph are likely to contain cycles that require more careful processing and which ones do not.

Single tree could be split over several files from top to bottom for reasons given in II.2 ; multi-rooted trees are useful but don't have cycles at all, they just point downwards towards signal/text. It would be useful to have some way of making this plain so that, as in the NITE data format, the main structure is the multi-rooted tree, expressed via children in XML files and one kind of link, and any additional graph structure required is expressed via another kind of link imposed on top. One reason why this is useful is that trees are so common in nature, so this is the bulk of the data. This is something we'd have to talk about, though, because I think if taken to its logical conclusion it means two types of links, not just (say), comments about these properties in the document type definition, and therefore could have implications for SO W 06 . DD noted in May that the consensus seems to be using link elements for the pointers that aren't meant to be traced in processing.

Using children versus attributes

NOTE: Standard advice about when to use children and when to use attributes that says never require parsing of attributes? Of course, IDREFS doesn't follow this... Also, if attributes come in related subsets, can be better to use a tag that makes this structure clear.

Differentiating editorial comment from print?

NOTE: DD suggested this as an issue, but I don't know much about it. Is this the same as the element content issue?

Element content

In corpora consisting of textual data, the encoding is often designed so that the text (whether it is a written text or an orthographic transcription of some speech) can be recovered in some sensible order by appending all of the textual content of elements within the document body. In this case, a simple filter can remove tags to reproduce the base text. Design for annotation in stand-off documents can follow a similar strategy by regarding the annotation information as the "primary data" for that particular stand-off document. While including this information as content rather than, say, in attributes is not mandatory, it can be useful to consider this distinction for processing purposes. For instance, consider a stand-off document including morpho-syntactic information and lemma for the words in a corpus:

<f type="cat">determiner</f>

</seg>

</seg>

</seg>

...

</seg>

This may allow easier manipulation than placing the lemmas, part of speech categories, and other features in an attribute value. The information is easily retrievable, requires no parsing of attribute values, and makes it easier to apply regular expressions to it.

Corpus examples

A simple case: syntax

[Some problems here coming up with a TEI encoding: I used <f> for "feature"-also in some places I had a seg with no segment associated...]

For the text: Paul intends to leave IBM.

</seg>

</seg>

</seg>

Representing parallel texts

Examples from XCES pages-I have not taken out refs to XCES yet.

DOC1: <s id="p1s1">According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products.</s><s id="p1s2">Cola drink manufacturers in particular achieved above-average growth rates.</s>

DOC2: <s id="p1s1">Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes.</s><s id="p1s2">En effet, notre sondage fait ressortir des ventes nettement supérieures ˆ celles de 1987, pour les boissons ˆ base de cola notamment.</s>

ALIGN DOC:

<link>

</link>

<link>

</link>

</linkGrp>

The order of the <align> elements within a <link> element is significant. Unless otherwise specified the order is assumed to match the ordering of <translation> elements in the header. If a different ordering is required the attribute n in the <translation> element and the attribute n in the <align> element can be used to explicitly link an <align> element with a specific translation. Many-to-one alignments and many-to-many alignments can be represented by providing a range for the XPointer expression. N-to-zero alignments can be indicated by omitting one or more of the <align> elements and using the n attribute to specify which translations any remaining <align> elements refer to. Alternatively, the href attribute can be set to #xces:undefined to indicate that there is no translation for that fragment in that language.

Example:

header.xml:

...

</translations>

</cesHeader>

align.xml:

<link>

</link>

<link>

</link>

<link>

</link>

<link>

</link>

...

</linkGrp>

</linkList>

</cesAlign>

(The rest is Jean's-not sure what to do with it.)

Pointing into a video: The MONITOR data set

NOTE: This data set includes a good reason for wanting to know where on the video frame a particular behaviour is - we can describe exactly the locations of the giver and follower gazes. We discussed allowing for this in SO W 04, but I'm not sure we know whether this is possible yet. Including it would probably require CC's help.

The data

This example is taken from a manipulation of the Map Task. In it, the route giver is told that there is a route follower in another room who can hear him, but cannot be heard. Before the trial, the route giver has eyetracking explained, and during recording, is himself subject to eyetracking. He is told that the route follower is also subject to eyetracking, and that a red square on the map shows where the route follower is looking. Figure X shows a video still from the resulting data; on it, the route giver's gaze is represented by a white circle. when represented graphically as follows:

Figure 1. Video still from the MONITOR Data Set

In reality, there is no route follower, and the red square representing his gaze is manipulated programmatically.

The annotations

The annotations for the MONITOR data set include the following types of information:

orthographically transcribed words representing the route giver's speech, which, to save money, are not timestamped against signal
the identification of spans of words as timestamped references to landmarks
the identification of spans of words as timestamped moves representing one speaker intention
the identification of spans of moves as transactions thatsegment the task (so that when all of the moves in a transaction are complete, some part of the route is complete)
codes relating the giver's gaze to the "follower" red feedback square,segmenting the signal according to whether the giver's gaze is off the screen, moving towards the square, at it, moving away from it, or away from it
codes relating the giver's gaze to the route,segmenting the signal according to whether the giver's gaze is off the screen, at the current landmark under discussion, at a landmark previously encountered on the route, at a landmark still to be encountered on the route, at some other landmark, or at a part of the route not near a landmark
codes relating to the "follower" red feedback square,segmenting the signal according to whether the square is not showing, at the current landmark under discussion, at the wrong landmark, or currently in motion.

This data set was created on a split-site project, with different staff creating the different kinds of annotation from a shared set of videos. Here, the intention is to understand how non-verbal feedback affects the route giver's behaviour.

The representation

NOTE: The details of referencing into video are not correct below yet. I think the right idea is that there's one place that says where to find the relevant signal.

In this data set, each of the monologues has several orthogonalsegmentations. Putting each type of coding in a separate stand-off file allows processes to consider them as complete, orderedsegmentations of the data, as well as allowing staff to work on them simultaneously. Accordingly, for each monologue in the corpus, one might represent the relationship between giver and follower gaze in a single XML file with a root element and a flat set of codes, each of which expresses its type and how it relates to signal:

<feedback-gaze_stream id="18t-ft.fbg.1">

....

Similarly, one could use one file for the relationship between giver gaze and the route, and another for the relationship between "follower" gaze and the route. Using one element name for each type of code, and differentiating the codes themselves with an attribute value, has advantages for some query applications (such as the NITE XML Toolkit) that can load related stand-off files as one coherent graph-structured data set, since then it is possible to find all tags from a specific code set without resorting to a disjunction of tag names. However, when all processing is carried out on one XML file at a time, is is simpler to use one element name per code. Courtesy attributes such as duration may be useful to store if they are needed by onward processes that will have difficulty deriving them.

NOTE: Paragraphs here explaining how to represent the language data - one tree but with references pulled out into a separate list pointing in to the tree. This makes the tree have a uniform structure (transaction parent of move parent of word; references are recursive and mess this up); make it easy to create a reference list; yet another file can be used to relate references together.

Metadata and headers

NOTE: There are issues about where to put generic information like where to find a video/audio signal and it's a good idea to have a list somewhere of all the files that go with a particular corpus. We've only touched on these issues briefly but we need to give some advice or at least raise this part of the design as an issue.

Discussion

We recommend utilizing multiple files to maximize the relationship between the XML structure and the data structure, since that is what is best for daily use on any particular corpus (unless one is relying solely on some technology that uses a graph structure encoding). Luckily, XML technology makes data transformation relatively easy, and so it is possible to transfer into one of the graph structure encodings and back as needed, should these technologies be useful.

Corpus Applications

Contents

Purpose of Working Paper

Introduction

General requirements

Stand-off markup: what to encode where

Association of stand-off annotation with primary data and/or other annotations

Alignment Annotations

Another heading

Avoiding tracing cycles

Using children versus attributes

Differentiating editorial comment from print?

Element content

Corpus examples

Representing parallel texts

Pointing into a video: The MONITOR data set

The data

The annotations

The representation

Metadata and headers

Discussion