Corpus Applications


Contents

Purpose of Working Paper

The purpose of this working paper is to demonstrate how to apply the TEI, and especially the new recommendations of the working group on stand-off markup, to language corpora that are annotated with information desscribing any of a variety of linguistic properties of the data. Creators and users of linguistically-annotated corpora are typically computationally-oriented and XML-aware, and the corpora are created primarily for the purposes of processing rather than display. Current developments in XML open new possibilities for data representation that will be of direct benefit to this class of potential TEI user. We discuss the range of corpus representations that stand-off admits, with the strengths and weaknesses of the various possibilities, describe good practice where a clear choice is possible, and give a range of worked examples that can be used as models.

Introduction

It is becoming common to annotate language corpora with a wide range of linguistic information, primarily to enable statistical processing of the data that feeds the development of language models used in a variety of language processing applications. Natural language processing (NLP) applications can be applied to create annotations for linguistic data by analyzing text, speech, and data representing other modalities to determine specific linguistic attributes and associate them with the segments of that data to which they apply. NLP applications also use linguistic annotations to facilitate language understanding and generation. Encoding schemes for linguistic data must support both of these "views" on linguistic annotation, as well as integration of the two to ensure maximal inter-operability.

The annotated linguistic data can consist solely of text, contain a single audio or video signal (usually with an accompanying time-aligned orthographic transcription), or contain multiple, synchronized signals representing different views on the language encounter. Annotations may comprise any of a variety of types of linguistic information, including morpho-syntactic specifications, syntactic and/or discourse structure, semantic and/or time- or event-related information, co-reference links, etc.; or, in the case of spoken data, phonetic, prosodic, gestural, etc. information. Annotation of linguistic data may involve multiple annotation steps, where annotations at lower linguistic levels serves as input in the determination of higher-level annotation categories (for example, morpho-syntactic tagging as input to syntactic analysis), so that annotation can be viewed as an incremental process. Depending on the application intended to use the annotations, lower-level annotations may or may not be preserved in a persistent format; the output of the annotation software may consist solely of higher-level annotations, even though lower-level analysis has been performed. For example, information extraction software typically performs the analysis required for annotation of various linguistic features and utilizes it internally to deliver the desired result, without preserving the annotation information.

Annotation of data for several "layers" of different linguistic features, or even multiple versions of the same type of information, presents considerable problems for its representation. In particular, annotations at different levels of analysis often do not conform to a single tree-structure mapped onto the primary data. Even annotations of the same type may not use the same or compatible segmentations of the data (e.g., different morpho-syntactic analyses may treat differing spans--possibly discontiguous--of a text as a "word"). Even if a corpus annotated with multiple types of linguistic information were to form a well-formed tree structure, including these annotations in the primary data may complicate processing and render the data unwieldy.

General requirements

For the purposes of language processing, several desiderata for encoding linguistic data and its annotations have been identified by the language engineering community:

To meet these requirements, the growing practice among annotators of linguistic corpora is to separate annotations from primary data, using what has become known as "stand-off annotation". The base notion behind the stand-off strategy is that primary data and its annotations exist in separate XML documents. Different annotation types (or variants of the same annotation type) are also contained in separate XML documents with their own DTD or schema. The result is a lattice of annotation documents pointing to a primary source or intermediate annotation levels, where the links signify a semantic role rather than navigational options. That is, the links signify the locations where markup contained in a given annotation document would appear in the document to which it is linked. As such the annotation information comprises "remote markup" which is virtually added to the document to which it is linked. This approach has several advantages for corpus-based research:

The concept of stand-off annotation was introduced in the Corpus Encoding Standard (CES) nearly a decade ago, and was first used to represent the MULTEXT Journal of the Commission (JOC) corpus. The JOC corpus consisted of parallel aligned translations in six languages annotated for morpho-syntax. The same strategy was used to represent the MULTEXT- EAST parallel aligned corpus of Orwell's 1984, which exists in ten languages distributed by the TELRI project. Since then, the stand-off strategy has been adopted by numerous corpus creation projects (TIPSTER, MapTask, ANC) and developers of software for corpus manipulation (GATE, ATLAS). Most recently, the notion of stand-off annotation as the preferred representation for linguistically annotated corpora has been endorsed by ISO TC37 SC4, which is developing standard representation mechanisms for language resources.

Stand-off markup: what to encode where

The fundamental operative distinction for stand-off annotation is between the primary data-i.e., the representation of a linguistic act as a text, audio signal, transcription, etc.-and the annotation of that data that provides information about its linguistic properties. Using this as a basis, we can distinguish two fundamental types of annotation activity:

In current practice, primary data typically contains segmentation markup for gross logical structure-i.e., markup identifying titles, chapter and section divisions, etc., down to the level of the paragraph. In some cases, sentence boundaries are also marked in primary data, but given that different processors may identify different spans as sentences, this is less frequently recommended. For documents considered to be read-only, or where it is otherwise desired to leave the primary data unaltered, segmentation information may be included in a separate annotation document specifying byte offsets into the primary document where a given segment begins and ends.

Various sub-paragraph elements such as names, dates, abbreviations etc. are commonly identified in the early stages of corpus processing (typically during segmentation into words, sentences, etc.), often because they are signaled by typography. Such elements may be marked in the primary data; however, given the recent work in the field of computational linguistics on the identification of so-called named entities, it is becoming more common to represent this type of annotation as stand-off as well because: (1) it is identified by processes entirely different from the segmentation process; and (2) there is considerable variation in the composition and identification of the entities among recognition systems.

Determining how annotation types are split up into different stand-off annotation documents depends on the way in which the annotations are created (e.g., by multiple annotators at different sites, each working on a specific annotation type), the requirements of the processing software that uses the annotations, as well as, in some cases, the linguistic view of the annotator. There can be no hard and fast rule for splitting annotations, but it is recommended that if the annotation can be considered as including different annotation types, their representation is such that they are easily separable by automatic means.

[there could be a few examples here]

Association of stand-off annotation with primary data and/or other annotations

Annotation documents are linked to the primary data or other annotation documents using ...

[here is where David or someone has to jump in with the exact mechanisms]

Alignment Annotations

There are several cases where the "annotation" of primary data involves only the identification of corresponding segments/elements of two or more different documents:

In these cases, the annotation document contains only links associating locations in two or more documents. (We can think of this as a two-way link, where there may be multiple targets of one or both "ends" of the link.)

[here need some explanation of the linkage conventions to be used]

[could include more information on the Data Category Registry and would probably have to say something (brief) about rdf]

Another heading

Traditionally in the field of computational linguistics, a choice has had to be made between site-specific formats that inter-operate with locally-developed software, and adherence to suggested "standard" formats that may not accommodate specific annotation needs. To address this problem, several generic, abstract encoding schemes have been developed that are intended to be capable of representing all or most annotation types have been developed (e.g., the TIGER scheme, annotation graphs). For example, TIGER (http://www.ims.uni-stuttgart.de/projekte/TIGER/), which provides graphical displays and query for syntax, uses this kind of representation. In this example, representing the syntactic structure of one utterance from the Switchboard Corpus, the data forms a graph with a set of non-terminal syntactic constituents, start with an identified root node, and edges leading down to terminals that include both part-of-speech tagged words and other symbols from the orthographic transcription.

<graph root="s6_500">

<terminals>

<t id="s6_1" word="But" pos="CC" />

<t id="s6_2" word="I" pos="PRP" />

<t id="s6_3" word="do" pos="VBP" />

<t id="s6_4" word="n't" pos="RB" />

<t id="s6_5" word="," pos="," />

<!-- N_S is a special end of utterance marker -->

<t id="s6_6" word="N_S" pos="-DFL-" />

</terminals>

<nonterminals>

<nt id="s6_501" cat="NP">

<edge idref="s6_2" label="--" />

</nt>

<nt id="s6_502" cat="VP">

<edge idref="s6_3" label="--" />

<edge idref="s6_4" label="--" />

</nt>

<nt id="s6_500" cat="S">

<edge idref="s6_1" label="--" />

<edge idref="s6_501" label="SBJ" />

<edge idref="s6_502" label="UNF" />

<edge idref="s6_5" label="--" />

<edge idref="s6_6" label="--" />

</nt>

</nonterminals>

</graph>

The annotation graph formalism assumes stand-off markup; this is an example of an AG representation in XML:

<annotation>

<arc><source id="0" offset="0"/><label att_1="P" att_2="h#"/><target id="1" offset="2360"/></arc>

<arc><source id="1" offset="2360"/><label att_1="P" att_2="sh"/><target id="2" offset="3270"/></arc>

<arc><source id="2" offset="3270"/><label att_1="P" att_2="iy"/><target id="3" offset="5200"/></arc>

<arc><source id="1" offset="2360"/><label att_1="W" att_2="she"/><target id="3" offset="5200"/></arc>

<arc><source id="3" offset="5200"/><label att_1="P" att_2="hv"/><target id="4" offset="6160"/></arc>

<arc><source id="4" offset="6160"/><label att_1="P" att_2="ae"/><target id="5" offset="8720"/></arc>

<arc><source id="5" offset="8720"/><label att_1="P" att_2="dcl"/><target id="6" offset="9680"/></arc>

<arc><source id="3" offset="5200"/><label att_1="W" att_2="had"/><target id="6" offset="9680"/></arc>

<arc><source id="6" offset="9680"/><label att_1="P" att_2="y"/><target id="7" offset="10173"/></arc>

<arc><source id="7" offset="10173"/><label att_1="P" att_2="axr"/><target id="8" offset="11077"/></arc>

<arc><source id="6" offset="9680"/><label att_1="W" att_2="your"/><target id="8" offset="11077"/></arc>

</annotation>

Formats such as these may be used for actual processing, or may be used as interchange formats for exchanging data among different processes. In the latter case, it is assumed that locally-developed formats can be transformed as needed into and out of more standard, abstract representations using XML transduction mechanisms such as XSLT. However, it is important to note that transduction between formats can only be performed if there is a uniform notion of the data model underlying both representations.

Currently ISO TC37 SC4 (http://www.tc37sc4.org)is developing an abstract format for linguistic annotations that will serve the needs of the language engineering community. The fundamental premise upon which the work of this committee is based is that annotators will encode their data using locally-developed schemes that serve their needs for processing, conciseness and readability, etc., and transduction to the abstract format will be performed for interchange (and, possibly, also for processing since we assume that many processors will adopt this representation). The primary consideration for any scheme that can be transduced to the abstract format is that the user-defined format must conform to the data model defined for the ISO abstract format. The SC4 data model is built around a clear separation of the structure of annotations and their content, that is, the linguistic information the annotation provides. The model therefore combines a structural meta-model, that is, an abstract structure shared by all documents of a given type (e.g. syntactic annotation), and a set of data categories associated with the various components of the structural meta-model.

The structural component of the data model is a directed graph feature structure capable of referencing n-dimensional regions of primary data as well as other annotations. The choice of this model is indicated by its almost universal use in defining general-purpose annotation formats, including the Generic Modeling Tool (GMT) (Ide and Romary, 2001, 2002), Annotation Graphs (Bird and Liberman, 2001), and schemes such as the TIGER graph representation above. A small inventory of logical operations over annotation structures can be specified, which define the model's abstract semantics. These operations allow for expressing the following relations among annotation fragments:

Hierarchical relations (parent-child, right-sibling) are typically left implicit in the structure of the XML encoding. [However, there will probably need to be a way to express these relations explicitly, just in case.]

The feature structure graph contains elementary structural nodes to which one or more data category/value pairs (or links to pre-existing categories) are attached, providing the semantics of the annotation.

Not surprisingly, the SC4 data model is a generalization of the TEI recommendations for segmentation and alignment (as well as feature structures). Therefore the TEI provides considerable support for an encoding format that is easily transduced to the more abstract SC4 format.

[we can provide examples of TEI encoded stuff here]

There are two additional points to be considered:

Avoiding tracing cycles

Stand-off markup allows one to represent arbitrary graph structure, but in practice, it can be useful to know which parts of the graph are likely to contain cycles that require more careful processing and which ones do not.

Single tree could be split over several files from top to bottom for reasons given in II.2 ; multi-rooted trees are useful but don't have cycles at all, they just point downwards towards signal/text. It would be useful to have some way of making this plain so that, as in the NITE data format, the main structure is the multi-rooted tree, expressed via children in XML files and one kind of link, and any additional graph structure required is expressed via another kind of link imposed on top. One reason why this is useful is that trees are so common in nature, so this is the bulk of the data. This is something we'd have to talk about, though, because I think if taken to its logical conclusion it means two types of links, not just (say), comments about these properties in the document type definition, and therefore could have implications for SO W 06 . DD noted in May that the consensus seems to be using link elements for the pointers that aren't meant to be traced in processing.

Using children versus attributes

NOTE: Standard advice about when to use children and when to use attributes that says never require parsing of attributes? Of course, IDREFS doesn't follow this... Also, if attributes come in related subsets, can be better to use a tag that makes this structure clear.

Differentiating editorial comment from print?

NOTE: DD suggested this as an issue, but I don't know much about it. Is this the same as the element content issue?

Element content

In corpora consisting of textual data, the encoding is often designed so that the text (whether it is a written text or an orthographic transcription of some speech) can be recovered in some sensible order by appending all of the textual content of elements within the document body. In this case, a simple filter can remove tags to reproduce the base text. Design for annotation in stand-off documents can follow a similar strategy by regarding the annotation information as the "primary data" for that particular stand-off document. While including this information as content rather than, say, in attributes is not mandatory, it can be useful to consider this distinction for processing purposes. For instance, consider a stand-off document including morpho-syntactic information and lemma for the words in a corpus:

<seg type="lemma">

<seg xlink="words.xml#id(2)">

<f type="cat">determiner</f>

<f type="lemma">the</f>

</seg>

<seg xlink="words.xml#id(2)">

<f type="cat">noun</f>

<f type="number">pl</f>

<f type="lemma">man</f>

</seg>

<seg xlink="words.xml#id(2)">

<f type="cat">verb</f>

<f type="tense">past</f>

<f type="lemma">buy</f>

</seg>

...

</seg>

This may allow easier manipulation than placing the lemmas, part of speech categories, and other features in an attribute value. The information is easily retrievable, requires no parsing of attribute values, and makes it easier to apply regular expressions to it.

Corpus examples

A simple case: syntax

[Some problems here coming up with a TEI encoding: I used <f> for "feature"-also in some places I had a seg with no segment associated...]

For the text: Paul intends to leave IBM.

<seg id="s0">

<f type="cat">S</f>

<seg id="s1" xlink:href="xptr(substring(/p/s[1]/text(),1,4))">

<f type="cat">NP</f>

<f type="rel" head="s2">SBJ</f>

</seg>

<seg id="s2" xlink:href="xptr(substring(/p/s[1]/text(),6,7))">

<f type="cat">VP</f>

</seg>

<seg id="s3">

<f type="cat">S</f>

<seg id="s4" ref="s1">

<f type="rel" head="s6">SBJ</f>

<seg id="s5" xlink:href="xptr(substring(/p/s[1]/text(),24,3))">

<f type="cat">VP</f>

<seg id="s6" xlink:href="xptr(substring(/p/s[1]/text(),18,5))">

<f type="cat">VP</f>

<seg id="s7" xlink:href="xptr(substring(/p/s[1]/text(),24,3))">

<f type="cat">NP</f>

</seg>

</seg>

</seg>

</seg>

</seg>

Representing parallel texts

Examples from XCES pages-I have not taken out refs to XCES yet.

DOC1: <s id="p1s1">According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products.</s><s id="p1s2">Cola drink manufacturers in particular achieved above-average growth rates.</s><!-- ... -->

DOC2: <s id="p1s1">Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes.</s><s id="p1s2">En effet, notre sondage fait ressortir des ventes nettement supérieures ˆ celles de 1987, pour les boissons ˆ base de cola notamment.</s>

ALIGN DOC:

<linkGrp targType="s">

<link>

<align xlink:href="#p1s1"/>

<align xlink:href="#p1s1"/>

</link>

<link>

<align xlink:href="#p1s2"/>

<align xlink:href="#p1s2"/>

</link>

</linkGrp>

The order of the <align> elements within a <link> element is significant. Unless otherwise specified the order is assumed to match the ordering of <translation> elements in the header. If a different ordering is required the attribute n in the <translation> element and the attribute n in the <align> element can be used to explicitly link an <align> element with a specific translation. Many-to-one alignments and many-to-many alignments can be represented by providing a range for the XPointer expression. N-to-zero alignments can be indicated by omitting one or more of the <align> elements and using the n attribute to specify which translations any remaining <align> elements refer to. Alternatively, the href attribute can be set to #xces:undefined to indicate that there is no translation for that fragment in that language.

Example:

header.xml:

<cesHeader version="2.3">

...

<translations>

<translation trans.loc="text-fr.xml" xml:lang="fr" wsd="ISO8859-1" n="1">

<translation trans.loc="text-en.xml" xml:lang="en" wsd="ISO8859-1" n="2">

<translation trans.loc="text-ro.xml" xml:lang="ro" wsd="ISO8859-1" n="3">

<translation trans.loc="text-cz.xml" xml:lang="cz" wsd="ISO8859-1" n="4">

</translations>

</cesHeader>

align.xml:

<cesAlign type="sent" version="1.6">

<cesHeader xlink:href="header.xml"/>

<linkList>

<!-- sentence alignments -->

<linkGrp domains="d1 d1 d1 d1" targType="s">

<link>

<!-- Same ordering as translation elements [fr,en,ro,cz] -->

<align xlink:href="#s1"/>

<align xlink:href="#s1"/>

<align xlink:href="#s1"/>

<align xlink:href="#s1"/>

</link>

<link>

<!-- Reverse order [cz,ro,en,fr] -->

<align n="4" xlink:href="#s2"/>

<align n="3" xlink:href="#s2"/>

<align n="2" xlink:href="#s2"/>

<align n="1" xlink:href="#s2"/>

</link>

<link>

<!-- No English translation [3ro,2cz,1fr]-->

<align n="3" xlink:href="#xpointer(id('s3')/range-to(id('s5')))"/>

<align n="4" xlink:href="#xpointer(id('s3')/range-to(id('s4')))"/>

<align n="1" xlink:href="#s3"/>

</link>

<link>

<!-- 3rd align is fr, the rest are taken in order of translation [1en,1ro,2fr,0cz] -->

<align xlink:href="#s3"/>

<align xlink:href="#s4"/>

<align n="1" xlink:href="#xpointer(id('s4')/range-to(id('s5')))"/>

<align xlink:href="#xces:undefined"/>

</link>

...

</linkGrp>

</linkList>

</cesAlign>

(The rest is Jean's-not sure what to do with it.)

Pointing into a video: The MONITOR data set

NOTE: This data set includes a good reason for wanting to know where on the video frame a particular behaviour is - we can describe exactly the locations of the giver and follower gazes. We discussed allowing for this in SO W 04, but I'm not sure we know whether this is possible yet. Including it would probably require CC's help.

The data

This example is taken from a manipulation of the Map Task. In it, the route giver is told that there is a route follower in another room who can hear him, but cannot be heard. Before the trial, the route giver has eyetracking explained, and during recording, is himself subject to eyetracking. He is told that the route follower is also subject to eyetracking, and that a red square on the map shows where the route follower is looking. Figure X shows a video still from the resulting data; on it, the route giver's gaze is represented by a white circle. when represented graphically as follows:

Figure 1. Video still from the MONITOR Data Set

In reality, there is no route follower, and the red square representing his gaze is manipulated programmatically.

The annotations

The annotations for the MONITOR data set include the following types of information:

This data set was created on a split-site project, with different staff creating the different kinds of annotation from a shared set of videos. Here, the intention is to understand how non-verbal feedback affects the route giver's behaviour.

The representation

NOTE: The details of referencing into video are not correct below yet. I think the right idea is that there's one place that says where to find the relevant signal.

In this data set, each of the monologues has several orthogonalsegmentations. Putting each type of coding in a separate stand-off file allows processes to consider them as complete, orderedsegmentations of the data, as well as allowing staff to work on them simultaneously. Accordingly, for each monologue in the corpus, one might represent the relationship between giver and follower gaze in a single XML file with a root element and a flat set of codes, each of which expresses its type and how it relates to signal:

<feedback-gaze_stream id="18t-ft.fbg.1">

<feedbackgaze label="not-feedback" start="0.00" end="12.12" duration="12.12" id="18t-ft.fbg.1" />

<feedbackgaze label="offscreen" start="12.12" end="12.32" duration="0.20" id="18t-ft.fbg.2" />

<feedbackgaze label="not-feedback" start="12.32" end="14.80" duration="2.48" id="18t-ft.fbg.3" />

<feedbackgaze label="offscreen" start="14.80" end="14.96" duration="0.16" id="18t-ft.fbg.4" />

<feedbackgaze label="not-feedback" start="14.96" end="15.20" duration="0.24" id="18t-ft.fbg.5" />

....

Similarly, one could use one file for the relationship between giver gaze and the route, and another for the relationship between "follower" gaze and the route. Using one element name for each type of code, and differentiating the codes themselves with an attribute value, has advantages for some query applications (such as the NITE XML Toolkit) that can load related stand-off files as one coherent graph-structured data set, since then it is possible to find all tags from a specific code set without resorting to a disjunction of tag names. However, when all processing is carried out on one XML file at a time, is is simpler to use one element name per code. Courtesy attributes such as duration may be useful to store if they are needed by onward processes that will have difficulty deriving them.

NOTE: Paragraphs here explaining how to represent the language data - one tree but with references pulled out into a separate list pointing in to the tree. This makes the tree have a uniform structure (transaction parent of move parent of word; references are recursive and mess this up); make it easy to create a reference list; yet another file can be used to relate references together.

Metadata and headers

NOTE: There are issues about where to put generic information like where to find a video/audio signal and it's a good idea to have a list somewhere of all the files that go with a particular corpus. We've only touched on these issues briefly but we need to give some advice or at least raise this part of the design as an issue.

Discussion

We recommend utilizing multiple files to maximize the relationship between the XML structure and the data structure, since that is what is best for daily use on any particular corpus (unless one is relying solely on some technology that uses a graph structure encoding). Luckily, XML technology makes data transformation relatively easy, and so it is possible to transfer into one of the graph structure encodings and back as needed, should these technologies be useful.


Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org