15 Simple Analytic Mechanisms

Part 4

Additional Tag Sets

15 Simple Analytic Mechanisms

This chapter describes a tag set for associating simple analyses and interpretations with text elements. We use the term analysis here to refer to any kind of semantic or syntactic interpretation which an encoder wishes to attach to all or part of a text. Examples discussed in this chapter include familiar linguistic categorizations (such as ``clause'', ``morpheme'', ``part-of-speech'' etc.) and characterizations of narrative structure (such as ``theme'', ``reconciliation'' etc.). The mechanisms presented in this chapter offer simpler but less powerful than those described in chapter 16 .

Section 15.1 introduces a tag set for characterizing text segments according to the familiar linguistic categories of sentence or s-unit, clause, phrase, word, morpheme, and character. These elements represent special cases of the generic <seg> element described in section 14.3 .

Section 15.2 introduces an additional global attribute which allows passages of text to be associated with specialised SGML elements representing their interpretation. These `interpretative' elements ( and <interp> ) are described in detail in section 15.3 . They allow the encoder to specify an analysis as a series of names and associated values, [ see note 90 ] each such pair being linked to one or more stretches of text, either directly, in the case of spans, or indirectly, in the case of interpretations.

Finally section 15.4 revisits the topic of linguistic analysis, and illustrates how these interpretative mechanisms may be used to associate simple linguistic analysis with text segments.

The following DTD fragments show the overall organization of the class of analytic elements discussed in the remainder of this chapter. File teiana2.ent defines the additional global attribute made available by this tag set.

<!-- 15:  Modifications to TEI class system for
analysis     -->
<!-- ... declarations from section 15.2                       -->
<!--     (Global attribute for analysis)                      -->
<!--     go here ...                                          -->

File teiana2.dtd contains declarations for elements used to represent simple analyses or interpretations of portions of a text.

<!-- 15:  Simple analytic mechanisms                          -->
 <!-- Text Encoding Initiative: Guidelines for Electronic      -->
<!-- Text Encoding and Interchange. Document TEI P3, 1994.    -->

<!-- Copyright (c) 1994 ACH, ACL, ALLC. Permission to copy    -->
<!-- in any form is granted, provided this notice is          -->
<!-- included in all copies.                                  -->

<!-- These materials may not be altered; modifications to     -->
<!-- these DTDs should be performed as specified in the       -->
<!-- Guidelines in chapter "Modifying the TEI DTD."           -->

<!-- These materials subject to revision. Current versions    -->
<!-- are available from the Text Encoding Initiative.         -->

<!-- We declare the various elements, group by group.         -->

<!-- ... declarations from section 15.3                       -->
<!--     (Spans)                                              -->
<!--     go here ...                                          -->
<!-- ... declarations from section 15.1                       -->
<!--     (Linguistic Segment Categories)                      -->
<!--     go here ...                                          -->

This tag set is selected as described in 3.3 ; in a document which uses the markup described in this chapter, the document type declaration should contain the following declaration of the entity TEI.analysis, or an equivalent one:

<!ENTITY % TEI.analysis 'INCLUDE'>

The entire document type declaration for a document using this additional tag set together with that for linking and alignment and the base tag set for prose might look like this:

 <!DOCTYPE TEI.2 PUBLIC "-//TEI P3//DTD Main Document Type//EN" "tei2.dtd" [
    <!ENTITY TEI.prose    'INCLUDE' >
    <!ENTITY TEI.linking  'INCLUDE' >
    <!ENTITY TEI.analysis 'INCLUDE' >
 ]>

15.1 Linguistic Segment Categories

In this section we introduce specialized linguistic segment category elements which may be used to represent the segmentation of a text into the traditional linguistic categories of sentence, clause, phrase, word, morpheme, and characters.

<s> contains a sentence-like division of a text.
<cl> represents a grammatical clause.
<phr> represents a grammatical phrase.
<w> represents a grammatical (not necessarily orthographic) word. Attributes include:
- lemma identifies the word's lemma (dictionary entry form).
<m> represents a grammatical morpheme. Attributes include:
- baseform identifies the morpheme's base form.
<c> represents a character.

As members of the seg class, these elements share the following attributes:

type characterizes the type of segment.
function characterizes the function of the segment.

The <s> element may be used simply to segment a text end-to-end into a series of non-overlapping segments, referred to here and elsewhere as s-units, or sentences.

 
<p>
<s>Nineteen fifty-four, when I was eighteen years old,
is held to be a crucial turning point in the history of
the Afro-American  --- for the U.S.A. as a whole --- the
year segregation was outlawed by the U.S. Supreme Court.</s>
<s>It was also a crucial year for me because on June 18,
1954, I began serving  a sentence in state prison for
possession of marijuana.</s>

The <s> may be thought of as providing an abbreviated version of the tag <seg type='s-unit'> , with the important additional proviso that (unlike <seg> elements) <s> elements may not be nested within each other. The type attribute of the <s> element corresponds to the subtype attribute on the <seg> element, that is, a tag <s type='xxx'> should be thought of as synonymous with a tag <seg type='s-unit' subtype='xxx'> . Similar considerations apply to the <cl> and <phr> elements , which can be thought of as short for <seg type=clause> and <seg type=phrase> , respectively.

The <s> element may be further subdivided into clauses, marked with the <cl> element, as in the following example:

<p>
<s><cl>It was about the beginning of September, 1664,
    <cl>that I, among the rest of my neighbours,
      heard in ordinary discourse
      <cl>that the plague was returned again
        in Holland;
      </cl>
    </cl>
  </cl>
  <cl>for it had been very violent there, and
    particularly at Amsterdam and Rotterdam,
    in the year 1663,</cl>
  <cl>whither,
    <cl>they say,</cl>
    it was brought,
    <cl>some said</cl>
    from Italy, others from the Levant,
    among some goods
    <cl>which were brought home
    by their Turkey fleet;</cl>
  </cl>
  <cl>others said it was brought from Candia;
    others from Cyprus.
  </cl>
</s>
<s><cl>It mattered not
    <cl>from whence it came;</cl>
  </cl>
  <cl>but all agreed
    <cl>it was come into Holland again.</cl>
  </cl>
</s>
</p>

Clauses may be further divided into <phr> elements in the same way. A text may be segmented directly into clauses, or into phrases, with no need to include segmentation at a higher level as well.

For verse texts, the overlapping of metrical and syntactic structure requires that special care be given to representing both using the SGML element hierarchy. One simple approach is to split the syntactic phrases into fragments when they cross verse boundaries, reuniting them either with the part attribute:

<div type=stanza>
<l><cl part=i>Tweedledum and Tweedledee</cl></l>
<l><cl part=f>Agreed to have a battle;</cl></l>
<l><cl part=i>For Tweedledum said
  <cl part=i>Tweedledee</cl></cl></l>
<l><cl part=f><cl part=f>Had spoiled his
  nice new rattle.</cl></cl></l>
<div type=stanza>
<l><cl part=i>Just then flew down a monstrous crow,</cl></l>
<l><cl part=f>As black as a tar barrel;</cl></l>
<l><cl part=i>Which frightened both the heroes so,</cl></l>
<l><cl part=f><cl>They quite forgot their quarrel.</cl></cl></l>
</div>

Another approach is to use the next and prev attributes defined in the additional tag set for linking (chapter 14 ):

<div type=stanza>
<l>...
<l><cl id=c3 next=c5 part=i>For Tweedledum said
  <cl id=c4 next=c6 part=i>Tweedledee</cl></cl></l>
<l><cl id=c5 prev=c3 part=f>
  <cl id=c6 prev=c4 part=f>
   Had spoiled his nice new rattle.</cl></cl></l>
</div>

Other methods are also possible; for discussion, see chapter 31 .

The type attribute on linguistic segment categories can be used to provide additional interpretative information about the category. The function attribute on the <cl> and <phr> elements can be used to provide additional information about the function of the category. Legal values for these two attributes are not defined by these Guidelines, but should be documented in the <segmentation> element of the <encodingDesc> element within the document's header. A general approach to the encoding of linguistic categories assigned to parts of a text is discussed in section 15.4 below.

Using traditional terminology, these attributes provide a convenient way of specifying, for example, that the clause `from whence it came' is a relative clause modifying another, or that the phrase `by the U.S. Supreme Court' is a prepositional post-modifier:

<cl>It mattered not
  <cl type='relative' function='clause modifier'>
   from whence it came;</cl></cl>

<phr type=NP>the year
  segregation</phr>
 <phr>was outlawed</phr>
 <phr type=PP function='postmodifier (agent)'>
  by the U.S. Supreme Court.</phr>

Segmentation into clauses and phrases can, of course, be combined. To make such encodings easier to read, the segmentation tags can begin new lines, and be indented according to their degree of nesting, thus:

<p>
<cl type='finite declarative' function='independent'>
 <phr type=NP function='subject'>Nineteen fifty-four,
  <cl type='finite relative declarative'
    function='appositive'>when
   <phr type=NP function='subject'>I</phr>
   <phr type=VP function='predicate'>was eighteen
                    years old</phr>
  </cl>,
 </phr>
 <phr type=VP function='predicate'>
  <phr type=V function='main verb'>is held</phr>
  <phr type=NP function='complement'> <!-- ? -->
   <cl type='nonfinite' function='predicate nom.'>
    <phr type=V function='copula'>to be</phr>
    <phr type=NP function='predicate nom.'>
     a crucial turning point
     <phr type=PP function='postmodifier'>in
      <phr type=NP function='prep.obj.'>
       the history
       <phr type=PP function='postmodifier'>
        of the Afro-American</phr>
      </phr>
     </phr>
     ---
     <phr type=PP
        function='appositive postmodifier'>for
      <phr type=NP function='prep.obj.'>the U.S.A.
       <phr type=PP function='postmodifier'>
        as a whole</phr>
      </phr>
     </phr>
    </phr>
    ---
    <phr type=NP function='appositive pred.nom.'>
     the year
     <cl type='finite relative'
       function='adjectival'>
      <phr type=NP function='subject'>
       segregation</phr>
      <phr type=VP function='predicate'>
       <phr type=V function='main verb'>
        was outlawed</phr>
       <phr type=PP function='postmodifier'>
        by the U.S. Supreme Court</phr>
      </phr>
     </cl>
    </phr>
   </cl>
  </phr>
 </phr>
  .
</cl>
<cl type='finite declarative' function='independent'>
 <phr type=NP function='subject'>It</phr>
 <phr type=VP function='predicate'>
  <phr type=V function='main verb'>was</phr>
  also
  <phr type=NP function='predicate nom.'>
   a crucial year for me</phr>
 </phr>
 <cl type='finite declarative'
   function='dependent causative'>because
  <phr type=PP function='sentence adverb'>
   on June 18, 1954</phr>,
  <phr type=NP function='subject'>I</phr>
  <phr type=VP function='predicate'>
   <phr type=V function='main verb'>began serving</phr>
   <phr type=NP function='complement'>
    a sentence in state prison
    <phr type=PP function='complement'>
     for possession of marijuana
    </phr>
   </phr>
  </phr>
 </cl>
</cl>
.

This style of markup, however, introduces spurious new lines and blanks into the text, which could make restoring the text to its original layout problematic. If the original layout is important, the original line breaks and font shifts should be recorded using <lb> elements, the global rend attribute, etc.

The <w> , <m> and <c> elements are also identical in meaning to the <seg> element with a type attribute of ``w'', ``m'', or ``c'', and may occur wherever <seg> is permitted to occur. However, they have more restricted content models than does <seg> : for example, the <w> element can only contain <w> , <m> and <c> elements, and parsed character data; the <m> element can only contain <c> elements and parsed character data; the <c> element can only contain parsed character data, and should in fact only contain a single character. Consequently, while <m> et al. can be translated directly into typed <seg> elements, the reverse is not necessarily the case.

The restriction on the content of the <w> attribute in particular requires that a certain care must be exercised when using it, especially in relation to the use of other tags that one may think of as word level, but which are in fact defined as phrase level. Consider the problem of segmenting an occurrence of the <mentioned> element as a word.

<mentioned>grandiloquent</mentioned>

The first of the following two encodings is legitimate; the second is not, since the <mentioned> element is not part of the content model of the <w> element:

<!-- This is all right. -->
<mentioned><w>grandiloquent</w></mentioned>

<!-- This is NOT all right! -->
<w><mentioned>grandiloquent</mentioned></w>

On the other hand, both of the following encodings are legitimate:

<mentioned><phr>grandiloquent speech</phr></mentioned>

<phr><mentioned>grandiloquent speech</mentioned></phr>

The first encoding describes the citing of a phrase. The second describes a phrase which consists of something mentioned.

The <w> and <m> attributes carry additional attributes which may be of use in many indexing or analytic applications. The lemma attribute may be used to specify the lemma, that is the head- or base- form of an inflected verb or noun, for example:

<s lang=LA>
  <w lemma=timeo>timeo</w>
  <w lemma=danaii>Danaos</w>
  <w lemma=et>et</w>
  <w lemma=donum>dona</w>
  <w lemma=fero>ferentes</w>
</s>

Similarly, the baseform attribute may be specified for the <m> element, to indicate the `base form' of a transformed morpheme:

<w type=adjective>
  <m type=prefix baseform=con>com</m>
  <m type=root>fort</m>
  <m type=suffix>able</m>
</w>

The <w> , <m> and <c> elements can be used together to give a fairly detailed low-level grammatical analysis of text. For example, consider the following segmentation of the English S-unit `I didn't do it'.

<w>I</w>
<w>
 <w>did</w>
 <m>n't</m>
 </w>
<w>do</w>
<w>it</w>
<c>.</c>

This segmentation, crude as it is, succeeds in representing the idea that `did' occurs as a word inside the word `didn't'. A further advantage of segmenting the text down to this level is that it becomes relatively simple to associate each such segment with a more detailed formal analysis. This matter is taken up in detail in section 15.4 .

The <s> , <cl> , <phr> , <w> , <m> , and <c> elements are formally declared as follows:

<!-- 15.1:  Linguistic Segment Categories                     -->
<!ELEMENT s             - -  (%phrase.seq)        -(s)          >
<!ATTLIST s                  %a.global;
                             %a.seg;                            >
<!ELEMENT cl            - -  (%phrase.seq)                      >
<!ATTLIST cl                 %a.global;
                             %a.seg;                            >
<!ELEMENT phr           - -  (%phrase.seq)                      >
<!ATTLIST phr                %a.global;
                             %a.seg;                            >
<!ELEMENT w             - -  ((#PCDATA | seg | w | m | c)*)     >
<!ATTLIST w                  %a.global;
                             %a.seg;
          lemma              CDATA               #IMPLIED       >
<!ELEMENT m             - -  ((#PCDATA | seg | c)*)             >
<!ATTLIST m                  %a.global;
                             %a.seg;
          baseform           CDATA               #IMPLIED       >
<!ELEMENT c             - -  (#PCDATA)                          >
<!ATTLIST c                  %a.global;
                             %a.seg;                            >
<!-- This fragment is used in sec. 15                         -->

15.2 Global Attributes for Simple Analyses

When the tag set described by this chapter is selected, an additional attribute is defined for all elements:

ana indicates one or more elements containing interpretations of the element on which the ana attribute appears.

The ana attribute may be specified for any SGML element. Its effect is to associate the element with one or more others representing an analysis or interpretation of it. Its target should be one of the elements described in the section 15.3 below, or some other interpretative element such as <note> , on which see section 6.8 or <fs> , on which see chapter 16 .

The ana attribute is formally declared as follows:

<!-- 15.2:  Global attribute for analysis                     -->
<!ENTITY % a.analysis '
          ana                IDREFS              #IMPLIED'      >
<!-- This fragment is used in sec. 15                         -->

15.3 Spans and Interpretations

The simplest mechanisms for attaching analytic notes in some structured vocabulary to particular passages of text are provided by the empty  and <interp> elements, and their associated grouping elements <spanGrp> and <interpGrp> .

 associates an interpretative annotation directly with a span of text. Attributes include:
- value identifies the specific phenomenon being annotated.
- from specifies the beginning of the passage being annotated; if not accompanied by a to attribute, then specifies the entire passage.
- to specifies the end of the passage being annotated.
<spanGrp> collects together  tags.
<interp> provides for an interpretative annotation which can be linked to a span of text. Attributes include:
- value identifies the specific phenomenon being annotated.
<interpGrp> collects together <interp> tags.

These elements are all members of the class interpret , and thus share the following attributes:

resp indicates who is responsible for the interpretation.
type indicates what kind of phenomenon is being noted in the passage. Sample values include:
- theme identifies a theme in the passage.
- allusion identifies an allusion to another text.
- character identifies a character associated with the passage.
- (discourse type) specifies that the passage is of a particular discourse type.
- image identifies an image in the passage.
inst points to instances of the analysis or interpretation represented by the current element.

The type and value attributes of the  and <interp> elements may be used to associate an interpretive name, type, and value with a specific stretch (or span) of text. In the case of the  element, the span of text being annotated is indicated by values of the from and to attributes, the value of each being a pointer. If the optional to attribute is omitted, the span consists just of the element pointed at by the obligatory from attribute. In the case of <interp> (see below), the span is indicated by a pointer from a <link> element or some similar mechanism. Here is an example of the  element.

<p id=MQp1s2p114>
<s id=MQp1s2p114s1>There was certainly a definite point
at which the thing began.</s>

<s id=MQp1s2p114s2>It was not; then it was suddenly
inescapable, and nothing could have frightened it
away.</s>

<s id=MQp1s2p114s3>There was a slow integration,
during which she, and the little animals, and the moving
grasses, and the sun-warmed trees, and the slopes of
shivering silvery mealies, and the great dome of blue
light overhead, and the stones of earth under her feet,
became one, shuddering together in a dissolution
of dancing atoms.</s>

<s id=MQp1s2p114s4>She felt the rivers under the ground
forcing themselves painfully along her veins,
swelling them out in an unbearable pressure; her flesh
was the earth, and suffered growth like a ferment; and
her eyes stared, fixed like the eye of the sun.</s>

<s id=MQp1s2p114s5>Not for one second longer (if the
terms for time apply) could she have borne it; but then,
with a sudden movement forwards and out, the whole
process stopped; and <emph rend=italic>that</emph> was
<soCalled rend=dquo>the moment</soCalled> which it was
impossible to remember afterwards.</s>

<span resp=DTL
      value='the moment'
      from=MQp1s2p114s3
      to=MQp1s2p114s5>

<s id=MQp1s2p114s6>For during that space of time (which
was timeless) she understood quite finally her
smallness, the unimportance of humanity.</s>

<!-- ... -->
</p>

The  element may, as in this example, be placed in the text near the textual span it is associated with, or it may be placed outside the text enclosed within a <spanGrp> element as follows.

<spanGrp resp=DTL>
  <span value='the moment' from=MQp1s2p114s3 to=MQp1s2p114s5>
</spanGrp>

As may be seen, the type attribute may be omitted in order to associate a span of text simply with a descriptive name.

Spans may also be used to represent the structural divisions assigned to the narrative by an interpreter. Consider the following narrative: ``

Sigmund, the son of Volsung, was a king in Frankish country. Sinfiotli was the eldest of his sons, the second was Helgi, the third Hamund. Borghild, Sigmund's wife, had a brother named ---- But Sinfiotli, her stepson, and ---- both wooed the same woman and Sinfiotli killed him over it. [ see note 91 ] And when he came home, Borghild asked him to go away, but Sigmund offered her weregild, and she was obliged to accept it. At the funeral feast Borghild was serving beer. She took poison, a big drinking horn full, and brought it to Sinfiotli. When Sinfiotli looked into the horn, he saw that poison was in it, and said to Sigmund ``This drink is cloudy, old man.'' Sigmund took the horn and drank it off. It is said that Sigmund was hardy and that poison did him no harm, inside or out. And all his sons could tolerate poison on their skin. Borghild brought another horn to Sinfiotli, and asked him to drink, and everything happened as before. And a third time she brought him a horn, and reproachful words as well, if he didn't drink from it. He spoke again to Sigmund as before. He said ``Filter it through your mustache, son!'' Sinfiotli drank it off and at once fell dead.

Sigmund carried him a long way in his arms and came to a long, narrow fjord, and there was a small boat there and a man in it. He offered to ferry Sigmund over the fjord. But when Sigmund carried the body out to the boat, it was fully laden. The man said Sigmund should go around the fjord inland. The man pushed the boat out and then suddenly vanished.

King Sigmund lived a long time in Denmark in the kingdom of Borghild, after he married her. Then he went south to Frankish lands, to the kingdom he had there. Then he married Hiordis, the daughter of King Eylimi. Their son was Sigurd. King Sigmund fell in a battle with the sons of Hunding. And then Hiordis married Alf, the son of King Hialprec. Sigurd grew up there as a boy.

Sigmund and all his sons were tall and outstanding in their strength, their growth, their intelligence, and their accomplishments. But Sigurd was the most outstanding of all, and everyone who knows about the old days says he was the most outstanding of men and the noblest of all the warrior kings.

A structural analysis of this text, dividing it into narrative units in a pattern shared with other texts from the same literature, might look like this:

<p id=P1>
<s id=S1>Sigmund ... was a king in Frankish country.</s>
<s id=S2>Sinfiotli was the eldest of his sons.</s>
<s id=S3>Borghild, Sigmund's wife, had a brother ... </s>
<s id=S4a>But Sinfiotli ... wooed the same woman</s>
<s id=S4b>and Sinfiotli killed him over it.</s>
<s id=S5>And when he came home, ... she was obliged to accept it.</s>
<s id=S6>At the funeral feast Borghild was serving beer.</s>
<s id=S7>She took poison ... and brought it to Sinfiotli.</s>
<!-- ... -->
<s id=S17>Sinfiotli drank it off and at once fell dead.</s>
<anchor id=nil1>
<p id=P2>Sigmund carried him a long way in his arms ... </p>
<p id=P3>King Sigmund lived a long time in Denmark ... </p>
<p id=P4>Sigmund and all his sons were tall ... </p>

<!-- ... -->

<span resp=TMA type='structural unit' value='introduction'
      from=S1 to=S3                                             >
<span resp=TMA type='structural unit' value='conflict'
      from=S4a                                                  >
<span resp=TMA type='structural unit' value='climax'
      from=S4b                                                  >
<span resp=TMA type='structural unit' value='revenge'
      from=S5 to=S17                                            >
<span resp=TMA type='structural unit' value='reconciliation'
      from=nil1                                                 >
<span resp=TMA type='structural unit' value='aftermath'
      from=P2 to=P4                                             >

Note the use of an empty <anchor> element to provide a target for the `reconciliation' unit which is normally part of the narrative pattern but which is not realized in the text shown.

If groups of  elements with the same resp or type are used, as in this example, they may be grouped together inside a <spanGrp> element, with the values of the common attribute(s) inherited from the higher element, as follows.

<spanGrp resp=TMA type='structural unit'>
 <span value='introduction'   from=S1   to=S3 >
 <span value='conflict'       from=S4a        >
 <span value='climax'         from=S4b        >
 <span value='revenge'        from=S5   to=S17>
 <span value='reconciliation' from=nil1       >
 <span value='aftermath'      from=P2   to=P4 >
 </spanGrp>

The same analysis may be expressed with the <interp> element instead of the  element; this element provide attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text. The association between text passages and <interp> elements must be made either by pointing from the text to the <interp> element with the ana attribute defined in section 15.2 , or by pointing at both text and interpretation from a <link> element, as described in chapter 14 .

To encode the first example above using <interp> , it is necessary to create a text element which contains --- or corresponds to --- the the third, fourth, and fifth orthographic sentences (S-units) in the paragraph. This can be done either with the seg element, described in 14.3 , or the join element, described in 14.7 . The resulting element can then be associated with the <interp> element using the ana attribute described in section 15.2 . We illustrate using the <seg> element.

<p id=MQp1s2p114>
<s id=MQp1s2p114s1>There was certainly a definite point ... </s>
<s id=MQp1s2p114s2>It was not; then it was suddenly inescapable ... </s>
<seg id=MQp1s2p114s3-5 ana=moment>
<s id=MQp1s2p114s3>There was a slow integration ... </s>
<s id=MQp1s2p114s4>She felt the rivers under the ground ... </s>
<s id=MQp1s2p114s5>Not for one second longer ... </s>
</seg>
<s id=MQp1s2p114s6>For during that space of time ... </s>
<!-- ... -->
</p>
<!-- ... -->
<interp id=moment resp=DTL value='the moment'>

The second example above can be recoded using <interp> and <interpGrp> tags in a similar manner. The interpretation itself can be expressed in an <interpGrp> element, which would replace the <spanGrp> in the example shown above:

<interpGrp resp=TMA type='structural unit'>
 <interp id=intro    value='introduction'   >
 <interp id=conflict value='conflict'       >
 <interp id=climax   value='climax'         >
 <interp id=revenge  value='revenge'        >
 <interp id=reconcil value='reconciliation' >
 <interp id=afterm   value='aftermath'      >
</interpGrp>

This <interpGrp> element would be linked to the text either by means of the ana attribute, or by means of <link> elements. Using the ana attribute (on <seg> elements introduced specifically for this purpose), the text would be encoded as follows:

<p id=P1>
<seg id=S1-S3 ana=intro>
<s id=S1>Sigmund ... was a king in Frankish country.</s>
<s id=S2>Sinfiotli was the eldest of his sons.</s>
<s id=S3>Borghild, Sigmund's wife, had a brother ... </s>
</seg>
<s id=S4a ana=conflict>But Sinfiotli ... wooed the same woman</s>
<s id=S4b ana=climax>and Sinfiotli killed him over it.</s>
<seg id=S5-S17 ana=revenge>
<s id=S5>And when he came home, ... she was obliged to accept it.</s>
<s id=S6>At the funeral feast Borghild was serving beer.</s>
<!-- ... -->
<s id=S17>Sinfiotli drank it off and at once fell dead.</s>
</seg>
<anchor id=nil1 ana=reconcil>
<p id=P2>Sigmund carried him a long way in his arms ... </p>
<p id=P3>King Sigmund lived a long time in Denmark ... </p>
<p id=P4>Sigmund and all his sons were tall ... </p>
<!-- ... -->
<join id=P2-P4 targets='P2 P3 P4' ana=afterm>

The linkage may also be accomplished using a <linkGrp> element, whose content is a set of <link> elements which point to each interpretive element and its corresponding text unit. This method does not require the use of the ana attribute on the text units.

<linkGrp resp=TMA targFunc='text interpretation'>
  <link targets='intro    S1-S3'>
  <link targets='conflict S4a'>
  <link targets='climax   S4b'>
  <link targets='revenge  S5-S17'>
  <link targets='reconcil nil1'>
  <link targets='afterm   P2-P4'>
</linkGrp>

One obvious advantage of using <interp> rather than  elements for the Sigmund text is that the <interp> elements can be reused for marking up other texts in the same document, whereas the  elements cannot. Another is that the <interp> element can be used to provide interpretations for discontinuous text elements (represented by <join> elements). On the other hand, the use of <interp> elements may require the creation of special text elements not otherwise needed (e.g. the <seg> and the <join> in the revised encoding of the text), whereas the the use of  elements does not.

The formal declarations for the  , <spanGrp> , <interp> and <interpGrp> elements are:

<!-- 15.3:  Spans                                             -->
<!ELEMENT span          - O  EMPTY                              >
<!ATTLIST span               %a.global;
                             %a.interpret;
          to                 IDREF               #IMPLIED
          value              CDATA               #REQUIRED
          from               IDREF               #REQUIRED      >
<!ELEMENT spanGrp       - -  ((span)*)                          >
<!ATTLIST spanGrp            %a.global;
                             %a.interpret;                      >
<!ELEMENT interp        - O  EMPTY                              >
<!ATTLIST interp             %a.global;
                             %a.interpret;
          value              CDATA               #REQUIRED      >
<!ELEMENT interpGrp     - -  ((interp)*)                        >
<!ATTLIST interpGrp          %a.global;
                             %a.interpret;                      >
<!-- This fragment is used in sec. 15                         -->

15.4 Linguistic Annotation

By linguistic annotation we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in this chapter and in chapters 6 , 7 , and the various chapters of Part III (on base tag sets). The contextual properties of a TEI text are fully documented in the TEI Header, which is discussed in chapter 5 , and in section 23.2 .

Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.

The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the <interpretation> element within the encoding description of the TEI Header, as described in section 5.3.3 . Where different parts of a language corpus have used different annotation methods, the decls attribute may be used to indicate the fact, as further discussed in section 23.3 .

As one example of such types of analysis, consider the following sentence, taken from the Lancaster/IBM Treebank Project. [ see note 92 ] ``The victim's friends told police that Kruger drove into the quarry and never surfaced.''

Our discussion focuses on the way that this sentence might be analysed using the Claws system developed at the University of Lancaster, but exactly the same principles may be applied to a wide variety of other systems. [ see note 93 ] Output from the system consists of a segmented and tokenized version of the text, in which word class codes have been associated with each token. For our example sentence, we might conveniently represent these codes using entity references: [ see note 94 ]

<s>The&AT victim&NN1;'s&GEN friends&NN2 told&VVD police&NN2
that&CST Krueger&NP1  drove&VVD into&II the&AT
quarry&NN1  and&CC  never&RR surfaced&VVD;.&PUN
</s>

The names used for these entity references have some significance for the human reader (AT for article, NN1 for singular noun, NN2 for plural noun, etc.), but their representation in the output from an SGML system processing the document may be adjusted, by varying the entity declarations, to suit the convenience of whatever analytic software is to be used. For example, if the SGML parser operating on this sentence uses a set of entity declarations in the following form, then the word class tags will simply disappear from the output.

<!ENTITY AT "">
<!ENTITY NN1 "">
<!ENTITY GEN "">
<!-- ... -->

Alternatively, suppose the entity set in use follows the following pattern:

<!ENTITY AT  "[definite article]">
<!ENTITY NN1 "[singular noun]">
<!ENTITY GEN "[genitive suffix]">
<!-- ... -->

Then the sample sentence will be processed by an SGML-aware processor as if it began:

The[definite article] victim[singular noun]'s[genitive suffix] ...

It would be more useful if the replacement texts for each entity were a code of some significance to a particular analysis program. If the codes are considered to be atomic, then one of the mechanisms based on the <interp> element described in section 15.3 is sufficient. If the codes are considered to be compositional (for example that NN1 and NN2 have something in common, namely their noun-ness, which they do not share with, say, VVD ), then this compositionality may be most clearly expressed using a mechanism based on the <fs> element defined in chapter 16 . For a detailed example, see 16.10 .

One such replacement for the word-class entity references above is a set of empty <ptr> elements bearing target attributes as described in section 6.6 . The required entity definitions would look as follows.

<!ENTITY   AT   "<ptr target=AT>" >
<!ENTITY   NN1  "<ptr target=NN1>">
<!ENTITY   GEN  "<ptr target=GEN>">
<!-- ... -->

Then the text would be expanded to read:

<s>The<ptr target=AT> victim<ptr target=NN1>'s<ptr target=GEN>
friends<ptr target=NN2> told<ptr target=VVD> police
<ptr target=NN2> that<ptr target=CST> Krueger<ptr target=NP1>
drove<ptr target=VVD> into<ptr target=II> the<ptr target=AT>
quarry<ptr target=NN1 and<ptr target=CC> never<ptr target=RR>
surfaced<ptr target=VVD>.<ptr target=PUN>
</s>

The <ptr> elements are designed to point to elements with unique identifiers. But we have yet to specify what those elements are. Suppose we say that they are <interp> elements whose values are the same as their identifiers. That is, we provide an <interpGrp> element as follows:

<interpGrp type='word classes'>
  <interp id=AT  value=AT>
  <interp id=NN1 value=NN1>
  <interp id=GEN value=GEN>
  <!-- ... -->
</interpGrp>

Although common practice, this (or any similar) method of relating text to interpretation is seriously flawed. The interpretations are related not to text elements, but to points in the text, namely those that are occupied by the <ptr> elements. In order to relate the interpretation to the appropriate text units, a uniform convention needs to be applied; for example, that an interpretation relates to all the text material preceding the <ptr> element that points to it up to the immediately preceding <ptr> , or up to the <s> that delimits the S-unit containing that <ptr> element, whichever is nearer. While this convention works with texts that are marked up solely with <ptr> elements that point to interpretation elements, it does not work with texts with additional markup, for example <ptr> elements that are used for some other purpose. In addition, the convention fails for any markup in which interpretations are intended to be associated with nested text elements.

None of these difficulties arise if the text is fully segmented, using the linguistic segment elements described in section 15.1 , and the ana attribute to point to the interpretations that are associated with each such segment, as follows:

<s type=sentence>
<w ana=AT >The</w>
<w ana=NN1>victim</w>
<m ana=GEN>'s</m>
<w ana=NN2>friends</w>
<w ana=VVD>told</w>
<w ana=NN2>police</w>
<w ana=CST>that</w>
<w ana=NP1>Krueger</w>
<w ana=VVD>drove</w>
<w ana=II >into</w>
<w ana=AT >the</w>
<w ana=NN1>quarry</w>
<w ana=CC >and</w>
<w ana=RR >never</w>
<w ana=VVD>surfaced</w>
<c ana=PUN>.</c>
</s>

Analysis into phrase and clause elements can be superimposed on the word and morpheme tagging in the preceding illustration. For example, Claws provides the following constituent analysis of the sample sentence (the word class codes have been deleted): ``[N [G The victim's G] friends N] [V told [N police N] [Fn that [N Krueger N] [V [V& drove [P into [N the quarry N]P]V&] and [V+ never surfaced V+]V]Fn]V]''

Treating the labels on the brackets as phrase or clause interpretations, this analysis of the structure of the example sentence can be combined with the word class analysis and represented as follows (the symbol V& representing the first part of a coordinate phrase, has been replaced by V1 , and V+ , representing the second part, has been replaced by V2 ).

<s type=sentence>
<phr ana=N>
  <phr ana=G>
    <w ana=AT>The</w>
    <w ana=NN1>victim</w>
    <m ana=GEN>'s</m>
    </phr>
  <w ana=NN2>friends</w>
  </phr>
<phr ana=V>
  <w ana=VVD>told</w>
  <phr ana=N><w ana=NN2>police</w></phr>
  <cl ana=Fn>
    <w ana=CST>that</w>
    <phr ana=N><w ana=NP1>Krueger</w></phr>
    <phr ana=V>
      <phr ana=V1>
        <w ana=VVD>drove</w>
        <phr ana=P>
          <w ana=II>into</w>
          <phr ana=N>
            <w ana=AT>the</w>
            <w ana=NN1>quarry</w>
            </phr>
          </phr>
        </phr>
      <w ana=CC>and</w>
      <phr ana=V2>
        <w ana=RR>never</w>
        <w ana=VVD>surfaced</w>
        </phr>
      </phr>
    </cl>
  </phr>
<c ana=PUN>.</c>
</s>

A representation using the <linkGrp> element can be obtained by supplying each linguistic segment with its own id attribute, removing its ana attribute, and putting each segment-interpretation pair into a <link> element inside the <linkGrp> element.

Each linguistic segment so far discussed has been well-behaved with respect to the basic document hierarchy, having only a single parent. Moreover, the segmentation has been complete, in that each part of the text is accounted for by some segment at each level of analysis, without discontinuities or overlap. This state of affairs does not of course apply in all types of analysis, and these Guidelines provide a number of mechanisms to support the representation of discontinuities or multiple analyses. A brief overview of these facilities is provided in chapter 31 ; also see 14 . These mechanisms all depend to a greater or lesser degree on the ability to associate a unique identifier with any element in a TEI-conformant text, and then to specify that identifier as the target of a pointing element of some kind.

The mechanisms proposed in this chapter may also be used to encode analyses of an entirely different kind, for example discourse function. Here is an application of the span technique to record details of a sales transaction in a spoken text.

<u who=P1 id=U1>Can I have ten oranges and a kilo of
    bananas please?</u>
<u who=P2 id=U2>Yes, anything else?</u>
<u who=P1 id=U3>No thanks.</u>
<u who=P2 id=U4>That'll be dollar forty.</u>
<u who=P1 id=U5>Two dollars</u>
<u who=P1 id=U6>Sixty, eighty, two dollars. Thank you.</u>

<spanGrp type=transactions>
  <span from=U1 value='sale request'>
  <span from=U2 to=U3 value='sale compliance'>
  <span from=U4 value='sale'>
  <span from=U5 value='purchase'>
  <span from=U6 value='purchase closure'>
  </spanGrp>

For further discussion of the  (utterance) element and other elements recommended for transcriptions of spoken language, see chapter 11 .