Date: Tue, 5 Dec 89 08:30:05 CST Sender: Text Encoding Initiative - Text Analysis and Interpretation Committee From: Gary Simons To: TEI-ANA distribution list From: Gary Simons Date: December 4, 1989 Appended is the document which Terry Langendoen referred to as SSS in his recent mailing. It was originally distributed to a subset of the TEI-ANA committee along with a photocopy of a chapter of the IT (Interlinear Text) system documentation on which it was based. At Terry's request I have revised the original document slightly to make it suitable for posting on its own. ------------------------------------------------------------ Text Encoding Initiative Committee on Text Analysis and Interpretation November 1, 1989 Revised December 4, 1989 PROPOSED FRAMEWORK FOR ENCODING ANALYSIS AND INTERPRETATION OF RUNNING TEXT Gary F. Simons Summer Institute of Linguistics 7500 W. Camp Wisdom Rd. Dallas, TX 75236 UUCP: ...!texbell!txsil!gary Internet: gary@txsil.lonestar.org This document proposes a framework for encoding the linguistic analysis and interpretation of running text (as opposed to the encoding of isolated displays used as examples in articles or books). This presentation assumes familiarity with the text encoding used in the IT (Interlinear Text) system developed by the Summer Institute of Linguistics (SIL). The MS- DOS version of IT is documented in "How to Use IT: a guide to interlinar text processing," by Gary F. Simons and Larry Versaw (SIL, Dallas, TX, 1987), and the Macintosh version in "How to Use IT: interlinear text processing on the Macintosh," by Gary F. Simons and John V. Thomson (Linguist's Software, Edmonds, WA, 1988). IT uses backslash markers to tag kinds of information in an analyzed text and uses column alignment to encode the relationships between the elements in different tagged fields. This document suggests how the same kinds of information could be encoded in SGML. The IT documentation illustrates many kinds of information that can be encoded in analyzed texts. The focus in this document is not on the variety of things that can be encoded (and for which we would eventually need to develop tags), but on a basic strategy for encoding text analysis in SGML (particularly where there is simultaneously an analysis of the text at different levels of structure). Most of the discussion below gives the justification for what I am proposing and gives examples to illustrate. There are some specific proposals, however. Where these occur, I have drawn attention to them by setting them apart in an indented paragraph. I have also given each a unique reference number to facilitate future discussion of the proposals. In the discussion below I use some of the terminology of the IT system. The baseline text is the original text being analyzed. The analyses added to the text are called annotations. Word-level analysis I will begin with a simple example (from page 2-19 of the IT manual) which involves only word-level annotation: \ot Ma nau ku rikia kini eri ne kuu ka bula \rt Ma nau ku rikia kini 'eri ne ku'u ka bula \wg and I I saw woman that who drank she be drunk Here, \ot stands for `original transcription,' \rt for `retranscription,' and \wg for `word gloss.' In the interlinear presentation of text, the associations between words and their annotations are achieved by means of explicit manipulation of format. In a data interchange standard we need to get away from that, for at least two reasons: (1) the actual formatting (such as the size of the spaces between words) is dependent on the size and style of the fonts that are eventually used, and (2) software could not understand the structure of an interlinear text (and the relationships among its elements) without having a special purpose parser. To remedy this, the interchange standard should explicitly encode the structure of the interlinear text in its descriptive markup. This leads to the first proposal: 1. A markup element, say ... , is needed to enclose a word and all of its annotations in an interlinear text. Within this element, the base word and each of the annotations are also marked with a descriptive tag. I will assume that the DTD specifies that end tags are optional for these lowest level tags. Following this scheme, the above example (keeping the same markup tags) would be encoded as: Ma Ma and nau nau I ku ku I rikia rikia saw kini kini woman eri 'eri that ne ne who kuu ku'u drank ka ka she bula bula be drunk In this example, a space has been put after each data item. It is my feeling that the standard should ignore any white space following a data item within a element. This flexibility is needed, especially as analysis structures get more complex, in order to make interchange formats human readable. Thus, 2. Within the element, all spaces and newlines preceding a markup tag are superfluous. Thus the above sample means exactly the same thing as the following in which all the superfluous spaces are omitted (thus producing a file that is highly compact but much less human readable): MaMaand naunauI kukuI rikiarikiasaw kinikiniwoman eri'erithat nenewho kuuku'udrank kakashe bulabulabe drunk Likewise, the above two encodings are equivalent to the following in which extra spaces are added to achieve a very human readable format: Ma Ma and nau nau I ku ku I rikia rikia saw kini kini woman eri 'eri that ne ne who kuu ku'u drank ka ka she bula bula be drunk The following is still another possible encoding of the very same information: Ma Ma and nau nau I ku ku I rikia rikia saw etc. The examples below which involve morpheme analysis within word analysis demonstrate the necessity for being able to treat newlines as well as spaces as superfluous. I further propose, as shown in the above samples, that: 3. Within the element, the baseline text element should have a markup tag just as do all the annotations. This means that will always be followed immediately by the baseline tag. This appears redundant, and one could argue that the baseline word should be the content of the tag. I prefer to have the baseline separately tagged for the following reasons: (1) The element is more like a record of a database than an element with running text; in this perspective, the baseline word is a field of information as are all the annotations. (2) The baseline text may be drawn from any of a number of domains -- phonetic transcription, phonemic transcription, standard orthography, historical orthography -- which happen to be among the domains that are possible for annotations as well. This proposal ensures that elements are tagged consistently, whether they are base forms or annotations. (3) If the baseline were not tagged, then the meaning of would be variable depending on the kind of representation being used for the baseline. With the baseline separately tagged, always means exactly the same thing, namely, "This is a bundle of information about a word." Morpheme-level analysis Now we shift to a more complex example involving morpheme- level analysis within the word-level analysis. The example is from Eskimo, a highly agglutinative language. In interlinear format (from page 2-33 of the IT manual) the example looks like this: \tx Akutchilighmik-uvva uqaaqtullangniaqtunga. \at akut -chi-ligh-mik =uvva uqaaqtu -llang-niaq-tunga \mr akutuq -si -liq -mik =uvva uqaaqtuq -llak -niaq-tunga \mg icecream-RSL-GER -s.MOD=now tell story-DUR -INT -1s.I \wg about making Eskimo icecream I am going to tell a story The field codes in this case are \tx for `baseline text,' \at for `allomorphic transcription' (that is, with morph cuts indicated in the surface form of the word), \mr for `morphemic representa- tion' (that is, underlying forms), \mg for `morpheme gloss,' and \wg for `word gloss.' In this example we have the elements of the baseline text and the word gloss lines working at the word level. But the middle three lines are further subdivided into a morpheme level of analysis. We could ignore this and treat everything as a word-level annotation, as in Akutchilighmik-uvva akut-chi-ligh-mik=uvva akutuq-si-liq-mik=uvva icecream-RSL-GER-s.MOD=now about making Eskimo icecream uqaaqtullangniaqtunga. uqaaqtu-llang-niaq-tunga uqaaqtuq-llak-niaq-tunga tell story-DUR-INT-1s.I I am going to tell a story Note that in this representation we have preserved the morpheme break characters in the data so that the human reader can reconstruct the relationships between the parts of the related lines, but we have not encoded those relationships explicitly in the markup. Without a special purpose parser, the morpheme subalignments are lost to computer software. To handle the morphemic substructure in the markup, we need an element like above, but which works at the morpheme level and can be embedded inside elements. Thus, 4. A markup element, say ... , is needed to enclose a morpheme and all of its annotations in an interlinear text. Proposal 2 above, about superfluous newlines and spaces, applies as well within the scope of the element. So does proposal 3, which in this case means that the first item within a element should have a descriptive tag. To translate the above interlinear example into descriptive SGML markup requires one more step. We must realize that the three lines which will be expressed at the morpheme level inside elements function together as a single element at the word level. In the same way that \wg provides a gloss for the whole word, the set of three lines provides a morphemic analysis for the whole word. We thus introduce for `morphemic analysis' as a word-level annotation in which to embed the morpheme-level analyses. The descriptive markup of the Eskimo example then becomes something like: Akutchilighmik-uvva akut akutuq icecream -chi -si -RSL -ligh -liq -GER -mik -mik -s.MOD =uvva =uvva =now about making Eskimo icecream uqaaqtullangniaqtunga. uqaaqtu uqaaqtuq tell story -llang -llak -DUR -niaq -niaq -INT -tunga -tunga -1s.I I am going to tell a story Note that this is just one possible way to use indentations and newlines to format the encoded information. By proposal 2 above, many others would be possible. Above word level Text analysis must also go above the word level. For instance, the free translations of conventional interlinear text analysis are annotations at a sentence or clause level. One could also perform analysis and annotation at a phrase level or at a paragraph level. Markup can be generalized at any level by doing what we did to embed morpheme analysis inside word level analysis. Thus, 5. For each level at which annotation is to be done, a markup tag (along the lines of and introduced above) should be created. If a lower level of anno- tation is to be embedded within it, then a markup tag to embed the lower level analysis must also be created. The second sentence refers to a tag like which had to be created to embed morpheme analyses inside of word analyses. Consider this example (adapted from page 2-64 of the IT manual): \ref BARU u 11 \tx "Hata ma haia, \wg vine sp. and putty nut \lt "Hata vine and putty nut, \ft "There is plenty of hata vine and putty nut around, \bi Hata is a vine used for stitching together the planks of canoes. The nut of the haia tree (Parinarium glaberrimum) contains a putty-like substance used for caulking canoes. In this example, \ref stands for `reference,' \tx for the baseline text, \wg for `word gloss,' \lt for `literal translation,' \ft for `free translation,' and \bi for `background information.' For the purposes of text glossing, this particular text was divided into units of convenient size for translation (often more than a single clause but less than a sentence). Since the division does not correspond to a linguistic level like sentence or clause, we will simply call it `unit' and use as the tag for this unit. Within the unit, we have word-level analysis. Thus we need a tag to embed the lower level of analysis. We will use for `word-by-word analysis.' The above would then be marked up as: 11 "Hata vine sp. ma and haia putty nut "Hata vine and putty nut, "There is plenty of hata vine and putty nut around, Hata is a vine used for stitching together the planks of canoes. The nut of the haia tree (Parinarium glaberrimum) contains a putty-like substance used for caulking canoes. If the words exhibited complex morphology, then a element would be further embedded within the element. Where next? My purpose here has been to suggest a possible framework for the markup of analysis and interpretation in texts. If this framework is adopted, then it seems that the major task of the committee would be to devise a standard set of tags for all the kinds of information that could be encoded in analyzed texts. The IT manual proposes more than 30, but many more are possible. The following are some of the difficulties I foresee along the way: (1) This kind of text is not amenable to the once-for-all DTD in the same way that something like an academic book or article (a la the AAP tag set) has proven to be. Two reasons come to mind. First, there is not a closed tag set. The imaginative analyst could invent new ways of interpreting and analyzing text for which appropriate standardized markup tags have not been devised. Second, each analyzed text potentially has a unique DTD rather than all analyzed texts sharing the same DTD. In a particular analyzed text, the analyst makes a selection of the kinds of information that will be included; the analyst should then be consistent in following that model throughout. The analyst's selection thus would ideally define a unique DTD for the text which consistently enforces the analyst's intention. We could probably live with a once-for-all general- purpose DTD, but it would permit sloppy work in which each element of the text could be annotated with a different selection of fields. (2) The whole scheme presented above is optimal for situations where there is a one-to-one correspondence between all the analysis items at the same level. What do we do when there is not? For instance, what happens if one word in the baseline text is a contraction that needs to be represented as two words in the next line of the analysis. Or, what if two words in the baseline text are a single lexeme which should be represented as a single item in a later line of analysis. This problem gets even more difficult when the items involved are discontinuous. (3) The example above shows only a traditional single-tier approach to morphology. Encoding of multiple-tier analyses will be tricky because the associations between elements on different tiers are typically not one-to-one. (4) It is unclear to me how encoding the interpretation and analysis of a running text (which is what this working paper considers) should relate to the encoding of linguistic examples in a book or article. It seems pretty obvious that the encoding of interlinear examples can be covered by the same mechanism. For other kinds of examples, such as syntactic tree diagrams or the multi-tier displays common in phonology, it is not so clear. It seems that it would be possible to force these into the above model (and indeed we probably want this for the case where these kinds of analysis are used in annotating a running text). However, it also seems that it might be better to have a separate encoding scheme optimized for these special kinds of displays.