Revised Workplan for the Text Analysis and Interpretation Committee of the Text Encoding Initiative Document Number: TEI AIR2 Department of Linguistics University of Arizona Tucson, AZ 85721 Version 2 CONTENTS Goals of the Committee . . . . . . . . . . . . . . . . . . . . . . . 1 Centrality of Linguistic and Stylistic Analysis . . . . . . . . . . . 1 The Tasks to be Performed . . . . . . . . . . . . . . . . . . . . . . 1 Survey of Existing Encoding Systems . . . . . . . . . . . . . . . 1 Subareas of Linguistic Analysis and Interpretation . . . . . . . . 1 Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Discourse Analysis . . . . . . . . . . . . . . . . . . . . . . 1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Appendix A: Members of the Committee on Text Analysis and Interpretation . . . . . . . . . . . . . . . . . . . . . 2 GOALS OF THE COMMITTEE The goals of the Text Analysis and Interpretation Committee (hence- forth, the A&I Committee) are to determine what aspects of the structure and interpretation of machine-readable texts it is now feasible to encode and to recommend the form which that encoding should take. Its recommendations should be applicable to a any text in any natural lan- guage that conforms to grammatical and stylistic principles that the encoder can identify; in particular, the recommendations should be applicable to most prose and poetry texts written in a standard orthog- raphy, and to standardized transcriptions of spoken language. In addi- tion, because of the special status of dictionaries of natural languages as repositories of information of great importance for textual analysis and interpretation, the A&I Committee will also make recommendations for the encoding of machine-readable dictionaries. CENTRALITY OF LINGUISTIC AND STYLISTIC ANALYSIS The amount and variety of analysis that can be performed on and interpretation that can be provided for even a very simple text are enormous. The most fundamental aspects of the analysis and interpreta- tion of a text, however, are linguistic and stylistic in nature, and other domain-specific analyses and interpretations (e.g., literary and philosophical ones) depend on the availability of the results of lin- guistic and stylistic analysis. Hence, the first phase of the A&I Com- mittee's work will be devoted to determining what types of linguistic and stylistic information should be encoded and to developing encoding tags for representing that information. THE TASKS TO BE PERFORMED The A&I Committee will be organized into partially overlapping sub- committees that will carry out the various tasks that need to be done. Survey of Existing Encoding Systems A number of text-tagging systems currently exist which encode various kinds of linguistic and stylistic information. Many members of the com- mittee either have been instrumental in developing one or another of these systems or are knowledgeable in their use, including the follow- ing. Name Affiliation ____ ___________ Robert Amsler Bell Communications Research Bran Boguraev IBM Research and Cambridge University Nicoletta Calzolari Istituto di Linguistica Computazionale, Pisa Winfried Lenders Institut fuEr Kommunikationsforschung und Phonetik, UniversitaEt Bonn Mitch Marcus University of Pennsylvania Nelleke Oostdijk TOSCA Workgroup, Universiteit Nijmegen Serge Perschke EEC, Luxembourg Shana Poplack University of Ottawa Gary Simons Summer Institute of Linguistics Academic Computing, Dallas Antonio Zampolli Istituto di Linguistica Computazionale, Pisa These members will determine what sorts of linguistic and stylistic information have been successfully encoded by one or more of these sys- tems, how it has been done, and how it might be rendered in SGML. Subareas of Linguistic Analysis and Interpretation Linguistic analysis divides naturally into the following subareas according to the type and size of textual units. * phonology * morphology * syntax * discourse analysis * semantics For each of these areas, except the last two, subcommittees will immediately be formed, each to be headed by a member of the A&I Commit- tee, with other members to be drawn from both the A&I Committee and the list of those who have volunteered to serve on TEI subcommittees. In certain cases, sub-subcommittees will also be formed to deal with par- ticular problems, some of which are noted below. Others will be formed as the need arises. We discuss each of these areas in turn. Phonology: Phonology encompasses the study of the sound patterns of natural lan- guages. The following two members of the A&I Committee, who are highly regarded for their work on phonology, will serve on the phonology sub- committee. Name Affiliation ____ ___________ Stephen Anderson Johns Hopkins University William Poser Stanford University The fundamental units of phonology are the articulatory gestures and their associated acoustic properties that taken together comprise the spoken language. These units combine in complex ways to form phonologi- cal units of various sizes. The most familiar of these to nonlinguists are the segmental units, or &oquote(2).phonemes&cquote(2)., that corre- spond roughly to the individual characters in alphabetic transcriptions; and syllables, which typically consist of a vocalic phoneme (vowel) as their nuclei and satellite consonantal phonemes. Larger units include feet, consisting of an accented (stressed) syllable and adjacent unac- cented syllables, and phonological words and phrases. Recommendations for the encoding of these and other phonological entities, their proper- ties and their relations with other linguistic units, both phonological and nonphonological, will be the major task of the phonology subcommit- tee. Determining the relation between phonological entities and other tex- tual units may require the creation of a sub-subcommittee of the phonol- ogy subcommittee. Among the problems that such a sub-subcommittee could be asked to address are the following. 1. the relation between the phonemes of the language and the script in which the text is written. For example, in English, the letter sequence th is used to rep- __ resent three different phonemes, in the words thigh, thy and Thom- _____ ___ ____ as. Moreover, it represents a sequence of two phonemes in the __ English word hothouse. The ability to distinguish among phonemes ________ is necessary for a number of applications that many scholars would find useful, such as determining the relative frequency of various phonemes in extended text in a particular language or dialect. 2. the relation between phonological and poetic rhyme 3. the relation between metrical patterns in phonology and poetic analysis (scansion) 4. the representation of other poetically significant phonological patterns such as alliteration and assonance 5. the representation of sound symbolism 6. the relation between phonological words and phrases and syntacti- cally defined words and phrases. Problem "6. the relation between phonological words and phrases" is one of several that the A&I Committee will face in which domains do not properly nest. Consider, for example, the well- known line in Figure 1. +----------------------------------------------------------------------+ | | | This is the cat that chased the rat that ate | | the cheese that lay in the house that Jack built. | | | | Figure 1 | ________ | | +----------------------------------------------------------------------+ The phonological phrasing of the example in Figure 1 is indicated by the slash marks in Figure 2. +----------------------------------------------------------------------+ | | | This is the cat / that chased the rat / that ate | | the cheese / that lay in the house that Jack built. | | | | Figure 2 | ________ | | +----------------------------------------------------------------------+ However, the syntactic phrasing of the example in Figure 1 is quite different. First, the phrases themselves are nested, rather than occurring serially as the phonological phrases presumably do. Second, the boundaries between the phrases occur at different points, as indicated by the parentheses in Figure 3. +----------------------------------------------------------------------+ | | | This is the (cat that chased the (rat that ate | | the (cheese that lay in the (house that Jack built)))). | | | | Figure 3 | ________ | | +----------------------------------------------------------------------+ Thus, the phonologically-defined phrase that ate the cheese in _____________________ Figure 2 does not correspond to any syntactically-defined phrase in Figure 3. Once the A&I Committee as a whole has a clear idea about how phonological and syntactic phrasing may be encoded in SGML, the matter of integrating them will be brought to the atten- tion of the Metalanguage and Syntax Committee. 7. the relation between phonological structure and language- particular orthographic conventions, such as punctuation, font changes, etc. Morphology: Morphology is the study of word formation in natural languages. Ste- phen Anderson of Johns Hopkins University, who has done pioneering work in morphology, will head up this subcommittee. Among the tasks that this subcommittee will undertake are the following. 1. delimiting the words of a text (lemmatization) Almost all machine-readable texts use spacing or some other delimiter to indicate word boundaries. However, it is appropriate to provide a tag set for explicitly indicating the function of spaces as word boundaries, since not every occurrence of a space in a text functions as a word boundary. Moreover, since in some text, not all words are explicitly delimited (indeed, none may be), it is necessary to provide for the tagging of word bound- aries. 2. classifying the words of a text The assignment of the words of a text into word classes (parts of speech) is perhaps the most fundamental task that any system for encoding linguistic information must accomplish. The system of tagging that is developed must be flexible enough to permit multiple word-class assignments for ambiguous words and also to indicate the word class in case the context disambiguates it. 3. representing the internal structure of words This task includes tagging the morphologically significant parts of words, and annotating them according to their linguistic properties and relations. For example, consider the English word rings, which is analyzable either as the plural form of the noun _____ ring or as the third-person singular present-tense form of the ____ verb ring. An adequate tagging system must provide the means for ____ encoding both interpretations of this complex form, and for iden- tifying the suffix s in it as indicating plurality when attached _ to a noun and as indicating present tense associated with a third- person singular subject when attached to a verb. The task also includes representing the hierarchical structure of morphological- ly complex words, such as unpacked, which is analyzable either as ________ the past-tense or past-participial form of the verb unpack (which ______ is itself morphologically complex), or as an adjective made up of the prefix un attached to the past-participial form of the verb __ pack. ____ 4. identifying the roots and dictionary entry forms of words in texts Inflected words, such as English rings mentioned above in con- _____ nection with task "3. representing the internal structure of words", contain the root ring, for which one may expect to find a ____ dictionary entry. Since inflected forms of words are typically not entered as headwords in dictionaries, or are entered as such only for purposes of cross-reference, it is important to develop a tag set which would enable the encoder to identify the dictionary entry form of textually occurring words. In some cases, as for example with the English word rang, whose root form is ring, the ____ ____ root form is orthographically and phonologically distinct from the textually occurring form. If a machine-readable dictionary of the language is available along with the text, then information about the word (specifically the information contained in the dictionary entry for its root) would be accessible for further analysis of the text. Derived words, such as English ringer, often occur separate ______ from their roots as dictionary entries; nevertheless, it is still useful to be able to tag their roots, since the study of the rela- tions that hold between derived words and their roots is of con- siderable interest both to linguists and other scholars. Finally, compound words, such as English ringmaster and ring finger consist __________ ___________ of two or more roots (together possibly with other material). For such words, it is useful to be able to tag all of their roots. 5. relating orthographically and morphologically defined words This task divides into a number of subtasks, involving such problems as the analysis of compound words which may or may not appear in texts as separate words (cf. ringmaster and ring finger __________ ___________ mentioned in connection with task "4. identifying the roots and dictionary entry forms"), and the analysis of contracted forms such as she's, which is a single orthographic word consisting of _____ the parts she and 's. The second of these parts is in turn relat- ___ __ ed to the independently occurring word is (itself an inflected __ form of the root be). The ability to encode orthographic (and __ phonological) words such as she's as consisting of two parts which _____ correspond to independent words is important for further syntactic and semantic analysis of larger expressions containing them. For example, in order to analyze a sentence like She's leaving as con- _____________ sisting of the phrases She and 's leaving, it is necessary to have ___ __________ access to the two morphologically significant parts of the word She's. _____ 6. identifying special uses of words 7. identifying the speech register associated with a particular word or part of a word 8. identifying dialectal or usage variants of other words 9. identifying proper names, acronyms, abbreviations, etc. To some extent, the identification of proper names and acronyms also falls within the province of the syntax subcommittee, since many proper names are phrasal in form, and many acronyms actually represent phrases. Syntax: Syntax is the study of phrase formation in natural languages, includ- ing sentences. The following members of the A&I Committee who are both well known as syntacticians and computational linguists, will be members of the syntax subcommittee. Name Affiliation ____ ___________ Mitch Marcus University of Pennsylvania Hans Uszkoreit UniversitaEt des Saarlandes Mitch Marcus' contribution to the work of this subcommittee (and to the work of several of the other subcommittees as well) is particularly significant because of his current research project to annotate millions of sentences with part of speech assignments, skeletal syntactic pars- ings, intonational boundaries for spoken language, and other forms of linguistic information that can be encoded consistently and quickly. The tasks of the syntax subcommittee include the following. 1. delimiting the phrases of a text Sentences are normally delimited in modern texts by initial capital letters and final punctuation. Since capitalization has a variety of functions in texts in most languages, its use as a sen- tence delimiter should be taggable. 2. representing the internal structure of phrases. Among the types of phrases to be dealt with are clauses, including main and subordinate clauses; nominal phrases; adjecti- val phrases; and various kinds of modifying phrases. 3. distinguishing arguments and adjuncts 4. identifying idiomatic phrases 5. identifying figurative phrasing and phrases (the traditional fig- ures of speech and figures of thought) The traditional figures of speech typically involve special phrasing possibilities such as inversion as illustrated by the example in Figure 4. +----------------------------------------------------------------------+ | | | Indictments do not a conviction make. | | | | Figure 4 | ________ | | +----------------------------------------------------------------------+ On the other hand, the traditional figures of thought, including simile and metaphor, typically involve the figurative interpreta- tion of phrases, as illustrated by the example in Figure 5. +----------------------------------------------------------------------+ | | | You are my sunshine. | | | | Figure 5 | ________ | | +----------------------------------------------------------------------+ 6. representing &oquote(2).discontinuous&cquote(2). phrases Representing the structure of discontinuous phrases presents an interesting challenge to the committee, and will presumably require extensive use of the cross-referencing capabilities of SGML. Some cases of phrasal discontinuity in English are indicat- ed by highlighting in the examples in Figure 6. +----------------------------------------------------------------------+ | | | | | a. You left a word out I wanted included. | a word I wanted included | b. We'll have to wait an hour and a half. | an and a half | c. Can she sing? | Can sing | d. I wonder who they think they are. | who are | e. Advantage was taken of us. | Advantage taken of | | | Figure 6 | ________ | | +----------------------------------------------------------------------+ 7. identifying null phrases The example in "a. " in Figure 6 may be analyzed as lacking a &oquote(2).relative pronoun&cquote(2). between the words out and ___ I, and as having an empty phrase between the words wanted and _ ______ included, corresponding to the missing relative pronoun. The ________ existence of null phrases is not uncontroversial in linguistics, but there are enough practitioners who countenance and make use of them, that recommendations for encoding them should be made. Fur- thermore, even in the absence of theoretical justification for such elements, it would be useful to have the ability to encode them; for example, to enable someone to undertake a study of the relative frequency of explicit and &oquote(2).missing&cquote(2). relative pronouns in a particular text. 8. representing anaphoric relations between pronouns and their antecedents and between reduced phrases and their antecedents Anaphoric relations, such as those that hold between nominal phrases and pronominal forms, and between full phrases of whatever type and reduced phrases, are of great interest for both theoreti- cal and practical purposes. The proper treatment of anaphoric relations has been at center stage in theoretical linguistic dis- cussions for over a decade. During this same period, much effort has been made in developing natural-language processing systems to represent these relations, because texts cannot be understood properly without them. Indeed, when a passage is misunderstood, much of the time it is because the anaphoric relations the speaker or writer intended for it have not been correctly recovered. Typ- ical anaphoric relations hold among the highlighted phrases in the examples in Figure 7. +----------------------------------------------------------------------+ | | | | | a. The women convinced one another that they were happy. | The women one another they | b. Joey rode on the carousel, but noone else did. | rode on the carousel did | | | Figure 7 | ________ | | +----------------------------------------------------------------------+ The example in "a. " in Figure 7 is in fact ambiguous, depending on whether they is anaphorically connected to the phrase The ____ ___ women, and the markup recommendations that the A&I Committee _____ develops must be able to represent these different possibilities. 9. identifying other grammatical dependencies among phrases, such as government, concord (agreement) and cross reference Grammatical concord among words and phrases is a familiar phe- nomenon of natural language; it holds in English between certain verb forms and their subjects, and between certain demonstrative adjectives and the nouns they modify, as illustrated by the high- lighted phrases in the examples in Figure 8. +----------------------------------------------------------------------+ | | | | | a. A baby is sleeping. | A baby is __ | b. Babies are sleeping. | Babies are ___ | c. I like that shoe. | that shoe ____ | d. I like those shoes. | those shoes _____ | | | Figure 8 | ________ | | +----------------------------------------------------------------------+ The full encoding of concord patterns requires more than simply registering in examples like those in Figure 8 the grammatical number of the forms that agree. It also requires marking the fact that the form of the verb depends on the form of its subject, and that the form of the demonstrative adjective depends on the form of the noun it modifies. 10. representing ambiguous syntactic structures The problem of ambiguity arises at every level of the A&I Com- mittee's work. However, it is perhaps the most severe in connec- tion with syntactic representation. Consider the example given in Figure 9. +----------------------------------------------------------------------+ | | | The duck is ready to eat. | | | | Figure 9 | ________ | | +----------------------------------------------------------------------+ Conceptually, the simplest way to represent the two very dis- tinct syntactic interpretations of this example is to provide two taggings for it, one for each interpretation. However, it may not be possible to do so effectively without repeating the sentence. Repeating the sentence, however, means altering the text, unless the repetition itself is tagged as not occurring in the text. But this gives primacy to the interpretation that is in the text. In case that interpretation is the one clearly intended by the author, this solution may not be problematic. (In such a case, in fact, the ambiguity can be ignored.) However, situations do arise where either the author's intentions about the interpretation of an ambiguous phrase are not clear (at least to the encoder) or the phrase is intended to be ambiguous. The problem of providing a satisfactory way of tagging ambiguous text is one that will have to be worked out by the A&I Committee working together with the Metalanguage and Syntax Committee. Discourse Analysis: Discourse analysis deals with the relations among the major phrases (typically sentences) of a text. It will not be dealt with in the first phase of the A&I Committee's work. Semantics: Semantics, which deals with the relation between the text and the world, will also not be dealt with in this phase of the A&I Committee's work, except where it comes up in connection with the problems being dealt with by the morphology, syntax and dictionary subcommittees. In later phases, as the A&I Committee examines discipline-specific problems of encoding, issues in semantics will be much more prominent than they are in this phase. Dictionaries The encoding of machine-readable dictionaries has been made a project of the A&I Committee at the direction of the Steering Committee of the Text Encoding Initiative. Accordingly, a dictionary subcommittee will be set up, chaired by Robert Amsler; Bran Boguraev will also be a member of this subcommittee. The basis for the work of this subcommittee is the Dictionary Encoding Standard reported in An SGML-Based Standard for Eng- ______________________________ lish Monolingual Dictionaries by Robert Amsler and Frank Tompa, in ______________________________ Fourth Annual Conference of the UW Centre for the New Oxford English ________________________________________________________________________ Dictionary: Information in Text. Proceedings of the Conference, Octo- _______________________________________________________________________ ber 26-28, 1988, Waterloo, Canada, pp. 61-79. Since machine-readable _________________________________ dictionaries in principle can contain much of the information that is needed to support text analysis and interpretation, the development of compatible recommendations for both dictionary and text encoding is highly desirable. Appendix A Members of the Committee on Text Analysis and Interpretation Names and addresses (postal and email) for the members of the Text Anal- ysis and Interpretation Committee of the Text Encoding Initiative. Email addresses are given in internet form except as indicated. Robert Amsler Bell Communications Research, 445 South St., MRE 2D-398, Morristown, NJ 07960-1961, USA, amsler@flash.bellcore.com Stephen R. Anderson Cognitive Science Center (Ames Hall), The Johns Hop- kins University, Baltimore, MD 21218 USA, ander- son@sapir.cog.jhu.edu, anderson@cs.jhu.edu Branimir Boguraev IBM T.J. Watson Research Center, P.O. Box 704, York- town Heights, NY 10598, USA, bkb@ibm.com Nicoletta Calzolari Istituto di Linguistica Computazionale, Via Della Faggiola, Pisa, Italy, glottolo@icnucevm (bitnet) Terence Langendoen (Committee Chair) Dept. of Linguistics, Douglass Bldg., Room 200E, Uni- versity of Arizona, Tucson, AZ 85721, USA, lan- gendt@arizvm1.ccit.arizona.edu, langendt@arizvm1 (bitnet) Winfried Lenders Institut fuEr Kommunikationsforschung und Phonetik, UniversitaEt Bonn, Poppelsdorfer Allee 47, D-5300 Bonn 1, Fed. Rep. of Germany, upk000@dbnrhrz1 (bit- net) Mitch Marcus Dept. of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, mitch@linc.cis.upenn.edu Nelleke Oostdijk TOSCA Workgroup, Dept. of English, Universiteit Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Neth- erlands, u279103@hnykun11 (bitnet) Serge Perschke Commission of the European Communities, DG XIII-B-3, Ba^timent Jean Monnet, Office B4-005, L-2920 Luxem- bourg Shana Poplack Dept. of Linguistics, University of Ottawa, 78 Lauri- er Avenue East, Ottawa, ON K1N 6N5, Canada, sxpaf@uottawa (bitnet) William Poser Dept. of Linguistics 2150, Stanford University, Stan- ford, CA 94305, USA, poser@crystals.stanford.edu Gary Simons SIL Academic Computing, 7500 W. Camp Wisdom Road, Dallax, TX 75236, USA, con- vex!txsil!gary.uucp@arizona.edu Hans Uszkoreit UniversitaEt des Saarlandes, Computerlinguistik, D-6500 SaarbruEcken 11, West Germany, han- su@sbsvax.informatik.uni-saarland.dbp.de Antonio Zampolli Istituto di Linguistica Computazionale, Via Della Faggiola, Pisa, Italy, glottolo@icnucevm (bitnet)