.im gmlgdoc .sr docfile = &sysfnam. ;.sr docversion = 'Draft' .im teigml .* Document proper begins. .********************************************************************** .* Draft of a revised workplan for the .* Text Analysis and Interpretation Committee of the .* Text Encoding Initiative. .* Draft of May 17, 1989. .* mechanical changes ( Revised Workplan for the Text Analysis and Interpretation Committee <author>D. Terence Langendoen, Chair <date>May 17, 1989 <docnum>TEI AIR2 <address> <aline>Department of Linguistics <aline>University of Arizona <aline>Tucson, AZ 85721 <eaddress> <etitlep> <toc> <body> <h1 id=goals>Goals of the Committee <p>The goals of the Text Analysis and Interpretation Committee (henceforth, the A&I Committee) are to determine what aspects of the structure and interpretation of machine-readable texts it is now feasible to encode and to recommend the form which that encoding should take. Its recommendations should be applicable to a any text in any natural language that conforms to grammatical and stylistic principles that the encoder can identify; in particular, the recommendations should be applicable to most prose and poetry texts written in a standard orthography, and to standardized transcriptions of spoken language. In addition, because of the special status of dictionaries of natural languages as repositories of information of great importance for textual analysis and interpretation, the A&I Committee will also make recommendations for the encoding of machine-readable dictionaries. <h2 id=central>Centrality of Linguistic and Stylistic Analysis <p>The amount and variety of analysis that can be performed on and interpretation that can be provided for even a very simple text are enormous. The most fundamental aspects of the analysis and interpretation of a text, however, are linguistic and stylistic in nature, and other domain-specific analyses and interpretations (e.g., literary and philosophical ones) depend on the availability of the results of linguistic and stylistic analysis. Hence, the first phase of the A&I Committee's work will be devoted to determining what types of linguistic and stylistic information should be encoded and to developing encoding tags for representing that information. <h2 id=tasks>The Tasks to be Performed <p>The A&I Committee will be organized into partially overlapping subcommittees that will carry out the various tasks that need to be done. <h3 id=survey>Survey of Existing Encoding Systems <p>A number of text-tagging systems currently exist which encode various kinds of linguistic and stylistic information. Many members of the committee either have been instrumental in developing one or another of these systems or are knowledgable in their use, including the following. <dl> <dthd>Name <ddhd>Affiliation <dt>Robert Amsler <dd>Bell Communications Research <dt>Bran Boguraev <dd>IBM Research and Cambridge University <dt>Nicoletta Calzolari <dd>Istituto di Linguistica Computazionale, Pisa <dt>Winfried Lenders <dd>Institut fuer Kommunikationsforschung und Phonetik, Universitaet Bonn <dt>Mitch Marcus <dd>University of Pennsylvania <dt>Nelleke Oostdijk <dd>TOSCA Workgroup, Universiteit Nijmegen <dt>Serge Perschke <dd>EEC, Luxembourg <dt>Shana Poplack <dd>University of Ottawa <dt>Gary Simons <dd>Summer Institute of Linguistics Academic Computing, Dallas <dt>Antonio Zampolli <dd>Istituto di Linguistica Computazionale, Pisa <edl> <p>These members will determine what sorts of linguistic and stylistic information have been successfully encoded by one or more of these systems, how it has been done, and how it might be rendered in SGML. <h3 id=subarea>Subareas of Linguistic Analysis and Interpretation <p>Linguistic analysis divides naturally into the following subareas according to the type and size of textual units. <ul> <li>phonology <li>morphology <li>syntax <li>discourse analysis <li>semantics <eul> <p>For each of these areas, except the last two, subcommittees will immediately be formed, each to be headed by a member of the A&I Committee, with other members to be drawn from both the A&I Committee and the list of those who have volunteered to serve on TEI subcommittees. In certain cases, sub-subcommittees will also be formed to deal with particular problems, some of which are noted below. Others will be formed as the need arises. We discuss each of these areas in turn. <h4 id=phonol>Phonology <p>Phonology encompasses the study of the sound patterns of natural languages. The following two members of the A&I Committee, who are highly regarded for their work on phonology, will serve on the phonology subcommittee. <dl> <dthd>Name <ddhd>Affiliation <dt>Stephen Anderson <dd>Johns Hopkins University <dt>William Poser <dd>Stanford University <edl> <p>The fundamental units of phonology are the articulatory gestures and their associated acoustic properties that taken together comprise the spoken language. These units combine in complex ways to form phonological units of various sizes. The most familiar of these to nonlinguists are the segmental units, or &oq.phonemes&cq., that correspond roughly to the individual characters in alphabetic transcriptions; and syllables, which typically consist of of a vocalic phoneme (vowel) as their nuclei and satellite consonantal phonemes. Larger units include feet, consisting of an accented (stressed) syllable and adjacent unaccented syllables, and phonological words and phrases. Recommendations for the encoding of these and other phonological entities, their properties and their relations with other linguistic units, both phonological and nonphonological, will be the major task of the phonology subcommittee. <p>Determining the relation between phonological entities and other textual units may require the creation of a sub-subcommittee of the phonology subcommittee. Among the problems that such a sub-subcommittee could be asked to address are the following. <ol> <li>the relation between the phonemes of the language and the script in which the text is written. <p>For example, in English, the letter sequence <hp1>th<ehp1> is used to represent three different phonemes, in the words <hp1>thigh<ehp1>, <hp1>thy<ehp1> and <hp1>Thomas<ehp1>. Moreover, it represents a sequence of two phonemes in the English word <hp1>hothouse<ehp1>. The ability to distinguish among phonemes is necessary for a number of applications that many scholars would find useful, such as determining the relative frequency of various phonemes in extended text in a particular language or dialect. <li>the relation between phonological and poetic rhyme <li>the relation between metrical patterns in phonology and poetic analysis (scansion) <li>the representation of other poetically significant phonological patterns such as alliteration and assonance <li>the representation of sound symbolism <li id=wandp>the relation between phonological words and phrases and syntactically defined words and phrases. <pc>Problem <liref refid=wandp> is one of several that the A&I Committee will face in which domains do not properly nest. Consider, for example, the well-known line in <figref refid=jack1>. <fig id=jack1 place=inline frame=box> This is the cat that chased the rat that ate the cheese that lay in the house that Jack built. <figcap> <efig> <pc>The phonological phrasing of the example in <figref refid=jack1 page=no> is indicated by the slash marks in <figref refid=jack2>. <fig id=jack2 place=inline frame=box> This is the cat / that chased the rat / that ate the cheese / that lay in the house that Jack built. <figcap> <efig> <pc>However, the syntactic phrasing of the example in <figref refid=jack1 page=no> is quite different> First, the phrases themselves are nested, rather than occurring serially as the phonological phrases presumably do. Second, the boundaries between the phrases occur at different points, as indicated by the parentheses in <figref refid=jack3>. <fig id=jack3 place=inline frame=box> This is the (cat that chased the (rat that ate the (cheese that lay in the (house that Jack built)))). <figcap> <efig> <pc>Thus, the phonologically-defined phrase <hp1>that ate the cheese<ehp1> in <figref refid=jack2 page=no> does not correspond to any syntactically-defined phrase in <figref refid=jack3 page=no>. Once the A&I Committee as a whole has a clear idea about how phonological and syntactic phrasing may be encoded in SGML, the matter of integrating them will be brought to the attention of the Metatheory and Syntax Committee. <li>the relation between phonological structure and language-particular orthographic conventions, such as punctuation, font changes, etc. <eol> <h4 id=morphol>Morphology <p>Morphology is the study of word formation in natural languages. Stephen Anderson of Johns Hopkins University, who has done pioneering work in morphology, will head up this subcommittee. Among the tasks that this subcommittee will undertake are the following. <ol> <li>delimiting the words of a text (lemmatization) <p>Almost all machine-readable texts use spacing or some other delimiter to indicate word boundaries. However, it is appropriate to provide a tag set for explicitly indicating the function of spaces as word boundaries, since not every occurrence of a space in a text functions as a word boundary. Moreover, since in some text, not all words are explicitly delimited (indeed, none may be), it is necessary to provide for the tagging of word boundaries. <li id=classw>classifying the words of a text <p>The assignment of the words of a text into word classes (parts of speech) is perhaps the most fundamental task that any system for encoding linguistic information must accomplish. The system of tagging that is developed must be flexible enough to permit multiple word- class assignments for ambiguous words and also to indicate the word class in case the context disambiguates it. <li id=isw>representing the internal structure of words <p>This task includes tagging the morphologically significant parts of words, and annotating them according to their linguistic properties and relations. For example, consider the English word <hp1>rings<ehp1>, which is analyzable either as the plural form of the noun <hp1>ring<ehp1> or as the third-person singular present-tense form of the verb <hp1>ring<ehp1>. An adequate tagging system must provide the means for encoding both interpretations of this complex form, and for identifying the suffix <hp1>s<ehp1> in it as indicating plurality when attached to a noun and as indicating present tense associated with a third-person singular subject when attached to a verb. The task also includes representing the hierarchical structure of morphologically complex words, such as <hp1>unpacked<ehp1>, which is analyzable either as the past-tense or past-participial form of the verb <hp1>unpack<ehp1> (which is itself morphologically complex), or as an adjective made up of the prefix <hp1>un<ehp1> attached to the past-participial form of the verb <hp1>pack<ehp1>. <li id=rootw>identifying the roots and dictionary entry forms of words in texts <p>Inflected words, such as English <hp1>rings<ehp1> mentioned above in connection with task <liref refid=isw page=no>, contain the root <hp1>ring<ehp1>, for which one may expect to find a dictionary entry. Since inflected forms of words are typically not entered as headwords in dictionaries, or are entered as such only for purposes of cross-reference, it is important to develop a tag set which would enable the encoder to identify the dictionary entry form of textually occurring words. In some cases, as for example with the English word <hp1>rang<ehp1>, whose root form is <hp1>ring<ehp1>, the root form is orthographically and phonologically distinct from the textually occurring form. If a machine-readable dictionary of the language is available along with the text, then information about the word (specifically the information contained in the dictionary entry for its root) would be accessible for further analysis of the text. <p>Derived words, such as English <hp1>ringer<ehp1>, often occur separate from their roots as dictionary entries; nevertheless, it is still useful to be able to tag their roots, since the study of the relations that hold between derived words and their roots is of considerable interest both to linguists and other scholars. Finally, compound words, such as English <hp1>ringmaster<ehp1> and <hp1>ring finger<ehp1> consist of two or more roots (together possibly with other material). For such words, it is useful to be able to tag all of their roots. <li>relating orthographically and morphologically defined words <p>This task divides into a number of subtasks, involving such problems as the analysis of compound words which may or may not appear in texts as separate words (cf. <hp1>ringmaster<ehp1> and <hp1>ring finger<ehp1> mentioned in connection with task <liref refid=rootw>), and the analysis of contracted forms such as <hp1>she's<ehp1>, which is a single orthographic word consisting of the parts <hp1>she<ehp1> and <hp1>'s<ehp1>. The second of these parts is in turn related to the independently occurring word <hp1>is<ehp1> (itself an inflected form of the root <hp1>be<ehp1>)> The ability to encode orthographic (and phonological) words such as <hp1>she's<ehp1> as consisting of two parts which correspond to independent words is important for further syntactic and semantic analysis of larger expressions containing them. For example, in order to analyze a sentence like <hp1>She's leaving<ehp1> as consisting of the phrases <hp1>She<ehp1> and <hp1>'s leaving<ehp1>, it is necessary to have access to the two morphologically significant parts of the word <hp1>She's<ehp1>. <li id=specw>identifying special uses of words <li id=registw>identifying the speech register associated with a particular word or part of a word <li id=variw>identifying dialectal or usage variants of other words <li id=propn>identifying proper names, acronyms, abbreviations, etc. <p>To some extent, the identification of proper names and acronyms also falls within the province of the syntax subcommittee, since many proper names are phrasal in form, and many acronyms actually represent phrases. <eol> <h4 id=syntax>Syntax <p>Syntax is the study of phrase formation in natural languages, including sentences. The following members of the A&I Committee who are both well known as syntacticians and computational linguists, will be members of the syntax subcommittee. <dl> <dthd>Name <ddhd>Affiliation <dt>Mitch Marcus <dd>University of Pennsylvania <dt>Hans Uszkoreit <dd>Universitaet des Saarlandes <edl> <p>Mitch Marcus' contribution to the work of this subcommittee (and to the work of several of the other subcommittees as well) is particularly significant because of his current research project .* [Note to MSMcQ] the following passage is lifted .* from Don Walker's note of 5/5/89 to TEIHEADS. .* Should it be quoted, and if so, how should it be referred to? to annotate millions of sentences with part of speech assignments, skeletal syntactic parsings, intonational boundaries for spoken language, and other forms of linguistic information that can be encoded consistently and quickly. .* end of quotation from DW. <p>The tasks of the syntax subcommittee include the following. <ol> <li id=delimp>delimiting the phrases of a text <p>Sentences are normally delimited in modern texts by initial capital letters and final punctuation. Since capitalization has a variety of functions in texts in most languages, its use as a sentence delimiter should be taggable. <li id=isp>representing the internal structure of phrases. <p>Among the types of phrases to be dealt with are clauses, including main and subordinate clauses; nominal phrases; adjectival phrases; and various kinds of modifying phrases. <li id=argadj>distinguishing arguments and adjuncts <li id=idiomp>identifying idiomatic phrases <li id=figurep>identifying figurative phrasing and phrases (the traditional figures of speech and figures of thought) <p>The traditional figures of speech typically involve special phrasing possibilities such as inversion as illustrated by the example in <figref refid=figsp>. <fig id=figsp frame=box place=inline> Indictments do not a conviction make. <figcap> <efig> <pc>On the other hand, the traditional figures of thought, including simile and metaphor, typically involve the figurative interpretation of phrases, as illustrated by the example in <figref refid=figth>. <fig id=figth frame=box place=inline> You are my sunshine. <figcap> <efig> <li id=discp>representing &oq.discontinuous&cq. phrases <p>Representing the structure of discontinuous phrases presents an interesting challenge to the committee, and will presumably require extensive use of the cross-referencing capabilities of SGML. Some cases of phrasal discontinuity in English are indicated by highlighting in the examples in <figref refid=discex>. <fig id=discex frame=box place=inline> <ol compact> <li id=wordout> You left <hp2>a word<ehp2> out <hp2>I wanted included<ehp2>. <li>We'll have to wait <hp2>an<ehp2> hour <hp2>and a half<ehp2>. <li><hp2>Can<ehp2> she <hp2>sing<ehp2>? <li id=iwonder>I wonder <hp2>who<ehp2> they think they <hp2>are<ehp2>. <li><hp2>Advantage<ehp2> was <hp2>taken of<ehp2> us. <eol> <figcap> <efig> <li id=nullp>identifying null phrases <p>The example in <liref refid=wordout page=no> in <figref refid=discex page=no> may be analyzed as lacking a &oq.relative pronoun&cq. between the words <hp1>out<ehp1> and <hp1>I<ehp1>, and as having an empty phrase between the words <hp1>wanted<ehp1> and <hp1>included<ehp1>, corresponding to the missing relative pronoun. The existence of null phrases is not uncontroversial in linguistics, but there are enough practitioners who countenance and make use of them, that recommendations for encoding them should be made. Furthermore, even in the absence of theoretical justification for such elements, it would be useful to have the ability to encode them; for example, to enable someone to undertake a study of the relative frequency of explicit and &oq.missing&cq. relative pronouns in a particular text. <li>representing anaphoric relations between pronouns and their antecedents and between reduced phrases and their antecedents <p>Anaphoric relations, such as those that hold between nominal phrases and pronominal forms, and between full phrases of whatever type and reduced phrases, are of great interest for both theoretical and practical purposes. The proper treatment of anaphoric relations has been at center stage in theoretical linguistic discussions for over a decade. During this same period, much effort has been made in developing natural-language processing systems to represent these relations, because texts cannot be understood properly without them. Indeed, when a passage is misunderstood, much of the time it is because the anaphoric relations the speaker or writer intended for it have not been correctly recovered. Typical anaphoric relations hold among the highlighted phrases in the examples in <figref refid=anaph>. <fig id=anaph frame=box place=inline> <ol compact> <li id=anaph1><hp2>The women<ehp2> convinced <hp2>one another<ehp2> that <hp2>they<ehp2> were happy. <li>Joey <hp2>rode on the carousel<ehp2>, but noone else <hp2>did<ehp2>. <eol> <figcap> <efig> <pc>The example in <liref refid=anaph1 page=no> in <figref refid=anaph page=no> is in fact ambiguous, depending on whether <hp1>they<ehp1> is anaphorically connected to the phrase <hp1>The women<ehp1>, and the markup recommendations that the A&I Committee develops must be able to represent these different possibilities. <li>identifying other grammatical dependencies among phrases, such as government, concord (agreement) and cross reference <p>Grammatical concord among words and phrases is a familiar phenomenon of natural language; it holds in English between certain verb forms and their subjects, and between certain demonstrative adjectives and the nouns they modify, as illustrated by the highlighted phrases in the examples in <figref refid=concord>. <fig id=concord place=inline frame=box> <ol compact> <li><hp2>A baby<ehp2> <hp3>is<ehp3> sleeping. <li><hp2>Babies<ehp2> <hp3>are<ehp3> sleeping. <li>I like <hp2>that<ehp2> <hp3>shoe<ehp3>. <li>I like <hp2>those<ehp2> <hp3>shoes<ehp3>. <eol> <figcap> <efig> <p>The full encoding of concord patters requires more than simply registering in examples like those in <figref refid=concord page=no> the grammatical number of the forms that agree. It also requires marking the fact that the form of the verb depends on the form of its subject, and that the form of the demonstrative adjective depends on the form of the noun it modifies. <li>representing ambiguous syntactic structures <p>The problem of ambiguity arises at every level of the A&I Committee's work. However, it is perhaps the most severe in connection with syntactic representation. Consider the example given in <figref refid=duck>. <fig id=duck place=inline frame=box> The duck is ready to eat. <figcap> <efig> <p>Conceptually, the simplest way to represent the two very distinct syntactic interpretations of this example is to provide two taggings for it, one for each interpretation. However, it may not be possible to do so effectively without repeating the sentence. Repeating the sentence, however, means altering the text, unless the repetition itself is tagged as not occurring in the text. But this gives primacy to the interpretation that is in the text. In case that interpretation is the one clearly intended by the author, this solution may not be problematic. (In such a case, in fact, the ambiguity can be ignored.) However, situations do arise where either the author's intentions about the interpretation of an ambiguous phrase are not clear (at least to the encoder) or the phrase is intended to be ambiguous. The problem of providing a satisfactory way of tagging ambiguous text is one that will have to be worked out by the A&I Committee working together with the Metatheory and Syntax Committee. <eol> <h4 id=discour>Discourse Analysis <p>Discourse analysis deals with the relations among the major phrases (typically sentences) of a text. It will not be dealt with in the first phase of the A&I Committee's work. <h4 id=semant>Semantics <p>Semantics, which deals with the relation between the text and the world, will also not be dealt with in this phase of the A&I Committee's work, except where it comes up in connection with the problems being dealt with by the morphology, syntax and dictionary subcommittees. In later phases, as the A&I Committee examines discipline-specific problems of encoding, issues in semantics will be much more prominent than they are in this phase. <h3 id=dict>Dictionaries The encoding of machine-readable dictionaries has been made a project of the A&I Committee at the direction of the Steering Committee of the Text Encoding Initiative. Accordingly, a dictionary subcommittee will be set up, chaired by Robert Amsler; Bran Boguaraev will also be a member of this subcommittee. The basis for the work of this subcommittee is the Dictionary Encoding Standard reported in <cit>An SGML-based Standard for English Monolingual Dictionaries<ecit> by Robert Amsler and Frank Tompa that appeared in REF, 1988. .* Michael, I need your help for this reference. .* The xerox copy I have doesn't give the source. .* I assume its the proceedings of last fall's .* Waterloo conference, but I don't know how to refer to it. Since machine-readable dictionaries in principle can contain much of the information that is needed to support text analysis and interpretation, the development of compatible recommendations for both dictionary and text encoding is highly desirable. <egdoc>