Organization and Work of the Text Analysis and Interpretation Committee


D. Terence Langendoen, Chair

TEI AIR1

February 9, 1989


Department of Linguistics
University of Arizona
Tucson, AZ 85721



Table of Contents


The Work of the Text Analysis and Interpretation Committee

The Text Analysis and Interpretation Committee has two major tasks:

  1. to enable the results of linguistic analysis to be incorporated into machine-readable texts;
  2. to enable the results of literary and stylistic analysis to be incorporated into machine-readable texts.

Schedule of the Committee's Work

During the first two years of drafting, the committee will concern itself primarily with the first of these tasks. However, some aspects of the second task that overlap with the first, such as metrical analysis of poetry, will be undertaken during the initial phase. Moreover, certain aspects of the first task, particularly those that deal with more complex linguistic issues, such as the representation of very complex grammatical structures, and of semantic representation generally, will not be undertaken during the initial phase.

The committee will begin its work on the first task by surveying the most common features of existing text-tagging schemes for representing grammatical structure and by determining what linguistic information is most commonly encoded in the analysis and tagging of existing corpora. It will then begin work on drafting standards for the representation of this information. It may be expected that information at the lexical level will be the initial focus of this work, including the delimitation of words where these are not explicitly delimited in the text (say, by spacing); [1] word-class (part of speech) information; and morphological analysis (analysis of the internal structure of words, including the arrangement of prefixes, suffixes and stems, and their interpretation).

Encoding Linguistic Analysis and Interpretation

Limiting ourselves for the moment to the consideration of texts written in standard orthographies, [2] the linguistic analysis of text deals with objects of various sizes, which are associated with various types of information, and which are linked with each other in a variety of ways.

Characters and Null Elements

Characters

The smallest textual objects would appear to be the individual characters: blanks, letters, punctuation marks, digits and special symbols. The problem of the orthographic interpretation of these characters will be dealt with by the Committee on Text Representation. However, other aspects of the interpretation of these characters and their interrelations, especially their relation to pronunciation, fall within the scope of this committee's work. One way of looking at the committee's task in this area is that it will establish standards for encoding phoneme-grapheme correspondences in whatever language a text happens to be encoded in.

Null Elements

In addition, it will be useful for at least some applications to provide markup conventions for elements that do not appear at all in the text itself, such as orthographically null prefixes and suffixes on words, and null words and phrases, as in the following English sentence.

In this example, some scholars would want to note the absence of a "relative pronoun" between the words woman and they, the absence of an explicit phrase immediately following the word meet, and the absence of a suffix on the word want. The existence of null affixes, words and phrases is not uncontroversial in linguistics, but there are enough practitioners who countenance and make use of them, that standards for encoding them should be established. Furthermore, even in the absence of theoretical justification for such elements, it would be useful to have the ability to encode them; for example, to enable someone to undertake a study of the relative frequency of explicit and "missing" relative pronouns in a particular text.

Words

Initial Goals

Some aspects of the committee's work on the linguistic analysis and interpretation of words have already been discussed in Schedule of the Committee's Work In addition to these aspects, the committee will develop standards for the representation of other significant aspects of the structure of words such as the following:

Later Goals

The committee will also begin to undertake certain more complex tasks, among them the following:

Compound Words

The problem of analyzing compound words is similar to that of analyzing morphologically complex words, discussed above in Initial Goals. However, certain aspects of the problem deserve special attention, such as distinguishing between orthographic words (strings not containing blanks that appear between blanks or certain punctuation marks) and "linguistically significant" words. Thus, for example, the contracted form we'll is a single orthographic word which corresponds to two linguistically significant words we and will. Moreover, when such a word occurs in a sentence, as in We'll leave, it may be analyzed as containing a phrase boundary occurring between the two parts of the orthographic word (i.e., it may be analyzed as containing the phrases We and 'll leave). Thus, the ability to decompose a word like we'll into two component words is necessary for further phrasal analysis. [5] On the other hand, the two-word sequence machine readable may be considered, at least for certain purposes, a single linguistically significant word, which happens to have two words as parts. [6] Whatever encoding scheme is developed for marking compound words of this sort, it must be designed to enable the user of the marked up text to decide what counts as a word. For example, someone who is simply interested in the number of word tokens in a text would want machine readable to count as two words, whereas someone who is interested in the grammatical structure of the text may want it to be considered as one word. [7]

Subcategorization and Selection (Argument Structure)

For the analysis of the grammatical properties of words, the committee will be helped by the work of the subcommittee on dictionary encoding (see The Organization of the Text Analysis and Interpretation Committee), which will be drawing up standards for the encoding of machine-readable dictionaries, since much of the grammatical information that it will be useful to encode in association with a particular word in a text will also appear in the dictionary entry for that word. [8] The question of the proper representation of subcategorizational and selectional information is a matter of great interest and controversy in contemporary linguistics, so it will be essential that the committee develop standards that scholars of various persuasions will be able to work with comfortably. The importance of this kind of information for grammatical analysis and interpretation is such that work should begin on this program as soon as possible, though it is not expected that the Committee will be finished with this task until the end of its first phase of work.

Phrases

The development of markup standards for the analysis of phrases, including sentences, poses perhaps the greatest challenge to this committee. Determining the order of priority of projects in this area will demand careful consideration, and what is proposed here should simply be considered an initial go at the problem.

Initial Goals

Initially, the committee will undertake to develop standards in the following areas of phrasal encoding:

Analysis of Uncontroversial Phrase Types

Linguists agree for the most part on what constitutes what might be considered the gross structure of phrases, and the committee will attempt to develop markup conventions for the encoding of such structure. Among the types of phrases to be dealt with in the first phase of the committee's work are clauses, including main and subordinate clauses; nominal phrases; adjectival phrases; and various kinds of modifying phrases.

Analysis of Idiomatic Phrases

In addition, markup standards for idiomatic phrases will be established, for which, again, the help of the dictionary encoding subcommittee will be valuable. By achieving this level of coverage, standards for gross syntactic coverage will be achieved in the first phase of the committee's work.

Analysis of Discontinuous Phrases

In this phase, it will also be important for the committee to come to grips with the encoding of discontinuous phrases, such as those highlighted in the following English examples.

Presumably, to indicate the parts of a discontinuous phrase, it will at least be necessary to assign an attribute to a marker associated with one part of the phrase that points to the other part.

Later Goals

Other problems to be addressed by the committee at some point during the first two years of work are the following:

Anaphoric Relations

Anaphoric relations, such as those that hold between nominal phrases and pronominal forms, and between full phrases of whatever type and reduced phrases, are of great interest. These relations are illustrated by the following examples.

The encoding of such relations would constitute an important step in the direction of explicitly representing the semantic structure of a text. Furthermore, whatever approach is taken to the encoding of intrasentential anaphoric relations must be generalizable to multisentence contexts, as in the following example.

Concord Relations

Grammatical concord among words and phrases is a familiar phenomenon of natural language; it holds in English between certain verb forms and their subjects, and between certain demonstrative adjectives and the nouns they modify, as illustrated by the following examples.

Notice that the full encoding of concord patters requires more than simply registering (in examples like the foregoing) the grammatical number of the forms that agree. It also requires marking the fact that the form of the verb depends on the form of its subject, and that the form of the demonstrative adjective depends on the form of the noun it modifies.

Global Dependencies

Many aspects of the informational structure of a text are indicated by the occurrence of grammatical elements that occur typically at the periphery of clauses. In English, they tend to occur at the beginning of clauses, as in the following example.

The Problem of Representing Ambiguous Text

One of the major problems that this committee will have to deal with is the representation of lexical and syntactic ambiguity. [10] A familiar example of an English sentence (in telegraphic style) which involves both lexical and syntactic ambiguity is:

In this example, the word ship can be considered either a subject noun or an imperative verb. If it is taken to be a noun, then the word sails is a verb; otherwise it is a direct-object noun.

In a particular text, the intended interpretation of an ambiguous expression can often be determined with certainty, in which case only the intended interpretation need be marked, unless one had in mind an application (such as the study of ambiguity itself) for which the alternative markings would be needed. In many cases, however, two or more alternative analyses would have to be marked, which may necessitate the repetition of parts of the text in order that the alternatives can be represented. In this case, we would need markup not only to indicate the grammatical structure, but also to show that the text has been repeated.

Encoding Literary Analysis and Interpretation

Since the task of encoding the results of literary analysis will not be undertaken in the first phase of the work of this committee, no further discussion of this task is included here. It is intended, however, that an outline of the work to be done by this committee on the encoding of information of literary significance, and of semantic and stylistic structure will be completed during the first phase.

The Organization of the Text Analysis and Interpretation Committee

The Committee will consist of specialists who have knowledge about existing schemes of encoding of the results of linguistic and literary analysis, together with specialists on the various aspects of textual analysis, especially linguistic analysis. Thus, persons who are familiar with phonological and metrical structure, lexical structure, and grammatical structure will be included. Because of the importance of the structure of the lexicon in relation to the linguistic analysis of texts, specialists in the use of machine-readable dictionaries will also be included in the committee. Subcommittees representing the major areas in which standards need to be developed will be established, including subcommittees on

As the committee's work progresses, other subcommittees and perhaps also sub-subcommittees will be established.

Notes

[1] Indeed, the interpretation of blanks and other characters between strings of letters as word boundaries falls squarely within the purview of this committee.
[return to text]

[2] The proper encoding of texts written in phonetic transcription is one of the task of the Committee on Text Representation. Further markup of such texts for other purposes, however, does fall within the purview of this committee.
[return to text]

[3] For example, the English verb form broken can be analyzed as containing the verb root broke, which in turn is related to the basic form break.
[return to text]

[4] For example, the English word unpacked is ambiguous in both structure and interpretation. On one interpretation, it is the past-tense or past-participial form of the verb stem unpack, itself made up of the prefix un and the verb root pack. With this structure, the word could be used to describe a suitcase from which the contents have been removed. On the other interpretation, it is an adjective derived by prefixing un to the past-participial form of the verb stem pack. With this structure, the word could be used to describe a suitcase into which no contents have yet been put.
[return to text]

[5] See Phrases for further discussion of phrasal analysis.
[return to text]

[6] Note that the suffix able, while morphologically attached to the verb stem read, is understood as if attached to machine read.
[return to text]

[7] The converse problem to that posed by compounds like machine readable is posed by compounds like bellydancer, which it might be useful to consider for certain purposes as two-word sequences.
[return to text]

[8] For example, the encoding of the word chased as a form of the transitive verb chase in a particular text corresponds to the fact that the entry in an English dictionary for chase marks it as usable as a transitive verb.
[return to text]

[9] As this example shows, the parts of an idiomatic phrase may be separated by other textual material.
[return to text]

[10] An simple example of lexical ambiguity is given in [click here].
[return to text]


15 Nov 1997