Department of Linguistics
University of Arizona
Tucson, AZ 85721
The Text Analysis and Interpretation Committee has two major tasks:
During the first two years of drafting, the committee will concern itself primarily with the first of these tasks. However, some aspects of the second task that overlap with the first, such as metrical analysis of poetry, will be undertaken during the initial phase. Moreover, certain aspects of the first task, particularly those that deal with more complex linguistic issues, such as the representation of very complex grammatical structures, and of semantic representation generally, will not be undertaken during the initial phase.
The committee will begin its work on the first task by surveying the most common features of existing text-tagging schemes for representing grammatical structure and by determining what linguistic information is most commonly encoded in the analysis and tagging of existing corpora. It will then begin work on drafting standards for the representation of this information. It may be expected that information at the lexical level will be the initial focus of this work, including the delimitation of words where these are not explicitly delimited in the text (say, by spacing); [1] word-class (part of speech) information; and morphological analysis (analysis of the internal structure of words, including the arrangement of prefixes, suffixes and stems, and their interpretation).
Limiting ourselves for the moment to the consideration of texts written in standard orthographies, [2] the linguistic analysis of text deals with objects of various sizes, which are associated with various types of information, and which are linked with each other in a variety of ways.
The smallest textual objects would appear to be the individual characters: blanks, letters, punctuation marks, digits and special symbols. The problem of the orthographic interpretation of these characters will be dealt with by the Committee on Text Representation. However, other aspects of the interpretation of these characters and their interrelations, especially their relation to pronunciation, fall within the scope of this committee's work. One way of looking at the committee's task in this area is that it will establish standards for encoding phoneme-grapheme correspondences in whatever language a text happens to be encoded in.
In addition, it will be useful for at least some applications to provide markup conventions for elements that do not appear at all in the text itself, such as orthographically null prefixes and suffixes on words, and null words and phrases, as in the following English sentence.
Some aspects of the committee's work on the linguistic analysis and interpretation of words have already been discussed in Schedule of the Committee's Work In addition to these aspects, the committee will develop standards for the representation of other significant aspects of the structure of words such as the following:
The committee will also begin to undertake certain more complex tasks, among them the following:
The problem of analyzing compound words is similar to that of analyzing morphologically complex words, discussed above in Initial Goals. However, certain aspects of the problem deserve special attention, such as distinguishing between orthographic words (strings not containing blanks that appear between blanks or certain punctuation marks) and "linguistically significant" words. Thus, for example, the contracted form we'll is a single orthographic word which corresponds to two linguistically significant words we and will. Moreover, when such a word occurs in a sentence, as in We'll leave, it may be analyzed as containing a phrase boundary occurring between the two parts of the orthographic word (i.e., it may be analyzed as containing the phrases We and 'll leave). Thus, the ability to decompose a word like we'll into two component words is necessary for further phrasal analysis. [5] On the other hand, the two-word sequence machine readable may be considered, at least for certain purposes, a single linguistically significant word, which happens to have two words as parts. [6] Whatever encoding scheme is developed for marking compound words of this sort, it must be designed to enable the user of the marked up text to decide what counts as a word. For example, someone who is simply interested in the number of word tokens in a text would want machine readable to count as two words, whereas someone who is interested in the grammatical structure of the text may want it to be considered as one word. [7]
For the analysis of the grammatical properties of words, the committee will be helped by the work of the subcommittee on dictionary encoding (see The Organization of the Text Analysis and Interpretation Committee), which will be drawing up standards for the encoding of machine-readable dictionaries, since much of the grammatical information that it will be useful to encode in association with a particular word in a text will also appear in the dictionary entry for that word. [8] The question of the proper representation of subcategorizational and selectional information is a matter of great interest and controversy in contemporary linguistics, so it will be essential that the committee develop standards that scholars of various persuasions will be able to work with comfortably. The importance of this kind of information for grammatical analysis and interpretation is such that work should begin on this program as soon as possible, though it is not expected that the Committee will be finished with this task until the end of its first phase of work.
The development of markup standards for the analysis of phrases, including sentences, poses perhaps the greatest challenge to this committee. Determining the order of priority of projects in this area will demand careful consideration, and what is proposed here should simply be considered an initial go at the problem.
Initially, the committee will undertake to develop standards in the following areas of phrasal encoding:
Linguists agree for the most part on what constitutes what might be considered the gross structure of phrases, and the committee will attempt to develop markup conventions for the encoding of such structure. Among the types of phrases to be dealt with in the first phase of the committee's work are clauses, including main and subordinate clauses; nominal phrases; adjectival phrases; and various kinds of modifying phrases.
In addition, markup standards for idiomatic phrases will be established, for which, again, the help of the dictionary encoding subcommittee will be valuable. By achieving this level of coverage, standards for gross syntactic coverage will be achieved in the first phase of the committee's work.
In this phase, it will also be important for the committee to come to grips with the encoding of discontinuous phrases, such as those highlighted in the following English examples.
Presumably, to indicate the parts of a discontinuous phrase, it will at least be necessary to assign an attribute to a marker associated with one part of the phrase that points to the other part.
Other problems to be addressed by the committee at some point during the first two years of work are the following:
Anaphoric relations, such as those that hold between nominal phrases and pronominal forms, and between full phrases of whatever type and reduced phrases, are of great interest. These relations are illustrated by the following examples.
The encoding of such relations would constitute an important step in the direction of explicitly representing the semantic structure of a text. Furthermore, whatever approach is taken to the encoding of intrasentential anaphoric relations must be generalizable to multisentence contexts, as in the following example.
Grammatical concord among words and phrases is a familiar phenomenon of natural language; it holds in English between certain verb forms and their subjects, and between certain demonstrative adjectives and the nouns they modify, as illustrated by the following examples.
Many aspects of the informational structure of a text are indicated by the occurrence of grammatical elements that occur typically at the periphery of clauses. In English, they tend to occur at the beginning of clauses, as in the following example.
One of the major problems that this committee will have to deal with is the representation of lexical and syntactic ambiguity. [10] A familiar example of an English sentence (in telegraphic style) which involves both lexical and syntactic ambiguity is:
In this example, the word ship can be considered either a subject noun or an imperative verb. If it is taken to be a noun, then the word sails is a verb; otherwise it is a direct-object noun.
In a particular text, the intended interpretation of an ambiguous expression can often be determined with certainty, in which case only the intended interpretation need be marked, unless one had in mind an application (such as the study of ambiguity itself) for which the alternative markings would be needed. In many cases, however, two or more alternative analyses would have to be marked, which may necessitate the repetition of parts of the text in order that the alternatives can be represented. In this case, we would need markup not only to indicate the grammatical structure, but also to show that the text has been repeated.
Since the task of encoding the results of literary analysis will not be undertaken in the first phase of the work of this committee, no further discussion of this task is included here. It is intended, however, that an outline of the work to be done by this committee on the encoding of information of literary significance, and of semantic and stylistic structure will be completed during the first phase.
The Committee will consist of specialists who have knowledge about existing schemes of encoding of the results of linguistic and literary analysis, together with specialists on the various aspects of textual analysis, especially linguistic analysis. Thus, persons who are familiar with phonological and metrical structure, lexical structure, and grammatical structure will be included. Because of the importance of the structure of the lexicon in relation to the linguistic analysis of texts, specialists in the use of machine-readable dictionaries will also be included in the committee. Subcommittees representing the major areas in which standards need to be developed will be established, including subcommittees on
[1]
Indeed, the interpretation of blanks and other characters
between strings of letters as word boundaries falls
squarely within the purview of this committee.
[return to text]
[2]
The proper encoding of texts written in phonetic
transcription is one of the task of the Committee on
Text Representation.
Further markup of such texts for other purposes,
however, does fall within the purview of this committee.
[return to text]
[3]
For example, the English verb form broken
can be analyzed as containing the verb root broke,
which in turn is related to the basic form break.
[return to text]
[4]
For example, the English word unpacked
is ambiguous in both structure and interpretation.
On one interpretation, it is the past-tense or
past-participial form of the verb stem unpack,
itself made up of the prefix un and the verb root
pack.
With this structure, the word could be
used to describe a suitcase from which the contents have been
removed.
On the other interpretation,
it is an adjective derived by
prefixing un to the
past-participial form of the verb stem pack.
With this structure, the word could be
used to describe a suitcase into which no contents have
yet been put.
[return to text]
[5]
See Phrases for further
discussion of phrasal analysis.
[return to text]
[6]
Note that the suffix able,
while morphologically attached to the verb stem
read, is understood as if
attached to machine read.
[return to text]
[7]
The converse problem to that posed by compounds
like machine readable is posed by
compounds like bellydancer, which
it might be useful to consider for certain purposes
as two-word sequences.
[return to text]
[8]
For example, the encoding of the
word chased as a form of the
transitive verb chase in a particular
text corresponds to the fact that
the entry in an English dictionary for chase
marks it as usable as a transitive verb.
[return to text]
[9]
As this example shows, the parts of an idiomatic
phrase may be separated by other textual material.
[return to text]
[10]
An simple example of lexical ambiguity is given in
[click here].
[return to text]