Date: Wed, 20 Dec 89 10:51:57 EST Sender: Text Encoding Initiative - Text Analysis and Interpretation Committee From: Beatrice Santorini Subject: for your comments and suggestions Text Encoding Initiative Committee on Text Analysis and Interpretation Subcommittee on Syntax December 19, 1989 Beatrice Santorini Department of Computer Science University of Pennsylvania Philadelphia, PA 19104 beatrice@unagi.cis.upenn.edu The following is an expanded version of the very brief summary of the issues the Syntax Subcommittee of the Text Encoding Initiative has been considering which I sent out on December 14, and draws considerably on Terry Langendoen's initial document that was circulated in the summer. I first address some issues that are central to syntactic markup, and then briefly discuss some concerns that are more general and intersect with the concerns of other subcommittees. This document focuses on substantive issues; questions of SGML implementation are as yet outside the scope of this document. Again, please send comments and suggestions for improvements to beatrice@unagi.cis.upenn.edu. SPECIFIC ISSUES o Delimiting the sentences and phrases of a text The most elementary task of the Syntax Committee is to provide a standard way of delimiting the basic units of syntactic investigation: sentences and their constituent phrases. Sentences are normally delimited in modern texts by initial capital letters and final punctuation. Since both capitalization and punctuation have a variety of functions in texts in most languages, we need to be able to explicitly tag their use as sentence delimiters. As far as the encoding of constituent phrases is concerned, it is not clear to me that we can or should give a list of possible constituent phrases. And although we might recommend that proposed phrase structures conform to certain formal conventions (e.g., no crossing branches), it is not clear that we want to prohibit people from expressing unorthodox phrase structures. o Representing functional dependencies In addition to delimiting linear sequences, we must be able to represent the relationships among them and their constituents. We will need to be able to identify: (a) heads of phrases, (b) the argument structure of those heads, (c) the syntactic realization of the arguments, and (d) any adjuncts associated with the heads. For instance, we must be able to distinguish between two sentences as in (1). (1) a. I gave him a book. b. I consider him a fool. How exactly all of this information is to be represented in SGML remains to be worked out. Again, it is not clear to me what conventions regarding phrase structure we are in a position to recommend. For instance, the idea that a phrase needs a lexical head, while uncontroversial in many cases, is debatable in the case, say, of clauses. o Representing discontinuous dependencies Representing the structure of discontinuous dependencies is a relatively spectacular case of the more general issue of representing functional dependencies and will presumably require extensive use of the cross-referencing capabilities of SGML. Some cases of phrasal discontinuity in English are indicated by bracketing in the examples in (2). (2) a. You left [a word] out [I wanted included]. b. We'll have to wait [an] hour [and a half]. c. [Can] she [sing]? d. I wonder [who] they think they [are]. e. [Advantage] was [taken of] us. This task will be particularly important in German and Dutch, which obey the verb-second constraint and in which the sentence-initial constituent is often an element of the predicate. o Identifying null constituents The example in (2a) may be analyzed as missing a relative pronoun between the words `out' and `I', and as having an empty phrase between the words `wanted' and `included', corresponding to the missing relative pronoun. The existence of null constituents is not uncontroversial in linguistics, but there are enough practitioners who countenance and make use of them that recommendations for encoding them should be made. The simplest and most attractive recommendation would be that null constituents may have no special syntactic features or properties. However, this recommendation would rule out most versions of Government-Binding theory since the null element PRO, unlike other nominal elements, is obligatorily ungoverned. We might encourage people to explicitly state their assumptions concerning null constituents if they make use of them. The possibility of representing null constituents is particularly important in those European languages that do not require overt subjects. In some languages, like German, missing subjects are restricted to impersonal contexts, as in (3), but in others, like Italian, Spanish or Greek, missing subjects are possible in the same contexts as subjects denoting human agents, as shown in (4). (3) Sonntags wurde getanzt. (German) Sunday-gen was danced `There was dancing on Sunday.' (4) a. (Maria) vuole mangiare una mela. (Italian) wants eat an apple b. (Maria) quiere comer una manzana. (Spanish) wants eat an apple c. (Maria) theli na fai ena milo. (Greek) wants that eats an apple `Maria/(s)he wants to eat an apple.' Even in the absence of theoretical justification for null constituents, it is useful to have the ability to encode them; for example, to enable someone to undertake a study of the relative frequency of explicit and missing relative pronouns in a particular text, or to investigate the conditions governing the use of the overt pronoun in the examples in (5). (5) a. Select six medium-size apples. Wash (them) carefully, peel (them), core (them), slice (them) thinly and sprinkle (them) with sugar and cinnamon. b. Do (*you) sit down. Don't (you) sit down. o Representing anaphoric relations between pronouns and their antecedents and between reduced phrases and their antecedents Anaphoric relations, such as those that hold between nominal phrases and pronominal forms, and between full phrases of whatever type and reduced phrases, are of great interest for both theoretical and practical purposes. The proper treatment of anaphoric relations has been at center stage in theoretical linguistic discussions for over a decade. During this same period, much effort has been made in developing natural-language processing systems to represent these relations, because texts cannot be understood properly without them. Indeed, when a passage is misunderstood, much of the time it is because the anaphoric relations the speaker or writer intended for it have not been correctly recovered. Typical anaphoric relations hold among the bracketed phrases in the examples in (6). (6) a. [The women] convinced [one another] that [they] were happy. b. Joey [rode on the carousel], but no-one else [did]. The example in (6a) is in fact ambiguous, depending on whether `they' is anaphorically connected to the phrase `The women', and the markup recommendations that the A&I Committee develops must be able to represent the anaphoric alternatives that are available. An important question is what representation of the text to key the possibilities of anaphora to. For most purposes in syntax, it is probably sufficient to be able to associate each phrase in a sentence with a list of possible and impossible indices, but we also need to allow people to express anaphora with respect to the elements of a full discourse model. o Identifying other grammatical dependencies among phrases, such as government (often realized as overt case), concord (agreement) and cross reference Grammatical concord among words and phrases is a familiar phenomenon of natural language; it holds in English between certain verb forms and their subjects, and between certain demonstrative adjectives and the nouns they modify, as illustrated by the bracketed phrases in the examples in (7). (7) a. [The baby] [is] sleeping. b. [Babies] [are] sleeping. c. I like [that] [shoe]. d. I like [those] [shoes]. The full encoding of concord patterns requires more than simply registering in examples like those in (7) the grammatical number of the forms that agree. It also requires marking the fact that the form of the verb depends on the form of its subject, and that the form of the demonstrative adjective depends on the form of the noun it modifies. Furthermore, while in general, we might want to enforce subject-verb agreement, we must allow non-agreement and be able to mark it as exceptional in cases like (8). (8) A bunch of papers were/*was lying all over the room. Twelve pages is/*are short for a thesis. There's/There are problems with every solution. o Representing ambiguous syntactic structures The problem of ambiguity arises at every level of the A&I Committee's work. However, it is perhaps the most severe in connection with syntactic representation. Consider the example given in (9). (9) The duck is ready to eat. Conceptually, the simplest way to represent the two very distinct syntactic interpretations of this example is to provide two taggings for it, one for each interpretation. However, it may not be possible to do so effectively without repeating the sentence. Repeating the sentence, however, means altering the text, so that we need to be able to tag the repetition as not occurring in the text. Two particularly important subcases of ambiguity phenomena concern ambiguities that arise (a) from the scope of conjunctions and (b) from different attachment possibilities of prepositional phrase modifiers. A further important case of ambiguity arises in connection with the common causative construction illustrated for German in (10). (10) Er laesst den Hans anrufen. he lets the-acc Hans call-up The construction in (10) is ambiguous between an active reading of the infinitival complement (`He is having Hans call up') and a more usual passive reading, in which the agent is not expressed (`He is having someone call up Hans'). o Mismatch and reanalysis phenomena Mismatches between syntax and other levels of description are fairly common and must be handled easily. In addition to mismatches between syntax and prosody of the sort illustrated by `This is the house that Jack built', there are mismatches between syntax and morphology, as in the case of contractions and forms like `gonna' and `wanna'. In many European languages other than English, causative constructions involve syntax-morphology mismatches in which a causative verb and its complement behave for some purposes as if they were two words and for others as if they were one. From a notational point of view, these phenomena are all variants of the ambiguity problem, since in every case a single string is to be associated with more than one representation. GENERAL ISSUES o Underspecification The encoding standards should not require providing a full, disambiguated structure, since it is often desirable to represent underspecified analyses. For instance, in phrases like `ten years ago', we might not want to be forced to specify the part of speech of `ago'. While it is perhaps best treated as a postposition, one might neither want to introduce the category of postposition in English nor (given its position in the phrase) to classify `ago' as a preposition. Nevertheless, one might want to be able to treat the entire phrase as a prepositional phrase headed by `ago'. Cf. also `to dance the whole night through'. A similar example concerns the collocation `near NP', where one might want to remain agnostic about the part of speech of `near'. On the one hand, it seems to be an adjective or adverb, since it can be modified by the degree adverb `very' and allows the comparative and superlative forms `nearer/nearest NP', but on the other hand, it is acting as the head of what seems to be a prepositional phrase, as evidenced by the fact that it can occur in the locative complement of `put'. The above examples involve two related but distinct types of underspecification: (a) words that are not assigned parts of speech and (b) phrases of a specified type with categorially unspecified heads. We also need to be able to represent the converse of (b)---categorially unspecified phrases with categorially specified heads---in order to accommodate the so-called "small clause" analysis of examples as in (11). (11) That makes me sad. I want him off the boat. Under this analysis, the oblique noun phrase and the predicate phrase following it form a constituent whose head is the head of the predicate phrase but whose own categorial identity may remain unspecified, as shown in (12). (12) That makes [? me [Adj sad]]. I want [? him [[Prep off] the boat]]. o Tokenization issues While proper tokenization falls mainly within the domain of morphology, tokenization issues involving the interpretation of hyphens and blankspace arise regularly in syntax as well. We need to allow the tagging of compound proper nouns like `New York' as units without having to assign separate parts of speech to their constituents. Adverbial phrases and collocations frequently require the same solution. For instance, in (13), it is clear that `all but' is functioning as an adverbial phrase (cf. `They virtually promised'), but it is not intuitively obvious what part of speech to assign to `but'---or `all', for that matter. (13) They all but promised. Other instances in which part-of-speech assignment by word seems out of place concern what might be called `compound conjunctions', like `much as', `so that' and `so long as' in English. Such compound conjunctions are also common in other European languages; cf. the various conjunctions ending in `que' in French. Issues concerning tokenization overlap with ones concerning underspecification. For instance, not only might we want to omit part-of-speech information about `New' and `York' in `New York', but we might not want to be forced to specify other information normally required for phrases, such as what the head of the phrase is. A related issue concerns the use of hyphens as markers of constituent grouping. Consider the phrase `the New York-New Jersey express'. Here, `New York' and `New Jersey' should be grouped together, and the hyphen indicates higher-level constituent grouping. Conversely, in an example like `the New-Orleans Times-Democrat', the hyphen indicates constituent grouping at the lower level. There appear to be two alternative ways of handling this. We might treat hyphens that connect constituents at different levels of syntactic hierarchy as distinct items, rather like homonymous lexical items. That is, we might tag the hyphen in `the New York-New Jersey express' as a phrase level hyphen, but the ones in `the New-Orleans Times-Democrat' as sub-phrase level hyphens. A more flexible and realistic approach is probably to find a way of indicating the domain of a hyphen, much as we need to be able to indicate the scope of conjunctions. The default domain would be a word on each side, but other domains (say, up to a clause) would be possible. The domain of a hyphen can be smaller than a word and need not be adjacent to it. This is especially true of German, where conjoined phrases of the form `ac Conj bc' are regularly expressed in the `factored' form `a- Conj bc', even if `a' and `b' are bound morphemes. Similar examples are found in English, though not as frequently; cf. `mid- to late summer', where `mid-' is a bound morpheme. Finally, we need to be able to deal easily with inconsistent usage regarding hyphenation. On the one hand, we often need to avoid altering the original text, but on the other hand, it is desirable to be able to map inconsistencies onto a unique normalized form (this is an issue that also arises in connection with typographical errors). o Useability of marked-up documents Given the number of features that even an analysis with only modest aims ends up encoding, a marked-up text file almost immediately becomes too complex for humans to use directly. This complexity is compounded by the bracketing overlap that results from the representation of syntactic ambiguity and of mismatches between different levels of analysis. In addition to providing standards of encoding, it is therefore crucial that we provide standards for accessing and modifying the various streams of a text file without going directly to the master copy. This is especially important when we consider that marked-up text in syntax more often than not illustrates a particular analysis or a point in an argument. Hence, we must easily be able to extract only those elements that are germane to the issue at hand. We also need to consider providing means for checking the consistency of a marked-up text file. Since these questions are of general concern, perhaps an additional subcommittee needs to be set up to deal with them.