Minutes of Work Group AI1 Baltimore, 7-8 January 1991 C. M. Sperberg-McQueen Document Number: TEI AI1M4 Draft January 17, 1991 (19:12:25) Present: Stephen R. Anderson (SRA), Nicoletta Calzolari (NC), D. Terence Langendoen (TL), Geoffrey R. Sampson (GRS), Gary F. Simons (GFS), C. M. Sperberg-McQueen (MSM). 1 AGENDA TL proposed the following agenda, which was adopted without change: 1. review of lexical markup schemata (word class and morphology) 2. clean up some technical problems in SGML definitions of linguistic tags a. underspecification (what does omission of a feature mean?) b. ambiguity and resolution of ambiguity c. degree of certainty (especially in ambiguity resolution) d. simplifications of TEI scheme for special grammars e. sample taggings of linguistic examples f. mixed-content models g. review DTD and mechanism of inclusion h. interface to work by committee TR (s unit, punctuation, etc.) 2 LEXICAL MARKUP 2.1 Feasibility and Goals The committee began by considering the feasibility of defining fea- ture sets for part of speech and morphological information in the style of existing commonly used schemes. (Brown, LOB, et al.) GRS was of the opinion that consensus might be reached on feature sets for the major classes only, such that the features required for an individual language would be a subset of the whole. For reference, MSM proposed two Venn diagrams, each showing two sets labeled L1 and L2, symbolic of an indeterminate number of sets L1 to Ln, which overlap in part. In one, a set U is defined as the union of L1 and L2 (and ... Ln). In the other, a set I is defined as the intersec- tion of L1 and L2 (and ... Ln).(1) MSM inquired whether GRS meant that for the major categories, we might succeed in generating set U, but that for the minor categories we might only find set I. GRS replied that he did, but that the method in which one discovers distinctive word-class features was also important. After discussion, the committee agreed that some set of common fea- tures for lexical markup could and should be devised. At the very least, the committee would arrive at a set I of features which a tagger could supplement with others (important to note for funding agencies and others that extension of the feature set is legal and expected); if it proves possible to generate a set U from which all features currently thought important for lexical markup could be generated, that would be attempted as well. SRA stressed the importance of allowing individual encoders to vary the level of granularity in their tagging. 2.2 Mechanism for Boilerplate The feature sets are to be made available to users as boilerplate, that is, pre-defined feature-value pairs which can be embedded within feature structures with entity references, and possibly pre-defined fea- ture structures, which can be embedded within texts using entity refer- ences. MSM discussed the form to be taken by the invocation of such boilerplate, and the ways in which a user could override the predefined values. A need was seen for two different files: one containing entity declarations for simple feature-value pairs, and one containing entity declarations for complex feature-value pairs and for feature structures. A difficulty with predefined feature structures is that they do not allow the specification of unique IDs for the individual feature struc- tures in the running text. NC asked whether it would not be simpler to use the SGML attribute declaration mechanism to define features and their possible values. This proposal was found to have some advantages, but on the whole to be less flexible than the use of the feature- structure tags, and so was not adopted. 2.3 Long-Term Goals for TEI Starter Set of Features The committee defined its first goal as the definition of a set of features and their values, its second goal as the definition of a set of feature structures equivalent in information content to the tags of LOB and other commonly used annotation schemes, and its third goal as the formulation of recommended minimal feature sets for lexical markup. In terms of the mechanism defined, this appeared to require a single file of feature-value pairs ("primitive grammatical features") and one file of feature structures ("compound grammatical features") for each lan- guage to be specified. 2.4 Languages Considered After a discussion of nouns (for results, see document AI1 W2), GFS suggested that the committee specify a set of languages to take into account in its first pass at a general set of features, then consult authoritative grammars for those languages to see what features and val- ues would be needed for them, and to determine the standard terminology, and then to take the set union of features and values. The committee agreed to take into account the nine official languages of the EEC and Russian, and assigned the following members to check respected grammars of them for features missing from the analysis pro- duced by the meeting and for values missing from the value sets speci- fied at the meeting. In addition, TL will assign a graduate student the task of checking a grammar of each of the ten languages except English and include the results of those checks in his report. * Danish: SRA * Dutch: TL * English: GRS * French: MSM * German: MSM (originally assigned to GFS) * (Modern) Greek: NC * Italian: NC * Portuguese: GRS * Russian: SRA * Spanish: TL Work Group AI 1 (as specified) to consult grammars and revise AI1 W2 on feature sets Due: 31 January 1991 2.5 Features for Specific Word Classes Much of the meeting was concerned with the development of feature sets for specific parts of speech: nouns, adjectives, verbs, etc. Results of this discussion are contained in working paper AI1 W2, "List of Common Morphological Features for Inclusion in TEI Starter Set of Grammatical-Annotation Tags," and will not be repeated here. 2.6 Features for Tokens and for Types (Lemmata) During consideration of specific word classes, the committee repeat- edly discussed the possibility of including or excluding information associated not with the specific occurrence of a word, but with the lem- ma itself. After much deliberation, the committee determined not to include, in the "starter set" being constructed, features fixed for a lemma once and for all. Nevertheless, in practice many grammatical annotations do include lemma-level information, and any division has implications for the markup of the lexicon. Some liaison with the work group on computational lexica will be necessary to ensure that the two groups do not work at cross purposes. The definitions to be provided in the "starter set" are for word classes and for features which some language (of those considered) expresses morphologically. "Morphology" here is to be understood as including typographic features, so that the consistent capitalization of proper nouns in English, for example, allows (requires) the inclusion of a feature PROPER (or at least WORD-INITIAL-CAP) for nouns so marked. 2.7 Multiple Word-Class Assignments (Source Class, Usage Class) For the not infrequent cases in which a word of one class is used with the function of another class (adjective-as-noun and participle-as- adjective are two very common examples), the committee decided to allow tagging according to the word class of the usage (in the examples: noun, adjective) or according to the source class (adjective, verb), at the discretion of the tagger. Arguments may be found in favor of either decision. Where features associated with both classes are to be tagged, the committee decided to provide a universally available feature source- class, which would take as its value a feature structure valid for the source or lexical class of the word. The source-class feature should be embedded within a feature structure for the word class for the word's syntactic function. Within the committee's "starter set", therefore, adjectives used as nouns may be tagged * as nouns (usage or functional class) * as adjectives (source or lexical class) * as nouns, with an embedded feature source-class whose value is a feature structure for adjectives Explicitly considered and rejected was a corresponding universal feature function, which would allow the functional-class feature structure to be embedded within the source-class feature structure. If both feature structures are to be encoded, the committee decided, the outer structure should be for the functional class. How best to handle such double-tagging in the starter set remains uncertain, however. As an example, the committee formulated the following feature struc- ture for the word ihres 'her' in the German phrase die Titel ihres Buch- es 'the title of her book'. [category = adjective gender = neuter number = singular case = genitive +possessive -interrogative source-category = [category = pronoun person = 3 number = singular gender = feminine case = n/a ] ] 3 SGML TECHNICAL MATTERS At this point the committee divided into two subgroups on lexical markup (SRA, NC, and GRS) and SGML issues (TL, GFS, and MSM). The lat- ter group considered underspecification, ambiguity and its resolution, the problem of mixed-content models in SGML, and the DTD mechanisms used for embedding linguistic analysis in a document. The former group con- tinued the specification of feature sets and considered the relevance of work by the TR committee for lexical markup. 3.1 Underspecification Five possible interpretations were found for the omission of a fea- ture specification from a feature set: 1. The feature value is unrestricted. 2. The feature takes some default value. 3. The feature value is unknown. 4. The feature does not apply (has no value). 5. No claim is made about the feature's value or applicability. In order to disambiguate these, the following universal feature values were agreed on: ANY: The word is compatible with (unifies with) all values of the fea- ture. DEFAULT: The feature takes some specific default value, which can be inferred by an analysis, but the value thus found is not stated here. ?: The feature value is unknown. N/A: The feature does not apply (has no value). NO CLAIM: No claim is made about the feature's value or applicability. Simple omission of a feature from a structure may legitimately be used for the senses no claim, n/a, or ? but should not be used for any or default. The value no claim should not be used if it is known whether the feature is applicable or not (use ? or n/a instead). 3.2 Feature System Declaration GFS pointed out that unknown values and inapplicable features could be unambiguously determined without use of ? and n/a values, if it were possible to specify what features are applicable under what circumstan- ces. He also pointed out that such a specification would give specific interpretation to . For example, if the legitimate values for case are nominative, dative and accusative, then the interpretation of: nominative is equivalent to that of: dative accusative It appears that the minimum function required is 1. a specification of legal feature names and legal values for them of the logical form F1 = (a | b | c | d | ...) [ x ] (where F1 is a feature name, a ... d are legal values, and x is the optional default value) 2. a specification of what features (and what subset of their values) are applicable under what circumstances, of the logical form F1 [= v] --> F2 [= (a | b | ...)] [x] & F3 [= (d | e | ...)] [y] ... (where F1, F2, and F3 are feature names, a, b, d, e are legal val- ues for F2 and F3 if F1 is present with the value v and x and y are default values for F2 and F3 if F1 = v. The specification of v is optional; if omitted, the implication holds for any value specified for F1. The specification of the legal ranges and defaults for F2 and F3 is also optional; if omitted, they are tak- en from the global specifications of the first form of feature system declaration.(2) Other additional specifications (global values, non-enumerated value types, etc.) may prove desirable as well. After discussion, GFS agreed to draft a document AI1 W3 on feature system declarations, covering at least the minimal specifications and any enhancements he finds useful and simple. GFS to draft AI1 W3 on Feature System Declarations Due: asap 3.3 DTD Mechanisms The overall mechanism for inclusion of linguistic analysis in a docu- ment was reviewed. GFS proposed that the element (in all its spelling variations) be omitted in favor of a disjunction of the various types of material it may contain (, , , and ). This was agreed to. Thus the element decla- ration of TEI P1 version 1.1: should be replaced by a corresponding parameter entity declaration: and the element declaration for should be modified to include the parameter entity ling.analysis, not the element , in its list of inclusion exceptions: 3.4 Mixed-content Models and Other Technicalities Owing to SGML's parsing rules, mixed-content models such as that specified in TEI P1 version 1 for are interpreted in a way that prohibits white space from appearing in many places where it might be desired for clear presentation. The examples in TEI P1, as a result, are not legal SGML as they stand, but become legal if all excess white space is suppressed. The following changes were agreed to: The element should be suppressed (it survives in TEI P1 ver- sion 1 only through editing errors) and in its place an element should be defined as part of the parameter entity f.value.simple. values should accept #PCDATA (or the parameter entity %broth -- i.e. parsed character data interspersed with optional phrase-level ele- ments like emphasis or quotation) as their content. The elements and should be suppressed. In their stead, the elements and should have an addi- tional attribute defined for their name. It should accept character- data values and require no value, and thus may be defined: At this point, GFS departed; TL and MSM continued the subgroup's work. 3.5 Ambiguity and Its Resolution The current draft of TEI P1 is contradictory in its treatment of ambiguity resolution (as pointed out in Jan Hajic's comments). Two mechanisms appear: a path attribute which is said to appear on the element (but is not defined in the DTD) and an element which is used in examples as content for elements but is not defined. It was agreed: 1. The path attribute should be renamed choice and added to the DTD. 2. The choice attribute should take the declared value IDREFS. 3. The choice attribute should be used on an element when the target attribute on the same points to an ambiguous analy- sis (one containing or constituted by an . 4. If the ambiguity is completely resolved, the choice attribute val- ue should be the ID of the chosen analysis. 5. If the ambiguity is partially resolved, the choice attribute value should be a list including the IDs of all analyses still consid- ered possible. 6. If the analysis has multiple disjunctions, the choice value should be a list including the IDs of all chosen interpretations in any order. 7. Any disjunctions which have no daughters included in the choice value are unaffected by the disambiguation. For example, consider the following feature structure: [ id = F8 cat = verb OR: (id = F8o1 [ id = F8a1 ... ] [ id = F8a2 ... ] [ id = F8a3 OR: (id = F8a3o [ id = F8a3a1 ... ] [ id = F8a3a2 ... ] ) ] ) OR: (id = F8o2 [ id = F8a4 ... ] [ id = F8a5 ... ] ) OR: (id = F8o3 [ id = F8a6 ... ] [ id = F8a7 ... ] ) Or, drawn as a tree: F8 | +---------+--------------+--------------+ | | | | cat OR OR OR | id=F8o1 id=F8o2 id=F8o3 verb | | | +----+----+ +--+--+ +--+--+ | | | | | | | F8a1 | F8a2 F8a4 F8a5 F8a6 F8a7 OR id=F8a3o | +----+----+ | | F8a3a1 F8a3a2 To continue the example, the specification in an analysis would indicate that the disjunction F8a3o is resolved in favor of its left-hand child F8a31, and F8o2 is resolved in favor of F8a4. The disjunction F8o1 is partially resolved by the elimination of F8a1 (and with the disambiguation of F8a3o, F8o1 is now effectively an ambiguity between F8a3a1 and F8a2). F8o3 is unaffected and remains ambiguous. As a second example, note that if the nested disjunction F8a3o were omitted from the choice value, leaving 'F8a3a1 F8a2 F8a4', then F8o1 would be completely disambiguated, because only one of its children is listed, namely F8a2. Since F8a3o would not be kept, its disambiguation would be purely of academic interest. An application which wished to prune the tree of possible interpreta- tions may follow the algorithm: 1. flag all nodes listed in the choice attribute value. These should all be immediate daughters of an element; if not, the choice value is invalid. 2. check the siblings of each flagged feature structure: if they are unflagged, then delete them. 3. check each : if it has exactly one daughter remaining, then a. delete the itself, and b. promote the remaining daughter to the position of the moth- er. In the first example given, flagging the chosen nodes leaves nodes F8a3a1, F8a3o, F8a2, and F8a4 flagged. Weeding the unflagged siblings of flagged analyses removes F8a1, F8a3a2, and F8a5 from the tree. Pro- motion of only daughters results in elimination of F8a3o and F8o2 from the tree. The pruned tree would look like this: F8 | +---------+--------------+--------------+ | | | | cat OR F8a4 OR | id=F8o1 id=F8o3 verb | | +----+----+ +--+--+ | | | | F8a3a1 F8a2 F8a6 F8a7 In the second example, with F8a3o omitted from the choice value, only F8a3a1, F8a2, and F8a4 would be flagged. F8a3a2, F8a1, F8a3o, and F8a5 would be removed as unflagged siblings of flagged analyses. F8o1 would be replaced by F8a2, F8o2 by F8a4. If the tree is processed top-down, some entire branches may be removed before they are processed: in this case, the removal of F8a3o would eliminate the need to search for siblings of F8a3a1. The pruned tree would look like this: F8 | +---------+--------------+--------------+ | | | | cat F8a2 F8a4 OR | id=F8o3 verb | +--+--+ | | F8a6 F8a7 A similar mechanism may be desirable for the explicit disambiguation of representations using the and notation. TL will look into this. Finally, it should be noted that the mechanism of explicitly marking disambiguation is particularly useful if a text is associated with a lexicon in which words are organized under lemmata, in which the various interpretations are grouped using . If a lexical occurrence of a word is disambiguated by its context, then the foregoing mechanism provides a straightforward way of pointing to the intended interpreta- tion. 3.6 Interface to TR Work The existing TEI tags , , seem to be usa- ble for the objects they describe and so no provision was found neces- sary for marking these features using the feature structure notation or for including feature structure notations for them in the "starter set". 4 UNFINISHED BUSINESS The following agenda items were not taken up for lack of time. * degree of certainty in resolving ambiguity * simplifications for special grammars * sample taggings ------------------------- (1) The drawings actually drawn used the terms F (drawn as the universe of discourse within which L1 and L2 exist) and F' (drawn as a subset of the intersection of L1 and L2), but later discussion introduced the terms U and I and defined them as the union and intersection of L1 ... Ln, respectively. (2) N.B. the forms shown for the specifications are solely to illustrate the abstract syntax required: the first form requires a feature name, a range of values, and a default; the second a set of feature names, value ranges, and defaults. The concrete syntax used for the example was improvised for discussion and is not a proposal for the concrete syntax of the feature system declaration, which should have other material present (e.g. prose documentation) and should take the form of an SGML document. The writing system declaration already defined by the TEI gives a good example of what is needed. Draft January 17, 1991 (19:12:25)