Feature System Declarations and the Interpretation of Feature Structures Gary F. Simons Document Number: TEI AI1W3 Version 1: 28 January 1991 Version 2: 2 January 1992 Version 2.1: 28 January 1992 Abstract Discusses problems involved in interpreting the meaning of underspecified feature (or field) structures in TEI conforming texts and then proposes that these be solved by requiring an external Feature System Declaration. The DTD for such a file is proposed and two sample FSDs, illustrating GPSG and systemic linguistics, are given. Revision history Version 2 reflects the shift agreed upon at the Myrdal meeting from to as a more general-purpose feature structure or field structure. Significant differences in this version are: change from values to tags for the predefined feature values, combination of and section into a single section, and inclusion of the full definition of in the FSD DTD with the assumption that validation of instances against the FSD will be done by a unification engine. Version 2.1 changes the encoding of conditional defaults and of conditional and biconditional constraints. Another change to the DTD is the use of an exclustion on TEI.fsd to prevent in FSD's; see section 2. The details of the definition of are based on draft version 1.6 of the TEI.Ling DTD. Most notable difference here, from what was in version 2 of this paper, is the shift to , , from , , . 1. INTRODUCTION 1.1 The problem In writing any particular feature structure, analysts typically omit some features that occur in other feature structures used in the analysis of that language. What does it mean when a particular feature is not specified? At least the following might be possible interpretations: - The value is unrestricted. - The feature takes a default value supplied by rule. - The feature does not apply in this context. - The feature is known to apply but its value is unknown. - No claim is made about the feature's value or applicability. To avoid this ambiguity, the following empty tags for feature values are defined for TEI feature structures: The form is compatible (or unifies) with all possible values of the feature. The feature takes a specific default value which can be inferred by some general rule. The feature is not applicable in this context. The feature is known to be applicable, but its value is not known. No claim is made about the feature's value or applicability. (If it is known whether or not the feature is applicable, use or instead.) But providing these special feature values still does not solve the problem of knwing how to interpret a given feature structure. The range of values possible for or are still not known. The specific feature value to assign for is still not known, nor is the range of values for a feature specification which uses the "domain=CMP" attribute to indicate that the allowed value is in the complement of the set of values specified. The inability to interpret a feature structure could spring from a completely different kind of problem; the human reader of a feature structure might have no idea what the abbreviations for feature names and feature values stand for. Finally, even thought and are available when needed, it is not parsimonious (and thus not typically desirable) to specify or whenever a feature value can be inferred by a general rule or is known to be irrelevant by a general rule. 1.2 The proposed solution To solve these problems, the working group of the A & I committee is proposing that a "Feature System Declaration" mechanism be set up. In a Feature System Declaration (or FSD), the designer of the feature system declares its formal properties in an external file which is separate from analyzed text files or dicionary files which might use that feature system. Specifically, an FSD declares the names of the features that are used, the range of values allowed for each feature, rules for inferring default values, and constraints on the co-occurrence of features and values. The FSD functions further as a place for the analyst to provide prose description of the features and their possible values. Though the original motivation was to support interpret feature structures in linguistic analysis, the same FSD mechanism can be used with the more general interpretation of and as "field structure" and "field". In this case, an FSD is like the schema of a database; it serves to declare what fields are allowed in different types of field structures, what range of values they are allowed to have, and what it means when a particular field is not specified. The TEI draft guidelines already employ the mechanism of using an auxiliary file (with a specialized DTD) to interpret the meaning of what is encoded in a TEI-conformant document. This is the mechanism of the Writing System Declaration (or WSD). For every unique value of the LANG attribute used in a text, there should be a matching WSD which documents how data in that language has been encoded, and thus how byte codes in content marked as being in that language are to be interpreted. The proposed FSD mechanism is therefore analogous to the WSD mechanism that already exists. In order for application software to use such declarations to aid in automatic interpretation of encoded texts, or for human readers to find the appropriate declarations, there must be a formal link from the texts to the declarations. As far as I can tell this is missing in TEI P1. A logical place for this would be in the of the . For instance, could formally identify the WSD for a given language, and could identify the FSD associated with a given text file. The parallel to WSDs raises the question of whether we might want multiple FSDs associated with a single file, linked perhaps via the TYPE attribute of the feature (or field) structure. For instance, could indicate that the FSD for any is to be found in the named file. 2. THE FORMAL DEFINITION OF FEATURE SYSTEM DECLARATIONS This section proposes a formal definition of Feature System Declarations in terms of an SGML document type definition (DTD). In developing this definition I have referred heavily to a recent landmark work dealing with feature structures, namely, Generalized Phrase Structure Grammar by Gerald Gazdar, Ewan Klein, Geoffrey Pullum, and Ivan Sag (Harvard University Press, 1985). I believe the formulation below to be adequate to handle the feature system they propose in the appendix for English. A feature system declaration has a file header followed by declarations for every (that is, feature or field) used in the system, optionally followed by a specification of constraints (which are essentially co-occurrence restrictions between the features or fields). The definition of and elements, used within the declarations and constraints, is taken directly from the LING.DTD by including it as an entity reference. The one difference between what is allowed in the feature structures of an analyzed text versus those in an FSD, is that (cross-reference pointer) is not allowed in the latter. This is enforced by means of an exclusion on the content model of an FSD. That is, We now elaborate each of these components in turn, illustrating each by extracts from the feature system for English proposed by Gazdar, Klein, Pullum, and Sag. 2.1 The FSD header The FSD.header, in analogy to a TEI.header, gives the necessary bibliographic documentation on the FSD. This includes identification of the language and purpose of the feature system, who developed it and when, details of revision history, references to literature on which the analysis is based, and so on. For the present, I define FSD.header as simply PCDATA. I leave its elaboration to the TEI editors who are in a better position to make it consistent with the philosophy of TEI headers. Thus, *** Note to TEI editors: This suggests that WSDs as well, should have a header analogous to a TEI header, and very similar (if not identical) in definition to a FSD header. *** The following is the FSD header for our sample based on the GPSG appendix: This sample FSD does not describe a complete feature system. It is based on extracts from the feature system for English presented in the appendix (pages 245-247) of Generalized Phrase Structure Grammar, by Gazdar, Klein, Pullum, and Sag (Harvard University Press, 1985). This sample was encoded by Gary F. Simons (Summer Institute of Linguistics, Dallas, TX) on January 28, 1991. Revised January 1, 1992 for version 2 of FSD.DTD. 2.2 Feature (or field) declarations The section of an FSD declares all of the features (or fields) that are allowed to occur in well-formed elements. Each feature is declared in an whose NAME attribute specifies the name of the feature.(which matches the NAME attribute of the elements it declares). An has three parts: an optional prose description (which should explain what the feature and its values represent), an obligatory value range specification (which declares what values the feature is allowed to have), and an optional value default specification (which declares what default value should be supplied when the named does not appear in an ). The DTD fragment for feature declarations is as follows: The content model for uses the parameter entity f.value which is defined to be "%fs.family; | %val.family;". The latter two parameter entities are defined in LING1.DTD. In other words, the value range in an FSD can be anything from the linguistic analysis DTD that can be a feature (or field) value. This includes embedded feature structures or terminal values (which include , , , and all of the predefined values described above in section 1). The value for an in an in a TEI conforming text is valid if it unifies with the model specified in of the with the same name. The key to making this work is to use or as the range specification. Such an alternation specifies a list of 's or terminal values that are allowed values of the feature. The predefined values described above in section 1 are always allowed and need not be included in the range specification. Under the rules of unification, or in an FSD with no value are defined to unify with any or , respectively. In the FSD, is permitted additional attributes MIN and MAX which specify minimum and maximum allowed values for VALUE under unification. with no content is defined to unify with any instance of . Similarly, is defined to unigy with any . The TYPE attribute of is an important one for use in range specifications. Specifying in the range of a feature states that any of type X is an allowed value. The specification of the value default may be unconditional or conditional. If unconditional, then a single feature value is specified; it is supplied as the default value in any which does not specify the feature. If conditional, then the default specification consists of one or more elements. Each of those has three elements inside it. The first is an instance of %fs.family which is unified with an instance of that does not specify the feature being declared. If the first element unifies with the , then the third element (which is an instance of %f.value) is the default value for the feature. An empty tag is required between the antecedent and consequent; it is there essentially to enhance human readability. If the first does not unify, then the second is tried, and so on. If no unifies, or if is not given at all, then the default is . As an example, consider the following extract from Gazdar, Klein, Pullum, and Sag (1985:245): feature value range INV {+, -} CONJ {and, both, but, either, neither, nor, or, NIL} COMP {for, that, whether, if, NIL} CASE {ACC, NOM} AGR CAT PFORM {to, by, for, ...} where CAT stands for any category (where category in their system is equivalent to feature structure). In the sample encoding below I extend the illustration by specifying that the which is the value of AGR must have features for person and number. Since PFORM is specified as an open set, is used in the range specificationbelow rather than . Furthermore, consider the accompanying "feature specification defaults" extract from Gazdar, Klein, Pullum, and Sag (1985:246-247): FSD 1: [-INV] FSD 2: ~[CONJ] FSD 9: [INF, +SUBJ] --> [COMP for] fsd 10: [+N, -V, BAR 2] <--> [ACC] Some comments on notation: ~ means undefined (that is, not applicable). (Note that this is distinct from the NIL value for CONJ which means that the phenomenon of conjunction is taking place but there is no surface conjuction.) If a given feature value is in the range set of only one feature, then the feature name may be omitted, as with INF above which is a value of the VFORM feature. --> is the implication (if-then) operator of boolean logic. In number 10, <--> represents biconditional implication. I am having trouble understanding what this means as a default specification. In am interpreting it as a normal implication for deducing the default value of CASE, with a simultaneous constraint that if ever [CASE ACC], then [+N, -V, BAR 2] must also be true. These sample range and default specifications would be encoded as follows in an FSD: inverted sentence surface form of the conjunction surface form of the complementizer agreement for person and number word form of a preposition 2.3 Feature (or field) structure constraints The specification of constraints on valid feature (or field) structures consists of a sequence of conditional and biconditional tests. A particular feature structure is valid only if it meets all the constraints. The element encodes the conventional if-then conditional of boolean logic which fails only if the antecedent is true and the consequent is false; otherwise, it succeeds. In feature structure constraints the antecedent and consequent are expressed as feature structures; they are considered true if they unify with the target feature structure. The element encodes the biconditional (if and only if) operation of boolean logic. It succeeds only when both antecedent and consequent are true, or both are false. The DTD fragment for feature structure constraints is as follows. Note that and use the empty tags and , respectively, to separate the antecedent and consequent. These are primarily for the sake of enhancing human readability. Constraints are typically used to enforce co-occurrence restrictions among feature values. They can also be used to control the co-occurrence of named features or fields with particular types of . For instance, an application that uses to model records in a bibliographic database, a constraint like the following could be used to ensure that all book records have a field for authors, year, date, and publisher, but not for journal. (By virtue of not being mentioned, all other fields declared in an would be optional in this type of record.) As a fuller example, consider the following "feature co- occurrence restrictions" extracted from Gazdar, Klein, Pullum, and Sag (1985:246-247): FCR 1: [+INV] --> [+AUX, FIN] FCR 7: [BAR 0] <--> [N] & [V] & [SUBCAT] FCR 8: [BAR 1] --> ~[SUBCAT] FCR 20: ~([SLASH] & [WH]) FCR 20 is the only constraint of the 22 given by Gazdar et al that is not expressed with an implication operator; I assume it to mean if either SLASH or WH is defined, the other cannot be defined, and have thus translated it below into two conditionals. The above constraintswould be encoded as follows in an FSD: 3. ANOTHER EXAMPLE It turns out that feature system declarations work equally well to describe the networks of system delicacy used in systemic linguistics. In fact, systemic networks seem to offer a graphic notation for representing feature ranges and constraints. The following simple network is from R. A. Hudson, English Complex Sentences (North Holland, 1972), page 60: / | | 1st-person | | | --> | 2nd-person | | personal | | 3rd-person ---| | masculine pronoun --> < | | | > --> | feminine | | | | | singular -----| | neuter | --> | | | plural \ This systemic network could be translated into a feature system declaration like the following: This is a sample "toy" FSD based on a systemic network for the inflection of English personal pronouns in R. A. Hudson, English Complex Sentences (North Holland, 1972), page 60. It was encoded by Gary F. Simons, Summer Institute of Linguistics (Dallas, TX). The original version was produced on January 28, 1991. This is a revision produced January 28, 1992 to match version 2.1 of the FSD definition. major syntactic category subtype of pronoun