Minutes of the Meeting Of the Analysis and Interpretation Committee Held at the University of Arizona, Tucson March 12-14, 1990 Lou Burnard D. Terence Langendoen Document Number: TEI AIM3 23 March 1990 Present: D. Terence Langendoen, chair (TL); Robert Amsler (RA); Lou Burnard (LB); Nicoletta Calzolari (NC); Robert Ingria (RI); Mitch Marcus (MM); William Poser (WP); Beatrice Santorini (BS); Gary Simons (GS); Michael Sperberg-McQueen (MSM) Final, November 14, 1997 0 1. SYNTAX REPORT MM circulated and talked through AI W6. He had begun by considering ways in which the immense range of syntactic representations currently used could be reformulated as directed acyclic graphs (DAGs), each node being potentially represented as a set of feature-value pairs. The meeting spent some time discussing the basic ideas, expressed in MM's paper as two BNF grammars, and in exploring formalisms which might prove difficult to represent in this way (examples included parse forests, words absent from the input stream, ordering and other rendering varia- tions). It was agreed that the model should be tested with a variety of formalisms and AI W18 was circulated as an example of the range. It was agreed that, although probably inadequate for the direct representa- tion of the linguistic theories underlying some formalisms, SGML could successfully model their functionality as presented typographically, and would thus be of immense value in interchange. To demonstrate this, MM also presented a BNF grammar for tree struc- tures, which MSM then reformulated in SGML (AI W6A). TL asked how this is to be related to the feature-structure model. The tree model pro- vides special-purpose tags for "tree geometry", but the entire power of the feature-structure apparatus is available for specifying the internal structures of the nodes of the tree. The problems raised by PP- attachment were discussed (e.g. the ability to represent all the rela- tions between the constituents of such sentences as "I saw the man in the park with the telescope"). Ken Church in his MIT Master's thesis originally proposed a compact method of representing these relations, the number of which grows exponentially with the number of preposition phrases. The mechanism used in the PLNLP English Grammar developed by _____________________ Karen Jensen at IBM for representing them was discussed, and it was agreed that the tree-markup specification would provide for a compact method of representing these relations. Returning to the problem of the feature-structure specifications, WP pointed out that there was a need to distinguish e.g. disjunction from conjunction. TL noted the similar need to distinguish the contextually correct or only plausible reading. Some pointers are obligatorily co- referential (e.g. "John i saw himself i"); others ambiguous (e.g. "John i saw him i/j"); yet others disjunct (e.g. "John i saw him j/=i"). To deal with these and other features of lexical functional grammar, the notion of restrictional clauses handling booleans (AND, OR, and NOT) expressions was introduced. Some concern was voiced that the representation chosen should not be anchored in presentation only: MM argued that the typographic (etc) conventions used by linguists were chosen as comprehensible ways of rep- resenting their theories; to represent the former was a more reliable way of unifying the latter than any inevitably doomed attempt at grand unifying theories. MM was asked to provide a further draft expressing these ideas, with extensions for Penelope style PP attachment rules. This would form the basis for section 8.1 of the Guidelines [A new draft was posted to TEI-REP by BS on 20 March] 0 2. MORPHOLOGY REPORT GS presented a new version of AI W12, in which sections 2 to 4 had been extensively revised. Like MM he had concentratd on features that were typographically marked. The metalanguage he presented could be recast in SGML using simple rewrite rules to convert attributes into tags. WP noted that attribute values and names had to be unique within a DTD. Discussion of the theoretical part of the paper was postponed for the second cycle; it was regarded as a very useful basis for further work on extensions. After a digression on the dictionary encoding exam- ple, discussion of morphological issues resumed with some consideration of bracketing paradoxes and a debate about the difficulty of incorporat- ing a theory of morphological procedures (as proposed by Stephen Ander- son) into the encoding system. MM asked for a set of exemplars against which to test the proposed formalism. NC suggested it would be better to concentrate on the comparatively simple procedures most widely used. It was agreed that the committee should try to formulate a set of gener- al purpose tools, with the general purpose feature syntax proposed by MM as its basis. GS to prepare a BNF for the interlinear gloss structure Due: March 31 After some discussion of the morphology of the word "ungrammaticali- ty", it was agreed that there was a need for many analyses, possibly orthogonal and non-unifiable, to be represented in parallel, with some sort of "impedance match". WP noted that this applied equally to phono- logical analyses. A set of sample morphological analyses, analogous to the syntactic analyses of W18 was needed: MM referred the meeting to a tutorial published in ACL Survey last year by Richard Sproat which included examples of K4 notation, and those of Anderson, Williams, Sel- kirk, Hankamer and Keyser & Roeper. TL to review Sproat's tutorial and to produce exemplary morpholog- ical analyses Due: 15 April An attempt was made to analyse ways of presenting relations between the parts of a text; four ways were proposed: (1) wrap-around, (2) use of ID/IDREF, (3) implicit positional and (4) explicit positional. Of these, (3) could be translated into (2) or (4), which were conceptually identical. (4) depended on deprecated facilities such as byte counts while (3) depended on unreliable features such as agreed tokenization rules. MSM pointed out that (1) (2) or (4) required that the target be marked which might not always be possible (e.g. on CD-ROM) commenting however that the Text Representation Committee had recommended that all SGML elements should carry an explicit ID tag. TL noted that (3) did not involve segmentation of the input, (4) was local to the sentence and (2) being global allowed for inter-sentence reference. MM then presented a model for alignment, initially predicated on a unidirectional (input-output) model but expanded during discussion into a general purpose mechanism for mapping arbitrary segmentations onto each other in either direction. The discussion was written up by BS as W19; it was felt to provide an essential complement to the feature-level syntax. It was noted that semantically all parts of the alignment map were equivalent; it would be up to an application to check that co- referring pointers belonged to the same level of analysis. GS presented W7 which proposed a simple "box" model for use in inter- linear analyses. It was agreed that this was a useful simplification appropriate to many applications. MSM introduced the notion of the "pizza model" proposed by the Metalanguage and Syntax Issues Committee; it was clear that this was a good example of one kind of "topping" which would be much appreciated. The theory-independent model proposed by MM, in which the theory was regarded as content, would be found disturbing by those who wanted to work in terms of their own ontology. MM remarked that SGML was too impoverished to support all possible linguistic mod- els, and that a general approach must therefore be the committee's first responsibility. Discussion of the economics of software development ensued, with RA proposing that a "Linguistica" program might some day be developed on the basis of the general mechanism, just as "Mathematica" was based on an agreed meta-mathematical theory. There was some discussion of the need to produce a unified list of part of speech tags, which had been an original work item for the com- mittee. MM asserted that the parts of speech used in LOB and Brown were non-unifiable, and that even if they were no-one would agree to it. TL proposed some features which might be used to categorise them but agreed that unification would be very difficult. MM suggested this work should be left to the second cycle. MSM noted that theory-specific mappings between tags used and a general theory could be stored as part of the document header. To simplify representation of the subtrees required by the general feature-structure mechanism, some sort of abbreviatory con- vention should be produced. An ad hoc subcommittee composed of MSM, NC, MM, BS and TL will work on this; LB suggested that the entity reference mechanism of SGML would prove entirely adequate to the task. MSM gave a short overview of the facilities and syntax provided by an SGML DTD, covering the SGML declaration, minimization, content models, separators and connectors, and occurrence groups. 0 3. PHONOLOGY REPORT WP summarised the needs of phonologists as "everything is dags and maps". He was confident that the general mechanisms so far proposed were adequate to the purpose. His discussion began by consideriong tra- ditional phonological concerns, concentrating on character set issues, and the need for an IPA-like character set: MSM noted that the Text Representation Committee was working on this, citing an article on the computer coding of IPA (Eslin, JIPA(88)18.2). WP believed that these could be extended to cater for paralinguistic phenomena such as ingres- sive air streams, but that transcription would pose serious problems; he also discussed the problems posed by mixed transcriptions, using extend- ed standard orthography. Such cases should be documented in the docu- ment header in terms of a phonetic feature set. It was noted that the current IPA standard was due for revision at Kiel last year, but that the recommendations are not yet public; due to time constraints, any proposals in the TEI Guidelines would therefore be based on the current draft proposals. WP discussed a second category of problems, to do with suspicious or partial information: e.g. dotted letters, gaps with some indication of length or cause, one of a small set of possibilities, robust but partial information as used heuristically in speech recognition. It was noted that many of these concerns, notably segment lattices from spectrograph- ic analyses, were isomorphic with problems already addressed by those working on textual variation. Degree of confidence could be regarded as a feature/value pair. At the highest level, WP stated that almost all representations of phonology used hierarchic constituent structures, some of which might or might not be aligned in parallel (e.g. prosodic and metrical struc- tures). These seemed isomorphic to the bracketing paradoxes discussed earlier. It was noted however that analysis of digitized speech might introduce new problems, which would be deferred to the second cycle. WP to produce some examples showing how existing phonological analyses could be expressed as DAGs with alignment maps. Due: no date 0 4. DICTIONARY REPORT RA presented a report on progress within the subgroup, set up origi- nally to attempt a unification of the various dictionary formats. There had been meetings at Waterloo and at Oxford. Work had been carried out on extending the original Amsler/Tompa proposals (W4) in two areas, etymology (summarised in W15) and in multilingual dictionaries (W20). RA stressed that the aim was tio agree on a common tag set, not to define an all purpose DTD. Some examples of marked up dictionary entries were circulated for discussion (W8). Future work would involve standardisation of such things as part of speech tags. NC suggested that a clearer balance between prescription and descrip- tion was needed. MSM commented that SGML supported broadly two differ- ent approaches to this: in the first, a DTD for a dictionary following the so- called "Waterloo" syntax, textual elements with some struc- ture were allowed to float freely in a soup: ] > In the second, the so-called "Belgian" syntax, all data elements whatever their structure were regarded as exceptions to a model containing only character data: ] > Discussion of the possibilities offered by the latter model ensued. Returning to the dictionary encoding initiative, the work being carried out within the ESPRIT project at Pisa and elsewhere as described in NC's paper was noted. She and RA agreed to work towards ways of merging their tagsets, at both the individual and grouping levels, with a view to producing a common DTD suited to monolingual and bilingual dictionaries and writing it up for inclusion in the Guidelines as section 7.6. NC, RA to write up a common dictionary tagset and DTD Due: April 15 1990 0 5. LEXICON REPORT RI reviewed the survey he had produced for the Grossetto meeting in collaboration with Susan Warwick. It had categorised machine-readable lexica under some 12 headings; it was noted that categorial distinctions were hard to make in all cases. E.g. some systems treated paradigm as a lexical features and others as an independent concept. NC suggested the report should be expanded to take account of the current European feasi- bility study for standardising of lexical and terminological resources. LB asked whether in fact people did exchange machine-readable lexica, and whether the aim was to preserve what had been used in the past or to facilitate integration in the future. TL pointed out that future machine-readable texts would certainly contain pointers into electronic lexica. RA suggested that detailed discussion might be premature or politically sensitive. MSM proposed that the topic should be written up for commitee circulation only, and might be included in the Draft Guide- lines for June 1991. RI to redraft the report, possibly in SGML terms, with examples before and after the conversion. Due: no date specified 0 6. SUMMARY It was agreed that the committee would recommend four general purpose tools rather than specific tags for all the varieties of linguistic analysis. The tools were a general feature structure grammar, a tree structure grammar, a simple unit and annotation model and a general alignment mechanism. There was some discussion of the need for tree structured grammars as distinct from featiure structure; it was agreed that they provided a useful subset of features, as did the unit/ annotation model. An abbreviatory mechanism, as provided by entity ref- erences, would be a useful way of hiding the complexities of the full feature structure syntax. TL, MM, BS to draft a section introducing the general purpose tools recommended by the committee Due: 31 March 0 7. LIST OF PAPERS The following working papers, presented at the meeting, are here listed with the numbers assigned to them by the editors. Those in machine- readable form should also be available from the TEI-ANA file server shortly. AI W3: Plan of action for remaining work of the A&I Committee (Langen- doen) AI W4: Notes on Dictionary Encoding (Amsler/Tompa) AI W6: Untitled (A generic proposal for representing syntactic struc- tures; ms notes, Marcus) AI W6A: Untitled (Amended BNF appendix to AI W6A) AI W7: Where is morphology? (ms notes, Simons) AI W7A: Untitled (Revised version of AI W7) AI W8: Examples of dictionary encoding (Amsler) AI W9: [Sample instantiations - phonology] AI W10: [Sample instantiations - syntax] AI W11: [Sample instantiations - morphology] AI W12: A conceptual modelling language (Simons) AI W14: Preliminary draft proposal (Calzolari) AI W15: Encoding etymology (Abate) AI W16: Notes on encoding of linguistic analyses (Langendoen) AI W17: A&I for a simple example (Langendoen) AI W18: Untitled (Examples of syntactic diagrams from Marcus) AI W19: Alignment notes (Ms notes, Santorini) AI W20: Towards an SGML DTD for bilingual dictionaries (Fought/Van Ess) Final, November 14, 1997