Analysis and Interpretation Committee Minutes of the meeting held at Oxford University Press, 21 September 1989 Lou Burnard 10 October 1989 Document Number: TEI AI M1 Present: Robert Amsler (RA); Steve Anderson (SA); Bran Boguraev (BB); Lou Burnard (LB); Nicoletta Calzolari (NC); Nancy Ide (NI); Terry Langendoen (TL), chair; Winfried Lenders (WL); Nelleke Oostdijk (NO); Bill Poser (BP); Beatrice Santorini (BS), replacing Mitch Marcus; Gary Simons (GS); Michael Sperberg-McQueen (MSM); Donald Walker (DW); Antonio Zampolli (AZ) Final 10 Oct 89 0 INTRODUCTORY BUSINESS TL welcomed members of the committee and drew their attention to the agenda previously circulated. Expenses Committee members are reminded to send all receipts for travel to MSM at UIC. A maximum per diem of $20 is payable without receipts; with receipts up to 50% of estimated travel costs would be reimbursed. (The ceilings are thus $650 for North Americans and 315 ECU for Europeans at this meeting. -Ed.) North Americans should send originals and would be reimbursed (up to 50%) in US dollars; Europeans may send copies and would be reimbursed in ECUs. Timetable Three meetings were scheduled for the current funding cycle. A draft of the committee's guidelines was due by March 1990. The next meeting could conveniently (except for some Europeans) be held at the time of the LSA and MLA meetings in Washington during (Dec. 28-30). The final meeting could be held, in Tucson, at the end of February. MSM gave an overview of the TEI timetable; funding for the second cycle had already been requested of NEH and a decision would be made by May 1990. Committee Brief TL stressed that the committee's task in the current phase was more to decide what to mark up than exactly how to do it. He had had the opportunity of discussing SGML with the chair of the Metalanguage Com- mittee (David Barnard) and saw no technical reason why it should not support the committee's requirements, however imperspicuously. The com- mittee's work would undoubtedly take SGML into corners it had not as yet visited, but the Metalanguage Committee would have the job of identify- ing and then supporting any extensions to the standard which might be required. There was no requirement that the committee adopt SGML for its own internal working: its job was to produce tagsets. He likened the committee's role to that of the body which defined IPA: it had to define a universal language for text structure, capable of supporting any one of a number of possibly inconsistent or mutually exclusive theoretical frameworks. No unification of theories was expected of it, nor would it be responsible for implementing any part of its proposals. As with IPA again, the object was to define an interchange format rather than an application-specific one. 0 GENERAL DISCUSSION RA drew a parallel between the committee's task and that of defining purely typographic markup. It should focus on tagsets rather than on theoretical discussions. BP asked whether linguistic rules should not be tagged as such, since they would necessarily form a part of the content of some texts to be tagged. TL said that a full analysis of formulae as such should be deferred to the second cycle. The goal should be to tag such things adequately to ensure their correct representation, indepen- dent of any particular formatter. If two theories are disjoint with respect to a particular feature (traces for example), it should be pos- sible but not obligatory to supply it. BP stressed the importance of making provision for less orthodox theories using entirely different representations. TL said subcommittees should aim to be catholic, iden- tifying as many theories as possible. NC asked whether when considering existing markup schemes, e.g. TOSCA or LOB, it would be necessary to identify conversion rules, grouping tags according to their function. TL said that such mappings were the responsibility of the Metalanguage Com- mittee. The main focus should be on linguistic issues and alphabetic texts. BP asked about polylinguistic texts and non-Roman alphabets, for example Gardiner's Egyptian Grammar, or Japanese texts with embedded ________________ Chinese characters. TL replied that the Text representation committee would address these problems, although the question of syntactic fea- tures linking parallel structures was within the A&I committee's brief. MSM said that the question of synchronised structures was a good example of an area where several committees would need to work, as were cross- reference and the markup of discontinuous text segments and of segments with unclear boundaries. He stressed the importance of good communica- tion between the committee heads and the editors in this respect. TL stated that linguistic markup should include all of semantics and prag- matics, and acknowledging the point of view of certain linguists, such as George Lakoff, that all domain-specifying categories are artificial, contended that the markup should make it possible to smear all these distinctions. WL asked whether the intention was to provide a language-independent superset of tagsets, citing the MATER standard (ISO 6156) as one which had found the need to include a German-specific appendix. TL stated that his personal preference was for a superset. DW pointed out that no par- ticular tagset would use all available tags and MSM that the Steering Committee had already decided tagsets should be extensible, and re- nameable. 0 SUBCOMMITTEE WORK Dictionary Encoding Subcommittee RA reported on the Dictionary Encoding Standard worked out in collab- oration with Frank Tompa at Bellcore, and circulated copies of their joint paper describing it. RA said that the paper needed many more examples and more descriptive text. He invited comments from the commit- tee and indicated that he would circulate any comments received on extensions to it (also see "4. The Amsler/Tompa paper will be given a number"). RA is chair of the dictionary subcommittee; BB and NC are members. Two areas of activity were proposed for the subcommittee: one was to broaden the work to include non-English and multilingual diction- aries and the other to consider etymology. Frank Abate had volunteered to investigate the latter. Alain Pierrot was collaborating with NI on a DTD for Hachette's dictionaries. John Fought and Carol Van Ess-Dykema were developing a multilingual standard, with language-specific exten- sions for each language. AZ asked whether this subcommittee would also deal with machine- tractable dictionaries, or electronic lexica. RA replied that there was considerable overlap of interests among the proposed membership. A brief discussion of the wisdom, or lack of it, of recoding lexica expressed in LISP in SGML ensued. AZ opined that it was organisationally preferable for this subcommittee to focus only on printed dictionaries, aiming at a neutral interchangeable format. Output from other subcommittees (mor- phology for example) would be useful at a later stage in the project.(1) Other members proposed for the subcommittee were Susan Warwick (ISSCO, Geneva), Carol Van Ess-Dykema (NSA), John Fought (U Penn) and additional representatives from the groups at Bellcore, IBM, Pisa and IKP (Bonn). Communication between these and commercial publishers was important. Phonetics/Phonology Subcommittee This work of this group, chaired by Bill Poser, would to some extent overlap with that of the Text Representation committee. It would addi- tionally address such issues as hesitation, intonation, and overlapping, and the correspondence between phonemes and graphemes, but would need to prioritize these carefully. AZ reminded the committee of the importance of supporting the needs of the speech synthesis community in this con- text. RA asked whether the dictionary encoding should attempt to define phonemic equivalences in IPA or something else: his view was that they should not. After lunch, the following were suggested as possible members for this subcommittee: Ken Church and Mark Liberman (AT&T); Henry Thompson (Edinburgh); Jared Bernstein (SRI); Lauri Lamell (MIT); Janet Pierrehum- bert (Northwestern); Brian MacWhinney (Carnegie-Mellon); Paul Roossin (IBM); Bob Mercer (IBM); Jan Svartvik (Lund); John McCarthy (U Massachu- setts). BP said that the work of the subcommittee should include "gestural stuff": its task was not to propose a Klatt-style "ARPAbet" but to make it possible for anyone who wished to use one to define such a code in a portable way. He asked where the kind of phonemic markup employed should be specified and whether its semantics should be specified with refer- ence to e.g. IPA. MSM saw this as another area where this committee's work overlapped with that of another: the Text Documentation committee would provide a space into which declarations of this kind could be placed, but little more. If texts include application specific data (for example F0 values) it was clearly necessary to provide portable ways of interpreting them. Morphology Subcommittee This subcommittee currently comprises Steve Anderson (chair), Win- fried Lenders and Gary Simons. It will address such standard issues as the delimiting and classifying of words, aiming at generic solutions rather than value lists. SA suggested that many substantive categories such as dialectal or usage variants are not morphological but lexical information. The subcommittee should focus be on the representation of the internal structure of words, recognising however that simply delim- iting morphemes would be inadequate for discontinuous segments (e.g. in Arabic) or for the use of such tricks as ablaut in Germanic languages, or metathesis in Saylish to render aspectual distinctions. SA suggested that the most promising line was to identify and generalise the rela- tionships existing between different forms, regarding morphology as the internal syntax of words. Members proposed for the subcommittee included: Martin Chodorow (Hunter C, CUNY); Richard Sproat (AT&T); Kimmo Koskiennemi (Helsinki); Lauri Karttunen (Xerox PARC); Jorge Hankamer (UC Santa Cruz); Mark Aro- noff (SUNY Stony Brook); John McCarthy and Lisa Selkirk (U Massachu- setts); Burghardt Schaeder (Siegen). There was some discussion of the level of generalisation appropriate to the subcommittee's work. NC and AZ pointed out that for most people identifying the lexical item (lemma) appropriate to a surface form was of far more importance than its internal structure. AZ asked whether compound words would also be considered. SA replied that these were a special case of the general rule. TL said it was important to support different levels of analysis. RA suggested that some redundancy in the encoding would be a helpful way of supporting this. He also recommended that as much language-specific information as possible should be identi- fied and shared amongst members of the subcommittee. BP remarked on the existence of many large corpora of Amerindian languages exhibiting many unusual features. RA recommended the use of consultants with expertise in these areas, mentioning David Nash for Australian Aboriginal languag- es. Resuming the earlier discussion, MSM pointed out that the coding schemes used by existing tagged corpora often blurred lexical, syntactic and morphological distinctions. He felt that it was enough to identify places where a value could be recorded without attempting to unravel its semantics. RA noted that the DEI often specified alternative ways of encoding a given feature; GS that tags were treated isomorphically with data in the Brown Corpus. AZ was firmly of the opinion that a well defined set of values should be identified, for example, for part of speech, rather than an open ended set. RA remarked that SGML gave us a better notation than that available to earlier projects which often needed to attach attribute values to every token because they lacked the notion of markup distributed throughout a text. Syntax subcommittee BS reported on behalf of Mitch Marcus who had been asked to form this subcommittee, together with herself, NO and Hans Uszkoreit. They felt that whatever was to be provided should be able to specify both ambigu- ous and hierarchic syntactic structures, to cope with a variety of re- analysis phenomena and other syntactic ambiguities. A single word (e.g. the Japanese causative) might require a bi-clausal analysis. Multiple simultaneous representations of a string might be needed, for example "(take advantage [of) John]." TL remarked that David Barnard had stated that such things could be managed by SGML. SA asked whether it was also capable of Postal-style arc-pair grammar. On ambiguity, RA highlighted the need to distinguish the deliberately ambiguous from the merely vague, contrasting "the duck is ready to eat" with "light house keep- ing". NO asked whether idiomatic and figurative phrases belonged in this subcommittee: most present agreed with her that they did not. Idi- omatic phrases formed a convenient unit, but were not in fact phrases. They also had multiple class membership. TL mentioned the need to sup- port inheritance of properties within a hierarchy by placing attributes as high as possible in the tree: if tense is only marked as a property of verbs, it becomes difficult to deduce the tense of sentences. The following people were suggested for this subcommittee: Annie Zaenen (Xerox PARC); James Pustejovsky (Brandeis); Geoffrey Leech (Lan- caster); Geoffrey Sampson; Robin Fawcett (Cardiff); Beth Levin (North- western); Eric Wehrli (UCLA); Gerald Gazdar (Sussex); Don Hindle (AT&T); Elisabet Engdahl (Edinburgh). 0 ACTIONS 1. Subcommittee chairs should co-opt people to their committees and produce interim reports by 1 November, 1989. 2. Subcommittee chairs were requested to save and precis for distri- bution all correspondence with potential members. TL offered the services of his office for assistance with TEI work, particularly in sending documents out to subcommittee members. 3. All working documents should be sent to TL who would assign num- bers and post them on the TEI-ANA server. 4. The Amsler/Tompa paper will be given a number shortly, and com- ments were requested as soon as possible, particularly with respect to extending its scope to include polylingual and non- English dictionaries and to discussing the etymology problem. 5. The committee should agree the structure of an overall interim report. It was agreed to differ discourse analysis to the second cycle. 6. MSM requested that draft documents be distributed using some form of descriptive markup to simplify their later reuse. He and LB are working on a tagset for this purpose, but committee members should not await its appearance before putting finger to keyboard. These minutes reflect this (mostly LB's doing, but TL has lent a hand). 7. Minutes of the meeting to be distributed by the end of September 1989. 8. TL to try to contact Hans Uszkoreit. 9. NI to circulate details of the TEI workshop at the MLA meeting in Washington in December, so that our next committee meeting can be definitely scheduled (see "Remaining Meetings"). 0 REMAINING MEETINGS The next meeting will be held in Washington, DC in conjunction with LSA and MLA meetings, either 28 or 29 December, so as not to conflict with the TEI workshop at MLA. The final meeting will be held in Tucson, AZ in late February or ear- ly March, 1990. ------------------------- (1) Since the meeting, RA has suggested adding Robert Ingria as co-chair of the Dictionary subcommittee and extending its mandate to include the development of recommendations for encoding electronic lexica. Final 10 Oct 89