Computational Lexica Work Group Minutes of Meeting Held at Berkeley 16 June 1991 Robert Ingria Document Number: TEI AI6M2 The TEI working group on computational lexica had its second meeting on June 16, 1991 (Bloomsday), at the University of California, Berkeley, California. Present were Robert Ingria (chair), Nicoletta Calzolari, James Pustejovsky, and Susan Warwick-Armstrong. The meeting began with a reading of the first paragraph of _Ulysses_ by the Chair. The first business item was a review of the previous meeting. As a result of this review, the following agenda was drawn up: I. Review Previous Meeting II. Review Related Activities III. List Sites and Systems to be Surveyed IV. Discuss Parsing/Syntax Fields V. Reformulate Objectives A detailed discussion of each of these agenda items follows. I. Review of Previous Meeting After reviewing the minutes of the previous meeting, the following items were handed out to the participants: By the Chair: An SGML version of TEI document AI6 P1 A text version of TEI document AI6 P1 Discussion notes for the previous meeting The minutes of the previous meeting Discussion notes for this meeting Notes from the TEI Print Dictionary Group Meeting - April 3rd, 1991 (by Susan Warwick-Armstrong) An EMail discussion of SGML crystals A text version of TEI document AI1 W2 The minutes of the DARPA Spoken Language Systems Steering Committee meeting establishing a Common Lexicon Working Group The kick-off message of the DARPA SLS Common Lexicon Working Group A brief note on the differences between the lexical requirements of spoken and written language systems ``Notes for Dictionary Encoding Initiative Workshop'' by Robert A. Amsler and Frank Wm. Tompa Descriptions of the CELEX English, German, and Dutch databases By Nicoletta Calzolari: ``Standards for Syntactic Description'' by Peter Hellwig, Torsten Minkwitz, and Heinz-Detlev Koch ``An Architecture for Reusable Lexical Resources in a Multi-Theoretical Environment'' -- DOC-8 of the Eurotra-7 Study _Computational Model of the Dictionary Entry: Preliminary Report_ ACQUILEX Report ILC-ACQ-1-90 By Susan Warwick-Armstrong: ``Apply an experimental MT system to a realistic problem'' by Pierre Bouillon and Katharina Boesefeldt ``A Practical Approach to Multiple Default Inheritance for Unification-Based Lexicons'' by Graham Russell, Afzal Ballim, John Carroll, and Susan Warwick-Armstrong Other Enclosures: Report on Activities of SIGLEX, the Special Interest Group on the Lexicon The Consortium for Lexical Research II. Review of Related Activities To make the working group aware of activities that should be tracked, we discussed various related activities throughout the world. North American activities: 1. Dictionary Encoding Initiative (DEI): Chairman - Robert Amsler (MITRE Corporation) Steering Committee: Martha Evens (Illinois Institute of Technology) Nancy Ide (Vassar College) Robert Ingria (BBN) Frank Tompa (University of Waterloo) Susan Warwick-Armstrong (ISSCO/University of Geneva) The DEI was envisioned by Amsler and Tompa as an organization for sharing lexical resources. They, independently of the TEI, produced an SGML standard for exchange of MRDs. Subsequently, the DEI was subsumed under the TEI because of the similarity of goals. Recently, the DEI has shown signs of reviving and may look for funding. The relation between the DEI and the CLR (see below) and the ACL/DCI is unclear. 2. Center for Lexical Resources (CLR): Located at New Mexico State University, Las Cruces, New Mexico. Director - Yorick Wilks (NMSU) ACL committee: Roy Byrd (IBM) Ralph Grishman (NYU) Mark Liberman (UPenn) Don Walker (Bellcore) The CLR has DARPA funding for its first three years and is to become self-supporting thereafter. Its goal is to provide to the whole NL research community resources now available only to a few. In particular, it will serve as a ``broker'' between lexical information providers (e.g. dictionary companies) and researchers. This organization grew out of a suggestion by Roy Byrd for the formation of a Lexical Consortium, first made at the February, 1989 DARPA Speech and Natural Language Workshop in Philadelphia. 3. ACL Special Interest Group on the Lexicon (SIGLEX): Head - James Pustejovsky (Brandeis) Pustejovsky is in the process of recruiting the following people for a steering committee: Robert Ingria (BBN) Beth Levin (Northwestern) Nicoletta Calzolari (CNR, Pisa) Susan Warwick-Armstrong (ISSCO) Robert Amsler is interested in organizing a workshop on lexical acquisition under its auspices at the next ACL meeting. There may also be a European meeting in San Marino under SIGLEX auspices. 4. DARPA Common Lexicon Committee: Head - Robert Ingria (BBN) Committee Members: John Bear (SRI) Jim Glass (MIT) Marcia Linebarger (Unisys) Mike Riley (AT&T) Robert Weide (CMU) This working group was formed to produce a common lexicon format for all the sites that are part of the DARPA spoken language research community for the exchange of phonetic, syntactic, and possibly semantic information. See enclosure for more details. 5. The UPenn Treebank: Head - Mitch Marcus (UPenn) Steering Committee: Ken Church (AT&T) Jordan Cohen (IDA) Robert Ingria (BBN) Don Hindle (AT&T) Fred Jelinek (IBM) Mark Liberman (UPenn) This is a DARPA funded project ($5 million) to produce a bracketed corpus of English and other languages, much like the British national corpus or Birmingham corpus. 6. ACL Data Collection Initiative (ACL/DCI): The ACL/DCI is a collection of texts made available to the NL research community. Among the texts is the first edition of the Collins English Dictionary in SGML format. European activities: Susan Warwick-Armstrong presented the following overview of funding in Europe, as a background to a presentation of the current European projects. The European Community funds various projects through ``framework programs''. One of these is telecommunications; within telecommunciations there is funding for ESPRIT (= European Strategic Program for Research in Information Technology). There are two types of ESPRIT funding: ESPRIT 2, with industrial partners; and ESPRIT BRA (= Basic Research Action), without industrial partners. Framework programs are funded for a fixed period of time; we are currently within framework 2, and moving to framework 3. Framework 3 will run for 2 or 3 years; there will be a new round of ESPRIT funding soon. Eureka projects are not funded by the European Community but by the member nations, each of which contributes money directly. Eurotra phase 1 funding ended last year and a new phase began, lasting 1.5 years. ESPRIT has 4 sites in Brussels. Eurotra is independent and has 2 sites in Luxemburg. 7. Acquilex: This is an ESPRIT BRA. This is a program for lexical acquisition from MRDs. The goal is to produce a common Lexical Knowledge Base for 4 languages: English, Italian, Dutch, and Spanish. There will be both monolingual and bilingual components; there are monolingual entries for all 4 languages, and some bilingual. There will be 2,000 entries. The project began in October, 1989 and concludes in May, 1992. Plans are underway for a follow-up to Acquilex. This involves 4 universities: U.K.: Cambridge (Ted Briscoe) Italy: Pisa (University and CNR) (Nicoletta Calzolari) The Netherlands: Amsterdam (Willem Meijs and Piek Vossen) Spain: Barcelona (Verdejo) Later Dublin will join, to be a test bed for the data created by the other sites. 8. Multilex: This is an ESPRIT 2 project. The goal is to develop a common lexical format. There are 2 phases, each of 18 months, each involving 1 industrial partner: Phase 1: Bochard Cap Gemini (Paris) - develop standards specification Phase 2: Triumph-Adler (Germany) - apply standards Academic partners: University of Pisa Surrey University This project has just begun. It is intended to apply to many languages, for multiple applications. 9. Genelex (GENeral LEXicon): This is a Eureka project. It began in France and so far has dealt only with Romance languages. Participants include Gsi-ERLI (Paris: Marc Nossin), a software company (and competitor of Cap Gemini), IBM France, Bull, Hachette, and others. There are also partners in Italy (Pisa) and Spain. The goal is to develop a standard for lexicons for different applications. All results so far are private. 10. ET-7 (Eurotra-7): This is a feasability study that grew out of the Grosseto workshop on the lexicon. It is a survey of lexical resources to see if it is possible to have a common lexicon. The project ends in 1 month. It will make proposals for the future direction of the field. A questionaire has been sent to many sites; much material has been collected. It is co-ordinated by Ulrich Heid in Stuttgart. 11. ET-10: Part of the Eurotra effort is a continuing project to relate different formalisms. ET-7 and ET-10 are diversity funding; this has 3 times the funding allotted to specific projects. Diversity funding allows for non-EC funding. ET-10 involves lexicon development. 12. The Network of European Corpora: This is an EC project which starts soon, funded through Luxemburg. This is a large feasability study for the design of a balanced corpus. The goal is to develop general software tools for linguistic annotation and knowledge extraction, including lexical knowledge. This is an 18 month project with 6 groups: Pisa: CNR - Antonio Zampolli (Head) Paris: TLF (Tresor de la Langue Francais) - Bernard Quemada Birmingham: Leiden: Mannheim: Malaga: Alvar Japanese activities: 13. EDR (Japanese Electronic Dictionary Research Institute): This was set up in 1986 to develop electronic dictionaries for Japanese and English. They are also developing a corpus (the EDR corpus) to assist dictionary work. This is a 9 year project; the budget should reach 14 billion yen (approximately $100 million) by the end of fiscal year 1994. There are different types of dictionaries, as follows: Word dictionaries: These are monolingual dictionaries for both English and Japanese. Each contains both general vocabulary and technical vocabulary. The goal is approximately 100,000 words for the dictionary of each language. Co-occurrence dictionaries: These are monolingual dictionaries for both English and Japanese; the goal is approximately 300,000 words for each language. Bilingual dictionaries: There are bilingual dictionaries for both English to Japanese and Japanese to English; the goal is approximately 300,000 entries in each dictionary. Concept dictionary: This contains a classification hierarchy and concept descriptions for approximately 400,000 concepts. Some concepts are common to the two languages, others are peculiar to one or the other. The EDR corpus: This contains 20 million sentences of Japanese and 20 million sentences of English. KWIC data is being developed for these 2 corpora. The total size is 15 gigabytes. The sources are newspapers and encyclopedia articles. Also, they have selected 500,000 sentences each from Japanese and from English to be analysed: analysis will be both syntactic (part of speech and parse trees) and semantic (associated concepts). (The parse trees are assigned automatically then manually post-editted.) There will also be information about the logical structure of each analyzed document. III. List of Sites and Systems to be Surveyed In our previous meeting, we had discussed surveying systems in terms of the type of system (i.e. parsing, generation, machine-translation, etc.) Susan Warwick-Armstrong suggested that a better approach would be to survey systems in terms of the types of information they store: syntax - mostly parsing systems, with some generation (this includes parsing and generation as parts of machine translation systems) semantics - cuts across many different types of systems translation information - for machine translation Since much of generation involves text-planning, it was decided to stay away from this aspect of generation and look only at the lexical roots of realization mechanisms. Syntax: Ingria's list from Grosseto is fine as far as it goes, but it is mainly U.S. based. The working group members produced the following list of systems to look into and people to contact, in European countries and elsewhere. U.K.: Alvey - Graeme Ritchie (graeme@aipna.edinburgh.ac.uk) Ted Briscoe Claire Grover (did grammar) SRI-Cambridge - Steve Pulman (sgp@cam.sri.com) Edinborough - several projects: Jo Calder Ewan Klein (ewan@epistemi.edinburgh.ac.uk@NSS.Cs.Ucl.AC.UK) Sussex - Gerald Gazdar (geraldg@cogs.sussex.ac.uk) Manchester ET - Jock McNaught (jock@ccl.umist.ac.uk) DATR - James Kilbury (kilbury@ddorud81.bitnet) France: LADL (Paris VII) - Anne Abeille (abeille@zeta.ibp.fr) - Maurice Gross - a syntactic lexicon IRIT - Patrick Saint-Dizier (stdizier@irit.irit.fr) Gsi-ERLI - Marc Nossin (mn@gsierli.uucp) Misc: Dominique Estival (estival@divsun.unige.ch) Jean Veronis (VERONIS%VASSAR.BITNET@cunyvm.cuny.edu) Nancy Ide (ide%vassar.BITNET@cunyvm.cuny.edu) C. Fouquere (cf@lipn.univ-paris13.fr) BDLEX - (perennou@frcict81.bitnet) Cap Gemini Innovation - (bilange@csinn.uucp) IBM, Paris Edinborough/French collaboration - Alan Black (awb@ed.ac.uk) The Netherlands: Utrecht: Lexic project - Pim Van der Eyck (vandereyk@let.ruu.nl) Amsterdam - Remko Scha (scha@alf.let.uva.nl) Piek Vossen Willem Meijs (links@alf.let.uva.nl) Tilburg: Hap Kolb (KOLB%HTIKUB5.BITNET@CUNYVM.CUNY.EDU) Reusability issues - Craig Thiersch (THIERSCH%HTIKUB5.BITNET@CUNYVM.CUNY.EDU) Sweden: Lexical database for Swedish Contact: Godrun Magnusdottir (usdgm@seguc21.bitnet) U. Sweden - (brodda@ling.su.se) - Anna Sagvall Hein (uduas@seudac21.bitnet) Norway: TROLL (Trondheim) Finland: PNLP, IBM Finland - Harri Jappinen Lauri Carlson (lcarlson@finuh.bitnet) Misc: Kimmo Koskenniemi (koskenniemi@finfun.bitnet) Switzerland: ISSCO: GB-Parsing - Eric Wehrli (wehrli@uni2a.unige.ch, wehrli@cgeuge51.bitnet) ELU (Multi-word entries and lexical inheritance) - Graham Russell (russell@divsun.unige.ch) French unification grammar - Dominique Estival (estival@divsun.unige.ch) Misc: Michael Rosner Rod Johnson Italy: Pisa: ATN parser IBM, PNLP Grammar Trento: Contact - Oliviero Stock (stock@irst.it) Alberto Lavelli (lavelli@irst.it) Venice: LFG - Rodolfo Delmonte (delmonte@iveuncc.bitnet) Salerno: Maurice Gross style work Spain: PNLP, IBM, Spain, lexical database Portugal: Misc: Luis Damas (luis@nccup.ctt.pt) Israel: Contact: Shalom Lappin (lappin@ibm.com) Austria: GWAI Czechoslovakia: Jan Hajic - (hajic@cspuni12.bitnet) Germany: Saarbruecken: Hans Uszkoreit (uszkoreit@coli.uni-sb.de) John Nerbonne (nerbonne@dfki.uni-sb.de) LFG: Klaus Netter (netter@dfki.uni-sb.de) Stephan Busemann (busemann@dfki.uni-sb.de) Wolfgang Wahlster - (wahlster@dfki.uni-sb.de) Manfred Pinkal (pinkal@coli.uni-sb.de) Stuttgart: Ulrich Heid (heid@informatik.uni-stuttgart.de) Remi Zajac (zajac@is.informatik.uni-stuttgart.dbp.de) Stefan Momma (momma@informatik.uni-stuttgart.de) Polygloss: polygloss@informatik.uni-stuttgart.dbp.de IBM, Stuttgart - Peter Bosch (bosch@ds0lilog.bitnet) IBM, Heidelberg - Brigitte Barnett Hubert Lehmann (leh@gysvmhdi.bitnet) Herbert Leass (leass@dhdibm1.bitnet) Peter Hellwig (c87@dhdurz1.bitnet) Bonn - IKP Center: Jan Brustkern (upk001@dbnrhrz1.earn) Willec Tuebingen - Erhard Hinrichs (hinrichs@mailserv.zdv.uni-tuebingen.de) GMD: Karin Haenelt (haenelt@darmstadt.gmd.dbp.de) Japan: Jun-ichi Tsuji Canada: Toronto: Graeme Hirst (gh@cs.toronto.edu) Simon Fraser: Dan Fass (fass@cs.sfu.ca) Nick Cercone (nick@cs.sfu.ca) Richard Kittredge USA: Brandeis/CTI lexicon: James Pustejovsky (jamesp@chaos.cs.brandeis.edu) MIT: SLS: Jim Glass (jrg@goldilocks.lcs.mit.edu) IBM: Bran Boguraev (bkb@ibm.com) Ezra Black (BLACK@ibm.com) NYU: Tomek Strzalkowski (tomek@cs.nyu.edu) CMU: SLS: Robert Weide (weide@speech1.cs.cmu.edu) Unisys: Marcia Linebarger (marcia@prc.unisys.com) AT&T: SLS: Mike Riley (riley@research.att.com) Fidditch: Don Hindle (dmh@alice.att.com) Bellcore: Steve Abney (abney@arantu.bellcore.com) NMSU CRL: Yorick Wilks (yorick@nmsu.edu) Louise Guthrie (louise@NMSU.edu) ISI (MT and Generation): Ed Hovy (hovy@vaxa.isi.edu) HP Labs: Dan Flickinger (flickinger@hplabs.hp.com) SRI: John Bear (bear@ai.sri.com) Xerox PARC: LFG & MT: Annie Zaenen (zaenen.pa@xerox.com) Boeing (Washington): GPSG: Phil Harrison (phil@atc.boeing.com) Semantics: a subset of the institutions above. Machine Translation: EEC: Eurotra lexicon format: Tom Gerhardt (tom_gerhardt@eurokom.ie) U.K.: UK-Eurotra - Doug Arnold (doug@eurotra.sx.ac.uk) France: Eurotra-France: Laurence Danlos (laurence_danlos@eurkom.ie) The Netherlands: Utrecht: Mimo - Herbert Ruessink (ruessink@let.ruu.nl) BSO DLT - dlt (sadler@dlt1.uucp) Philips/Rosetta - Jan Landsbergen (landsbrn@nla276.prl.philips.nl) Denmark: Eurotra DK - Bente Maegaard (bente_maegaard@eurokom.ie) Switzerland: Machine translation entries, etc. - Susan Warwick-Armstrong (susan@divsun.unige.ch) Avalanche report MT system - Pierrette Bouillon (pb@divsun.unige.ch) Grenoble - Christian Boitet Jean-Phillippe Guilbaud (almand@frgeta11.bitnet) Italy: Pisa: Eurotra Lexicon - Nicoletta Calzolari (glottolo@icnucevm.cnuce.cnr.it) Spain: Eurotra group Siemens MT Greece: Eurotra group - George Carayannis (head) (karayan@theseas.ntua.gr) Czechoslovakia: Prague Belgium: Metal-Leuven - siegeert@cs.kuleuven.ac.be Germany: Siemens AI and MT (METAL) Groups - (metal@ztivax.siemens.com) German Eurotra - iai (cor@iai.uni-sb.de) Saabruecken - CAT (old Eurotra format) Randy Sharp (randy@iai.uni-sb.de) Heinz-Dirk Luckhardt (dlu@rz.uni-sb.de) Japan: Kyoto University - Makoto Nagao (nagao@nagao.kuee.kyoto-u.ac.jp) Electronic Research Laboratory - Toshio Yokoi (yokoi@edrrp.edr.co.jp) Misc - Jun-ichi Tsujii (tsujii@ccl.umist.ac.uk) Tomoyoshi Matsukawa (kddlab!edr5r.edr.co.jp!matsu@uunet.uu.net) Canada: METEO - Pierre Isabelle (isabelle@ccrit.doc.ca) USA: IBM: MT - Michael McCord (mccord@ibm.com) CMU: MT - Sergei Nirenburg (sergei@nl.cs.cmu.edu) Lori Levin University of Maryland - Bonnie Dorr (bonnie@umiacs.umd.edu) NMSU - Yorick Wilks (yorick@nmsu.edu) ISI (MT and Generation) - Ed Hovy (hovy@vaxa.isi.edu) Xerox PARC - LFG & MT: Annie Zaenen (zaenen.pa@xerox.com) Pan American Health Organization (Paho) - SPANAM - Spanish/American System Muriel Vasconcellos (commercial) ALPS - (no longer maintained) Alan Melby (HRCMELBY@byuvm.bitnet, 8013773704@fax.uic.edu) Logos Metal U Texas: Scott Bennett Winfield S. Bennett (Bennett@I1.CC.untexas.edu) Systran After compiling this list of sites and systems to be surveyed, we discussed the output of the survey. It should contain an explanation and justification of the structures for lexical information that we propose. It should contain examples of each point in action. We then discussed what we should ask for in our survey. We decided that rather than asking for specific words or specific types of entries, we should ask for particular types of words. We produced an initial sketch of the type of information to ask about for different categories: nouns of different types, with examples: entity nouns - book, apple relational nouns - speed, age, height, father, brother abstract nouns - courage, love mass nouns - wine, sand proper nouns - Jane, Europe verbs: subcategorization - transitive, intransitive, taking finite clause, taking infinitival clauses (Raising, control, etc.) etc. for Case marking languages like German, the Case marked on the object e.g. ``helfen'' (Dative) vs. ``sehen'' (Accusative) adjectives: semantically intersective vs. non-intersective representation of subcategorization prepositions: different types: Case-marking (such as ``of'' in ``pictures of John'') vs. relational (e.g. locatives) multi-word entries morphology: relation of base and inflected forms Robert Ingria and James Pustejovsky are to produce a draft by June 28th, distribute it to the other members of the working group, and then send out the final version. IV. Discussion of Parsing/Syntax Fields The working group then discussed the notes for a common lexical entry for syntactic/parsing systems produced by Robert Ingria. A number of issues that were unaddressed by his proposal were raised: 1. The use of re-entrancy, especially common in complex-feature based systems. 2. The lack of any explicit statement about the use of pointers (to other lexical entries). 3. The use of inheritance in complex-feature based systems, and how this effects the usability of information put into interchange format. (Inheritance can produce compact entries that are incomprehensible to those that do not know what is being inherited.) 4. The use of defaults, which raises similar issues. 5. The use of typing for features. 6. The use of internal vs. external forms in lexical entries. V. Reformulation of Objectives In conclusion, the members of the working group produced the following objectives: 1. By June 28th, Robert Ingria and James Pustejovsky should produce a draft of the survey to be sent out. 2. This draft is to be circulated to Nicoletta Calzolari and Susan Warwick for coments. 3. On July 10th, the survey letter should be sent out, with a suggested deadline of one month. 4. The working group will produce a synthesis of the information obtained, by the end of September, to present at the Steering Committee Meeting in Oxford on October 2. 5. The working group will produce a first draft of standards by the first week of November, to be discussed at a meeting in Europe. 6. The working group will produce a new draft by the end of December, for inclusion in the TEI report. 7. Nicoletta Calzolari will distribute the results of the Pisa survey of lexical resources and other similar materials, as they become available.