Date: Mon, 5 Mar 90 15:55:33 MET Reply-To: Text Encoding Initiative - Text Analysis and Interpretation Committee , nicoletta Sender: Text Encoding Initiative - Text Analysis and Interpretation Committee From: nicoletta Terry Langendoen To: Michael Sperberg-McQueen Dear Colleagues, I am here circulating our preliminary draft proposal for the description of a dictionary entry. These notes in my letter are only an introduction to clarify some points. Preliminary Draft Proposal The description of the computational model of the dictionary entry to be used within the Project will consist of two separate sections. (What we send now is a preliminary draft of the first one). The first is an explicit standardized representation of the content of the different dictionaries to be used within the Project. This representation is essential for the following reasons: - it makes it possible to preserve the source text (the dictionaries) for further applications; - it allows the exchange of data in a common format; - it makes it possible to write parsers for the different dictionaries on the basis of a common format; - it permits us to have uniform types of analyses of the different dictionaries; - it makes it easier to design a common model for the Project Database Dictionary Entry, into which all the existing dictionary representations must be mapped, on the basis of the representations of existing data; - it makes it easier to standardize the contents (the values) of some of the fields; - it makes it possible to load the existing dictionaries into the common Project Database Dictionary. The second part of the computational model will consist in a description of the common Project Database Dictionary Entry. This will be improved and/or augmented at the different stages of development of the database itself. This description will be more complex than that of section one, because it will contain also the specification of the database model (with types of access, indices, etc.). The two sections together will constitute our integrated proposal for the Computational Model of a Dictionary Entry. In this preliminary draft we propose a language for representing the information present in the different dictionaries that we are analyzing and using in the Project (as was agreed in Pisa in December together with the Cambridge partners, and as has been communicated by Cambridge to the other partners). This proposal is based, at the moment, on the analysis of the dictionaries used here in Pisa - a monolingual, the Garzanti, and a bilingual, the Collins -, and is therefore in no way intended to be exhaustive as to the types of information that have to be represented, and therefore as to the set of tags. If the proposal is accepted by the other partners, the intention would be that they should then try to describe their dictionaries on the basis of the proposed set of tags, and where they find the tags insufficient, or not correct, or must in some way be changed, they should make additions, corrections, etc., if possible, within the same general framework. We should receive all the material before the 20 of February, in order to be able to prepare the final list of tags, descriptions, sets of values, templates, etc. On the basis of the information received from all the partners, decisions will then be made (as far as possible all together) as to the second part, i.e. the design of the Project Database Dictionary Entry Model. Our proposal consists of three parts: 1) A Dictionary Entry Template for each dictionary. We have prepared two of these templates, for the Garzanti and Collins. Each partner should add the compatible templates for all its dictionaries. The Template is a representation of the "maximal" entry in the dictionary, i.e. all possible fields should be foreseen in all possible positions. This Template should be useful for instance for the parser which will have to decode the dictionary entries. It is similar to that one designed by B.Boguraev for the Longman and Collins dictionaries. An Abbreviated Dictionary Entry Template has also been included with only the main GROUP_nodes (see 2a below), without their children. 2) A List and description of the semantics of all the Tags used in the Template. This list is not exhaustive, and will be augmented and corrected to include the representation of phenomena or data present in other dictionaries. The Tags are of two types: a) Node Tags, which do not have a value, but dominate a group of other tags. They have the string 'GROUP' in their name, and are indicated by the name followed by the 'equal' sign and by an informal description. We have e.g.: MORPH_GROUP = node grouping information... b) Attribute Tags, which take values. They are in lower-case, and are indicated by the name followed by a 'colon' and by a description of the possible values. We have e.g.: morph_label : takes as value any label.... The list of possible values for each Attribute-tag is given in the Appendix (see 3 below). 3) An Appendix which lists or defines all the possible values which each Attribute-tag can take in each different dictionary, i.e. its domain. We have not yet completed the lists for our dictionaries, but we have given at least some values for each attribute in order to give an idea of its content. In the future, at least for some of the Attributes, we could try to standardize the names of the values to be used within the Project. The other partners should let us know as soon as possible those parts which they feel should be immediately modified because they are "so general and basic" as to involve immediately also the work of the other partners. Otherwise we expect to receive not later than 20 of February: a) the Templates for all the dictionaries; b) the additions, corrections, etc., to the List of Tags; c) the sets of values for all the dictionaries for the Appendix. All of these should be sent in MRF, if possible already added (with some marks) to the files that we are circulating. Some comments on various points: - We have not yet completed the Templates with regard to information on optionality and obligatoriness. These are represented in the following way: (...) = optional, and, if present, once only * = 0 or more times + = 1 or more times nothing = obligatory, once We must observe that almost all the fields are optional when representing a generic entry for a dictionary, because of the wide range of variations in standard dictionary entries. If an optional sign is present at a GROUP level, obligatory signs on its children are to be intended in the following way: "if the GROUP is present, then this Tag is obligatory". - The letters 'M' and 'B' at the left of some tags mean that these tags are only for Monoligual or Bilingual dictionaries respectively. - Hierarchies and dependencies (and therefore scope) are represented by the indentations in the Templates. They may be different in the different dictionaries. They are important for parsers. Should or could they also be represented in a descriptive written section? - We hope that the definition of Groups as formed by certain sets of terminal tags (to which additions can obviously be made) will not constitute a problem for the description of other dictionaries. If so, please let us know immediately. - Attribute names ending with the string '_type' are used for information which is not explicitly present in written dictionaries, but which it should be possible to derive also at an early stage with automatic or semi-automatic procedures. More of these "type"-tags will be used in the final Common Entry, which will conform less to the source text, and in which more "derived" information will be present (e.g. Hypernyms). - Attribute names ending with the string '_text' are used to preserve the data as it is represented in the source text, when the process of parsing the dictionary into the different fields results in the source text being no longer recoverable, or when a portion of the source text is not yet analyzed. This has been done because maintenance of the source text is considered important, also - but not only - because of the Project's connection with the TEI (Text Encoding Initiative). For example, further and more refined analyses may need to go back to it. - We realize that the addition of these 'text'- and 'type'- attributes may cause the list of tags and the templates to become rather complicated. Unfortunately, these and other complexities come out from the necessity of disambiguating and representing in an explicit and standardized form, for computational use, texts - in our case dictionaries - which are somewhat inconsistent, incoherent, and not systematic in their form of presenting information, because they are intended for human use. - For ease of understanding, at the moment we have used long tag-names. We can change them later. As a final comment, I must add that what we are doing in this first phase has to be seen also in the perspective of a cooperation with the TEI, as has been mentioned in our Proposal of Work. Therefore the result of our work will also be used as a contribution to the TEI, and we expect to have feedback from this group. Waiting for your comments, critiques, suggestions, etc. Best regards Nicoletta COMPUTATIONAL MODEL OF THE DICTIONARY ENTRY Preliminary Draft Proposal Nicoletta Calzolari, Catol Peters, Adriana Roventini SECTION I TEMPLATE FOR LEXICAL ENTRY IL NUOVO DIZIONARIO GARZANTI: ITALIAN MONOLINGUAL Tags at Dictionary Level: DICT_SOURCE Dict_name: Dict_notes: LANGUAGE Lang: PHONETIC TRANSCRIPTION IPA: Tags at Entry Level: ENTRY HEADWORD_GROUP Hdwd_type: Hdwd_form: (Hdwd_Homonym_no.): (VARIANT_GROUP)... (PHONETIC_GROUP) (Pronunc_text): Pronunciation: (Alternate_pronunc): (CROSS-REFERENCE_GROUP)... (ETYMOLOGY_GROUP) Etymology_text: (VARIANT_GROUP) (Variant_label): Variant_form: (PHONETIC_GROUP)... (GRAM_INF_GROUP)... +HOM_GROUP Hom_No.: (Hom_form): (Hom_compact): *(GRAM_INF_GROUP) (Gram_Inf_text): *(POS_GROUP) (POS): (subcat): (subtype): (gender): (number): (various): (MORPH_GROUP) (Morph_text): (Morph_label): +Morph_form: (VARIANT_GROUP)... (CROSS_REFERENCE_GROUP) (Xref_text): Xref_label: +Xref_entry: (Xref_extens): *(SENSE_GROUP) Sense_no.: (CROSS_REFERENCE_GROUP)... (DEFINITION_GROUP) (GRAM_INF_GROUP)... *(SEMANTIC_LABEL_GROUP) (Semantic_label_text): (Subject_code): (Semantic_code): (Register_code): (Usage_code): (Geographic_code): (CROSS_REFERENCE_GROUP)... *(DEF_GROUP) +Def_text: (SEMANTIC_LABEL_GROUP)... (Implicit_Xref): (Figure_ref): (CROSS_REFERENCE_GROUP)... (TAXONOMY_GROUP) Taxon_label: Taxon_text: *(EXAMPLE_GROUP) . +Ex_text: (SEMANTIC_LABEL_GROUP)... (Ex_explanation): *(MULTIWORD_GROUP) Mwd_label: Mwd_form: (SEMANTIC_LABEL_GROUP)... Mwd_explanation: (PROVERB_GROUP) Prov_label: +Prov_text: (Prov_explan): (SEMANTIC_RELATIONS_GROUP) (SYNONYM_GROUP) Syn_label: +Synonym: (ANTONYM_GROUP) Ant_label: +Antonym: ("ALTERATE"_GROUP) Alt_label: +"Alterate": (RUN-ON_GROUP) Run-On_type: Run-On_label: Run-On_form: ABBREVIATED SCHEMA OF THE LEXICAL ENTRY IN GARZANTI DICT_SOURCE LANGUAGE ENTRY HEADWORD_GROUP PHONETIC_GROUP ETYMOLOGY_GROUP VARIANT_GROUP HOM_GROUP GRAM_INF_GROUP POS_GROUP MORPH_GROUP CROSS_REFERENCE_GROUP SENSE_GROUP DEFINITION_GROUP SEMANTIC_LABEL_GROUP DEF_GROUP TAXONOMY_GROUP EXAMPLE_GROUP MULTIWORD_GROUP PROVERB_GROUP SEMANTIC_RELATIONS_GROUP SYNONYM_GROUP ANTONYM_GROUP "ALTERATE"_GROUP RUN-ON_GROUP TEMPLATE FOR LEXICAL ENTRY COLLINS BILINGUAL: ENGLISH/ITALIAN, ITALIAN/ENGLISH Tags at Dictionary Level: DICT_SOURCE Dict_name: Dict_notes: LANGUAGE L1: L2: Metalanguage: PHONETIC TRANSCRIPTION IPA: Notes: Tags at Entry Level: ENTRY HEADWORD_GROUP Hdwd_type: (Hdwd_text): Hdwd_form: (Hdwd_Homonym_no.): (PHONETIC_GROUP) (Pronunc_text): Pronunciation: *(Alternate_pronunc): (CROSS_REFERENCE_GROUP)... (VARIANT_GROUP) (Variant_label): Variant_form: (PHONETIC_GROUP)... (GRAM_INF_GROUP)... *(MORPH_GROUP)... (PHONETIC_GROUP)... (CROSS_REFERENCE_GROUP)... +HOM_GROUP Hom_no.: (Hom_form): (PHONETIC_GROUP)... (Hom_compact): (CROSS_REFERENCE_GROUP)... (MULTIWORD_GROUP) Mwd_label: +Mwd_form: (GRAM_INF_GROUP) (Gram_Inf_text): *(POS_GROUP) POS: (subcat): (subtype): (gender) (number): (various): (MORPH_GROUP) (Morph_text): *(Morph_label): +Morph_form: (PHONETIC_GROUP)... (AUX_GROUP) Aux_label: Aux_text: (CROSS_REFERENCE_GROUP)... *(SENSE_GROUP) Sense_no.: (Hom_form): (GRAM_INF_GROUP)... (COMPOUND_GROUP)... (TRANSLATION_GROUP) (SENSE_LABEL_GROUP) (Sense_Label_text): (GRAM_INF_GROUP)... *(SEMANTIC_LABEL_GROUP) (Subject_code): (Semantic_code): (Register_code): (Usage_code): (Geographic_code): *(SEMANTIC_INDICATOR_GROUP) (Semantic_Indicator_type): Semantic_Indicator_text: *(TRANS_GROUP) (Trans_type): (Trans_label): +Trans_text: (GRAM_INF_GROUP)... *(EXAMPLE_GROUP) +Ex_text: *(SENSE_LABEL_GROUP)... +EXAMPLE_TRANS_GROUP (Ex_Trans_type): (Ex_Trans_label): +Ex_Trans_text: (GRAM_INF_GROUP)... (CROSS_REFERENCE_GROUP) (Xref_text): Xref_label: Xref_entry: (Xref_entry_extens): (RUN-ON_GROUP) Run-On_type: Run-On_label: Run-On_form: ............. ABBREVIATED SCHEMA OF THE LEXICAL ENTRY IN COLLINS DICT_SOURCE LANGUAGE PHONETIC TRANSCRIPTION ENTRY HEADWORD_GROUP PHONETIC_GROUP VARIANT_GROUP HOM_GROUP MULTIWORD_GROUP GRAM_INF_GROUP POS_GROUP MORPH_GROUP AUX_GROUP SENSE_GROUP TRANSLATION_GROUP SENSE_LABEL_GROUP SEMANTIC_LABEL_GROUP SEMANTIC_INDICATOR_GROUP TRANS_GROUP EXAMPLE_GROUP EXAMPLE_TRANS_GROUP CROSS_REFERENCE_GROUP RUN-ON_GROUP EXPLANATION OF TAGS IN THE LEXICAL ENTRY TEMPLATES AND DESCRIPTION OF THE VALUES WHICH EACH ATTRIBUTE CAN ASSUME Tags at Dictionary Level are: DICT_SOURCE = node grouping information on the source of the lexical data. Dict_Name: the name of the source dictionary. Dict_Notes: contains information on the kind of lexical data contained in the source dictionary. For example the VanDale material originates from a historically oriented database, but is restricted to Contemporary Dutch. It includes words from spoken language but excludes dialect words. Entries are based on a "one-form-one-entry" principle. LANGUAGE = node grouping information on language of source dictionary. M Lang: language of source dictionary for monolingual dic- tionaries. B L1 : source language for bilingual dictionaries. B L2 : target language for bilingual dictionaries. e.g. for an Italian/English bilingual dictionary: in the Italian-English dataset, L1=Italian, L2=English; in the English-Italian dataset, L1=English, L2=Italian. B Metalanguage: Language used as metalanguage. In bidirectional bilingual dictionaries this is normally L1 for each dataset whereas in monodirectional bilinguals the metalanguage used is normally that of the intended user. For example, in our Collins bilingual, in the Italian-Eng- lish dataset, this field takes as value Italian and, in the English-Italian dataset, the value is English. PHONETIC TRANSCRIPTION IPA: Symbols of the International Phonetic Association; takes as values Yes/No. Notes: comments on particular characteristics of phonetic transcription used by source dictionary. For example, in Collins the phonetic transcription includes information on word stress. Tags at Entry Level are: ENTRY = node for entire lexical entry. Entry_Id: = takes as value a number identifying the entry on the source tape in order to preserve the source order. (e.g. when entries are added, deleted, split, etc., when loading dictionary into LDB shell, or when transferring information to other projects who have also used the same source tape). HEADWORD_GROUP = node grouping information on headword. Hdwd_type: all "_type" tags contain information which is implicit in the entry and can be derived either manually or automatically. This tag indicates the particular kind of headword and can take as values, e.g. lemma, irregular word-form, suffix, prefix, abbreviation, etc. A list of possible labels (plus examples) is given in Appendix. Hdwd_text: This field is necessary to preserve the source text in at least two cases: a) headwords in which hyphenation information is included are transcribed here, exactly as they appear. For example, in the Collins Bilingual, the English-Italian dataset includes hyphenation information in the headword: "in.for.ma.tion". b) headwords in which morphological information is located in the headword position are transcribed here, exactly as they appear. For example, in the Collins Bilingual, the Italian-English dataset includes many headwords like "basico,a,ci,che": the whole string is entered in this field, the form assumed as the primary form ("basico" in the example here) will be entered as value in the Hdwd_form field and "a,ci,che" will be entered as value in the Morph_text field of the Morph_Group, which in this case will substitute the Variant_Group. Hdwd_form: Obligatory field; takes as value the dictionary citation form. Contains only the primary form for "com- plex" headwords, see above Hdwd_text. All signs which are additional to the actual graphic form of the headword must be removed, e.g.indications of hyphenation, stress, etc., whereas all signs which belong to the usual graphic form of the headword must be maintained, e.g. capitals, graphic accents, periods, spaces, etc.(for example, Cristo, wagon-lit, ab aeterno). Hdwd_Homonym_no.: Usually takes as value a number (some- times this appears in dictionaries as a superscript number) or any other sign used by the source dictionary to represent homonyms recorded as separate entries. Collins and Garzanti at times indicate both lexical and grammatical homonyms in this way, e.g. for "calcio" both give separate entries for 2 lexical homonyms (both nouns), whereas for "potere" separate entries are given for grammatical homographs (noun and verb). HEADWORD_LABEL_GROUP = node grouping information related to the entry as a whole and not to specific senses. Freq_Inf: tag containing frequency information. The Van Dale dictionary tape, for example, has frequency labels which either refer to all senses or for which it is not known to which sense they apply. PHONETIC_GROUP = node grouping phonetic information.This node can be repeated in different positions in the lexical entry. Pronunc_text: Takes as value the entire pronunciation field as it appears in the dictionary, without any analysis. This field is used to preserve the source text, e.g. when more than one pronunciation is merged in a single string. For example, in Collins, for the entry "fuor(i)uscito", the two pronunciations are indicated in the same way as the headword , i.e. with the "i" between brackets, and will thus be recorded in two different Phonetic_groups, one for the headword, one for the variant. Pronunciation: takes as value the pronunciation informa- tion. When more than one pronunciation is associated with the headword or with any variant form, the first (assumed as the primary) pronunciation is entered here. Alternate_pronunc.: When more than one pronunciation is associated with the headword or any variant form, the pronunciations assumed as secondary pronunciations are entered here. For example, Garzanti gives "nesso [nes o ne's]"; in this case, the entire pronunciation is entered in the Pronunc_text field, "nes" is entered in the Pronunciation field and "ne's" is entered in the Alternate_Pronunc. field. VARIANT_GROUP = node grouping information on variant forms of the headword. When inflected word-forms are recorded at headword level, (see the example "basico, a, ci, che" in the Hdwd_text field above), the Morph_group node substitutes the Variant_Group node. Variant_label: takes as value any label given in the source dictionary indicating the type of variant. The list of possible labels for our dictionaries is given in Appendix 1. Variant_form: takes as value the graphical form of the variant as it appears in the dictionary, e.g. Collins gives "sceptic, (Am) skeptic [......]"; "skeptic" will be entered in the Variant_form field; in the example of the entry "fuor(i)uscito" cited above, "fuoruscito" is entered as value in the Hdwd_form field and "fuoriuscito" is entered as value in this field. HOM_GROUP = node for each homograph group within an entry (usually grammatical homographs). Homograph divisions are usually given for major POSs, but separate homographs are also often given for different subcategorizations of the same POS, e.g. vt, vi, vr, etc.. When there is just one homograph group, it is not necessarily indicated explicitly in the dictionary, but is logically always present. Hom_no.: takes as values the number or label given in the dictionary, or the value NIL when there is no explicit number or label. The list of possible labels for our dictionaries is given in Appendix 1. Hom_form: this field is necessary whenever the form of the entry word at the homographic level differs from the form which appears at the headword level. For example, in the Italian-English dataset, Collins lists the reflexive form of verbs in a separate homograph divi- sion and indicates their graphic form; e.g. in the entry for "abbandonare", there are two Homograph groups for the transitive and the reflexive forms of the verb, as follows: 1 vt ... 2: ~rsi vr. In this case, Hom_Form takes the value "~rsi". Hom_compact: takes as value Yes; this field is present only when two or more primary POSs are listed under one Hom_No. It will probably be disambiguated to create two or more entries or homographs in the future. For example, Garzanti gives "ambulante agg. e s.m. e f."; in Collins, we have "Coreano, a [....] ag, sm, sm/f". GRAM_INF_GROUP = node grouping morphological and syntactic information. This node can be repeated in many different positions in the lexical entry. Gram_Inf_text: field needed to preserve source text. For example, in Collins, for the headword "behove" we find "impers vt"; thus, in order to maintain the order, "impers vt" will be entered in the "Gram_Inf_text" field, "v" in the "POS" field, "t" in the "subcat" field and "impers" in the "subtype" field. It is also used for complex grammatical information which will need further analysis in the future. POS_GROUP = node grouping all information on parts-of-speech. POS: takes as values the major parts-of-speech (verb, noun, etc.) as they are represented in each dictionary. The list of possible labels for our dictionaries is given in Appendix 1. In the future, a standard list will be defined for the project, with conversion tables for the individual lists of each source dictionary. subcat: takes as value any subcategorization information given for verbs, e.g.on transitivity,reflexivity, etc. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. subtype: takes as value any labels which regard more specific grammatical information for any POS. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. gender: takes as value any labels for nouns, pronouns and adjectives regarding gender. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. number: takes as value any labels for nouns, pronouns and adjectives regarding number. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. various: this field is used at the moment for "junk collection", i.e. all grammatical information which cannot be classified under any of the previous headings. For example, in Collins in the entry for "a", we find "...before vowel or silent h: an.."; at this stage we put this information in the various field. g-code: this field is used for codes such as the Longman formal grammatical codes which describe in detail the syntactic behaviour of headwords or their word senses. MORPH_GROUP = node grouping information on inflection, usually gives irregular inflected forms. As has been explained in the hdwd_text description, it has also been found necessary at times to use this node for morphological information located in the headword position; in this case, this node will be located at the place of the Variant_Group. Morph_text: This field is necessary to preserve the source text and takes as value the morphological information as it is reported in the dictionary when it cannot be analyzed. For example, in Garzanti, we find "ciondolare ... v.intr.(io ciondolo ecc.)"; in this case "io ciondolo ecc." will be entered as value in this field. Morph_label: takes as value any labels indicating the type of morphological information given. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. Morph_stem: takes as value the "stem" of the entry lemma. For example, Dutch needs a specification of the stem-form for flection and derivation. Morph_form: takes as value inflected word forms or inflectional endings, as they appear in the dictionary. For example, in Garzanti, we find "doppiatore...s.m. (f.-trice)"; in this case, "f" will be entered as value in the Morph_label field and "-trice" in this field. AUX_GROUP = node grouping information on the auxiliary verb associated with a given homograph of the headword, according to its subcategorization. Aux_label: takes as value any label indicating the presence of an auxiliary verb. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. Aux_form: takes as value the auxiliary verb given. SENSE_GROUP = node grouping the information on each sense division of the headword. Usually contains labels, definitions and examples for monolingual dictionaries, and labels, translations and their examples for bilinguals. Sense_no.: takes as value the number, letter, etc., which is used in the entry to distinguish between the different word senses. For monosemous words the dictionary sense_no. may be implicit. The value in this case is NIL. The list of possible labels for our dictionaries is given in Appendix 1. M DEFINITION_GROUP = node grouping information on the mean- ing of the headword, normally contains semantic labels and definition(s).(For bilingual dictionaries, in this position we have the Translation_group.) SENSE_LABEL_GROUP = node grouping explicit and implicit indications on the sense being defined or translated. (This node governs the two following nodes and is repeated in all the different positions where such information appears.) Sense_label_text: field needed to preserve source text for the sense labels, both with regard to the order of the labels and to handle information which can be found in this field in addition to the codes and will require further analysis. For example, in Collins, under the first sense divi- sion of the entry for "superior", we find the sense labels (Comm: goods, quality) which will eventually have to be disambiguated. SEMANTIC_LABEL_GROUP = node grouping one or more explicit semantic labels attached to a word sense. For each of the Semantic_label types below, the list of possible labels for our dictionaries is given in the Appendix.Where possible, in the future, a standard list will be defined for the project, with conversion tables for the particular lists of each source dictionary. Semantic_label_text: field needed to preserve source text for semantic labels, both with regard to the order of the labels and to handle information which can be found in this field in addition to the codes and will require further analysis. For example, in Garzanti, after the definition of "barricarsi" we find "(anche fig.)" where the word "anche = also" is not a code. Subject_code: labels indicating particular domains. The values for each of our dictionaries are listed in Appendix 1. Semantic_code: labels indicating metaphorical, figu- rative usage, etc. The values for each of our dictionaries are listed in Appendix 1. Register_code: labels indicationg colloquial, formal, informal, literary, poetical usage, etc. The values for each of our dictionaries are listed in Appendix 1. Usage_code: labels indicating archaic, old, rare usage, etc. The values for each of our dictionaries are listed in Appendix 1. Geographic_code: labels indicating dialectal, region- al usage, etc. The values for each of our dictionaries are listed in Appendix 1. B SEMANTIC_INDICATOR_GROUP = node grouping information on semantic indicators or constraints on the translations. Semantic_Indicator_type: identifies the particular kind of semantic indicator used, and can take values such as "near-synonym", "contextual", etc. The values for each of our dictionaries are listed in Appendix 1. Semantic_Indicator_text: takes as value the seman- tic indicator information as it appears in the source dictionary. M DEF_GROUP = node grouping definitions regarding word senses. Def_text: takes as value the definitions as they appear in the source dictionary without any analysis. Implicit_Xref: takes as value a headword referred to in a definition and implicitly defined in this same definition. It may appear at the end of or within the Def_text field. For example, in Garzanti, we find "embolia.. s.f. ostruzione di ... causata da ... un corpo estraneo (embolo)"; in this case, "embolo" is entered as value in this field. Figure_ref: takes as value any reference in the dic- tionary source text to an illustration and will include the name or number of the illustration. The list of possible indicators for this attribute in our dictionaries is given in Appendix 1. TAXONOMY_GROUP = node grouping information on typical taxonomies (usually for animals and plants),if explicitly given in the dictionary. Taxon_label: takes as value the label indicating the taxonomy type. The list of possible indicators for this attribute in our dictionaries is given in Appendix 1. Taxon_text: takes as value what is given in the source dictionary as taxonomy data. B TRANSLATION_GROUP = node grouping all the information for each word sense, including translations, sense_labels, examples, etc. this node corresponds to the Definition_ Group for monolingual dictionaries. B TRANS_GROUP = node grouping the translations for each word-sense. Trans_type: indicates the particular kind of trans- lation given when it is not a direct L2 equivalent, but e.g. an explanation, a cultural equivalent, etc. The values for our dictionaries are listed in Appendix 1. Trans_label: any label indicating explicitly the trans_type. The values for our dictionaries are listed in Appendix 1. Trans_text: takes as value the translation(s) as they appear in the source dictionary. Gram_Inf regarding the trans_text may appear at the end of or within the Trans_text; in the latter case the Trans_text will continue in a Trans_text_cont field. EXAMPLE_GROUP = node grouping examples referring to a word sense. Ex_text: takes as value the example(s) as they appear in the source dictionary. Ex_explanation: takes as value any explanation which is associated with a particular example. B EXAMPLE_TRANS_GROUP = node grouping information on the translation(s) of an example. Ex_Trans_type: indicates the particular kind of trans- lation given for an example when it is not a direct translation of the example, but e.g. an explanation, a cultural equivalent, etc. The values for our dictionaries are listed in Appendix 1. Ex_trans_label: any label indicating explicitly the Ex_trans_type. The values for our dictionaries are listed in Appendix 1. Ex_Trans_text: takes as value the translation(s) of an example as it appears in the source dictionary. Gram_Inf regarding the Ex_Trans_text may appear at the end of or within the Ex_Trans_text; in the latter case the Trans_text will continue in a Trans_text_cont field. PROVERB_GROUP = node grouping proverbs referring to a head- word or to a word sense. Prov_label: takes as value the label indicating the presence of a proverb. The list of possible indicators for this attribute in our dictionaries is given in Appendix 1. Prov_text: takes as value any proverb(s). Prov_explanation: takes as value any explanations which are associated with a particular proverb. SEMANTIC_RELATIONS_GROUP = node grouping information on particular explicit semantic relations. SYNONYM_GROUP = groups any explicit information concerning synonyms of the headword or of a word sense. Syn_label: takes as value the label indicating the presence of synonym(s). The values for each of our dictionaries are listed in Appendix 1. Synonym: takes as value the synonym(s). ANTONYM_GROUP = groups any explicit information concerning antonyms of the headword or of a word sense. Ant_label: takes as value the label indicating the presence of antonyms. The values for each of our dictionaries are listed in Appendix 1. Antonym: takes as value the antonym(s). "ALTERATE"_GROUP = groups any explicit information concerning words such as diminutives, augmentatives, pejoratives, etc. Alt_label: takes as value the label indicating the presence of "alterates". The values for each of our dictionaries are listed in Appendix 1 "Alterate": takes as value explicit "altered forms" of the headword. MULTIWORD_GROUP = node grouping explicit information on va- rious types of multiple words, e.g. phrasal verbs, frozen expressions, etc., listed within the entry. This node can appear in different positions, depending on the dictionary. For example, in Collins, the whole node is given the same status as a homograph division, how- ever, it can contain more than one compound form, each of which can have its own Gram_Inf, Sense_Labels and Trans_ Groups. In Garzanti, this node contains compounds, idioms, fixed phrases, etc., and is located within the Sense_Group, after the Example_Group.and the whole node is repeated with its Semantic_Labels for each compound. Mwd_label: takes as value the label indicating the presence of some kind of compound. The list of possible labels (plus examples) for our dictionaries is given in Appendix 1. Mwd_form: takes as value the multi-word form(s) given in the dictionary. Mwd_explanation: takes as value any explanation which is associated with a particular Mwd_form. ETYMOLOGY_GROUP = node grouping information on etymology. Etymology_text: takes as value the etymology information reported in the dictionary without any analysis. CROSS-REFERENCE_GROUP = node grouping information on cross-references of various types. This node can appear at almost any level in the lexical entry. Xref_label: takes as value the label, sign, or word(s) used in the dictionaries to indicate that a cross-reference follows. The list of possible labels for our dictionaries is given in Appendix 1. Xref_text: takes as value the entire cross-reference information in order to preserve source text, when necessary. For example, Garzanti gives "lagrima... -> lacrima e deriv.", where the words "e deriv." (= and its deri- vatives) need further analysis. Xref_entry: takes as value the entry which is cross- referenced. Xref_entry_extens: takes as value any Homonym, Homograph and/or Sense nos. qualifying the referenced entry. For example, in Collins we find "bale [...] vt,vi see bale out 1,2(a)" where "see" is entered in the Xref_label field, "bale out" is entered in the Xref_entry field, and "1,2(a)" is entered in the Xref_entry_extens field. RUN-ON_GROUP = node grouping run-ons to the headword; the main feature of such run-ons is formal, i.e. they are strictly related to the headword, e.g. derivatives formed from the headword by adding a suffix, phrasal verbs, compounds, etc., and have not been given independent entry status in the particular dictionary under analysis but have been located at the end of the entry.as a type of sub-entry. For Garzanti, the structure of this node is identical to that of the Headword_Group, beginning with the Hom_Group, etc. after the Run-On_form. For Collins, the structure is the same as for Garzanti. Run-On_type: identifies the particular kind of run-on and can take values such as "suffix", "phrasal verb",etc.. The values for each of our dictionaries are listed in Appendix 1. Run-On_label: takes as value the label, sign, or word(s) used in the dictionaries to indicate that a run-on follows. The list of possible labels for our dictionaries is given in Appendix 1. Run-On_form: takes as value the derived word, the suffix used to form the derived word, the phrasal verb,etc. as they appear in the dictionary. For example, Garzanti gives "esterno ... agg. ... //-mente avv. all'esterno ..."; "-mente" will be entered as value in this field. APPENDIX 1 The Appendix will list those values for the various attributes in the Lexical Entry Template which are found in the machine-readable dictionaries used within the Project. The dictionaries which have been examined so far are "Il Nuovo Dizionario Garzanti", a monolingual dictionary for Italian, the "Collins Concise English- Italian, Italian-English Dictionary", the "VanDale Groot Woordenboek Hedendaags Nederlands" a monolingual Dutch dictionary, the VanDale Bilingual Duth/English dictionary, here referred to, as the Garzanti, Collins, VanDale Monolingual and VanDale Bilingual dictionaries, respectively. The values for the remaining dictionaries used in the project will be added by the other partners. N.B. Not all of the lists have been completed so far; at this stage they are to be intended as purely indicative. It will only be possible to have completely exhaustive lists for each dictionary when the first parsing and analysis stages have been completed. Hdwd_type: Possible values: lemma (typical standard dictionary headword) word-form (inflected word-forms, usually irregular) abbreviation acronym suffix prefix proper noun contracted form (e.g. ain't) compound (e.g. bathing costume) ........ Variant_label: in Garzanti: raro = rare ...... in Collins: Am = American Brit = British Scot = Scottish ...... in VanDale: not present, all variants are less common than the Headword by default. Hom_No.: in Garzanti: NIL if there is only one homograph // otherwise; however, when there is more homo- graph, Garzanti only indicates them speci- fically from the second on, e.g.dopopranzo s.m.......// avv........ appiccicare v.tr...//v.intr..//-arsi v.rifl.... Thus, the first Hom_No. value will be filled in automatically by.a backtracking procedure. In a second phase, the // will be substituted by 1,2,3,etc. in Collins: NIL (if there is only one homograph) 1, 2, 3, n ...(otherwise) in VanDale: NIL (if there is only one homograph) 1, 2, 3, n ...(otherwise) POS: in Garzanti: s. = noun v. = verb agg. = adjective avv. = adverb art. = article prep. = preposition . pron. = pronoun inter. = interjection cong. = conjunction in Collins: Italian English s n v v, vb ag adj av adv art art prep prep pron pron escl excl cong conj in VanDale Monolingual: 1 = noun 3 = verb 2 = adjective 5 or 2 = adverb 75 = article 6 = preposition 4 = pronoun 9 = interjection 8 = conjunction 70, 72 = cardinal 73 = ordinal in VanDale Bilingual: Dutch, the same as in the mono-lingual English has no coding. subcat: in Garzanti: tr. = transitive, e.g. presumere v.tr. ciancicare v.tr. e intr. intr.= intransitive, e.g. andare v.intr rifl = reflexive?, e.g. prestare v.tr....// -arsi v.rifl. arrendersi v.rifl. arrampicare v.intr..., arrampicarsi v.rifl.pron. in Collins: t = transitive, e.g. incorporate vt i = intransitive, e.g. go vi r = reflexive, pronominal, reciprocal, e.g. arrendersi vr in VanDale Monolingual: for verbs : 1 intransitive 2 transitive 3 reflexive subtype: in Garzanti: articolata = preposition combined with definite article, e.g del...prep.articolata indef. = indefinite, e.g. qualcosa...pron.indef. interr. = interrogative, e.g. quando...avv.interr. num.card. = cardinal number, e.g. quattrocento . . agg.num.card. locuz. = locution, e.g. ab ovo...locuz. voce onom. = onomatopeic word, e.g. ciac...voce onom. invar. = invariable, e.g. qualsivoglia...agg.indef.invar. di tempo = of time, e.g. dopodomani...avv di tempo ........... in Collins: impers = impersonal, e.g...be ... impers vb modal will modal aux vb aux = auxiliary, will modal aux vb .......... in VanDale Monolingual: for pronouns: 1 personal pronoun 2 demonstrative pronoun 3 possessive ponoun 4 relative pronoun 5 interrogative pronoun 6 reflexive pronoun 7 reciprocal pronoun for verbs : 4 auxiliary verb 5 copula 6 impersonal verbs like "rain", "snow" gender: in Garzanti: f. e.g. abilitazione s.f. m. abete s.m. in Collins: f persecuzione sf m sistema sm in VanDale Monolingual: 1 masculine 2 feminine 3 neutral 4 masc. and neut. 5 fem. and neut. 6 masc. and fem. 7 masc., fem. and neut. 8 neut. and 'male person' 9 neut. and 'female person' number: in Garzanti: sing. e.g. io ..pron.....sing. pl. forbici...s.f.pl. in Collins: sg people sg ... pl ..........informazioni fpl ........... in VanDale: not present g-code: in Longman: ........... in VanDale: not explicitly present, perhaps extractable from the formalized examples. Morph_label: in Garzanti: sing. = singular pl. = plural m. = masculine f. = feminine compar. = comparative superl. = superlative ...... in Collins: sg = singular pl = plural m = masculine f = feminine comp = comparative superl = superlative pt = past tense pp = past participle ............... in VanDale Monolingual: mv. plural enk. singular Aux_label: in Collins: aus = auxiliary in VanDale Monolingual: h. auxiliary of perfect tense "hebben" ("have") i. auxilairy of perfect tense "zijn" ("be") Sense_No: in Garzanti: NIL (when there is only one word-sense within a HOM_GROUP) e.g. quadrettato agg. suddiviso in quadrati..... a number from 1 on (when there is more than one word sense) e.g. quadrello..s.m. 1 mattonelle.... 2 [p.f.-a] (lett.) freccia.... in Collins: NIL (if there is only one word-sense) (a), (b), (c), and so on (otherwise) in VanDale: a number from 1 on Subject_code: in Garzanti: aer. = aeronautics agr. = agriculture bot. = botany ........ e.g. cima......4 (bot) infiorescenza.... in Collins: Admin/Amm = Administration (for Eng-It and It-Eng sides) Chem/Chim = Chemistry ( " " " " " ) ......... in Van Dale: cul. = culinair ("cooking") atrol. = astrology dipl. = diplomacy Semantic_code: in Garzanti: fig. = figurative ..... in Collins: fig e.g. weak-kneed adj (fig) debole,... in VanDale: fig. = figurative Register_code: in Garzanti: scherz. = humorous lett. = literary in Collins: frm = formal fam = informal or colloquial poet = literary or poetic usage .......... in VanDale: euf. = euphemistic iron. = ironic Usage_Code: in Garzanti: .... rar. = rare usage e.g. ibi s.m.invar. (rar.) -> ibis in Collins: old = old fashioned gen = generally, in most senses in VanDale: f = frequent Geographic_code in Garzanti: dial. = dialectal region. = regional ......... in Collins: Scot = Scottish Brit = British Am = American in VanDale: AZN = common in the south of the Netherlands and in Belgium This is the only code used. Semantic_Indicator_type: in Collins: the following types of semantic indicators can be derived: synonym hypernym contextual typical_subject typical_object .......... in VanDale: not present Figure_ref: in Garzanti: ill. = illustration e.g. embrione 1... 2....[ill.Frutti] 3...... in Collins: no illustrations in VanDale: not present Taxon_label: in Garzanti: fam. = family e.g. mandorlo ...... (fam.Rosacee) in Collins: no taxonomies. in VanDale: not present Trans_type: in Collins: explanation, e.g. Befana sf ... (b) (personaggio) kind old woman who, according to legend, ... cultural equivalent, e.g. maturita' "=" G.C.E. A-levels Trans_label: in Collins: "=" (see example above) Ex_Trans_type: in Collins: explanation, e.g. liceo sm ... liceo classico/ scientifico secondary school specializing in ..... cultural equivalent, e.g. speaker n ... (d) (Brit Parliament): the Speaker "=" il Presidente della Camera dei deputati. Ex_Trans_label: in Collins: "=" Prov_label: in Garzanti: prov. = proverb e.g. ciambella .... 1..../prov.: non tutte le ciambelle riescono col buco, non sempre si riesce...2. in VanDale: 366 = a number reference to an external list of sayings which is in the book but not on the tape Syn_label: in Garzanti: SIN. e.g. cima 1. ..... SIN. sommita', vetta, ..... Collins does not list synonyms explicitly. in VanDale: SB Ant_label: in Garzanti: CONTR. = antonym e.g. chiuso ... CONTR. aperto Collins does not list antonyms explicitly. in VanDale: AB Alt_label: in Garzanti: DIM. = diminutive ACCR. = augmentative e.g. casa .. 1...SIN...., DIM. casina, casetta, ACCR. casona Collins does not list "altered forms" explicitly. in VanDale: not present Mwd_label: in Garzanti: / in Collins: cpd (on the Eng-It side) : (on the It-Eng side) e.g. sewing .. 1 n cucito 2 cpd: sewing machine n macchina da cucire guardia .. 1 sf ... 2: guardia del corpo bodyguard in VanDale: only indirectly derivable from the examples. Xref_label: in Garzanti: -> e.g. henna .. s.f. -> henn s.m.invar ..... in Collins: = e.g. pry vt (Am) = prise pt of .. . rode pt of ride pp of ridden pp of ride abbr of TV n (abbr di television) .......... in VanDale: not present Run-On_Label: in Garzanti: // e.g. frontale agg. ....// -mente avv. di fronte in Collins: "a solid lozenge" is used to indicate phrasal verbs at the end of an entry. in VanDale: not present.