Report from the TEI Print Group Meeting - August 9-10, 1991, Pisa (Notes compiled by N. Ide and S. Warwick-Armstrong, Aug. 28, 1991) participants: Nicoletta Calzolari, Nancy Ide, Susan Warwick-Armstrong 0. Introduction The meeting was initially set up to do some practical work on bilingual dictionary entries. However, before looking at the information specific to translation, we found it desirable to review what was already specified for monolingual entries in the dictionary section of the guidelines and in a more recent report from the group working on morphological tags. After an initial review of the guidelines we drew up some sample entries in order to test where the text needed explanation and/or extensions. We then turned to information specific to bilingual entries and modified the schema we had developed to accommodate the translation information. What follows is a synthesis of the topics we discussed and the proposals we came up with during the meeting. 1. Review of existing TEI documents 1.1. Dictionary section (TEI P1, section 7.4) Though the dictionary section provides a starting point for coding printed entries, it is far from complete both in the explanations it provides and in the range of tags it has specified. With regard to the explanations provided, the three areas obviously missing are the references to other relevant sections in the guidelines, exemplification, and discussion of how to use the proposed tags. When we tried to use the guidelines to code some entries, the latter topic proved to be especially significant given the impossibility of specifying once and for all everything contained in every dictionary and thus the choices to be made when marking up a given entry. (Cf. section 2 for more discussion of this topic). One outstanding group of information which would merit some discussion in the dictionary section, as well as pointers to other relevant sections in the guidelines, is the grammatical information typically found in printed entries. It would be desirable to use one set of tags to account for the morphological and syntactic properties of words throughout the guidelines. There are, however, differences between the information provided in a dictionary and the properties ascribed to words occurring in texts. Printed entries provide information about all of the possible uses of a given word in rather specialized and often compacted forms and the codes may not directly correspond to common linguistic conventions. We also saw the need to more fully specify the XREF tagging, develop the use of attributes on FORM groups, and make a GRAM group. A general problem which arose in trying to use the dictionary guidelines was the lack of an overview section which would provide the skeleton (or basic structure) of an entry. Section 7.4.2 is an attempt to provide such an overview, but is somewhat too general and oversimplified. The guidelines should specify, insofar as it is possible, not only what tags belong to a group, but also, which groups can be embedded in other groups. Though the DTD will formally define the groupings, they should also be exemplified and discussed in the text. We do note from our recent experience, that this will not be an easy task, given the range of dictionary format conventions and the various types of information they contain. 1.2. Morphological Features (TEI AI1W2) We looked at the tags in the 'starter set' (TEI AI1W2) for morphological features, and we found that they essentially cover all the features found in printed dictionaries. However, we had the following comments: (1) the names of morphological features used here often have different names from those used in dictionaries. For example, linguists typically use the word 'category' whereas lexicographers typically use 'part of speech' or 'pos'. Question: should we provide a mechanism (or two tags) to use either name? The difference in tagging often stems from the fact that dictionary entries specify the potential realization of words, not their actual use in texts. We should work with morphology on this, for they seem to have only the "word in use" view. (2) naming conflicts in more basic ways--the use of the proposed 'form' tag for dictionaries, for example, has a different function than the 'form' attribute proposed for the morphology. Again, we should work with morphology. (3) In some cases, an explanation of the attributes and values would have been useful. Eg. the use of the attribute 'unit' with values '+/-' -- does this refer to plural building? does 'unit' mean divisable or not divisable? does it replace the traditional 'mass/count' distinction? (4) The 'reciprocal' attribute provided for verbs does not adequately cover the distinctions often made in French dictionaries (and perhaps others). In the 'Robert methodique', for example, verbs can be classified as 'pronominal reflexive' or 'pronominal reciprocal'. The current morphological scheme does not allow this. Dictionaries also code verbs as 'impersonal'. (5) We see the need to be able to encode the auxiliary a verb takes, i.e. 'be' or 'have'; but 'auxiliary' is itself currently an attribute. (6) We noted a missing feature for prepositions assigning case (for eg. German). 2. General Issues 2.1 Goals The TEI recommendations for encoding "print" dictionaries are at present (at least implicitly) intended to serve several different groups, with different goals that have strong repercussions for what must be included in the guidelines. These groups include at least the following: (1) researchers in humanities disciplines interested in encoding existing historical dictionaries, and who therefore want to retain the exact (printed) form of the original (but who may also want to render some implicit or compressed information explicitly) (2) the lexicographer who is creating data which may be used in a specific printed dictionary (3) the publisher who would like to retain a database of lexical information which could potentially be used to generate different versions of a dictionary (e.g., a compressed version, one without examples, etc.) (4) the linguist who wants to have access to the information in existing print dictionaries, possibly in order to further process it There exist two (potentially conflicting) concerns here: (1) to retain in its exact form the printed text of a dictionary and (2) to represent the information in a usable (retrievable) form, possibly retaining information which is implicit or compressed in a printed dictionary. In some cases one application (e.g. encoding historical dictionaries whose form *and* content are a subject for research) may involve both concerns. 2.2 Issues arising from different goals The differences outlined above have an impact on the encoding. For example, (1) compressed information such as a rendering of a headword as "curieux, -ieuse" needs to be retained as such if the print form must be preserved, whereas the use of this information in a database would demand its expansion into two items, "curieux" and "curieuse". This is a simple example; a glance at any dictionary will show that all kinds of information is compressed this way. Another example is "nm" for "noun" "masculine"; probably dictionary editors, and certainly linguists, want to split this into two pieces of information. (2) a more serious problem is the varying treatment of information in print dictionaries. For example, when variant or inflected forms of a word exist, the lexicographer will typically choose one to serve as the key to the entry in which the definition text(s) appears and list the others within the same entry. However, in cases where the variant or inflected form is alphabetically distant from the key, it is often given in a separate entry. Related entries are treated in the same way. If the concern is only to reflect the printed form of a text, varying treatment of the same objects is not a problem. If the concern is to be able to access information according to its type, this can be a problem, especially if the information is to be in a database. (3) there is much information that is implicit in dictionary entries, that it is highly desirable to retain for many applications. For example, usage notes come in several varieties: domain, geographical info, register, etc. but the type of note is not specified explicitly except in the prefatorial matter of the dictionary, where a closed list of options is provided. Other information is implicit in dictionary entries by default; for instance, if not specified in a British dictionary, the geographical specification on any word from is "British". For linguistic purposes at the very least, this information is essential to retain, even though it does not appear in the printed form. 2.3 Other issues Another problem arises not from the conflict cited above but from our goal to accomodate the encoding of *any* existing print dictionary. The difficulties here are well known and arise from the fact that different dictionaries factor information in different ways. For example, in one dictionary, all senses of a given orthographic form with the same etymology will be grouped in a single entry, regardless of part of speech; whereas in another, different entries for the same orthographic form are given if the part of speech is different. Within entries, information is also factored: in fact, the salient feature of dictionary entries is that information such as part of speech, form, etc. is factored out so as to apply to a number of senses. Factored information appears "outside" or above a grouping of senses, often at the head of the entry. In this way it is given only once. In addition, there is an "override" system in dictionary entries: it is very common to give exceptions for a specific sense when factored information does not apply. For example, one sense of the english word "heave" has a different past participle; dictionaries handle such things in something like the following way: heave (hi:v) vb. heaves, heaving, heaved... 5. (past tense and past participle hove) Nautical. a. to move or cause to move in a specified way ... b. (intr.) (of a vessel) to pitch or roll. ... The specification of a different past tense on sense 5 indicates that for this sense only, there is an override for the past tense form. Factoring is a result of the need to conserve space in print dictionaries, so that common information is not respecified. Conceivably, a database could duplicate information that is factored and thus eliminate the problem of representing factored information, but it is obviously desirable to factor in a database for space reasons as well. There is a semantics associated with factoring. That is, (1) factored information has a scope which must be defined (2) the presence of duplicate information (e.g. a past tense for a given sense when one which has been factored should apply) must be recognized as an override. Default information is similar; it too implies a semantics. It is not clear what the TEI should do here. Do we specify an encoding, and suggest a semantics? It seems imperative, if this scheme is to be of use. There is not a clear precedent in the current guidelines, although the encoding of trees in section 7 seems to imply a constituent structure, which is a sort of semantics. 2.4 Recommendations We recommend that the TEI scheme attempt to accomodate all the goals outlined in section 2.1. That is, we can specify a scheme which can be used to represent the printed form of a dictionary only, but which provides additional mechanisms which can be used when a rendering and structuring of information for further processing is desired. To accomplish this the scheme must (1) allow varying factorings of information. This implies a scheme wherein entities can be nested, and in which various pieces of information can appear at any level of the nesting. (2) be flexible enough to allow the encoding of information that is implicit, and allow a means to distinguish implicit from explicit information. Thus we recommend that the scheme is developed in such a way that the information typically repesented in dictionaries is "content" (within tags) and implied information is given in attributes. (3) allow means to represent the printed form for any item, anywhere We have already allowed for this in the current version, with the floating PFORM tag. We also recommend that the dictionary guidelines contain the following sections: (1) A discussion of distinctions between the varying goals in encoding dictionaries and their ramifications (2) Examples of the same entry coded to serve different goals (3) A discussion of a semantics for scoping, overriding, and defaults that can be applied to dictionaries tagged according to the scheme. (4) Recommendations on which option to select for a "good" encoding (5) Pointers to all other sections of the guidelines which contain tags potentially relevant for dictionaries, e.g., tags for names, hyphens, etc. 3. Tagging dictionary entries 3.1. The basic structure In what follows, we propose a scheme for tagging entries based on a practical look at various dictionaries, including Dutch, English, French, German, Italian, and Spanish, mono- and bilingual entries. This proposal gives the basic structure of an entry and a starting set of tags. Problematic cases of structuring and naming the information will be discussed; outstanding issue are summarized at the end of this section. 3.1.1. Monolingual dictionary entries We argue for a view of the entry as comprising a group of objects, all of the same type. One might regard these objects as "senses" (although we have found this term not be broad enough, as explained below). That is, any entry is basically a group of objects, each of which is associated with features such as an orthographic form, a pronunciation, a part of speech, and etymology, a morphology, a definition text, cross references, and examples, among others. Often, several senses share features--the most obvious is the sharing of an orthographic form and pronunciation. When information is shared, dictionaries typically "factor" the information, grouping senses with which the factored information is associated. For this reason dictionary entries are usually groups of senses sharing an orthographic form, and usually also a common etymology (as in the CED), OR alternatively, a common part of speech (as in the LDOCE). Therefore, in developing the guidelines scheme, a typical dictionary entry will be seen as consisting of a number of sub-entries (SE), each of which associated with a group of features containing information about the written and spoken realization of the word(s) (FORM), morpho-syntactic properties of the entry (GRAM), a description (DESCRIP) (which may be a definition or some other text to help distinguish it from other words or phrases), example (EX) phrases in which the word(s) typically occur, restrictions on usage (USAGE), cross-references (XREF), and etymological (ETYM) properties. Sub- entries may be nested in order to show sub-groupings which share common information. This allows common information to be factored, as it is in printed dictionaries, so that it appears only once even when it applies to several senses. This view of entries as comprising several (possibly nested) sub-entries also enables the representation of different factorings in different monolingual dictionaries. In this view one could assume there is no such thing as an entry but only groups of subentries. However, there are certain features which are associated with the entity we think of as an entry, which are not associated with sub-entries, most notably, KEY. We therefore define an ENTRY tag to group an entire entry. The SE tag can be used to group objects in entries on any basis--for example, change in category, extensions of the entry word in compounds and phrases, etc. Thus the SE tag essentially replaces the 'sense' grouping tag proposed in the guidelines. Note that through the use of attributes, sub-entries can be typed to indicate the basis of the grouping. They may also be numbered. When supplied via an attribute, the number would be one which did not appear explicitly in the text - see also the use of the SN tag for recording the explicit numbering found in entries. However, in some cases levels in a sense hierarchy are indicated in dictionaries through font changes, special symbols etc.-- see for instance the Hachette French. It should be noted that this scheme, which allows a hierarchical representation of dictionary data, does not preclude representing the entry in a "flat" non-nested form. The following is a skeletal outline of an entry, intended to show only how sub-entries can be nested, and how features specific to a given sub-entry can be specified.
includes tags for orthographic form, pronunciation, etc. includes tags for information on morphology and syntax typically gives a definition sample phrases containing the headword cross-reference In this example there is one main sub-entry for which FORM and GRAM groups are given. There are two sub-entries nested within this overall one. In the first, there is a definition, an example, and a cross reference; in the second, a definition and example appear. Below, another cross reference appears. The FORM and GRAM groups have been factored; that is, they appear at the level of the most general sub- entry and thus apply to it and to each sub-entry nested within. The same is true for the last XREF in the example, which also appears at the outermost level. However, note that the first XREF appears within the first nested sub-entry and therefore applies to this sub-entry only. Similarly, the definitions and examples apply to only the sub- entry in which they appear. A full description of all tags appears in section 4, and examples of encoded entries appear at the end of the report. 3.1.2. Bilingual dictionary entries Bilingual entries contain information similar to that in the monolingual; however, the conventions adopted for specifying translation information, i.e. the explanations given for the source and target word(s) are quite different from monolingual dictionaries. The presentation of this information varies widely from one dictionary to another. In the scheme we propose, sub-entries are still the basic grouping elements. Within sub-entries, information specific to translation is grouped under the TRANSL tag, which can in turn contain information regarding restrictions or clarification on the word(s) to be translated (SOURCE) and the translation(s) (TARGET). Within the SOURCE and TARGET tags we often find grammatical information and various restrictions concerning the use of the word(s) in question, which is tagged with the tags used for monolingual dictionaries. We have introduced the tag SEMANTIC_INDICATOR to tag the kind of information frequently found in bilingual dictionaries regarding semantic restrictions, information which differs from DOMAIN or other kinds of usage indicators typically found in monolingual dictionaries. (We could have called it 'hints', as the front matter of one dictionary described it.) However, there is no reason this tag could not be used for a monolingual dictionary, if the appropriate information appeared there. A schematic view of a bilingual entry: including orthographic and phonetic forms information on morphology and syntax explanations and restrictions on the source information concerning the translation more than one translation is often given with different restrictions on the source . . . . . . The major difference between the monolingual and bilingual structures is the introduction of the translation block, grouping the information specific to a given language under 'source' and 'target'. The use of similar structures for both mono- and bilingual entries is intended to help identify the information common to all lexical entries. Whereas the guidelines proposed one tag (DESCRIP) for either monolingual definitions or translations (sect 7.4.4), we propose a separate group for the translation information. Bilingual dictionaries typically do not attempt to define or explain the meaning of the headword; rather they assume the general meaning is known and simply restrict the semantic domain in order to give an appropriate translation. Representing bilingual information as essentially an extension to the monolingual entry is useful if dictionaries are to be merged. 4. Dictionary tags The following is a proposal for a somewhat extended tag set for mono- and bilingual dictionaries. We have attempted to cover the major types of information found in dictionaries. This should be seen as a starting set that will need further additions. We need to fill out this section, indicating attributes for each tag and the tags they can contain. 4.1. Grouping Tags Grouping tags contain only other tags, between which there may in fact be content. Thus their function is only to group information in order to identify it as comprising an entity or whole. At present, we have determined that ENTRY is the outermost tag and cannot be nested. SE and FORM can be recursively nested, and GRAM can be nested in FORM. Any tag can appear within an SE at any level. We have not fully specified where the various tags can appear; this needs work. In particular we need to develop a DTD. ENTRY : top level tag for dictionary entry SE : sub-entry tag to group sections of the entry : this field may contain all of the following tags : sub-entries may be nested FORM : contains information on orthographic (ORTH) and phonetic (PHON)form : may also contain a GRAM, USAGE GRAM : contains morpho-syntactic information, eg. POS, GENDER, NUMBER, CASE DESCRIP : contains the definition text and other 'defining material' : may also contain usage notes on subject field, register etc. EX : contains sample phrases using the headword : may contain usage notes on subject field, register etc. XREF : indicates cross references of different types (synonym, antonym, etc.) : may also occur within DESCRIP and EX fields ETYM : contains etymological information; content tags not developed TRANSL : groups information about both languages in question SOURCE : contains information specific to the entry word(s) : may contain GRAM information and a SEMANTIC_INDICATOR field TARGET : contains the translation text : may contain GRAM information and a SEMANTIC_INDICATOR field SEMANTIC_INDICATOR : typically contains usage notes and other descriptive text to limit the semantic domain 4.2. Simple tags The following is an initial set of tags which essentially contain the portions of the printed text found in the entries; see also discussion section 5 on literal vs. interpreted content. Note that these tags may contain various attributes. HDW : to tag the headword(s) as occurring in the dictionary HOM : homograph number SN : sense or sub-entry number for explicitly numbering of sub- sections ORTH : orthographic representation of the word(s) in question PRON : pronunciation field POS : part of speech GENDER : morphological gender (eg. masc, fem, neuter) NUMBER : number (eg. sg, pl) CASE : case (eg. nom, acc, dat, gen) USAGE : usage note from a finite set of features for register, geography, etc. note: this tag replaces the 'label' tag suggested in the guidelines EGWORD : to tag the headword used in an example phrase (often inflected) PFORM : text as found in printed dictionary (may occur anywhere) NOTE : general tag for other information for which no tag was foreseen (or perhaps whose content could not be fully analyzed) TEXT : embedded under a group tag, this field would contain the printed text or alternatively: DESCRIP_TEXT : for definition TR_TEXT : for translation XREF_TEXT : for cross-references ... Sample entries using these tags are given at the end of the paper. 4.3 Attributes Attributes have not been fully worked out, but at points we determined some for certain tags which are given below. Possible values are given in parentheses, where they can be specified. ENTRY: id, key, type (main, prefix, suffix, related, etc.) SE: type (sense, compound, phrase, derivative, idiom, collocation, inflected form, hom_gram, hom_lex), selevel (this wouold be a number indicating the depth of nesting) FORM: type (compound, derivative, inflected, etc. -- this list needs work) USAGE: type (semantic, domain, geographic, register, time, orth, figure) SEMANTIC_INDICATOR: type (all of the usage types plus indicators on type of source context provided -- this needs some work to see how much can be formalized) There are more, and this whole area needs work. 4.4 XREF We developed a set of tags for XREFs, which are given as simple tags in the current guidelines. However, it is obvious that this should be a group. We have identified two sub-tags that can appear inside XREF: XREF_LABEL and XREF_PATH. XREF_LABEL contains the label indicating the type of the cross- reference( e.g., synonym, antonym, archaic use, etc.). XREF_PATH is itself a grouping tag, containing at a minimum HWD (the word in question), as well as possibly SN (when a sense is given). Here are some examples of XREF: When a cross-reference appears in the middle of a definition text: blah blah blah synthese blah blah blah When a cross-reference appears as a block at the end of a sense or entry: (1) V. %% actual label from dict Armillaire (sphere) (2) glovie II An xref with a long explanation in it: blah blah blah harmonie blah blah blah 5. Literal vs. interpreted representation Printed dictionary entries use a variety of abbreviatory conventions and compaction techniques for essentially two reasons: 1) to save space (i.e. limit the number of pages according to the specification of a given edition) and 2) to visually help the human reader quickly identify the relevant section of the entry (without having to scan masses of details). Once found, the reader will need to interpret (or expand) the compacted information. For example, the entry containing the string 'specatateur, -trice NM,F' is a shorthand for the two strings 'spectateur NM' and 'spectatrice NF'. (Actually it is a shorthand for a structured element containing orthographic information as well as morpho-syntactic information.) The first string can be viewed as the explicit text, the two expanded strings as implicit or derived. The current guidelines (sect. 7.4.3) identify the case where the actual spelling of a form may be obscured by hyphenation, stress, etc. markers and recommend processing the string to represent the 'true orthography'. They do not, however, suggest how this information could be rendered 'explicit' and recorded in the structure (as one might wish to do with German separable prefix verbs where stress marks whether the prefix is seperable or not and syllable boundary marks the end of the prefix and where the past participle infix 'ge' is inserted). When annotating a printed entry (i.e. tagging the fields), there are three choices: (1) include only the information as it appears in the printed text spectateur -trice (2) include only "expanded" information spectateur spectatrice (3) include both forms The solution to (3) is not straightforward. The solution currently suggested (or implied) by the guidelines is to simply put every piece of printed text in a PFORM field and otherwise use a tagging schema that would represent the interpreted information. However, this may demand restructuring parts of the entry and it may not be clear how to handle the placement of PFORM. We should provide some examples of recommended practice. E.g., Code "NM" as noun masculine NM Code "~en" in an entry for "take" as taken ~en Code "buste ... (ANAT) chest; (: de femme) bust; (sculpture) bust" as ANAT de femme bust Note that we have factored out the semantic marker ANAT over the second translation field and typed the information on the basis of the typographic conventions, i.e. the use of ':' means, "here is some context to situate when this translation should be used". It is clear that the generation of the expanded information in these and many other cases can be problematic. It should be noted, possibly in the guidelines, that we are not dealing with the issue of how to go about interpreting or processing dictionary entries, a topic not within our brief. 6. A semantics for tagged entries 6.1 Scoping, overriding, and defaults In the previous sections we looked at what tags are necessary to account for individual pieces of information found in printed dictionary entries, at how to structure the information, and at potential problems regarding the implied (but not explicitly represented) content of the entries. One might then assume that given a list of tags with clear explanations, a list of possible attributes and their values, a DTD defining the structure of the tagging, and a set of suggestions on how to deal with problematic cases, that the guidelines would be complete. This is, however, not the case if one takes the view that interchange formats are meant to offer other users electronic access not only to the printed text but also to the contents of the text, including the (linguistic) relations between the parts. This might be interpreted as going beyond the purposes and possibilities of SGML and the guidelines. Nevertheless, it is a topic that must be considered if our tagging enterprise is going to be of any future use. This section will consider what the tags and structures 'mean' in view of using the information they contain for eg. making new dictionaries, extracting the information for some computational application, etc. SCOPE. It is clear that factored information in dictionary entries is intended to be applied over nested sub-entries. Therefore we need to introduce the notion of the scope of information over nested structures, which parallels the way in which variables have scope within nested procedures in block-structured computer languages. OVERRIDES. However, as discussed above in section 2.3, information specific to a given sub-entry often overrides the scoped information. Again, this is as in block-structured languages where a local variable with the same name as a global one overrides. DEFAULTS. There is much default information in dictionaries, often specified in the front matter of the dictionary (e.g., "unless otherwise specified, a verb is transitive..."). We need a mechanism to specify default information in the tagging, in tags at the outermost level of the dictionary (do we by the way need a DICT tag?). WE did not deal with this. The semantics would specify that this information applies when not given explicitly. To make this clearer, consider again the change of category example discussed above under issues of scope and defaults and assume the following structure assigned to a given entry. (These are oversimplified examples to illustrate the point.) verb ... ... adj ... ... ... In this example, the POS "verb" has scope over the sub-entries nested inside the first SE group, and "adj" applies to those in the second SE group. Without a scoping mechanism, the POS tag would have to be repeated under each SE to make it explicit which part of speech applies. The point here is that we assume a 'meaning' of the structure that is not given by the representation. Note that in the example above, assuming that the default for verbs is transitive, we can also assume this piece of information as well for the first SE group. Consider a second example: verb ... adj ... ... ... In this example, the first and third nested SE's have part of speech "verb" according to the scope rules. However, the second SE has part of speech "adj", since the respecification of this information overrides that given at the outer level. Printed dictionary entries are structured implicitly using these rules of scope and overriding, and so it seems essential to our encoding scheme. Furthermore, a database representation is likely to use a similar factoring to avoid redundancy. Although SGML cannot specify the semantics, it is assumed that the tagged content of dictionary entries will be disambiguated and accessible by a program. The processing issue is also addressed in section 7.4.4. of the guidelines regarding the use of a nested or flat structure to represent senses. Though the example given above will also have implications for processing, it is meant, first of all, to illustrate the problem of interpreting a given structure. We would like to suggest that this issue be addressed in the guidelines. One possibility would be to suggest an additional document (or point to existing ones) which gives a semantics to the adopted notation. 6.2 Concatenation vs. override The SN tag presents a problem in terms of the overriding described above. For example, consider the following: I 1 a ... ... ... The information within the SN tags in embedded SE's does not override that at the outer level; rather, it should be concatenated to it. For example, the SN for the innermost SE is actually I.1.a. How should we handle this? One solution is to use an explicit numbering schema (e.g., =I.1.a); another is to give a "level" indicator based on the implicit hierarchy as an attribute on SE (e.g., and let a processor generate the numbers. This topic is in need of further work. 7. Outstanding issues The following is a list of issues that have not been adequately addressed within the dictionary print group. They will need detailed attention and may imply further changes to the proposal for the dictionary section of the guidelines. (1) close consideration of the 'type' attribute on the FORM group. We determined that there should be a type attribute which takes values such as "compound", "variant", etc., but it was not clear if this should be allowed to appear only on the FORM tag or on an ORTH or a PRON tag, etc. (2) representation of variants (alternative forms, alternative explanations, definitions and translations, etc.) (3) representation of meta language (words and symbols used throughout the text; these may be explicit field separators or commentary within a portion of text) (4) attributes (each of the simple and group tags should have a list of attributes with possible values, distinguishing finite, or non- finite sets, defined. We started this and made some tentative lists) (5) determination of structural relations among tags, and optional and obligatory tags (DTD development) 8. Conclusion This paper is a summary of the numerous topics addressed during the two day meeting at Pisa. It is meant as a discussion document; comments are welcome on any of the issues raised here. It is hoped that the proposed tags and structures provide enough material for other members to test this on some printed entries, taking into account the discussion sections. 8.1 Sample tagged entries Here is a simple entry (bold indicated in capitals, italics surrounded by single quotes) with a tagged version, followed by some discussion (according to line nb.). CANARY (k@'n#@ri) 'n., pl'. -NARIES. 1. a small finch of the Canary Islands and Azxores: a popular cagebird noted for its singing. 2. CANARY YELLOW. A. a light yellow. B. ('as adj'): 'a canary-yellow car'. 3. a sweet wine similar to Madeira. 4. 'Arch'. a sweet wine from the Canary Islands. [C16: < OSp. 'canario' of ...] 1 2 3 canary 4 k@'n#@ri 5 n 6 7
8 canaries 9 -naries 10 pl 11
12 1 13 a small finch of the Canary Islands and 14 Azores: a popular cagebird noted for its singing 15 16 2 17 a 18
19 canary yellow 20
21 a light yellow 22
23 b 24 as adj 25 a canary-yellow car 26 27
28 3 29 a sweet wine similar to Madeira 39 31 4 32 Arch 33 a sweet wine from the Canary Islands 34 35 ... 36
1. The top level sub-entry tag is included to indicate the scope of the following form features over the sub-entries. 5. Gives only 'pos' and not number; compare to 10. which only gives 'number' and not 'pos'. (By convention we know that the first form is singular.) 16. This subentry introduces a new 'headword' (or extension), i.e. canary yellow. The 'pos' given above is still implied for 2a, but not for 2b. 24. The change in 'pos' is given as a note, an 'intelligent' program might know that this is a new value for gram, but the syntax is that of a usage note. We have not represented the ':' following "(as adj)"; is it significant? 25. This is a sample phrase of 'canary yellow', not of the top form 'canary'. Perhaps 'canary-yellow should be tagged as EGWORD; treating the hyphen like inflection. 28. This sense refers back to the top-level 'form', i.e. canary and thus inherits everything from 'form' and 'gram'; same for the last sense. 35. Its not clear whether the etymological information actually applies to the entire entry, or merely to the headword 'canary'. Here is a simple bilingual entry (uppercase for bold, single quotes for italic). CANARY [k#'n@#ri] 1 'n' (A) ('bird') canari 'm', seri 'm'. (B) ('wine') vin 'm' des Canaries. 2 'cpd' ('also' CANARY YELLOW) ('de colour') jaune serin 'inv, jaune canari 'inv'. ('Bot') CANARY GRASS alpiste 'm'; ('Geog') CANARY ISLANDS 'or' ISLES, CANARIES (iles 'fpl') Canaries 'fpl'; ('Bot') CANARY SEED millet 'm'. 1 2
canary
3 1 4 n 5 a 6 7 bird 8 canari 9 m 10 11 serin 12 m 13 14 15 b 16 17 wine 18 vin 19 m 20 des Canaries 21 22 23 24 2 25 cpd 26 27
canary yellow 28 alsocanary yellow
29 de couleur 30 jaune serin 31 inv 32 jaune canari 33 inv 34
35 36 Bot 37
canary grass 38 alpiste 39 m 40 41 42 Geog 43 Canary Islands> 44
Isles, Canaries 45 isles, fpl
46 Canaries 47 f pl 48
49 50 Bot 51
canary seed
52 millet 53 m 54
55
56
The entire entry is grouped under a subentry to indicate the scope of the form field (as above, this level may be found unnecessary in this case). 7. The use of semantic indicator seems justified here; one cannot really say that 'bird' and 'wine' (17.) is a usage note; however, the indicators in the compound section are (as indicated by capitalization of the words). 8. and 10. The two translations are each grouped in a separate translation field. This is not enough to distinguish the translations separated (in this case) by a comma but in other cases (by a semi- colon). 18.-20. This is not a very satisfactory representation of the phrase and the grammatical information associated to one word in the phrase. 25. Not clear that compound should be considered POS though it has similar status. 27. The phrase canary yellow is identified here as the ORTH; perhaps this should be PHRASE or CPD (varying types of expressions are translated in bilingual entries). 28. Note the use of PFORM to save the phrase (not clear what 'also' means here). The marking of italic and bold may not be standard. 30.-34. The different translations have been separated as in the previous subentry. A new tag is introduced here, i.e. INFL for inflection. The value 'inv' is to indicate that the phrase in invariable (does not agree in number or gender with the noun it is modifying). Alternatively, this could be just a note in GRAM. 43.-44. Not clear how to represent this; compare with 28. As these examples demonstrate; there are still lots of problems in tagging entries.