<!-- Document TEI AI5 M4 "Minutes from Subgroup meeting" -->
Report from the TEI Print Group Meeting - August 9-10, 1991, Pisa
(Notes compiled by N. Ide and S. Warwick-Armstrong, Aug. 28, 1991)
 
participants: Nicoletta Calzolari, Nancy Ide, Susan Warwick-Armstrong
 
 
0. Introduction
 
The meeting was initially set up to do some practical work on
bilingual dictionary entries. However, before looking at the
information specific to translation, we found it desirable to review
what was already specified for monolingual entries in the dictionary
section of the guidelines and in a more recent report from the group
working on morphological tags.
 
After an initial review of the guidelines we drew up some sample
entries in order to test where the text needed explanation and/or
extensions.  We then turned to information specific to bilingual
entries and modified the schema we had developed to accommodate the
translation information.
 
What follows is a synthesis of the topics we discussed and the
proposals we came up with during the meeting.
 
1. Review of existing TEI documents
 
1.1. Dictionary section (TEI P1, section 7.4)
 
Though the dictionary section provides a starting point for coding
printed entries, it is far from complete both in the explanations it
provides and in the range of tags it has specified.  With regard to
the explanations provided, the three areas obviously missing are the
references to other relevant sections in the guidelines,
exemplification, and discussion of how to use the proposed tags.  When
we tried to use the guidelines to code some entries, the latter topic
proved to be especially significant given the impossibility of
specifying once and for all everything contained in every dictionary
and thus the choices to be made when marking up a given entry.  (Cf.
section 2 for more discussion of this topic).
 
One outstanding group of information which would merit some discussion
in the dictionary section, as well as pointers to other relevant
sections in the guidelines, is the grammatical information typically
found in printed entries. It would be desirable to use one set of tags
to account for the morphological and syntactic properties of words
throughout the guidelines.  There are, however, differences between
the information provided in a dictionary and the properties ascribed
to words occurring in texts.  Printed entries provide information
about all of the possible uses of a given word in rather specialized
and often compacted forms and the codes may not directly correspond to
common linguistic conventions.
 
We also saw the need to more fully specify the XREF tagging, develop
the use of attributes on FORM groups, and make a GRAM group.
 
A general problem which arose in trying to use the dictionary
guidelines was the lack of an overview section which would provide the
skeleton (or basic structure) of an entry. Section 7.4.2 is an attempt
to provide such an overview, but is somewhat too general and
oversimplified. The guidelines should specify, insofar as it is
possible, not only what tags belong to a group, but also, which groups
can be embedded in other groups. Though the DTD will formally
define the groupings, they should also be exemplified and discussed
in the text. We do note from our recent experience, that this will not
be an easy task, given the range of dictionary format conventions and
the various types of information they contain.
 
1.2. Morphological Features (TEI AI1W2)
 
We looked at the tags in the 'starter set' (TEI AI1W2) for
morphological features, and we found that they essentially cover all
the features found in printed dictionaries. However, we had the
following comments:
 
(1) the names of morphological features used here often have different
names from those used in dictionaries. For example, linguists
typically use the word 'category' whereas lexicographers typically use
'part of speech' or 'pos'. Question: should we provide a mechanism (or
two tags) to use either name? The difference in tagging often stems
from the fact that dictionary entries specify the potential
realization of words, not their actual use in texts. We should work
with morphology on this, for they seem to have only the "word in use"
view.
 
(2) naming conflicts in more basic ways--the use of the proposed
'form' tag for dictionaries, for example, has a different function
than the 'form' attribute proposed for the morphology. Again, we
should work with morphology.
 
(3) In some cases, an explanation of the attributes and values would
have been useful.  Eg.  the use of the attribute 'unit' with values
'+/-' -- does this refer to plural building?  does 'unit' mean
divisable or not divisable?  does it replace the traditional
'mass/count' distinction?
 
(4) The 'reciprocal' attribute provided for verbs does not adequately
cover the distinctions often made in French dictionaries (and perhaps
others).  In the 'Robert methodique', for example, verbs can be
classified as 'pronominal reflexive' or 'pronominal reciprocal'. The
current morphological scheme does not allow this. Dictionaries also
code verbs as 'impersonal'.
 
(5) We see the need to be able to encode the auxiliary a verb takes,
i.e. 'be' or 'have'; but 'auxiliary' is itself currently an attribute.
 
(6) We noted a missing feature for prepositions assigning case
(for eg.  German).
 
2. General Issues
 
2.1 Goals
 
The TEI recommendations for encoding "print" dictionaries are at
present (at least implicitly) intended to serve several different
groups, with different goals that have strong repercussions for what
must be included in the guidelines. These groups include at least the
following:
 
(1) researchers in humanities disciplines interested in encoding
existing historical dictionaries, and who therefore want to retain the
exact (printed) form of the original (but who may also want to render
some implicit or compressed information explicitly)
 
(2) the lexicographer who is creating data which may be used in a
specific printed dictionary
 
(3) the publisher who would like to retain a database of lexical
information which could potentially be used to generate different
versions of a dictionary (e.g., a compressed version, one without
examples, etc.)
 
(4) the linguist who wants to have access to the information in
existing print dictionaries, possibly in order to further process it
There exist two (potentially conflicting) concerns here: (1) to retain
in its exact form the printed text of a dictionary and (2) to
represent the information in a usable (retrievable) form, possibly
retaining information which is implicit or compressed in a printed
dictionary. In some cases one application (e.g. encoding historical
dictionaries whose form *and* content are a subject for research) may
involve both concerns.
 
2.2 Issues arising from different goals
 
The differences outlined above have an impact on the encoding. For
example,
 
(1) compressed information such as a rendering of a headword as
"curieux, -ieuse" needs to be retained as such if the print form must
be preserved, whereas the use of this information in a database would
demand its expansion into two items, "curieux" and "curieuse". This is
a simple example; a glance at any dictionary will show that all kinds
of information is compressed this way. Another example is "nm" for
"noun" "masculine"; probably dictionary editors, and certainly
linguists, want to split this into two pieces of information.
 
(2) a more serious problem is the varying treatment of information in
print dictionaries. For example, when variant or inflected forms of a
word exist, the lexicographer will typically choose one to serve as
the key to the entry in which the definition text(s) appears and list
the others within the same entry. However, in cases where the variant
or inflected form is alphabetically distant from the key, it is often
given in a separate entry. Related entries are treated in the same
way. If the concern is only to reflect the printed form of a text,
varying treatment of the same objects is not a problem. If the concern
is to be able to access information according to its type, this can be
a problem, especially if the information is to be in a database.
 
(3) there is much information that is implicit in dictionary entries,
that it is highly desirable to retain for many applications. For
example, usage notes come in several varieties: domain, geographical
info, register, etc. but the type of note is not specified explicitly
except in the prefatorial matter of the dictionary, where a closed
list of options is provided. Other information is implicit in
dictionary entries by default; for instance, if not specified in a
British dictionary, the geographical specification on any word from is
"British". For linguistic purposes at the very least, this information
is essential to retain, even though it does not appear in the printed
form.
 
2.3 Other issues
 
Another problem arises not from the conflict cited above but from our
goal to accomodate the encoding of *any* existing print dictionary.
The difficulties here are well known and arise from the fact that
different dictionaries factor information in different ways. For
example, in one dictionary, all senses of a given orthographic form
with the same etymology will be grouped in a single entry, regardless
of part of speech; whereas in another, different entries for the same
orthographic form are given if the part of speech is different. Within
entries, information is also factored: in fact, the salient feature of
dictionary entries is that information such as part of speech, form,
etc. is factored out so as to apply to a number of senses. Factored
information appears "outside" or above a grouping of senses, often at
the head of the entry. In this way it is given only once.
 
In addition, there is an "override" system in dictionary entries: it
is very common to give exceptions for a specific sense when factored
information does not apply. For example, one sense of the english word
"heave" has a different past participle; dictionaries handle such
things in something like the following way:
 
heave (hi:v) vb. heaves, heaving, heaved...
5. (past tense and past participle hove) Nautical. a. to move
or cause to move in a specified way ... b. (intr.) (of a vessel) to
pitch or roll.  ...
 
The specification of a different past tense on sense 5 indicates that
for this sense only, there is an override for the past tense form.
Factoring is a result of the need to conserve space in print
dictionaries, so that common information is not respecified.
Conceivably, a database could duplicate information that is factored
and thus eliminate the problem of representing factored information,
but it is obviously desirable to factor in a database for space
reasons as well.
 
There is a semantics associated with factoring. That is,
 
(1) factored information has a scope which must be defined
 
(2) the presence of duplicate information (e.g. a past tense for a
given sense when one which has been factored should apply) must be
recognized as an override.
 
Default information is similar; it too implies a semantics.
 
It is not clear what the TEI should do here. Do we specify an
encoding, and suggest a semantics? It seems imperative, if this scheme
is to be of use. There is not a clear precedent in the current
guidelines, although the encoding of trees in section 7 seems to imply
a constituent structure, which is a sort of semantics.
 
2.4 Recommendations
 
We recommend that the TEI scheme attempt to accomodate all the goals
outlined in section 2.1. That is, we can specify a scheme which can be
used to represent the printed form of a dictionary only, but which
provides additional mechanisms which can be used when a rendering and
structuring of information for further processing is desired.
To accomplish this the scheme must
 
(1) allow varying factorings of information. This implies a scheme
wherein entities can be nested, and in which various pieces of
information can appear at any level of the nesting.
 
(2) be flexible enough to allow the encoding of information that is
implicit, and allow a means to distinguish implicit from explicit
information. Thus we recommend that the scheme is developed in such a
way that the information typically repesented in dictionaries is
"content" (within tags) and implied information is given in
attributes.
 
(3) allow means to represent the printed form for any item, anywhere
We have already allowed for this in the current version, with the
floating PFORM tag.
 
We also recommend that the dictionary guidelines contain the following
sections:
 
(1) A discussion of distinctions between the varying goals in encoding
dictionaries and their ramifications
 
(2) Examples of the same entry coded to serve different goals
 
(3) A discussion of a semantics for scoping, overriding, and defaults
that can be applied to dictionaries tagged according to the scheme.
 
(4) Recommendations on which option to select for a "good" encoding
 
(5) Pointers to all other sections of the guidelines which contain
tags potentially relevant for dictionaries, e.g., tags for names,
hyphens, etc.
 
3. Tagging dictionary entries
 
3.1. The basic structure
 
In what follows, we propose a scheme for tagging entries based on a
practical look at various dictionaries, including Dutch, English,
French, German, Italian, and Spanish, mono- and bilingual entries.
This proposal gives the basic structure of an entry and a starting set
of tags.  Problematic cases of structuring and naming the information
will be discussed; outstanding issue are summarized at the end of this
section.
 
3.1.1. Monolingual dictionary entries
 
We argue for a view of the entry as comprising a group of objects, all
of the same type. One might regard these objects as "senses" (although
we have found this term not be broad enough, as explained below).
That is, any entry is basically a group of objects, each of which is
associated with features such as an orthographic form, a
pronunciation, a part of speech, and etymology, a morphology, a
definition text, cross references, and examples, among others. Often,
several senses share features--the most obvious is the sharing of an
orthographic form and pronunciation. When information is shared,
dictionaries typically "factor" the information, grouping senses with
which the factored information is associated. For this reason
dictionary entries are usually groups of senses sharing an
orthographic form, and usually also a common etymology (as in the
CED), OR alternatively, a common part of speech (as in the LDOCE).
 
Therefore, in developing the guidelines scheme, a typical dictionary
entry will be seen as consisting of a number of sub-entries (SE), each
of which associated with a group of features containing information
about the written and spoken realization of the word(s) (FORM),
morpho-syntactic properties of the entry (GRAM), a description
(DESCRIP) (which may be a definition or some other text to help
distinguish it from other words or phrases), example (EX) phrases in
which the word(s) typically occur, restrictions on usage (USAGE),
cross-references (XREF), and etymological (ETYM) properties. Sub-
entries may be nested in order to show sub-groupings which share
common information. This allows common information to be factored, as
it is in printed dictionaries, so that it appears only once even when
it applies to several senses. This view of entries as comprising
several (possibly nested) sub-entries also enables the representation
of different factorings in different monolingual dictionaries.
 
In this view one could assume there is no such thing as an entry but
only groups of subentries. However, there are certain features which
are associated with the entity we think of as an entry, which are not
associated with sub-entries, most notably, KEY. We therefore define an
ENTRY tag to group an entire entry.
 
The SE tag can be used to group objects in entries on any basis--for
example, change in category, extensions of the entry word in compounds
and phrases, etc.  Thus the SE tag essentially replaces the 'sense'
grouping tag proposed in the guidelines. Note that through the use of
attributes, sub-entries can be typed to indicate the basis of the
grouping. They may also be numbered. When supplied via an attribute,
the number would be one which did not appear explicitly in the
text - see also the use of the SN tag for recording the explicit
numbering found in entries. However, in some cases levels in a sense
hierarchy are indicated in dictionaries through font changes, special
symbols etc.-- see for instance the Hachette French.
 
It should be noted that this scheme, which allows a hierarchical
representation of dictionary data, does not preclude representing the
entry in a "flat" non-nested form.
 
The following is a skeletal outline of an entry, intended to show only
how sub-entries can be nested, and how features specific to a given
sub-entry can be specified.
 
 
<ENTRY key = xxx>
  <SE>
     <FORM>  includes tags for orthographic form, pronunciation, etc.
     <GRAM>  includes tags for information on morphology and syntax
     <SE>
        <DESCRIP> typically gives a definition
        <EX>      sample phrases containing the headword
        <XREF>    cross-reference
     </SE>
     <SE>
        <DESCRIP>
        <EX>
     </SE>
     <XREF>
  </SE>
</ENTRY>
 
In this example there is one main sub-entry for which FORM and GRAM
groups are given. There are two sub-entries nested within this overall
one. In the first, there is a definition, an example, and a cross
reference; in the second, a definition and example appear. Below,
another cross reference appears. The FORM and GRAM groups have been
factored; that is, they appear at the level of the most general sub-
entry and thus apply to it and to each sub-entry nested within. The
same is true for the last XREF in the example, which also appears at
the outermost level. However, note that the first XREF appears within
the first nested sub-entry and therefore applies to this sub-entry
only. Similarly, the definitions and examples apply to only the sub-
entry in which they appear.
 
A full description of all tags appears in section 4, and examples of
encoded entries appear at the end of the report.
 
3.1.2. Bilingual dictionary entries
 
Bilingual entries contain information similar to that in the
monolingual; however, the conventions adopted for specifying
translation information, i.e. the explanations given for the source
and target word(s) are quite different from monolingual dictionaries.
The presentation of this information varies widely from one dictionary
to another.
 
In the scheme we propose, sub-entries are still the basic grouping
elements. Within sub-entries, information specific to translation is
grouped under the TRANSL tag, which can in turn contain information
regarding restrictions or clarification on the word(s) to be
translated (SOURCE) and the translation(s) (TARGET). Within the SOURCE
and TARGET tags we often find grammatical information and various
restrictions concerning the use of the word(s) in question, which is
tagged with the tags used for monolingual dictionaries.
 
We have introduced the tag SEMANTIC_INDICATOR to tag the kind of
information frequently found in bilingual dictionaries regarding
semantic restrictions, information which differs from DOMAIN or other
kinds of usage indicators typically found in monolingual dictionaries.
(We could have called it 'hints', as the front matter of one
dictionary described it.) However, there is no reason this tag could
not be used for a monolingual dictionary, if the appropriate
information appeared there.
 
A schematic view of a bilingual entry:
 
<ENTRY>
  <SE>
    <FORM>  including orthographic and phonetic forms
    <GRAM>  information on morphology and syntax
    <TRANSL>
       <SOURCE> explanations and restrictions on the source
          <GRAM>
          <SEMANTIC_INDICATOR>
       </SOURCE>
       <TARGET> information concerning the translation
          <TR_TEXT>
          <GRAM>
       </TARGET>
    </TRANSL>
    <TRANSL>    more than one translation is often given
       <SOURCE> with different restrictions on the source
       <TARGET>
    </TRANSL>
    . . .
  </SE>
  <SE>
     <FORM>
     <GRAM>
     <TRANSL>
        <SOURCE>
        <TARGET>
     </TRANSL>
  </SE>
  . . .
</ENTRY>
 
The major difference between the monolingual and bilingual structures
is the introduction of the translation block, grouping the information
specific to a given language under 'source' and 'target'.  The use of
similar structures for both mono- and bilingual entries is intended to
help identify the information common to all lexical entries.  Whereas
the guidelines proposed one tag (DESCRIP) for either monolingual
definitions or translations (sect 7.4.4), we propose a separate group
for the translation information.  Bilingual dictionaries typically do
not attempt to define or explain the meaning of the headword; rather
they assume the general meaning is known and simply restrict the
semantic domain in order to give an appropriate translation.
Representing bilingual information as essentially an extension to the
monolingual entry is useful if dictionaries are to be merged.
 
4. Dictionary tags
 
The following is a proposal for a somewhat extended tag set for mono-
and bilingual dictionaries. We have attempted to cover the major types
of information found in dictionaries. This should be seen as a
starting set that will need further additions.  We need to fill out
this section, indicating attributes for each tag and the tags they can
contain.
 
4.1. Grouping Tags
 
Grouping tags contain only other tags, between which there may in fact
be content. Thus their function is only to group information in order
to identify it as comprising an entity or whole.
At present, we have determined that ENTRY is the outermost tag and
cannot be nested. SE and FORM can be recursively nested, and GRAM can
be nested in FORM. Any tag can appear within an SE at any level. We
have not fully specified where the various tags can appear; this needs
work. In particular we need to develop a DTD.
 
ENTRY   : top level tag for dictionary entry
 
SE      : sub-entry tag to group sections of the entry
        : this field may contain all of the following tags
        : sub-entries may be nested
 
FORM    : contains information on orthographic (ORTH) and phonetic
          (PHON)form
        : may also contain a GRAM, USAGE
 
GRAM    : contains morpho-syntactic information, eg. POS, GENDER,
          NUMBER, CASE
 
DESCRIP : contains the definition text and other 'defining material'
        : may also contain usage notes on subject field, register etc.
 
EX      : contains sample phrases using the headword
        : may contain usage notes on subject field, register etc.
 
XREF    : indicates cross references of different types (synonym,
          antonym, etc.)
        : may also occur within DESCRIP and EX fields
 
ETYM    : contains etymological information; content tags not
          developed
 
 
TRANSL  : groups information about both languages in question
 
SOURCE  : contains information specific to the entry word(s)
        : may contain GRAM information and a SEMANTIC_INDICATOR field
 
TARGET  : contains the translation text
        : may contain GRAM information and a SEMANTIC_INDICATOR field
 
SEMANTIC_INDICATOR : typically contains usage notes and other
                     descriptive text to limit the semantic domain
 
4.2. Simple tags
 
The following is an initial set of tags which essentially contain the
portions of the printed text found in the entries; see also discussion
section 5 on literal vs. interpreted content.  Note that these tags
may contain various attributes.
 
HDW     : to tag the headword(s) as occurring in the dictionary
HOM     : homograph number
SN      : sense or sub-entry number for explicitly numbering of sub-
          sections
ORTH    : orthographic representation of the word(s) in question
PRON    : pronunciation field
POS     : part of speech
GENDER  : morphological gender (eg. masc, fem, neuter)
NUMBER  : number (eg. sg, pl)
CASE    : case (eg. nom, acc, dat, gen)
USAGE   : usage note from a finite set of features for register,
          geography, etc.
          note: this tag replaces the 'label' tag suggested in the
          guidelines
EGWORD  : to tag the headword used in an example phrase (often
          inflected)
PFORM   : text as found in printed dictionary (may occur anywhere)
NOTE    : general tag for other information for which no tag was
          foreseen (or perhaps whose content could not be fully
          analyzed)
 
TEXT    : embedded under a group tag, this field would contain the
          printed text
 
or alternatively:
 
DESCRIP_TEXT  : for definition
TR_TEXT   : for translation
XREF_TEXT : for cross-references
    ...
 
Sample entries using these tags are given at the end of the paper.
 
4.3 Attributes
 
Attributes have not been fully worked out, but at points we determined
some for certain tags which are given below.
 
Possible values are given in parentheses, where they can be specified.
 
ENTRY: id, key, type (main, prefix, suffix, related, etc.)
 
SE:  type (sense, compound, phrase, derivative, idiom, collocation,
inflected form, hom_gram, hom_lex), selevel (this wouold be a number
indicating the depth of nesting)
 
FORM: type (compound, derivative, inflected, etc. -- this list needs
work)
 
USAGE: type (semantic, domain, geographic, register, time, orth,
figure)
 
SEMANTIC_INDICATOR: type (all of the usage types plus indicators on
        type of source context provided -- this needs some work to
        see how much can be formalized)
 
There are more, and this whole area needs work.
 
4.4 XREF
 
We developed a set of tags for XREFs, which are given as simple tags
in the current guidelines. However, it is obvious that this should be
a group.
 
We have identified two sub-tags that can appear inside XREF:
XREF_LABEL and XREF_PATH.
 
XREF_LABEL contains the label indicating the type of the cross-
reference( e.g., synonym, antonym, archaic use, etc.).
 
XREF_PATH is itself a grouping tag, containing at a minimum HWD (the
word in question), as well as possibly SN (when a sense is given).
 
Here are some examples of XREF:
 
When a cross-reference appears in the middle of a definition text:
 
<DESCRIP>
    <TEXT> blah blah blah
       <XREF><XREF_PATH><hwd>synthese</hwd> </XREF_PATH></XREF>
            blah blah blah</TEXT></DESCRIP>
 
When a cross-reference appears as a block at the end of a sense or
entry:
 
(1) <XREF>
      <xref_label> V. </xref_label>         %% actual label from dict
      <XREF_PATH><hwd>Armillaire (sphere) </hwd></XREF_PATH></XREF>
(2) <XREF>
       <xref_label. Ant </xref_label>
       <XREF_PATH refid=%%%>
          <hwd>glovie</hwd>
          <sn> II </sn> </XREF_PATH></XREF>
 
An xref with a long explanation in it:
 
<XREF>
  <xref_text>blah blah blah <XREF_PATH><hwd>harmonie</hwd></XREF_PATH>
      blah blah blah </xref_text></XREF>
 
 
5. Literal vs. interpreted representation
 
Printed dictionary entries use a variety of abbreviatory conventions
and compaction techniques for essentially two reasons:  1) to save
space (i.e. limit the number of pages according to the specification
of a given edition) and 2) to visually help the human reader quickly
identify the relevant section of the entry (without having to scan
masses of details).  Once found, the reader will need to interpret (or
expand) the compacted information.  For example, the entry containing
the string 'specatateur, -trice NM,F' is a shorthand for the two
strings 'spectateur NM' and 'spectatrice NF'.  (Actually it is a
shorthand for a structured element containing orthographic information
as well as morpho-syntactic information.)  The first string can be
viewed as the explicit text, the two expanded strings as implicit or
derived.  The current guidelines (sect.  7.4.3) identify the case
where the actual spelling of a form may be obscured by hyphenation,
stress, etc.  markers and recommend processing the string to represent
the 'true orthography'.  They do not, however, suggest how this
information could be rendered 'explicit' and recorded in the structure
(as one might wish to do with German separable prefix verbs where
stress marks whether the prefix is seperable or not and syllable
boundary marks the end of the prefix and where the past participle
infix 'ge' is inserted).
 
When annotating a printed entry (i.e. tagging the fields), there are
three choices:
 
(1) include only the information as it appears in the printed text
        <orth> spectateur </orth>
        <orth> -trice </orth>
 
(2) include only "expanded" information
        <orth> spectateur </orth>
        <orth> spectatrice </orth>
 
(3) include both forms
 
The solution to (3) is not straightforward. The solution currently
suggested (or implied) by the guidelines is to simply put every piece
of printed text in a PFORM field and otherwise use a tagging schema
that would represent the interpreted information.  However, this may
demand restructuring parts of the entry and it may not be clear how to
handle the placement of PFORM. We should provide some examples of
recommended practice.
 
E.g.,
 
Code "NM" as
 
   <GRAM>
      <pos> noun </pos>
      <gender> masculine </gender>
      <pform> NM </pform>
   </GRAM>
 
Code "~en" in an entry for "take" as
 
   <EGWD> taken <PFORM> ~en </PFORM> </EGWD>
 
Code "buste ... (ANAT) chest; (: de femme) bust; (sculpture) bust" as
 
    <TRANSL>
       <SOURCE> <SEM_IND type=domain>ANAT</SEM_IND>
                <SEM_IND type=context>de femme</SEM_IND>
       <TARGET> <TXT>bust</TXT> </TARGET>
    </TRANSL>
 
Note that we have factored out the semantic marker ANAT over the
second translation field and typed the information on the basis of
the typographic conventions, i.e. the use of ':' means, "here is
some context to situate when this translation should be used".
 
It is clear that the generation of the expanded information in these
and many other cases can be problematic. It should be noted, possibly
in the guidelines, that we are not dealing with the issue of how to go
about interpreting or processing dictionary entries, a topic not
within our brief.
 
6. A semantics for tagged entries
 
6.1 Scoping, overriding, and defaults
 
In the previous sections we looked at what tags are necessary to
account for individual pieces of information found in printed
dictionary entries, at how to structure the information, and at
potential problems regarding the implied (but not explicitly
represented) content of the entries. One might then assume that given
a list of tags with clear explanations, a list of possible attributes
and their values, a DTD defining the structure of the tagging, and a
set of suggestions on how to deal with problematic cases, that the
guidelines would be complete. This is, however, not the case if one
takes the view that interchange formats are meant to offer other users
electronic access not only to the printed text but also to the
contents of the text, including the (linguistic) relations between the
parts. This might be interpreted as going beyond the purposes and
possibilities of SGML and the guidelines. Nevertheless, it is a topic
that must be considered if our tagging enterprise is going to be of
any future use. This section will consider what the tags and
structures 'mean' in view of using the information they contain for
eg. making new dictionaries, extracting the information for some
computational application, etc.
 
SCOPE. It is clear that factored information in dictionary entries is
intended to be applied over nested sub-entries. Therefore we need to
introduce the notion of the scope of information over nested
structures, which parallels the way in which variables have scope
within nested procedures in block-structured computer languages.
 
OVERRIDES. However, as discussed above in section 2.3, information
specific to a given sub-entry often overrides the scoped information.
Again, this is as in block-structured languages where a local variable
with the same name as a global one overrides.
 
DEFAULTS. There is much default information in dictionaries, often
specified in the front matter of the dictionary (e.g., "unless
otherwise specified, a verb is transitive..."). We need a mechanism to
specify default information in the tagging, in tags at the outermost
level of the dictionary (do we by the way need a DICT tag?). WE did
not deal with this. The semantics would specify that this information
applies when not given explicitly.
 
To make this clearer, consider again the change of category example
discussed above under issues of scope and defaults and assume the
following structure assigned to a given entry. (These are
oversimplified examples to illustrate the point.)
 
   <ENTRY>
      <SE>
 
        <POS> verb </POS>
        <SE>  ... </SE>
        <SE>  ... </SE>
      </SE>
 
      <SE>
        <POS> adj </POS>
        <SE> ... </SE>
        ...
      </SE>                                        ...
 
In this example, the POS "verb" has scope over the sub-entries nested
inside the first SE group, and "adj" applies to those in the second SE
group.  Without a scoping mechanism, the POS tag would have to be
repeated under each SE to make it explicit which part of speech
applies.  The point here is that we assume a 'meaning' of the
structure that is not given by the representation.
 
Note that in the example above, assuming that the default for verbs is
transitive, we can also assume this piece of information as well for
the first SE group.
 
Consider a second example:
 
   <ENTRY>
      <SE>
 
        <POS> verb </POS>
        <SE>  ... </SE>
        <SE>
          <POS> adj </POS>
           ...
        </SE>
        <SE> ... </SE>
        ...
      </SE>
 
In this example, the first and third nested SE's have part of speech
"verb" according to the scope rules. However, the second SE has part
of speech "adj",  since the respecification of this information
overrides that given at the outer level.
 
Printed dictionary entries are structured implicitly using these rules
of scope and overriding, and so it seems essential to our encoding
scheme. Furthermore, a database representation is likely to use a
similar factoring to avoid redundancy. Although SGML cannot specify
the semantics, it is assumed that the tagged content of dictionary
entries will be disambiguated and accessible by a program. The
processing issue is also addressed in section 7.4.4.  of the
guidelines regarding the use of a nested or flat structure to
represent senses.  Though the example given above will also have
implications for processing, it is meant, first of all, to illustrate
the problem of interpreting a given structure.  We would like to
suggest that this issue be addressed in the guidelines.  One
possibility would be to suggest an additional document (or point to
existing ones) which gives a semantics to the adopted notation.
 
6.2 Concatenation vs. override
 
The SN tag presents a problem in terms of the overriding described
above. For example, consider the following:
 
   <ENTRY>
      <SE>
         <SN>I</SN>
 
         <SE>
           <SN>1</SN>
           <SE>
              <SN>a</SN>  ... </SE>
           ...
           </SE>
        ...
      </SE>
 
The information within the SN tags in embedded SE's does not override
that at the outer level; rather, it should be concatenated to it. For
example, the SN for the innermost SE is actually I.1.a. How should we
handle this? One solution is to use an explicit numbering schema
(e.g., <SN>=I.1.a</SN>); another is to give a "level" indicator
based on the implicit hierarchy as an attribute on SE (e.g.,
<SE lev=2> and let a processor generate the numbers. This topic is in
need of further work.
 
 
7.  Outstanding issues
 
The following is a list of issues that have not been adequately
addressed within the dictionary print group. They will need detailed
attention and may imply further changes to the proposal for the
dictionary section of the guidelines.
 
(1) close consideration of the 'type' attribute on the FORM group. We
determined that there should be a type attribute which takes values
such as "compound", "variant", etc., but it was not clear if this
should be allowed to appear only on the FORM tag or on an ORTH or a
PRON tag, etc.
 
(2) representation of variants (alternative forms, alternative
explanations, definitions and translations, etc.)
 
(3) representation of meta language (words and symbols used throughout
the text; these may be explicit field separators or commentary within
a portion of text)
 
(4) attributes (each of the simple and group tags should have a list
of attributes with possible values, distinguishing finite, or non-
finite sets, defined. We started this and made some tentative lists)
 
(5) determination of structural relations among tags, and optional and
obligatory tags (DTD development)
 
8. Conclusion
 
This paper is a summary of the numerous topics addressed during the
two day meeting at Pisa.  It is meant as a discussion document;
comments are welcome on any of the issues raised here.  It is hoped
that the proposed tags and structures provide enough material for
other members to test this on some printed entries, taking into
account the discussion sections.
 
8.1 Sample tagged entries
 
Here is a simple entry (bold indicated in capitals, italics surrounded
by single quotes) with a tagged version, followed by some discussion
(according to line nb.).
 
CANARY (k@'n#@ri) 'n., pl'. -NARIES. 1. a small finch of the Canary
Islands and Azxores: a popular cagebird noted for its singing. 2.
CANARY YELLOW. A. a light yellow. B. ('as adj'): 'a canary-yellow
car'. 3. a sweet wine similar to Madeira. 4. 'Arch'. a sweet wine from
the Canary Islands. [C16: < OSp. 'canario' of ...]
 
<ENTRY type = main id=.. key=  >
1    <SE>
2       <FORM>
3          <ORTH>canary</ORTH>
4          <PRON type=..>k@'n#@ri</PRON>
5          <GRAM> <POS>n</POS> </GRAM>
6       </FORM>
7       <FORM>
8          <ORTH>canaries</ORTH>
9          <PFORM>-naries</PFORM>
10         <GRAM> <NUMBER>pl</NUMBER> </GRAM>
11      </FORM>
12      <SE> <SN>1</SN>
13         <DESCRIP> <TEXT>a small finch of the Canary Islands and
14            Azores: a popular cagebird noted for its singing </TEXT>
           </DESCRIP>
15      </SE>
16      <SE> <SN>2</SN>
17         <SE> <SN>a</SN>
18            <FORM>
19              <ORTH>canary yellow</ORTH>
20            </FORM>
21            <DESCRIP> <TEXT>a light yellow</TEXT> </DESCRIP>
22         </SE>
23         <SE> <SN>b</SN>
24            <USAGE type=gram>as adj</NOTE>
25            <EX>a canary-yellow car</EX>
26         </SE>
27      </SE>
28      <SE> <SN>3</SN>
29            <DESCRIP><TEXT>a sweet wine similar to Madeira </TEXT>
              </DESCRIP>
39      </SE>
31      <SE> <SN>4</SN>
32            <USAGE>Arch</USAGE>
33            <DESCRIP><TEXT>a sweet wine from the Canary Islands
                 </TEXT> </DESCRIP>
34      </SE>
35      <ETYM> ...</ETYM>
36    </SE>
</ENTRY>
 
1. The top level sub-entry tag is included to indicate the scope of
the following form features over the sub-entries.
 
5.  Gives only 'pos' and not number; compare to 10. which only gives
'number' and not 'pos'.  (By convention we know that the first form is
singular.)
 
16. This subentry introduces a new 'headword' (or extension), i.e.
canary yellow. The 'pos' given above is still implied for 2a, but not
for 2b.
 
24. The change in 'pos' is given as a note, an 'intelligent' program
might know that this is a new value for gram, but the syntax is that
of a usage note. We have not represented the ':' following "(as adj)";
is it significant?
 
25.  This is a sample phrase of 'canary yellow', not of the top form
'canary'. Perhaps 'canary-yellow should be tagged as EGWORD; treating
the hyphen like inflection.
 
28. This sense refers back to the top-level 'form', i.e. canary and
thus inherits everything from 'form' and 'gram'; same for the last
sense.
 
35. Its not clear whether the etymological information actually
applies to the entire entry, or merely to the headword 'canary'.
 
Here is a simple bilingual entry (uppercase for bold, single quotes
for italic).
 
CANARY [k#'n@#ri] 1 'n' (A) ('bird') canari 'm', seri 'm'. (B)
('wine') vin 'm' des Canaries. 2 'cpd' ('also' CANARY YELLOW) ('de
colour') jaune serin 'inv, jaune canari 'inv'. ('Bot') CANARY GRASS
alpiste 'm'; ('Geog') CANARY ISLANDS 'or' ISLES, CANARIES (iles 'fpl')
Canaries 'fpl'; ('Bot') CANARY SEED millet 'm'.
 
<ENTRY>
1   <SE>
2     <FORM> <ORTH>canary</ORTH> </FORM>
3     <SE> <SN>1</SN>
4         <GRAM> <POS>n</POS> </GRAM>
5         <SE> <SN>a</SN>
6               <TRANSL>
7                 <SOURCE> <SEM_IND>bird</SEM_IND>
8                 <TARGET> <TEXT>canari</TEXT>
9                        <GRAM> <GENDER>m</GENDER> </GRAM> </TARGET>
10                <TARGET>
11                       <TEXT>serin<TEXT>
12                       <GRAM> <GENDER>m</GENDER> </GRAM> </TARGET>
13              </TRANSL>
14        </SE>
15        <SE> <SN>b</SN>
16              <TRANSL>
17                <SOURCE> <SEM_IND>wine</SEM_IND>
18                <TARGET> <TEXT>vin</TEXT>
19                         <GRAM> <GENDER>m</GENDER></GRAM>
20                         <TEXT>des Canaries</TEXT> </TARGET>
21              </TRANSL>
22         </SE>
23     </SE>
24     <SE> <SN>2</SN>
25        <GRAM> <POS>cpd</POS> </GRAM>
26              <TRANSL>
27                <SOURCE> <FORM> <ORTH>canary yellow</ORTH>
28                         <PFORM> <IT>also</IT><B>canary
                                   yellow</B></PFORM></FORM>
29                   <SEM_IND>de couleur</SEM_IND> </SOURCE>
30                <TARGET> <TEXT>jaune serin <TEXT>
31                         <GRAM> <INFL>inv</INFL> </GRAM> </TARGET>
32                <TARGET> <TEXT>jaune canari</TEXT>
33                         <GRAM> <INFL>inv</INFL> </GRAM> </TARGET>
34              </TRANSL>
35              <TRANSL>
36                <SOURCE> <SEM_IND>Bot</SEM_IND>
37                         <FORM> <ORTH>canary grass</ORTH> </SOURCE>
38                <TARGET> <TEXT>alpiste</TEXT>
39                         <GRAM> <GENDER>m</GENDER> </GRAM> </TARGET>
40              </TRANSL>
41              <TRANSL>
42                <SOURCE> <SEM_IND>Geog</SEM_IND>
43                         <FORM> <ORTH>Canary Islands></ORTH></FORM>
44                         <FORM> <ORTH>Isles, Canaries</ORTH>
45                                <NOTE>isles, fpl</NOTE> </FORM>
                  </SOURCE>
46                <TARGET> <TEXT>Canaries</TEXT>
47                         <GRAM> <GENDER>f</GENDER> <NUM>pl</NUM>
                           </GRAM> </TARGET>
48              </TRANSL>
49              <TRANSL>
50                <SOURCE> <SEM_IND>Bot</SEM_IND>
51                         <FORM> <ORTH>canary seed</ORTH> </FORM>
                  </SOURCE>
52                <TARGET> <TEXT>millet</TEXT>
53                         <GRAM> <GENDER>m</GENDER> </GRAM> </TARGET>
54              </TRANSL>
55      </SE>
56   </SE>
</ENTRY>
 
The entire entry is grouped under a subentry to indicate the scope of
the form field (as above, this level may be found unnecessary in this
case).
 
7. The use of semantic indicator seems justified here; one cannot
really say that 'bird' and 'wine' (17.) is a usage note; however, the
indicators in the compound section are (as indicated by capitalization
of the words).
 
8. and 10. The two translations are each grouped in a separate
translation field. This is not enough to distinguish the translations
separated (in this case) by a comma but in other cases (by a semi-
colon).
 
18.-20. This is not a very satisfactory representation of the phrase
and the grammatical information associated to one word in the phrase.
 
25. Not clear that compound should be considered POS though it has
similar status.
 
27. The phrase canary yellow is identified here as the ORTH; perhaps
this should be PHRASE or CPD (varying types of expressions are
translated in bilingual entries).
 
28. Note the use of PFORM to save the phrase (not clear what 'also'
means here). The marking of italic and bold may not be standard.
 
30.-34. The different translations have been separated as in the
previous subentry.
 
A new tag is introduced here, i.e. INFL for inflection. The value
'inv' is to indicate that the phrase in invariable (does not agree in
number or gender with the noun it is modifying). Alternatively, this
could be just a note in GRAM.
 
43.-44. Not clear how to represent this; compare with 28.
 
As these examples demonstrate; there are still lots of problems in
tagging entries.