Date: Mon, 17 Feb 92 18:36 EDT
From: "NANCY M. IDE (914) 437 5988" <IDE@vassar.bitnet>
Subject: minutes of mitre
To: u35395@UICVM.BITNET
 
From:   IN%"amsler@starbase.MITRE.ORG"  9-FEB-1992 13:19:40.49
To:  Carol.Van.Ess.Dykema@nl.cs.cmu.edu, GLOTTOLO%ICNUCEVM.CNUCE.CNR.IT@CUNYV
M.CUNY.EDU, amsler@starbase.mitre.org, fwtompa@waterloo.edu, ide@vassar.edu,
ingria@bbn.com, jjohn@apollo.lap.upenn.edu, louise@nmsu.edu, 
warwick%cui.unige.ch   @relay.cs.net
CC:
Subj:   Emergency Sub-Group PDWG meeting
 

Date: Sun, 9 Feb 92 13:10:42 EST
From: amsler@starbase.MITRE.ORG
Subject: Emergency Sub-Group PDWG meeting
To: Carol.Van.Ess.Dykema@nl.cs.cmu.edu,
GLOTTOLO%ICNUCEVM.CNUCE.CNR.IT@CUNYVM.CUNY.EDU, amsler@starbase.mitre.org,
fwtompa@waterloo.edu, ide@vassar.edu, ingria@bbn.com,
jjohn@apollo.lap.upenn.edu, louise@nmsu.edu, warwick%cui.unige.ch@relay.cs.net
Message-id: <9202091810.AA09860@starbase.mitre.org>
 
This is a quick rundown on what went on, presented to solicit
(1) feedback from the people who were there
(2) opinions and reminders from those who weren't when/where we
    may have undone things that seemed to be agreed to previously
    of forgotten things we should have remembered.
 
The intent is to send Michael Sperberg McQueen and Lou Burnard some
report on this meeting ASAP especially since we are petitioning for
a two week grace period in which to rewrite materials for the P2 draft.
 
====
 
The meeting was held in McLean, Virginia at MITRE Corp.'s Hayes
Buidling, 7525 Colshire Drive and started at 9 AM and ran through til
4 PM. Nancy Ide, Jean Veronis, John Fought, and Carol Van-Ess Dykema
and myself, Robert Amsler in attendance.
 
The agenda was to discuss the state of the AI5 standard write-up and
the materials distributed by MSM. We quickly reached a concensus that
there were essentially three levels of standard being described.
The first was the "base tag" level. The second was the allownce of
"nesting" of base tags inside specific labeled tags. The third was
the "Grouping Tag" discussion, and in particular the idea of a generic
group tag which could bundle almost arbitrary subcollections of nesting
and base tags together for the purpose of making explicit the semantics
of inheritance of information.
 
We decided that the concept of the grouping tag had not been well explained
so far and that this was causing difficulty. Nancy Ide agreed to supply
a new description of the grouping tag to help alleviate this problem.
 
We then went about creating a new outline of the AI5 guidelines, since
none could determine what the guidelines actually looked like from the
materials in fragments provided by MSM (Sorry, Michael, we couldn't read
the stuff well enough to know what the whole thing looked like).
 
The remainder of this message details the discussion of the standard
as we recalled it, however before that the conclusions were that we'd
like two weeks in which to prepare and submit new prose and tag lists
for consideration. The new tag list is being prepared by Fought and
Van Ess Dykema, the new prose description by Nancy Ide. These will be
quickly distributed the the rest of the AI5 committee for discussion
and sent on to the editors in two weeks.
 
------
[[Comment by Amsler: Please supplement and correct the following ]]
 
The new outline was as follows:
 
Three MAIN sections,
 
I. Context/Introduction
II. Tags
III. Examples & DTD
 
In I.
  1. Introduction
  2. Issues in Dictionary Encoding
     2.1 Multiple Goals of Encoding
         (to cover the goals of being faithful to the "printed form"
          vs. being faithful to the "logical structure" of the
          material; recording "explicit" vs. "implicit" information;
          and recording "literal" vs. "interpreted" information)
          [Some concern here was expressed that too much had been
          deleted from the EUROLEX paper materials by LB such that
          there were serious problems in explaining that the intent
          of the person creating a representation of a print
          dictionary would force choices between these options that
          could not be reconciled in one encoding.]
     2.2 Multiple Levels of Encoding ("base", "nested" and
          "grouped" for semantic inheritance).
 
     A matrix was developed to explain these two axes, which
     may be visualized as follows:
 
                      ||             LEVELS
                      ||
                      ||    base tags  |  nested tags | "group" tag
            ==========================================================
             faithful ||               |              |  <not possible>
    G        to the   ||               |              |
    O        printed  ||               |              |
    A        book     ||               |              |
    L       ----------------------------------------------------------
    S      "augmented"||               |              |
                      ||               |              |
            ----------------------------------------------------------
           "altered"  ||               |              |
                      ||               |              |
            ----------------------------------------------------------
 
The vertical dimension "can" be contracted into only two levels, with
"augmented" and "altered" being considered the same. The idea was that
some changes to the dictionary, such as normalizing the abbreviation
used for "feminine" if the dictionary used all of "f.", "fem." and
"feminine" and expanding the elliptical forms of things like
"talk, -ed, -ing" to "talk, talked, talking" are really just
"augmentations" rather than alterations, whereas others such as
parceling out the British vs. American pronunciations between senses
which are listed as US only or Brit. only, when the pronunciations
were given once at the top as both applying to the entry, may be
considered "altering". (It was noted, that SOME changes, such as
eliminating end-of-line hyphens, are not considered either
"augmentations" or "alterations", since these are artifacts of the
typographer's formatting rather than any content-based changes.
 
A discussion here about which of the following forms were allowed
was enlightening.
 
If value1 is the form directly printed in the dictionary, e.g. "fem."
or "fem" or "feminine" or "f" or "F", etc. and value2 is some
cannonical form desired by the encoder, e.g. "f" then the options for
how to record this information include the following six cases:
 
 Legal
 or not (X)
 
  X    1. <tag att=value2></tag>        (no data, original preserved)
  X    2. <tag att=value1></tag>        (no data, original lost)
 
[[Comment by RA: Numbering on value in cases 1  and 2 reversed to be
 consistent with other cases]]
 
  X    3. <tag><att>value2</att></tag>  (only changed data kept)
       4. <tag><att>value1</att></tag>  (only original data kept)
 
       5. <tag att=value2><att>value1</att></tag> (both kept)
       6. <tag att=value1><att>value2</att></tag> (both kept)
 
Thus, cases 4,5 and 6 were considered to be the ones we wanted to endorse.
This is a bit prescriptive, since it is quite possible in a preface to say
explicitly what one has done, but we felt 1,2 and 3 were more inappropriate.
 
The Outline continued with Section II, Tags, and has three parts
   3. Base Tags for a Dictionary Entry
   4. Nesting Tags for a Dictionary Entry
   5. The Abstract Group Tag for a Dictionary Entry
 
Section III, contained two parts
 
   6. Example Entries Tagged with Base, Nesting and Group tags
         - monolingual, bilingual
   7. DTD
 
----
 The production of this outline consumed most of the morning session,
we broke for lunch in the MITRE cafeteria, resuming shortly after 1
 
The afternoon session deepened the discussions of the topics of the
morning, but primarily focused on an effort to enumerate the base
and nesting tags.
 
The base tag list was found to contain (with some implicit ordering
for order of apperance in the entry):
 
 <hom>
 <pform> (back again after another debate on whether it is needed.
          The pform tag is basically for "monstrosities" in entries
          which present intermingled syllable, stress, pronunciation
          and/or orthographic forms, with things like centered dots
          used for syllables in the recording of headwords. This may
          mean pform is limited to the "faithful to printed book"
          row of the matrix.
 <orth>
 <hwd> (headword, used to note the single word selected for the
        basis for the alphabetical order in which the entry appears,
        though this "could" be an attribute on the particular "orth")
 
[[Comment by RA: This latter option sounds better to me, i.e. an hwd is
merely an orth with the attribute type=hwd]]
 
 <pron>
 <sn>  (sense number)
 <text>
 <usage>
 <gender>
 <number> (e.g. singular/plural)
 <case>
 <pos> (part of speech)
 <egwd>
 <note>
 <xref>
 
[[comment by RA: It strikes me we did nothing about morphology (i.e.
syllables), pronunciation (stress), or Synonym/Anyonyms]]
 
These "base" tags were nested inside the next level of tags, in many
cases such that some of the nesting tags could hardly be left out if
their base tags were used--or conversely, some of the nesting tags
were so essential that it was hard to imagine not recording them if
the base tags were used. This leads to a bit of a dilemma, in that
nesting seems to be a variable thing, <etym> for example, seems quite
likely to be wanted whereas the base tags inside it are a bit more
problematic to encode. <egwd> and <eg> seem to both be used together
or else <eg> would be a "base" tag?
 
The nesting tags are:
 
<form>, which contains <pron>, <orth> and <pform>
<etym>, which contains the etymology
<desc>, which contains <text>
<gram> which contains <gender>, <number>, <case> and <pos>
<sense> which contains <sn>, <desc>, <usage>, <gram> as well
        as the base tags which <gram> contains (i.e. <gender>,
        <number>, <case> and <pos>).
<entry> which contains all of the above and itself.
 
 
[[Comment by RA: The difficulty of being unable to specify the
attributes of a nesting tag seem to create a need for some special
forms of some tags, as for <main-entry> to be the unique, single
<entry> tag which contains all the material selected to appear at
one alphabetic position in the directionary). ??]]
 
The meeting concluded with a new discussion of how the <group> tag was
actually intended to cover the strangeness of the <sense> tag
seemingly also being required to contain <form> at some times as well
as even containing other <entry> tags, whereas <gram> and <form> may
not contain each other.
 
The tree structure of
 
                                  SNS (i.e. Sense, a.k.a. "group")
                                   |
                   ---------------------------------
                   |   |     |     |    |     |    |
                  eg  etym  form  SNS  gram  desc xref
 
was offered, but not conclusively agreed to.
 
Another comment was that what we REALLY wanted was a 95% for the
cannonical dictionary entry and then a 5% DTD which allowed basically
everything else to happen. The assumption would be that the user used
the 95% "orthodox" DTD most of the time, but then occasionally noted
that some entries didn't fit that form and then used the "liberal"
"anything goes" DTD for the remaining entries. This reflects the
experience of the OUP folks with the OED encoding, which left them
feeling that there really was some sense of a real DTD in the OED, but
that some entries violated it, making it impossible to state one DTD
with any real character for the whole book.
 
Another thought which seemed important was the comment made by Jean
Veronis that we should maintain a collection of special examples that
illustrate the abberancies of dictionary encodings. In Lou Burnard's words,
a "zoo" of strange (lexical) creatures whose entries lead to all these
fine distinctions.  I like this idea a lot. It would seem to require some
means of classification of the entries to reveal why the contributor feels
that are unusual, but perhaps that can be handled in comments initially.