Minutes of the Meeting
 
              Of the Analysis and Interpretation Committee
 
               Held at the University of Arizona, Tucson
 
                           March 12-14, 1990
 
 
                              Lou Burnard
                         D. Terence Langendoen
 
                       Document Number:  TEI AIM3
 
                             23 March 1990
 
Present:  D. Terence Langendoen, chair (TL); Robert Amsler (RA); Lou
Burnard (LB); Nicoletta Calzolari (NC); Robert Ingria (RI); Mitch Marcus
(MM); William Poser (WP); Beatrice Santorini (BS); Gary Simons (GS);
Michael Sperberg-McQueen (MSM)
 
                        Final, November 14, 1997
 
 
 
                                   0
 
                           1.  SYNTAX REPORT
 
   MM circulated and talked through AI W6.  He had begun by considering
ways in which the immense range of syntactic representations currently
used could be reformulated as directed acyclic graphs (DAGs), each node
being potentially represented as a set of feature-value pairs.  The
meeting spent some time discussing the basic ideas, expressed in MM's
paper as two BNF grammars, and in exploring formalisms which might prove
difficult to represent in this way (examples included parse forests,
words absent from the input stream, ordering and other rendering varia-
tions).  It was agreed that the model should be tested with a variety of
formalisms  and AI W18 was circulated as an example of the range.  It
was agreed that, although probably inadequate for the direct representa-
tion of the linguistic theories underlying some formalisms, SGML could
successfully model their functionality as presented typographically, and
would thus be of immense value in interchange.
 
   To demonstrate this, MM also presented a BNF grammar for tree struc-
tures, which MSM then reformulated in SGML (AI W6A).  TL asked how this
is to be related to the feature-structure model.  The tree model pro-
vides special-purpose tags for "tree geometry", but the entire power of
the feature-structure apparatus is available for specifying the internal
structures of the nodes of the tree.  The problems raised by PP-
attachment were discussed (e.g. the ability to represent all the rela-
tions between the constituents of such sentences as "I saw the man in
the park with the telescope").  Ken Church in his MIT Master's thesis
originally proposed a compact method of representing these relations,
the number of which grows exponentially with the number of preposition
phrases.  The mechanism used in the PLNLP English Grammar developed by
                                    _____________________
Karen Jensen at IBM for representing them was discussed, and it was
agreed that the tree-markup specification would provide for a compact
method of representing these relations.
 
   Returning to the problem of the feature-structure specifications, WP
pointed out that there was a need to distinguish e.g. disjunction from
conjunction.  TL noted the similar need to distinguish the contextually
correct or only plausible reading.  Some pointers are obligatorily co-
referential (e.g. "John i saw himself i"); others ambiguous (e.g. "John
i saw him i/j"); yet others disjunct (e.g. "John i saw him j/=i").  To
deal with these and other features of lexical functional grammar, the
notion of restrictional clauses handling booleans (AND, OR, and NOT)
expressions was introduced.
 
   Some concern was voiced that the representation chosen should not be
anchored in presentation only:  MM argued that the typographic (etc)
conventions used by linguists were chosen as comprehensible ways of rep-
resenting their theories; to represent the former was a more reliable
way of unifying the latter than any inevitably doomed attempt at grand
unifying theories.
 
   MM was asked to provide a further draft expressing these ideas, with
extensions for Penelope style PP attachment rules.  This would form the
basis for section 8.1 of the Guidelines [A new draft was posted to
TEI-REP by BS on 20 March]
 
 
 
                                   0
 
                         2.  MORPHOLOGY REPORT
 
   GS presented a new version of AI W12, in which sections 2 to 4 had
been extensively revised.  Like MM he had concentratd on features that
were typographically marked.  The metalanguage he presented could be
recast in SGML using simple rewrite rules to convert attributes into
tags.  WP noted that attribute values and names had to be unique within
a DTD.  Discussion of the theoretical part of the paper was postponed
for the second cycle; it was regarded as a very useful basis for further
work on extensions.  After a digression on the dictionary encoding exam-
ple, discussion of morphological issues resumed with some consideration
of bracketing paradoxes and a debate about the difficulty of incorporat-
ing a theory of morphological procedures (as proposed by Stephen Ander-
son) into the encoding system.  MM asked for a set of exemplars against
which to test the proposed formalism.  NC suggested it would be better
to concentrate on the comparatively simple procedures most widely used.
It was agreed that the committee should try to formulate a set of gener-
al purpose tools, with the general purpose feature syntax proposed by MM
as its basis.
 
      GS to prepare a BNF for the interlinear gloss structure
      Due:  March 31
 
   After some discussion of the morphology of the word "ungrammaticali-
ty", it was agreed that there was a need for many analyses, possibly
orthogonal and non-unifiable, to be represented in parallel, with some
sort of "impedance match".  WP noted that this applied equally to phono-
logical analyses.  A set of sample morphological analyses, analogous to
the syntactic analyses of W18 was needed:  MM referred the meeting to a
tutorial published in ACL Survey last year by Richard Sproat which
included examples of K4 notation, and those of Anderson, Williams, Sel-
kirk, Hankamer and Keyser & Roeper.
 
      TL to review Sproat's tutorial and to produce exemplary morpholog-
      ical analyses
      Due:  15 April
 
   An attempt was made to analyse ways of presenting relations between
the parts of a text; four ways were proposed:  (1) wrap-around, (2) use
of ID/IDREF, (3) implicit positional and (4) explicit positional.  Of
these, (3) could be translated into (2) or (4), which were conceptually
identical.  (4) depended on deprecated facilities such as byte counts
while (3) depended on unreliable features such as agreed tokenization
rules.  MSM pointed out that (1) (2) or (4) required that the target be
marked which might not always be possible (e.g. on CD-ROM) commenting
however that the Text Representation Committee had recommended that all
SGML elements should carry an explicit ID tag.  TL noted that (3) did
not involve segmentation of the input, (4) was local to the sentence and
(2) being global allowed for inter-sentence reference.
 
   MM then presented a model for alignment, initially predicated on a
unidirectional (input-output) model but expanded during discussion into
a general purpose mechanism for mapping arbitrary segmentations onto
each other in either direction.  The discussion was written up by BS as
W19; it was felt to provide an essential complement to the feature-level
syntax.  It was noted that semantically all parts of the alignment map
were equivalent; it would be up to an application to check that co-
referring pointers belonged to the same level of analysis.
 
   GS presented W7 which proposed a simple "box" model for use in inter-
linear analyses.  It was agreed that this was a useful simplification
appropriate to many applications.  MSM introduced the notion of the
"pizza model" proposed by the Metalanguage and Syntax Issues Committee;
it was clear that this was a good example of one kind of "topping" which
would be much appreciated.  The theory-independent model proposed by MM,
in which the theory was regarded as content, would be found disturbing
by those who wanted to work in terms of their own ontology.  MM remarked
that SGML was too impoverished to support all possible linguistic mod-
els, and that a general approach must therefore be the committee's first
responsibility.  Discussion of the economics of software development
ensued, with RA proposing that a "Linguistica" program might some day be
developed on the basis of the general mechanism, just as "Mathematica"
was based on an agreed meta-mathematical theory.
 
   There was some discussion of the need to produce a unified list of
part of speech tags, which had been an original work item for the com-
mittee.  MM asserted that the parts of speech used in LOB and Brown were
non-unifiable, and that even if they were no-one would agree to it.  TL
proposed some features which might be used to categorise them but agreed
that unification would be very difficult.  MM suggested this work should
be left to the second cycle.  MSM noted that theory-specific mappings
between tags used and a general theory could be stored as part of the
document header.  To simplify representation of the subtrees required by
the general feature-structure mechanism, some sort of abbreviatory con-
vention should be produced.  An ad hoc subcommittee composed of MSM, NC,
MM, BS and TL will work on this; LB suggested that the entity reference
mechanism of SGML would prove entirely adequate to the task.
 
   MSM gave a short overview of the facilities and syntax provided by an
SGML DTD, covering the SGML declaration, minimization, content models,
separators and connectors, and occurrence groups.
 
 
 
                                   0
 
                          3.  PHONOLOGY REPORT
 
   WP summarised the needs of phonologists as "everything is dags and
maps".  He was confident that the general mechanisms so far proposed
were adequate to the purpose.  His discussion began by consideriong tra-
ditional phonological concerns, concentrating on character set issues,
and the need for an IPA-like character set:  MSM noted that the Text
Representation Committee was working on this, citing an article on the
computer coding of IPA (Eslin, JIPA(88)18.2).  WP believed that these
could be extended to cater for paralinguistic phenomena such as ingres-
sive air streams, but that transcription would pose serious problems; he
also discussed the problems posed by mixed transcriptions, using extend-
ed standard orthography.  Such cases should be documented in the docu-
ment header in terms of a phonetic feature set.
 
   It was noted that the current IPA standard was due for revision at
Kiel last year, but that the recommendations are not yet public; due to
time constraints, any proposals in the TEI Guidelines would therefore be
based on the current draft proposals.
 
   WP discussed a second category of problems, to do with suspicious or
partial information:  e.g. dotted letters, gaps with some indication of
length or cause, one of a small set of possibilities, robust but partial
information as used heuristically in speech recognition.  It was noted
that many of these concerns, notably segment lattices from spectrograph-
ic analyses, were isomorphic with problems already addressed by those
working on textual variation.  Degree of confidence could be regarded as
a feature/value pair.
 
   At the highest level, WP stated that almost all representations of
phonology used hierarchic constituent structures, some of which might or
might not be aligned in parallel (e.g. prosodic and metrical struc-
tures).  These seemed isomorphic to the bracketing paradoxes discussed
earlier.  It was noted however that analysis of digitized speech might
introduce new problems, which would be deferred to the second cycle.
 
      WP to produce some examples showing how existing phonological
      analyses could be expressed as DAGs with alignment maps.
      Due:  no date
 
 
 
                                   0
 
                         4.  DICTIONARY REPORT
 
   RA presented a report on progress within the subgroup, set up origi-
nally to attempt a unification of the various dictionary formats.  There
had been meetings at Waterloo and at Oxford.  Work had been carried out
on extending the original Amsler/Tompa proposals (W4) in two areas,
etymology (summarised in W15) and in multilingual dictionaries (W20).
RA stressed that the aim was tio agree on a common tag set, not to
define an all purpose DTD.  Some examples of marked up dictionary
entries were circulated for discussion (W8).  Future work would involve
standardisation of such things as part of speech tags.
 
   NC suggested that a clearer balance between prescription and descrip-
tion was needed.  MSM commented that SGML supported broadly two differ-
ent approaches to this:  in the first, a DTD for a dictionary following
the so- called "Waterloo" syntax, textual elements with some struc-
ture were allowed to float freely in a soup:
 
             <!doctype dict [ <!element dict - - ANY>
                                     <!element (A,C) - - ANY>
                                     <!element (X,Y,Z) - - ANY>
                                     <!element B - - (X, Y?, Z+) -(B)> ]
     >
 
In the second, the so-called "Belgian" syntax, all data elements
whatever their structure were regarded as exceptions to a model
containing only character data:
 
             <!doctype dict [ <!element dict - - (#PCDATA) +(A,B,C)>
                                      <!element (A,C,X,Y,Z) - -
     (#PCDATA) >
                                      <!element B - - (X,Y?,Z+) -(B)> ]
     >
 
Discussion of the possibilities offered by the latter model ensued.
 
   Returning to the dictionary encoding initiative, the work being
carried out within the ESPRIT project at Pisa and elsewhere as described
in NC's paper was noted.  She and RA agreed to work towards ways of
merging their tagsets, at both the individual and grouping levels, with
a view to producing a common DTD suited to monolingual and bilingual
dictionaries and writing it up for inclusion in the Guidelines as
section 7.6.
 
      NC, RA to write up a common dictionary tagset and DTD
      Due:   April 15 1990
 
 
 
                                   0
 
                           5.  LEXICON REPORT
 
   RI reviewed the survey he had produced for the Grossetto meeting in
collaboration with Susan Warwick.  It had categorised machine-readable
lexica under some 12 headings; it was noted that categorial distinctions
were hard to make in all cases.  E.g. some systems treated paradigm as a
lexical features and others as an independent concept.  NC suggested the
report should be expanded to take account of the current European feasi-
bility study for standardising of lexical and terminological resources.
LB asked whether in fact people did exchange machine-readable lexica,
and whether the aim was to preserve what had been used in the past or to
facilitate integration in the future.  TL pointed out that future
machine-readable texts would certainly contain pointers into electronic
lexica.  RA suggested that detailed discussion might be premature or
politically sensitive.  MSM proposed that the topic should be written up
for commitee circulation only, and might be included in the Draft Guide-
lines for June 1991.
 
      RI to redraft the report, possibly in SGML terms, with examples
      before and after the conversion.
      Due:   no date specified
 
 
 
                                   0
 
                              6.  SUMMARY
 
   It was agreed that the committee would recommend four general purpose
tools rather than specific tags for all the varieties of linguistic
analysis.  The tools were a general feature structure grammar, a tree
structure grammar, a simple unit and annotation model and a general
alignment mechanism.  There was some discussion of the need for tree
structured grammars as distinct from featiure structure; it was agreed
that they provided a useful subset of features, as did the unit/
annotation model.  An abbreviatory mechanism, as provided by entity ref-
erences, would be a useful way of hiding the complexities of the full
feature structure syntax.
 
      TL, MM, BS to draft a section introducing the general purpose
      tools recommended by the committee
      Due:  31 March
 
 
 
                                   0
 
                           7.  LIST OF PAPERS
 
The following working papers, presented at the meeting, are here listed
with the numbers assigned to them by the editors.  Those in machine-
readable form should also be available from the TEI-ANA file server
shortly.
 
AI W3:  Plan of action for remaining work of the A&I Committee (Langen-
  doen)
 
AI W4:  Notes on Dictionary Encoding (Amsler/Tompa)
 
AI W6:  Untitled (A generic proposal for representing syntactic struc-
  tures; ms notes, Marcus)
 
AI W6A:  Untitled (Amended BNF appendix to AI W6A)
 
AI W7:  Where is morphology? (ms notes, Simons)
 
AI W7A:  Untitled (Revised version of AI W7)
 
AI W8:  Examples of dictionary encoding (Amsler)
 
AI W9:  [Sample instantiations - phonology]
 
AI W10:  [Sample instantiations - syntax]
 
AI W11:  [Sample instantiations - morphology]
 
AI W12:  A conceptual modelling language (Simons)
 
AI W14:  Preliminary draft proposal (Calzolari)
 
AI W15:  Encoding etymology (Abate)
 
AI W16:  Notes on encoding of linguistic analyses (Langendoen)
 
AI W17:  A&I for a simple example (Langendoen)
 
AI W18:  Untitled (Examples of syntactic diagrams from Marcus)
 
AI W19:  Alignment notes (Ms notes, Santorini)
 
AI W20:  Towards an SGML DTD for bilingual dictionaries (Fought/Van Ess)
 
 
                                                Final, November 14, 1997