Minutes of the Meeting of the Metalanguage Committee
    Held at the Commission of the European Communities, Luxembourg,
 
                          October 18-20, 1989
 
                      Document Number:  TEI ML M10
                              Lou Burnard
                             8 January 1990
 
Present: David Barnard (DB), chair; Lou Burnard (LB); Jean-Pierre
Gaspart (JPG); Lynne A. Price (LAP); Michael Sperberg-McQueen (MSM);
Nino Varile (NV).
                         Final, 8 January 1990
 
                                   0
 
                         INTRODUCTORY BUSINESS
 
Administrative matters
 
DB welcomed the committee. A new address list had been prepared
[attached to these minutes]; any further alterations to DB as soon as
possible. MSM outlined the limits on and procedure for claiming reim-
bursement of expenses; it was noted that the situation may change with
new funding arrangements. DB asked all committee members to inform him
of their likely expenses well in advance of meetings to enable him to
administer the budget allocated to the committee effectively.
 
 
Minutes of previous meeting [ML M 1]
 
'DRTD' (page 3) should read 'DTD'.
 
   LB confirmed that he and MSM were still working on a tagset for
internal TEI usage.
 
   The status of decisions made at the previous meeting was reviewed. On
non-substantive issues, the committee agreed to accept the Chair's rul-
ing. On substantive issues, it was agreed that a decision accepted by a
two-thirds majority of the whole committee should be binding. Members
who had not voted on a particular issue would be given a fixed period
after notification of the decision in which to inform the chair of any
disagreement; silence would be interpreted as consent. This lead to a
brief discussion of absenteeism: it was agreed that two unexplained
absences from meetings would constitute resignation.
 
   It was agreed that document MLW1 had not yet been adopted by the com-
mittee: substantial revisions had been requested at Toronto which had
not yet been carried out. Most of these were subsumed into the discus-
sion of the replacement for this working paper, now renumbered as ML
W13.
 
Document numbering
 
A new document numbering scheme was proposed. All papers before the com-
mittee would be consecutively numbered, and each would bear a single
letter prefix (W for working papers, A for agenda lists, M for minutes).
There would be a standing agenda item at each meeting to check the sta-
tus of all documents currently before the committee and to note any new
ones. DB agreed to produce a new document list including all existing
and proposed papers [attached]. It was also agreed that documents circu-
lated outside the committee should be credited to the whole committee as
author or editor; individual authorship should be noted only within the
committee.
 
                                   0
 
                           STATEMENT OF WORK
 
The Committee's plan of work, as presented in TEI documents SC G10, ED
P2 and ML R1 (now renumbered MLW4), was reviewed.  DB stated that the
committee had two main responsibilities: firstly to advise the other
committees on optimal usage of SGML, and secondly to survey existing
encoding schemes with a view to recasting them. There was some discus-
sion of what was entailed by 'recasting': at one extreme it might simply
be a lookup table or an SGML short reference map; at the other a much
more complex scheme might be needed. Information loss should be avoida-
ble when going from one exogenous scheme into TEI, but would not be when
going from TEI to a comparatively impoverished scheme.  LAP asked wheth-
er software would be produced to do the conversions. DB said this was
not the committee's responsibility: it needed only to produce generic
specifications for transforming between a representative set of encoding
schemes and the tagsets and DTDs defined by the TEI. Political consider-
ations were involved in determining what would be an appropriate set to
consider, as well as technical issues. JPG noted that some candidate
encodings were very application specific and that hence some considera-
tion of the function of the markup was necessary, citing macro packages
such as LaTex. It was agreed to produce a list of candidate encoding
schemes and to revise the Statement of Work.
 
      LB to produce categorised list of candidate markup schemes
      Document number:  ML W12
      Due:  19 Oct 89
 
      DB to revise statement of work
      Document number:  ML W4 (source), MLW11 (new doc)
      Due:  27 Oct 89
 
                                   0
 
                               GUIDELINES
 
It was noted that a paper setting out general Guidelines on the usage of
SGML features was urgently needed by the other committees. The existing
draft working paper (TEI ML W1, now replaced by ML W13) needed consider-
able revision and extension. A detailed discussion ensued.
 
 
General principles
 
JPG pointed out that different situations might require different recom-
mendations: in particular, features appropriate for capture or process-
ing might not be appropriate in interchange or storage. Although the
CALS and AAP standards had been proposed for interchange only, people
still used them for data capture, for which they were less suitable. It
was agreed to structure discussion by drawing up a matrix of SGML fea-
tures categorised by their suitability for data capture, interchange,
storage and processing.
 
   As a general principle, the committee felt that anything which could
be expressed using SGML should be. Similarly, documentation should make
clear that only those features which are defined using SGML could be
relied on for interchange purposes.
 
   DB proposed that the document should also address the importance of
using software to check the syntax of SGML encoded texts. It was recog-
nised that users of the TEI Guidelines might use many intermediate soft-
ware environments, but the committee agreed that DTDs developed for the
project, and documents claiming to be TEI-conformant, would have to be
validated by full SGML parsers, and to caution the other committees
against under-estimating the complexity of this process. JPG and LAP
both pointed out that producing software capable of checking SGML syntax
correctly was a far from trivial task, and LAP agreed to draft a few
pages setting out the difficulties of doing so.
 
      LAP to draft a memorandum on the difficulties of SGML parseability
      Document number:  no number assigned
      Due:  no date assigned
 
      DB to produce new draft of Guidelines (with sections by others as
      noted elsewhere)
      Document number:  ML W13
      Due:  1 Nov 89
 
 
Discussion of specific features
 
SHORTREF and CONCUR:  JPG proposed that SHORTREF should be recommended
for use only in data capture; LAP that it was appropriate at all times,
and could lead to significant savings in storage costs as well as con-
venience in input. JPG noted that SHORTREF and SHORTTAG could apply only
to the base tagset, suggesting that their use by applications using CON-
CURrent dtds might be problematic.  MSM said that potentially any ele-
ment might need to CONCUR. DB agreed that the only alternative to using
CONCUR would be to define something else with a similar functionality.
LAP proposed that any problems the group identified in using the feature
should be passed back to WG8.
 
Attributes:  It was agreed that, although formally equivalent, attribute
values could be used in preference to element content in order to add
information to a view. Even within a single view it might still be dif-
ficult to decide what was content and what was process-specific informa-
tion.
 
      JPG to draft recommendations on the appropriate use of attributes
      Document number:  section of W13
      Due:  no date assigned
 
Inclusion/Exclusion Exceptions:  Recommendations on when these might
profitably be used were still needed.
 
      JPG to draft recommendations on the appropriate use of exceptions
      in content models
      Document number:  section of W13
      Due:  no date assigned
 
SUBDOC:  This feature might provide an alternative to some uses of the
CONCUR feature. It allowed an entity reference to be replaced by a sub
document with a distinct environment, entirely replacing replaced that
of the base document type, and with no easy way of communicating infor-
mation (eg the target of IDREF attributes) between the two environments.
How this feature relates to CAPACITY is not clear. Guidance on its usage
would be useful, particularly for the &ai;. LAP proposed that entity
references the text of which were subdocuments should appear only as
attribute values rather than as content, which met with general agree-
ment.
 
      NV to draft a paragraph on the use of SUBDOC
      Document number:  section of ML W13
      Due:  no date assigned
 
APPINFO:  MSM asked whether this might be an appropriate feature to use
as a means of providing aliases for tagnames and other GIs.  LAP said
parameter entities or short refs would be a better solution to this
problem. AppInfo merely provided commentary on the environment described
by the DTD in which it appeared, and not its application.
 
OMITTAG:  There was continued discussion on appropriate use of this min-
imization feature, and it was agreed that clear recommendations were not
easy to make. Part of the problem was that different recommendations
were appropriate for private and public use of documents, a distinction
which not all wished to make. LAP circulated some 'Notes on Markup Mini-
mization and Attributes' [document ML W9] which would be incorporated in
the draft for W13. Further information on the use of SHORTTAG was
requested.
 
      LAP to expand her notes to include recommendations on use of
      SHORTTAG
      Document number:  W13
      Due:  1 Nov 89
 
LINK:  JPG stated that LINK was the only mechanism provided by SGML for
relating different document types. As implemented by the SOBEMAP parser
the feature allowed the dtd designer to associate semantic actions with
any element, for example to define formatting. LAP was opposed to its
use on the grounds that SGML was intended to separate semantics from
markup, that it was currently the subject of some concern within WG8 and
that as currently defined it was defective in several respects. After
some discussion, it was agreed to defer decision on the use of this fea-
ture.
 
Concrete syntax:  It was agreed that working committees should not devi-
ate from the reference concrete syntax without strong motivation. To do
so would require transmission of a default SGML declaration with each
document instance. Concern was expressed over the default namelength of
the current syntax which was felt to be inadequate.
 
Quantity and capacity:  The existing defaults were reviewed briefly: it
was agreed that we would need to alter only namelength, for which a val-
ue of 128 was proposed as default, and possibly the level to which enti-
ties could be nested, for which 16 seemed a little low.
 
Naming rules:  It was agreed to encourage consistency in naming rules as
far as possible. In particular the existing defaults for case sensitivi-
ty (sensitive in entity names only) should be adhered to and existing
defined entity names should be used. The character set used for names
should as far as possible be the same as that used for the document: it
was recognised that this might give rise to problems in documents using
8-bit character sets, and the committee asked the Text Representation
Committee to address the problem of representing names in such docu-
ments, to consider the SGML syntax status of each character as well as
its collating position etc. and to address the best way of defining
translations between identical sets of names in different languages. A
need for globally transforming names was identified.
 
   In discussion of the FORMAL feature, it was noted that no decision
had yet been taken as to whether the TEI scheme would be registered with
the relevant standards bodies.
 
      MSM to discuss with the steering committee whether formal regis-
      tration of the TEI Guidelines was intended
      Due:  no date assigned
 
   It was agreed that entity names should be used without a doctype
qualifier only if they had the same replacement value in all TEI doc-
types. The special case of entity references expanding to IGNORE or
INCLUDE when used with marked sections was noted: the editors would need
to maintain consistency in this context as there was no way of including
a doctype identifier in this case.
 
 
Conclusions
 
DB agreed to produce a new set of draft recommendations by November 1st.
Fuller justification would be left to a later date. The document would
be distributed to JPG and LAP by FAX, by email to other members. Two
weeks would be allowed for comments. Committee members were asked to
acknowledge receipt of the draft immediately.
 
                                   0
 
                           SGML BIBLIOGRAPHY
 
Production of this was well advanced. Robin Cover had produced a sub-
stantial amount of information which was being merged with DB's current
file and would be distributed as a working paper in SGML form shortly.
 
      DB To distribute SGML bibliography
      Document number:  ML W14
      Due:  15 Nov 89
 
 
                                   0
 
                       INTRODUCTORY GUIDE TO SGML
 
There was some discussion of the need for a very elementary guide to
SGML including illustrative material relevant to the TEI's concerns. The
diversity of other committee members' backgrounds and expertise in for-
mal language theory was noted.  It was suggested that the chapter on
SGML in the FORMEX manual might be a suitable basis.
 
      LB to draft "Idiots' Guide" to SGML
      Document number:  ML W15
      Due:  1 Dec 89
 
      NV to check on copyright status of FORMEX Guide
      Document number:  ML W15
      Due:  no date assigned
 
                                   0
 
                       PROGRAMMERS' GUIDE TO SGML
 
MSM proposed that a document introducing the major concepts of SGML
aimed at document designers and programmers would also be useful, analo-
gous to Bryan's book but less biassed to publishing applications. LAP
suggested that our working papers would form a good basis for such a
text. DB agreed that ML W13 should cover all that was necessary for
those wishing to design TEI-conformant DTDs.
 
      LAP to report on the availability of relevant work done previously
      at Hewlett Packard
      Document number:  ML W16
      Due:  24 Oct 89
 
      LAP with MSM and David Durand as backup to consider writing a
      technical introduction to the use of the TEI Guidelines
      Document number:  ML W16
      Due:  no date assigned
 
                                   0
 
                         OTHER ENCODING SCHEMES
 
LB presented verbally an initial draft of ML W12. The Committee worked
through a long list of encoding schemes, categorising each as (a) one on
which work towards specifying a transformation would be carried out (b)
one on which such work was not judged necessary (c) one on which no fur-
ther work was currently planned. [The results are presented in the draft
of ML W12 appended to these minutes]  It was noted that Nancy Ide (NI)
had expressed willingness to work in this area and she was requested to
form a working group. NV noted that recoding schemes for a number of
word-processing systems were already required for the Eurotra project.
DB proposed that each scheme on which work was required should be con-
sidered on an ad hoc basis. It might be possible simply to specify an
SGML declaration and DTD for some; others might need simple string sub-
stitutions; for yet others general purpose tools for lexical analysis
such as YACC and Lex might be required. JPG felt that the issue should
not be oversimplified but the consensus was that simple tasks should be
addressed by simple tools. DB noted that there were already several vol-
unteers willing to work in this area and others might be forthcoming.
He undertook to ask NI and FT to co-ordinate the work.
 
      NI with assistance from Frank Tompa To form working group with
      responsibility for addressing tasks specified in ML W12
      Document number:  ML W12
      Due:  no date assigned
 
                                   0
 
                            STYLISTIC GUIDE
 
LB suggested that a separate document summarising TEI 'house style' for
DTD-definers might be useful. This was agreed, though the difficulty of
formulating such rules was noted. It should for example summarise naming
conventions, recommend appropriate grouping levels for content models
and propose a standard ordering for DTD statements. DB suggested that
much of this material might be included in W13. MSM proposed that it
should be a separate document and suggested David Durand (DD) as a suit-
able author.
 
      DB to contact DD to discuss preparation of Stylistic Guidelines
      Document number:  ML W17
      Due:  no date assigned
 
                                   0
 
                              INPUT TO WG8
 
The committee agreed that a formal mechanism for conveying problems
identified with the current SGML standard to the relevant ISO working
groups was highly desirable. Two particular problems (namelength and
doctype qualifier on marked sections) had already been identified and it
seemed likely that there would be more. It was agreed that the steering
committee should be asked to convey detailed proposals for revisions to
ISO 8879, the five year review of which was due in 1991 and that the
Committee would attempt to draft such proposals.
 
      DB to ask the steering committee to communicate with WG8
      Due:  no date assigned
 
                                   0
 
                              NEXT MEETING
 
Planned dates for next meeting are 16-19 February 1990, in Chicago. Date
and place to be confirmed by 1 Dec 1989.
 
      MSM to confirm date and time of next meeting
      Due:  1 Dec 89
 
 
                                   0 

                  QUESTIONS RAISED BY OTHER COMMITTEES
 
Due to other committments, DB had to leave the meeting after the second
day. The third day of the meeting was spent discussing in some detail
some specific problems raised by members of other working committees.
MSM characterised the problem areas on which guidance was needed as fol-
lows:
 
*   Arbitrary segments which did not seem to be really text elements,
    for example in discourse or stylistic analysis.
 
*   Discontinuous segments where elements were interrupted by other ele-
    ments, for example in morphological or conversational analyses.
 
*   Ambiguity, where more than one structural analysis might apply to a
    given text segment.
 
*   Overlap, where two or more overlapping structures were to be tagged
    across the same text.
 
*   Synchronous parallel structures
 
*   Transcription mapping.
 
*   Cross references both within a document and outside it.
 
*   Vagueness, where the feature to be encoded could not be exactly
    localised.
 
 
Arbitrary segments
 
Examples quoted were: this is where the transcript is inaudible; this is
where he moved to the window. JPG suggested that each such segment
should be regarded as a separate automaton. CONCUR could only be used if
the number of different segment types was in principle bounded and quite
small. Alternatively he proposed empty tags marking either end of the
segment, with attributes to indicate their type. For example, Nino's
moves around the room might be  represented in a document with content
model of (#PCDATA | move) by elements tagged [(nino)move] ....
[/(nino)move] or by elements tagged [move person=nino start] ... [move
person=nino end]. It was noted that the first mechanism could be gener-
ated from the second, but that the second was more general. The advan-
tage of the first was that the parser could check that tags were proper-
ly matched etc. A third possibility was to create the total intersection
of all identifiable segments, subdivided at each segment boundary and
grouped at a higher level.
 
 
Discontinuous segments
 
These could be handled by co-indexing using the id/idref mechanism of
SGML. To avoid the need to treat the first part of a discontinuous seg-
ment differently from the rest, it could be set up initially. The exam-
ple discussed was of discontinuous segments relating to a particular
topic.  Provided that the set of topics could be predefined, these could
be listed in a separate TOPICLIST element, each member of which could be
allocated an ID. Each occurrence of a topic in the text itself would
then be linked back to the appropriate element by an IDREF.
 
   It was agreed that this method was probably not appropriate for all
examples of discontinuity. For micro discontinuities such as those found
in morphology, a simpler solution might be to introduce some redundancy
into the text. For example, the lemmatised form KTB of the Arabic word
al-kaatib might be represented either as an attribute [word
root=KTB]al-kaatib [/word] or as an additional element [word] [root]KTB
[form]al- kaatib [/word]. The first was probably to be preferred, since
lemmatisation was a process applied to the text rather than a part of
its content.
 
Ambiguity:  Five mechanisms were outlined for dealing with ambiguities
such as  'I saw the man with the telescope' The two parse trees for this
sentence - ((I) (saw((the man) with the telescope))) and ((I) ((saw (the
man)) with the telescope)) - could be represented independently, but
without repeating their points of overlap, by using CONCUR. Or they
could be repeated as alternative marked sections.
 
   The third possibility was to think of each parse tree as a graph, in
which the nodes were the boundaries between syntactic units and the arcs
the syntactic interpretation placed on them.  A sentence could then be
represented in SGML as an ordered list of words, each with its own ID.
The arcs of each syntactic graph for the sentence could then be repre-
sented by empty elements, using idrefs to point back to the words. A
simpler representation might be to use a special notation (which could
not therefore be checked by the SGML parser) for the parse tree value
and specify it as an attribute. The fifth possibility was to treat all
parse subtrees as arbitrary segments, as outlined above.
 
   The committee felt that more specific examples would be needed before
clear recommendations could be given as to which of these mechanisms
should be preferred.
 
Overlap:  The example cited -- she took advantage of Joan -- seemed to
be an instance of arbitrary sectioning. The A&I Committee was requested
to provide precise illustration of the Japanese biclausal analysis
referred to in document TEI AI M1.
 
Cross references:  The id/idref mechanism of SGML seemed adequate in the
absence of more specific problems.
 
Synchronisation of multiple transcriptions:  JPG pointed out that so
long as order was preserved, paralellism between two synchronous struc-
tures could be implied. If order was not preserved, paralellism would
need to be explicitly tagged by using idrefs.
 
Vagueness:  Various examples of vagueness were discussed inconclusively.
In most cases the mechanisms proposed for arbitrary and discontinuous
segments seemed adequate. Two different examples were requested, togeth-
er with some indication of the purpose for which such features might be
tagged.
 
 
CDATA
 
LAP gave a detailed clarification of the use of the CDATA keyword, as
declared content for an element, declared value for an attribute, as an
entity or as effective status keyword in a marked section. JPG described
the interaction between the use of CDATA and of marked sections. It was
agreed that CDATA and RCDATA should be avoided as declared element con-
tent, unless in a non-SGML environment or as a marked section status
keyword.
 
Processing Instructions
 
It was agreed that these should be avoided, except for a few specific
purposes such as returning the system date to a document instance, in
which case the value returned should be declared as SDATA to avoid
changing the parse state.
 
                                                   Final, 8 January 1990