Minutes of the Meeting of the Metalanguage Committee Held at the Commission of the European Communities, Luxembourg, October 18-20, 1989 Document Number: TEI ML M10 Lou Burnard 8 January 1990 Present: David Barnard (DB), chair; Lou Burnard (LB); Jean-Pierre Gaspart (JPG); Lynne A. Price (LAP); Michael Sperberg-McQueen (MSM); Nino Varile (NV). Final, 8 January 1990 0 INTRODUCTORY BUSINESS Administrative matters DB welcomed the committee. A new address list had been prepared [attached to these minutes]; any further alterations to DB as soon as possible. MSM outlined the limits on and procedure for claiming reim- bursement of expenses; it was noted that the situation may change with new funding arrangements. DB asked all committee members to inform him of their likely expenses well in advance of meetings to enable him to administer the budget allocated to the committee effectively. Minutes of previous meeting [ML M 1] 'DRTD' (page 3) should read 'DTD'. LB confirmed that he and MSM were still working on a tagset for internal TEI usage. The status of decisions made at the previous meeting was reviewed. On non-substantive issues, the committee agreed to accept the Chair's rul- ing. On substantive issues, it was agreed that a decision accepted by a two-thirds majority of the whole committee should be binding. Members who had not voted on a particular issue would be given a fixed period after notification of the decision in which to inform the chair of any disagreement; silence would be interpreted as consent. This lead to a brief discussion of absenteeism: it was agreed that two unexplained absences from meetings would constitute resignation. It was agreed that document MLW1 had not yet been adopted by the com- mittee: substantial revisions had been requested at Toronto which had not yet been carried out. Most of these were subsumed into the discus- sion of the replacement for this working paper, now renumbered as ML W13. Document numbering A new document numbering scheme was proposed. All papers before the com- mittee would be consecutively numbered, and each would bear a single letter prefix (W for working papers, A for agenda lists, M for minutes). There would be a standing agenda item at each meeting to check the sta- tus of all documents currently before the committee and to note any new ones. DB agreed to produce a new document list including all existing and proposed papers [attached]. It was also agreed that documents circu- lated outside the committee should be credited to the whole committee as author or editor; individual authorship should be noted only within the committee. 0 STATEMENT OF WORK The Committee's plan of work, as presented in TEI documents SC G10, ED P2 and ML R1 (now renumbered MLW4), was reviewed. DB stated that the committee had two main responsibilities: firstly to advise the other committees on optimal usage of SGML, and secondly to survey existing encoding schemes with a view to recasting them. There was some discus- sion of what was entailed by 'recasting': at one extreme it might simply be a lookup table or an SGML short reference map; at the other a much more complex scheme might be needed. Information loss should be avoida- ble when going from one exogenous scheme into TEI, but would not be when going from TEI to a comparatively impoverished scheme. LAP asked wheth- er software would be produced to do the conversions. DB said this was not the committee's responsibility: it needed only to produce generic specifications for transforming between a representative set of encoding schemes and the tagsets and DTDs defined by the TEI. Political consider- ations were involved in determining what would be an appropriate set to consider, as well as technical issues. JPG noted that some candidate encodings were very application specific and that hence some considera- tion of the function of the markup was necessary, citing macro packages such as LaTex. It was agreed to produce a list of candidate encoding schemes and to revise the Statement of Work. LB to produce categorised list of candidate markup schemes Document number: ML W12 Due: 19 Oct 89 DB to revise statement of work Document number: ML W4 (source), MLW11 (new doc) Due: 27 Oct 89 0 GUIDELINES It was noted that a paper setting out general Guidelines on the usage of SGML features was urgently needed by the other committees. The existing draft working paper (TEI ML W1, now replaced by ML W13) needed consider- able revision and extension. A detailed discussion ensued. General principles JPG pointed out that different situations might require different recom- mendations: in particular, features appropriate for capture or process- ing might not be appropriate in interchange or storage. Although the CALS and AAP standards had been proposed for interchange only, people still used them for data capture, for which they were less suitable. It was agreed to structure discussion by drawing up a matrix of SGML fea- tures categorised by their suitability for data capture, interchange, storage and processing. As a general principle, the committee felt that anything which could be expressed using SGML should be. Similarly, documentation should make clear that only those features which are defined using SGML could be relied on for interchange purposes. DB proposed that the document should also address the importance of using software to check the syntax of SGML encoded texts. It was recog- nised that users of the TEI Guidelines might use many intermediate soft- ware environments, but the committee agreed that DTDs developed for the project, and documents claiming to be TEI-conformant, would have to be validated by full SGML parsers, and to caution the other committees against under-estimating the complexity of this process. JPG and LAP both pointed out that producing software capable of checking SGML syntax correctly was a far from trivial task, and LAP agreed to draft a few pages setting out the difficulties of doing so. LAP to draft a memorandum on the difficulties of SGML parseability Document number: no number assigned Due: no date assigned DB to produce new draft of Guidelines (with sections by others as noted elsewhere) Document number: ML W13 Due: 1 Nov 89 Discussion of specific features SHORTREF and CONCUR: JPG proposed that SHORTREF should be recommended for use only in data capture; LAP that it was appropriate at all times, and could lead to significant savings in storage costs as well as con- venience in input. JPG noted that SHORTREF and SHORTTAG could apply only to the base tagset, suggesting that their use by applications using CON- CURrent dtds might be problematic. MSM said that potentially any ele- ment might need to CONCUR. DB agreed that the only alternative to using CONCUR would be to define something else with a similar functionality. LAP proposed that any problems the group identified in using the feature should be passed back to WG8. Attributes: It was agreed that, although formally equivalent, attribute values could be used in preference to element content in order to add information to a view. Even within a single view it might still be dif- ficult to decide what was content and what was process-specific informa- tion. JPG to draft recommendations on the appropriate use of attributes Document number: section of W13 Due: no date assigned Inclusion/Exclusion Exceptions: Recommendations on when these might profitably be used were still needed. JPG to draft recommendations on the appropriate use of exceptions in content models Document number: section of W13 Due: no date assigned SUBDOC: This feature might provide an alternative to some uses of the CONCUR feature. It allowed an entity reference to be replaced by a sub document with a distinct environment, entirely replacing replaced that of the base document type, and with no easy way of communicating infor- mation (eg the target of IDREF attributes) between the two environments. How this feature relates to CAPACITY is not clear. Guidance on its usage would be useful, particularly for the &ai;. LAP proposed that entity references the text of which were subdocuments should appear only as attribute values rather than as content, which met with general agree- ment. NV to draft a paragraph on the use of SUBDOC Document number: section of ML W13 Due: no date assigned APPINFO: MSM asked whether this might be an appropriate feature to use as a means of providing aliases for tagnames and other GIs. LAP said parameter entities or short refs would be a better solution to this problem. AppInfo merely provided commentary on the environment described by the DTD in which it appeared, and not its application. OMITTAG: There was continued discussion on appropriate use of this min- imization feature, and it was agreed that clear recommendations were not easy to make. Part of the problem was that different recommendations were appropriate for private and public use of documents, a distinction which not all wished to make. LAP circulated some 'Notes on Markup Mini- mization and Attributes' [document ML W9] which would be incorporated in the draft for W13. Further information on the use of SHORTTAG was requested. LAP to expand her notes to include recommendations on use of SHORTTAG Document number: W13 Due: 1 Nov 89 LINK: JPG stated that LINK was the only mechanism provided by SGML for relating different document types. As implemented by the SOBEMAP parser the feature allowed the dtd designer to associate semantic actions with any element, for example to define formatting. LAP was opposed to its use on the grounds that SGML was intended to separate semantics from markup, that it was currently the subject of some concern within WG8 and that as currently defined it was defective in several respects. After some discussion, it was agreed to defer decision on the use of this fea- ture. Concrete syntax: It was agreed that working committees should not devi- ate from the reference concrete syntax without strong motivation. To do so would require transmission of a default SGML declaration with each document instance. Concern was expressed over the default namelength of the current syntax which was felt to be inadequate. Quantity and capacity: The existing defaults were reviewed briefly: it was agreed that we would need to alter only namelength, for which a val- ue of 128 was proposed as default, and possibly the level to which enti- ties could be nested, for which 16 seemed a little low. Naming rules: It was agreed to encourage consistency in naming rules as far as possible. In particular the existing defaults for case sensitivi- ty (sensitive in entity names only) should be adhered to and existing defined entity names should be used. The character set used for names should as far as possible be the same as that used for the document: it was recognised that this might give rise to problems in documents using 8-bit character sets, and the committee asked the Text Representation Committee to address the problem of representing names in such docu- ments, to consider the SGML syntax status of each character as well as its collating position etc. and to address the best way of defining translations between identical sets of names in different languages. A need for globally transforming names was identified. In discussion of the FORMAL feature, it was noted that no decision had yet been taken as to whether the TEI scheme would be registered with the relevant standards bodies. MSM to discuss with the steering committee whether formal regis- tration of the TEI Guidelines was intended Due: no date assigned It was agreed that entity names should be used without a doctype qualifier only if they had the same replacement value in all TEI doc- types. The special case of entity references expanding to IGNORE or INCLUDE when used with marked sections was noted: the editors would need to maintain consistency in this context as there was no way of including a doctype identifier in this case. Conclusions DB agreed to produce a new set of draft recommendations by November 1st. Fuller justification would be left to a later date. The document would be distributed to JPG and LAP by FAX, by email to other members. Two weeks would be allowed for comments. Committee members were asked to acknowledge receipt of the draft immediately. 0 SGML BIBLIOGRAPHY Production of this was well advanced. Robin Cover had produced a sub- stantial amount of information which was being merged with DB's current file and would be distributed as a working paper in SGML form shortly. DB To distribute SGML bibliography Document number: ML W14 Due: 15 Nov 89 0 INTRODUCTORY GUIDE TO SGML There was some discussion of the need for a very elementary guide to SGML including illustrative material relevant to the TEI's concerns. The diversity of other committee members' backgrounds and expertise in for- mal language theory was noted. It was suggested that the chapter on SGML in the FORMEX manual might be a suitable basis. LB to draft "Idiots' Guide" to SGML Document number: ML W15 Due: 1 Dec 89 NV to check on copyright status of FORMEX Guide Document number: ML W15 Due: no date assigned 0 PROGRAMMERS' GUIDE TO SGML MSM proposed that a document introducing the major concepts of SGML aimed at document designers and programmers would also be useful, analo- gous to Bryan's book but less biassed to publishing applications. LAP suggested that our working papers would form a good basis for such a text. DB agreed that ML W13 should cover all that was necessary for those wishing to design TEI-conformant DTDs. LAP to report on the availability of relevant work done previously at Hewlett Packard Document number: ML W16 Due: 24 Oct 89 LAP with MSM and David Durand as backup to consider writing a technical introduction to the use of the TEI Guidelines Document number: ML W16 Due: no date assigned 0 OTHER ENCODING SCHEMES LB presented verbally an initial draft of ML W12. The Committee worked through a long list of encoding schemes, categorising each as (a) one on which work towards specifying a transformation would be carried out (b) one on which such work was not judged necessary (c) one on which no fur- ther work was currently planned. [The results are presented in the draft of ML W12 appended to these minutes] It was noted that Nancy Ide (NI) had expressed willingness to work in this area and she was requested to form a working group. NV noted that recoding schemes for a number of word-processing systems were already required for the Eurotra project. DB proposed that each scheme on which work was required should be con- sidered on an ad hoc basis. It might be possible simply to specify an SGML declaration and DTD for some; others might need simple string sub- stitutions; for yet others general purpose tools for lexical analysis such as YACC and Lex might be required. JPG felt that the issue should not be oversimplified but the consensus was that simple tasks should be addressed by simple tools. DB noted that there were already several vol- unteers willing to work in this area and others might be forthcoming. He undertook to ask NI and FT to co-ordinate the work. NI with assistance from Frank Tompa To form working group with responsibility for addressing tasks specified in ML W12 Document number: ML W12 Due: no date assigned 0 STYLISTIC GUIDE LB suggested that a separate document summarising TEI 'house style' for DTD-definers might be useful. This was agreed, though the difficulty of formulating such rules was noted. It should for example summarise naming conventions, recommend appropriate grouping levels for content models and propose a standard ordering for DTD statements. DB suggested that much of this material might be included in W13. MSM proposed that it should be a separate document and suggested David Durand (DD) as a suit- able author. DB to contact DD to discuss preparation of Stylistic Guidelines Document number: ML W17 Due: no date assigned 0 INPUT TO WG8 The committee agreed that a formal mechanism for conveying problems identified with the current SGML standard to the relevant ISO working groups was highly desirable. Two particular problems (namelength and doctype qualifier on marked sections) had already been identified and it seemed likely that there would be more. It was agreed that the steering committee should be asked to convey detailed proposals for revisions to ISO 8879, the five year review of which was due in 1991 and that the Committee would attempt to draft such proposals. DB to ask the steering committee to communicate with WG8 Due: no date assigned 0 NEXT MEETING Planned dates for next meeting are 16-19 February 1990, in Chicago. Date and place to be confirmed by 1 Dec 1989. MSM to confirm date and time of next meeting Due: 1 Dec 89 0 QUESTIONS RAISED BY OTHER COMMITTEES Due to other committments, DB had to leave the meeting after the second day. The third day of the meeting was spent discussing in some detail some specific problems raised by members of other working committees. MSM characterised the problem areas on which guidance was needed as fol- lows: * Arbitrary segments which did not seem to be really text elements, for example in discourse or stylistic analysis. * Discontinuous segments where elements were interrupted by other ele- ments, for example in morphological or conversational analyses. * Ambiguity, where more than one structural analysis might apply to a given text segment. * Overlap, where two or more overlapping structures were to be tagged across the same text. * Synchronous parallel structures * Transcription mapping. * Cross references both within a document and outside it. * Vagueness, where the feature to be encoded could not be exactly localised. Arbitrary segments Examples quoted were: this is where the transcript is inaudible; this is where he moved to the window. JPG suggested that each such segment should be regarded as a separate automaton. CONCUR could only be used if the number of different segment types was in principle bounded and quite small. Alternatively he proposed empty tags marking either end of the segment, with attributes to indicate their type. For example, Nino's moves around the room might be represented in a document with content model of (#PCDATA | move) by elements tagged [(nino)move] .... [/(nino)move] or by elements tagged [move person=nino start] ... [move person=nino end]. It was noted that the first mechanism could be gener- ated from the second, but that the second was more general. The advan- tage of the first was that the parser could check that tags were proper- ly matched etc. A third possibility was to create the total intersection of all identifiable segments, subdivided at each segment boundary and grouped at a higher level. Discontinuous segments These could be handled by co-indexing using the id/idref mechanism of SGML. To avoid the need to treat the first part of a discontinuous seg- ment differently from the rest, it could be set up initially. The exam- ple discussed was of discontinuous segments relating to a particular topic. Provided that the set of topics could be predefined, these could be listed in a separate TOPICLIST element, each member of which could be allocated an ID. Each occurrence of a topic in the text itself would then be linked back to the appropriate element by an IDREF. It was agreed that this method was probably not appropriate for all examples of discontinuity. For micro discontinuities such as those found in morphology, a simpler solution might be to introduce some redundancy into the text. For example, the lemmatised form KTB of the Arabic word al-kaatib might be represented either as an attribute [word root=KTB]al-kaatib [/word] or as an additional element [word] [root]KTB [form]al- kaatib [/word]. The first was probably to be preferred, since lemmatisation was a process applied to the text rather than a part of its content. Ambiguity: Five mechanisms were outlined for dealing with ambiguities such as 'I saw the man with the telescope' The two parse trees for this sentence - ((I) (saw((the man) with the telescope))) and ((I) ((saw (the man)) with the telescope)) - could be represented independently, but without repeating their points of overlap, by using CONCUR. Or they could be repeated as alternative marked sections. The third possibility was to think of each parse tree as a graph, in which the nodes were the boundaries between syntactic units and the arcs the syntactic interpretation placed on them. A sentence could then be represented in SGML as an ordered list of words, each with its own ID. The arcs of each syntactic graph for the sentence could then be repre- sented by empty elements, using idrefs to point back to the words. A simpler representation might be to use a special notation (which could not therefore be checked by the SGML parser) for the parse tree value and specify it as an attribute. The fifth possibility was to treat all parse subtrees as arbitrary segments, as outlined above. The committee felt that more specific examples would be needed before clear recommendations could be given as to which of these mechanisms should be preferred. Overlap: The example cited -- she took advantage of Joan -- seemed to be an instance of arbitrary sectioning. The A&I Committee was requested to provide precise illustration of the Japanese biclausal analysis referred to in document TEI AI M1. Cross references: The id/idref mechanism of SGML seemed adequate in the absence of more specific problems. Synchronisation of multiple transcriptions: JPG pointed out that so long as order was preserved, paralellism between two synchronous struc- tures could be implied. If order was not preserved, paralellism would need to be explicitly tagged by using idrefs. Vagueness: Various examples of vagueness were discussed inconclusively. In most cases the mechanisms proposed for arbitrary and discontinuous segments seemed adequate. Two different examples were requested, togeth- er with some indication of the purpose for which such features might be tagged. CDATA LAP gave a detailed clarification of the use of the CDATA keyword, as declared content for an element, declared value for an attribute, as an entity or as effective status keyword in a marked section. JPG described the interaction between the use of CDATA and of marked sections. It was agreed that CDATA and RCDATA should be avoided as declared element con- tent, unless in a non-SGML environment or as a marked section status keyword. Processing Instructions It was agreed that these should be avoided, except for a few specific purposes such as returning the system date to a document instance, in which case the value returned should be declared as SDATA to avoid changing the parse state. Final, 8 January 1990