[Note: Words noted in Italics in the printed version of the Minutes have been enclosed in *...* for this electronic version.- WP, TEI] TEI AI7M12 ========== Meeting II of TEI AI7 Terminology, 1991-08-13/14 Version 1, 1991-09-01 at Oak Ridge National Laboratory in Oak Ridge Tennessee. In attendance: Dr. Alan K. Melby Chair Dr. Gerhard Budin Members Dr. Richard A. Strehlow Dr. Sue Ellen Wright Dr. Gregory Shreve Co-opted member Dr. Michael Sperberg-McQueen TEI Editor Dr. Leland D. Wright Observer Special Guests: Dr. James Mason Thomas O. Tallant The committee wishes to express its thanks to Dr. Strehlow and to Oak Ridge National Laboratory for their hospitality in making space available for these meetings. Procedural Comment ------------------- Whereas the minutes of the first meeting reflect the ongoing development of the reasoning of the WG, the minutes of this second meeting will serve to document the development of a specific set of procedures governing the AI7 TEI.TERM exchange format. The minutes themselves should be viewed as a kind of developmental guide to a series of sub-documents that will eventually form the core of a formal report. These sub-documents are identified in the minutes and cross-referenced wherever necessary. Readers are encouraged to move back and forth between these text segments in order to formulate an overall picture of the proposed exchange format. AI7 Sub-documents (included in AI7W13) ------------------ 0. Introductory Document 1. Terminology 2. Basic Structure of the termEntry 3. Inter- and Intra-termEntry Links 4. termEntry Tags, Attributes and Attribute Values 5. Writing System Declarations and the *lang* Attribute 6. AI7 DTD 7. Sample s Review of the Minutes of the Cleveland Meeting ---------------------------------------------- The first item on the agenda consisted in examining the minutes of the previous meeting in Cleveland, Ohio (hereinafter called CLE Minutes). The discussion items are listed succinctly below cross-referenced by page and paragraph number as they appear in the revised minutes (Version 4). It was noted that the footnotes included in the revised minutes of 1991-07-29 should be moved to an addendum section in order to distinguish them from the full text, which represents the actual content of the Cleveland meeting. Shreve noted that "SGML compatible" should read "can be easily convertible to SGML-conformant." See Terminology Section and Item 7 below. Page 3, Paragraph 4, l. 3 should read "AI6 on Computational Lexicons." Page 8, Paragraph 3, l. 2 should read "ISO 639, Part II." Page 9, header: the number "10" should be omitted. The minutes were approved as corrected. A revised version of the minutes will be provided to the TEI Chicago office. Other document-related comments ------------------------------- The request was made that TEI.Term (AI7) documents be specifically forwarded to the members of the dictionary committee. Sperberg-McQueen agreed to facilitate this request. Related Reports --------------- Dr. James Mason, convener of Working Group 8 on Text Description and Processing Languages, ISO/IEC JTC1, Subcommittee 18 for Text and Office Systems, reported that the SGML standard is entering the 5-year review process. Mason is a member of the Advanced Publications Technology Section of the Publications Division of Oak Ridge National Laboratory. He heads an SGML team of four. Together with Thomas O. Tallant, leader for SGML development and application, Mason arranged for a demonstration of SGML parsers in both the Macintosh and the PC environments. Sperberg-McQueen promised to provide the WG with a discussion paper by Wilson, "Rendering Hypertext Documents in SGML." Melby provided the WG with a copy of the document "Computational Model of the Dictionary Entry, Preliminary Report" authored by Calzolari, Peters and Roventini. The group expressed great interest in the progress of the dictionary WG and registered a desire to see concrete examples of dictionary entries to accompany the abstract templates provided in the cited paper. Primary Discussion Items ------------------------ The initial examination of the CLE Minutes produced the following list of issues for discussion: 1) The basic structure of the as evidenced on pp. 8 & 10 of the CLE Minutes 2) Normalization levels (degrees of normalization) 3) ... 4) Possible elimination of *xrefs* Inter- and intra-termEntry links 5) Openness vis-a-vis the tag set & procedures for "opening" the set vs. declaring most data categories to be attribute values 6) Floating vs. non-floating (unbound vs. bound) data categories 7) "Easily convertible to SGML-conformant" 8) TR1 & lang attribute (ISO 646). 9) The ISO language codes: use of 2 vs 3-letter codes (ISO 639) Item 7 ------ The locus of the discussion concerning "easily convertible" related to the long-standing use in some quarters of "SGML-compatible" to mean "easily convertible to SGML." Shreve correctly pointed out that "compatible" in standard computer parlance does indeed mean "fully conformant." The group agreed to deprecate the term "compatible" as ambiguous and to note preference for "conformant" over "conforming." No specific term was found for "easily convertible," which shall remain the approved formulation for describing this type of format. Items 3 and 9 ------------- The questions of front matter and whether to use 2 or 3 letter *lang* codes were left unresolved during this meeting. (See Addendum to these minutes.) Items 1-2 - Structure and Levels of Normalization -------------------------------------------------- Items 1-2 have been incorporated into "AI7 Sub-document 2: Basic Structure of the ". The discussion of the structure and the levels of normalization to be provided within the TEI environment centered on one basic question: Where should one posit the major conversion effort, on the side of the importer or on the side of the exporter? McQueen argued convincingly that the interchange format must be highly normalized (which in essence means deeply nested) in order to facilitate maximum flexibility on the import side. While acknowledging this concern, the WG expressed their sociological concerns for the ultimate acceptance of the interchange format. It is the consensus of the group that the interchange format must also offer optimum flexibility on the export side so that potential users will not be put off by an interchange format that "doesn*t look much like" their current local applications. As noted in Sub-document 2, these complementary concerns dictate the necessity for a multi-level interchange environment, wherein documents created in local applications can be imported to TEI.TERM in a so-called "flat" form that closely resembles the format of the source application and then subjected to iterative conversion passes to reorganize the to conform to the fully nested TEI.TERM model. Item 4 ------- The item "inter- and intra-termEntry links" was originally included under the discussion of the basic structure, but it has been moved to combine with Item 4 because the two items address essentially the same concern. Sub-document 3 outlines the final position reached by the WG on inter- and intra-termEntry linkage. Inter-termEntry linkage is solved easily by the id/xref/target capability built into TEI. By the same token, linkage within fully normalized nested s is solved by virtue of the rules of adjacency and basic embedding mechanisms. Designing a mechanism for intra- termEntry linkage in flat s was a more difficult task. Sperberg-McQueen pointed out that of the two types defined by the committee (normalized and flat), the normalized can be readily processed in SGML, whereas the flat must utilize pointers that are currently not a part of SGML. The pointer method lacks existing names, constructs and hierarchies within SGML to support it. In the nested , "order" within the document is predictable within a certain set of controlled parameters, whereas in the flat , one only knows that the order of the elements is some permutation of the total number of elements included in that . (See the Addendum to these minutes for further comment on the order of data categories within the .) The essential factor in this argument is that the more constrained the interchange format, the more powerful it becomes as an import-export tool. The group is generally agreed that the flat represents a sort of "half-way house" solution, i.e. that flat s must be converted to nested s before they can be used to import data into other applications. The tagId/xref mechanism suggested in the CLE Minutes is not a viable solution because the TEI *xref* attribute is not used for internal referencing. The WG had intended that an element reference could be unique to the without being unique within the entire document. Sperberg-McQueen pointed out that this was inappropriate, that for linking purposes each element had to be "globally available and globally unique" within the document. Sperberg-McQueen also stated that the complex interlinking of elements within a is unique to AI7 and reflects a special kind of problem that poses the potential for problems with respect to SGML. Two basic identification principles emerged from the discussion: either all elements in a terminological information group (see "Subdocument 1, Terminology, and Subdocument 2, Basic Structure of the " for a discussion of the tag) would have to be identified with some sort of SET identifier so that discontiguous components of the set can be extracted or reassembled (EXTRACT principle), or all elements in a must receive an alpha-numerical identifier that will enable pointing from one location in a discontiguous to another (MOVE principle). It was also recognized that provision must be made for two types of referencing: pure adjacency (membership in a , direct reference to the ) and embedding (which generally implies a second level of association -- information used to modify an element which in turn is associated with the ). The group considered three specific proposals for defining pointers in the flat : 1) Shreve proposed the use of a number tree method, i.e. a system of domain style addressing to identify all elements making up the SET that constitutes the . The domain address would consist of three identifying strings in the pattern x.y.z, where x = the identifier y = the identifier z = the individual element identifier AI7 would not dictate the specific form that this system might take. For instance, the identifier of a given element might consist of the identifier, the identifier, and a sequentially or arbitrarily assigned sequence. Example: ------- *MEDAIDS12345.1.def1* would constitute a unique identifier for the first definition associated with 1 for the sample shown in Subdocument 3, page 3. 2) Sperberg-McQueen proposed the use of two attribute identifiers that would flag data elements according to: 1) the with which they are associated ( - accounts for adjacency) (SET concept) 2) the specific tag element with which they are associated ( - accounts for embedding) (MOVE concept) After much discussion, this was the system adopted as a final position. See "AI7 Subdocument 3: Links" for a complete discussion of this option. 3) Melby proposed the third option which involved using a single coordinate pair to link to data elements. In this system, each target element would have a unique name identified using the n=attribute, which could be referenced using two pointer attributes, coord (for adjacency) and depend (for embedding). This system is based on the MOVE concept. The group analyzed the advantages and disadvantages of the three systems: 1. Uses only one attribute Imposes domain-style number tree No restrictions with respect Requires full normalization to to numbers and characters assign the number tree, which is not possible in the flat Easy to understand 2. No "magic numbers" Requires two attributes Structurally simpler than number tree Potentially confusing because both the SET (*group*) and MOVE Easier to teach, therefore (*depend*) mechanism exist in more accessible to potential the same TEI.TERM document fore the Can actually be applied to the non-nested structure without preconstructing the nested 3. Uses only one mechanism Reifies the normalized tree No restrictions on the element name As noted, the consensus of the group was that there were more advantages to be gained from solution 2 than from the other options. In effect, solution 2 represents features of both 1 and 3 because it incorporates the SET (group) concept to accommodate adjacency and the MOVE (depend) concept to accommodate embedding. Items 5 & 6 ------------ The WG recognizes the concern that the actual tags included in the DTD must be limited to a basic level of abstraction and that the data categories included in the empirical study (Wright/Budin, AI7 W-11) should be recognized as generic identifiers (GIs) that will appear as attribute values of the *type=x* attribute in TEI tags. By defining a limited basic set of tags in the DTD and publishing an open-ended Joint List of recommended attribute values, it will be possible to add attribute values as needed to satisfy the requirements of specific applications. The group agreed that the attribute values list would appear in a non- hierarchical form and that there would be no attempt made to restrict attribute values to specific tags because many attribute values tend to behave differently depending upon the structure of the system in which they are used. L. Wright observed that the is composed essentially of three components: terms, descriptive material related to the terms, and administrative data. [Coincidentally, this observation conforms to Sager*s three primary data categories.] The WG agreed in principle, noting that this simple list needs some expanding to utilize or accommodate several TEI characteristics. The resulting list included: terms otherForms (which are themselves terms) description (which refers to the term) floating description (which refers to some other element in the ) administrative data specific TEI tags that can be incorporated into TEI.TERM See "Subdocument 4: TEI Data Categories: Structure, Tags, Attributes and Attribute Values" for a detailed list of TEI tags and attribute values. The committee must prepare a list of recognized attribute values with their definitions. The draft level of this list can be generated from the existing KSU data categories terminology file. The definitions contained in that file are strictly ad hoc formulations created for research purposes. The WG will have to subject them to a formal review and revision process. Agreement on the meaning of the attribute values is very important for purposes of interchange. Although the list of attribute values is open-ended, the addition of synonymous attrubute values is to be descouraged at the interchange level, although local applications are completely free to use whatever data category names they deem fit. Export routines should account for aliasing to existing attribute values wherever possible. Review of Assignments List from the Cleveland Meeting ----------------------------------------------------- Assignments for June-July-August -------------------------------- A. Write sample TEI.TERM documents using tentative Wright & Budin tag set Budin reported that a few sample TEI s had been generated and that he and Wright would continue this effort. Sample s will be included in "Sub-document 7". B. Consider character sets and reversible Melby transliteration Contact TR1 Item 8 ------- Melby reported that it was currently essential that documents be restricted to ISO 646 because of the prevalence of systems that use either exclusively lower ASCII or EBCDIC. He stressed that: * Transliteration must be reversible. * SGML % entities will probably be used. * UNICODE and ISO 10646 16 bit representation schemes for multilingual characters are interesting solutions for the future, but that we must cope with existing systems for some time to come. Melby noted that the problem of reversibility appears to be resolved for non-oriental languages. Melby will provide a more detailed summary of ISO 646 and the lang attribute, which will becomes "Sub-document 5". C. Write formal DTD using tentative tag set Shreve Shreve reported on the DTD, which does parse. He will be revising the DTD to conform to decisions reached during this meeting. This DTD will become "Sub-document 6". D. Test sample TEI.TERM documents and DTD using Shreve an SGML parser. See above. E. Get feedback on sample files from many TDBs Wright & Budin Budin reported that he and Wright will contact users and operators of terminological databases (TDBs) to solicit sample s for the purpose of examining as many different system types as possible. F. Invite numerous groups to create conversion Wright & Budin software to convert sample files to and from TEI.TERM-conformant documents Ditto above. G. Refine tag set and DTD Shreve; WG Shreve will revise the DTD as noted above. Wright will generate a list of tags, attributes and attribute values in conjunction with the writing of the Oak Ridge minutes. She and Shreve will supervise the creation of a draft list of tags, attributes and attribute values together with definitions, which should be ready by mid-October. H. Write final report Wright; WG The committee identified the following reports and papers that must completed, together with their respective deadlines: 1. Preliminary version of an AI7 status report to be circulated in Sub-Committee 3 of ISO 26 Aug.- Technical Committee 37 on Terminology (ISO TC 37/SC 3) 1 Sept. 2. Minutes of Oak Ridge Meeting 1 Sept. 3. Formal paper to be authored by Melby and Wright 25 Sept. This paper is for presentation at the International Symposium on Terminology and Documentation in Specialized Communication to be held in Hull, Ontario 7-8 Oct (presentation by Wright). The same paper will form the basis for Melby's presentation at the ASLIB conference in London in November. There will be a book printed as the proceedings of the ASLIB conference, in which this paper will appear. 4. Section for TEI Guidelines 2, which will be basically identical to the Hull/ASLIB paper, with minor modifications to reflect venue- specific concerns. 25 Sept. 5. Preparation of a more polished proposal for TC 37/SC 3. 15-16 Nov. This proposal can again be based on the ASLIB paper, and preliminary versions can be provided to important SC 3 members prior to the November SC3 meeting. The precise form of the SC 3 presentation can be finalized during the November meeting in Vienna. 6. Formal petition to continue the work of AI7 within the framework of the third round of TEI activities ?? I. Create TEI stationery header in both MS-DOS and Macintosh format Strehlow, Wright (Purpose: to avoid identification of AI7 with any of the respective members' parent organizations, which could impair the acceptance of the TEI format by a broad range of terminologists) Strehlow reported on his design and presented the members with disks. The group registered its collective gratitude for his efforts. Proposed plans for the Vienna meeting 1991-11-15/16 The next meeting of AI7 will take place in Vienna on 1991-11-15/16. Strehlow will be unable to attend, and Melby will petition TEI to cover expenses for Shreve in his stead. Melby will have to attend the first two days of a four- day meeting in Bergen, Norway on the official days of the AI7 meeting. Budin, Wright and Shreve will use the meeting time to review work on the various assignments and to prepare presentations for the TC 37/SC 3 meeting to be held in Vienna on 1991-11-18/19. Melby will join them on the evening of 11-19. The group strategy for the SC 3 meeting will be to present the TEI project as an interchange format based on ISO 8879-1986 (SGML). In this regard, the SGML solution would exist as a parallel option to the existing MATER standard (ISO 6156), which is based in part on ISO 2709. The SGML format should not be touted as a replacement for ISO 6156, but as a second part designed to meet needs unaddressed by the MATER standard. There should be no effort to retain reference to MATER in the name of the SGML solution because there is no relationship between the actual title of that standard (with its reference to magnetic tape) and the electronic environment for which TEI.TERM is being designed. Respectfully submitted, Sue Ellen Wright ****************************************************************************** Addendum to the Minutes of 1991-08-13/14 1. 2 and 3 letter language codes In telephone consultation Wright, Budin and Melby have agreed on the following position concerning ISO 639, parts 1 and 2. First of all, it must be noted that many systems exist that use language codes that differ from *either* the two or the three letter code. It cannot be the province of AI7 to suggest what local applications may do within the confines of their own specific working environments. However, Wright, Budin and Melby are agreed that exporters should structure their export routines so that the flat s produced by the initial conversion will reflect either the two or the three letter code as standardized in ISO 639 Part 1 or Part 2. Melby and Wright are agreed that AI7 should make a strong recommendation to TEI Working Group TR1 that all instances of language identification that occur in TR1 literature or in the WSD format itself should also conform to ISO 639. The rationale behind this recommendation is that Melby, Wright and Budin contend that TEI practice should make use of existing standards whenever possible so long as these standards do not conflict or interfere with the efficiency of the evolving TEI system. Because both the two and three letter codes are so widespread already, the three members of AI7 are of the opinion that this request can be made of TDB users and developers without inciting reluctance on their part to comply. 2. Reversibility of s The following statement was included in the first version of the minutes: Sperberg-McQueen also pointed out that one could convert a flat to the universal interchange format (normalized, nested form) and back again with no or little loss of data or position, whereas even if AI7 designs a mechanism for converting a flat to the universal interchange format, it will be impossible to automatically reproduce the structure of the flat . When Melby and Wright tried to reconstruct what was actually being said here, they decided that the statement was too inexact to codify in the minutes, although it does definitely lead to an interesting series of observations. Melby and Wright both doubt that it would be possible to reconstruct the original per se, even if it did resemble the nested form, because there is no precise imposition of order within the beyond the fact that the term must introduce the and certain elements have to embedded in other elements. Wright went on to point out, however, that on the import side the construction (or conceivably reconstruction) of prespecified order is absolutely necessary. She visualizes a generic export routine that would consist of a combination layout and filter template that would 1) specify the order that certain data categories (tags, attributes and attribute values) would assume in the target local application format 2) specify which set of data categories from the source system should be imported to the target local application format. For instance, if one were exporting from a very rich system that specifies many specific data categories to a relatively simple one that uses only a limited number of categories, one would want first to filter out all those items for which there was no data category within the target system, or one might want to lump certain data categories together in a "note" or other "soup" type data category. Once this initial conversion pass had created a document that closely parallels the structure of the target local application format, an actual proprietary conversion routine could be run in order to convert the data stream to the native mode truly compatible with the target system. 3. Language of a "tig" or a Language Section within a Wright observed in a FAX message following the Oak Ridge meeting that although there had been discussion of the use of the *lang* attribute to identify the language of a term or even to refer to specific elements within a , no decision had been made concerning conventions to identify the language of the itself. Nor has any system been identified for sorting all the information contained in a that could be defined as constituting a language section in the (assuming, for instance, that more than one term in a given language occurs in the ). Melby pointed out that although it has been noted that the lang attribute cannot be used as a target for pointing purposes, it can be used as a sorting attribute by a conversion routine that would assemble the components of a or of a language section. Nonetheless, it would be necessary for the sorting routine to "go out" to the WSD to find the actual language associated with the language code included in the elements of the . One expeditious way to simplify this problem might be to require that the first element in any WSD be either the two or the three letter language code followed by a .-digit (example: scr.1 for one of the two writing systems used to represent Serbo-Croatian) to indicate the specific character set used. Thus little time would be lost in referencing the WSD, but such a scenario would demand that the WSD in question be resident in the system at the time of conversion in order to implement such a search. Another option would be to create a *tigLang* attribute that would be generated by the import routine at the time s are assembled in the normalized . Although this method necessitates the assignment of an additional attribute, it has the advantage that all further processing could take place without having to reference the WSD, which in some cases may not even reside within the system at the time of further processing. A third option suggested by Melby would be to use the existing *languageCode* attribute, which is currently one of the WSD attributes, within the TEI.TERM . Wright observed that this might conceivably lead to problems in the event of inconsistencies, which Melby acknowledged as a potential difficulty, while noting that the problem of potential inconsistency exists in many other instances as well. Wright and Melby noted mutually that the inheritance properties that are native to TEI can be used to generate the *tigLang value*, assuming that the tag is appropriately identified by a lang attribute and implying by virtue of inheritance from below that the language of the is equivalent to the language of its term unless otherwise indicated. It is also possible that the language of an entire monolingual terminology document could be indicated in front matter and thus implied in every by virtue of inheritance from above. Any term elements that vary from the stated inheritance principles would have to be explicitly identified with their own *lang* attributes. Exporters will have to provide for the identification of the *tigLang* either explicitly or using one of the implicit methods described above. In fully normalized s, the *tigLang* attribute would have to be explicit. Respectfully submitted, Sue Ellen Wright