Comments on TEI AI7 W3 MicroMater: A proposed standard format ... C. M. Sperberg-McQueen Document Number: TEI AI7W3C1 May 29, 1991 (14:38:54) Draft May 29, 1991 (14:38:54) First of all, congratulations to your work group on this good paper; thank you very much for your hard work! These notes point out some minor issues which I think you probably will already be planning to take into consideration, as well as mentioning to you some other work within the TEI which proves relevant to your work. 1 1 FILES IN TEI INTERCHANGE FORMAT SGML and TEI do not have the concept of a file. There are two basic units: the SGML document which has an SGML declaration, a document type declaration, and a marked-up document content; and the entity, which may correspond to a file in some cases or to a string variable in another. As part of TEI interchange, it is expected that users will 'pack' their SGML documents for exchange by (a) gathering all the entities that make up a document into a single file for transmission and (b) ensuring that the transmission package contains only characters which can be transmit- ted safely. (This packing is not yet formally described in any TEI doc- uments; I'll send you a relevant ML document when the draft is completed and the committee agrees to distribution within the TEI.) The implications for terminology exchange are not entirely clear. If people prefer to view their authority information, bibliographic data, etc. as logical parts of one big thing, then they might well prefer to say they are all logically part of one document. The document itself might have just a single top-level file like this: ]> ... &bib.data; &auth.data; &terms1; &terms2; &terms3; If on the other hand people prefer to think of the authorities infor- mation, bibliographic information, etc. as separate documents which each are part of a system, one could of course ship the data as a series of distinct documents. The definition of a document is in the mind of the user; unless there is a compelling reason to recommend one way or the other, I think we should probably leave the choice between these two approaches to the user. 2 2 NAMES OF TAGS The metalanguage committee has just finished a document on the naming of TEI tags and attributes; Wendy will send it to you shortly. Most of your names are fine and consistent with existing practice; the new docu- ment would require changes only where it recommends changes in existing practice. The ML paper also specifies a unified treatment of case: lowercase all one-word names and one-letter-per-word abbreviations, uppercase only the first letters of words or partial words in multi-word names. (I mention this only for your reference not as a criticism of this paper-- there is no way for you to have seen this rule coming!) 3 3 CONCEPT-LEVEL INFORMATION It looks as though concept ID and domain might be expressible as SGML attributes; the ID, at least, seems to me more like an attribute than like a subordinate element. If you have had discussions about this, I'd be interested in hearing what you think. (Of course this is really a question for this summer as you work on the DTD fragments. For now thinking of these as elements is fine.) How likely is it that a given document might have concept IDs or domain specifications from more than one taxonomic system? (If a decla- ration is provided for specifying what system an ID or domain name is from, will it be appropriate on the document level or on the record lev- el?) 4 4 CONSTRUCTION OF ENTRIES The overall structure of entries seems very clear; I can write a context-free grammar for these records from the paper. Until, that is, I reach the terminology units and the descriptions. Which of these is the definition? (I use a postfix '*' to indicate 0 or more occurrenc- es.) termAssignment ::= (termUnit, description*)* termAssignment ::= termUnit*, description* The first is: term assignments contain arbitrary numbers of term-unit- plus-description-sequence pairs. The second: term assignments contain a term unit sequence and then a description sequence. Perhaps I've mis- read, too, in assuming that the pair or the term unit is starred. Per- haps the rule should be: termAssignment ::= termUnit, description* which is probably a better structure anyway. 5 5 SECONDARY FIELDS Here the issue of attribute vs. element encoding occurs again; left to my own devices I would probably have defined entry status and lan- guage symbol as attributes. This has the pleasant side effect of allow- ing a convenient restriction on their values to items from a defined list. I'm not sure from your examples which way you are leaning; DATE is given as a tag under 'Freely repeatable data categories' but as an attribute in the next example. In a few cases, I think you can use tags from the existing draft; Occam's razor says if you can, you probably should. * Cross references should almost certainly use the defined in P1 section 5.7; if it doesn't behave the way you need it to, it will have to be changed. * Notes can and should use the tag. * The administrative data you mention should probably be put into the pot with other similar items to produce some fairly general set of tags for administrative information. (It is relevant, for example, for all sorts of office documents, and may also be relevant to any large project where many people work on the same documents.) Draft May 29, 1991 (14:38:54)