ML W1 C MS 1 Notes on draft syntax document "Guidelines for Writing SGML DTDs and Instances of SGML DTDs" (ML W1) - CMSMcQ, 7 Feb 89 0. General I believe this document addresses some relevant areas, but it's not quite the kind of syntax document I believe we need. I would like to suggest some changes in focus, as well as in style and substance. Perhaps this draft should become a completely separate document for the use of those committee members, if any, deputed to write out DTDs formally, and the "basic syntax document" we're looking for just take over parts of this draft. We need a document that defines as clearly and concisely as possible the legal syntax for our tags, with a little justification and, if we think it useful, a little discussion of design goals. Restrictions on DTDs are part of this, but probably should not constitute the main focus of discussion. 1. Introduction Even with respect just to the syntax, I think this account of committee ML's responsibilities is skewed slightly. The fundamental, single responsibility of ML with regard to syntax is to define a syntax for the TEI encoding scheme. Period. That syntax should: 1 suffice for an intellectually serious modeling of texts in machine-readable form 2 conform to ISO 8879 if at all possible, in order that texts encoded with the TEI scheme will be processable by SGML processors developed by whomever, for whatever purposes 3 be as simple and regular as possible, consistent with the first two points, so that the scheme presents no needless obstacles to parsing and tokenizing I believe adequacy for research (and this includes the human interface) must take precedence over the convenience of software developers. The syntax should be clearly defined, but I don't think the average linguist or literary scholar will give two hoots whether it's LALR(1) parseable or not, while they will give a lot more than two hoots if end tags are required even where their omission would be unambiguous. It is the syntax of the TEI scheme and its requirements that the ML committee must focus on and which the syntax document must discuss, not SGML. Point 2 says they should be compatible, but they need not and probably won't be identical. SGML we can and should just take as a given; critiques are pointless. We can offer suggestions to ISO TC 8 if we want, but they don't belong in this document. 2.1 Use of Existing Software Tools Am not sure why this section is present. We are not responsible for SGML, so why do we care whether it follows established conventions or not? Our own syntax may be another matter, though I'd like better reasons than the established conventions of computer science. About machine processability -- I think the draft takes too protective an attitude toward the machine and the programmer. Yacc is a fine tool and I like it. But I'm not willing to concede that if a language can't be handled by yacc, it's not "amenable to processing by computers," which is what seems implied here. We can talk, if we want, about making sure the scheme is "easy to process". But not, in these terms, about "processable." As a denizen of an academic computer center, I find the identification of "computing" with "computer science" pernicious. [I perceive that I diverge from NI on this section. She feels it needs to be extended and explained; I am inclined to eliminate it. If it is to be kept, read "software-generation tools" for "software tools". OCP is a tool, and it's software, but it's not the topic here.] 2.2 Some Implications of using computer-processable SGML I'm not sure I agree with the concepts here. Or it may just be the wording that is throwing me off. If we take as a given that machines and humans may prefer different views of the data, then either the machine view is the official view and the human views are produced by running the data through filters, or else the official syntax does allow the humanized view but we recommend, for pragmatic purposes, that the data be run through various filters (as later described) before interchange or before processing by some software, to ensure proper parsing. I dislike the former approach (full-form the only legal form) because it means filters are going to be *required* for people to look at TEI-encoded files. And the people with the most need for such filtering (technical novices) will be those with the least prayer of finding them or building them. Also, on principle I prefer to favor the needs of humans over the needs of the machinery. (I write software myself, so I'm not just trying to throw all the problems in the other fellow's yard.) I'd vote to delete the paragraph about computer-science expertise. It sounds rather as though we think humanists can't be trusted to work with sharp objects without supervision. 3 Guidelines First para - no sources are cited above at all. first point (no shortrefs, no omittag) I'm very skeptical here. In many people's view, short refs are a saving grace of SGML, making plausible a scheme otherwise far too elaborate for normal use. I understand from Barnard et al. in the current CHum that there is a nasty interaction between short refs and the CONCUR feature. But I believe that was fixed in the Amendment (or am I wrong?). I don't think shortrefs will be a big part of the TEI project, if only because I think it is better to exhibit the tags, not hide them. But I am not yet persuaded we should eschew them entirely. OMITTAG seems to be an essential feature if a file is going to be readily typed or readily read. And no matter how much we say one should never have to type naked tags on a keyboard, that is what many people will do, for a long time or until typical commercial word processors have an SGML-tag feature (a longer time). In any case, I am inclined to say: if a useful feature of SGML makes things harder for current tokenizer- and parser-generators, then so much the worse for the generators. They are supposed to make my life easier, not I theirs. I also don't completely understand the logic here: if a filter can be written to insert the omitted tags, why is it so hard for a processing program to infer them? And how will the filter know which tags to supply, if no DTD tells us which tags are omissible? And if we want to separate tag-inference from full parsing, why not simply recommend the use of a filter (specification given in an appendix) before submission of a text to a processor? second point (avoid attributes). We have some discussion to do here. Obviously we can't distribute a fundamental design document that says attributes should be used in preference to model groups under many circumstances, together with a syntax document that says the opposite. The sentence "Tags already provide a mechanism for specifying all hierarchical structure in a document" is correct, but does not lead to the given conclusion. Not all information is hierarchical: for information about a given element of a text which is not *part* of that element, attributes seem to me the most natural kind of convention. The title of a chapter is arguably a delimited, named part of its content. Its ordinal number, on the other hand, is an attribute of the entire chapter, not a special part of its content. If I write "Chapter 2" above the third chapter of my book, that doesn't make it the second chapter. If some of my figures should have boxes around them (to signal the art department that a paste-in is required) and some should not (because no art is required), then "box=yes" and "box=no" are features, not portions, of the figures. And so on. It depends, I suppose, on what you mean by "must be specified outside of the hierarchical structure of the DTD" -- but if you believe the AAP tag set does things right, then we disagree fundamentally. The argument about the difficulty of tokenizing and parsing attributes puzzles me, because you agree that one has to be ready to parse attributes anyway, if only for cross references. Well, if you've got to be able to parse attributes anyway, why not use them? Certainly it's a simple mechanical filter to take any attributes associated with a tag and re-write them as normal tags immediately following their parent, if one has to deal with a processor that can't handle attribute strings very well. In which case I don't see the advantage of prohibiting attributes except where absolutely necessary. I would like to know the opinions of other members of the syntax group on this point. third point (inclusion/exclusion prohibition). OK, although it seems awfully inconvenient, when you get down into mutually nestable but not self-nesting text elements. (What can a paragraph contain? lists of various types, quotations of various types, figures and tables of various types, highlighted phrases of various types, notes of various types. What can a figure contain? Exactly the same things, except for more figures and tables. What can a quote contain? Exactly the same thing, including more notes. ... But I can't use an entity reference for the group and modify it with an exclusion? sigh ...) This prohibition, that is, seems diametrically opposed to the next rule (keep the model groups clearly legible). modular design. I'd be happier if this were phrased in SGML terms (element declarations, model groups, ...) rather than abstract formal-language terms. I don't know what you are counting as a "production" in the context of DTDs, or what SGML term corresponds to "subgrammars". concur feature. This seems a good rule to begin with, but the examples need to be better phrased; a typical problem will involve two distinct types of structural information. 4 explicit value indicator, quoted values. This seems plausible to me; should single quotes be optional for attribute values containing no blanks? Are we prepared to allow double quotes instead? Open and close quotes? Guillemets? white space. for "outside the angle brackets" read "inside the angle brackets", I think. This is obviously less problematic if attributes are minimized than if they are extensively used. I'm not sure I understand the reasoning; is white space so hard to handle? Also, in the last sentence for "encoding" read "processing" -- or else I don't understand the paragraph at all. I'll try to make some more constructive suggestions, including alternative wordings, if we can agree on some of the substantive issues here. Michael --- Received: from QUCDN.BITNET by UICVM (Mailer R2.02) with BSMTP id 5371; Wed, 08 Feb 89 09:37:56 CST Received: from QUCDN by QUCDN.BITNET (Mailer R2.02) with BSMTP id 7087; Wed, 08 Feb 89 10:29:14 EST Received: from qusunl.qucis.queensu.ca by QUCDN.QUEENSU.CA ; Wed, 08 Feb 89 10:29:12 EST Received: by qusunl.qucis.queensu.ca (3.2/SMI-3.2) id AA00975; Wed, 8 Feb 89 10:30:23 EST Date: Wed, 8 Feb 89 10:30:23 EST From: barnard@qucis.queensu.ca Message-Id: <8902081530.AA00975@qusunl.qucis.queensu.ca> To: Lou_Burnard, Michael_Sperberg-McQueen, Nancy_Ide, Sandra_Mamrak Subject: Comments on SM's First Draft of SGML Guidelines I've read the draft and MS-McQ's comments on it. In general, my reactions were similar to MS-McQ's. (you need a simpler set of initials, Michael!) First, I'll transcribe my marginal notes from the draft itself. "The Use of Existing Software Tools" seems too narrowly-focussed to me. I agree that there are tools that can determine the lexical and context-free syntactic structure (which is what I'd prefer to refer to, rather than "language", if we end up making comments along these lines at all), but do we need to say that in this document? I think it was while I was reading this section that I realized I wasn't really sure of the purpose of this document, or, perhaps more correctly, how widely it will be disseminated. In the second point under "Some Implications ..." I'm not sure this statement about computer science expertise is quite right as it stands. I agree that having this expertise available as part of TEI is good, but surely we can use pieces of software to check that DTDs are machine processable--we just feed a DTD to an SGML parser (which at least some of us have access to) and see what happens. I think the value of a computer science perspective is at a deeper level: perhaps to influence the TEI to use appropriate tools in appropriate, effective, efficient ways. Anyway, this objection is much too long relative to the sentence I'm objecting to. I do think, though, that saying it just the way it's said may not endear us to all of our collaborators in TEI. Short references and omitted tags: Hm...I think any implementation worth its salt will handle these, even though they're optional features. (We should probably agree not to make gratuitous criticisms of the SGML standard, but in this case it might be appropriate: I believe some of the required features are harder to implement than some of the optional ones; I suspect this means that implementors will not necessarily decide to draw a boundary between what they implement and what they do not implement exactly at the same place that the standard draws a boundary between required and optional features. This, in fact, is probably a result of the SGML designers NOT having had enough computer science expertise in their project!) And, even if SGML implementations do not allow these, it's surely easy to have special purpose filters that insert tags and expand references (at least, in all but pathological cases, I think). Attributes: I think they're quite useful. Restricting the lexical structure seems fine, but establishing a bias against using them seems too limiting. They certainly require different processing, but it's no harder (is it?) than the other processing required, and you seem to allow them in some cases, so the software tools will have to deal with them in any case. Inclusions and exclusions: I'm of two minds. These can be confusing, for sure, and I agree (if this is what the last sentence of the paragraph intends) that the equivalent "real BNF" form that arises from using them can be very large and complex. But I wonder if it would not be the case that most uses of them would be such that 1) the user really thinks in these terms, 2) the use is quite simple, and 3) the formal BNF equivalent is not too ugly. Numeric constraints: Admittedly I work with grammars a great deal, but 20-30 productions and 4-5 levels of depth seems pretty simple to me. Book contains body contains section contains chapter contains subsection contains list contains figure; this is already 7 levels but it's not very confusing, is it? I really don't know whether I want to object to having numbers here at all, or just to numbers that seem to me quite small. Multiple encodings: A thorny issue, as discussed in our CHum paper. The paragraph seems to imply it's simpler than it is. I think it's possible to use two different DTDs and two sets of tags, but CONCUR remains problematic, I think. Now a couple of comments on MS-McQ's comments. I won't point out the several places where we agree, save in one or two instances where I'd like to emphasize the agreement. Hoots, LALR, etc.: I agree absolutely. When we technical types tell less technical types that the technology can't do things that are clearly easy, they have a right to be upset. Disliking full form as the only legal form: yes, same reason. Lexers and parsers automatically generated making life easy: I think the real issue is that things that are obviously easy should not be disallowed. I believe a convincing argument can be made along the lines "LALR is good for you because the restrictions it imposes are restrictions that your limited brain power cannot often usefully transcend anyway" but I don't think this same paternalistic (and I do not mean it's bad--I make this argument to myself and to my students all the time) kind of argument can be successfully made wrt omitted tags, for example. From IDE@VAS780.BITNET Wed Feb 1 19:57:01 1989 Received: from RELAY.CS.NET by bucatina.cis.ohio-state.edu (3.2/2.890120) id ; Wed, 1 Feb 89 19:56:45 EST Message-Id: <8902020056.AA02878@bucatina.cis.ohio-state.edu> Received: from [128.228.1.2] by RELAY.CS.NET id aa09326; 1 Feb 89 17:59 EST Received: from VAS780.BITNET by CUNYVM.CUNY.EDU (IBM VM SMTP R1.1) with BSMTP id 6482; Wed, 01 Feb 89 12:56:57 EST Date: 1 FEB 89 12:17-EST From: IDE%VAS780.BITNET@cunyvm.cuny.edu To: mamrak@cis.ohio-state.edu Cc: IDE%vas780.bitnet@RELAY.CS.NET Subject: SGML GUIDELINES Status: RO Sandra, I am sorry to have taken so long to get back to you about the syntax document. Unfortunately, it would be ideal to modify it and distribute it to the steering commitee for comment in order to finalize it *before* distributing it at the Feb. 18 meeting. I hope my delay hasn't rendered this impossible. The document looks good in general and only needs, in my view, to be filled in in spots because people will be unaware of what is behind some of what you say, or will not have the expertise to understand what you say in teh current form. In the intersts of time, I am going to give you my general comments,and if you have specific questions come back to me and I will elaborate. Introduction: The first sentence will need some explanation and justification--that is, why have we decided to use SGML, why is it the obivous choice? Much of this can be drawn from existing TEI documents, but it should be made clear in the document (1) what we see the strengths of SGML to be (2) what we see as the potential drawbacks--which things are we aware of that SGML may not do well, or not do at all? (3) that we intend to examine SGML, knowing we could come up against unexpected problems, and what we intend to do in this event. Some of these questions are answered elsewhere in the document but could be outlined in the opening. This is the first thing people will ask about. SGML requires DTD's but we *could*, in principle, develop tag sets without bothering to specify DTDs, could we not? If so we need to explain why we are concerning ourselves with DTDs. You reference to extended BNF notation will probably not be understood by all the audience. This should be explained briefly--the most important thing to make clear is what effect the constraints this notation implictly puts on the structure of tag sets. You also treat this later on but the relation is not made explicit anywhere. 2nd paragraph: The term "TEI community" is not the right one for this context--"needs of computer-aided language analysis" and "undertaken in the field of computer-aided language analysis" might be better. It also seems worthwhile to make explicit the fact that the two goals of Committee 4 are not necessarily compatible, and how we seek to resolve conflicts that may arise from this. Processing SGML with Computers: give the reference for the SGML standard in-line The Use of Existing Software: This section needs to be extended and explained in much greater depth. Both the topics of par. 1 and 2 are so familiar to you that I don't think you realize that most people within the TEI framework won't understand from this description. The first paragraph makes the major poin that supports the approach to the problem that will be taken: it must be elaborated to constitute a viable argument. As it is, it is rather weakly stated, because you assume what you say is obvious. To humanities types, it isn't. Some Implications...: The first par. here makes a big and important point. Emphasize it an elaborate on it. The second point is not entirely clear. Which kind of CS expertise? Where in the effort must this expertise exist? Guidelines...: item 2. note that it is the relation between two parts of a German verb *when they are not given contiguously within a sentence*. item 3. explain inclusion and exclusion exceptions. item 5. I may have a bad memory but I am not sure we decided that we would do anything that relies on the Concur feature. Am I wrong here? What happened to the idea of separate files etc.? _________________ In general, I think that the paper needs elaboration and some connecting of ideas to make clear what the relations among parts are. let me suggest this: make another pass at it as quickly as you can, then submit it to the entire steering committee for review. Michael and I are not too bad at modifying things for emphasis, connection, etc. (we have done a lot of this in workingon the various proposals), and if you do the filling out first we can probably make any other minor stylistic/logical changes rather quickly. maybe we can get this thing together by the 17th! Thanks for your work. Again, come back to me if you want elaboration on any point. Nancy From mamrak Thu Feb 2 09:43:26 1989 Received: by bucatina.cis.ohio-state.edu (3.2/2.890120) id ; Thu, 2 Feb 89 09:42:07 EST Date: Thu, 2 Feb 89 09:42:07 EST From: Sandy Mamrak Message-Id: <8902021442.AA03392@bucatina.cis.ohio-state.edu> To: IDE%VAS780.BITNET@cunyvm.cuny.edu, mamrak@cis.ohio-state.edu Subject: Re: SGML GUIDELINES Cc: IDE%vas780.bitnet@RELAY.CS.NET, mamrak@cis.ohio-state.edu Status: R Nancy, Thanks for your comments. They raise a major question in my mind that I believe must be answered before any revisions of the draft can be undertaken. *** What is the target audience for the document? *** OUR ASSUMED TARGET AUDIENCE The focus of the draft is guidelines for writing DTD's and instances of DTD's (yes, you MUST have a DTD if you want to conform to sgml requirements; sgml requires not only that you specify a tag set, but also that you specify the rules for ordering the tags; hence the DTD). We assume that the readers of the guidelines would be those whose responsibility it is to write DTD's or to develop software environments for specifying instances of DTD's. Such people would minimally be already well-versed in sgml and optimally would have already written software to process it. This audience would likely have little interest in a justificaton of the choice of sgml, and should have no problems with any of the terminology or tightness of prose for which you are asking elaboration. In particular, phrases like `extended BNF', `inclusion and exclusion exceptions', and concepts like tokenization and parsing would be quite familiar. YOUR (APPARENTLY) ASSUMED TARGET AUDIENCE The readership that I infer from your comments is people from the humanities and linguistics who have little detailed knowledge of sgml and no experience with software development. CONFLICT TO BE RESOLVED I think we need to agree on a targeted audience before we proceed. Likely, there is a need for documents for both types of readers. And likely, the documents will be quite different (i.e., not one document to serve both). For example, for the humanities/linguistic audience, a brief tutorial on sgml, an elaborate justification of its choice, a tutorial on issues that arise when doing machine-processing and a pointer to a guidelines document of the type we've already drafted is probably the right approach. PROPOSAL FOR THE FEBRUARY 17TH MEETING I can revise the draft based on your comments, but targeting only the sgml/computer science readership. This would involve some minimal clarifications (e.g., regarding DTD's and the use of CONCUR), but not much of the elaboration you request. I could do this in the next week or so and forward it to steering. As a preamble to the draft, I can explicitly identify the targeted readership, discuss the need for another document for alternate readers, and promise one for the future. Please let me know how you think we should proceed. Perhaps a phone conversation is most appropriate at this point. . .