ML W1 C MS 1

Notes on draft syntax document "Guidelines for Writing SGML DTDs and
Instances of SGML DTDs" (ML W1) - CMSMcQ, 7 Feb 89

0.  General

I believe this document addresses some relevant areas, but it's not
quite the kind of syntax document I believe we need.  I would like to
suggest some changes in focus, as well as in style and substance.
Perhaps this draft should become a completely separate document for the
use of those committee members, if any, deputed to write out DTDs
formally, and the "basic syntax document" we're looking for just take
over parts of this draft.

We need a document that defines as clearly and concisely as possible the
legal syntax for our tags, with a little justification and, if we think
it useful, a little discussion of design goals.  Restrictions on DTDs
are part of this, but probably should not constitute the main focus of
discussion.

1.  Introduction

Even with respect just to the syntax, I think this account of committee
ML's responsibilities is skewed slightly.  The fundamental, single
responsibility of ML with regard to syntax is to define a syntax for the
TEI encoding scheme.  Period.  That syntax should:

    1 suffice for an intellectually serious modeling of texts
         in machine-readable form
    2 conform to ISO 8879 if at all possible, in order that texts
         encoded with the TEI scheme will be processable by SGML
         processors developed by whomever, for whatever purposes
    3 be as simple and regular as possible, consistent with the first
         two points, so that the scheme presents no needless obstacles
         to parsing and tokenizing

I believe adequacy for research (and this includes the human interface)
must take precedence over the convenience of software developers.  The
syntax should be clearly defined, but I don't think the average linguist
or literary scholar will give two hoots whether it's LALR(1) parseable
or not, while they will give a lot more than two hoots if end tags are
required even where their omission would be unambiguous.

It is the syntax of the TEI scheme and its requirements that the ML
committee must focus on and which the syntax document must discuss, not
SGML.  Point 2 says they should be compatible, but they need not and
probably won't be identical.  SGML we can and should just take as a
given; critiques are pointless.  We can offer suggestions to ISO TC 8
if we want, but they don't belong in this document.

2.1 Use of Existing Software Tools

Am not sure why this section is present.  We are not responsible for
SGML, so why do we care whether it follows established conventions or
not?  Our own syntax may be another matter, though I'd like better
reasons than the established conventions of computer science.

About machine processability -- I think the draft takes too protective
an attitude toward the machine and the programmer.  Yacc is a fine tool
and I like it.  But I'm not willing to concede that if a language can't
be handled by yacc, it's not "amenable to processing by computers,"
which is what seems implied here.  We can talk, if we want, about making
sure the scheme is "easy to process".  But not, in these terms, about
"processable."

As a denizen of an academic computer center, I find the identification
of "computing" with "computer science" pernicious.

[I perceive that I diverge from NI on this section.  She feels it needs
to be extended and explained; I am inclined to eliminate it.  If it
is to be kept, read "software-generation tools" for "software tools".
OCP is a tool, and it's software, but it's not the topic here.]

2.2 Some Implications of using computer-processable SGML

I'm not sure I agree with the concepts here.  Or it may just be the
wording that is throwing me off.  If we take as a given that machines
and humans may prefer different views of the data, then either the
machine view is the official view and the human views are produced by
running the data through filters, or else the official syntax does allow
the humanized view but we recommend, for pragmatic purposes, that the
data be run through various filters (as later described) before
interchange or before processing by some software, to ensure proper
parsing.

I dislike the former approach (full-form the only legal form) because it
means filters are going to be *required* for people to look at
TEI-encoded files.  And the people with the most need for such filtering
(technical novices) will be those with the least prayer of finding them
or building them.  Also, on principle I prefer to favor the needs of
humans over the needs of the machinery.  (I write software myself, so
I'm not just trying to throw all the problems in the other fellow's
yard.)

I'd vote to delete the paragraph about computer-science expertise.  It
sounds rather as though we think humanists can't be trusted to work with
sharp objects without supervision.

3 Guidelines

First para - no sources are cited above at all.

first point (no shortrefs, no omittag)

I'm very skeptical here.  In many people's view, short refs are a saving
grace of SGML, making plausible a scheme otherwise far too elaborate for
normal use.  I understand from Barnard et al. in the current CHum that
there is a nasty interaction between short refs and the CONCUR feature.
But I believe that was fixed in the Amendment (or am I wrong?).  I don't
think shortrefs will be a big part of the TEI project, if only because I
think it is better to exhibit the tags, not hide them.  But I am not
yet persuaded we should eschew them entirely.

OMITTAG seems to be an essential feature if a file is going to be
readily typed or readily read.  And no matter how much we say one should
never have to type naked tags on a keyboard, that is what many people
will do, for a long time or until typical commercial word processors
have an SGML-tag feature (a longer time).

In any case, I am inclined to say:  if a useful feature of SGML makes
things harder for current tokenizer- and parser-generators, then so much
the worse for the generators.  They are supposed to make my life easier,
not I theirs.

I also don't completely understand the logic here:  if a filter can be
written to insert the omitted tags, why is it so hard for a processing
program to infer them?  And how will the filter know which tags to
supply, if no DTD tells us which tags are omissible?  And if we want to
separate tag-inference from full parsing, why not simply recommend the
use of a filter (specification given in an appendix) before submission
of a text to a processor?

second point (avoid attributes).  We have some discussion to do here.
Obviously we can't distribute a fundamental design document that says
attributes should be used in preference to model groups under many
circumstances, together with a syntax document that says the opposite.

The sentence "Tags already provide a mechanism for specifying all
hierarchical structure in a document" is correct, but does not lead to
the given conclusion.  Not all information is hierarchical:  for
information about a given element of a text which is not *part* of that
element, attributes seem to me the most natural kind of convention.
The title of a chapter is arguably a delimited, named part of its
content.  Its ordinal number, on the other hand, is an attribute of
the entire chapter, not a special part of its content.  If I write
"Chapter 2" above the third chapter of my book, that doesn't make it
the second chapter.  If some of my figures should have boxes around
them (to signal the art department that a paste-in is required) and
some should not (because no art is required), then "box=yes" and
"box=no" are features, not portions, of the figures.  And so on.
It depends, I suppose, on what you mean by "must be specified outside
of the hierarchical structure of the DTD" -- but if you believe the
AAP tag set does things right, then we disagree fundamentally.

The argument about the difficulty of tokenizing and parsing attributes
puzzles me, because you agree that one has to be ready to parse
attributes anyway, if only for cross references.  Well, if you've got to
be able to parse attributes anyway, why not use them?  Certainly it's a
simple mechanical filter to take any attributes associated with a tag
and re-write them as normal tags immediately following their parent, if
one has to deal with a processor that can't handle attribute strings
very well.  In which case I don't see the advantage of prohibiting
attributes except where absolutely necessary.

I would like to know the opinions of other members of the syntax
group on this point.

third point (inclusion/exclusion prohibition).  OK, although it seems
awfully inconvenient, when you get down into mutually nestable but not
self-nesting text elements.  (What can a paragraph contain?  lists of
various types, quotations of various types, figures and tables of
various types, highlighted phrases of various types, notes of various
types.  What can a figure contain?  Exactly the same things, except for
more figures and tables.  What can a quote contain?  Exactly the same
thing, including more notes. ...  But I can't use an entity reference
for the group and modify it with an exclusion?  sigh ...) This
prohibition, that is, seems diametrically opposed to the next rule (keep
the model groups clearly legible).

modular design.  I'd be happier if this were phrased in SGML terms
(element declarations, model groups, ...) rather than abstract
formal-language terms.  I don't know what you are counting as a
"production" in the context of DTDs, or what SGML term corresponds to
"subgrammars".

concur feature.  This seems a good rule to begin with, but the examples
need to be better phrased; a typical problem will involve two distinct
types of structural information.

4 explicit value indicator, quoted values.  This seems plausible to me;
should single quotes be optional for attribute values containing no
blanks?  Are we prepared to allow double quotes instead?  Open and
close quotes?  Guillemets?

white space.  for "outside the angle brackets" read "inside the angle
brackets", I think.  This is obviously less problematic if attributes
are minimized than if they are extensively used.  I'm not sure I
understand the reasoning; is white space so hard to handle?  Also, in
the last sentence for "encoding" read "processing" -- or else I don't
understand the paragraph at all.

I'll try to make some more constructive suggestions, including
alternative wordings, if we can agree on some of the substantive issues
here.

Michael

---

Received: from QUCDN.BITNET by UICVM (Mailer R2.02) with BSMTP id 5371; Wed, 08
 Feb 89 09:37:56 CST
Received: from QUCDN by QUCDN.BITNET (Mailer R2.02) with BSMTP id 7087; Wed, 08
 Feb 89 10:29:14 EST
Received: from qusunl.qucis.queensu.ca by QUCDN.QUEENSU.CA ; Wed, 08 Feb 89
 10:29:12 EST
Received: by qusunl.qucis.queensu.ca (3.2/SMI-3.2)
    id AA00975; Wed, 8 Feb 89 10:30:23 EST
Date: Wed, 8 Feb 89 10:30:23 EST
From: barnard@qucis.queensu.ca
Message-Id: <8902081530.AA00975@qusunl.qucis.queensu.ca>
To: Lou_Burnard<Lou@Vax1.Oxford.AC.UK>,
        Michael_Sperberg-McQueen<u18189@uicvm.bitnet>,
        Nancy_Ide<ide@vassar.bitnet>,
        Sandra_Mamrak<mamrak@bucatina.cis.ohio-state.edu>
Subject: Comments on SM's First Draft of SGML Guidelines

I've read the draft and MS-McQ's comments on it. In general, my reactions
were similar to MS-McQ's. (you need a simpler set of initials, Michael!)

First, I'll transcribe my marginal notes from the draft itself.

"The Use of Existing Software Tools" seems too narrowly-focussed to me. I
agree that there are tools that can determine the lexical and context-free
syntactic structure (which is what I'd prefer to refer to, rather than
"language", if we end up making comments along these lines at all), but do
we need to say that in this document? I think it was while I was reading
this section that I realized I wasn't really sure of the purpose of this
document, or, perhaps more correctly, how widely it will be disseminated.

In the second point under "Some Implications ..." I'm not sure this statement
about computer science expertise is quite right as it stands. I agree that
having this expertise available as part of TEI is good, but surely we can
use pieces of software to check that DTDs are machine processable--we just
feed a DTD to an SGML parser (which at least some of us have access to)
and see what happens. I think the value of a computer science perspective is
at a deeper level: perhaps to influence the TEI to use appropriate tools in
appropriate, effective, efficient ways. Anyway, this objection is much too
long relative to the sentence I'm objecting to. I do think, though, that
saying it just the way it's said may not endear us to all of our collaborators
in TEI.

Short references and omitted tags: Hm...I think any implementation worth
its salt will handle these, even though they're optional features. (We
should probably agree not to make gratuitous criticisms of the SGML standard,
but in this case it might be appropriate: I believe some of the required
features are harder to implement than some of the optional ones; I suspect
this means that implementors will not necessarily decide to draw a boundary
between what they implement and what they do not implement exactly at the
same place that the standard draws a boundary between required and optional
features. This, in fact, is probably a result of the SGML designers NOT
having had enough computer science expertise in their project!) And, even if
SGML implementations do not allow these, it's surely easy to have special
purpose filters that insert tags and expand references (at least, in all but
pathological cases, I think).

Attributes: I think they're quite useful. Restricting the lexical structure
seems fine, but establishing a bias against using them seems too limiting.
They certainly require different processing, but it's no harder (is it?)
than the other processing required, and you seem to allow them in some cases,
so the software tools will have to deal with them in any case.

Inclusions and exclusions: I'm of two minds. These can be confusing, for
sure, and I agree (if this is what the last sentence of the paragraph intends)
that the equivalent "real BNF" form that arises from using them can be very
large and complex. But I wonder if it would not be the case that most uses
of them would be such that 1) the user really thinks in these terms, 2) the
use is quite simple, and 3) the formal BNF equivalent is not too ugly.

Numeric constraints: Admittedly I work with grammars a great deal, but
20-30 productions and 4-5 levels of depth seems pretty simple to me. Book
contains body contains section contains chapter contains subsection contains
list contains figure; this is already 7 levels but it's not very confusing,
is it? I really don't know whether I want to object to having numbers here
at all, or just to numbers that seem to me quite small.

Multiple encodings: A thorny issue, as discussed in our CHum paper. The
paragraph seems to imply it's simpler than it is. I think it's possible to
use two different DTDs and two sets of tags, but CONCUR remains problematic,
I think.


Now a couple of comments on MS-McQ's comments. I won't point out the several
places where we agree, save in one or two instances where I'd like to
emphasize the agreement.

Hoots, LALR, etc.: I agree absolutely. When we technical types tell less
technical types that the technology can't do things that are clearly easy,
they have a right to be upset.

Disliking full form as the only legal form: yes, same reason.

Lexers and parsers automatically generated making life easy: I think the
real issue is that things that are obviously easy should not be disallowed.
I believe a convincing argument can be made along the lines "LALR is good
for you because the restrictions it imposes are restrictions that your limited
brain power cannot often usefully transcend anyway" but I don't think this
same paternalistic (and I do not mean it's bad--I make this argument to myself
and to my students all the time) kind of argument can be successfully made
wrt omitted tags, for example.

From IDE@VAS780.BITNET Wed Feb  1 19:57:01 1989
Received: from RELAY.CS.NET by bucatina.cis.ohio-state.edu (3.2/2.890120)
	id <AA02878@bucatina.cis.ohio-state.edu>; Wed, 1 Feb 89 19:56:45 EST
Message-Id: <8902020056.AA02878@bucatina.cis.ohio-state.edu>
Received: from [128.228.1.2] by RELAY.CS.NET id aa09326; 1 Feb 89 17:59 EST
Received: from VAS780.BITNET by CUNYVM.CUNY.EDU (IBM VM SMTP R1.1) with BSMTP id
 6482; Wed, 01 Feb 89 12:56:57 EST
Date: 1 FEB 89 12:17-EST
From: IDE%VAS780.BITNET@cunyvm.cuny.edu
To: mamrak@cis.ohio-state.edu
Cc: IDE%vas780.bitnet@RELAY.CS.NET
Subject: SGML GUIDELINES
Status: RO

Sandra,

I am sorry to have taken so long to get back to you about the syntax document.
Unfortunately, it would be ideal to modify it and distribute it to the steering
commitee for comment in order to finalize it *before* distributing it at the
Feb. 18 meeting.  I hope my delay hasn't rendered this impossible.

The document looks good in general and only needs, in my view, to be
filled in in spots because people will be unaware of what is behind
some of what you say, or will not have the expertise to understand
what you say in teh current form.  In the intersts of time, I am
going to give you my general comments,and if you have specific
questions come back to me and I will elaborate.

Introduction:

The first sentence will need some explanation and justification--that is, why
have we decided to use SGML, why is it the obivous choice?  Much of this can be
drawn from existing TEI documents, but it should be made clear in the document
(1) what we see the strengths of SGML to be (2) what we see as the potential
drawbacks--which things are we aware of that SGML may not do well, or not do at
all? (3) that we intend to examine SGML, knowing we could come up against
unexpected problems, and what we intend to do in this event.  Some of these
questions are answered elsewhere in the document but could be outlined in the
opening.  This is the first thing people will ask about.

SGML requires DTD's but we *could*, in principle, develop tag sets without
bothering to specify DTDs, could we not?  If so we need to explain why we are
concerning ourselves with DTDs.

You reference to extended BNF notation will probably not be understood by all
the audience.  This should be explained briefly--the most important thing to
make clear is what effect the constraints this notation implictly puts on the
structure of tag sets. You also treat this later on but the relation is not made
explicit anywhere.

2nd paragraph:  The term "TEI community" is not the right one for this
context--"needs of computer-aided language analysis" and "undertaken in the
field of computer-aided language analysis" might be better.

It also seems worthwhile to make explicit the fact that the two goals
of Committee 4 are not necessarily compatible, and how we seek to
resolve conflicts that may arise from this.

Processing SGML with Computers:  give the reference for the SGML
standard in-line

The Use of Existing Software:  This section needs to be extended and explained
in much greater depth.  Both the topics of par. 1 and 2 are so familiar to you
that I don't think you realize that most people within the TEI framework won't
understand from this description. The first paragraph makes the major poin that
supports the approach to the problem that will be taken: it must be elaborated
to constitute a viable argument.  As it is, it is rather weakly stated, because
you assume what you say is obvious.  To humanities types, it isn't.

Some Implications...:  The first par. here makes a big and important
point.  Emphasize it an elaborate on it.

The second point is not entirely clear.  Which kind of CS expertise?
Where in the effort must this expertise exist?

Guidelines...: item 2. note that it is the relation between two parts
of a German verb *when they are not given contiguously within a sentence*.

item 3. explain inclusion and exclusion exceptions.

item 5. I may have a bad memory but I am not sure we decided that we
would do anything that relies on the Concur feature.  Am I wrong here?
What happened to the idea of separate files etc.?

_________________

In general, I think that the paper needs elaboration and some connecting of
ideas to make clear what the relations among parts are. let me suggest this:
make another pass at it as quickly as you can, then submit it to the entire
steering committee for review.  Michael and I are not too bad at modifying
things for emphasis, connection, etc. (we have done a lot of this in workingon
the various proposals), and if you do the filling out first we can probably make
any other minor stylistic/logical changes rather quickly. maybe we
can get this thing together by the 17th!

Thanks for your work.  Again, come back to me if you want elaboration
on any point.

Nancy

From mamrak Thu Feb  2 09:43:26 1989
Received: by bucatina.cis.ohio-state.edu (3.2/2.890120)
	id <AA03392@bucatina.cis.ohio-state.edu>; Thu, 2 Feb 89 09:42:07 EST
Date: Thu, 2 Feb 89 09:42:07 EST
From: Sandy Mamrak <mamrak@cis.ohio-state.edu>
Message-Id: <8902021442.AA03392@bucatina.cis.ohio-state.edu>
To: IDE%VAS780.BITNET@cunyvm.cuny.edu, mamrak@cis.ohio-state.edu
Subject: Re:  SGML GUIDELINES
Cc: IDE%vas780.bitnet@RELAY.CS.NET, mamrak@cis.ohio-state.edu
Status: R

Nancy,

Thanks for your comments.  They raise a major question in my mind that
I believe must be answered before any revisions of the draft can be
undertaken.

         *** What is the target audience for the document? ***

OUR ASSUMED TARGET AUDIENCE

The focus of the draft is guidelines for writing DTD's and instances
of DTD's (yes, you MUST have a DTD if you want to conform to sgml
requirements; sgml requires not only that you specify a tag set, but
also that you specify the rules for ordering the tags; hence the DTD).
We assume that the readers of the guidelines would be those whose
responsibility it is to write DTD's or to develop software
environments for specifying instances of DTD's.  Such people would
minimally be already well-versed in sgml and optimally would have
already written software to process it.   This audience would
likely have little interest in a justificaton of the choice of sgml, and
should have no problems with any of the terminology or tightness of
prose for which you are asking elaboration. In particular, phrases
like `extended BNF', `inclusion and exclusion exceptions', and
concepts like tokenization and parsing would be quite familiar.

YOUR (APPARENTLY) ASSUMED TARGET AUDIENCE

The readership that I infer from your comments is people from the
humanities and linguistics who have little detailed knowledge of sgml
and no experience with software development.

CONFLICT TO BE RESOLVED

I think we need to agree on a targeted audience before we proceed.
Likely, there is a need for documents for both types of readers. And
likely, the documents will be quite different (i.e., not one document
to serve both).  For example, for the humanities/linguistic audience,
a brief tutorial on sgml, an elaborate justification of its choice, a
tutorial on issues that arise when doing machine-processing and a
pointer to a guidelines document of the type we've already drafted is
probably the right approach.

PROPOSAL FOR THE FEBRUARY 17TH MEETING

I can revise the draft based on your comments, but targeting only the
sgml/computer science readership.  This would involve some minimal
clarifications (e.g., regarding DTD's and the use of CONCUR), but not
much of the elaboration you request.  I could do this in the next week
or so and forward it to steering.

As a preamble to the draft, I can explicitly identify the targeted
readership, discuss the need for another document for alternate
readers, and promise one for the future.

Please let me know how you think we should proceed.  Perhaps a phone
conversation is most appropriate at this point. . .