Guidelines for Writing SGML DTDs (Draft)

TEI ML W01 (Yet to be accepted by the full ML Committee)

Sandra A. Mamrak

May 8, 1989

I. Introduction
2. Tasks for Writing DTDs
3. Guidelines for Writing Instances of DTDs
4. Guidelines for DTD Writers
5. Software Development to Support Following the Guidelines
References

I. Introduction

The text encoding scheme formulated by the Text Encoding Initiative (TEI) will use the syntax of ISO 8879, 'Standard Generalized Markup Language', (SGML) [SGM86] as the basis for representing encoded documents, with the restrictions specified in this and later documents of the Initiative. No general introduction to SGML is attempted here. Tutorial materials for use in the TEI will be prepared separately.

SGML specifies both (1) rules for marking up text elements (delimiters to be used to signal markup tags, delimiters for 'entity references' and so on) and (2) rules for defining the tag sets to be used in marking up a documents and the hierarchical relationships of individual tags (what tags, or more precisely what tag-delimited text elements, are allowed or required to appear within what text elements). The hierarchy of text elements and their tags is defined by a Document Type Definition, DTD. Each marked up document is an instance of a DTD.

The responsibilities of the Syntax and Metalanguage Committee of the TEI are (1) to ensure that the scheme developed by the TEI is powerful enough to meet the analysis needs of the text-research community, (2) to ensure that the specific recommendations of the TEI guidelines conform to SGML, and (3) to ensure that texts encoded according to the TEI guidelines (and therefore their DTDs) are easily processable by computer.

Other TEI Committees are responsible for defining concretely the requirements for computer-assisted textual research. Their work will show whether SGML suffices to those requirements or must be somehow extended or modified. At present, no extensions are seen to be necessary. It is the purpose of this paper to make recommendations that will ensure that the DTDs defined within the TEI project, and thus documents which follow them, are of the highest quality possible with respect to completeness and correctness. We are also concerned about issues that affect processing DTDs by computers.

Below we identify two distinct tasks for specifying DTDs. The purpose of the distinction is to identify the expertise and resources that must be available to execute each task successfully in the DTD definition process.

2. Tasks for Writing DTDs

The creation of DTDs must be undertaken as a cooperative effort by those who are expert in the application area for which the DTD is being defined, and those who are expert in the use of software tools and testing methods for specifications based on formal languages.

DTDs must enable the correct and useful encoding of information required by the text type, disciplinary domain, and particular processing task. Linguistic expertise, for example, will be required in writing DTDs for the encoding of linguistic analyses, and publishing experience is required to ensure that DTDs meet the needs of publishers. Though necessary, however, such domain-specific expertise is not sufficient to ensure the high quality of the DTDs defined.

Since DTDs are intended as detailed machine-processable specifications, they should also pass the tests we commonly apply to other such specifications, notably computer programs. Since they can become long and complex,[2] steps should be taken to control their complexity and keep them comprehensible. Their syntax should be verifiably correct, and their semantics should be tested to ensure that the DTDs define in fact what they are intended to define.

Software tools and testing methods have been established to deal with syntactic and semantic problems in computer programs. Compilers exist to verify syntax, and test data are run to raise confidence that the semantics are correct, i.e., that the program does what was intended by its author. These techniques are not foolproof, but they help considerably to eliminate all the obvious problems, and some of the more subtle problems, that arise when humans deal with complex objects.

If DTD definers do not have access to software tools and methods similar to those of the computer programmer, then the quality of their DTDs can be expected to be comparable to that of a computer program that has never been compiled or tested. The DTD is likely to be plagued not only by syntactic errors, but also by semantic ones.

From a syntactic point of view, DTDs are supposed to be written as grammars. As such, they consist of terminal (e.g., CDATA) and nonterminal (e.g., elements like frontmatter, body or backmatter) symbols and a set of rules for ordering the symbols. Ordering rules might specify that a symbol must occur only once, may occur zero or more times, must follow another specified symbol, and so on. All symbols should be reachable from the start symbol and all orderings must follow the rules. As a DTD becomes large, it is very difficult for humans to determine if all symbols are reachable, i.e., if there are any symbols that are disjoint from the grammar. It is also very difficult for humans to determine if the ordering rules are followed. For example, if the ordering is supposed to be unambiguous, it is almost impossible for a human to verify this condition as the number of rules becomes large.

Semantically, DTDs are highly error-prone. SGML provides very powerful shorthand notations for specifying rules in a grammar (content models are one example). The purpose of these notations is to make the specification of both the DTDs themselves, and also instances of the DTDs, easier. Often, however, their effect may be to lead definers to write a specification that does not capture their intended meaning. Without some method of testing the DTD, it is very difficult for a DTD definer to know if the DTD really captures the document type that was intended.[3]

Thus, the successful definition of DTDs requires expertise both from the targeted application area and from the area of software tools and testing methods for formal language specification. Below we offer guidelines for writing DTDs and their instances.

3. Guidelines for Writing Instances of DTDs

In each case, we follow our recommendations with a detailed justification. The casual reader may skip these sections without loss of continuity.

Recommendation.

We recommend that instances of DTDs be developed using a software environment specifically designed for this purpose. This environment will enforce correct and complete specification, in addition to greatly easing the burden of tag entry. In the absence of such software, we recommend that 'informal' instances of DTDs be created in an electronic form and be submitted to the ML committee. The ML committee will use the most current software technology available to ensure proper, formal specification, in consultation with originators of instances of DTDs as necessary.}

Detailed Justification. SGML-based software systems take as input a DTD specification and generate as output an editor for instances of DTDs. These editors support the easy specification of legal DTD instances, including presentation of the instance with both shortened and omitted tags. Such systems are already available, e.g., ArborText, SoftQuad and Software Exoterica, among others, distribute them, and more will surely be appearing in the near future. The overwhelming advantage of using this technology is that it not only makes it easy for the end-user to specify an instance of a DTD, but it also guarantees that the resulting specification will be legal.

In the absence of this technology, those who write instances of DTDs in a local computing environment will be required to submit them to nonlocal personnel, i.e., the ML committee, with access to such software tools to determine if they are legal. This activity is analogous to writing a computer program and having to submit it elsewhere for compilation and subsequent debugging, either by the original author, or some other designated person.

It is important to note that since those using limited technologies to create instances of DTDs will have to have them verified elsewhere in any case, that their burden of entering tags can be greatly reduced. They can use informal methods for shortening or omitting tags that would be readily understood by another human who has to deal with 'debugging' the DTD instance, and not have to worry about the detailed and complex SGML rules (see below) for this activity. This will not totally eliminate the effort required for tag entry. However, the cost of tag entry must be weighed against the potential advantages to be gained by having manuscripts that are completely and correctly marked up.

4. Guidelines for DTD Writers

Before instances of DTDs may be specified, the DTDs themselves must be in place. The guidelines below govern the specification of DTDs.

Recommendation. We recommend that DTDs be developed using a software environment specifically designed for this purpose. This environment will enforce correct and complete specification. In the absence of such software, we recommend that 'informal' DTDs be created in an electronic form and be submitted to the ML committee. The ML committee will use the most current software technology available to ensure proper, formal specification, in consultation with originators of DTDs as necessary.

4.1 Use of Minimization in SGML

Recommendation. We recommend that the DTD definer use none of the minimization features of SGML: SHORTREF, DATATAG, OMITTAG, RANK OR SHORTTAG. [4]

Detailed Justification. The design of SGML flounders when it extends the standard to include aspects of the specification of a DTD that are motivated primarily by a particular computer technology rather than by the general goals of descriptive, machine-independent markup. In particular, the minimization features have been included in the standard to ease the DTD markup burden, under the assumption that a particular kind of markup technology is in place.

The assumption is that the markup system will require the author of a document to enter all the descriptive information about the content of the document, i.e., tags and possibly attributes of tags, in addition to the content itself, completely manually, and that only one view of the document will be available, that of the markup actually entered by the author. This assumption is based on a 'typewriter' view of computer systems that might be developed to support the generation of DTDs. Basically a 'blank sheet', i.e., empty display screen, is envisioned as confronting the author, and a keyboard is envisioned as the sole source of input to lay characters, both markup and content itself, on the blank sheet.

Under this assumption of a particular type of computer technology for text-entry, the common goals of the minimization features are 1) to reduce the number of keystrokes required to be typed by the author, and 2) to aid in viewing the document without the 'intrusion' of many, possibly deeply-nested levels of, tags.

With the current existence of tag-driven, menu-based DTD-preparation systems like those being distributed commercially, SGML's assumption about the relatively antiquated text-entry technology have already been demonstrated to be outmoded. In view of the emergence of this technology, we believe it is a design error to include the specification of these non-required features in DTD specifications. Moreover, these features are ill-conceived to meet the goals for which they were specified, even assuming the typewriter model of data entry (see [HEA88], for example).

We illustrate the complexity involved in using and specifying the OMITTAG feature below. The difficulties in using the other minimization features are similar.

The OMITTAG feature states that in some circumstances, an entire tag may be omitted from the marked-up document. There are four conditions that must prevail for a tag to be omissible. First, the tag must be omissible according to three SGML-specific rules. Second, the omission of the tag may not create an ambiguity.[5] Third, the minimization specification for this tag must allow omission. Finally, 'OMITTAG YES' must be specified. A somewhat simplified version of the SGML-specific rules is:

An end-tag may be omitted if its element's content is followed by the start-tag of an element that can not occur within it.
An end-tag may be omitted if it is followed by the end-tag of an element that contains it.
A start-tag may be omitted if the element is contextually required and any other element types that could occur are contextually optional.

The author of a DTD is responsible for determining which tags are omissible and the minimization must be specified in the DTD. Determination of whether a tag is omissible depends not only on that element's declaration but also on where that element is used in other element declarations.

Complexities in Using the OMITTAG Feature.

OMITTAG specifications can be difficult to use if one does not understand the omission conditions completely. For example, consider the following DTD definition of a list:

<!--      ELEMENT MIN CONTENT             >
<!ELEMENT list    - - (item)+             >
<!ELEMENT item    O O (#PCDATA, (list)*)  >

The string '- -' means that both the start-tag and end-tag are required for the element(s) being defined. The string '- 0' means that the start-tag is required and that the end-tag may be omitted, and so on.

The following is a sample, fully marked-up list:

<list>
<item>First item</item>
<item>Second item</item>
<item>Last item</item>
</list>

To the inexperienced user,the specification may appear to indicate that all start-tags and all end-tags are omissible on the element 'item', leading to the following possible markup:

<list>
First item
Second item
Last item
<\list>

A user with some experience may realize that this may not be the correct interpretation of the specification, since there is no way to unambiguously distinguish the items. A newline character has been used here, but newline characters may themselves appear in the contents of an item.

In fact, to correctly decide which 'item' tags are omissible, each decision about omission must be done on a tag-by-tag basis, applying the three conditions for omissibility in each case.

Using Rule 1 for tag omissibility, the list can be marked up as:

<list>
<item>First item
<item>Second item
<item>Last item</item>
</list>

and using Rule 2, this markup can be reduced to:

<list>
<item>First item
<item>Second item
<item>Last item
</list>

The specification does not indicate that all the item start-tags are omissible. By examining the declaration of list, we see that only the first item is contextually required. The other items are contextually optional. The notation implies that only the start-tag for the first item is omissible. Thus Rule 3 reduces the markup to:

<list>
First item
<item>Second item
<item>Last item
</list>

Note that this is not the only possibility for minimal markup. Another possible minimal, unambiguous markup is:

<list>
First item</item>
Second item</item>
Last item
</list>

However, this markup is not allowable because it does not meet the first condition for omissibility, i.e., it is not allowable by the three SGML-specific rules. In particular, it is not permissible to omit the start-tag on the second and third items.[6]

One other problem that may occur with the use of tag omissions is that modifications to existing text may by made incorrectly because insufficient attention is paid to tag rules. For example, if an author wanted to add an item to the beginning of the list above, a common tendency would be to generate the new, illegal list as follows, without concern for putting a begin-tag on the now second item:

<list>
New first item
First item
<item>Second item
<item>Last item
</list>

As can be seen by this example, determining the legal use of omissible tags can be a complex undertaking, involving knowledge of not only the immediate minimization specification for a given element, but the examination of several declarations in a given DTD, and careful consideration of the current instance being created or modified.

Complexity in Specifying Omissible Tags.

We now illustrate the complexity involved in specifying which tags may or may not be omitted. We also illustrate that in order to achieve the most intuitively pleasing omissions, it may be necessary to go back and change the element definitions in an already existing DTD.

We use the following mathematical formula as an example. The DTD element definition that we will be using is taken from [AAP86c]. The definition without the specification of minimization is:

<!--      ELEMENT CONTENT       >
<!ELEMENT fr      (nu,de)       >
<!ELEMENT (nu,de) (%f-butxt;)*  >

This declaration says that the content of a fraction consists of a numerator followed by a denominator. The content of the numerator and the denominator are defined to be built-up text, which is defined elsewhere. Built-up text does include fractions.

Assume you are a specifier of a DTD who now wishes to embellish the element definition with the specification of which tags may be omitted. Your first impulse is likely to determine the minimum markup by examining instances of possible declarations of fractions. Rather quickly, you might discover that the string

<fr>numerator<de>denominator</fr>

represents minimal markup (using tag omission only, and not the SHORTTAG feature).

Next, you would attempt to add the minimization specification to the DTD. Upon examination, you would discover that it is not possible to specify this markup exactly, without rewriting the element definitions, since there is no way using the current specification to indicate that only the denominator begin-tag is required. So, you might rewrite the element definition, specifying the minimization suggested by the sample string:

<!-- ELEMENT MIN CONTENT          >
<!ELEMENT fr    - - (nu,de)       >
<!ELEMENT nu    0 0 (%f-butxt;)*        >
<!ELEMENT de    - 0 (%f-butxt;)*        >

At this point, it may occur to you that there is another string that represents minimum markup, i.e.,

<fr>numerator</nu>denominator</fr>

, and that, likely, creatures of instances of DTDs may try to use this minimization instead of the one specified. Such a minimization would be illegal if the specification above were the prevailing one.

So, you might attempt to determine a way to allow for both specifications. In fact, the following specification allows for both forms of minimization, if the creator of the instance applies the conditions for omissibility properly when declaring the DTD instance.

<!-- ELEMENT MIN CONTENT          >
<!ELEMENT fr    - - (nu,de)       >
<!ELEMENT nu    0 0 (%f-butxt;)*        >
<!ELEMENT de    - 0 (%f-butxt;)*        >

By application of the first condition for omissibility, i.e. three SGML- specific rules, the minimal markup

<fr>numerator denominator</fr>

, is allowable.[7] By application of the second condition, i.e., the markup cannot be ambiguous, it is discovered that some tag must be included between the numerator and the denominator. If we choose the numerator end-tag, the desired minimization is achieved.

You may now observe that the initial element declaration was okay, and change the element definitions once again to their original form, with the minimization specification added:

<!-- ELEMENT MIN CONTENT       >
<!ELEMENT fr      - - (nu,de)       >
<!ELEMENT (nu,de) O O (%f-butxt;)*  >

This specification for tag minimization is comparable to the list example just discussed, and is unintuitive to the inexperienced creator of instances of DTDs. Again, the apparent ambiguity must be resolved in exactly the same way as for the list example, by methodically applying the conditions for omissibility when creating an actual instance of a fraction.

Also, we observe that the most logical action for a specifier of a DTD who wants to allow for maximum flexibility under the SGML minimization rules, is to simply specify all minimization as '0 0'. This approach relieves the DTD specifier of any complexity in the specification, but in turn simultaneously places a tremendously complex burden on the creator of an instance of a DTD.

In reality, the effective function of the minimization specification is to indicate, among those tags which may be omissible, which tags must be included. This is a strange function indeed, since such a specification of required tags is necessarily due to some personal, subjective judgments in the mind of the specifier of the DTD, having nothing to do with possible ambiguity. Such judgments are unlikely to be shared by creators of instances of DTDs, who will find these rules arbitrary and unintuitive (as they are), and thus difficult to apply.

This is a rather simple example, involving only two elements. As was observed above, DTDs typically involve hundreds of elements that interact in complex ways. Thus, the specification of omit tags can be expected in general to be considerably more complicated than that discussed here, and proportionately more error-prone.

Again we might consider the task of using the mechanism of tag omission, in addition to using the SHORTTAG feature, in the actual markup of an instance of a mathematical equation. Consider the following: \begin{displaymath} \frac{1+\frac{1}{1+\frac{1}{x}}}{x} \end{displaymath} A completely marked-up version of the above formula is:

<fr><nu>1+<fr><nu>1</nu><de>1+<fr><nu>1</nu>
<de>x</de></fr></de></fr></nu><de>x</de></fr>

Using tag omission only, the markup would be the following:

<fr>1+<fr>1</nu>1+<fr>1</nu>x</fr></fr></nu>x</fr>

The fraction end-tag is not omissible since a fraction may contain another fraction.

By adding empty end-tag minimization, the markup would be the following:

<fr>1+<fr>1</>1+<fr>1</>x</></></></></>x</></>

Without a doubt, this markup is considerably shorter than the fully marked-up version. The real question is: is it easier to generate and is it easier to read? In this example, the start-tags of the numerators and denominators are missing but are understood to be there. The placement of the empty end-tags is similar to matching right parentheses with left parentheses in an expression where the left parentheses (i.e., start-tags) are invisible. Authors of programming languages have long acknowledged the difficult problem of matching nested parentheses, but their case is considerably simpler than this one since they always have the explicit left parentheses. Other, select tags may be entered to better display the intended markup in this case, but that process is itself subject to error since the creator of an instance of a DTD must specify only those shortened versions of markup that are allowable both under SGML rules and according to the DTD.

Summary.

The SGML features that support shortened or omitted tags were motivated by an outmoded computer technology. In general, their specification and use is very complex and highly error-prone. Correctness depends upon simultaneously considering many concurrent rules and many element declarations. In the worst case their use can dictate unnecessary and perhaps unnatural changes to the DTD element specification. These features also do not necessarily result in the easiest to generate, or the easiest to read, markup conventions, in direct contradiction to their intended purpose. Further, as will be discussed below, the use of the SHORTREF feature interferes with the correct use of another feature, i.e., CONCUR, that we recommend for use in the TEI.

4.2 Use of Attributes

Recommendation. We recommend that attributes not be used in the declaration of DTDs or their instances.

Detailed Justification. SGML provides two mechanisms for marking information in a DTD: tags and attributes. Tags are used to mark the begin and end of each element that is defined in a DTD. Attributes are associated with particular elements. Their notation is different from that of tags. They cannot be decomposed hierarchically, and they can be restricted to contain only certain values.

The use of two mechanisms to mark information is necessarily more complex than the use of one. It has already been established that DTDs are complicated to define and use. Thus, two separate mechanisms can only be justified if there is considerable advantage to be gained from having both. We argue that there is no advantage to be gained, either semantically or syntactically, and that only tags, the better of the two mechanisms, should be used.

Semantic Issues

From a semantic point of view, there is no universal and unambiguous way to distinguish tags from attributes. At best, distinguishing criteria have been suggested that are application dependent. These criteria are not generally applicable across all domains and even lead to contradictory distinctions within a single domain.

For example, in the publishing domain, the concept of 'hierarchy' has been proposed as a distinguishing one for tags and attributes. All those pieces of information in a manuscript that are clearly hierarchical are to be marked using tags. All other information is to be marked using attributes. Thus, the manuscript components frontmatter, body and rearmatter would be tagged. But nonhierarchical items like references and citations would be marked using attributes.

The problem with the 'hierarchy' criterion is that the concept very quickly blurs as DTDs are defined in other domains. For example, suppose in the linguistics domain it is desired to define a DTD to mark parts of speech, like verb, noun, adverb and so on. Further, assume that for each part it is desired to mark information like gender, person, number, tense and so on. In this case, neither the parts of speech themselves, nor the information about them, is hierarchical in any clear sense.

Another concept that has been applied in the publishing domain to distinguish tags from attributes is that of 'content'. All those substantive pieces of information that will actually appear on a printed page, i.e., the content of a manuscript, are to be marked with tags. All else, e.g., formatting or rendering information, are to be marked using attributes. Note that using this criterion, references and citations would now be marked using tags, while using the hierarchy criterion they would be marked using attributes, even though both criteria stem from the same application.

The concept of 'content' blurs quickly, even within the publishing domain. Consider books on typesetting, like The TeX book, in which examples arise like: use the following typesetting commands .... to produce .... To accommodate such publications one would have to use tags for the typesetting commands that are content. But, attributes would be used for any typesetting commands that refer to formatting. Thus one would have tags and attributes capable of specifying the same information.

The concept of 'content' also quickly blurs as one moves outside the domain of publishing. For example, assume that an analysis is to be done on a table in a manuscript to evaluate visual cues that readers might use to distinguish items in a table. From the point of view of this analysis, column separators would be considered to be 'content'. However, suppose for the same table that one wishes to find the average of all the values in the columns of the table. From this point of view, column separators would be considered a nonsubstantive part of the table.

Syntactic Issues

From the syntactic point of view, tags are more powerful than attributes in the sense that tags can be further decomposed into subcomponents. So, from this point of view, they should always be chosen over attributes. On the other hand, certain values, or ranges of allowable values, can be associated with attributes, but not with tags. So, a case can be made for using attributes only if a case can be made that it is desirable to specify allowable values of certain strings when a DTD is defined.

In practice, the value of any terminal string in a DTD is strictly a matter of interest or concern only to a particular application that may be processing instances of the DTD. This is equally true for those strings that may be delimited using tags or those that may be delimited using the attribute mechanism. Since it is in the application that the interest originates, it should rightly be in the application that the necessary work to pursue the interest, i.e., specifying values and constraints on values, is done.

For example, consider the 'value' of zip code information. One application may wish to generate mailing lists only for those entries with postal zones indicating residence in Oregon and Washington. Another may wish to verify that all the zip codes are valid based on an official U.S. Postal Service listing of such codes. Or, alternatively, an application may search for all zip codes that are invalid based on such a listing. Still another may wish to analyze the postal codes in order to determine a geographic distribution of entries in the southern portion of the U.S. Each of these applications is interested in a constrained 'value' of the string that is used to represent the zip code. However there is no obvious way to impose constraints on the value of the zip code when the DTD is defined such that all of the interests of these varied applications would be served. Further, it would be a bad design policy to burden the DTD itself with a set of constraints that are of interest to only one or a few applications.

Or, consider the 'value' of a 'column separator' associated with a table element. Suppose a list of allowable values for the separator has been specified in the DTD. This list is likely to be based on some application for instances of the DTD, say a formatter. Now assume that someone is creating an instance of this DTD by marking an already existing manuscript. Suppose that the manuscript has a value for a column separator that does not occur on the list of allowable values. The definer is faced with the choice of illegal markup or no markup at all. On the other hand, if 'column separator' had been marked using tags, the definer would simply fill in the string corresponding to the true value of the column separator.

Even if a case could be made for assigning values in the DTD, the mechanism provided by SGML is flawed and would likely prove inadequate for the need. Attributes are declared in SGML using an attribute-declaration statement that specifies the element name to which the attribute applies, the attribute name, a 'declared value' for the attribute and a default value. The declared value is intended to specify the possible range of values from which the actual value of the attribute may be chosen.

The SGML standard has attempted to anticipate the possible range of values that would be used by definers of DTDs by restricting the 'declared value' parameter of the attribute declaration to a prespecified set of classes. For example, they can have a declared value of NAME which restricts the attribute value to a string in which the first character is a letter (say) and the rest are either letters, digits, a period or a hyphen. Or attributes can have a declared value of NAMES which restricts the value to a list of NAMEs. All of the prespecified classes allow attributes to take on only one type of value, or a list of one type of value.

The AAP in its DTD for mathematical formulas found these classes to be inadequate to fully specify the desired restrictions on some of the attributes they wanted to declare. For example, to describe the value of an attribute "column separators" for an array, they wished to specify a list of ordered pairs as the attribute value. The first element of the pair defines the column number and the second element defines the allowable separator, e.g., single line, double line, blank, and so on. Since the standard does not provide sufficient expressive power for this specification, the AAP specified its desired restriction in a lengthy comment following the formal declaration for the attribute (see [AAP86c], p. 49). Clearly, such a description is outside the standard, and no software that has been designed to validate or analyze this DTD will be able to deal with this type of declaration.

Summary.

Two separate mechanisms for marking information in manuscripts are not needed unless each provides unique, necessary functionality. Tags are more powerful than attributes in that tags can be further decomposed. By this criterion, the tag mechanism should be chosen in favor of attributes. Attributes can be specified to assume only constrained values, while tags cannot. But constraints on values cannot be specified in a general, reasonable way when defining a DTD. Such constraint issues are rightly placed within the purview of the applications that need them. So attributes serve no unique, necessary function, they further complicate an already complex task, and they should not be used.

4.3 Inclusion and Exclusion Exceptions

Recommendation. We recommend that the DTD definer avoid the use of inclusion and exclusion exceptions when defining DTDs.

Detailed Justification. Inclusion and exclusion exceptions in SGML are powerful shorthand notations. They indicate that a certain content model should be included or excluded over sets of elements in a DTD, rather then indicating this information on an element-by-element basis. For example, it is possible to use an inclusion exception to indicate that a figure can occur anywhere in an entire manuscript.

Our experience with these exception mechanisms is that they are so powerful that the DTD definer does not always readily understand the full implications of the exceptions that are specified. For example, in the case of a figure occurring anywhere, this implies that figures can occur in header lines, in footnotes, and in bibliographic citations. Typically, the DTD definer does not intend this. Further, if the definer does understand the implications of an inclusion exception and wants to limit its scope, then an exclusion exception can be used. A problem occurs when inclusion and exclusion exceptions begin to be piled atop one another so that finally there is little hope of understanding what is really legal in any given element.

4. 4 Use of Concur

Recommendation. We recommend the use of the CONCUR feature.

Detailed Justification. The CONCUR feature of SGML provides a mechanism for more than one simultaneous markup of the same data stream. The main advantage of this mechanism is that it supports different structural views of the same document. Such views may be required by different analyses. It also supports the definition of simple, straight-forward DTDs that are likely to be well-understood by their definers and thus likely to completely and correctly capture their intentions. A potential disadvantage of using the CONCUR feature is that it may be difficult to process a DTD instance with multiple tag sets when the SHORTREF feature has been used [BAR88]. Since we are recommending that this feature not be used for storage of DTD instances, this disadvantage will not affect the TEI DTDs.

5. Software Development to Support Following the Guidelines

As can be inferred from the discussion above, many different tasks must be performed to support the use of these guidelines. A set of tutorial materials on SGML must be prepared. The TEI should be looking forward to obtaining funds to commission the acquisition or development of production-quality software to meet their needs.

A list of such software includes the following:

An electronic template for creating informal DTDs, for those specifiers not having access to software tools to assist in the formal specification.
A software environment for specifying DTDs that assists the specifier in creating a syntactically and semantically correct DTD. Ideally, this environment would be capable of automatically generating a structured-editor for creating instances of the DTDs specified.
A 'structured-editor' type software environment that assists the specifier of instances of a DTD in creating syntactically correct instances.

References

[AAP86]. Association of American Publishers, Standard for Electronic Manuscript Preparation and Markup, 1986. Electronic Manuscript Series.
[AAP86c]. Association of American Publishers, Markup of Mathematical Formulas, April, 1986. Electronic Manuscript Series.
[BAR88]. David Barnard, Ron Hayter, Maria Karababa, George Logan and John McFadden. "SGML-Based Markup for Literary Texts: Two Problems and Some Solutions." Computers and the Humanities, 22:265-276, 1988.
[FUR87s]. Richard Furuta, "Complexity in Structured Documents: User Interface Issues." In J.J.H. Miller, ed., PROTEXT IV: Proceedings of the Fourth International Conference on Text Processing Systems, pages 7-22. Boole Press, 1987.
[HEA88]. Jim Heath and Larry Welsch. Difficulties in parsing SGML. In ACM Conference on Document Processing Systems, pages 71-77, December, 1988. Santa Fe, New Mexico.
[KAE87]. Kaelbling, M. J. Braced Languages and a Model of Translation for Context-Free Strings: Theory and Practice. Ph.D. Thesis, The Ohio State University, 1987. Available as University Microfilms International, Inc., No. 8804059.
[SCH81]. G. M. Schneider and S. C. Bruell. Advanced Programming and Problem Solving with PASCAL. John Wiley and Sons, Inc., 1981.[8]
([SGM86]. ISO. Information processing - Text and office systems - Standard Generalized Markup Language (SGML), October, 1986. ISO 8879-1986 (E). [9]

Notes

[1] Julie Barnes from The Ohio State University contributed substantially to this manuscript. Michael J. Kaelbling, Nancy Ide and Michael Sperberg-McQueen were also particularly diligent in monitoring the progress of the the manuscript and offering numerous helpful comments and suggestions.
[return to text]

[2] In order to compare the complexity of a DTD to a programming language, one document specification [AAP86c, AAP86] was rewritten in terms of BNF, using productions containing terminals and nonterminals. The grammar has 240 productions, 406 terminals and 230 nonterminals. Furthermore, a nonrecursive nesting level to reach a terminal from the root of the grammar can be as deep as 25. And some terminals and nonterminals are referenced in as many as 70 productions. Comparatively, a grammar for Pascal contains 95 productions, 67 terminals, 49 nonterminals, a nonrecursive nesting level of 15, and references to some nonterminals and terminals from 21 productions [SCH81]. Furuta has also recognized and enumerated similar complexities in [FUR87s].
[return to text]

[3] In the AAP documentation of their grammar for tables [AAP86], for example, the markup they use for their examples is not legal according to their formal specifications. In the formal specification they indicate that each cell in a table must contain a paragraph, but in the sample markup they do not include the required tags for paragraphs.
[return to text]

[4] There are eleven optional features that can be used when specifying DTDs. These include the five relating to minimization listed above, three relating to linking document types (SIMPLE, IMPLICIT and EXPLICIT), and three others (CONCUR, SUBDOC and FORMAL). An SGML document that uses no features is known as a 'minimal SGML document'. Currently, we would recommend that the DTD definer use minimal SGML, except for the CONCUR feature (see discussion of CONCUR below). We delay in making this the official recommendation because although there exists extensive experience informing a decision against the use of the minimization features, there is little experience with the others. Thus, we are open to considering the use of the three link features, and SUBDOC and FORMAL, if warranted in the TEI.
[return to text]

[5] The alert reader will notice the 'catch-all' nature of this second condition. In fact it was added to later versions of the standard, likely in response to the fact that the SGML-specific rules do allow for ambiguous markup.
[return to text]

[6] M. J. Kaelbling points out in [KAE87], p. 53, that it is not possible to achieve absolute minimal markup in SGML because the conditions under which an SGML start-tag is omissible are overly restrictive.
[return to text]

[7] This allowable markup is deduced by applying rule 2 above to the end-tags and rule 3 to the start-tags. It is ambiguous.
[return to text]

[8] Used this book to get a CFG from their syntax diagrams, to get a feel for the numbers of terminals, nonterminals, productions, etc. There were 67 terminals (keywords, symbols, digits, and chars), 49 nonterminals, 95 productions, a nesting level of 15 which could go deeper, and references to IDENTIFIER from 21 locations.)
[return to text]

[9] This document contains the formal definition of the Standardized General Markup Language. SGML is a set of rules for specifying the definition of the structure of a document. The specification is given as a formal grammar that is difficult for the inexperienced reader to follow. Several appendices are included to provide examples and further discussion. This document should be read by those with a lot of experience with SGML applications who desire or need to have the complete, unambiguous knowledge of the abstract syntax that drives these applications.
[return to text]

HTML generated 20 May 1998