5 Ensuring consistency

While a rose might smell just as sweet by any other name, every computer user knows that names intended for automatic processing must be spelled exactly and defined precisely. The human reader might tolerate paragraphs sometimes labelled <p>, sometimes labelled <para>, and sometimes not labelled at all, but the computer is less forgiving. Slightly less obviously perhaps, the user of an SGML aware software system needs to know what elements have been defined for a given text (or group of texts) and what their possible contents are. He or she needs to know not just whether personal names should be tagged <propname> or <name>, but also in what contexts personal names may reasonably be expected to appear (for example, if something tagged as a name appears within a name, it is probably an error). He or she also needs to know what attribute names have been defined for particular elements and their legal values, and also what entity names should be used for particular symbols. The formal specification of these names and their usage is enshrined in a separate component, unique to SGML, known as a Document Type Definition or DTD.

5.1 Defining a Document : the content model

A DTD performs a function analogous to that of a grammar: it formally defines what are the legal productions of a given markup language. Of course, DTDs can be as lax or as restrictive as any other kind of grammar: the designer of a DTD generally has to trade off generality of use with accuracy of error detection. The simplest kind of DTD would be one which did no more than specify a set of tag names, requiring only that every element tagged in a document use one of them. Such a DTD would of course be unable to detect errors such as <name>s occurring within <name>s or within <date>s, nor to prohibit such errors as register entries appearing other than inside registers. Creating correctly encoded texts with such a DTD would be rather like trying to speak a foreign language with the aid of a lexicon of the language, but no idea of its syntax.

More usually however, the transcribers and creators of electronic texts wish to control how elements can meaningfully appear within a given class of texts, so that processors intended to act on them can do so more intelligently. The specification of what is legal within any one kind of textual component or element is known in SGML as its content model, because it provides a model for its content. Here, for example, is a part of the formal DTD for the register example given informally above.

       <!ELEMENT register - - (entry+)> 
       <!ELEMENT entry - o (name, trade, relation?)> 
       <!ELEMENT name - - (#PCDATA)>

These three lines are examples of SGML declarations: each defines or declares a name for an element and what its content should be. The details of the syntax need not detain us; note only that each declaration (like everything else in SGML) is explicitly delimited, in this case by a symbol marking the start of a declaration (the ``''). The content model part of each declaration is given in parentheses at the end. Between the name of the element (``register'' in the first case) and the content model are two characters which specify whether or not both start- and end- tags are required to mark off occurrences of the element. The hyphen character indicates that a tag is required, the letter O that it is optional. Thus, in this example, <register>s and <name>s must have both start- and end-tags, whereas <entry>s can be specified using start-tags only.

The content model for register states that a <register> consists of one or more <entry>s, the plus sign indicates that the element before it can be repeated one or more times. Thus a register containing no entries, or one containing something other than an entry would be regarded as an error by this DTD. The content model for an entry states that a <entry> must begin with a <name>, followed by a <trade> and then optionally by a <relation>. The commas between the components of this content model indicate that the elements must all appear in the order given. The question mark following the <trade> indicates that this element need not be present. Thus, an entry with no name, or one where the trade preceded the name, would both be regarded as erroneous by this DTD, whereas entries are equally acceptabl, whether with or without <relation>s. Finally, the content model for a <name> states that it may contain only text, that is, simply data with no embedded tags. (The word ``#PCDATA'' is a special SGML symbol standing for parsed character data -- which must be ``parsed'' because it may contain entity references as well as raw characters).

SGML syntax allows for other variations, which will be necessary if we are to refine this model to reflect more accurately the probable content of register entries in the real world. We will begin by relaxing the restriction on the number of <trade>s an entry may contain:

        <!ELEMENT entry - o (name, trade*, relation?)>

The asterisk following the word ``trade'' indicates that an entry may contain zero or more <trade>s. An entry such as the following

         <entry>
            <name>John Smith</name>
            <trade>butcher</trade>
            <trade>baker</trade>
            <trade>candle-stick maker</trade>
         </entry>

would be legal according to this second definition, as would one like this:

          <entry><name>John Smith</name></entry>

Suppose however that entries are mixed, sometimes containing names and trades, sometimes only one or the other. One possible content model for this situation would be:

           <!ELEMENT entry - o ((name|trade|relation)+)>

The vertical bar symbol may be read as ``or''. This content model states that an <entry> must contain at least one component, which may be a <name>, a <trade> or a <relation>, and may contain more than one of any of them, in any order. (The inner set of parentheses is needed to indicate that the plus sign is to be applied to the whole group of alternated names). The following entries would all be legal according to this definition:

          <entry>
             <name>John Smith</name>
             <trade>Baker</trade>
             <trade>Chandler</trade>
          </entry>
          <entry>
             <relation>wife of the above</relation>
             <name>Mary Jones</name>
          </entry>
          <entry>
             <name>John Smith</name>
             <name>Henry Jones</name>
             <trade>smith</trade>
          </entry>

As the last example indicates, such laxity of definition may lead to difficulties of interpretation -- our syntax now cannot help us determine whether it is Smith or Jones who is the smith. But presumably in that respect we are being true to the source!

A more realistic situation would be that names, trades and relationships are found promiscuously intertwined in any order within any amount of other text. SGML offers two ways of dealing with this. One is simply to add #PCDATA as a fourth alternate to the example above, to give a declaration like this:

           <!ELEMENT entry - o ((name|trade|relation|#PCDATA)+)>

This solution is that generally preferred in the TEI Guidelines, for the general case, where elements contain of small identifiable elements (names, trades etc.) swimming about in an arbitrary mixture or `primal soup'. Another approach achieving a similar effect is to use what is known as an inclusion exception, as in this example:

         <!ELEMENT entry - o (#PCDATA) +(name|trade|relation)>

Either of the above definitions states that items tagged as names, trades or relations may appear anywhere within an entry, an arbitrary number of times, interspersed with arbitrary sequences of character data. The following example would be regarded as legal according to both the above definitions:


    <entry>Also<name>John Smith</name></entry>

    <entry><name>John Smith</name>
           and<name>Adam Smith</name>
           <relation>his brother</relation>
    </entry>
   <entry>
     <relation>Cousin</relation> 
     of the
     <trade>blacksmith</trade>
   </entry>
   <entry>
       Old<relation>Uncle</relation>
          <name>Tom Cobbley</name> and all
   </entry>

The difference between the two element declarations is that the second allows names trades or relations to appear arbitrarily not only within entries, but also within anything that entries contain. Unless further modified, this definition allows (for example) names to occur within trades (or vice versa), or even within other names. Entries such as the following would be legal by the second definition, but not the first:

    <entry> 
      <name>John <trade>Smith</trade> Jones</name> 
    </entry>
    <entry> <relation>Brother of 
              <name>Henry
                     <trade>Potter</potter>
                    Jones
              </name>
            </relation>
    </entry>

Of course, entries such as the following would also be acceptable by either definition:

    <entry>Any old string of characters you care to type in

To complete the content models for our simple register example, we need to define the sub-components of an entry. For the sake of argument, we will assume that <name>, <trade> and <relation> are to contain only text, with no further embedded elements. Since they all share the same content model a single declaration will suffice:

        <!ELEMENT (name|trade|relation) - - (#PCDATA)>

5.2 Defining a document: attribute lists

We now turn to the definition of attributes for each of the elements discussed so far. As with elements, SGML requires that all attributes likely to be used within a document must be defined in advance. It also offers a variety of features to control the values that specific attributes may take.

An example attribute declaration might look like the following:

         <!ATTLIST name id     ID    #IMPLIED
                        key    CDATA #IMPLIED
                        type (personal|honorific) personal>

This declaration associates three different attributes with the element <name>. The first attribute is called id, the second key and the third type. Note that all the attributes for a given element are defined together in a single ATTLIST declaration. For each attribute named, two additional pieces of information are required for its declaration. The first, following the name of the attribute, defines what type of value may be supplied for it, while the third, following the type specification, indicates what value should be assumed if the attribute is not specified for any element occurrence.

In this example, three different kinds of value are specified. The id attribute may take as its value only a string [See note 10] which the SGML processor will treat as an identifier for the associated element: this is indicated by the keyword ID. An value for the id attribute need not be supplied, and if it is not, then a processor may take whatever default action it chooses; this is indicated by the keyword #IMPLIED.

The key attribute may take as its value any string of characters; this is indicated by the keyword CDATA. The SGML processor will not check the value of this attribute in any way, except of course that it may not contain any form of markup. [See note 11]. Again, there is no requirement to supply a value for this attribute.

More exact checking is provided for in the case of the type attribute: here we have specified that only two values are legal for this attribute -- the type of a name must be marked as either personal or honorific. If no value is specified, the processor is to assume that the intended value is personal.

To illustrate some of the other possibilities for attribute specifications, we conclude with a declaration for the attribute list to be attached to the <relation> element:

     <!ATTLIST relation  target    IDREF  #REQUIRED 
                         certainty NUMBER 0            >

This states that the relation element may have two attributes, called target and certainty. The former takes as its value an IDREF, that is a string which has been used as the value for an ID-type attribute somewhere within the current document. The SGML parser will check that the value actually identifies some other element, as is the case in our example. The keyword #REQUIRED means additionally that no <relation> can exist in the document unless the element to which it points is specified in this way; this keyword is used to specify that a value must be supplied for every occurrence of the attribute to which it is attached (the examples of relations given above are thus all illegal by this definition!). Finally, the certainty attribute is defined as taking a numeric value, which defaults to zero.

Back to table of contents
On to next section
Back to previous section