4 Elements and their content

The level of description at which texts are composed solely of entities in the SGML sense defined above is not, however, a very satisfactory one. All markup schemes to a greater or lesser extent attempt to identify and to distinguish components of texts at a more ambitious level of description. Texts are not simply sequences of words, still less of bytes; they contain instances of objects, such as paragraphs, titles, names, dates etc. represented by such sequences. All markup schemes, to a greater or lesser extent, attempt to describe these components. A consideration of such schemes indicates at least three important aspects of textual objects which need to be recognised: their extent -- that is, the points in the textual stream at which object instances begin and end; their type -- that is, the category to which object instances are assigned; and their context -- that is, their relationship to other object instances within the document. SGML addresses each of these concerns: everything in an SGML document is delimited explicitly in some way; a document is decomposed into elements of a named type; and a kind of textual grammar can be defined.

4.1 A note on syntax

Most discussions of SGML mention if only in passing that the particular characters and conventions used to represent SGML markup in a particular document can be redefined. This is of course a necessary consequence of the fact that SGML is not itself a markup language, but a method of describing such languages. However, for sdimplicity's sake, this document will follow customary practice in using the reference concrete syntax to represent SGML markup. This rebarbative phrase is actually quite a precise description of what it denotes: it is a ``concrete'' syntax, because it represents by particular characters (the < > ! and other delimiters) distinctions required by the SGML model of how markup should be described (its ``abstract'' syntax); it is provided for ``reference'' purposes, as an example of one generally useful way of representing the constructs of the language.

The SGML reference concrete syntax has two great advantages over most other ways of making concrete a view of the abstract structure of a markup language: everything is delimited (bracketed) explicitly, and very few special characters are needed. As we have already seen, entity references are delimited explicitly by the ampersand character and the semicolon. [See note 7] In the same way, element occurrences within an SGML document are explicitly delimited in the reference concrete syntax by named tags. There are two kinds of tag: start-tags, which indicate the beginning of an element, and end-tags, which indicate its end. The tags themselves are delimited by special characters: ``<'' to mark the beginning of a start-tag, and ``'' is used to indicate the end of a tag. Between these delimiters is given a name identifying the type of element delimited by the start- and end-tag pair. For example, an embedded name element in a text might be tagged as follows:

             Call me <name>Ishmael</name>.

This is by no means the only way of indicating the presence of an SGML element within a text; it is however the most explicit, and hence that into which other representations are most generally mapped.

4.2 Content models

As suggested earlier, the primary function of the start and end tags within a marked-up text is to indicate the extent of a particular object or textual component (the SGML term is element) within it. In addition, the tags identify the category or type of the element which they delimit, by supplying a name for it (``name'' in the example above). An SGML-aware processor can thus easily identify the start and end of all elements of a given type within a document -- it can identify all names, all sentences, all paragraphs (etc) and process them in a way appropriate for such objects.

The content of a document element of a particular type (that is, the portion between the start and end tags) may consist simply of running text, perhaps including entity references. More usually, it will contain other embedded document elements; occasionally it may have no content at all. The ability of SGML to specify rules about how elements can nest within other elements is one of its chief strengths and is discussed further below. Here we simply note that elements of one type typically contain elements of another: for example, a parish register consists of a mixture of birth, marriage and death records, each of which contains elements such as names, dates and details of an event. We might thus expect to find such records encoded in SGML with different tags for <birth>, <marriage> and <death> elements, within each of which might be found <name> and <date> elements. In exactly the same way, a document such as this one might be encoded as a series of <paper> elements, each of which begins with a <title>, followed optionally by an <abstract>, and at least one (and probably several) <section>s, each composed of <paragraph>s.

An empty element (one which has no content) may seem like a contradiction: what use can it be simply to tag a specific point in a text, especially if there is no way of associating information with it? At the very least, it should be possible to supply a name or other identifier to distinguish one such empty point in a text from another. Fortunately, SGML does provide a mechanism for adding such ``extra-textual'' information to the elements of a text: that of attributes, discussed in the next section.

4.3 Attributes and cross-references

Like ``entity'', the word ``attribute'' has a specific technical sense when used in the SGML context, which differs somewhat from its sense when used in the database design context. An SGML attribute is a category of information associated with a particular type of element, but not contained within it. Attributes are associated with particular element occurrences by including their name and value within the start-tag for the element concerned. For example:

  Call me <name type=Biblical>Ishmael</name>.

Here ``type'' is the name of an attribute associated with any occurrence of the <name> element; ``Biblical'' is the value defined for this attribute in the case of the example <name> shown above. [See note 8]

Attributes are used for two related purposes: they enable an identifying number or name to be associated with a particular element occurrence within a text (which might otherwise be missing), and they enable additional information missing from a text to be added to it without violating its integrity.

As an example of the first usage, consider the page or folio numbering of a historical source. There is a sense in which the individual pages of a source might be regarded as distinct elements within it. This is not however generally the primary focus of interest for those using it: in most cases, the number of the page only is of importance as a means of documenting where the other elements of the text occur. Moreover, the page numbers may not appear at all in the original source. In such cases, a tag <pb> (for ``page break'') may conveniently be used to mark the point in the text at which a new page begins. An attribute (say, n for ``number'') would then provide a convenient means of indicating the number of the page just begun: thus

                  text of page 3 ends here 
                   <pb n=4> 
                 text of page 4 starts here

As an example of the second usage, consider the common need for normalisation in prosopographical studies. One way of achieving this might be associate an attribute such as ``key'' with each occurrence of <name> elements in a text, the value of which would be a regularized and encoded form of the name, which could also serve as an identifying key in a database derived from the text. For example:

             <name key='SMITJ04'>Jack Smyth</name>

Attribute values may be defaulted, taken from a controlled list or specified freely, the only constraint being that they cannot contain markup.

The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element within the text. This makes possible the encoding of links between individual elements of a text in a simple and economical way. This facility is very commonly used in document preparation systems (such as TeX or Scribe) in order to link cross-references (such as ``see section 3 above'') within a text with the sections of a text to which they refer, when the section number is not known or may be dynamic. In SGML, such a system is completely generalizable. For example, let us suppose that we wish to encode a register of names in which the following passage occurs:

          John Smith, baker.
          Mary Smith, seamstress, wife of the above.

In this example we have two <entry>s, each containing a <name> and a <trade>. The second entry however contains an additional clause which states a relationship between it and another element. We begin by tagging the elements so far identified: [See note 9]

     <register>
        <entry>
           <name>John Smith</name>
           <trade>baker</trade>
       </entry>
       <entry>
          <name>Mary Smith</name>
          <trade>seamstress</trade>
          <relation>wife of the above</relation>
       </entry>
  ...
    </register>

Clearly ``wife of the above'' is meaningless as a relation unless we have some way of pointing to the entry with which it is linked. Let us assume that the referent of ``the above'' is the whole of John Smith's entry rather than just the name within it; the assumption does not affect the argument. What is needed is some way of identifying that entry uniquely; that identifying number can then be supplied as the target of the relationship. In other words, we need an identifying attribute (call it ``id'') that can be attached to any <entry> and a pointer attribute (call it ``target'') which can be attached to any <relation>. Using these, and inventing an arbitrary value for the identifier, we can encode the link implicit in the above text as follows:

     <register>
        <entry id=E1234>
           <name>John Smith</name>
           <trade>baker</trade>
       </entry>
       <entry>
          <name>Mary Smith</name>
          <trade>seamstress</trade>
          <relation target=E1234>wife of the above</relation>
       </entry>
  ...

Here we have allocated the arbitrary name or identifier ``E1234'' to the Baker's entry. By supplying that same identifier as the value for the target attribute associated with the <relation> element of the Seamstress' entry, we assert both the existence of the relationshiop itself, but also its target. This simple solution to a well-known problem has several attractive features, but perhaps the most attractive is that it makes explicit the fact that the target of the relationship is an interpretation brought to bear on the text by the encoder of it, leaving the text itself unchanged. Other attributes (say, ``certainty'' or ``authority'') may also be imagined which might carry additional interpretative information associated with the link.

Back to table of contents
On to next section
Back to previous section