The SGML reference concrete syntax has two great advantages over most other ways of making concrete a view of the abstract structure of a markup language: everything is delimited (bracketed) explicitly, and very few special characters are needed. As we have already seen, entity references are delimited explicitly by the ampersand character and the semicolon. [See note 7] In the same way, element occurrences within an SGML document are explicitly delimited in the reference concrete syntax by named tags. There are two kinds of tag: start-tags, which indicate the beginning of an element, and end-tags, which indicate its end. The tags themselves are delimited by special characters: ``<'' to mark the beginning of a start-tag, and ``'' to mark the beginning of an end-tag. In either case, the character ``>'' is used to indicate the end of a tag. Between these delimiters is given a name identifying the type of element delimited by the start- and end-tag pair. For example, an embedded name element in a text might be tagged as follows:
Call me <name>Ishmael</name>.
This is by
no means the only way of indicating the presence of an SGML element
within a text; it is however the most explicit, and hence that into
which other representations are most generally mapped.
The content of a document element of a particular type (that is, the portion between the start and end tags) may consist simply of running text, perhaps including entity references. More usually, it will contain other embedded document elements; occasionally it may have no content at all. The ability of SGML to specify rules about how elements can nest within other elements is one of its chief strengths and is discussed further below. Here we simply note that elements of one type typically contain elements of another: for example, a parish register consists of a mixture of birth, marriage and death records, each of which contains elements such as names, dates and details of an event. We might thus expect to find such records encoded in SGML with different tags for <birth>, <marriage> and <death> elements, within each of which might be found <name> and <date> elements. In exactly the same way, a document such as this one might be encoded as a series of <paper> elements, each of which begins with a <title>, followed optionally by an <abstract>, and at least one (and probably several) <section>s, each composed of <paragraph>s.
An empty element (one which has no content) may seem like a contradiction: what use can it be simply to tag a specific point in a text, especially if there is no way of associating information with it? At the very least, it should be possible to supply a name or other identifier to distinguish one such empty point in a text from another. Fortunately, SGML does provide a mechanism for adding such ``extra-textual'' information to the elements of a text: that of attributes, discussed in the next section.
Call me <name type=Biblical>Ishmael</name>.Here ``type'' is the name of an attribute associated with any occurrence of the <name> element; ``Biblical'' is the value defined for this attribute in the case of the example <name> shown above. [See note 8]
Attributes are used for two related purposes: they enable an identifying number or name to be associated with a particular element occurrence within a text (which might otherwise be missing), and they enable additional information missing from a text to be added to it without violating its integrity.
As an example of the first usage, consider the page or folio numbering of a historical source. There is a sense in which the individual pages of a source might be regarded as distinct elements within it. This is not however generally the primary focus of interest for those using it: in most cases, the number of the page only is of importance as a means of documenting where the other elements of the text occur. Moreover, the page numbers may not appear at all in the original source. In such cases, a tag <pb> (for ``page break'') may conveniently be used to mark the point in the text at which a new page begins. An attribute (say, n for ``number'') would then provide a convenient means of indicating the number of the page just begun: thus
text of page 3 ends here
<pb n=4>
text of page 4 starts here
As an example of the second usage, consider the common need for
normalisation in prosopographical studies. One way of achieving this
might be associate an attribute such as ``key'' with each
occurrence of <name> elements in a text, the value of which
would be a regularized and encoded form of the name, which could also
serve as an identifying key in a database derived from the text. For
example:
<name key='SMITJ04'>Jack Smyth</name>
Attribute values may
be defaulted, taken from a controlled list or specified freely, the only
constraint being that they cannot contain markup.The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element within the text. This makes possible the encoding of links between individual elements of a text in a simple and economical way. This facility is very commonly used in document preparation systems (such as TeX or Scribe) in order to link cross-references (such as ``see section 3 above'') within a text with the sections of a text to which they refer, when the section number is not known or may be dynamic. In SGML, such a system is completely generalizable. For example, let us suppose that we wish to encode a register of names in which the following passage occurs:
John Smith, baker.
Mary Smith, seamstress, wife of the above.
In this example we have two <entry>s, each containing a
<name> and a <trade>. The second entry however
contains an additional clause which states a relationship between it and
another element. We begin by tagging the elements so far
identified: [See note 9]
<register>
<entry>
<name>John Smith</name>
<trade>baker</trade>
</entry>
<entry>
<name>Mary Smith</name>
<trade>seamstress</trade>
<relation>wife of the above</relation>
</entry>
...
</register>
Clearly ``wife of the above'' is meaningless as a relation unless we
have some way of pointing to the entry with which it is linked. Let us
assume that the referent of ``the above'' is the whole of John
Smith's entry rather than just the name within it; the assumption does
not affect the argument. What is needed is some way of identifying that
entry uniquely; that identifying number can then be supplied as the
target of the relationship. In other words, we need an identifying
attribute (call it ``id'') that can be attached to any
<entry> and a pointer attribute (call it ``target'') which
can be attached to any <relation>. Using these, and inventing
an arbitrary value for the identifier, we can encode the link implicit
in the above text as follows:
<register>
<entry id=E1234>
<name>John Smith</name>
<trade>baker</trade>
</entry>
<entry>
<name>Mary Smith</name>
<trade>seamstress</trade>
<relation target=E1234>wife of the above</relation>
</entry>
...
Here we have allocated the arbitrary name or identifier ``E1234'' to
the Baker's entry. By supplying that same identifier as the value for
the target attribute associated with the <relation> element of
the Seamstress' entry, we assert both the existence of the relationshiop
itself, but also its target. This simple solution to a well-known
problem has several attractive features, but perhaps the most attractive
is that it makes explicit the fact that the target of the relationship
is an interpretation brought to bear on the text by the encoder of it,
leaving the text itself unchanged. Other attributes (say,
``certainty'' or ``authority'') may also be imagined which might
carry additional interpretative information associated with the link.