\documentstyle[11pt,titlepage]{article}
\frenchspacing
\def\tag#1{{\boldmath\bf #1}}
\def\smartitalicx{\ifx\next,\else\ifx\next-\else\ifx\next.\else\/\fi\fi\fi}
\def\smartitalic#1{{\it#1}\futurelet\next\smartitalicx}
\title{TEI EDW25\\ What is SGML and how does it help?}
\author{Lou Burnard}
\date{1 August 1991, revised 3 Oct 1991}
\begin{document} \maketitle
\begin{abstract}
SGML is an abbreviation for ``Standard Generalized Markup Language''.
This language, or rather metalanguage, was first defined by an
International Standard in 1986,\footnote{International Organization for
Standardization, \smartitalic{ISO 8879: Information processing ---Text and
office systems --- Standard Generalized Markup Language
(SGML),} ({[Geneva]}: ISO, 1986).} and is currently
undergoing its first five-year review. While some changes are likely, it
is certain that the standard will be with us for many years to come. As
a number of detailed technical descriptions of SGML are already
available,\footnote{Examples include Bryan (1988), van Herwijnen (1990) and
Goldfarb (1991).} this paper\footnote{Originally published as part of
Greenstein, D.I. (ed) \smartitalic{Modelling Historical Data}
(G\"{o}ttingen, Max-Planck-Inst. f\"{u}r Geschichte, 1991)} will briefly
describe the purpose and
scope of the standard, aiming to persuade the non-technically minded
reader that it has something to offer him or her.\end{abstract}
\section{What is SGML for?}
\par
The objectives of those who designed SGML were simple. Confronted with
an increasing number of so called ``markup languages'' for electronic
texts, each more or less bound to a particular kind of processing or
even to a particular software package, they sought to define a single
language in which all such schemes could be re-expressed, so that the
essential information represented by such texts could be transferred
from one program or application to another. I begin therefore by giving
a slightly more formal definition of what is meant by the term ``markup
language''. A universal language necessarily presupposes some basic
concepts or semantic primitives in which the notions of all other
languages can be expressed: the semantic primitives of SGML are simple
and few in number, and their definition forms the bulk of the rest of
this paper. I begin however with a few remarks about what SGML is
{\bf not}.
\par
Newcomers to SGML often think of it as a special case of the kind of
markup language with which they may be familiar. They expect it to
define a universal set of tags or to define exactly what tags mean, in
terms of how the items identified by tags are to be processed. But the
semantics of a markup language are precisely what SGML does not concern
itself with: it describes only the formal properties and inter-relation
of the components of a document. It does not tell you what it means to define
part of a text as belonging to some category (say, ``blort''); it
simply tells you how things-called-blort can legally appear within texts
--- whether they can be decomposed into ``blortettes'', or whether
more than one can appear at the start of a document, and so on.
Determining what a thing-called-blort actually may be is inextricably
entangled with how the text is to be processed, and the function of SGML
is to define the content of a document in terms that are entirely
independent of its processing.
\par
It follows from this that it is nonsensical to think of SGML as
a kind of text formatting system (although its origins can be
readily traced in the world of electronic text formatting), or as a
competitor for such languages as TeX or PostScript. These are languages
which define how ink (or its equivalent) is to be placed onto paper (or
its equivalent); they are not primarily concerned with the formal
structure of the language represented by those placements of
dark-on-light. SGML by contrast is decidedly unhelpful about how texts
are to be reproduced, since this is but one of the many applications for
which a text may be placed into electronic form. Its strength is that by
separating the notion of what the text actually {\bf is} from how the
text {\bf is rendered}, it makes possible the use of the same text by
many different kinds of processor.
\par
As a simple example, consider the headings used to introduce the
subdivisions of the present document. These need to be separated from
the body of the text so that they can be formatted in a particular way.
However, I have not yet decided how --- and it is more than likely
that those responsible for printing this text will prefer to format them
in some other way in any case. If therefore I use the facilities
available on my word processor for the display of headings --- say a
change of font size, a margin indent and a switch to bold font --- I
will not be helping the typesetter's task very much. Moreover, should I wish to
prepare a list of the subheadings in my text to serve as an index, I
will very probably find it quite difficult to distinguish occasions
where bold
font indicates headings from cases where it indicates (say) emphasized
phrases in the text. By the same token, I will find it difficult to
check that each subsection has one and only one subheading, that any
numbers included in the subheadings are in the right sequence and so
forth. And when, in the fullness of time, this text enters the great
database of late twentieth century prose, future linguists and
historians will have comparable difficulties in assessing the linguistic
properties of text used as subheadings as distinct from those of the
main text. If however, I simply tag each sub-heading as such, using some
unique string of codes to say ``here begins the text of a
subheading'' and some other to mark its end, then the same input text
can be used unchanged by any formatter, any indexing program and any
linguistic analysis program. Each one will be able to decide for itself
what it wants to do with the subheadings --- how it would like to
process them, if at all.
\par
While indexing the subheadings in a document of this nature is clearly
of somewhat limited importance, it should be apparent that the solution
proposed for that problem is an entirely generalisable one. Consider for
example any of the various kinds of historical source materials
described elsewhere in this volume.\footnote{See Greenstein, D.I.
(ed), \smartitalic{Modelling Historical Data}.}
Which is likely to be of more use in compiling a list of the names in an
electronic transcription of the records of an ecclesiastical court
records --- a version in which the names are simply italicised (as are
for example Latin phrases, running titles, annotations etc.) or one in
which each name is marked off clearly by a tag such as \tag{$<$name$>$}?
Which is likely to be of more use in extracting statistical data for
input to a spreadsheet analysis of the average age of offenders ---
a version in which birth dates are clearly marked as such, perhaps
incorporating some normalised version of the date concerned, or one in
which all dates are simply intermingled with the running text?
\section{Markup and Markup Languages}
\par
The word ``markup'' was originally used to describe annotation or
other marks within a text intended to instruct a compositor or typist
how a particular passage should be printed or laid out. Examples,
familiar to proof readers and others, include wavy underlining to
indicate boldface, special symbols for passages to be omitted or printed
in a particular font and so forth. As the production of texts was
automated, the term was extended to cover all sorts of special ``markup
codes'' inserted into electronic texts to govern formatting,
printing, or other processing.
\par
Generalizing from that sense, we define markup, or (synonymously)
\smartitalic{encoding}, as any means of making explicit an interpretation
of a text. At a banal level, all printed texts are encoded in this
sense: punctuation marks, use of capitalization, disposition of letters
around the page, even the spaces between words, might be regarded as a
kind of markup, the function of which is to help the human reader
determine where one word ends and another begins, or how to identify
gross structural features such as headings or simple syntactic units
such as dependent clauses or sentences. Encoding a text for computer
processing is in principle, like transcribing a manuscript from
\smartitalic{scriptio continua}, a process of making explicit what is
conjectural or implicit, a process of directing the user as to how the
content of the text should be interpreted.
\par
A ``markup language'', may be no more than a loose set of markup
conventions used together for encoding texts. A markup language must
specify what markup is allowed and whereabouts, what markup is required,
how markup is to be distinguished from text, and what the markup means.
As noted above, SGML provides the means for doing the first three of
these only; it allows you to describe a markup language independently of
what the markup is intended to do. To understand and act upon the
markup, additional semantic information is needed, which will differ in
different situations. Documentation like that enshrined in the TEI's
\smartitalic{Guidelines}\footnote{ACH-ACL-ALLC \smartitalic{Guidelines
for the encoding and interchange of machine-readable texts}, edited by
C.M. Sperberg-McQueen and Lou Burnard (Chicago and Oxford, Text Encoding
Initiative, October, 1990)} provides such information. In just the same
way as one may be able to parse the syntactic structure of a Latin unseen
without having the least idea what it is about, so an SGML-aware
processor can analyze the structure of an SGML-encoded document with no
sense of its meaning. This independence is necessary, given the
open-ended nature of electronic textual applications. It does not, of
course, imply that the intentions of the encoder of a text are
unimportant or vacuous; only that they are formally distinct from the
encoding itself.
\par
Three basic concepts are fundamental to an understanding of all markup
languages, when described in SGML terms. These are the notions of a
markup \smartitalic{entity}, a markup \smartitalic{element}, with its
associated \smartitalic{attribute}s, and a \smartitalic{document type}.
At the most primitive level, texts are composed simply of
streams of symbols (characters or bytes of data, marks on a page,
graphics, etc.): these are known as \smartitalic{entities} in SGML. At a higher
level of abstraction, a text is composed of representations of objects
of various kinds, linguistically or functionally defined. Such objects
do not appear randomly within a text: coherence demands that particular
types of object appear in specifiable relationship to other objects
--- they may be included within each other, linked to each other by
reference or simply presented sequentially, for example. This level of
description sees texts as composed of structurally defined objects,
known as \smartitalic{elements} in SGML. The grammar defining how elements may
be legally combined in a particular class of texts is known as a
\smartitalic{document type}. [This view of the nature of text has been
nicely defined by DeRose (1990) as an ``ordered
hierarchy of content objects''.] These three fundamental concepts
together are, it is claimed, adequate to describe all the complexities
of marked-up texts, of whatever kind and for whatever purposes. Each is
discussed in turn in the next three sections
\section{Entities}
\par The word ``entity'' is used in SGML rather differently from its use
elsewhere,
notably in database design methodology. An SGML entity is simply a named
bit of text, considered entirely independently of any structural or
categorical classification it might have. A document may be an SGML
entity, as may any arbitrary sequence of characters within it, or any
symbol it contains. The definition of an entity associates a name with a
particular string of bytes, which may be the representation of some
characters in a particular computer encoding or held in a system-defined
container of some kind (such as a file). Within an SGML document,
entities are represented by reference, using the defined name. This
mechanism has a number of important uses, specified further below,
primarily in making it possible to encode textual features such as
special characters or symbols which are unique to a particular
environment or application in a way that is independent of that
particular environment.
\par
Everyone who has communicated by electronic mail, or even tried to move
a file from a Macintosh computer to a PC, knows that some of the symbols
of which texts are composed are less portable than others. With the best
will in the world, computer manufacturers and standards bodies alike
will never be able to represent all the possible symbols occurring in
written texts in a single universally agreed code set, simply because
these symbols do not form a closed set: the task is as hopeless as that
of enumerating all the words in a language. Moreover, it is a fact of
life that different computing environments adopt different methods of
representing the same symbols, disregard entirely the existence of some
and insist on distinguishing others.
\par
A notorious consequence of this state of affairs is that the second
letter of Hans J\o rgen Marker's second name will look perfectly
satisfactory when typed on a keyboard here in Odense, but when
transmitted over the network to the UK will either be mysteriously
transmuted into a percent sign, or lost completely. How then is this
name to be stored in a computer file in such a way as to ensure that it
can be satisfactorily processed by any computer, not just those which
have the decency to be aware of the Danish national character set?
\par
Exactly the same problem arises, in a more acute form, when considering
the range of symbols likely to required in transcribing manuscript texts
or spoken language. There is no computer character set in which the long
form of s is distinguished from the short, still less for distinguishing
ligatured forms of the same letters, or for representing all scribal
abbreviations, astrological symbols, non-vocalic grunts, pauses etc.
Nor, UNICODE notwithstanding, do I think it likely that there ever will
be.
\par
The SGML solution is to encode characters not available in the
particular character set used for document transmission\footnote{I do not
discuss here the possibility of using variant character sets within an
SGML document; though possible this does not of course solve the general
case} by means of entity reference. If Hans J\o rgen is
represented as Hans J\øorgen, I can associate the unlovely
acronym \smartitalic{\ø} with whatever particular stream
of bytes is
necessary on my computer to produce the slashed-o to which he is
entirely entitled.\footnote{The use of the ampersand and semicolon to
delimit the acronym is an example of the SGML \smartitalic {reference
concrete syntax}, discussed briefly below. It is not a necessary part of
the SGML solution; merely a conventional one.}
\par
Some have objected to the apparent verbosity of such mnemonics, by
comparison with the variety of encoding tricks or ad hoc solutions
customarily resorted to. The advantage of the entity reference solution
is simply that it forms part of a single and consistent convention,
comprehensible without resort to special purpose documentation (which is
generally absent). Sets of standard mnemonics for all the accented
letters and special symbols of modern European languages are to be found
in ISO 8879 and elsewhere.
\par
The same mechanism can be applied more widely for any stream of bytes to
be included within a document. The use of a single short
abbreviation for a much repeated or particularly complex phrase,
is a simple way of ensuring
consistency and reducing effort; it is worth noting in this context that
the value of an entity reference can include other mark-up, such as tags
or other entity references, provided that any element opened within a
given entity is also closed within it.
This method has been adopted for example by the TEI committee
responsible for defining linguistic annotation. Another use is for
identifying objects which cannot be directly represented in a text, for
example non-textual entities like graphics or formulae. More mundane
uses are not difficult to identify.
\par
It should be stressed that entities have no structural properties: they
are simply shortcuts enabling an SGML aware processor to substitute a
system-defined string of bytes for a name identified as such by special
SGML delimiters. As such they are merely a special (if well thought out)
way of doing the kinds of things which transcribers and encoders of text
have already been doing for many years.
\section{Elements and their content}
\par
The level of description at which texts are composed solely of entities
in the SGML sense defined above is not, however, a very satisfactory
one. All markup schemes to a greater or lesser extent attempt to
identify and to distinguish components of texts at a more ambitious
level of description. Texts are not simply sequences of words, still
less of bytes; they contain instances of objects, such as paragraphs,
titles, names, dates etc. represented by such sequences. All markup
schemes, to a greater or lesser extent, attempt to describe these
components. A consideration of such schemes indicates at least three
important aspects of textual objects which need to be recognised: their
\smartitalic{extent} --- that is, the points in the textual stream at
which object instances begin and end; their \smartitalic{type} ---
that is, the category to which object instances are assigned; and their
\smartitalic{context} --- that is, their relationship to other object
instances within the document. SGML addresses each of these concerns:
everything in an SGML document is delimited explicitly in some way; a
document is decomposed into elements of a named type; and a kind of
textual grammar can be defined.
\subsection{A note on syntax}
\par
Most discussions of SGML mention if only in passing that the particular
characters and conventions used to represent SGML markup in a particular
document can be redefined. This is of course a necessary consequence of
the fact that SGML is not itself a markup language, but a method of
describing such languages. However, in order to stay sane, this document
will follow customary practice in using the \smartitalic{reference concrete
syntax} to represent SGML markup. This rebarbative phrase is
actually quite a precise description of what it denotes: it is a
``concrete'' syntax, because it represents by particular characters
(the characters ``$>$'', ``$<$'', ``!'' and other delimiters)
distinctions
required by the SGML model of how markup should be described
(its ``abstract'' syntax); it is provided for ``reference'' purposes,
as an example of one generally useful way of representing the constructs
of the language.
\par
The SGML reference concrete syntax has two great advantages over most
other ways of making concrete a view of the abstract structure of a
markup language: everything is delimited (bracketed) explicitly, and
very few special characters are needed. As we have already seen, entity
references are delimited explicitly by the ampersand character and the
semicolon.\footnote{The semicolon is not in fact strictly necessary in
all situations: the end of an entity reference is signalled by the first
character encountered which cannot form part of a name. A space is
therefore sufficient; since however entity references are often found
within words, rather than between them, the semicolon is often
necessary to indicate where the entity name ends and the word within
which it is embedded resumes. } In the same way, element
occurrences within an SGML document are explicitly delimited in the
reference concrete syntax by named \smartitalic{tag}s. There are two
kinds of tag: \smartitalic{start-tag}s, which indicate the beginning of
an element, and \smartitalic{end-tag}s, which indicate its end. The tags
themselves are delimited by special characters: ``$<$'' to mark the
beginning of a start-tag, and ``$$'' to mark the beginning of an
end-tag. In either case, the character
``$>$'' is used to indicate the end of a tag. Between these delimiters is given
a name identifying the type of element delimited by the start and end tag pair.
For example, an embedded name element in a text might be tagged as
follows:
\begin{verbatim}
Call me Ishmael.
\end{verbatim}
This is by no means the only way of indicating the
presence of an SGML element within a text; it is however the most
explicit, and hence that into which other representations are most
generally mapped.
\subsection{Content models}
\par
As suggested earlier, the primary function of the start and end tags
within a marked-up text is to indicate the extent of a particular object
or textual component (the SGML term is \smartitalic{element}) within it.
In addition, the tags identify the category or type of the element which
they delimit, by supplying a name for it (``name'' in the example
above). An SGML-aware processor can thus easily identify the start and
end of all elements of a given type within a document --- it can
identify all names, all sentences, all paragraphs (etc) and process them
in a way appropriate for such objects.
\par The content of a document element of a particular type (that is, the
portion
between the start and end tags) may consist simply of running text,
perhaps including entity references. More usually, it will contain other
embedded document elements; occasionally it may have no content at all.
The ability of SGML to specify rules about how elements can nest within
other elements is one of its chief strengths and is discussed further
below. Here we simply note that elements of one type typically contain
elements of another: for example, a parish register consists of a
mixture of birth, marriage and death records, each of which contains
elements such as names, dates and details of an event. We might thus
expect to find such records encoded in SGML with different tags for
\tag{$<$birth$>$},
\tag{$<$marriage$>$} and \tag{$<$death$>$} elements, within each of which might
be found \tag{$<$name$>$} and \tag{$<$date$>$} elements. In exactly the
same way, a document such as this one might be encoded as a series of
\tag{$<$paper$>$} elements, each of which begins with a \tag{$<$title$>$},
followed optionally by an \tag{$<$abstract$>$}, and at least one (and
probably several) \tag{$<$section$>$}s, each composed of
\tag{$<$paragraph$>$}s.
\par
An empty element (one which has no content) may seem like a
contradiction: what use can it be simply to tag a specific point in a
text, especially if there is no way of associating information with it?
At the very least, it should be possible to supply a name or other
identifier to distinguish one such empty point in a text from another.
Fortunately, SGML does provide a mechanism for adding such
``extra-textual'' information to the elements of a text: that of
\smartitalic{attributes}, discussed in the next section.
\subsection{Attributes and cross-references}
\par
Like ``entity'', the word ``attribute'' has a specific technical
sense when used in the SGML context, which differs somewhat from its
sense when used in the database design context. An SGML attribute is a
category of information associated with a particular type of element,
but not contained within it. Attributes are associated with particular
element occurrences by including their name and value within the
start-tag for the element concerned. For example:
\begin{verbatim}
Call me Ishmael.
\end{verbatim}
Here ``source'' is the name of an attribute associated with any
occurrence of the \tag{$<$name$>$} element; ``Biblical'' is the value
defined for this attribute in the case of the example \tag{$<$name$>$}
shown above.\footnote{Categorisation of this kind could of course
equally well be achieved by using a different tag --- say
\tag{$<$biblical.name$>$}. The decision as to whether to use an
attribute or a distinct element type is often a difficult one, involving
more detailed technical and design skill than it is appropriate to
consider here.}
\par
Attributes are used for two related purposes:
they enable an identifying number or name to be associated with a
particular element occurrence within a text (which might otherwise be
missing), and they enable additional information missing from a text to
be added to it without violating its integrity.
\par
As an example of the first usage, consider the page or folio numbering
of a historical source. There is a sense in which the individual pages
of a source might be regarded as distinct elements within it. This is
not however generally the primary focus of interest for those using it:
in most cases, the number of the page only is of importance as a means
of documenting where the other elements of the text occur. Moreover, the
page numbers may not appear at all in the original source. In such
cases, a tag \tag{$<$page.break$>$} may conveniently be used to mark the
point in the text at which a new page begins. An attribute (say,
``number'') would then provide a convenient means of indicating the
number of the page just begun: thus
\begin{verbatim}
text of page 3 ends here
text of page 4 starts here
\end{verbatim}
As an example of the second usage, consider the common need for
normalisation in prosopographical studies. One way of achieving this
might be associate an attribute such as ``normal'' with each
occurrence of \tag{$<$name$>$} elements in a text, the value of which
would be a normalised or encoded form of the name, which could also
serve as an identifying key in a database derived from the text. For
example:
\begin{verbatim}
Jack Smyth
\end{verbatim}
Attribute values may
be defaulted, taken from a controlled list or specified freely, the only
constraint being that they cannot contain markup.
\par
The most common use for attributes in the TEI and other SGML schemes is
not however to categorise element occurrences in this way, but to
identify them. In the TEI scheme, for example, every element is defined
to have an ID attribute, which supplies a unique identifier for that
particular textual element within the text. This makes possible the
encoding of links between individual elements of a text in a simple and
economical way. This facility is very commonly used in document
preparation systems (such as TeX or Scribe) in order to link
cross-references (such as ``see section 3 above'') within a text with
the sections of a text to which they refer, when the section number is
not known or may be dynamic. In SGML, such a system is completely
generalisable. For example, let us suppose that we wish to encode a
register of names in which the following passage occurs:
\begin{verbatim}
John Smith, baker.
Mary Smith, seamstress, wife of the above.
\end{verbatim}
In this example we have two \tag{$<$entry$>$}s, each containing a
\tag{$<$name$>$} and a \tag{$<$trade$>$}. The second entry however
contains an additional clause which states a relationship between it and
another element. We begin by tagging the elements so far
identified:\footnote{For simplicity, I have omitted the end-tags in this
example; this is a legitimate abbreviatory convention in many
circumstances, as discussed further below.}
\begin{verbatim}
John Smithbaker
Mary Smithseamstress
wife of the above
\end{verbatim}
Clearly ``wife of the above'' is meaningless as a relation unless we
have some way of pointing to the entry with which it is linked. Let us
assume that the referent of ``the above'' is the whole of John
Smith's entry rather than just the name within it; the assumption does
not affect the argument. What is needed is some way of identifying that
entry uniquely; that identifying number can then be supplied as the
target of the relationship. In other words, we need an identifying
attribute (call it ``id'') that can be attached to any
\tag{$<$entry$>$} and a pointer attribute (call it ``target'') which
can be attached to any \tag{$<$relation$>$}. Using these, and inventing
an arbitrary value for the identifier, we can encode the link implicit
in the above text as follows:
\begin{verbatim}
John Smithbaker
Mary Smithseamstress
wife of the above
\end{verbatim}
Here we have allocated the arbitrary name or identifier ``E1234'' to
the Baker's entry. By supplying that same identifier as the value for
the target attribute associated with the \tag{$<$relation$>$} element of
the Seamstress' entry, we assert both the existence of the relationship
itself, but also its target. This simple solution to a well-known
problem has several attractive features, but perhaps the most attractive
is that it makes explicit the fact that the target of the relationship
is an interpretation brought to bear on the text by the encoder of it,
leaving the text itself unchanged. Other attributes (say,
``certainty'' or ``authority'') may also be imagined which might
carry additional interpretative information associated with the link.
\section{Ensuring consistency}
\par
While a rose might smell just as sweet by any other name, every computer
user knows that names intended for automatic processing must be spelled
exactly and defined precisely. The human reader might tolerate
paragraphs sometimes labelled \tag{$<$p$>$}, sometimes labelled
\tag{$<$para$>$}, and sometimes not labelled at all, but the computer is less
forgiving. Slightly less obviously perhaps,
the user of an SGML aware software system needs to know what elements
have been defined for a given text (or group of texts) and what their
possible contents are. He or she needs to know not just whether personal
names should be tagged \tag{$<$propname$>$} or \tag{$<$name$>$}, but also
in what contexts personal names may reasonably be expected to appear
(for example, if something tagged as a name appears {\bf within} a
name, it is probably an error). He or she also needs to know what
attribute names have been defined for particular elements and their
legal values, and also what entity names should be used for particular
symbols. The formal specification of these names and their usage is
enshrined in a separate component, unique to SGML, known as a
\smartitalic{Document Type Definition} or DTD.
\subsection{Defining a Document : the content model}
\par
A DTD performs a function analogous to that of a grammar: it formally
defines what are the legal productions of a given markup language. Of
course, DTDs can be as lax or as restrictive as any other kind of
grammar: the designer of a DTD generally has to trade off generality of
use with accuracy of error detection. The simplest kind of DTD would be
one which did no more than specify a set of tag names, requiring only
that every element tagged in a document use one of them. Such a DTD
would of course be unable to detect errors such as \tag{$<$name$>$}s
occurring within \tag{$<$name$>$}s or within \tag{$<$date$>$}s, nor to
prohibit such errors as register entries appearing other than inside
registers. Creating correctly encoded texts with such a DTD would be
rather like trying to speak a foreign language with the aid of a
lexicon of the language, but no idea of its syntax.
\par
More usually however, the transcribers and creators of electronic texts
wish to control how elements can meaningfully appear within a given
class of texts, so that processors intended to act on them can do so
more intelligently. The specification of what is legal within any one
kind of textual component or element is known in SGML as its
\smartitalic{content model}, because it provides a model for its content.
Here, for example, is a part of the formal DTD for the register example
given informally above.
\begin{verbatim}
\end{verbatim}
\par
These three lines are examples of SGML \smartitalic{declaration}s: each
defines or declares a name for an element and what its content should
be. The details of the syntax need not detain us; note only that each
declaration (like everything else in SGML) is explicitly delimited, in
this case by a symbol marking the start of a declaration (the ``$<$'')
and its end (the ``$>$''). The content model part of each declaration
is given in parentheses at the end. Between the name of the element
(``register'' in the first case) and the content model are two
characters which specify whether or not both start- and end- tags are
required to mark off occurrences of the element. The hyphen character
indicates that a tag is required, the letter O that it is optional.
Thus, in this example, \tag{$<$register$>$}s must have both a start and
an end tag, whereas \tag{$<$entry$>$}s and \tag{$<$name$>$}s can be
specified using start-tags only.
\par
The content model for register states that a \tag{$<$register$>$}
consists of one or more \tag{$<$entry$>$}s, the plus sign indicates that
the element before it can be repeated one or more times. Thus a register
containing no entries, or one containing something other than an entry
would be regarded as an error by this DTD. The content model for an
entry states that a \tag{$<$entry$>$} must begin with a \tag{$<$name$>$},
followed by a \tag{$<$trade$>$} and then optionally by a
\tag{$<$relation$>$}. The commas between the components of this content
model indicate that the elements must all appear in the order given. The
question mark following the \tag{$<$trade$>$} indicates that this element
need not be present. Thus, an entry with no name, or one where the
trade preceded the name, would both be regarded as erroneous by this
DTD, whereas entries are
equally acceptable, whether with or without \tag{$<$relation$>$}s. Finally, the
content model for a \tag{$<$name$>$}
states that it may contain only text, that is, simply data with no
embedded tags. (The word ``\#PCDATA'' is a special SGML symbol
standing for \smartitalic{parsed character data} --- which must be
``parsed'' because it may contain entity references as well as raw
characters).
\par
SGML syntax allows for other variations, which will be necessary if we
are to refine this model to reflect more accurately the probable content
of register entries in the real world. We will begin by relaxing the
restriction on the number of \tag{$<$trade$>$}s an entry may contain:
\begin{verbatim}
\end{verbatim}
The asterisk following the word ``trade'' indicates that an entry
may contain zero or more \tag{$<$trade$>$}s. An entry such as the following
\begin{verbatim}
John Smith
butcher
baker
candle-stick maker
\end{verbatim}
would be legal according to this second definition, as would one like
this:
\begin{verbatim}
John Smith
\end{verbatim}
\par
Suppose however that entries are mixed, sometimes containing names and
trades, sometimes only one or the other. One possible content model for
this situation would be:
\begin{verbatim}
\end{verbatim}
The vertical bar
symbol may be read as ``or''. This content model states that an
\tag{$<$entry$>$} must contain at least one component, which may be a
\tag{$<$name$>$}, a \tag{$<$trade$>$} or a \tag{$<$relation$>$}, and may
contain more than one of any of them, in any order. (The inner set of
parentheses is needed to indicate that the plus sign is to be applied to
the whole group of alternated names). The following entries would all be
legal according to this definition:
\begin{verbatim}
John Smith
BakerChandler
wife of the aboveMary Jones
John SmithHenry Jonessmith
\end{verbatim}
As the last example indicates, such laxity of definition may lead to
difficulties of interpretation --- our syntax now cannot help us
determine whether it is Smith or Jones who is the smith. But presumably
in that respect we are being true to the source!
\par
A more realistic situation would be that names, trades and relationships
are found promiscuously intertwined in any order within any amount of
other text. SGML offers two ways of dealing with this. One is simply to
add \#PCDATA as a fourth alternate to the example above, to give a
declaration like this:
\begin{verbatim}
\end{verbatim}
This solution is that generally preferred in the TEI Guidelines, for the
general case, where elements consist of small identifiable elements
(names, trades etc.) swimming about in an arbitrary mixture or ``primal
soup''. Note that with this content model it is no longer possible to
leave out end-tags.
Another approach achieving a similar effect is to
use what is known as an \smartitalic{inclusion exception}, as in this
example:
\begin{verbatim}
\end{verbatim}
Either of the above definitions states that items tagged as names, trades or
relations
may appear anywhere within an entry, an arbitrary number of times,
interspersed with arbitrary sequences of character data. The following
example
would be regarded as legal according to both the above definitions:
\begin{verbatim}
AlsoJohn Smith
John Smith name>
andAdam Smith
his brother
Cousin of theblacksmith
OldUncle
Tom Cobbley and all
\end{verbatim}
The difference between the two element declarations
is that the second allows names, trades, or relations to appear
arbitrarily not only within entries, but also within anything that
entries contain. Unless further modified, this definition allows (for
example) names to occur within trades (or vice versa), or even within
other names. Entries such as the following would be legal by the second
definition, but not the first:
\begin{verbatim}
John Smith Jones
Brother of
Henry
Potter
Jones
\end{verbatim}
Of course, entries such as the following would also be acceptable by
either definition:
\begin{verbatim}
Any old string of characters you care to type in
\end{verbatim}
\par
To complete the content models for our simple register example, we need
to define the sub-components of an entry. For the sake of argument, we
will assume that \tag{$<$name$>$}, \tag{$<$trade$>$} and
\tag{$<$relation$>$} are to contain only text, with no further embedded
elements. Since they all share the same content model a single
declaration will suffice:
\begin{verbatim}
\end{verbatim}
\subsection{Defining a document: attribute lists}
\par
We now turn to the definition of attributes for each of the elements discussed
so far. As with elements, SGML requires that all attributes likely to be used
within a document must be defined in advance. It also offers a variety of
features
to control the values that specific attributes may take.
\par An example attribute declaration might look like the following:
\begin{verbatim}
\end{verbatim}
This declaration associates three different attributes with the element
\tag{$<$name$>$}. The first attribute is called ``id'', the second ``norm''
and the third ``type''. Note that all the attributes for a given
element are defined together in a single ATTLIST declaration. For each
attribute named, two additional pieces of information are required for
its declaration. The first, following the name of the attribute, defines
what type of value may be supplied for it, while the third, following
the type specification, indicates what value should be assumed if the
attribute is not specified for any element occurrence.
\par
In this example, three different kinds of value are specified. The
``id'' attribute may take as its value only a string\footnote{More
exactly, a \smartitalic{name token}, that is a sequence of alphanumeric
characters of which the first is a letter} which the SGML
processor will treat as an identifier for the associated element: this
is indicated by the keyword ``ID''. An ID-value need not be supplied,
and if it is not, then a
processor may take whatever default action it chooses; this is indicated by the
keyword ``\#IMPLIED''.
\par
The ``normal'' attribute may take as its value any string of
characters; this is indicated by the keyword CDATA. The SGML processor
will not check the value of this attribute in any way, except of course
that it may not contain any form of markup.\footnote{It must also be
enclosed in quotation marks if it is not a name token.} Again,
there is no requirement to supply a value for this attribute.
\par
More exact checking is provided for in the case of the ``type''
attribute: here we have specified that only two values are legal for
this attribute --- the type of a name must be marked as either
``personal'' or ``honorific''. If no value is specified, the
processor is to assume that the intended value is ``personal''.
\par
To illustrate some of the other possibilities for attribute
specifications, we conclude with a declaration for the attribute list to
be attached to the \tag{$<$relation$>$} element:
\begin{verbatim}
\end{verbatim}
This states that the relation element may have two attributes, called
``target'' and ``certainty''. The former takes as its value an
``IDREF'', that is a string which has been used as the value for
an ID-type attribute somewhere within the current document. The SGML
parser will check that the value actually identifies some other element,
as is the case in our example. The keyword ``\#REQUIRED'' means
additionally that no \tag{$<$relation$>$} can exist in the document
unless the element to which it points is specified in this way; this
keyword is used to specify that a value {\bf must} be supplied for
every occurrence of the attribute to which it is attached (the examples
of relations given above are thus all illegal by this definition!).
Finally, the ``certainty'' attribute is defined as taking a numeric
value, which defaults to zero.
\section{Why should I use a DTD?}
\par
It should be emphasized that the preceding brief discussion is by no
means comprehensive and is intended only to give a flavour of the kinds
of tools at the disposal of the SGML document designer. The interested
reader is referred to one of the introductory texts cited in the
bibliography for more comprehensive information. It should also be noted
that designing a DTD is not something which every user of an SGML system
needs to do afresh. On the contrary, it is the objective of endeavours
such as the Text Encoding Initiative to define general-purpose DTDs
which can be used for a wide variety of purposes, including those which
are the particular concern of historians.
\par Why, however, should anyone transcribing a historical source for analysis
care about the existence of such endeavours? How can a knowledge of SGML
help in understanding a text, in placing it into its proper context? It
should be relatively uncontentious that the mechanical drudgery of
transcribing, editing and reproducing texts is enormously simplified
by the uncoupling of the tasks of data {\bf interpretation} (tagging)
and data {\bf reproduction} (formatting). The former is an
essentially hermeneutical and scholarly act, while the latter is not.
Coombs, Renear et al have persuasively argued that the proliferation of
sophisticated tools for desktop publishing have effectively seduced
scholars from their true vocation (Coombs 1987) and that SGML offers a
chance to regain the high ground.
\par Suppose that you have obtained, or created, a DTD which adequately
describes
the kind of source texts you wish to process. How will it be of use to
you? You will use it firstly as a means of checking that each
individual document you process conforms to the model you have defined.
As well as providing you with a diagnostic check on the accuracy of your
keyboarding, this will also provide what the Americans call a
\smartitalic{reality check} on your interpretation of the sources
concerned: you may be forced to confront aspects of your sources to
which an initial, possibly over-hasty, assessment has blinded you.
Moreover, because each part of an electronic text (or at least, each
part that has been tagged) is equally accessible, equally processable,
you can analyse the contents of your sources with a far higher degree of
accuracy and sophistication than either a simple transcription or a
database of derived observations would provide.
\par Historical sources do not belong to one individual however. More perhaps
than other kinds of resource, they must be shareable, whether from
economic or ethical considerations. An electronic text encoded in SGML,
with its associated document type definition and appropriate
documentation, is a permanent asset, independent of time, place and
personality.
\pagebreak
\section{A brief bibliography}
\par
This bibliography lists some useful further reading for general
information about SGML and markup languages. A full and very detailed
SGML bibliography edited by Robin Cover and David T. Barnard
is also available as TEI document MLW14.
\begin{description}
\item[Bryan 88]
Bryan, Martin
\smartitalic{SGML: an Author's Guide to the Standard Generalized
Markup Language}
(Addison-Wesley, 1988)
{\small \smartitalic{Detailed text book giving full treatment of
the standard, but primarily from the publishing perspective.}}
\item[Coombs 87] Coombs, James H., et al.
``Markup Systems and the Future of Scholarly Text Processing''
\smartitalic{Communications of the ACM} Vol 30 no 11
(November, 1987) pp 933-47
{\small \smartitalic{Classic polemic in favour of descriptive over
procedural markup presented from the scholarly
perspective.}}
\item[DeRose 90] DeRose, Steven J., et al.
``What is text, really?''
\smartitalic{Journal of Computing in Higher Education}
Vol 1 no 2 (Winter, 1990)
\item[Goldfarb 91] Goldfarb, Charles
\smartitalic{The SGML Handbook}
Oxford University Press, 1991
{\small \smartitalic{Authoritative and exhaustive presentation of all
aspects of ISO 8879, including annotated and cross
referenced full text of the standard itself.}}
\item[ISO 1986]
International Organization for Standardization
\smartitalic{ISO 8879: Information processing - Text and office
systems - Standard Generalized Markup Language (SGML) }
(ISO,1986)
{\small \smartitalic{Annexes A and B to the Standard provide a formal
but readable summary of its most important features.}}
\item[ISO 1988]
International Organization for Standardization
\smartitalic{ISO/TR 9573: Information processing - SGML support
facilities - Techniques for using SGML }
(ISO,1988 )
{\small \smartitalic{Tutorial discussion of main features of the
standard with some interesting examples.}}
\item[van Herwijnen 90]
van Herwijnen, Eric
\smartitalic{Practical SGML}
(Kluwer, 1990)
{\small \smartitalic{Good introductory textbook with emphasis on how
SGML is currently being used.}}
\end{description}
\end{document}