Return-Path: <U35395@UICVM.CC.UIC.EDU>
Received: from UICVM (NJE origin U35395@UICVM) by UICVM.CC.UIC.EDU (LMail
          V1.2a/1.8a) with BSMTP id 2950; Mon, 12 Jun 1995 17:49:41 -0500
Date:         Mon, 12 Jun 95 17:49:26 CDT
From:         "C. M. Sperberg-McQueen" <U35395@UICVM>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject:      EDW49DR MEMO A1
To:           Lou Burnard <LOU@VAX.OX.AC.UK>,
              Wendy Plotkin <U49127@UICVM>
 
 
                 The Design of the TEI Encoding Scheme
 
 
                         C. M. Sperberg-McQueen
                              Lou Burnard
 
 
   This paper discusses the basic design of the encoding scheme
described by the Text Encoding Initiative's Guidelines for Electronic
Text Encoding and Interchange (TEI document number TEI P3, hereafter
simply P3 or the Guidelines).(1)  It reviews first the basic design
goals of the TEI project and their development during the course of the
project.  Next, it outlines some basic notions relevant for the design
of any markup language and uses those notions to describe the basic
structure of the TEI encoding scheme.  It also describes briefly the
"core" tag set defined in chapter 6 of P3, and the "default text
structure" defined in chapter 7 of that work.  The final section of the
paper attempts an evaluation of P3 in the light of its original design
goals, and outlines areas in which further work is still needed.
 
 
 
                                   1
 
                  DESIGN GOALS AND DEVELOPMENT PROCESS
 
 
1.1   Design Goals
 
   At the outset of its work, the overall goals of the TEI were defined
by the closing statement of the planning conference held in
Poughkeepsie, N.Y., in November, 1987; these "Poughkeepsie Principles"
are reproduced and discussed in Ide and Sperberg-McQueen, elsewhere in
this issue.  These goals were elaborated and interpreted in a series of
design documents (TEI ED P1, ED P2, and ED P3), which formulated
specific design goals for the work of specifying the TEI markup scheme.
The Guidelines, say document TEI ED P1, should:
 
*   suffice to represent the textual features needed for research
*   be simple, clear, and concrete
*   be easy for researchers to use without special-purpose software
*   allow the rigorous definition and efficient processing of texts
*   provide for user-defined extensions
*   conform to existing and emergent standards
 
We address each of these issues in more detail below.
 
 
1.2   Research Adequacy
 
   The TEI, as an undertaking of the research community, felt itself
responsible primarily to that community for creating encoding practices
adequate to research needs.  Since researchers also use commercial
software and publish their results, the practices and needs of
commercial software developers and publishers also needed to be
considered, but they were not to outweigh the needs of textual research.
In fact, commercial and research interests do not necessarily conflict.
Both are best served by an intellectually adequate analysis of textual
problems and their representation. Moreover, very few problems in the
research area lack analogues in commercial areas, though in research the
problems may occur more often, or in more extreme forms, and it may be
less possible to proceed without a full solution.  The interest taken by
many members of the commercial SGML community in the work of the TEI has
shown clearly that they are well aware of its potential relevance to
their work.
 
   Research work requires above all the ability to define rigorously
(i.e. precisely, unambiguously, and completely) both the textual objects
being encoded and the operations to be performed upon them.  Only a
rigorous scheme can achieve the generality required for research, while
at the same time making possible extensive automation of many
text-management tasks.  The TEI scheme addresses the need for a rigorous
description of textual objects, by using SGML to define a formal grammar
for TEI-encoded documents; it does not however attempt to define a set
of primitive operations on textual objects.
 
   Research work may also require relatively esoteric sets of tags for
marking up phenomena of interest in specialized areas.  This requirement
has led the TEI to develop tags for a variety of research interests not
served well by other extant encoding schemes.
 
   Because the TEI tag sets are intended to be usable by any researcher,
it has been necessary to exercise caution in defining such specialized
tag sets, to ensure that they avoid theoretical presuppositions which
would make them unacceptable to researchers in the specialty who do not
share those presuppositions.  It is not possible to define a tag set --
at least, not a useful one -- in a purely atheoretical way, but it is
possible, given a specific set of sufficiently explicit theoretical
approaches to a field, to define a tag set which allows adherents of
competing theories to encode the textual features they find of interest.
This problem is analogous to that of generating a common database schema
from a variety of views, all of which must be equally well supported.
In practice, there is often difficulty in formulating such a
"poly-theoretical" tag set, but the difficulty arises most frequently
either because one or more of the theories to be accommodated may not be
sufficiently explicit, or because the adherents of individual theories,
not content with a tag set which supports their view of the problem
domain, may insist also that other views be unencodable (i.e.
forbidden).
 
 
1.2.1   Simplicity and Concreteness
 
   A scheme must be clear, concrete, and easy to use in order to be
adopted by the research community.  It has naturally been difficult to
keep the TEI encoding scheme small, simple and easily explained in its
entirety while simultaneously including in it facilities for every
specialized area of research covered by P3.  A strenuous effort has
therefore been made to make the encoding modular, so that the parts of
P3 not of interest to a given researcher might safely be ignored.  The
goal of clarity and concreteness also led the TEI to specify a full SGML
DTD (document type definition), rather than contenting itself with
either a set of general recommendations as to how to such a thing might
be constructed or an abstract model or meta-DTD for such a thing.  The
discipline of defining a concrete DTD, and testing it on real texts, has
had (we believe) a beneficial effect on the robustness and precision of
the TEI scheme, as well as enabling us to present scholars with a usable
product requiring no prior knowledge of DTD construction or the building
of abstract models.
 
   The pursuit of simplicity has led to two conflicting patterns in the
development of the TEI tag set.  First, the TEI has tried wherever
possible to apply Occam's Razor -- the principle, named for the medieval
philosopher William of Occam, of seeking the simplest possible solution
to any problem.  Occam's formulation was "non sunt multiplicanda entia
praeter necessitatem", which may be loosely translated "plurality should
not be assumed without necessity."  As applied by the TEI, Occam's Razor
leads to the merger of similar elements into single elements, possibly
differentiated with a type attribute.  Thus where many markup languages
distinguish three or four types of list (e.g. with bullets, with
numbers, or unadorned) and three or more types of note (e.g. footnotes,
endnotes, and in-line block notes), the TEI defines a single <list> and
a single <note> element.  The systematic application of Occam's Razor
has significantly reduced the number of distinct elements of the TEI tag
set, compared with other markup schemes.
 
   A countervailing pressure, however, has led to some apparent
violations of Occam's principle.   In several instances, a very powerful
general notation which proved cumbersome in some common simple cases has
been supplemented by specialized tags which handle those common cases in
a simpler notation. Simplifying tags of this kind, being defined as
synonymous with the more cumbersome general notation, are not strictly
necessary.  For example, the general tags for text-critical apparatus
defined in chapter 19 of TEI P3 could  certainly be used to encode all
forms of editorial intervention in a text, but simple editorial
emendations or normalizations can be recorded much more simply by using
the simpler tags defined for such purposes in section 6.5.  Occam's
Razor, strictly applied, would require the removal of all such
redundancy.
 
   Certainly, excessive use of such redundant constructs (or "syntactic
sugar") can lead to an excessive number of elements and excessive
complication of the markup language as a whole; in moderation, however,
such devices can ensure that the markup language handles simple cases
with a simple notation, reserving more complex notations for situations
which really require them.  In this and other ways, the TEI has sought
to devise a markup language which scales gracefully.  That is, the TEI
scheme can be used both for very light markup of texts and for extremely
dense markup encoding detailed analysis of texts on multiple linguistic
and interpretive levels.
 
 
1.2.2   Extensibility
 
   Since research necessarily involves the asking of questions that have
not been asked before, a research-oriented encoding scheme must also be
extensible.  Some measure of extensibility, then, was from the beginning
an absolute requirement for the TEI markup scheme, rather than a design
goal. The extent to which the various portions of the scheme might be
extended, and the degree to which the various possible kinds of
extensions can be easily created and integrated, do however form
important design goals.  TEI P3 attempts to support convenient extension
and modification of the TEI DTD by modularizing the DTD into base and
additional tag sets, by standardizing the interface among tag sets so as
to allow new modules to be introduced easily, and by the extensive use
of SGML parameter entities to allow the suppression of individual
declarations, the ability to rename elements, and the addition of new
elements to existing element classes, as described further below.
 
 
1.2.3   Compatibility with Standards
 
   Compatibility with "existing and emergent standards" and practice was
to be sought, but (as its rank suggests) not at the expense of the other
design goals.  From the start, the standards perceived as most relevant
to the work of the TEI were SGML and existing applications of SGML; the
TEI therefore decided that its Guidelines should, if possible, use the
formalisms of SGML, with the caveat that, if the needs of research
required constructs unavailable in SGML, research was to take precedence
over the standard.  It is a tribute to the expressive power and
generally good design of SGML that no extensions to SGML were in
practice found to be necessary.  P3 can thus require all TEI-encoded
documents to be conforming SGML documents; in addition, the TEI
Interchange Format adds some simple restrictions beyond those of SGML
proper, intended to ensure that documents in that format can be parsed
by software simpler than full SGML parsers.(2)
 
   The TEI was not committed to using the full range of constructs
available with SGML; a metalanguage committee was responsible for
assessing the guidelines' compatibility with commonly available
software.  That committee formulated a subset of SGML which used few
enough constructs to allow users without SGML-conformant software to
construct ad hoc processors for the TEI Interchange Format.
 
 
1.3   Additional Design Issues
 
   The design goals discussed so far relate for the most part to how the
TEI scheme would be defined. This section discusses some important
aspects of the scheme's coverage -- what it is intended to address.
 
 
1.3.1   Guidance for New Encodings -- and New Encoders
 
   The Poughkeepsie Principles mandated that the Guidelines should
simplify the task of encoding texts in machine-readable form for the
first time by making it unnecessary for projects creating such resources
to design an encoding scheme from scratch.  The Guidelines recommend a
standard minimum set of textual features, derived from (but not limited
by) common existing practice.  The widespread use of this tag set will,
we hope, contribute to an improvement in both the quality and the
re-usability of newly-created encoded texts.  In addition, the
specialized tag sets described in chapters 14 to 23 of TEI P3 should
make it easier for projects working in those specific discipline areas
to exchange data and results.
 
   The Guidelines are also intended to provide guidance for researchers
who are uncertain as to which textual features are likely to be of
maximal usefulness in an encoding.  They should reduce, not increase,
the perplexity of deciding what to encode.
 
   At present, only the full reference documentation of TEI P3 is
available.  Since it is the task of a reference manual to describe every
detail and arcane subtlety of a system, P3 is only imperfectly suited to
the role of a reassuring introduction to the TEI encoding scheme.  It is
therefore a matter of some priority for the TEI to produce a set of
introductory manuals for the Guidelines.  In an introduction, only a
portion of the TEI scheme will need to be introduced, and many details
can be postponed until they are needed; only such introductions will
make it possible to achieve the goal of providing suitable guidance to
the perplexed novice.
 
 
1.3.2   Interchange Format for Existing Material
 
   The Poughkeepsie Principles also directed that the Guidelines must be
suitable for the interchange of encodings among sites using different
schemes.  As a definition of an interchange format, the Guidelines will
be of assistance to data archives, their borrowers, and even to software
developers who can rely on this interchange format, whether as a
documented interface between their software and the textual data they
may import or export, or as a reference point with which their format
can be compared.
 
   For interchange, it must be possible to translate from any existing
scheme for text encoding into the TEI scheme without loss of
information.  All distinctions present in the original encoding must be
preserved.  Any conventions used in the original encoding but not made
explicit by its encoding format should be documented within the
interchange format.
 
   When the TEI scheme is used as an interchange format for pre-existing
encodings, only those textual features expressed explicitly in the
pre-existing encoding format can be converted into their TEI
equivalents.  If the original encoding lacks, for example, information
about paragraph breaks in the original source, so will the TEI version,
even though marking this particular feature is strongly recommended to
those creating new electronic texts.  As the final item in the
Poughkeepsie Principles makes clear, translation into the TEI scheme
should not be construed as requiring the addition of any new information
not present in the original encoding.
 
 
1.3.3   Prescription and Description
 
   An SGML document type definition specifies a set of formal rules
which define the set of "valid" documents.  The formal definition of
document validity is one of SGML's great practical strengths, because it
allows automatic mechanical checking for markup errors.  It also allows
the designer of the document type to make the expected structure of
documents much more explicit than would be possible without such a
formal specification.  As described in more detail in section 2 of this
paper, such a formal specification may be regarded as a "document
grammar", which accepts a certain subset of the set of all possible
documents.
 
   Grammars, however, may be used for two quite different purposes.  One
may use a grammar to prescribe the legal forms of some language.  The
formal grammar for the programming language Pascal, for example,
prescribes the syntactically legal forms of Pascal programs.  And
old-fashioned prescriptive grammars of natural languages like English
similarly attempted to prescribe certain grammatical constructions, and
proscribe others.  A prescriptive grammar is a recipe for the
construction of documents which conform to a specification expressed by
that grammar.
 
   SGML is frequently used for the specification of document grammars
which are prescriptive in just this way.  A publisher of technical
documentation, for example, may use an SGML DTD to prescribe a standard
structure for introductory manuals, and a different structure for user
manuals, and to enforce rules such as "There must never be an example
without a preceding paragraph introducing it and a paragraph following
it, which describes what the example means." The SGML parser cannot
enforce the prescribed contents of the introductory and following
paragraphs, but it can ensure that an object tagged as an <example> is
invariably preceded by an <introPara> and invariably followed by at
least one <discussionPara>.  Memoranda can be required to have a date, a
subject line, and a confidentiality status.  Dictionary entries can be
required to have an etymology section at the beginning.  And so on.
When a document fails to conform to such a prescriptive grammar, the
document typically has some flaw which requires correction.
 
   Grammars may also be used, however, to describe some set of
independently existing objects.  Since the objects already exist, the
function of the grammar is not to specify how they may be constructed
but to explain how they were constructed in the first place, or to
identify the differences between the objects in the set and the objects
outside the set.  It is characteristic of descriptive grammars that when
an object fails to conform to the grammar, the flaw is usually sought
not in the object itself, but in the grammar.  In descriptive
linguistics, grammars are used in this way to explain, or at least to
identify, differences between the sets of grammatical and ungrammatical
sentences in a language.  If a sentence fails to conform to the grammar,
but is felt acceptable by native speakers or occurs in a corpus, it is
not the sentence but the grammar which needs correction.
 
   Before the advent of the TEI, SGML had not been widely used for the
specification of descriptive document grammars for existing texts.  When
pre-existing texts were converted to SGML, the texts involved were
frequently reference works or technical documentation being converted by
the publishers or corporate authors.  Prescriptive grammars were used,
and discrepancies between the grammars and the documents were normally
resolved in favor of the grammars:  i.e., the documents were changed to
conform to the grammar.  In cases where no such changes were to be
contemplated -- as in the creation of the electronic version of the
first edition of the Oxford English Dictionary -- SGML-like encodings
have sometimes been used, but without the formulation of any DTD.  (See
further the description of the "Waterloo syntax" below in section 2.)
 
   The TEI thus explored new territory in attempting to formulate a
descriptive grammar for existing documents using the formalisms of SGML.
In the course of this exploration, it faced a critical problem which
often arises in the development of descriptive grammars.  When the
population being described by the grammar is sufficiently various and
complex, it is often the case that no grammar can readily be written
which exactly matches the population.  Either the grammar overgenerates
for the population, that is, it accepts some items which are not
actually present in the target population, or it undergenerates, that
is, it rejects as invalid some items which actually occur.  In a grammar
of English, overgeneration would mean accepting as grammatical some
sentences which no native speaker accepts as English, or which never
occur in practice; undergeneration would mean rejecting as ungrammatical
some sentences which actually occur in practice and are accepted by
native speakers.  In a document grammar, overgeneration means allowing
as a valid document some sequence of text which never occurs and can
never occur in the set of texts studied by researchers; undergeneration
means rejecting as invalid some actual existing document.
 
   For the TEI, the target population includes, in principle, any
document written in any language during the entire span of written
history -- a population certainly various and complex enough to defy any
attempt at a rigidly prescriptive grammar.  In this situation, we chose
consciously to err on the side of overgeneration, rather than
undergeneration, whenever the choice presented itself.  An
overgenerating document grammar has the drawback that it accepts
nonsensical documents as valid ones; it will thus fail to catch some
errors of markup which a less generous grammar would detect
mechanically.  An undergenerating grammar, on the other hand, will
detect many markup errors which would slip past an overgenerating
grammar; unfortunately, it will also detect as markup errors some
constructs which are not markup errors at all but merely unusual
document structures not foreseen in the grammar.  An undergenerating
descriptive grammar thus behaves like a prescriptive grammar which has
strayed into the wrong arena.  When an SGML document grammar
overgenerates, researchers who would prefer that the SGML parser perform
a tighter validation of their documents will be inconvenienced:  to
achieve the tighter validation, they will have to restrict the document
grammar to reduce or eliminate the overgeneration.  When it
undergenerates, the inconvenience will befall any researcher who is
working on a document with structures other than those foreseen in the
grammar:  to allow the document to be parsed, they will have to loosen
the document grammar to allow their document's eccentricities to be
accepted as legitimate.  Since the skills needed for modifying the
document grammar seem more likely to be found among researchers who want
to exploit SGML's document validation powers to the full than among
researchers who happen to be working with eccentric document structures,
it is clearly preferable for the TEI to err by overgenerating, rather
than by undergenerating.  In order to minimize the baleful effects of
such overgeneration, the TEI tag sets sometimes define two alternative
forms of markup:  a somewhat prescriptive form for use when validation
is highly desired, and a very loose alternative form for use in
transcribing items which simply do not fit the more prescriptive form.
The elements <bibl> (loose) and <biblStruct> (prescriptive) for
bibliographic citations, and <entry> and <entryFree> for dictionary
entries, exhibit this approach.
 
 
1.3.4   Descriptive Markup
 
   In the TEI encoding scheme, descriptive markup has in general been
preferred to presentational markup.  TEI tags typically describe
structural or other fundamental textual features, independently of their
representation on the page.  In some cases, however, -- e.g. for
codicologists, paleographers, or analytical bibliographers -- the
physical appearance of the original text carrier can be the primary
object of interest.  In others, there may be no consensus as to the
meaning of all aspects of the text's physical appearance; it must
therefore be possible to represent them as explicitly as possible,
without being forced to speculate as to their meaning.  For use in such
cases, the TEI defines elements (e.g.  <hi>, for highlighted phrases of
any type) which simply record some salient fact about the appearance of
the source text, without requiring any overt interpretation on the part
of the encoder.  A great deal of work remains to be done, however, to
ensure that students of a text's transmission or physical presentation
can conveniently record the relevant information in a precise, readily
processable form.
 
 
1.4   Development Process
 
   The work of developing an encoding scheme consistent with these goals
devolved initially upon a set of four working committees.  Committee TD
had responsibility for issues of text documentation and produced the TEI
header described in chapter 5 of P3.  Committee TR, responsible for text
representation, produced the bulk of the material now in chapters 4, 6,
and 7.  Committee AI addressed issues of text analysis and
interpretation, producing the work now described in chapters 12, 15, and
16.  Committee ML, mentioned earlier, studied issues of metalanguage and
formal syntax; it was responsible for the usage of SGML in the TEI
scheme and provided formal definitions of the TEI Interchange Format and
of TEI conformance.  The committees ranged in size from seven to
fourteen members, with approximately equal representation from Europe
and North America.
 
   Following publication of the first draft of the Guidelines (TEI P1)
in July, 1990, the extension and revision of the work of committees TR
and AI was entrusted to about twenty small, specialized work groups,
each reporting either to TR or to AI; committees TD and ML, which had
somewhat more tightly circumscribed areas of responsibility, continued
their work without formal subcommittees.  These specialized work groups,
typically numbering three or four members, were charged with extending
or revising the recommendations of TEI P1 in some specific field in
which the members of the group had expertise. Areas addressed by the
work groups, many of which resulted in drafts for the appropriate
chapter of P3, were wide-ranging, including character set issues, text
criticism, hypertext, mathematical formulae, language corpora,
manuscripts, verse, drama and literary prose, spoken texts, literary and
historical studies, print dictionaries, computational lexica and
terminological databases.
 
 
 
                                   2
 
                    FEATURES, IDENTIFIERS, AND TAGS
 
 
2.1   Feature Sets, Identifier Sets, and Tag Sets
 
   A marked-up document is one in which specific textual features are
identified by tags or other mechanisms.  More abstractly, markup in
electronic text serves simply to predicate the existence of some quality
or feature in some specific passage of the text being marked up.  In
simple presentational markup, the feature involved may be simply that
the passage is intended to be printed in italics; in the analytic or
descriptive markup characteristic of SGML-based encoding schemes, the
features made explicit by markup tend to be slightly more abstract:  the
passage is a quotation or an emphatic phrase.
 
   We distinguish sharply between the textual feature marked and the
string of characters or other device used to mark it in the electronic
text.  That string of characters is conventionally referred to as a tag.
Formally, we regard a tag as a specific piece of markup in a text, which
signals the occurrence or location of a given textual feature.  We will
use the term identifier to refer to the string of characters itself,
without reference to any feature associated with it.(3)
 
   The TEI tag <figure>, the Formex tag <pic>, and the ISO "starter set"
tag <artwork>, for example, are used to signal occurrences of the same
textual feature, namely that at the given position in the document a
graphic of some sort is to be located.(4)  A page formatter might leave
space on the page for the artwork; a galley formatter might print a
marginal note calling attention to the expected artwork; a text database
might embed a special symbol indicating that the original document had
material not included in the database.
 
   Note that a tag must refer to a feature, but a feature need not be
tagged.  The same feature can be denoted by many identifiers in
different encoding schemes.  Thus in the TEI tag set, the element <div1>
may be used to identify occurrences of the feature chapter.  In other
tag sets the same feature might be tagged with <chapter>, <Kapitel>,
<chap&icirc.tre>, or <caput>.  We use the term feature precisely in
order to stress what would be common among all these tags and ignore
what varies (the name or identifier associated with the feature: in SGML
terms, its generic identifier or gi).  Users of the TEI Guidelines may
for example wish to abbreviate identifiers or translate them into
another language; the TEI provides mechanisms for such substitutions;
with them, the generic identifiers may be changed without altering
either the definition of the features themselves or the syntax which
governs their occurrences.
 
   The association of a given identifier with a given feature in a given
encoding system is an SGML element.(5)  Instances of a particular
element type are indicated by SGML start- and end-tags in the document,
and so in informal discussions elements are often referred to
metonymically as "tags".  In TEI documentation, the term element is
normally used to refer either to the abstract type denoted by a tag, or
to a particular instance of that type, while tag is used to refer to a
specific start-tag or end-tag in the document which delimits a given
element instance.(6) By definition each element associates a single
(though not necessarily an atomic) textual feature with a specific
identifier.  As just noted, different encoding schemes may readily mark
the same feature using different identifiers (the TEI, for example,
marks an International Standard Book Number with the tag idno type=isbn,
while ISO 12083 uses the tag isbn for the same feature).  They may also
use the same identifier for quite different features; the TEI and the
Formex tag set both use the identifier body, the former for the body of
the text as distinct from the front or back matter, the latter for the
name of a corporate body or entity.
 
   Different encoding schemes may resemble each other in the sets of
features they mark, while differing completely in the sets of
identifiers they use, and vice versa.  Tag sets which mark the same set
of features, moreover, may differ in the document grammar they define
for those features, and thus in the sets of documents they accept.
 
   We can distinguish, for any given encoding scheme,
 
*   the set of features it marks
*   the set of identifiers it uses
*   the mapping from identifiers to features
*   the grammar it imposes on documents
 
Encoding schemes which mark the same set of features may be said to use
the same feature set; those which use the same set of identifiers have
the same identifier set or name set.  Those which additionally impose
the same formal restrictions on the allowable combination of features or
identifiers may be said to share the same feature grammar or identifier
grammar, respectively.  When two tag sets are identical in all four
characteristics above (feature set, identifier set, document grammar,
and mapping from identifiers to features), they have an identical
element grammar, which defines the full set of rules and conventions
governing the use of a particular document type, referred to in SGML as
a document type definition or DTD.
 
   Since SGML provides no formalisms for the declaration of semantics,
the formally expressible part of the DTD (i.e. the set of all applicable
element, attribute-list, and related declarations) captures only the
identifier grammar of an SGML-based encoding scheme, the document type
declaration in SGML parlance.  The full scheme requires natural-language
documentation, of the sort provided in the TEI Guidelines, to qualify as
a document type definition.
 
   Comparing the identifiers used by two encoding schemes is of course
simpler than comparing the feature sets they mark, since textual
features are typically identified only by natural-language descriptions
which are difficult to compare mechanically.  Natural-language
descriptions are notorious for vagueness and ambiguity; their only
advantage appears to be that no other method of defining textual
features appears to exist, or is likely to exist soon.  If despite the
difficulties of documenting their meanings, the feature sets of two
encoding schemes can be identified, then translation from one scheme to
the other may be relatively simple.  If the feature sets and feature
grammars are identical, or if the source's feature grammar is a subset
of the target's feature grammar, then translation from one of the
associated tag sets to the other becomes very simple.
 
   More formally, we define
 
feature:  a characteristic of some segment of text or of some location
  in a text, considered independently of any identifier used to denote
  it.  In SGML terms, a segment of text possessing the feature may be
  marked up as an SGML element, though the presence of the feature may
  also be indicated by other SGML constructs, as described below in the
  section on "Methods of Predication."
element:  a component of an encoded text corresponding with an
  occurrence of a specific textual feature.  An element has both a tag
  and a textual feature associated with it.
tag:  the use of a specific identifier to signal the boundaries of an
  element.
identifier:  the string of characters used in a tag to identify an
  element, considered as a string of characters and independent both of
  any particular instance of the tag used to mark the feature and of the
  feature itself.  The SGML equivalent term is generic identifier.
 
   We can also identify sets (that is, unordered collections) of
features, elements, and identifiers.
 
   To define the allowable relationships among elements and features,
document markup languages define syntaxes for the elements of the
language.  The formal syntax is necessarily defined in terms of the
identifiers used by the language; the semantics of the elements (their
links to specific features) are not usually specified formally.  If the
distinction between element and identifier is rigorously maintained,
then, formal syntaxes like those expressed by the element declarations
of an SGML DTD must be viewed strictly speaking as governing only
identifiers, not elements.  If we wish to speak of rules governing
elements rather than identifiers, we must include both the formal
declarations and the semantic specifications of the markup scheme: in
SGML terms, we must talk of the document type definition, rather than
its declaration.  This distinction is preserved in the following
discussion.
 
   To express the ways in which sets of features or elements may be
combined, we may define the following:
 
feature syntax:  a set of rules specifying the allowable combinations of
  feature occurrences within a given type of text, considered
  independently of the identifiers used to identify them
tag syntax, element syntax:  a set of rules defining the allowable
  sequences of tags and textual data in a markup language
DTD (document type definition):  a tag syntax defined according to the
  rules of SGML
identifier syntax:  a syntax governing the allowable combinations of
  identifiers in a markup language (without regard to their association
  with particular features)
 
   From any document type definition, one may derive a tag set, which is
precisely all and only the tags defined in the DTD.(7)  The element
syntax of a DTD is expressed by the content models of its constituent
elements. If we modify a DTD by changing the syntax rules expressed in
its content models, the tag set (and the associated set of identifiers)
and feature set remain unchanged. We may say that the tag set is what
remains constant when the syntax is changed without introducing new tags
or deleting old ones.
 
   We may change a DTD by translating all the identifiers into another
language (as is done with the labels for the ISO starter set in ISO TR
9573, for example).  The legal relationships among features remain the
same, the features described remain the same; the DTD changes only
because the identifiers used to define the syntax change.  The term
feature syntax is coined to denote what does not change if one
translates all the generic identifiers into another vocabulary.
 
 
2.2   Important Types of Document Grammar
 
   While for any given DTD or element syntax there exists a unique
corresponding tag set, the converse is not true: for any given tag set,
one may formulate a very large number of possible element syntaxes.
Several types of such grammar can be readily identified, which are
useful for discussion purposes even if they are not all often used in
practice.
 
   The first of these types of document grammar is of particular
importance, because it comes close to being a null grammar, imposing no
requirements on the document beyond those of SGML itself.  This is the
grammar which, for any given set of elements, requires merely that all
elements nest and allows any element to nest within any other.  We call
this the Waterloo grammar (or, in SGML, the Waterloo DTD), because its
most visible exponents in recent years have been located at the
University of Waterloo's Centre for the New Oxford English Dictionary,
which designed a tag set with this type of document grammar for the
encoding of the Oxford English Dictionary.(8)  In SGML terms, a pure
Waterloo DTD is one in which each element has an element declaration of
the form
 
      <!ELEMENT blort - - ANY>
 
For details of the element-declaration syntax, see Burnard, elsewhere in
this issue.  This example declares an element with a generic identifier
of blort.  The two minus signs indicate that neither the start-tag nor
the end-tag may be omitted (even if the omitted markup could be
unambiguously inferred).  The keyword ANY signifies that a blort can
contain a mix of character data and any element declared in the DTD, in
any sequence.(9)
 
   The obvious advantage of the Waterloo grammar is that it radically
simplifies the task of data capture and translation from other tag sets,
since all that is necessary is the identification and marking of textual
features, without concern for grammatical restrictions on their
repetition or nesting.  The obvious disadvantage is that, as a grammar,
it vastly overgenerates for its population.  The Waterloo DTD for the
Oxford English Dictionary will accept dictionary entries in any of the
many forms they take in the actual text, without complaining that some
are ill-formed.  This is a distinct advantage.  But this DTD will also
accept entries which are not merely eccentric, but purely nonsensical --
in which citations appear within sense numbers or sense numbers within
data elements, for example.  (In the OED project, such nonsensical
tagging was detected and controlled, but by means of specially written
checking routines rather than through the grammatical formalism of a
DTD.)
 
   A second type of grammar provides a useful contrast to the extreme
permissiveness of the Waterloo grammar by prescribing a rigid
hierarchical relationship among the SGML elements it defines.  As an
example, consider the definition of the TEI header:
 
      <!ELEMENT teiHeader - - (fileDesc, profileDesc?, encodingDesc?,
                              revisionDesc?)                         >
 
This declaration specifies that a <teiHeader> element must invariably
begin with a <fileDesc> element, which may be followed by <profileDesc>,
<encodingDesc>, and <revisionDesc> elements.  Each of these three is
optional, but if they appear they must do so in sequence.
 
   This prescriptive form of content model is often found in the
definition of elements with clear internal structure; in TEI working
papers these are often referred to as crystals.  Crystals often
represent textual features which are themselves compounds of other,
usually simpler, features: lists, personal names, addresses, and
bibliographic citations may all be defined as crystals in a sufficiently
prescriptive tag set.  In SGML terms, a crystal is an element defined as
having element content rather than data content or mixed content: that
is, an element whose children are invariably other elements, rather than
strings of characters, possibly with other elements mixed in.  In the
TEI scheme, "pure" crystals are found almost exclusively in the TEI
header, since within the body of the text, it is not feasible to require
that encoders always capture the internal structure of items like
personal names, addresses, or bibliographic citations.
 
   The third type of grammar to be identified here combines the
flexibility of the Waterloo grammar with the notion of crystals, to
produce a document grammar which overgenerates slightly less than would
an equivalent Waterloo grammar.  This type, known to the authors as a
Belgian grammar in honor of Jean-Pierre Gaspart, who introduced us to
it, uses the SGML mechanism of inclusion exceptions to control the
appearance of crystals and their parts.  In a "pure" Belgian grammar,
 
*   crystals will be defined using some form of prescriptive grammar
*   overall text structure will typically be defined using a
    prescriptive grammar
*   elements which may directly contain text will have a content model
    of "(#PCDATA)"
*   top-level text-containing elements such as paragraphs will also have
    an inclusion exception which allows all crystals, but not their
    constituent parts, to appear anywhere within the text-containing
    element
 
The following simple DTD, for example, allows personal names and lists
to appear anywhere in any paragraph, but allows <surname>, <forename>,
and <item> to appear only within appropriate contexts:
 
     <!-- overall text structure:  belgiandoc is series of P -->
     <!ELEMENT belgiandoc - - (p+)                         >
     <!-- Text-container -->
     <!ELEMENT p          - - (#PCDATA) +(name | list)     >
 
     <!-- 'Crystals' -->
     <!ELEMENT list       - - (item+)                      >
     <!ELEMENT name       - - (forename+, surname)         >
 
     <!-- parts of crystals -->
     <!ELEMENT item       - - (#PCDATA)                    >
     <!ELEMENT forename   - - (#PCDATA)                    >
     <!ELEMENT surname    - - (#PCDATA)                    >
 
The declaration for the <p> element above demonstrates the power of a
Belgian syntax.  Its use of inclusion exceptions ensures that names and
lists can appear within any element within a paragraph, including within
names and lists; items and parts of names, however, can appear only
within the relevant enclosing crystal (i.e. within lists or names,
respectively).
 
   As this example shows, Belgian constructs can be restricted to some
elements, allowing the rest of the document grammar to be defined using
other constructs.
 
   The final type of grammar which must be described uses declarations
of the form:
 
      <!ELEMENT blort - - (#PCDATA | foo | bar | fubar)* >
 
This construct, which we will call a starred alternation, can be used to
mimic the effects of Waterloo and Belgian grammars, but also allows
slightly easier control over nesting.(10)  By avoiding the use of
inclusion exceptions, it makes clearer precisely what sub-elements are
legal within elements of a given type.  Its primary disadvantage is that
it requires each phrase-level element to name explicitly all the other
phrase-level elements which may appear within it, thus exploding the
size of the DTD.  Compare, for example, the Belgian DTD above with this
equivalent DTD in starred-alternation form:
 
     <!ELEMENT starreddoc - - (p+)                         >
     <!-- Text-container -->
     <!ELEMENT p          - - (#PCDATA | name | list)*     >
     <!-- 'Crystals' -->
     <!ELEMENT list       - - (item+)                      >
     <!ELEMENT name       - - (forename, surname)          >
     <!-- parts of crystals -->
     <!ELEMENT item       - - (#PCDATA | name | list)*     >
     <!ELEMENT forename   - - (#PCDATA | name | list)*     >
     <!ELEMENT surname    - - (#PCDATA | name | list)*     >
 
The growth in the grammar, small but perceptible here, may prove a
practical hindrance when, as in the main TEI DTD, the number of
phrase-level elements, which can occur within paragraphs or within other
phrase-level elements, rises above fifty.  Despite this fact, the
starred alternation is probably more common in existing standard DTDs
than the constructs of either the Belgian syntax or the Waterloo syntax.
 
   One advantage of the starred alternation version of this document
grammar over the Belgian version given above is that the content model
for each element describes its legal contents in full. In the Belgian
syntax, the legal contents of item are determined not only by the
content model for that element, but by the inclusions on the p element
given elsewhere.  If lists can occur both within and outside of
paragraph elements, then the legal contents of items may differ,
depending on location, in ways which are not immediately obvious, even
from a careful reading of the DTD.  The starred alternation makes it
possible to eschew inclusion and exclusion exceptions in a DTD, thus
ensuring that the legal contents of an element do not vary depending on
its context; this is important for the validation of SGML fragments in
isolation, and for the convenient decomposition of documents into
fragments suitable for retrieval and maintenance using standard database
techniques.
 
 
2.3   Methods of Predication
 
   If we view markup languages merely as means of predicating the
existence of particular qualities or features in particular passages or
locations of a text -- ignoring for the moment the other functions of a
markup language, such as ensuring system independence of data or
securing useful mechanisms for document management -- then it is useful
to ask exactly how we may legitimately infer, from a marked up text, the
existence of a given feature at a given location.  We assume in this
discussion that the feature in question is marked explicitly, ignoring
the problems of inferring the occurrence of features not marked from
those which have been marked.  Such inference is possible when the
implication relations of different features are made explicit, as may be
done in formal feature-system declarations (FSDs), but it is quite
difficult for the most part, and has not as far as we know been
undertaken for any markup language intended for general use.
 
   In a TEI-encoded text, the predication of a given feature at a given
location may be accomplished by any combination of several mechanisms.
For any location, the features predicated by the markup as applicable to
that location are signaled by:
 
*   the generic identifiers of open elements
*   attribute value specifications on open elements
*   the structural position of the location in the document (a TEI
    <head>, for example, marks a chapter title if and only if it is the
    first child of a text division which is itself marked as a chapter)
*   pointers to some open element from a <join> element, an encoded
    feature structure, note, <span>, or <interp> element
*   the position of the location between a location pointed to as the
    start, and a location pointed to as the end, of a passage pointed at
    by an encoded feature structure, note, or <span> element
*   an ana attribute on an open element pointing to a feature structure
    or interpretation element
*   the position of the location between two milestone elements (e.g.
    <pb>, <cb>, <milestone>, etc.)
 
Note that the pointers in each case may be indirect, by means of one or
more links in a chain of extended pointers.
 
   These methods for predicating features in a text are particularly
important for search and retrieval applications, which often involve
finding instances of particular features in a document; they are
relevant for other software as well, in cases where correct processing
depends on what features are present in a given passage.  The
implementation of software sensitive to generic identifiers, attribute
values, and structural location in the document is relatively simple and
thus quite common.  At the time TEI P3 was published, however, software
capable of treating virtual elements like <join>, or of following
predications across links like those given by the <link> element or by
the inst and ana attributes, remained relatively rare.  These markup
techniques are nevertheless widely used in the TEI Guidelines, primarily
because they make it possible, and indeed convenient, to represent
multiple possibly conflicting interpretations of the same passage,
without requiring that one be privileged vis a vis the others.
 
 
 
                                   3
 
                   THE TEI DOCUMENT TYPE DECLARATION
 
 
   Other contributions to the present volume describe in more detail
different aspects of the various parts of the TEI scheme. Here, we
provide a general overview of the DTD itself, and a very brief summary
of the textual features and default text structures defined by its core
components.
 
 
3.1   Core, Base, Additional, and Auxiliary Tag Sets
 
   TEI P3 describes several distinct document type definitions:  a
single main DTD for the transcription of texts, and several auxiliary
DTDs for the encoding of meta-information relevant to the transcription
of one or more texts.  The main DTD is described further below; the
auxiliary DTDs include:
 
*   the independent header, which documents the identity of a particular
    electronic text and its source, as further described below.
*   the writing system declaration, which documents the alphabet or
    writing system used by a given text, and the character sets,
    transliteration schemes, and SGML entity sets used to record it in
    machine-readable form
*   the feature system declaration, which defines the set of features
    nameable within a given feature system, the domains over which their
    values may range, their default values, and co-occurrence
    constraints upon sets of feature values
*   the tag set documentation, which defines tags for systematic
    documentation of SGML tag sets
 
   Although formally it defines only a single document type, the TEI
main DTD may vary dramatically in different invocations, to reflect the
varying needs of the document and encoder.  By specifying a particular
view of the DTD, the user may tailor it to the needs of a particular
application in a way difficult or impossible with most other general
purpose DTDs so far developed.  The user of the TEI scheme is offered
the opportunity of building a DTD which matches his or her requirements,
but constrained to do so in a way that facilitates interchange.
 
   The main TEI DTD thus resembles a complex database schema from which
different views may be derived for different applications.  The main DTD
is defined as a set of tag sets, which may be used in a variety of
combinations.  Each view of the main DTD selects one or more of these
tag sets, thus making the tags in those sets available for use in the
encoding; different views of the main DTD differ because they select
different tag sets and thus make different elements available.  Three
types of tag sets are distinguished:
 
*   the core and header tag sets; these include elements which must be
    included in any view of the main DTD
*   base tag sets; these define textual components specific to
    particular text types
*   additional tag sets; these define elements of interest for
    particular types of analysis or processing, which may conceivably be
    tagged in texts of any type
 
   This classification of tag sets reflects what is jocularly known as
the Chicago pizza model of DTD construction.  In Chicago, as elsewhere
in the U.S.  (though not in Italy), all pizzas have some ingredients in
common (cheese and tomato sauce); the consumer may specify, however, a
choice of one style of crust (thin crust, deep-dish or pan, or stuffed)
and an arbitrary selection of toppings to be strewn on the top of the
pizza.  In the same way, the user of the TEI scheme constructs a view of
the TEI DTD by combining the core tag sets (which are always present),
exactly one "base" tag set, and any selection of "additional" tag sets,
which play the part of the toppings.
 
   The core tag set is described in more detail further below; it
includes elements which are specific neither to particular genres or
types of text, nor to particular types of application or research.
 
   The TEI header allows the provision of documentary or bibliographic
information about electronic texts.  Such information is essential for
any satisfactory interchange of texts coming from multiple sources, or
for which long term uses are envisaged.  As with software, leaving the
documentation of an electronic text to the last moment is a recipe for
disaster all too commonly followed.  The TEI header is one of the few
mandatory elements in a TEI document. It has four major divisions which
together provide a detailed framework for the documentation of:
 
*   the electronic document itself and the sources from which it was
    derived
*   the encoding system which has been applied
*   non-bibliographic aspects of the document (e.g. the demographic
    characteristics of its author and readers, its subject matter, or
    its classification in some scheme of text types)
*   the revision history of the electronic document
 
   The TEI header may vary widely in its size and complexity.  At one
extreme, an encoder may provide nothing more than a bibliographic
identification of the text.  At the other, encoders wishing to ensure
that their texts can be used for the widest range of applications, will
want to provide the kind of detailed documentation most often found in a
detailed user's manual.  Most headers will lie somewhere between these
extremes; textual corpora in particular will tend more to the latter
extreme.  A collection of TEI headers can also be regarded as a distinct
document, and an auxiliary DTD is provided to support interchange of
headers alone, for example between libraries or archives, as noted
above.
 
   The greater part of the TEI main DTD is taken up, however, not by the
core and header tag sets, but by the base and additional tag sets which
may be selected by the user in various combinations.
 
   To construct a view of the TEI DTD, the user must always choose one
of eight base tag sets.  Six of these are intended for documents which
are predominantly composed of one type of text; the other two are
provided for use with texts which combine these basic tag sets.
 
*   prose
*   verse
*   drama
*   transcriptions of spoken material
*   printed dictionaries
*   terminological data
*   a "general" base for anthologies, etc.
*   a "mixed" base for anarchic mixtures of text types
 
   In addition to one base tag set, the user of the TEI main DTD may
select any combination of the following additional tag sets; these
typically define elements relevant for particular types of processing,
or for particular types of research, or text components which may occur
in many different types of text, but are not so widespread as to belong
in the core tag set.  The additional tag sets define SGML elements for:
 
*   hypertext linking, text segmentation, and alignment of multiple
    texts or multiple passages
*   simple analysis and interpretation of the text
*   markup of feature structures
*   recording the certainty or uncertainty of markup and responsibility
    for markup
*   transcription of primary sources (principally manuscripts)
*   text-critical apparatus
*   detailed analysis of names and dates
*   graphs, networks, and trees
*   tables, formulae, and graphics
*   demographic information about authors or speakers, and
    text-classification elements, of types particularly useful in
    language corpora
 
Aspects of most of the above tag sets are described in more detail
elsewhere in this issue.
 
 
3.2   Selection of Tag Sets
 
   The pizza model is implemented by means of parameter entities in the
TEI DTD, as further discussed below.  To illustrate the basic mechanism
we present here the outline of a TEI-conformant document in which the
base tag set for prose has been selected together with the additional
tag set for linking:
 
     <!DOCTYPE tei.2 [
     <!ENTITY % TEI.prose "INCLUDE">
     <!ENTITY % TEI.linking "INCLUDE">
     ]>
     <tei.2>
      <!-- content of document here -->
     </tei.2>
 
As may be seen, the user selects a particular tag set by declaring a
parameter entity with a particular name as having the value "INCLUDE".
Because tag sets are selected explicitly by means of declarations within
the DTD subset, as shown above, any recipient of a TEI document can tell
which TEI tag sets are required to process it.  Any deviations or
modifications of the TEI definitions (for example, the renaming of
elements, or the addition of new ones) are made visible in a similarly
declarative manner.  In the following example, all modifications to the
TEI scheme are contained in the files project.ent and project.dtd:
 
     <!DOCTYPE tei.2 [
     <!ENTITY % TEI.prose "INCLUDE">
     <!ENTITY % TEI.linking "INCLUDE">
     <!ENTITY % TEI.extensions.ent SYSTEM "project.ent">
     <!ENTITY % TEI.extensions.dtd SYSTEM "project.dtd">
     ]>
     <tei.2>
      <!-- content of document here -->
     </tei.2>
 
 
3.3   Implementation of the Pizza Model
 
   In implementing the pizza model so as to work with any conforming
SGML parser, several problems must be solved:
 
*   preventing logical or syntactic conflict among tag sets
*   embedding the correct files, i.e. only those which define the tag
    sets selected by the user
*   ensuring that the elements defined in each tag set appear in the
    appropriate content models in the effective DTD
*   allowing for user-defined extensions to the DTD
 
   The first problem has a relatively simple solution.  To avoid name
clashes among elements, and to ensure that tag sets may be used in any
combination, we adopt the simple rule that elements may in general
appear in only one tag set.  Special handling must provided for the tags
of the default text structure, which are used by all bases.
 
   The second problem has a slightly more complex, but still rather
straightforward solution.  The individual tag sets are defined and
referred to by the main driver file of the TEI DTD, tei2.dtd, in such a
way that they will be embedded in the DTD if and only if the user has
selected them.  Simplified slightly, the general pattern followed is as
follows:
 
     <!ENTITY % TEI.prose "IGNORE">
     <![ %TEI.prose [
     <!ENTITY % TEI.prose.dtd SYSTEM "teipros2.dtd">
     %TEI.prose.dtd;
     ]]>
 
     <!ENTITY % TEI.verse "IGNORE">
     <![ %TEI.verse [
     <!ENTITY % TEI.verse.dtd SYSTEM "teivers2.dtd">
     %TEI.verse.dtd;
     ]]>
 
     <!-- ... etc. -->
 
As can be seen, the entity containing each tag set's declarations is
declared and referred to within a conditional marked section controlled
by a parameter entity with a default value of IGNORE, which the user can
override to INCLUDE.  If the user has selected the tag set for prose,
for example, then upon interpretation of the parameter entities
TEI.prose and TEI.verse, the fragment above will resolve to the
following:
 
     <![ INCLUDE [
     <!ENTITY % TEI.prose.dtd SYSTEM "teipros2.dtd">
     %TEI.prose.dtd;
     ]]>
 
     <![ IGNORE [
     <!ENTITY % TEI.verse.dtd SYSTEM "teivers2.dtd">
     %TEI.verse.dtd;
     ]]>
 
     <!-- ... etc. -->
 
and then, upon parsing of the marked sections, the declaration of the
verse tag set will disappear entirely:
 
     <!ENTITY % TEI.prose.dtd SYSTEM "teipros2.dtd">
     %TEI.prose.dtd;
 
     <!-- ... etc. -->
 
   The third problem requires even more extensive use of parameter
entities and conditional marked sections, as well as a set of shared
conventions in the definition of content models.
 
   Each TEI base tag set determines the basic structure of all the
documents with which it is to be used.  More exactly, it defines the
constituents of <text> elements.  In practice, so far, though means
exist for them to vary, all the TEI bases defined use the same "default"
text structure, described in detail below, which allows a text to
contain nested text divisions (chapters, sections, etc.).  The bases do
however differ greatly in the components which may appear within those
divisions:  the sections of a dictionary (for example) will contain
dictionary entries, while the scenes of a play will contain speeches and
stage directions, and the subdivisions of a speech transcript will
contain utterances and descriptions of non-verbal activity.  The sets of
components, moreover, may overlap:  prose paragraphs may occur within
conventional prose texts, and in (the front matter of) dictionaries, and
in plays.  To cater for this variety, the constituents of all divisions
of a TEI <text> element are defined indirectly in terms of a parameter
entity named component, which is in turn defined differently in each
base tag set.  Once more, marked sections are used to control which
declaration is effective, as shown in this simplified example:
 
 
     <!ENTITY % TEI.prose "IGNORE">
     <![ %TEI.prose [
     <!ENTITY % component "p | list | note" >
     ]]>
 
     <!ENTITY % TEI.dictionaries "IGNORE">
     <![ %TEI.dictionaries [
     <!ENTITY % component "entry | p | list | note" >
     ]]>
 
     <!ENTITY % TEI.spoken "IGNORE">
     <![ %TEI.spoken [
     <!ENTITY % component "u | vocal | pause | event | p | list | note" >
     ]]>
     <!-- ... etc. -->
 
     <!-- This definition is always in force -->
     <!-- Its effect is to define component.seq as one or more of -->
     <!-- whatever definition of component is currently in force -->
     <!ENTITY % component.seq "(%component)+">
 
   Within the body of the DTD, elements are defined using these
parameter entities, for example:
 
     <!ELEMENT div - - (head?, (%component.seq), div*) >
 
When a base tag set is selected, one or the other of the optional entity
declarations will be "activated" by the user's selection of a base DTD.
 
   The components of a text are thus invariably defined by entities
whose values vary with the particular base in use.  All textual
divisions are defined with the same content model, which includes a
reference to the parameter entity component.seq; the value of this
parameter entity will however be different in different bases.  In this
way it is possible for the divisions of a text using the drama base (for
example) to consist of speeches and stage directions, while those of a
text using the dictionary base will consist of lexical entries.
 
   In the distributed DTD, the definitions of component and other
parameter entities for use in content models are found in a separate
file, teiclas2.ent, which also defines the structure of the TEI
element-class system described below in more detail.
 
   The final problem of implementing the pizza model, ensuring easy
introduction of user-specified extensions to the DTD, is accomplished
once more by use of parameter entities.  Because most user modifications
will be effected by overriding the default declarations of parameter
entities, it is essential that any user-modified parameter-entity
declarations be processed before the default declarations are
encountered.  (In SGML, the first declaration of an entity is the only
one which counts.)
 
   In simple cases, it would be enough for the user to insert the
desired new parameter-entity declarations in the DTD subset of a
document; since the DTD subset is processed before the external DTD
entity, this would ensure the necessary priority of declaration.  In
practice, however, this would have several disadvantages.  A project
which consistently used the same modifications of the TEI DTD would have
to include all of the modifications in the DTD subset of every document;
this is inconvenient and would lead to confusion when the local
modifications were themselves revised.  Moreover, elements declared
within the DTD subset are unable to refer to the parameter entities
normally used in element declarations; because component, for example,
has not yet been declared, an element declaration in the DTD subset
cannot refer to that parameter entity anywhere within its content model.
 
   It is more convenient if user-supplied element declarations can refer
to the same parameter entities as are found in the standard TEI
declarations.  This is possible if they are encountered in the DTD not
before but after the TEI-supplied parameter-entity declarations.  The
TEI's method of embedding user-supplied defaults ensures precisely this.
The main driver file of the TEI DTD embeds two entities, within which
local modifications can be conveniently enclosed.  First, before
declaring the standard TEI parameter entities, it embeds any local
modifications of those declarations:
 
     <!ENTITY % TEI.extensions.ent ''>
     %TEI.extensions.ent;
     <!-- ... default entity declarations ... -->
 
As shown, the default declaration of TEI.extensions.ent is the null
string, so by default the reference to that entity has no effect.  When
the user overrides that declaration, as shown in the previous section,
the user's modifications will be embedded before the TEI declarations of
standard parameter entities are encountered.
 
   After all standard parameter entities such as component have been
declared, the driver file embeds any local element declarations:
 
     <!ENTITY % TEI.extensions.dtd ''>
     %TEI.extensions.dtd;
 
Again, the default expansion of the entity is to the null string.
 
 
 
                                   4
 
                 OTHER FORMAL ASPECTS OF THE TEI SCHEME
 
 
   Several other characteristics of the TEI DTD should be described
briefly here, as they may be relevant to understanding the other
articles in this issue, and may also be of interest to others developing
general-purpose SGML tag sets.  This section describes in particular
 
*   the global attributes defined for every TEI element
*   TEI class system, which allows attributes and structural
    characteristics to be declared for classes of elements and inherited
    by all elements which are members of those classes
*   the full range of mechanisms for extension and modification of the
    TEI DTD
*   the notion of the derived DTD, which is important for the
    understanding of TEI conformance
 
 
4.1   Global Attributes
 
   Several attributes are so widely useful that they have been defined
as applying to every element in the TEI DTD.  These are:
 
id:  provides an SGML identifier for an element; SGML requires this
  identifier to be unique within the document, and it may therefore be
  used as the target of cross-references and other pointers
n:  provides a possibly non-unique number, label, or name for an
  element; this is particular useful when it is desired to specify
  explicitly the number of a section, or of a list item, without
  transcribing the number as data content
lang:  specifies the language of the content of an element; this is an
  indirect pointer to a TEI Writing System Declaration, which identifies
  the natural language and writing system and the character set used to
  transcribe them
rend:  provides information about the rendering of an element where this
  is desirable
 
   The id and n attributes allow for the identification of any element
occurrence within a TEI-conformant text.  Elements carrying an id
attribute value may be the object of a link or cross-reference, or any
of the other re-structuring mechanisms proposed by the TEI for
circumventing the rigidly hierarchic structure of a simple SGML DTD.
The fact that the requirement for such links is usually unpredictable is
one reason for making this attribute global.
 
   Values on id attributes must be unique (their declared value is ID).
Values on the n attribute however need not be; they may be used to carry
a TEI canonical reference.  A method for defining the structure of such
canonical reference schemes is also provided, so that documents using it
can be processed automatically.
 
   The lang attribute indicates both the language and hence the writing
system applicable to the element's content, thus providing explicit
support for polyglot or multi-script texts.  If no value is given, that
of the element's direct parent is inherited.  (A number of TEI
attributes use inheritance of this type; in the DTD, it is indicated by
the use of the TEI-defined keyword %INHERITED to specify the attribute's
default value).  The value of this element identifies a special purpose
<language> element which documents the language in use, optionally
associating it with an external entity in which a formal writing system
declaration may be given.
 
   The TEI writing system declaration (WSD) attempts to help encoders
come to terms with a world in which, for one reason or another,
documents may not always use the same universal character set, whether
from ignorance, perversity, or the sheer impossibility of finding one
large enough to represent all the glyphs they contain.  It provides for
the systematic documentation of a writing system, in terms of existing
international or other standards, public or private entity sets, ad hoc
transliteration schemes or explicit definitions, as well as combinations
of all four.
 
   Finally, the global rend element may be used to give information
about the physical presentation of the text in the source, where this is
not otherwise given.  A default rendition may be specified for all
elements of a given type.  No specific set of values is defined for this
attribute in the current draft, though it is possible that some suitable
set of DSSSL primitives may be proposed in a later version.(11)
 
   It should be stressed that the rend element does not specify
procedurally how an element is to be formatted.  It does associate the
element with a particular form of presentation, but does so in a purely
declarative way.  Its normal meaning is that the element in question was
rendered as indicated in the source -- it says nothing in itself as to
how the element is to be rendered in later processing, except insofar as
it may be desired to mimic the approximate appearance of the source
edition.  In document production applications, where there is no
physical source, some users may be tempted to use the rend attribute to
guide a formatting process, but external style sheets will normally
provide a better method of handling this task.
 
 
4.2   Inheritance of Attributes from Element Classes
 
   Other attributes, while not so widely useful as those defined
globally, do apply to more than one element type, and need to be kept
consistent among all elements which use them.  In the TEI scheme, such
attributes are defined as belonging to a particular class of elements;
attributes of a class in turn are inherited by all members of that
class.
 
   For example, all TEI elements which represent links or associations
between one element and another do so using a common set of attributes.
To express this visibly in the DTD, all such elements are defined as
members of an attribute class, to which the TEI gives the name pointer.
Since SGML has no explicit concept of element classes, class inheritance
is implemented by parameter entities.  The attributes of a class are
defined in a parameter entity, the name of which is the string a.
followed by the name of the class itself.  Members of the class refer to
that parameter entity in declaring their own attributes.  In the
following simplified example, the element <ref> is a member of the class
pointer, while <xref> is a member of the subclass xPointer:
 
     <!ENTITY % a.pointer '
               type               CDATA               #IMPLIED
               resp               CDATA               #IMPLIED
               targType           NAMES               #IMPLIED
               evaluate           (all | one | none)  #IMPLIED'      >
     <!ENTITY % a.xPointer '      %a.pointer
               doc                ENTITY              #IMPLIED
               from               CDATA               "ROOT"
               to                 CDATA               "DITTO"'       >
     <!ATTLIST ref                %a.global
                                  %a.pointer
               target             IDREFS              #IMPLIED       >
     <!ATTLIST xref               %a.global
                                  %a.xPointer                        >
 
After expansion of the parameter entities, it is as though <ref> had
been declared with the attributes id, n, lang, rend, type, resp,
targType, evaluate, and target.  The element <xref> has further
inherited the attributes of the xPointer class.  It is as though the
following declarations had been in the DTD:
 
     <!ATTLIST ref
               id                 ID                  #IMPLIED
               n                  CDATA               #IMPLIED
               lang               IDREF               %INHERITED
               rend               CDATA               #IMPLIED
               type               CDATA               #IMPLIED
               resp               CDATA               #IMPLIED
               targType           NAMES               #IMPLIED
               evaluate           (all | one | none)  #IMPLIED
               target             IDREFS              #IMPLIED       >
     <!ATTLIST xref
               id                 ID                  #IMPLIED
               n                  CDATA               #IMPLIED
               lang               IDREF               %INHERITED
               rend               CDATA               #IMPLIED
               type               CDATA               #IMPLIED
               resp               CDATA               #IMPLIED
               targType           NAMES               #IMPLIED
               evaluate           (all | one | none)  #IMPLIED
               doc                ENTITY              #IMPLIED
               from               CDATA               "ROOT"
               to                 CDATA               "DITTO"        >
 
   As may be seen,
 
*   the global attributes themselves are declared as the attributes of a
    class named global -- all elements are members of this class.
*   the subclass xPointer inherits attributes from its superclass
    pointer, and thus its members do, too; inheritance is thus
    transitive, as it should be.
*   the attributes of a class can readily be modified by the user, who
    need merely provide an alternative declaration for the appropriate
    parameter entity.
 
 
4.3   Model Classes
 
   Inherited class membership is also used in defining content models.
Members of a model class share the same structural properties:  that is,
they may appear at the same position within the SGML document structure.
For example, the class phrase includes all elements which can appear
within paragraphs but not between them, while the class chunk includes
all elements which can appear between but not within paragraphs (such as
paragraphs, for example).  A class inter is also defined, for elements
such as lists which can appear either within or between chunk elements.
Similarly, the class divtop contains all elements (headings, epigraphs,
etc.) which can appear at the start of a textual division.
 
   As well as these general purpose classes, some functional or semantic
classes are defined:  for example, all elements used to mark editorial
corrections or omissions are all members of the class edit; elements
marking bibliographic citations, etc., are all members of the class bibl
and so on.
 
   Elements may of course be members of more than one class.  Classes
may have super- and sub-classes, and properties are inherited.  For
example, reflecting the needs of many TEI users to treat texts both as
documents and as input to databases, a sub-class of phrase called data
is defined to include "data-like" features such as names of persons,
places or organizations, numbers and dates, abbreviations and measures.
These behave in exactly the same way as phrase elements, and so the data
class is a sub-class of the phrase class.
 
   The formal definition of these classes in the SGML syntax used to
express the TEI scheme makes it possible for users of the scheme to
extend it in a simple and controlled way: new elements may be added into
existing classes, and existing elements renamed or undefined, without
any need for extensive revision of the TEI document type definitions.
The class is defined by two parameter entities:  one contains a list of
class members, in a form suitable for use within a content model; it is
given a name constructed by adding the prefix m. (m plus full stop) to
the name of the class itself, and is thus termed an m-dot.  the m-dot
entity also contains a reference to an x-dot entity, which by default is
declared as the empty string.  (N.B. the use of m-dot and similar
prefixes is a naming convention followed consistently by the TEI; it is
not an SGML construct, and for an SGML processor, the m-dot and x-dot
entities for a class are simply two distinct parameter entities.  The
formal relationship between them is detectable only by TEI-aware
software, not by generic SGML systems.)
 
   For example, the element class divtop is defined as follows:
 
     <!ENTITY % x.divtop "">
     <!ENTITY % m.divtop "%x.divtop head | byline | epigraph">
 
   To add a new element (say, <keywords>) to this class, enabling it to
appear anywhere in the content model that other members of the class do,
all that is needed is to re-define the x-dot entity within the DTD
subset:
 
     <!ENTITY % x.divtop "keywords |"
 
Note the trailing vertical bar, which is required.  As it happens, the
element <keywords> is already defined in the TEI scheme (within the
header); if it were not, an element declaration would also be necessary
here.
 
   The m-dot entity is used in content models wherever members of the
class can appear.  The divtop class, for example, appears in the
declaration of text-division elements:
 
     <!ELEMENT div - - ((%m.divtop)*, (%component.seq), div*) >
 
 
4.4   Extension and Modification Mechanisms
 
   The TEI takes the following steps to ensure that the TEI DTD can be
readily extended or modified where necessary; some have already been
described.
 
*   modularization of the DTD, to allow individual components, which are
    referred by using external parameter entities, to be replaced by
    modified components
*   pre-defined entities for TEI extensions, embedded by the driver at
    appropriate points, as described above
*   the provision of x-dot entities for all model classes, to allow
    convenient addition of new elements to the model-class system
*   the expression of attribute classes using parameter entities,
    allowing them to be modified easily
*   use of parameter entities for common content models, to allow easy
    modification
*   use of parameter entities for the generic identifier of each
    element, so that elements may be renamed simply by overriding the
    default declaration of the element's n-dot entity
*   declaration of each element within a conditional marked section
    controlled by a parameter entity, to allow individual elements to be
    suppressed at will
 
The first few points have already been explained; the latter two can be
illustrated with a simple example.  In the non-modifiable form of the
TEI DTD, the declarations for <ptr>, for example, are given in the
following form:
 
     <!ELEMENT ptr           - O  EMPTY                              >
     <!ATTLIST ptr                %a.global
                                  %a.pointer
               target             IDREFS              #REQUIRED      >
 
In the modifiable-form version of the same file, however, the
declarations take the following form (again, we simplify slightly):
 
     <!ENTITY % n.ptr 'ptr' >
     <!ENTITY % ptr 'INCLUDE' >
     <![ %ptr; [
     <!ELEMENT %n.ptr;       - O  EMPTY                              >
     <!ATTLIST %n.ptr;            %a.global
                                  %a.pointer
               target             IDREFS              #REQUIRED      >
     ]]>
 
The <ptr> element can be renamed <crossref> by simply redefining the
entity used for its generic identifier:
 
     <!ENTITY % n.ptr 'crossref'>
 
It can be suppressed entirely by redefining that used to control the
marked section within which it is declared:
 
     <!ENTITY % ptr 'IGNORE'>
 
 
4.5   Derived DTDs
 
   The TEI DTDs may be used as they stand; it is expected, however, that
many users will modify them slightly, if only by selecting a particular
set of tag sets and suppressing the declaration of elements which they
expect never to tag in their documents.  The Oxford Text Archive, for
example, has for some time been distributing documents encoded with a
small subset of the TEI DTD, in which the main difference from the
standard DTDs has been the elimination of many elements of the core and
header.  The OTA DTD can be expressed purely as a set of modifications
of the standard TEI DTD, in the manner shown earlier, in which the
TEI.extensions.ent file contains a large number of declarations of the
following forms:
 
     <!ENTITY % sic  'IGNORE'>
     <!ENTITY % corr 'IGNORE'>
     <!-- etc. -->
     <!ENTITY % n.titleStmt 'titlStmt' >
     <!ENTITY % n.editionStmt 'edStmt' >
     <!ENTITY % n.publicationStmt 'pubStmt' >
     <!-- etc. -->
 
Rather than parse the entire TEI DTD each time a document needs to be
parsed, however, it is more convenient to define a free-standing OTA
DTD, which defines exactly the same element grammar as that of the TEI
DTD as modified by OTA; this derived DTD is smaller than the TEI DTD,
makes no reference to suppressed elements, and is thus more readily
comprehensible on its own terms than it would be if it had to be read
only in conjunction with the full TEI DTD.  Its disadvantage, from the
point of view of document interchange, is that it is difficult to
disentangle the parts which represent modifications of the TEI DTD from
the parts which are exactly the same; a recipient who wanted to know
exactly what changes would be required to process an OTA text with a
standard TEI-aware processor would have difficulty finding out.  It will
be most convenient, in general, if the derived DTD is provided in two
forms:  both as a free-standing DTD and as a pair of extension files
which are to be used as the TEI.extensions.ent and TEI.extensions.dtd
entities.
 
   Documents conforming to a derived DTD are also TEI conformant, as
long as the element grammar expressed by the derived DTD can also be
expressed by a set of legal modifications to the TEI DTD.  For ease of
maintenance, it will be convenient to maintain the DTD as a set of
modifications, and to generate the free-standing derived DTD by
processing the DTD with a program which generates, from the modified TEI
DTD, a free-standing derived DTD which expresses the same element
grammar.
 
   As pointed out in chapter 29 of TEI P3, any modification of a DTD
changes the set of documents accepted by that DTD.  The set of documents
accepted by a derived DTD may be:
 
*   a subset of the documents accepted by the unmodified TEI DTD
*   a superset of those documents
*   neither a subset nor a superset
 
If either of the first two cases applies, the changes may be said to be
"clean" modifications; clean modifications, however, are not a condition
of TEI conformance, and a combination of clean modifications is not
itself guaranteed clean.
 
 
 
                                   5
 
                      ELEMENTS OF THE CORE TAG SET
 
 
   It remains to describe at least briefly the elements defined in the
core tag set and in the default text-structure tag set, which are not
described elsewhere in this volume.  The core tag set defines elements
common to many different types of text, which are commonly enough used
that it seemed inconvenient to separate them out into additional tag
sets.  The elements defined in the core can be grouped into:
 
*   those which can appear directly within a text division
    ("paragraph-level" elements -- to avoid a bias toward prose, P3
    defines these as component-level elements)
*   those which can appear at character, word, or phrase level, i.e.
    within paragraphs or other component-level elements -- and also
    within other phrase-level elements; these elements are members of
    the class phrase
*   those which can appear both at phrase-level and at component-level;
    these are members of the class inter
 
There are few documents which do not exhibit some of these features; and
none of these features is particularly restricted to any one kind of
document. In some cases, additional more specialized tag sets are
provided for those wishing to encode aspects of these features in more
detail (e.g. in chapter 14, which provides a more elaborate tag set for
cross references and hypertext linking, or chapter 20, which allows for
detailed analysis of the components of names, dates, and measures), but
the elements defined in this core are believed to be adequate for most
applications most of the time.
 
 
5.1   Paragraph-Level Elements
 
   The only items defined as strictly component-level elements by the
core tag set are:
 
*   paragraphs (tagged <p>)
*   verse lines (<l>)
*   verse stanzas or other line groups (<lg> -- an <lg> contains one or
    more <l> elements)
*   dramatic speeches (<sp>), which contain an optional speaker
    indication (<speaker>) and a series of paragraphs, lines, or line
    groups
 
Paragraphs are included in the core, rather than in the base tag set for
prose, because prose materials regularly appear in texts of all types,
including dictionaries and term banks, if only in the front and back
matter.  SGML elements for verse and dramatic speeches are included in
the core tag set, rather than in the base tag sets for verse and drama,
on similar grounds:  they appear frequently in texts of all sorts,
either directly, in the case of mixed-genre materials, or indirectly
through quotations and epigraphs.  Any attempt to segregate these
elements into the respective base tag sets would have rendered it so
frequently necessary to combine base tag sets that the utility of
separating the bases would have been seriously impaired.
 
 
5.2   Phrase-level Elements
 
   The phrase-level elements of the TEI core serve a variety of uses.
They include:
 
*   typographically highlighted phrases, optionally distinguishing among
    emphasis (<emph>), technical terms (<term>), foreign words
    (<foreign>) or words otherwise regarded as linguistically distinct
    or anomalous (<distinct>), titles of books (<title>), etc..  When
    such distinctions are not to be made, the phrase may be tagged
    simply as "highlighted" (<hi>); the global rend attribute can be
    used in either case to specify how the phrase is highlighted in the
    source.
*   quoted phrases, optionally distinguishing quotation or direct
    discourse (<q> -- strictly speaking, <q> is an inter-level element),
    glosses (<gloss>), cited phrases (<mentioned>), "scare quotes"
    (<soCalled>), or not so distinguishing (<hi> again);
*   semantically distinct and important items like names (<name>),
    references to people, places, or things (<rs>, for 'referring
    string'), numbers and measures (<num> and <measure>), dates and
    times (<date>, <time>, <dateRange>, <timeRange>), abbreviations and
    their expansions (<abbr>, <expan>), etc.
*   simple editorial changes such as correction of apparent errors
    (<corr>) or faithful reproduction of errors (<sic>), regularization
    and normalization or the reproduction of unnormalized forms (<reg>
    or <orig>), additions (<add>), material transcribed despite being
    deleted (<del>), and omissions (<gap>); material not reliably
    transcribed because the original is illegible or inaudible may be
    marked <unclear>.
*   simple links and cross references (<ptr> and <ref>), providing basic
    hypertextual features; <ptr> is an empty element, used as in a
    document-production system which generates the appropriate cross
    reference; <ref> encloses a phrase which expresses a cross
    reference.
*   pre-existing or generated indexing (<index>)
*   simple or complex referencing systems, not necessarily dependent on
    the existing SGML structure (<pb>, <lb>, <cb> for page, line, and
    column breaks, <milestone> for other reference units)
 
 
5.3   Inter-level Elements
 
   The core tag set defines a number of elements which can appear at
either the phrase or the component level:
 
*   annotation, whether pre-existing or added by the encoder (<note>);
    the place and type attribute can be used to distinguish notes by
    place of appearance in the source (footnotes, endnotes, marginal
    notes, etc.) or by type (source, parallel passage, etc.)
*   lists of all kinds (<list>, <item>, and <label> for item labels;
    <listBibl> for lists of bibliographic citations
*   quotation, whether actual or pretended (e.g. dialog in a novel or
    narrative history); these are always reliably distinguishable and
    may both be tagged <q>.  Where it is possible and desirable to draw
    a distinction between the two, <q> should be used for dialog or
    direct discourse, and <quote> for actual quotations from other
    authors.  Attributed quotations may be associated with their
    bibliographic references by enclosing both within a <cit> element.
*   bibliographic citations, adequate for most commonly used
    bibliographic packages, in either a free or a tightly structured
    format (<bibl> for unstructured, <biblStruct> for tightly
    structured, bibliographic citations, <listBibl> for lists of
    bibliographic entries)
*   stage directions, which can occur either within or between speeches
    (<stage>)
*   embedded texts (tagged <text>)
 
 
 
                                   5
 
             ELEMENTS OF THE DEFAULT TEXT STRUCTURE TAG SET
 
 
   The default text structure, which is currently embedded by all TEI
base tag sets, divides each text into a body, surrounded by optional
front and back matter.  These three top-level units are tagged,
respectively, <front>, <body>, and <back>.
 
 
5.1   Text Body and Text Divisions
 
   While different text types exhibit a bewildering variety of names for
their component parts (chapters, sections, subsections, acts, scenes,
entries, parts, books, cantos, adventures, staves, fittes, etc. -- to
mention only English-language terms), nevertheless the components and
subcomponents of text typically behave in the same way whatever their
name:  typically each such is incomplete in itself, and typically
smaller ones nest within larger ones.  In the TEI scheme, therefore, all
such objects are therefore regarded as the same fundamental kind of
element, which the TEI calls a <div> (or text division).
 
   A type attribute may be used to give the name associated with a
particular text division in the source; chapters may thus be tagged <div
type='chapter'>, cantos may be tagged <div type='canto'>, and so on.  In
this way, the encoder can distinguish different types of division.  It
will frequently be possible to distinguish them also by their
hierarchical position:  when plays are consistently divided into acts,
and acts into scenes, then the top-level division of a text body will
consistently be a <div type='act'> and the next level a <div
type='scene'>.  Like others of its kind in the TEI scheme, the type
attribute has no standard set of values, precisely because no such
standard set of terms exists, or is likely to exist.  A set of legal
values may however be defined for a given application, either in the TEI
Header or by a user-defined modification to the DTD.
 
   Following the example of such markup languages as Scribe, LaTeX, GML,
and the SGML tag set developed for the Association of American
Publishers (now ISO 12083), the TEI defines a set of text-division
elements each labeled with its hierarchical depth (<div0>, <div1>, etc.,
down to <div7>).  This style of tagging is intuitively clear to many
users, allows convenient markup omission techniques, and can sometimes
assist in the detection of errors in the markup.  Since such a style of
markup can make it hard to reuse text unless it is embedded at exactly
the same level in the hierarchy of text divisions, however, a second
style of markup is sometimes preferred; this too is provided by the TEI,
in the form of an un-numbered or "vanilla" <div>.  Either numbered or
un-numbered division elements may be used in TEI documents, but they may
not be mixed in the same <front>, <body>, or <back> element.
 
   In the normal case, the components of all divisions in a particular
base are homogeneous -- they all use the same value for component.seq.
However, the scheme also allows for two kinds of heterogeneity.  If the
general base is selected, together with two or more other bases, then
different divisions of a text may have different constituents, though
each division must itself be homogeneous.  A mixed base is also defined,
in which components from any selection of bases may be combined
promiscuously across division boundaries.  The way in which this is done
is beyond the scope of this article; full details are however given in
the appropriate chapter of the TEI Guidelines.
 
   At the beginning of each text division, a division title (tagged
<head>) may appear, as may several other items (<epigraph>, <argument>,
<byline>, <dateline>, <salute> -- the salutation at the opening of a
letter, or the signoff at its closing --, <signed> -- the signature of a
letter).  A number of these may also appear at the end of a division.
To avoid ambiguity in the document grammar, the ambiguous items
(<byline>, <dateline>, <salute>, and <signed>) must be enclosed within
the grouping elements <opener> or <closer>.
 
   For example:
 
     <div><head>A LETTER TO THE PUBLISHER
     Occasioned by the present
     Edition of the DUNCIAD</head>
     <p>It is with pleasure I hear that you have procured a
     correct edition of the <title>Dunciad</title>, ...
     <!-- ... -->
     <p>I shall conclude with remarking ...
     I am
     <closer>
     <salute>Your most humble Servant,</salute>
     <dateline><name type=place>St. James's</>
     <date>Dec. 22, 1728</date></dateline>
     <signed>William Cleland.</signed>
     </closer>
     </div>
 
 
5.2   Front and Back Matter
 
   In general front and back matter are tagged in TEI documents using
the same elements as the text body; a few specialized elements are
defined for specialized elements found only there.  No specialized
elements are defined for items like prefaces, acknowledgements, etc.,
since these may conveniently be tagged <div type='Preface'>, etc.  The
title page, however, is given a tag of its own (<titlePage>), and a
number of specialized elements are defined for it, to make it easier to
extract salient information like the title and author of the document in
question.
 
 
5.3   Groups
 
   Textual divisions pose no problems in many conventional texts.  In
some cases, however, it is impossible to find a satisfactory method of
handling the internal structure of a text, using only the elements
<text>, <body>, and <div>.  In encoding an anthology, for example, one
might well balk at tagging a poem or short story as a division of some
larger whole, ignoring the fact that it too is a text in its own right.
One could, of course, tag the entire anthology as a corpus of texts,
enclosing each item in the anthology within its own <tei.2> and <text>
elements; but only at the cost of losing the ability to regard the
entire collection as a text in its own right.  For anthologies prepared
by later editors, this might not be an unacceptable cost, for some
encoders, but it is intolerable for poem cycles and other elaborately
structured works.
 
   For cases of this sort, the TEI scheme defines an alternative to
<body> and <div>.  In an anthology, the front and back matter surround a
<group> element, which contains either a series of <text> elements, or a
series of smaller <group> elements, or some combination thereof.  A
group can have the same kind of header and trailer elements at its
beginning and end as can a text division; texts enclosed within a group
have the normal structure and can thus have their own front matter,
body, and back matter -- or may themselves be compound texts, containing
groups of their own.
 
   In addition, <text> is declared as a component-level element, so that
embedded texts quoted in full can be tagged as free-standing texts.
 
 
 
                                   6
 
                               CONCLUSION
 
 
   It is too early to tell with certainty, and the authors are not the
judges called to decide, whether the TEI scheme has wholly justified all
the hopes placed in it at its inception.  But it is possible to make a
brief survey of the original design goals and see where the TEI now
stands in relation to them.
 
   We believe that the scheme defined in TEI P3 does suffice, thanks to
the generous work of scores of specialists, for the immediate needs of
the majority of scholars working with electronic text.  Without
user-defined extensions, it does not suffice, and no finite vocabulary
can ever suffice, for all the future needs of all researchers; by
providing extensive facilities for user extensions, however, we believe
the TEI does the best job possible of meeting the future needs of
research which cannot be foreseen today, as well as making the existing
DTD usable for those with specialized needs today.  The extensive
facilities for analysis and interpretation are designed to allow a great
deal of extension to the native semantics of the TEI tag sets to occur
without requiring any formal modification to the SGML DTD.
 
   The scheme is, however, far more complex and daunting than either of
us thought likely when the project began.  In part, this reflects the
project's efforts to meet the existing documented needs of researchers;
in part, no doubt, it reflects shortcomings in the editorial work for
which we must shoulder responsibility.  In our own defense we will offer
only the observation that a markup language truly capable of reflecting
in detail the research results of work of scholars and researchers in
very diverse fields will necessarily take on some of the complexity
inherent in advanced work in those fields; a very simple markup scheme
will hardly provide the kind of support for research work which P3
provides.  We do hope to have kept the explanations in TEI P3 reasonably
clear, and we do claim that TEI P3 is reasonably concrete:  it is not a
meta-DTD but a real markup language which can be used without further
ado, and for every element there are concrete examples, almost always
from real texts.
 
   We can also certify that it is possible to use the TEI markup
language without special-purpose software; we have used earlier versions
of it ourselves for several years, with no editors or word processors
other than the standard ones on our systems.  The goal, of course, was
that the scheme should be not just "possible", but easy to use without
special-purpose software.  And here we must admit that while it has not
been very onerous to edit TEI documents in standard editors, the task is
made much easier by the use of SGML-aware editors.  And while it has
been possible to write our own software to process our own SGML
documents (with which the printed version of TEI P3 was generated), the
task of processing SGML documents becomes incomparably simpler when SGML
software is used.
 
   The final items in the initial list of design goals call for the TEI
scheme to allow user-defined extensions, and to conform to existing and
emergent standards.  The list of facilities given above for simplifying
the task of extending the TEI DTD has, we hope, persuaded the reader
that such extension is adequately catered to.  And the final goal has
been met by erecting the Guidelines on the foundations provided by SGML.
The task of conforming to emerging standards, of course, has no end, and
future revisions of the Guidelines will have to take explicit account of
HyTime, DSSSL, and other standards which accompany and extend SGML.
 
   Some work remains for the future.  Several existing chapters would
benefit from further elaboration; several missing areas (analytic
bibliography of early printed books, computational lexica, detailed
description of page layout and design, letters and memoranda, and
numerous others) should be included in some future revision.  The most
pressing need, however, remains that of developing simple introductory
manuals for teaching the Guidelines to practitioners and students -- a
task to which we now turn our attention, and that of the community as a
whole.
 
 
 
------------------------------------------------------------------------
The Design of the TEI Encoding Scheme
------------------------------------------------------------------------
 
 
                                 NOTES
 
 
(1) Association for Computers and the Humanities (ACH), Association for
    Computational Linguistics (ACL), and Association for Literary and
    Linguistic Computing (ALLC), Guidelines for Electronic Text Encoding
    and Interchange, ed.  C. M.  Sperberg-McQueen and Lou Burnard
    (Chicago, Oxford:  Text Encoding Initiative, 1994).
 
(2) Although it requires SGML conformance of all TEI-encoded documents,
    the TEI encoding scheme is not strictly speaking a "conforming SGML
    application" within the definition given in ISO 8879, because of the
    additional restrictions it imposes on the SGML constructs to be used
    in the TEI Interchange Format, and because P3 sometimes replaces
    standard terminology used in ISO 8879 with terminology which was
    felt to be less stilted and thus, we hope, clearer.
 
(3) In Saussure's terminology, we are distinguishing here between the
    signified and the signifier.  The association we postulate between
    the tag and a specific feature, however, means that a tag, as we use
    the term, is not a signifier but a sign.  Since in common usage, the
    term tag is used both to refer to signifiers and to refer to signs,
    we use identifier when it is important to refer unambiguously to the
    signifier, and element (see below) when it is desired to refer
    unambiguously to the sign.
 
(4) For Formex, see C. Guittet, ed., Formex:  Formalized Exchange of
    Electronic Publications (Luxembourg: Office for Official
    Publications of the European Communities, 'New Technologies --
    Project Management' Department, 1984; for the starter set see ISO
    8879-1986 Information processing -- Text and office systems --
    Standard Generalized Markup Language (SGML) ([n.p.]:  ISO, 1986),
    Annex E.1, pp. 136-139.  The starter set is discussed and various
    national-language versions defined in ISO technical report ISO/TR
    9573-1988(E) Information processing--SGML support
    facilities--Techniques for using SGML.
 
(5) In our usage, an SGML element is thus a sign:  an association
    between a signifier (the identifier) and a signified (the feature).
 
(6) It is therefore perhaps rather inconsistent to refer to the sets of
    elements described in chapters 5 to 23 of P3  as tag sets, but the
    usage is now established.
 
(7) More rigorously, the formal portion of the DTD directly determines
    only a set of identifiers.  Taken in conjunction with the semantic
    rules of the encoding scheme, however, this set uniquely determines
    the tag set defined by the DTD.
 
(8) See, for example, Frank Wm. Tompa, "What is (Tagged) Text?" in
    Dictionaries in the Electronic Age:  Fifth Annual Conference of the
    UW Centre for the New Oxford English Dictionary (Oxford:  [n.p.],
    1989).
 
(9) A content model of ANY is defined as equivalent to a starred
    alternation of #PCDATA and all the elements declared in the DTD.
    For further discussion of starred alternations, see below.
 
(10) For example, in a DTD containing declarations for <foo>, <bar>,
     <fubar>, and <blort>, the declaration given here is the exact
     equivalent of the Waterloo-style declaration for <blort> given
     earlier.
 
(11) DSSSL, the Document Style Semantics and Specification Language, is
     a draft ISO standard for specifying the processing or layout of
     SGML documents; it is formally defined in ISO DIS 10179 --
     Information Technology--Text Composition--Document Style Semantics
     and Specification Language ([Geneva]:  ISO, 1994).
 
 
 
 
 
------------------------------------------------------------------------
The Design of the TEI Encoding Scheme
------------------------------------------------------------------------
 
 
                              WORKS CITED
 
 
*   Association for Computers and the Humanities (ACH), Association for
    Computational Linguistics (ACL), and Association for Literary and
    Linguistic Computing (ALLC). Guidelines for Electronic Text Encoding
    and Interchange, ed.  C. M.  Sperberg-McQueen and Lou Burnard.
    Chicago, Oxford:  Text Encoding Initiative, 1994.
*   Guittet, C., ed. Formex:  Formalized Exchange of Electronic
    Publications.  Luxembourg: Office for Official Publications of the
    European Communities, 'New Technologies -- Project Management'
    Department, 1984.
*   International Organization for Standardization (ISO).  ISO 8879-1986
    Information processing -- Text and office systems -- Standard
    Generalized Markup Language (SGML).  [Geneva]:  ISO, 1986.
*   --.  ISO/TR 9573-1988(E) Information processing--SGML support
    facilities--Techniques for using SGML.  [Geneva]:  ISO, 1988.
*   --.  ISO DIS 10179 -- Information Technology -- Text Composition --
    Document Style Semantics and Specification Language [Geneva]:  ISO,
    1994.
*   Text Encoding Initiative.  TEI ED P1 "Design Principles for Text
    Encoding Guidelines."  [Chicago, Oxford]:  TEI, 1989.
*   --.  TEI ED P2 "Charges to the Working Committees."  [Chicago,
    Oxford]:  TEI, 1989.
*   --.  TEI ED P3 "Theoretical Stance and Resolution of Theory
    Conflict."  [Chicago, Oxford]:  TEI, 1989.
*   Tompa, Frank Wm. "What is (tagged) text?" In Dictionaries in the
    Electronic Age:  Fifth Annual Conference of the UW Centre for the
    New Oxford English Dictionary.  Oxford:  [n.p.], 1989.