<?xml version="1.0" encoding="utf-8"?>
<!--
Copyright TEI Consortium. 
Dual-licensed under CC-by and BSD2 licences 
See the file COPYING.txt for details.
$Date$
$Id$
-->


<?xml-model href="http://tei.oucs.ox.ac.uk/jenkins/job/TEIP5/lastSuccessfulBuild/artifact/P5/release/xml/tei/odd/p5.nvdl" type="application/xml" schematypens="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"?>

<div xmlns="http://www.tei-c.org/ns/1.0" type="div1" xml:id="SG">

<head>A Gentle Introduction to XML</head>

<p>The encoding scheme defined by these Guidelines is formulated as an
application of the Extensible Markup Language (XML) (<ptr target="#XMLREC"/>). XML is widely used for
the definition of device-independent, system-independent methods of
storing and processing texts in electronic form. It is now also the
interchange and communication format used by many applications on the
World Wide Web. In the present chapter we informally introduce some of
its basic concepts and attempt to explain to the reader encountering
them for the first time how and why they are used in the TEI
scheme. More detailed technical accounts of TEI practice in this
respect are provided in chapters <ptr target="#USE"/>, <ptr target="#ST"/>, and <ptr target="#TD"/> of these Guidelines.</p>

<p>Strictly speaking, XML is a <term>metalanguage</term>, that is, a
language used to describe other languages, in this case,
<term>markup</term> languages.  Historically, the word
<term>markup</term> has been used to describe annotation or other
marks within a text intended to instruct a compositor or typist how a
particular passage should be printed or laid out.  Examples include
wavy underlining to indicate boldface, special symbols for passages to
be omitted or printed in a particular font, and so forth.  As the
formatting and printing of texts was automated, the term was extended
to cover all sorts of special codes inserted into electronic texts to
govern formatting, printing, or other processing.</p>
<p>Generalizing from that sense, we define <term>markup</term>, or
(synonymously) <term>encoding</term>, as any means of making explicit
an interpretation of a text.  Of course, all printed texts are
implicitly encoded (or marked up) in this sense: punctuation marks,
capitalization, disposition of letters around the page, even
the spaces between words all might be regarded as a kind of markup, the
purpose of which is to help the human reader determine where one word
ends and another begins, or how to identify gross structural features
such as headings or simple syntactic units such as dependent clauses
or sentences.  Encoding a text for computer processing is, in
principle, like transcribing a manuscript from <foreign xml:lang="la">scriptio continua</foreign><note place="bottom">In the
<q>continuous writing</q> characteristic of manuscripts from the early
classical period, words are written continuously with no intervening
spaces or punctuation.</note>; it is a process of making explicit what is
conjectural or implicit, a process of directing the user as to how the
content of the text should be (or has been) interpreted.</p>

<p>By <term>markup language</term> we mean a set of markup conventions
used together for encoding texts.  A markup language must specify how
markup is to be distinguished from text, what markup is allowed, what
markup is required, and what the markup means.  XML provides the means
for doing the first three; documentation such as these Guidelines is
required for the last.</p>

<p>The present chapter attempts to give an informal introduction to
those parts of XML of which a proper understanding is necessary to
make best use of these Guidelines. The interested reader should also
consult one or more of the many excellent introductory textbooks and
web sites now available on the subject.<note place="bottom">New
textbooks and websites about XML appear at regular intervals and to
select any one of them would be invidious. <!-- A useful list of
pointers to introductory web sites is available from <ptr
target="http://www.xml.org/xml/resources_focus_beginnerguide.shtml"/>;-->
some recommended online courses include <ptr
target="http://www.w3schools.com/xml/default.asp"/> and <ptr
  target="http://www.ibm.com/developerworks/training/ondemandskills/index.html"/>.</note></p>

<div type="div2" xml:id="SG11"><head>What's Special about XML?</head>

<p>XML has three highly distinctive advantages: 
<list rend="numbered">
<item>it places emphasis on descriptive rather than
procedural markup;</item>
<item>it distinguishes the concepts of syntactic correctness and of
<term>validity</term> with respect to a <term>document type definition</term>;</item>
<item>it is independent of any one hardware or software
system.</item></list> These three aspects are discussed briefly below,
and then in more depth in the remainder of this chapter.</p>

<p>XML is frequently compared with HTML, the language in which web
pages have generally been written, which
shares some of the above characteristics. Compared with HTML, however,
XML has some other important features:

<list rend="bulleted"><item>XML is <hi>extensible</hi>: it does not consist of a fixed set of
tags;</item>
<item>XML documents must be <hi>well-formed</hi> according to a defined
syntax;</item>
<item>an XML document can be formally <hi>validated</hi> against a schema of
some kind;</item>
<item>XML is more interested in the meaning of data than in its
presentation.</item></list> </p>

<div type="div3" xml:id="SG111"><head>Descriptive Markup</head>

<p>In a descriptive markup system, the markup codes used do little
more than categorize parts of a document.  Markup codes such as
<tag>para</tag> or <code>\end{list}</code> simply identify a portion
of a document and assert of it that <q>the following item is a
paragraph</q>, or <q>this is the end of the most recently begun
list</q>, etc.  By contrast, a procedural markup system defines what
processing is to be carried out at particular points in a document:
<q>call procedure PARA with parameters 42, b, and x here</q> or <q>move
the left margin 2 quads left, move the right margin 2 quads right,
skip down one line, and go to the new left margin,</q> etc.  In XML,
the instructions needed to process a document for some particular
purpose (for example, to format it) are sharply distinguished from the
markup used to describe it.</p>

<p>Usually, the markup or other information needed to process a
document will be maintained separately from the document itself,
typically in a distinct document called a <term>stylesheet</term>,
though it may do much more than simply define the rendition or visual
appearance of a document.<note place="bottom">We do not here discuss in
any detail the ways that a stylesheet can be used or defined, nor do
we discuss the popular W3C Stylesheet Languages XSLT and CSS. See
further <ptr target="#XSL11"/>, <ptr target="#XSLT"/>, and 
<ptr target="#CSS1"/>. </note></p>

<p>When descriptive markup is used, the same document can
readily be processed in many different ways, using only those parts of
it which are considered relevant.  For example, a content analysis
program might disregard entirely the footnotes embedded in an
annotated text, while a formatting program might extract and collect
them all together for printing at the end of each chapter. Different
kinds of processing can be carried out with the same part of a file.
For example, one program might extract names of persons and places
from a document to create an index or database, while another,
operating on the same text, but using a different stylesheet, might
print names of persons and places in a distinctive typeface.</p>
</div>

<div type="div3" xml:id="SG112">
<head>Types of Document</head> 

<p>A second key aspect of XML is its notion of a <term>document
type</term>: documents are regarded as having types, just as other
objects processed by computers do.  The type of a document is formally
defined by its constituent parts and their structure.  The definition
of a <soCalled>report</soCalled>, for example, might be that it
consisted of a <soCalled>title</soCalled> and possibly an
<soCalled>author</soCalled>, followed by an
<soCalled>abstract</soCalled> and a sequence of one or more
<soCalled>paragraphs</soCalled>.  Anything lacking a title, according
to this formal definition, would not formally be a report, and neither
would a sequence of paragraphs followed by an abstract, whatever other
report-like characteristics these might have for the human reader.</p>

<p>If documents are of known types, a special-purpose program (called
a <term>parser</term>), once provided with an unambiguous definition
of a document type, can check that any document claiming to be of 
that type does in fact conform to the specification. A parser
can check that all elements specified for a particular
document type are present and no others, that they are combined in appropriate ways,
correctly ordered, and so forth.  More significantly, different
documents of the same type can be processed in a uniform way. Programs
can be written which take advantage of the knowledge encapsulated in
the document type information, and which can thus behave in a
more <soCalled>intelligent</soCalled> fashion.</p>
</div>

<div type="div3" xml:id="SG113"><head>Data Independence</head>

<p>A basic design goal of XML is to ensure that documents encoded
according to its provisions can move from one hardware and software
environment to another without loss of information. The two features
discussed so far both address this requirement at an abstract level;
the third feature addresses it at the level of the strings of data
characters that make up a document.  All XML documents,
whatever languages or writing systems they employ, use the same
underlying character encoding (that is, the same method of
representing as binary data those graphic forms making up a particular
writing system).<note place="bottom">See <title>Extensible Markup
Language (XML) 1.0</title>, available from <ptr target="http://www.w3.org/TR/REC-xml/"/>, Section 2.2
Characters.</note> This encoding is defined by an international
standard,<note place="bottom">ISO/IEC 10646-1993 <title>Information
Technology — Universal Multiple-Octet Coded Character Set</title>
(UCS)</note> which is implemented by a universal character set
maintained by an industry group called the Unicode Consortium, and
known as Unicode.<note place="bottom">See <ptr target="http://www.unicode.org/"/></note> Unicode 
provides a standardized way of representing any of the many thousands of
discrete symbols making up the world's writing systems, past and
present. </p>
<!--
<p>For technical and historical reasons which need not concern us, it
is sometimes necessary to translate texts encoded as Unicode into (or
out of) some other encoding system. XML uses a general purpose
<term>string substitution</term> mechanism for this purpose, inherited
from SGML (which predates the availability of Unicode). In simple
terms, this mechanism allows for the indirect representation of
arbitrary parts of a document (be they single characters, character
strings, or whole files) by means of predefined names, known as
<term>entity references</term>; see further <ptr
target="#SG-er"/>.</p>-->

<p>Most modern computing systems now support Unicode directly; for
those which do not, XML provides a mechanism for the indirect
representation of single characters by means of
their character number, known as <term>character references</term>; see further
<ptr target="#SG-er"/>.</p>
</div>
</div>
<div type="div2" xml:id="SG12"><head>Textual Structures</head>
<p>A text is not an undifferentiated sequence of words, much less of
bytes. For different purposes, it may be divided into many different
units, of different types or sizes.  A prose text such as this one
might be divided into sections, chapters, paragraphs, and sentences.  A
verse text might be divided into cantos, stanzas, and lines.  Once
printed, sequences of prose and verse might be divided into volumes,
gatherings, and pages.</p>

<p>Structural units of this kind are most often used to identify
specific locations or refer to points within a text (<q>the third
sentence of the second paragraph in chapter ten</q>; <q>canto 10, line
1234</q>; <q>page 412</q>, etc.) but they may also be used to
subdivide a text into meaningful fragments for analytic purposes
(<q>is the average sentence length of section 2 different from that of
section 5?</q> <q>how many paragraphs separate each occurrence of the
word <mentioned>nature</mentioned>?  how many pages?</q>).
Other structural units are more clearly analytic, in that they
characterize a section of a text.  A dramatic text might regard each
speech by a different character as a unit of one kind, and stage
directions or pieces of action as units of another kind.  Such an
analysis is less useful for locating parts of the text (<q>the 93rd
speech by Horatio in Act 2</q>) than for facilitating comparisons
between the words used by one character and those of another, or those
used by the same character at different points of the play.</p>

<p>In a prose text one might similarly wish to regard as units of
different types passages in direct or indirect speech, passages
employing different stylistic registers (narrative, polemic,
commentary, argument, etc.), passages of different authorship and so
forth.  And for certain types of analysis (most notably textual
criticism) the physical appearance of one particular printed or
manuscript source may be of importance:  paradoxically, one may wish to
use descriptive markup to describe presentational features such as
typeface, line breaks, use of whitespace and so forth.</p>

<p>These textual structures overlap with one another in complex and
unpredictable ways.  Particularly when dealing with texts as
instantiated by paper technology, the reader needs to be aware of both
the physical organization of the book and the logical structure of the
work it contains.  Many great works (Sterne's <title>Tristram
Shandy</title> for example) cannot be fully appreciated without an
awareness of the interplay between narrative units (such as chapters
or paragraphs) and presentational ones (such as page divisions).  For
many types of research, the interplay among different levels
of analysis is crucial: the extent to which syntactic structure
and narrative structure mesh, or fail to mesh, for example, or the
extent to which phonological structures reflect morphology.</p></div>

<div type="div2" xml:id="SG13"><head>XML Structures</head>

<p>This section describes the simple and consistent mechanism for the
markup or identification of textual structure provided by XML.  It
also describes the methods XML provides for the expression of rules
defining how units of textual structure can meaningfully be combined
in a text.</p>

<div type="div3" xml:id="SG131"><head>Elements</head> 

<p>The technical term used in XML for a textual unit, viewed as a
structural component, is <term>element</term>.  Different types of
elements are given different names, but XML provides no way of
expressing the meaning of a particular type of element, other than its
relationship to other element types.  That is, all one can say about
an element called (say) <gi scheme="imaginary">blort</gi> is that instances of
it may (or may not) occur within elements of type <gi scheme="imaginary">farble</gi>, and
that it may (or may not) be decomposed into elements of type
<gi scheme="imaginary">blortette</gi>.  It should be stressed that XML is entirely
unconcerned with the <emph>semantics</emph> of textual elements,
because these are considered to be application dependent. It is up to
the creators of XML vocabularies (such as these Guidelines) to choose
intelligible element names and to define their
intended use in text markup.  That is the chief purpose of documents
such as the TEI Guidelines.  From the need to choose element names
indicative of function comes the technical term for the name of an
element type, which is <term>generic identifier</term>, or GI.</p>
<p>Within a marked-up text (a <term>document instance</term>), each
element must be explicitly marked or tagged in some way.  This is done
by inserting a tag at the beginning of the element (a
<term>start-tag</term>) and another at its end (an
<term>end-tag</term>). The start- and end-tag pair are used to bracket
off element occurrences within the running text, in rather the
same way as different types of parentheses or quotation marks are used
in conventional punctuation. For example, a quotation element in a
text might be tagged as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples">... Rosalind's 
remarks <quote>This is the silliest stuff that ere I heard 
of!</quote> clearly indicate ...</egXML>

As this example shows, a start-tag takes the form <tag>quote</tag>,
where the opening angle bracket indicates the start of the start-tag,
<q>quote</q> is the generic identifier of the element that is being
delimited, and the closing angle bracket indicates the end of the start-tag.
An end-tag takes an identical form, except that the opening angle
bracket is followed by a solidus (slash) character, so that the
corresponding end-tag is <tag>/quote</tag>.<note place="bottom">Because
the opening angle bracket has this special function in an XML
document, special steps must be taken to use that character for other
purposes (for example, as the mathematical less-than operator); see
further section <ptr target="#SG-er"/>.</note> The material between
the start-tag and the end-tag (the string of words <q>This is the
silliest stuff that ere I heard of</q> in the example above) is known
as the <term>content</term> of the element. Sometimes there may be
nothing between the start and the end-tag; in this case the two may
optionally be merged together into a single composite tag with the
solidus at the end, like this: <tag>quote/</tag>. </p></div>

<div type="div3" xml:id="SG132"><head>Content Models: an Example</head>

<p>An element may be <term>empty</term>, that is, it may have no
content at all, or it may contain just a sequence of characters with
no other elements. Often, however, elements of one type will
be <term>embedded</term> (contained entirely) within elements of a
different type.</p>

<p>To illustrate this, we will consider a very simple structural
model. Let us assume that we wish to identify within an anthology
only poems, their headings, and the stanzas and lines of which they are
composed. In XML terms, our <term>document type</term> is the
<term>anthology</term>, and it consists of a series of
<term>poem</term>s. Each poem has embedded within it one element, a
<term>heading</term>, and several occurrences of another, a
<term>stanza</term>, each stanza having embedded within it a number of
<term>line</term> elements. Fully marked up, a text conforming to
this model might appear as follows:
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples"
       valid="false" source="#SG132-neg-1">
<anthology xmlns="">
  <poem>
    <heading>The SICK ROSE</heading>
    <stanza>
      <line>O Rose thou art sick.</line>
      <line>The invisible worm,</line>
      <line>That flies in the night</line>
      <line>In the howling storm:</line>
    </stanza>
    <stanza>
      <line>Has found out thy bed</line>
      <line>Of crimson joy:</line>
      <line>And his dark secret love</line>
      <line>Does thy life destroy.</line>
    </stanza>
  </poem>
  <!-- more poems go here -->
</anthology></egXML></p>
<p>It should be stressed that this example does <emph>not</emph> use
the names proposed for corresponding elements elsewhere
in these Guidelines: the above is thus <emph>not</emph> a valid TEI
document.<note place="bottom">The element names here have been chosen for
clarity of exposition; there is, however, a TEI element corresponding to
each, so that this example may be regarded as TEI-conformable in the
sense that this term is defined in <ptr target="#CF"/>.</note> It will, however, serve as an introduction to the basic notions of XML.
Whitespace and line breaks have been added to the example for the
sake of visual clarity only; they have no particular significance in the
XML encoding itself.  Also, the line
<eg><![CDATA[<!-- more poems go here   -->]]></eg>
is an XML <term>comment</term> and is not treated as part of the text.</p>
<p>As it stands, the above example is what is known as a
<term>well-formed</term> XML document because it obeys the following simple rules:
<list rend="numbered">
<item>there is a single element
enclosing the whole document: this is known as the <term>root
element</term> (<gi scheme="imaginary">anthology</gi> in our case);</item>
<item>each element is completely contained by the root element,
or by an element that is so contained; elements do not partially
overlap one another;</item>
<item>a tag explicitly marks the start and end of each element.</item>
</list>
</p>
<p>A well-formed XML document can be processed in a number of useful
ways.  A simple indexing program could extract only the relevant text
elements in order to make a list of headings, first lines, or words
used in the poem text; a simple formatting program could insert blank
lines between stanzas, perhaps indenting the first line of each, or
inserting a stanza number.  Different parts of each poem could be
typeset in different ways.  A more ambitious analytic program could
relate the use of punctuation marks to stanzaic and metrical
divisions.<note place="bottom">Note that this simple example has not
addressed the problem of marking elements such as sentences
explicitly; the implications of this are discussed in section <ptr target="#SG152"/>.</note> Scholars wishing to see the implications of
changing the stanza or line divisions chosen by the editor of this
poem can do so simply by altering the position of the tags.  And of
course, the text as presented above can be transported from one
computer to another and processed by any program (or person) capable
of making sense of the tags embedded within it with no need for the
sort of transformations and translations needed for files which have
been saved in one or other of the proprietary formats preferred by
most word-processing programs.</p>

<p>As we noted above, one of the attractions of XML is that it enables
us to make up our own names for the elements rather than requiring us
always to use names predefined by other agencies. Clearly, however, if
we wish to exchange our poems with others, or to include poems others
have marked up in our anthology, we will need to know a bit more about
the names used for the tags. The means that XML provides for this is
called a <term>namespace</term>. In our simple example, the tags just
contain a simple name. As we shall see, it is also possible to use
tags that include a <term>qualified name</term>, that is, a name with
an optional prefix identifying the set of names to which it
belongs. For example, we have defined an element <gi scheme="imaginary">line</gi> for the purpose of marking lines of
verse. Another person might, however, define an element called <gi scheme="imaginary">line</gi> for the purpose of marking typographic
lines, or drawn lines. Because of these different meanings, if we wish
to share data it will be necessary to distinguish the two <q>line</q>
components in our marked-up texts. This is achieved by including a
<term>namespace prefix</term> within the markup, for example like
this: <eg xml:space="preserve"><![CDATA[ <my:line>This is one of my lines</my:line> 
<!-- ... --> 
<yr:line>This is one of your lines</yr:line>]]></eg>

This feature is particularly important if we have different
definitions of what a <soCalled>line</soCalled> is, of course, but
there are many occasions when it is useful to distinguish groups of
tags belonging to different <soCalled>markup vocabularies</soCalled>;
we discuss this further below (<ptr target="#SGname"/>). One
particularly useful namespace prefix is predefined for XML: it is
<code>xml</code> and we will see examples of its use below.</p>

<p>Namespaces allow us to represent the fact that a name belongs to a
group of names, but don't allow us to do much more by way of checking
the integrity or accuracy of our tagging. Simple well-formedness alone
is not enough for the full range of what might be useful in marking up
a document. It might well be useful if, in the process of preparing
our digital anthology, a computer system could check some basic rules
about how stanzas, lines, and headings can sensibly co-occur in a
document. It would be even more useful if the system could check that
stanzas are always tagged <tag>stanza</tag> and not occasionally
<tag>canto</tag> or <tag>Stanza</tag>.  An XML document in which such
rules have been checked is technically known as a <term>valid</term>
document, and the ability to perform such validation is one of the key
advantages of using XML. To carry this out, some way of formally
stating the criteria for successful validation is necessary: in XML
this formal statement is provided by an additional document known
as a <term>schema</term>.<note place="bottom">The older terms
<term>Document Type Declaration</term> and <term>Document Type
Definition</term>, both abbreviated as DTD, may also be
encountered. Throughout these Guidelines we use the term
<term>schema</term> for any kind of formal document grammar.</note>
</p>
</div>

<div type="div3" xml:id="SG14"><head>Validating a Document's Structure</head>

<p>The design of a schema may be as lax or as restrictive as the
occasion warrants.  A balance must be struck between the convenience
of following simple rules and the complexity of handling real
texts. This is particularly the case when the rules being defined
relate to texts that already exist: the designer may have only the
haziest of notions as to an ancient text's original purpose or meaning
and hence find it very difficult to specify consistent rules about its
structure.  On the other hand, where a new text is being prepared to
an exact specification, for entry into a textual database
of some kind for example, the more precisely stated the rules, the better they can
be enforced.  Even in the case where an existing text is being marked
up, it may be beneficial to define a restrictive set of rules relating
to one particular view or hypothesis about the text—if only as a
means of testing the usefulness of that view or hypothesis.  A schema
designed for use by a small project or team is likely to take a
different position on such issues than one intended for use by a large
and possibly fragmented community. It is
important to remember that every schema results
from an interpretation of a text.  There is no single schema 
encompassing the absolute truth about any text, although it may
be convenient to privilege some schemas above others for particular
types of analysis.</p>

<p>XML is widely used in environments where uniformity of document
structure is a major desideratum.  In the production of technical
documentation, for example, it is of major importance that sections
and subsections should be properly nested, that cross-references
should be properly resolved and so forth.  In such situations,
documents are seen as raw material to match against predefined sets
of rules.  As discussed above, however, the use of simple rules can
also greatly simplify the task of tagging accurately elements of less
rigidly constrained texts.  By making these rules explicit, the
scholar reduces his or her own burdens in marking up and verifying the
electronic text, while also being forced to make explicit an
interpretation of the structure and significant particularities of the
text being encoded.</p>
</div>

<div type="div3" xml:id="SG141bis"><head>An Example Schema</head>
<p>A schema can be expressed in a number of different ways;
frequently-encountered methods include the Document Type Definition
(DTD) language which XML inherited from SGML; the XML Schema language
(<ptr target="http://www.w3.org/XML/Schema"/>) defined by the W3C; and
the RELAX NG language (<ptr target="http://relaxng.org/"/>) originally
developed within the OASIS Technical Committee and now an
ISO standard<note place="bottom">ISO/IEC FDIS 19757-2 Document
Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based
validation -- RELAX NG</note>.  In this chapter, and throughout these
Guidelines, we give examples using the <soCalled>compact
syntax</soCalled> of RELAX NG, but the specifications within these
Guidelines are expressed in a way that is largely independent of the
specific language in which a schema generated from them is
expressed.<note place="bottom">See further <ptr target="#TD"/> and <ptr target="#IM"/>. In practice, the only part of a TEI element
specification not expressed using TEI-defined syntax is the content
model for an element, which is expressed using the RELAX NG schema
language for reasons of processing convenience. RELAX NG uses its own
XML vocabulary to define content models, which is adopted by the TEI
for the same purpose. </note> Although we will use the RELAX NG compact syntax
for illustration in what follows, the reader should bear in mind that analogous
concepts are expressed differently in other schema languages. </p>
<p>The following schema might be used to validate our example poem:
<eg xml:space="preserve"><![CDATA[
anthology_p = element anthology { poem_p+ }
poem_p = element poem { heading_p?, stanza_p+ }
stanza_p = element stanza {line_p+}
heading_p = element heading { text }
line_p = element line { text }
start = anthology_p
]]></eg>
</p>

<p>Note that this is not the only way in which a RELAX NG schema might
be written;<note place="bottom">For a good tutorial introduction to RELAX
NG, see <ptr target="#SG-BIBL-1"/>.</note> we have adopted this idiom, however, because it
matches that used throughout the rest of the Guidelines.  </p>

<p>A RELAX NG schema expresses rules about the possible structure of a
document in terms of <term>patterns</term>; that is, it defines a
number of named patterns, each of which acts as a kind of template
against which an input document can be matched. The meaning of a
pattern is expressed in a schema by reference to other patterns, or to
a small number of built-in fundamental concepts, as we shall see. In
the example above, the word to the left of the equals sign is the
pattern's name, and the material following it declares a meaning for
the pattern. Patterns may also be of particular types; the ones that
interest us here are called <term>element patterns</term> and
<term>attribute patterns</term>. In this example we see definitions
for five element patterns. Note that we have used similar names for
the pattern and the element which the pattern describes: so, for
example, the line <code>anthology_p = element anthology {poem_p+}</code> defines an element pattern called <code>anthology_p</code>,
the value of which defines an element called
<code>anthology</code>. These naming conventions are 
arbitrary; we could use the same name for the pattern as
for the element, since the two are syntactically quite distinct. The
name, or <term>generic identifier</term>, of the element follows the
word <q>element</q>, and the <term>content model</term> for the
element is given within the curly braces following that.  Each of
these parts is discussed further below.
</p>
<p>The last line of the schema above tells a RELAX NG validator which
element (or elements) in a document can be used as the root element:
in our case only <gi scheme="imaginary">anthology</gi>. This enables
the validator to detect whether a particular document is well-formed
but incomplete; it also simplifies the processing task by providing an
<q>entry point</q>.</p>

<div type="div4" xml:id="SG141x"><head>Generic Identifier</head>
<p>Following the word <q>element</q> each pattern declaration gives
the generic identifier (often abbreviated to GI) of the element being
defined, for example <mentioned>poem</mentioned>,
<mentioned>heading</mentioned>, etc.  A GI may contain letters,
digits, hyphens, underscore characters, or full stops, but must begin
with a letter.<note place="bottom">In XML, a single colon may also
appear in a GI, where it has a special significance related to the use
of <term>namespaces</term>, as further discussed in section <ptr target="#SGname"/>. The characters defined by Unicode as
<term>combining characters</term> and as <term>extenders</term> are
also permitted, as are logograms such as Chinese characters.</note> Uppercase and lowercase letters are quite
distinct: an element with the GI <gi scheme="imaginary">foo</gi> is
<emph>not</emph> the same as an element with the GI <gi scheme="imaginary">Foo</gi>; the root element of a TEI-conformant
document is <gi>TEI</gi>, <emph>not</emph> <gi scheme="imaginary">tei</gi>.</p></div>
<div type="div4" xml:id="SG143"><head>Content Model</head>
<p>The second part of each declaration, enclosed in curly braces, is
called the <term>content model</term> of the element being defined, because it
specifies what may legitimately be contained within it.  In RELAX NG,
the content model is defined in terms of other patterns, either by
embedding them, or (as in our examples above) by naming or referring
to them.  The RELAX NG compact syntax also uses a small number of
reserved words to identify other possible contents for an element, of which by
far the most commonly encountered is <code>text</code>, as in this
example: it means that the element being defined may contain any valid
character data, but no elements.  If an XML document is thought of as a structure
like a family tree, with a single ancestor at the top (in our case,
this would be <tag>anthology</tag>), then almost always, following the
branches of the tree downwards (for example, from <gi scheme="imaginary">anthology</gi>
to <gi scheme="imaginary">poem</gi> to <gi scheme="imaginary">stanza</gi> to <gi scheme="imaginary">line</gi> and
<gi scheme="imaginary">heading</gi>) will lead eventually to <code>text</code>. In our example,
<gi scheme="imaginary">heading</gi> and <gi scheme="imaginary">line</gi> are so defined, since their content
models say <code>text</code> only and name no embedded elements.</p></div>

<div type="div4" xml:id="SG144"><head>Occurrence Indicators</head>

<p>The declaration for <gi scheme="imaginary">stanza</gi> in the example above states
that a stanza consists of one or more lines.  It uses an
<term>occurrence indicator</term> (the plus sign) to indicate how many
times something matching the pattern <code>line_p</code> may be
repeated.  There are three occurrence indicators: the plus sign, the
question mark, and the asterisk or star.  The plus sign means that
the pattern can match one or more times; the
question mark means that it may match at most once but is not
mandatory; the star means that the pattern concerned is not
mandatory, but may match more than once. Thus, if the content model for
<gi scheme="imaginary">stanza</gi> were <code>{line_p*}</code>, stanzas with no lines
would be possible as well as those with more than one line.  If it
were <code>{line_p?}</code>, again empty stanzas would be
countenanced, but no stanza could have more than a single line.  The
declaration for <gi scheme="imaginary">poem</gi> in the example above thus states that a
<gi scheme="imaginary">poem</gi> cannot have more than one heading, but may have none,
and that it must have at least one <gi scheme="imaginary">stanza</gi> and may have
several.  </p></div>

<div type="div4" xml:id="SG145"><head>Connectors</head>
<p>The content model <code>{heading_p?, stanza_p+}</code> contains more than
one component, and thus needs additionally to specify the order in
which these patterns (<gi scheme="imaginary">heading_p</gi> and <gi scheme="imaginary">stanza_p</gi>) may appear.
This ordering is determined by the <term>connector</term> (the comma)
used between its components.  The
comma connector indicates that the patterns concerned must appear in
the sequence given. Another commonly encountered connector is the
vertical bar, representing alternation. If the comma in this example were replaced by a vertical
bar, then a <gi scheme="imaginary">poem</gi> would consist of either a heading or just
stanzas—but not both!</p></div>
<div type="div4" xml:id="SG146"><head>Groups</head>
<p>In our example so far, the components of each content model have
been either single patterns or <code>text</code>.  It is quite
permissible, however, to define content models in which the components
are lists of patterns, combined by connectors.  Such lists may also be
modified by occurrence indicators and themselves combined by
connectors.  To demonstrate these facilities, let us expand our
example to include non-stanzaic types of verse.  For the sake of
demonstration, we will categorize poems as one of the following:
<term>stanzaic</term>, <term>couplets</term>, or <term>blank</term>
(or <term>stichic</term>).  A blank-verse poem consists simply of
lines (we ignore the possibility of verse paragraphs for the
moment),<note place="bottom">It will not have escaped the astute reader
that the fact that verse paragraphs need not start on a line boundary
seriously complicates the issue; see further section <ptr target="#SG152"/>.</note> so no additional elements need be defined
for it.  A couplet is defined as a <gi scheme="imaginary">firstLine</gi> followed by a
<gi scheme="imaginary">secondLine</gi>.  <eg><![CDATA[
couplet_p = element couplet {firstLine_p, secondLine_p}]]></eg>
<!-- Can't believe it's a good idea to use as an example -->
	<!-- something we would never do ourselves.              -->
	<!-- Why not define SONNET, OCTET, QUATRAIN, SESTET?     --></p>
<p>The patterns firstLine_p and secondLine_p define elements
<gi scheme="imaginary">firstLine</gi> and <gi scheme="imaginary">secondLine</gi> (which are distinguished to
enable studies of rhyme scheme, for example<note place="bottom">This is
however a rather artificial example; XPath, for example, provides ways of distinguishing
elements in an XML structure by their position without the need to
give them distinct names. </note> ); these will have exactly
the same content model as the existing <gi scheme="imaginary">line</gi> element. We will
therefore add the following two lines to our example schema: <eg><![CDATA[firstLine_p = element firstLine {text} 
secondLine_p = element secondLine {text}]]></eg>
Next, we can change the declaration for the <gi scheme="imaginary">poem</gi> element to
include all three possibilities: <eg xml:space="preserve"><![CDATA[ poem_p = element poem
{ heading_p?, (stanza_p+ | couplet_p+ | line_p+) }]]></eg> That is, a poem
consists of an optional heading, followed by one or several stanzas,
or one or several couplets, or one or several lines.  Note the
difference between this declaration and the following: <eg xml:space="preserve"><![CDATA[poem_p = element poem
{heading_p?, (stanza_p | couplet_p | line_p)+ }]]></eg> The second version,
by applying the occurrence indicator to the group rather than to each
element within it, would allow a single poem to contain a mixture
of stanzas, couplets, and lines.</p>
<p>A group of this kind can contain <code>text</code> as well as named
elements: this combination, known as <term>mixed content</term>,
allows for elements in which the sub-components appear with
intervening stretches of character data. For example, if we wished to
mark place names wherever they appear inside our verse lines, then,
assuming we have also added a pattern for the
<gi scheme="imaginary">name</gi> element, we could change the definition for
<gi scheme="imaginary">line</gi> to <eg xml:space="preserve"><![CDATA[line_p = element
line { (text | name_p )* }]]></eg>
</p>

<p>Some XML schema languages place no constraints on the way that
mixed content models may be defined, but in the XML DTD language, when
<code>text</code> appears with other elements in a content model, it
must always appear as the first option in an alternation; it may
appear once only, and in the outermost model group; and if the group
containing it is repeated, the star operator must be used. Although
these constraints do not apply to (for example) schemas expressed in
the RELAX NG language, all TEI content models currently obey them.</p>

<p>Quite complex models can easily be built up in this way, to match
the structural complexity of many types of text.  As a further
example, consider the case of stanzaic verse in which a refrain or
chorus appears.  Like a stanza, a refrain consists of repetitions of
the line element.  A refrain can appear at the start of a poem only,
or as an optional addition following each stanza.  This could be
expressed by a pattern such as the following:
<eg xml:space="preserve"><![CDATA[refrain_p = element refrain {line_p+}
poem_p = element poem   {heading_p?, ( line_p+ | (refrain_p?, (stanza_p,
refrain_p?)+ )) }]]></eg>

That is, a poem consists of an optional heading, followed by either a
sequence of lines or an unnamed group, which starts with an optional
refrain and is followed by one or more occurrences of another group, each
member of which is composed of a stanza followed by an optional
refrain.  A sequence such as <mentioned>refrain - stanza - stanza -
refrain</mentioned> follows this pattern, as does the sequence
<mentioned>stanza - refrain - stanza - refrain</mentioned>.  The
sequence <mentioned>refrain - refrain - stanza - stanza</mentioned>
does not, however, and neither does the sequence <mentioned>stanza -
refrain - refrain - stanza.</mentioned> Among other conditions made
explicit by this content model are the requirements that at least one
stanza must appear in a poem, if it is not composed simply of lines,
and that if there is both a heading and a stanza they must appear in
that order.</p>
<p>Note that the apparent complexity of this model derives from the
constraints expressed informally above. A simpler model, such as
<eg xml:space="preserve"><![CDATA[poem_p =
element poem {heading_p?, (line_p | refrain_p | stanza_p)+ }]]></eg> would not enforce any of them, and would therefore permit
such anomalies as a poem consisting only of refrains, or an arbitrary mixture of
lines and refrains.</p>

<!-- not yet explained: 
  start=element
  list {}
  other things?
-->

</div>
</div>
</div>
<div type="div2" xml:id="SG152"><head>Complicating the Issue</head>
<p>In the simple cases described so far, we have assumed that one
can identify the immediate constituents of every element in a
textual structure.  A poem consists of stanzas, and an anthology
consists of poems.  Stanzas do not float around unattached to poems or
combined into some other unrelated element; a poem cannot contain an
anthology.  All the elements of a given document type may be arranged
into a hierarchic structure like a family tree, with a single
ancestor at one end and many children (mostly the elements containing
simple text) at the other.  For example, we could represent an anthology
containing two poems, the first of which contains two four-line
stanzas and the second a single stanza,  by a tree structure like the following figure:</p>
<figure><graphic width="80%" url="Images/xmlpic.png"/></figure>
<!-- graphic needs redoing, took out scaling looks much better(JC)  -->
<p>This graphic representation of the structure of an XML document is
close to the abstract model implicit in most XML processing
systems. Most such systems now use a standardized way of accessing
parts of an XML document called <term>XPath</term>.<note place="bottom">The official specification is at <ptr target="#XPATH"/>;  many
introductory tutorials are available in the XML references cited above
and elsewhere on the Web: good beginners' tutorials include <ptr target="http://www.w3schools.com/xsl/xpath_intro.asp"/> and <ptr target="http://www.zvon.org/xxl/XPathTutorial/"/>, the latter being
available in several languages.</note> XPath gives us a
non-graphical way of referring to any part of an XML document: for
example, we might refer to the last line of Blake's poem as
<code>/anthology/poem[1]/stanza[2]/line[4]</code>. The square brackets
here indicate a numerical selection: we are talking about the fourth
line in the second stanza of the first poem in the anthology. If we
left out all the square-bracketted selections, the corresponding XPath
expression would refer to all lines contained by stanzas contained by
poems contained by anthologies. An XPath expression can refer to any
collection of elements: for example, the expression
<code>/anthology/poem</code> refers to all poems in an anthology and
the expression <code>/anthology/poem/heading</code> refers to all their headings. </p>
<p>The solidus within an XPath expression behaves in much the same way
as the solidus or backslash in a filename specification: it
indicates that the item to the left directly contains the item to the
right of it. In XPath it is also possible to indicate that any number
of other items may intervene by repeating the solidus. For example,
the XPath expression <code>/anthology/poem//line[1]</code> will refer
to the first line of each poem in the anthology, irrespective of
whether it is in a stanza.</p>

<p>Clearly, there are many such trees that might be drawn to describe
the structure of this or other anthologies.  Some of them might be
representable as further subdivisions of this tree: for example, we
might subdivide the lines into individual words, since in our simple
example no word crosses a line boundary. Surprisingly perhaps, this
grossly simplified view of what text is (memorably termed an
<term>ordered hierarchy of content objects</term> (OHCO) view of text
by Renear <foreign>et al.</foreign><note place="bottom">See
<ptr target="#SG-BIBL-2"/>.</note>) turns out to
be very effective for a large number of purposes. It is not, however,
adequate for the full complexity of real textual structures, for which
more complex mechanisms need to be employed. There are many other
trees that might be drawn which do <emph>not</emph> fit within the
anthology model which we have presented so far.  We might, for
example, be interested in syntactic structures or other linguistic
constructs, which rarely respect the formal boundaries of verse.  Or,
to take a simpler example, we might want to represent the pagination
of different editions of the same text.</p>

<p>In the OHCO model of text, representation of cases where different
elements overlap so that several different trees may be identified in
the same document is generally problematic. All the elements marked
up in a document, no matter what namespace they belong to, must fit
within a single hierarchy. To represent overlapping structures,
therefore, a single hierarchy must be chosen, and the points at which
other hierarchies intersect with it marked. For example,
we might choose the verse structure as our primary hierarchy, and then
mark the pagination by means of empty elements inserted at the boundary
points between one page and the next. Or we could represent alternative
hierarchies by means of the pointing and linking mechanisms described
in chapter <ptr target="#SA"/> of the Guidelines. These mechanisms all
depend on the use of <term>attributes</term>, which may be used both to
identify particular elements within a document and to point to, link,
or align them into arbitrary structures.</p>


</div>
<div type="div2" xml:id="SG16"><head>Attributes</head>
<p>In the XML context, the word <mentioned>attribute</mentioned>, like
some other words, has a specific technical sense.  It is used to
describe information that is in some sense descriptive of a specific
element occurrence but not regarded as part of its content. For
example, you might wish to add a <att scheme="imaginary">status</att> attribute to
occurrences of some elements in a document to indicate their degree of
reliability, or to add an <att scheme="imaginary">identifier</att> attribute so that you
could refer to particular element occurrences from elsewhere within a
document.  Attributes are useful in precisely such circumstances.</p>
<p>Although different elements may have attributes with the same name
(for example, in the TEI scheme, every element is defined as having an
attribute named <att>n</att>), they are always regarded as
different, and may have different values assigned to them.  If an
element has been defined as having attributes, the attribute values
are supplied in the document instance as <term>attribute-value
pairs</term> inside the start-tag for the element occurrence. An
end-tag cannot contain an attribute-value specification, since it
would be redundant.</p>
<p>The order in which attribute-value pairs are supplied inside a tag
has no significance; they must, however, be separated by at least one
whitespace (blank, newline, or tab) character. The value part
must always be given inside matching quotation marks, either single or
double<note place="bottom">In the unlikely event that both kinds of quotation marks are needed within the
quoted string, either or both can also be presented in escaped form, using the
predefined character entities <code>&amp;apos;</code> or <code>&amp;quot;</code></note>.</p>
<p>For example: 
<egXML xmlns="http://www.tei-c.org/ns/Examples">
  <poem  xmlns="" xml:id="Poem1" status="draft"> ... </poem>
</egXML>
Here attribute values are being specified for two attributes
previously declared for the <gi scheme="imaginary">poem</gi> element: <att>xml:id</att>
and <att scheme="imaginary">status</att>. For the instance of a <gi scheme="imaginary">poem</gi> in this
example, represented here by an ellipsis, the <att>xml:id</att>
attribute has the value <val>P1</val> and the <att scheme="imaginary">status</att>
attribute has the value <val>draft</val>. An XML processor can use the
values of the attributes in any way it chooses; for example, a
<gi scheme="imaginary">poem</gi> in which the <att scheme="imaginary">status</att> attribute has the value
<val>draft</val> might be formatted differently from one in which the same
attribute has the value <val>revised</val>; another processor might use the
same attribute to determine whether or not poem elements are to be
processed at all.  The <att>xml:id</att> attribute is a slightly
special case in that, by convention, it is always used to supply a
unique value to identify a particular element occurrence, which may be
used for cross-reference purposes, as discussed further below (<ptr target="#SG-id"/>).</p>

<div type="div3" xml:id="SG-att"><head>Declaring Attributes</head>

<p>Attributes are declared in a schema in the same way as elements.
As well as specifying an attribute's name and the element to which it is to be
attached, it is possible to specify (within limits) what kind of value
is acceptable for an attribute.<!-- and a default value.--></p>

<p>In the compact syntax of RELAX NG, an attribute is defined by means
of an attribute pattern, like the following: 

<eg><![CDATA[att.status = attribute status {"draft" | "revised" | "published"}]]></eg>

This defines a new pattern, called <code>att.status</code>, whose
value is an attribute pattern defining an attribute named
<att>status</att>. Attribute names are subject to the same
restrictions as other names in XML; they need not be unique across the
whole schema, however, but only within the list of attributes for a
given element.</p>

<p>A pattern defining the possible values for this attribute is given
within the curly braces, in just the same way as a content model is
given for an element pattern. In this case, the attribute's value must
be one of the strings presented explicitly above. </p>

<p>The attribute pattern definition must be included or referenced
within the definition for every element to which the attribute is
attached. We therefore modify the definition for the
<code>poem_p</code> pattern given above as follows:

<eg xml:space="preserve"><![CDATA[poem_p = element poem {att.status?, heading_p?, stanza_p+}
]]></eg>

In RELAX NG, an element pattern simply includes any attribute patterns
applicable to it along with its other constituents, as shown
above. Attribute patterns can also be grouped and alternated in the
same way as element patterns, though this particular feature is not
widely used in the TEI scheme, since it is not available to the same
extent in all schema languages. Because a question mark follows the
reference to the <code>att.status</code> pattern in our example, a
document in which the <att>status</att> attribute is not specified
will still be valid; without this occurrence indicator the
<att>status</att> attribute would be required.

</p>

<p>Instead of supplying a list of explicit values, an attribute
pattern can specify that the attribute must have a value of a
particular type, for example a text string, a numeric value, a
normalized date, etc. This is accomplished by supplying a pattern
that refers to a <term>datatype</term>. In the example above,
because a list of acceptable values is predefined, a parser can check
that no <gi scheme="imaginary">poem</gi> is defined for which the <att>status</att>
attribute does not have one of <val>draft</val>, <val>revised</val>,
or <val>published</val> as its value. By contrast, with a definition
such as <eg xml:space="preserve"><![CDATA[att.status =
attribute status {text}]]></eg> a parser would accept almost any
unbroken string of characters (<code>status="awful"</code>,
<code>status="awe-ful"</code>, or <code>status="12345678"</code>) as
valid for this attribute.  Sometimes, of course, the set of possible
values cannot be predefined.  Where it can, as in this case, it is
generally better to do so.</p>

<p>Schema languages vary widely in the extent to which they support
validation of attribute values. Some languages predefine a small set
of possibilities. Others allow the schema designer to use values from
a predefined <soCalled>library</soCalled> of possible datatypes, or to
add their own definitions, possibly of great complexity. A
<soCalled>datatype</soCalled> might be something fairly general (any
positive integer), something very specific or idiosyncratic (any
four-character string ending with "T"), or somewhere between the
two. In the RELAX NG schemas used by the TEI, general patterns have
been defined for about half a dozen datatypes (using the W3C Schema
<ref target="#XSD2">Datatype Library</ref>,
and discussed further in <ptr target="#DTYPES"/>). In addition to the two
possibilities already mentioned—plain text or an explicit list of
possible strings—other datatypes likely to be encountered include
the following:
<list type="gloss">
<label>boolean</label>
<item>values must be either true or false</item>
<label>numeric</label>
<item>values must represent a numeric quantity of some
kind</item>
<label>date</label>
<item>values must represent a possible date and time in some calendar</item>
</list></p>
<p>Two further datatypes of particular usefulness in managing XML documents
are commonly known as <code>ID</code>—for identifier—and
<code>URI</code>—for Universal Resource Indicator, or pointer for
short. These are discussed in the next section.</p>
</div>
<div xml:id="SG-id"><head>Identifiers and Indicators</head>
<p>It is often necessary to refer to an occurrence of one textual
element from within another, an obvious example being phrases such as
<q>see note 6</q> or <q>as discussed in chapter 5</q>.  When a text is
being produced the actual numbers associated with the notes or
chapters may not be certain.  If we are using descriptive markup, such
things as page or chapter numbers, being entirely matters of
presentation, will not in any case be present in the marked-up text:
they will be assigned by whatever processor is operating on the text
(and may indeed differ in different applications).  XML therefore
predefines an attribute that may be used to provide any element
occurrence with a special identifier, a kind of label, which may be
used to refer to it from anywhere else: since it is defined in the XML
namespace, the name of this attribute is <att>xml:id</att> and it is
used throughout the TEI schema. Because it is intended to act as an
identifier, its values must be unique within a given document. The
cross-reference itself will be supplied by an element bearing an
attribute of a specific kind, which must also be declared in the
schema. </p>
<p>Suppose, for example, we wish to include a reference within
the notes on one poem that refers to another poem.  We will
first need to provide some way of attaching a label to each
poem:  this is easily done using the <att>xml:id</att> attribute.
Note that not every poem need carry an <att>xml:id</att> attribute and
the parser may safely ignore the lack of one in those that do not.
Only poems to which we intend to refer need use this attribute; for
each such poem we should now include in its start-tag some unique
identifier, for example:
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
<poem xmlns="" xml:id="Rose"> ... </poem>

<poem xmlns="" xml:id="P40"> ... </poem>

<poem xmlns="" > ... </poem>
</egXML>
</p>
<p>Next we need to define a new element for the cross-reference
itself.  This will not have any content—it is only a
pointer—but it has an attribute, the value of which will be
the identifier of the element pointed at.  This is achieved by
the following definition:
<eg><![CDATA[poemRef_p = element poemRef {attribute target {anyURI},  empty}]]></eg>
</p>

<p>The <gi scheme="imaginary">poemRef</gi> element has no content, but a single attribute
called <att>target</att>.  The value of this attribute must be a
pointer or web reference of type <code>anyURI</code>;<note place="bottom">The word <q>anyURI</q> is a predefined name, used in
schema languages to mean that any <term>Uniform Resource
Identifier</term> (URI) may be supplied here. The accepted syntax for
URIs is an Internet Standard, defined in <ptr target="http://tools.ietf.org/html/rfc3986"/>. <code>anyURI</code>
is one of the <term>datatypes</term> defined by the W3C
Schema datatype library.</note>
furthermore, because there is no indication of optionality on the
attribute pattern, it must be supplied on each occurrence—a
<gi scheme="imaginary">poemRef</gi> with no referent is an impossibility.</p>

<p>With these declarations in force, we can now encode a reference to
the poem whose <att>xml:id</att> attribute specifies that its
identifier is <val>Rose</val> as follows:
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
Blake's poem on the sick rose 
<poemRef xmlns="" target='#Rose'/> ...</egXML>
<!--
In this example, we have used the special syntax defined by XML for
representing empty elements in which the end-tag and the start-tag are
combined into a single tag.  (XML also permits representation of empty
elements by an immediately adjacent start- and end-tag, thus <egXML
xmlns="http://www.tei-c.org/ns/Examples"><![CDATA[<poemRef
target="#Rose"></poemRef>]]></egXML>)
-->
</p>
<p>A processor may take any number of actions when it encounters a
link encoded in this way: a formatter might construct an exact page
and line reference for the location of the poem in the current
document and insert it, or just quote the poem's title or first lines.
A hypertext style processor might use this element as a signal to
activate a link to the poem being referred to, for example by
displaying it in a new window.  Note, however, that the purpose of the
XML markup is simply to indicate that a cross-reference exists: it
does not necessarily determine what the processor is to do with it.</p>

<p>The target of a URI can be located anywhere: it may not necessarily
be part of the same document, nor even located on the same computer
system. Equally, it can be a resource of any kind, not necessarily an
XML document or document fragment. It is thus a very convenient way of
including references to non-XML data such as image files within a
document. If, for example, we wished to include an illustration
containing a reproduction of Blake's original in our anthology, the
most appropriate method would probably be to define a new element
called (for the sake of argument) <gi scheme="imaginary">graphic</gi>
with a <att>target</att> attribute of datatype URI:
<eg xml:space="preserve"><![CDATA[graphic_p = element graphic {att.url, empty} att.url =
attribute url {anyURI}]]></eg> With these additions to the schema, we
can now represent the location of the illustration within our text
like this: <egXML xml:lang="en"
xmlns="http://www.tei-c.org/ns/Examples" valid="false"><poem xmlns=""><graphic
url="http://en.wikisource.org/wiki/Image:Blake_sick_rose.jpg"/>
</poem></egXML> By providing a location from which a reproduction of
the required image can be downloaded, this encoding makes it possible
for appropriate software able to display the image as well as record
its existence. </p>
<p>Attributes form part of the structure of an XML document in the
same way as elements, and can therefore be accessed using XPath. For
example, to refer to all the poems in our anthology whose
<att>status</att> attribute has the value <val>draft</val>, we might
use an XPath such as <code>/anthology/poem[@status='draft']</code>. To
find the headings of all such poems, we would use the XPath
<code>/anthology/poem[@status='draft']/heading</code>. </p>
</div>
</div>

<div type="div2" xml:id="SG-oth"><head>Other Components of an XML Document</head> 

<p>In addition to the elements and attributes so far discussed, an XML
document can contain a few other formally distinct things.  An XML
document may contain references to predefined strings of data that a
validator must resolve before attempting to validate the document's
structure; these are called <term>entity references</term>. They may
be useful as a means of providing <soCalled>boilerplate</soCalled> text or
representing character data which cannot easily be keyboarded. An XML
document may also contain arbitrary signals or flags for use when the
document is processed in a particular way by some class of processor
(a common example in document production is the need to force a
formatter to start a new page at some specific point in a document);
such flags are called <term>processing instructions</term>.  And, as
noted earlier, an XML document may also contain instances of elements
taken from some other <term>namespace</term>. We discuss each of these
three cases in the rest of this section.</p>

<div type="div3" xml:id="SG-er"><head>Character References</head>

<p>As mentioned above, all XML documents use the same internal
character encoding. Since not all computer systems currently support
this encoding directly, a special syntax is defined
that can be used to represent individual characters from the Unicode
character set in a portable way by providing their numeric value, in
decimal or hexadecimal notation.</p>

<p>For example, the character <mentioned>é</mentioned> is represented
within an XML document as the Unicode character with hexadecimal value
<num value="233">00E9</num>. If such a document is being prepared on (or
exported to) a system using a different character set in which this
character is not available, it may instead be represented by the
character reference <code>&amp;#x00E9;</code> (the <code>x</code> indicating
that what follows is a hexadecimal value) or
<code>&amp;#0233;</code> (its decimal equivalent). References of this type do not need
to be predefined, since the underlying character encoding for XML is
always the same.</p>

<p>To aid legibility, however, it is also possible to use a mnemonic
name (such as <code>eacute</code>) for such character references,
provided that each such name is mapped to the required Unicode value
by means of a construct known as an <term>entity
declaration</term>.  A reference to a named character entity always
takes the form of an ampersand, followed by the name, followed by a
semicolon. For example an XML document containing the string
<q>T&amp;C</q> might be encoded as <code>T&amp;amp;C</code>. </p>

<p>There is a small set of such character
entity references that do not have to be declared because they form
part of the definition of XML. These include the names used for
characters such as the ampersand (<code>amp</code>) and the open angle
bracket or less-than sign (<code>lt</code>), which could not easily
otherwise be included in an XML document without ambiguity. Other
predeclared entity names are those for quotation marks (<code>quot</code>
and <code>apos</code> for double and single respectively), and for
completeness the closing angle bracket or greater-than sign
(<code>gt</code>).</p>

<p>For all other named character entities, a set of entity declarations
must be provided to an XML processor before the document referring to
them can be validated. The declaration itself uses a non-XML syntax
inherited from SGML; for example, to define an entity named
<ident type="ge">eacute</ident> with the replacement value é, the declaration
could have any of the following forms: <eg><![CDATA[<!ENTITY eacute
"é">]]></eg> or, using hexadecimal notation: <eg><![CDATA[<!ENTITY
eacute "&#xe9;">]]></eg> or, using decimal notation:
<eg><![CDATA[<!ENTITY eacute "&#233;">]]></eg>
 </p>

<p>Entities of this kind are useful also for <term>string
substitution</term> purposes, where the same text needs to be repeated
uniformly throughout a text. For example, if a declaration such as
<eg><![CDATA[<!ENTITY TEI "Text Encoding Initiative">]]></eg> is included with a document,
then references such as <code>&amp;TEI;</code> may be used within it,
each of which will be expanded in the same way and replaced by the
string <q>Text Encoding Initiative</q> before the text is validated.</p>


</div>

<div type="div3" xml:id="SG-pi"><head>Processing Instructions</head>

<p>Although one of the aims of using XML is to remove any information
specific to the processing of a document from the document itself, it
is occasionally very convenient to be able to include such information—if 
only so that it can be clearly distinguished from the
structure of the document. As suggested above, one common example is
the need, when processing an XML document for printed output, to
include a suggestion that the formatting processor might use to
determine where to begin a new page of output. Page-breaking decisions
are usually best made by the formatting engine alone, but there will
always be occasions when it may be necessary to override these. An XML
processing instruction inserted into the document is one very simple
and effective way of doing this without interfering with other aspects
of the markup.</p>
<p>Here is an example XML processing instruction: 
<eg><![CDATA[<?tex \newpage ?>]]></eg> It begins with
<code>&lt;?</code> and ends with
<code>?&gt;</code>. In between are two space-separated strings: by
convention, the first is the name of some processor (<code>tex</code>
in the above example) and the second is some data intended for the use
of that processor (in this case, the instruction to start a new
page). The only constraint placed by XML on the strings is that the
first one must be a valid XML name; the other can be any arbitrary
sequence of characters, not including the closing character-sequence
<code>?&gt;</code>.
</p>

<p>A construct which looks like a processing instruction (but is not)
is the <term>XML declaration</term> which can be supplied at the
beginning of an XML document, for example: <eg><![CDATA[<?xml
version="1.0" encoding="iso-8859-1"?>]]></eg> The XML declaration
specifies the version number of the XML Recommendation applicable to
the document it introduces (in this case, version 1.0), and optionally
also the character encoding used to represent the Unicode characters
within it. By default an XML document uses the character encoding
UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have
been mapped to the 8-bit character set known as ISO 8859-1; any
characters present in the document but not available in the target
character set will therefore need to be represented as character
references (<ptr target="#SG-er"/>). The XML declaration is purely
documentary, but if it is wrong many XML-aware processors will be
unable to process the associated text.</p>

</div>
<div type="div3" xml:id="SGname"><head>Namespaces</head>

<p>A valid XML document necessarily specifies the schema in which its
constituent elements are defined. However, a well-formed XML document
is not required to specify its schema (indeed, it may not even have a
schema).  It would still be useful to indicate that the element names used
in it have some defined provenance. Furthermore, it might be desirable
to include in a document elements that are defined (possibly
differently) in different schemas. A cabinet-maker's schema might well
define an element called <gi scheme="imaginary">table</gi> with very different
characteristics from those of a documentalist's.</p>
<p>The concept of <term>namespace</term> was introduced into the XML
language as a means of addressing these and related problems. If the
markup of an XML document is thought of as an expression in some
language, then a namespace may be thought of as analogous to the
lexicon of that language. Just as a document can contain words taken
from different languages, so a well-formed XML document can include
elements taken from different namespaces. A namespace resembles a
schema in that we may say that a given
set of elements <soCalled>belongs to</soCalled> a given namespace, or
are <soCalled>defined by</soCalled> a given schema. However, a schema
is a set of element definitions, whereas a namespace is really only a
property of a collection of elements: the only tangible form it takes
in an XML document is its distinctive <term>prefix</term> and the
identifying <term>name</term> associated with it.</p>

<p>Suppose for example that we wish to extend our anthology to include
a complex diagram. We might start by considering whether or
not to extend our simple schema to include XML markup for such
features as arcs, polygons, and other graphical elements. XML can be
used to represent any kind of structure, not simply text, and there
are clear advantages to having our text and our diagrams all expressed
in the same way.</p>
<p>Fortunately we do not need to invent a schema for the
representation of graphical components such as diagrams; it already
exists in the shape of the Scalable Vector Graphics (SVG) language
defined by the W3C.<note place="bottom">The W3C Recommendation is
defined at <ptr target="http://www.w3.org/Graphics/SVG/"/>.</note> SVG
is a widely used and rich XML vocabulary for representing all kinds of
two-dimensional graphics; it is also well supported by existing
software. Using an SVG-aware drawing package, we can easily draw our
diagram and save it in XML format for inclusion within our
anthology. When we do so, we need to indicate that this part of
the document contains elements taken from the SVG namespace, if only
to ensure that processing software does not confuse our <gi scheme="imaginary">line</gi>
element with the SVG <gi scheme="imaginary">line</gi>, which means something quite
different. </p>
<p>An XML document need not specify any namespace: it is then said to
use the <soCalled>null</soCalled> namespace. Alternatively, the root element of a
document may supply a default namespace, understood to apply to all elements which have no namespace prefix. This is the function of the
<att scheme="XML">xmlns</att> attribute which provides a
unique name for the default namespace, in the form of a URI:
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
<anthology  xmlns="">
<!-- anthology markup elements here -->
</anthology>
</egXML>
In exactly the same way, on the root element for each part of our
document which uses the SVG language, we might introduce the SVG
namespace name:
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
<anthology  xmlns="">
<!-- anthology markup elements here -->
   <svg xmlns="http://www.w3.org/2000/svg">
<!-- SVG markup elements here -->
  </svg>
<!-- more anthology markup elements here -->
</anthology>
</egXML>
Although a namespace name usually uses the URI (Uniform Resource
Identifier) syntax, it is not treated as an online address
and an XML processor regards it just as a string, providing a longer name for the
namespace.</p>
<p>The <att scheme="XML">xmlns</att> attribute can also be used to associate a short prefix
name with the namespace it defines. This is very useful if we want to
mingle elements from different namespaces within the same document,
since the prefix can be attached to any element, overriding the
implicit namespace for itself (but not its children):
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
<anthology  xmlns=""
	   xmlns:svg="http://www.w3.org/2000/svg">
  <!-- anthology markup elements here -->
   <svg:svg>
      <!-- SVG markup elements here -->
   </svg:svg>
  <!-- more anthology markup elements here -->
</anthology>
</egXML>
</p>
<p>There is no limit on the number of namespaces that a document can
use. Provided that each is uniquely identified, an XML processor can
identify those that are relevant, and validate them appropriately. To
extend our example further, we might decide to add a linguistic
analysis to each of the poems, using a set of elements such as <gi scheme="imaginary">aux</gi>, <gi scheme="imaginary">adj</gi>, etc.,
derived from some pre-existing XML vocabulary for linguistic analysis.
<egXML xml:lang="en" xmlns="http://www.tei-c.org/ns/Examples" valid="false">
<anthology xmlns=""
           xmlns:gram="http://www.gram.org"
	   xmlns:svg="http://www.w3.org/2000/svg">
    <!-- anthology markup elements here -->
   <svg:svg>
      <!-- SVG markup elements here -->
   </svg:svg>  
<line>
      <gram:itj>O</gram:itj>
      <gram:nom>Rose</gram:nom>
      <gram:pron>thou</gram:pron>
      <gram:aux>art</gram:aux >
      <gram:adj>sick</gram:adj>
</line>
</anthology>
</egXML>
</p>

<div type="div3" xml:id="SG-ms"><head>Marked Sections</head>
<p>We mentioned above that the syntax of XML requires the encoder to
take special action if characters with a syntactic meaning in XML
(such as the left angle bracket or ampersand) are to be used in a
document to stand for themselves, rather than to signal the start of a
tag or an entity reference respectively. The predefined entities
<code>&amp;amp;</code>, <code>&amp;lt;</code>, and
<code>&amp;gt;</code> provide one method of dealing with this problem,
if the number of occurrences of such things is small. Other methods
may be considered when the number is large, as in an XML document like
the present Guidelines, which contains hundreds of examples of XML
markup. One is to label the XML examples as belonging to a different
namespace from that of the document itself, which is the approach
taken in the present Guidelines. Another and simpler approach is
provided by one of the features inherited by XML from its parent SGML:
the <soCalled>marked section</soCalled>. </p>
<p>A marked section is a block of text within an XML document
introduced by the characters <code>&lt;![CDATA[</code> and terminated
by the characters <code>]]&gt;</code>. Between these rather strange
brackets, markup recognition is turned off, and any tags or entity
references encountered are therefore treated as if they were plain
text. For example, when we come to write the users' manual for our
anthology, we may find ourselves often producing text like the
following:
<eg xml:space="preserve"><![CDATA[
Here is an example of the use of the <gi>line</gi> element:
<![CDATA[<line>....</line>]]]]><![CDATA[>
]]></eg>
</p>

</div>
</div>
</div>

<div type="div2" xml:id="SG18"><head>Putting It All Together</head>

<p>In this chapter we have discussed most of the components of an XML
document and its associated schema. We have described informally 
how an XML document is represented, and also introduced one
way of representing the rules a RELAX NG validator might use to
validate it. In a working system, the following issues will also need
to be addressed:
<list rend="bulleted">
<item>how does a processor determine the schema (or schemas) that
should be used to validate a given XML document instance?</item>
<item>if a document contains entity references that must be processed before the
document can be validated, where are those entities defined?</item>
<item>an XML document instance may be stored in a number of different
operating system files; how should they be assembled together?</item>
<item>how does a processor determine which stylesheets it should use
when processing an XML document, or how to interpret any processing
instructions it contains?</item>
<item>how does a processor enforce more exact validation than simple
datatypes permit (for example of element content)?</item>
</list>
</p>

<p>Different schema languages and different XML processing systems
take very different positions on all of these topics, since none of
them is explicitly addressed in the XML specification itself. Consequently,
the best answer is likely to be specific to a particular software
environment and schema language. Since this chapter is concerned with
XML considered independently of its processing environment, we only
address them in summary detail here.</p> 

<div type="div3" xml:id="SG-ass1"><head>Associating Entity Definitions with a Document
Instance</head>
<p>In <ptr target="#SG-er"/> we introduced the syntax used for the
definition of named character entities such as <code>eacute</code>,
which XML inherited from SGML. Different schema languages vary in the
ways they make a collection of such definitions available
to an XML processor, but fortunately there is one method that all
current schema languages support.</p>
<p>As well as, and following, the XML declaration (<ptr target="#SG-pi"/>), an XML document instance may be prefixed with a
special <code>DOCTYPE</code> statement. This declarative statement has
been inherited by XML from SGML; in its full form it provides a large
number of facilities, but we are here concerned only with the small
subset of those facilities recognized by all schema languages.</p>
<p>Here is an example DOCTYPE statement which we might consider
prefixing to the final version of our anthology:
<eg xml:space="preserve"><![CDATA[<!DOCTYPE anthology [
<!ENTITY mdash "&#2014;">
<!ENTITY legalese "This document is available under a Creative Commons
Share and Enjoy Licence">
]>]]></eg>
Any XML processor encountering this statement will use it to add the
two named entities it defines to those already predefined for
XML. Before the document instance itself is validated, any references
to these entities will be expanded to the character string
given. Thus, wherever in the document instance the string
<code>&amp;legalese;</code> appears, it will be replaced by the
formulation above. This makes life a little easier for those
keyboarding our anthology.<note place="bottom">And, indeed, for those
responsible for deciding the licensing conditions if they change their
minds later.</note> The word <code>anthology</code> following the string
DOCTYPE in this example is, of course, the name of the root element of
the document to which this declaration is prefixed; however, only an XML
DTD processor will take note of this fact. </p></div>

<div type="div3" xml:id="SG-assoc"><head>Associating a Document Instance with Its Schema</head>

<p>In the past, different schema languages adopted entirely different attitudes to
this question, leading to a variety of different methods of associating schemas with 
document instances. However, a W3C Working Group Note, 
  <title level="m">Associating Schemas with XML documents</title>, 
(<ptr target="http://www.w3.org/TR/xml-model/"/>) now provides a 
standardized method of doing this through the use of a processing instruction:

  <eg><![CDATA[<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng"?>]]></eg>

The <att scheme="HTML">href</att> <term>pseudo-attribute</term> points to the location of the schema. This is the only mandatory
  pseudo-attribute, but others can be added to give more information about the schema:
  
  <eg xml:space="preserve"><![CDATA[<?xml-model href="burgess.rng" 
                       title="Anthony Burgess Project Schema" 
                       schematypens="http://relaxng.org/ns/structure/1.0" 
                       type="application/xml"
                     ?>]]></eg>

See the XML Model WG Note for more information on the pseudo-attributes available and 
how to use them.</p>
  
  <p>A document instance may be valid according to many
different schemas, each appropriate to a different processing task. 
All of these may be expressed in the same way:

  <eg xml:space="preserve"><![CDATA[<?xml-model href="tei_tite.xsd" type="application/xml" ?>
<?xml-model href="checkNames.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron" ?>]]></eg>

This example includes a standard schema in XML Schema format, along with a schematron schema which might be 
used for checking the format and linking of names.</p>

<!-- MDH 2012-04-27: The following two paragraphs now appear to be 
      obsolete, since we are recommending the use of xml-model for 
      all schema association (see http://purl.org/TEI/FR/1650195), but 
      are left here temporarily in case Council decides it would be a good 
      idea to reintroduce the link to DSDL. If you see these comments 
      after 2012, delete them. -->
<!--<p>In RELAX NG therefore no facility for associating a particular schema
with a particular instance exists: the task is regarded as a specific
case of the more general issues addressed by the general architectural
framework within which RELAX NG is defined: the ISO draft standard for
Document Schema Definition Languages (DSDL).<note place="bottom">DSDL is
a project of ISO/IEC JTC 1/SC 34 WG 1, the object of which is to
<q>bring together different validation-related tasks and expressions
to form a single extensible framework that allows technologies to work
in series or in parallel to produce a single or a set of validation
results. The extensibility of DSDL accommodates validation
technologies not yet designed or specified.</q> (<ptr target="http://dsdl.org"/>).</note></p>

<p>In W3C Schema and in the DTD schema language inherited by XML from
SGML, however, a document instance can point directly to the resource
or resources that may be used to validate it. In W3C Schema Language,
this is usually done by means of an attribute on the root element of
the document instance; for XML DTDs the <code>DOCTYPE</code> statement
introduced in <ptr target="#SG-ass1"/> is used for this purpose.</p>-->

<p>Any modern XML processing software tool will provide
convenient methods of validating documents which are appropriate to the particular
schema language chosen. In the interests of maximizing portability of
document instances, they should contain as little processing-specific
information as possible.</p>

</div>
<div type="div3" xml:id="SG-mult"><head>Assembling Multiple Resources into a Single Document</head>
<p>As we have already indicated, a single XML document may be made up
of several different operating system files that need to be pulled
together by a processor before the whole document can be
validated. The XML DTD language defines a special kind of entity (a
<term>system entity</term>) that can be used to embed references to
whole files into a document for this purpose, in much the same way as
the character or string entities discussed in <ptr target="#SG-er"/>. Neither RELAX NG nor W3C Schema directly supports
this mechanism, however, and we do not
discuss it further here.</p>
<p>An alternative way of achieving the same effect is to use a special
kind of pointer element to refer to the resources that need to be
assembled, in exactly the same way as we proposed for the illustration
in our anthology. The W3C Recommendation <title>XML Inclusions
</title> (XInclude)<note place="bottom"><ptr target="http://www.w3.org/TR/xinclude/"/>.</note> defines a generic mechanism
for this purpose, which is supported by an increasing number of XML
processors.  </p></div>

<div type="div3" xml:id="SG-style"><head>Stylesheet Association and Processing</head>
<p>As mentioned above,  the processing of an XML document will usually
involve the use of one or more stylesheets, often but not exclusively
to provide specific details of how the document should be displayed or
rendered. In general, there is no reason to associate a document
instance with any specific stylesheet and the schema languages we have
discussed so far do not therefore make any special provision for such
association. The association is made when the stylesheet processor is
invoked, and is thus entirely application-specific.</p>
<p>However, since one very common application for XML documents is to
serve them as browsable documents over the Web, the W3C has defined a
procedure and a syntax for associating a document instance with its
stylesheet (see <ptr target="http://www.w3.org/TR/xml-stylesheet/"/>). This Recommendation
allows a document to supply a link to a default stylesheet and also to
categorize the stylesheet according to its <term>MIME type</term>,
for example to indicate whether the stylesheet is written in CSS or
XSLT, using a specialized form of processing instruction.</p>
<p>Assuming therefore that we have made a CSS-conformant stylesheet
for our anthology and stored it in a file called
<code>anthology.css</code> which is available from the same location
as the anthology itself, we could make it available over the Web simply
by adding a processing instruction like the following to the anthology:
<eg><![CDATA[<?xml-stylesheet href="anthology.css"
type="text/css"?>]]></eg></p>
<p>Multiple stylesheets can be defined for the same document, and
options are available to specify how a web browser should select
amongst them. For example, if the document also contained a directive:
<eg><![CDATA[<?xml-stylesheet href="anthology_m.css"
type="text/css" media="mobile"?>]]></eg>a different stylesheet
called <code>anthology_m.css</code> could be used when rendering the
document on a handheld device such as a mobile phone.</p>
<p>Most modern web browsers support CSS (although the extent
of their implementation varies), and some of them
support XSLT.</p>

<div type="div3" xml:id="SG-val"><head>Content Validation</head>

<p>As we noted above, most schema languages provide some degree of
datatype validation for attribute values (<ptr target="#SG-att"/>). They vary greatly in the validation facilities
they offer for the content of elements, other than the syntactic
constraints already discussed. Thus, while we may very easily check
that our <gi scheme="imaginary">stanza</gi> elements contain only <gi scheme="imaginary">line</gi> elements, we cannot easily check that <gi scheme="imaginary">line</gi> elements contain between five and 500
correctly-spelled English words, should we wish to constrain our
poetry in such a way. Also, because attributes and elements are
treated differently, it is difficult or impossible to express
co-occurrence constraints: for example, if the <att scheme="imaginary">status</att> of a poem is <val>draft</val> we might
wish to permit elements such as <gi scheme="imaginary">editorialQuery</gi> within its content, but not
otherwise.  </p>

<p>The XML DTD language offers very little beyond syntactic checking
of element content. By contrast, a major impetus behind the design and
development of the W3C schema language was the addition of a much more
general and powerful constraint language to the existing structural
constraints of XML DTDs. In RELAX NG the opposite approach was taken,
in that all datatype validation, whether of attributes or element
content, is regarded as external to the schema language. For
attributes, as we have seen, RELAX NG makes use of the W3C Schema
Datatype Library (but permits use of others). Because RELAX NG treats
both elements and attributes as special cases of patterns, the same
datatype validation facilities are available for element content as
for attribute values; it is unlike other schema languages in
this respect. In addition, for content validation,
a different component of DSDL known as Schematron can be used. Schematron
is a pattern matching (rather than a grammar-based) language, which
allows us to test the components of a document against
templates that express constraints such as those mentioned above.</p>

<p>Like other XML processors, Schematron uses XPath to identify parts
of an XML document; in addition, it provides elements that describe
assertions to be tested and conditions which must be validated, as
well as elements to report the results of the test. </p>
</div>
</div></div></div>
