<?xml version="1.0"?> 
<TEI.2>
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Building TEI DTDs and Schemas on demand</title>
            <author>Sebastian Rahtz</author>
         </titleStmt>
         <editionStmt>
            <edition>
               <date>March 2003</date>
            </edition>
         </editionStmt>
         <publicationStmt>
            <authority>OUCS</authority>
            <address>
               <addrLine>sebastian.rahtz@oucs.ox.ac.uk</addrLine>
            </address>
         </publicationStmt>
         <sourceDesc>
            <p>This is the master version of an original document.</p>
         </sourceDesc>
      </fileDesc>

      <revisionDesc>
         <change>
            <date>$Date: 2003/05/05 $</date>
            <respStmt>
               <name>$Author: rahtz $</name>
            </respStmt>
            <item>$Revision: #1 $</item>
         </change>
      </revisionDesc>
   </teiHeader>
 <text>
<front>
 <titlePage> 
   <docTitle> 
     <titlePart>Building TEI DTDs and Schemas on demand</titlePart>
   </docTitle> 
   <docAuthor>Sebastian Rahtz</docAuthor> 
    <docDate>March 2003</docDate> 
 </titlePage> 
 <div type="abstract">
<p>The Text Encoding Initiative Guidelines provide generic but detailed
recommendations for the mark-up of electronic documents, in particular
texts from the literary and linguistic domains. The TEI guidelines,
converted to XML in 2001, are maintained in a high-level markup which
mixes elements combination and content model rules with text
documentation.  A project to convert this to use RelaxNG internally
was described at XML Europe 2002. Because the TEI is modular and
extensible, it is accompanied by a web application which assists the
user to define a subset and/or extension of the schema and creates an
ad hoc DTD. This paper describes a new version of the program, which
will enable users to generate DTDs, RelaxNG schemas, and W3C schemas on
demand according to their specification, along with instance documentation.</p>

<p>The application, named <ident>Roma</ident> (the previous DTD-only
incarnation was called Carthage), lets the user choose which TEI
modules are needed, and allows them to include or exclude elements
individually. It also supports the modification of existing elements,
and the definition of new elements, with appropriate changes to TEI
model classes. Standard components from other namespaces (SVG and
MathML) can be included.  Most of this can be done without commitment
to which of the output formats is desired.  At the end a
<soCalled>flattened</soCalled> schema or DTD is produced, containing
only the necessary elements.</p>
</div>
</front>

<body>
<div id="intro"><head>Introduction</head> 
<p>The Text Encoding Initiative's <title>Guidelines for electronic
text encoding and interchange</title> (<ptr type="cite"
target="TEIP3"/>) provide exhaustive recommendations for the encoding
of key features in literary and linguistic textual materials.  These
recommendations, instantiated by a modular XML-based architecture in
which DTD fragments and documentation are combined according to
user-specified requirements, are very effective and are widely adopted
in digital library, language engineering, and many other projects
(<xptr url="http://www.tei-c.org/Applications/"/>).
    </p>

<p>One of the projects (<xptr
url="http://www.tei-c.org/Activities/META/"/>) of the TEI's Technical
Council is to rewrite the Guidelines so that underlying metalanguage
is independent of SGML or XML DTD language, allowing for automatic
generation of schemas, DTDs, or any future constraint languages. The
first stage of this work resulted in a set of RelaxNG (<xptr
url="http://wwww.relaxng.org/"/>) grammar files automatically
generated from the Guidelines (available from <xptr
url="http://www.tei-c.org/Schemas/RelaxNG/P4X/"/>).  This work was
described at XML Europe 2002 (<ptr type="cite" target="XMLEUR2002"/>),
so we will only provide a summary description here, but we will have to
recap some of the explanation.</p>

<p>Manipulation of the TEI is possible because the TEI is not
maintained as DTD files, but in a literate programming (cf <ptr
type="cite" target="KnuthLP"/>) system which documents and describes
elements in a largely abstract manner, and describes their
interdependence using an independently-documented set of element
classes. This is probably best demonstrated by an example.  The
<gi>persName</gi> element is specified by the following markup:</p>
<p rend="eg"><![CDATA[<tagDoc id="PERSNAME" usage="opt"> 
 <gi>persName</gi>
 <name>personal name</name> 
 <desc>contains a proper noun or proper-noun
phrase referring to a person, possibly including any or all of the
person's forenames, surnames, honorifics, added names, etc.</desc>
  <attList>
    <attDef usage="mwa">
      <attName>type</attName>
      <desc>describes the personal name more fully using an open-ended
list  of words or phrases which help to indicate the function, e.g.
<q>married name</q>, <q>maiden name</q>,
<q>pen name</q>, <q>religious name</q>, etc.</desc>
      <datatype>CDATA</datatype>
      <valDesc>Any string of characters.</valDesc>
      <default>#IMPLIED</default>
    </attDef>
  </attList>
  <exemplum>...</exemplum>
  <remarks/>
  <part type="top" name="ND"/>
  <classes names="DEMOG NAMES DATA"/>
  <elemDecl> %om.RR; ( #PCDATA | %m.personPart;
                   | %m.phrase; | %m.Incl; )* </elemDecl>
  <ptr target="NDPER"/>
</tagDoc>]]></p>

<p>The key features here are
<list type="ordered">
<item>The general description of the purpose of the element,
including examples (in <gi>exemplum</gi>, the contents
of which are omitted here)</item>
<item>The list of attributes, specified using name, datatype, default
       etc</item>
<item>The module of the TEI to which <gi>persName</gi> belongs
       (<ident>ND</ident>, ie the module covering names and
       dates)</item>
<item>The classes to which this element contributes (<ident>DEMOG</ident>,
       <ident>NAMES</ident>, and  <ident>DATA</ident>)</item>
<item>The content model for the element; this is also expressed in 
terms of classes, using the DTD markup 
<code>%m.personPart;</code>&#8212;any elements which say they 
are members of the
<soCalled>personPart</soCalled> class are allowed here.</item>
     </list>
This information allows a processor to construct a DTD fragment 
for the element as follows:</p>
<p rend="eg"><![CDATA[!ELEMENT persName ( #PCDATA | %m.personPart;
                   | %m.phrase; | %m.Incl; )* > 
<!ATTLIST %n.persName;
      %a.global;
      %a.names;
      type CDATA #IMPLIED
      TEIform CDATA 'persName'  >]]></p>
<p>Note here the addition of more attributes, from the classes of
which this element is a member. </p>
<p>The problem with the system described above is the dependence on 
explicit DTD content models, which are not amenable to processing
using standard XML tools. We therefore replace the <gi>elemDecl</gi>
with the following:</p>
<p rend="eg"><![CDATA[<elemDecl>
  <rng:zeroOrMore xmlns:rng="http://relaxng.org/ns/structure/1.0">
    <rng:choice>
      <rng:text/>
      <rng:ref name="m.personPart"/>
      <rng:ref name="m.phrase"/>
      <rng:ref name="m.Incl"/>
    </rng:choice>
  </rng:zeroOrMore>
</elemDecl>]]></p>
<p>This is much easier to analyze, and is (reasonably!) easy to turn
back into DTD markup if needed. A processor can now assemble all the
information needed to construct a complete RelaxNG grammar. </p>

<p>The translation of the TEI Guidelines to use RelaxNG markup
to encode content models is fairly stable, and the challenge now
is to find ways of making use of the extra power provided by
schemas.</p>
</div>

<div id="roma">
<head>From Pizza Joint to Sushi Bar</head> 

<p>In addition to the class
system for maintaining relationships between elements, the TEI also
works on the basis of a set of mutually exclusive<note
place="foot">This statement is not <emph>entirely</emph> true.</note>
basic tag sets. The choice is between:
<list type="gloss">
<label>Prose</label>
<item>This tagset is suitable for most documents most of the
time</item>
<label>Verse</label>
<item>This tagset adds specialist tagging for metrical analysis,
rhyme-scheme etc to the basic verse markup already included in the
core</item>
<label>Drama</label>
<item>This tagset adds specialist tagging for cast lists, records of
first performance, etc. to the basic drama markup already included
in the core</item>
<label>Speech</label>
<item>This tagset replaces the basic structure by one suitable for
linguistic analysis of speech acts, etc.</item>
<label>Dictionaries</label>
<item>This tagset replaces the basic structure with one containing
detailed lexicographic features</item>
<label>Terminology</label>
<item>This tagset replaces the basic structure with one specific to
terminological databases</item>
</list>
A normal TEI document will start with one of these scenarios,
and then add modules from the following list:
<list type="gloss">
<label>Linking</label>
<item>Adds elements for hypertext linking, segmentation, and
alignment</item>
<label>Figures</label>
<item>Adds elements for encoding tables, pictures, and
formulae;</item>
<label>Analysis</label>
<item>Adds elements for interpretation and simple linguistic
analyses</item>
<label>FS</label>
<item>Adds elements for feature structure analysis</item>
<label>Certainty</label>
<item>Adds elements for recording uncertainty and
responsibility</item>
<label>Transcription</label>
<item>Adds elements for the transcription of primary sources (e.g.
manuscripts)</item>
<label>Textcrit</label>
<item>Adds elements for text-critical apparatus</item>
<label>Names &amp; Dates</label>
<item>Adds elements for the detailed tagging of names and
dates</item>
<label>Nets</label>
<item>Adds elements for recording the abstract structure of
mathematical graphs, networks, and trees</item>
<label>Corpora</label>
<item>Adds specialised elements to the TEI-header for use with
language corpora</item>
</list>
It is important to understand that a user <emph>must</emph>
make sort of choice&#8212;there is no one TEI DTD or schema
which is the default.

In addition, the TEI has a clear system for extending the tagset,
which again utilises the class system by allowing new elements to be
added to classes, and to refer to existing classes.<note
place="foot">Adding new classes is a more complex exercise, not for
the faint hearted</note></p> <p>How does a casual user make sense of
this complexity? It requires a good understanding of DTD or Schema
languages to manipulate the right parameter entities or pattern
definitions, so the TEI offers an interface for building customized
views of the system.  In the DTD-only release of the TEI, this is
done using a web form and a utility called <ident>carthago</ident>;
the job of this program was to <soCalled>compile</soCalled> DTDs,
expanding all parameter entity references and removing references
elements which were not available.<note place="foot">Hence the name
<ident>carthago</ident>; it builds of list of elements which are not
needed, commenting as it goes <emph>haec delenda sunt</emph>, or
<emph>these must be destroyed</emph>, echoing Scipio's repeated
admonition to the Roman Senate of <emph>Carthago delenda est</emph>.
Now, I hope, it is clear why the schema-based successor  is called
      <ident>roma</ident>.</note></p>

<p>The web application is known as the TEI Pizza Chef, because it
allows the customer to choose what toppings they want for a particular
base. However, it has to leave most of the work to the user, by
creating a pair of skeleton DTD extension files which the user
downloads, edits, and uploads again. Editing these DTD files by hand
is error-prone, fairly forbidding, and cannot be used to modify
schemas. A revised system has therefore been built which attempts to
keep all the knowledge or DTD or schema in the application itself, and
simply ask the user to select options on web forms. This is fancifully
known as the TEI Sushi Bar, following the model of an endless choice
of clean, distinct, options continually being presented to the user,
     rather than a rather oily mound of congealing cheese and tomato.
More precisely, the Sushi Bar is a web application running scripts
known generically as <ident>roma</ident>.
</p>

<p>Roma starts by asking the user to choose which base tagsets and
extra modules they require. There are two interfaces, one verbose 
(Figure <ptr target="roma1"/>)
and one for the expert (Figure <ptr target="roma1-expert"/>). 
There are also two important choices to make:
<list type="ordered">
<item>The user must indicate what sort of output
is needed. The choice is between: <list>
<item>RelaxNG schema</item>
<item>compiled RelaxNG schema</item>
<item>compact RelaxNG schema</item>
<item>W3C  schema</item>
<item>compiled DTD</item>
     </list>
      </item>
<item>The user must say whether  they want to make
modifications to the elements in the selected tagset. 
 The choice is between: <list>
<item>Leave elements as they are</item>
<item>Configure elements, including them by default</item>
<item>Configure elements, excluding them by default</item>
     </list>
      </item>
<item>The user can say whether they want to add some new elements</item>
     </list>
The choices here affect the next stage.  Firstly, if a DTD is
requested, the user is allowed to choose some ISO entity sets to
include (Figure <ptr target="roma3-dtd"/>). Secondly, if element
configuration is requested, all the elements in the chosen tagsets are
listed, with radio buttons which allow the element to be included in
the result, or excluded (Figure <ptr target="roma2"/>). The links
in this table are to the documentation of each element on the TEI web site.
At this stage, the user can <emph>rename</emph> elements; the example shown
in Figure <ptr target="roma3"/> has
<gi>figure</gi> being renamed to <gi>graphic</gi>, and
<gi>figDesc</gi> to <gi>caption</gi>, while <gi>table</gi>,
<gi>row</gi>, <gi>cell</gi>, and <gi>formula</gi> are declared as
unwanted. We will see shortly how this is implemented.</p>

<p><figure id="roma1" width="5in" url="03-01-04-fig01.png"><head>Roma stage 1,
verbose mode</head>
     </figure>
<figure id="roma1-expert" width="5in"
      url="03-01-04-fig02.png"><head>Roma stage 1, expert mode</head>
     </figure>
<figure id="roma3-dtd" width="5in" url="03-01-04-fig03.png"><head>Roma
       stage 2, choosing entity sets</head>
     </figure>
<figure id="roma2" width="5in" url="03-01-04-fig04.png"><head>Roma stage 2, expert mode</head>
     </figure>
<figure id="roma3" url="03-01-04-fig05.png" width="5in"><head>Roma stage 2,
       renaming elements</head>
     </figure></p>

<p>In the second stage of Roma, there is also a set of general 
options which can be turned on and off:
<list type="ordered">
<item>Whether date elements should be validated against an ISO date
       format
(Schema only)</item>
<item>Whether <gi>xptr</gi>, <gi>xref</gi> and <gi>figure</gi>
       elements
should support a <ident>url</ident> attribute to identify external
resources<note place="foot">This is done using entities in
       <soCalled>traditional</soCalled> TEI.</note></item>
<item>Whether the standard extensions of the common subset of the TEI
       known
as <soCalled>TEI Lite</soCalled> should be activated</item>
<item>Whether the <gi>formula</gi> element should be redefined
to insist on content being expressed as MathML (Schema  only)</item>
<item>Whether the <gi>figure</gi> element should be redefined
to allow a content of SVG (Scaleable Vector Graphics) elements
(Schema only)</item>
     </list>
After all these choices are made, the Submit button prompts the
user to download the resulting DTD or schema.    
</p>

<p>The look of the result depends on whether or not a compiled form has
     been selected. Given a simple set of choices, a RelaxNG grammar
could result as follows:</p>
<p rend="eg"><![CDATA[<grammar 
  xmlns="http://relaxng.org/ns/structure/1.0"
  xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" 
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<include href="http://www.tei-c.org/Schemas/RelaxNG/P4X/tei2.dtd.rng">

 <define name="TEI.prose"><ref name="INCLUDE"/></define>
 <define name="TEI.figures"><ref name="INCLUDE"/></define>

 <define name="formula"><notAllowed/></define>
 <define name="table"><notAllowed/></define>
 <define name="figDesc">
  <element name="caption">
   <ref name="c.figDesc"/>
  </element>
 </define>
 <define name="row"><notAllowed/></define>
 <define name="figure">
  <element name="graphic">
   <ref name="c.figure"/>
  </element>
 </define>
 <define name="cell"><notAllowed/></define>

<!-- overrides to make ISOdate a formal datatype -->
  <define name="ISO-date">
      <data type="date" 
          datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"/>
 </define>
</include>
</grammar>]]></p>
<p>There are some important points to note here:
<list type="ordered">
<item>The basic structure is 
<code>&lt;include
	href="http://www.tei-c.org/Schemas/RelaxNG/P4X/tei2.dtd.rng"&gt;</code> <emph> ... a set of redefinitions of standard TEI patterns ...</emph>
<code>&lt;/include&gt;</code></item>
<item>TEI modules are turned on by redefining a pattern, eg
<code>&lt;define name="TEI.figures"&gt;&lt;ref
	name="INCLUDE"/&gt;&lt;/define&gt;</code></item>
<item>In the same way, individual elements can be disallowed by
       setting
their definition to <gi>notAllowed</gi>, eg
<code>&lt;define
	name="table"&gt;&lt;notAllowed/&gt;&lt;/define&gt;</code></item>
<item>Elements are renamed by a redefinition of a pattern</item>
     </list>
The last point deserves some more explanation.
The original definition for <gi>figure</gi> is like this:</p>
<p rend="eg"><![CDATA[<define name="figure">
  <element name="figure">
   <ref name="c.figure"/>
  </element>
 </define>]]></p>
<p>that is, a pattern is defined called <ident>figure</ident>, which
defines an element called <gi>figure</gi>, with a content model
given in the pattern <ident>c.figure</ident>. By redefining 
<ident>figure</ident> as follows:</p>
<p rend="eg"><![CDATA[<define name="figure">
  <element name="graphic">
   <ref name="c.figure"/>
  </element>
 </define>]]></p> <p>we define an element called <gi>graphic</gi>,
which has the <emph>same content model</emph> as the old
<gi>figure</gi>, and is inside athe pattern called
<ident>figure</ident>. This is what other definitions will refer to;
so anything which wants to include the <soCalled>figure</soCalled>
element will say <gi>ref name="figure"/</gi>, and it will not matter
that the actual element is renamed. The <emph>original</emph> name of
the element is preserved by an attribute called
<ident>TEIform</ident>, defined as <code><![CDATA[<attribute
name="TEIform" a:defaultValue="figure"> <text/>
</attribute>]]></code>, so it is easy to relate this changed setup to
the basic TEI. The renaming feature may be extended in future to
allow complete translations of the TEI element names to predefined
language sets, allowing the user to simply request "all elements in
     Spanish, please".</p>

<p>If a compiled output is requested, then the skeleton
DTD or Schema will be put through a flattening process to
remove redundant elements and references to external files. This
has the advantage that a single file is produced, which considerably
aids portability, and the removal of unused elements can make it
much smaller.</p>
<p>DTD flattening is performed by the existing <ident>carthago</ident>
application, and schema flattening is performed by an XSLT transform
     of a RelaxNG grammar. The other outputs (compact RelaxNG and W3C
     Schema) are done by calls to James Clark's <ident>trang</ident>
     program (<xptr url="http://www.thaiopensource.com/trang/"/>).
    </p>
<p>MathML and SVG inclusion are managed by simplying
     <gi>include</gi>ing the relevant RelaxNG grammars,
each in their own namespace. </p>
</div>

<div id="more">
<head>Extending the TEI</head>
<p>We have so far seen examples of simply choosing subsets
of the TEI, or adding standard new features. What if we want to
add some elements? This may be for one of two reasons:
<list type="ordered">
<item>To add an element which is effectively a clone
of an existing element, perhaps with an assumed attribute value,
to make the text easier to edit and read. For example,
we could mark a set of exercise steps with <code><![CDATA[<list
	type='steps'>]]></code>, but it would be friendlier
to allow
<code><![CDATA[<steplist>]]></code>, even though the processing would
       be identical.</item>
<item>To add a new element to an existing class. For example,
the elements for describing an address do not include
anywhere to put a personal URL, so we want to add a new element
parallel to <gi>postCode</gi> and <gi>street</gi>.</item>
     </list>
If the user chooses to add elements, they are asked to decide which of
these two situations they want to address, and to give the element a
name and description.  In Figure <ptr target="roma4"/> we show the
addition of the <gi>homeurl</gi> element, in the
<ident>addrPart</ident> class. Of course, this assumes some
familiarity with the TEI class system, (see section <ptr
target="classes"/> for a summary of the TEI classes) and the interface
is not yet friendly enough for someone completely new to the TEI. The
list of elements and classes are derived, of course, dynamically from
the TEI Guidelines. </p>

<p><figure id="roma4" url="03-01-04-fig06.png" width="5in">
<head>Creating new elements</head></figure>
</p>

<p>There are three further facilities which <ident>Roma</ident> does not
yet provide:
<list type="ordered">
<item>Adding elements which do not simply follow the class system,
but have arbitrary content models and attribute lists. The problem
here is how to ask the user to specify the new material without
directly writings schema code.  It remains to see how many requests
we will receive for this feature.</item>
<item>Changing or limited the content model of elements which
do not follow the class system fully. The correct answer to this may
be to revise the TEI so that all elements <emph>do</emph> use the
class system 100%, but in the short-term this is unrealistic. It may
be possible  to devise an interface to editing content models.</item>
<item>Adding entire classes to the TEI. This is a complex matter,
which it is unlikely we can provide in a simple web interface.</item></list>
    </p>

</div>

<div id="classes">
<head>TEI classes</head>
<p>Here is a list of the currently defined classes of the TEI system:
<list type="gloss">

<label>addrPart</label><item>groups elements which may constitute a postal or
other form of address.</item>



<label>agent</label><item>groups elements which contain names of individuals
or corporate bodies.</item>



<label>analysis</label><item>default declaration for class <term>analysis</term>:
when the additional tag set for simple analysis is not selected,
no attributes are defined for this class.</item>



<label>analysis</label><item>defines a set of attributes for associating specific analyses or
interpretations with appropriate portions of a text, which are enabled
for all elements when the
additional tag set for simple analysis is selected.</item>



<label>baseStandard</label><item>groups elements in a writing system which refer to some public or
private standard as part of the basis for the writing system
  declaration</item>



<label>bibl</label><item>groups elements containing a bibliographic
  description. </item>



<label>biblPart</label><item>groups elements which can appear only within bibliographic
citation elements.</item>



<label>binary</label><item>elements which express binary values in
  feature structures.</item>



<label>boolean</label><item>groups elements which express Boolean
values in feature structures.</item>



<label>chunk</label><item>groups elements which can occur between, but not
within, paragraphs and other chunks.</item>



<label>common</label><item>groups common chunk- and inter-level
  elements.</item>



<label>comp.dictionaries</label><item>groups those component-level elements which are unique to the base tag set
for dictionaries.</item>



<label>comp.drama</label><item>groups those component-level elements
  which are specific to performance texts.</item>



<label>comp.spoken</label><item>groups those elements
which appear at the component level in spoken texts only.</item>



<label>comp.terminology</label><item>groups component-level elements unique to the base tag set
for terminological data.</item>



<label>comp.verse</label><item>groups component level elements unique
  to the base tag set for verse.</item>



<label>complexVal</label><item>groups elements which express complex feature values in feature
structures.</item>



<label>data</label><item>groups phrase-level elements containing names, dates, numbers, measures,
and similar data. </item>



<label>date</label><item>groups elements containing a date
  specifications.</item>



<label>declarable</label><item>groups elements which may be independently selected (using the special
purpose <ident>decls</ident> attribute) from a candidate list of declarations
within a TEI header.</item>



<label>declaring</label><item>groups elements which may be independently associated with a
particular declarable element within the header, thus overriding the
inherited default for that element.</item>



<label>demographic</label><item>groups elements describing demographic characteristics of the participants
in a linguistic interaction.</item>



<label>dictionaries</label><item>default declaration for class <term>dictionaries</term>:
when the base tag set for dictionaries is not selected,
no attributes are defined for this class.</item>



<label>dictionaries</label><item>defines a set of global attributes available on elements in the base
tag set for dictionaries.</item>



<label>dictionaryParts</label><item>groups all elements defined
  specifically for dictionaries.</item>



<label>dictionaryTopLevel</label><item>groups related parts of a dictionary entry forming a coherent
subdivision, for example a particular sense, homonym, etc.</item>



<label>divbot</label><item>groups elements which can occur at the end of a
text division; for example, trailer, byline, etc.</item>



<label>divn</label><item>defines a set of attributes common to all elements which
  behave in the same way as divisions.</item>



<label>divtop</label><item>groups elements which can occur at the start of any
division class element.</item>



<label>dramafront</label><item>groups elements which appear at the level of divisions within front or
back matter of performance texts only.</item>



<label>edit</label><item>defines a group of attributes common to the phrase-level
  elements used for simple editorial correction and
	transcription. </item>



<label>edit</label><item>groups phrase-level elements for simple editorial correction and
transcription. </item>



<label>editIncl</label><item>groups empty elements which perform a specifically editorial function, for
example by indicating the start of a 
span of text added, deleted, or missing in a source.</item>



<label>enjamb</label><item>groups elements bearing the <ident>enjamb</ident> attribute.</item>



<label>entries</label><item>groups the different styles of dictionary
  entries.</item>



<label>featureVal</label><item>groups elements which express feature
  values in feature structures.</item>



<label>fmchunk</label><item>groups elements which can occur as direct constituents
of front matter, when a full title page is not given. </item>



<label>formInfo</label><item>groups elements allowed within a <gi
  >form</gi> element in a dictionary.</item>



<label>formPointers</label><item>groups elements in the dictionary base which point at
orthographic or pronunciation forms of the headword.</item>



<label>fragmentary</label><item>groups elements which mark the beginning or ending of a fragmentary
manuscript or other witness.</item>



<label>front</label><item>groups elements which appear at the level of divisions within front or
back matter.</item>



<label>global</label><item>defines
a set of attributes available to all components of the writing system
  declaration.</item>



<label>global</label><item>defines a
set of attributes common to all elements in the TEI encoding
scheme.</item>



<label>gramInfo</label><item>groups those elements allowed within a
  <gi>gramGrp</gi> element in a dictionary.</item>



<label>hqinter</label><item>groups elements related to highlighting which can appear either
within or between chunk-level elements. </item>



<label>hqphrase</label><item>groups phrase-level elements related to
  highlighting. </item>


<label>Incl</label><item>groups empty elements which may appear at any
  point within a TEI text.</item>



<label>inter</label><item>groups
elements of the intermediate (inter-level) class: these elements can occur
both within and and between paragraphs or other chunk-level
  elements.</item>



<label>interpret</label><item>defines the set of attributes common to
  this group of interpretative elements.</item>



<label>linking</label><item>default declaration for class <term>linking</term>:
when the additional tag set for linking is not selected,
no attributes are defined for this class.</item>



<label>linking</label><item>defines a set of attributes for hypertext and other linking,
which are enabled for all elements when the additional tag set for
linking is selected.</item>



<label>lists</label><item>groups
all list-like elements.</item>



<label>loc</label><item>groups elements used for purposes of location
  and reference</item>



<label>metadata</label><item>groups empty elements which describe the status of other elements, for
example by holding groups of links or of abstract interpretations, or
by providing indications of certainty etc., and which may appear at any
point in a document.</item>



<label>metrical</label><item>defines a set of attributes which certain elements may use to
represent metrical information. </item>



<label>morphInfo</label><item>groups elements which provide morphological information within
the dictionary tag set.</item>



<label>names</label><item>groups those elements which refer to named persons, places, organizations etc.
</item>



<label>notes</label><item>groups all note-like elements. </item>



<label>personPart</label><item>groups those elements which form part of a personal
name.</item>



<label>phrase.verse</label><item>groups phrase-level elements which
  may appear within verse only.</item>



<label>phrase</label><item>groups those elements which can occur at the level of individual
words or phrases.</item>



<label>placePart</label><item>groups those elements which form part of a place
name.</item>



<label>pointer</label><item>defines
a set of attributes used by all elements which point to other elements
by means of one or more <code>IDREF</code> values.</item>



<label>pointerGroup</label><item>defines a set of attributes common to
  all elements which enclose groups of pointer elements.</item>



<label>readings</label><item>defines a set of attributes common to all
  elements representing variant readings in text critical work.</item>



<label>refsys</label><item>groups milestone-style
elements used to represent reference systems</item>



<label>seg</label><item>groups elements used for arbitrary
  segmentation. </item>



<label>sgmlKeywords</label><item>groups elements whose content is an SGML or XML identifier or tag of some sort
(generic identifier of an element type, name of an attribute,
  etc.).</item>



<label>singleVal</label><item>group elements which express single
  feature values in feature structures.</item>



<label>stageDirection</label><item>groups elements containing specialized
stage directions defined in the additional tag set for performance
texts.</item>



<label>temporalExpr</label><item>groups component elements of temporal expressions involving
  dates and time, and defines an additional set of attributes common
  to them.</item>



<label>terminology</label><item>default declaration for class <term>terminology</term>:
when the base tag set for terminological data is not selected,
no attributes are defined for this class.</item>



<label>terminology</label><item>defines attributes for all elements in documents which use the
base tag set for terminological data.</item>



<label>terminologyInclusions</label><item>groups elements which may be included at any point within a
terminology entry.</item>



<label>terminologyMisc</label><item>groups elements which can appear together at various points in
terminological entries.</item>



<label>timed</label><item>defines a set of attributes common to those elements which have a
duration in time,  expressed either absolutely
or by reference to an alignment map.</item>



<label>tpParts</label><item>groups those elements which can occur as direct constituents
of a title page (<gi>docTitle</gi>, <gi>docAuth</gi>,
<gi>docImprint</gi>, <gi>epigraph</gi>,
  etc.)</item>



<label>typed</label><item>defines a
set of attributes which can be used to classify or subclassify
certain elements in any way. </item>



<label>xPointer</label><item>defines a set of attributes used by all those elements which use the TEI extended pointer mechanism to
point at locations which have neither an SGML nor an XML ID.</item>



    </list>
    </p>
   </div>


<div id="conclusions"><head>Conclusions</head>

<p>The increasing power provided by schemas, and the stress on
modularity, argue in favour of moving towards (conceptual) <emph>two
stage validation</emph>. In the first phase, the important check is
that the document uses the right vocabulary, in our case meaning the
441 elements currently described by the TEI.  The structure here can
be quite loose. In the second phase, which can depend on individual
projects, validation can be a lot more precise, with detailed
datatyping and inter-dependency validation. For example, the basic
rule may say that an <gi>text</gi> must have a <gi>author</gi>,
<gi>title</gi> and <gi>date</gi>, but be agnostic about their order. A
particular project may wish to enforce a rule that they must occur in
a fixed order; or it may wish to <emph>more</emph> limited than the
     base schema, and say that <gi>date</gi> is not permitted at all.
Thus a typical document may be checked once to ensure that it uses
TEI vocabulary and broad grammatical structure, and then checked again
     to make sure it talks the right dialect.</p>

<p>The relevance of this work is that it shows a way forward for
XML users which does not involve low-level interaction with DTDs or
Schemas. Unlike the graphic direct manipulation tools in eg XML Spy,
the Roma tool works at the level of the TEI class system. Together
with the support for other namespaces via schemas, these tools take
the TEI one step further on the road to a universal markup
language.</p>

   </div>
</body>

<back>
<div id="notes"><head>Notes and Acknowledgements</head>
<p>This work was carried out as part of 
the technical work programme of the Metalanguage Taskforce
(<xptr url="http://www.tei-c.org/Council/tcw03.html"/>)
of the TEI Council in 2003. It is still experimental and does
not form a formal part of the TEI.</p>
<p>I am grateful to Norm Walsh and Lou Burnard, and the
other members of the Taskforce, for stimulating discussion
on this and related subjects; I was also delighted to discover
Daniel Veillard's work on a new RelaxNG validator 
(now part of <ident>libxml2</ident>) while I was writing this
paper, and to have the chance of contributing towards debugging
the software with TEI examples.</p>
   </div>

<div>
<head>Bibliography</head>
<listBibl>

<bibl id="XMLEUR2002">
Sebastian Rahtz,
<title level="a">Converting to schema: the TEI and RelaxNG</title>,
paper presented at XML Europe 2002, Barcelona, May 2002.</bibl>

<bibl id="TEIP3">
Association for Computers and the Humanities, Association for
Computational Linguistics, and Association for Literary and Linguistic
Computing, <title level="m">Guidelines for Electronic Text Encoding
and Interchange (TEI P3)</title>.  Ed. C. M. Sperberg-McQueen and Lou
Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
     </bibl>

<bibl id="DOCBOOK">
N. Walsh and L. Muellner, <title>DocBook The Definitive Guide</title>,
O'Reilly, Sebastopol, CA, USA, 1999.
 </bibl>

<bibl id="KnuthLP">
Donald E. Knuth, <title>Literate Programming</title>,
Stanford University Center for the Study of Language and Information
(CSLI Lecture Notes Number 27), Stanford, CA, USA, 1992.</bibl>

<bibl>C.M. Sperberg-McQueen and Lou Burnard. <title level="a">The
Design of the TEI Encoding Scheme</title> in N. Ide. and J. Veronis,
eds.  <title level="m">The Text Encoding Initiative: Background and
Contexts</title>, special triple issue of <title level="j">Computers
and the Humanities</title>, 29:1, 1995, 17-39</bibl>

</listBibl>
   </div>
</back>
</text>
</TEI.2>



