TEI META Task Force: Status Report [MEW01]


Contents

Choice of schema language for the TEI

The task force considered whether the content models of TEI elements should be expressed in the source TEI Guidelines in:
  1. XML DTD language as at present
  2. W3C Schema language
  3. OASIS Relax NG language
  4. A new notation of the TEI's devising
There was an almost instant preference for Relax NG, since it:
  • Uses XML syntax, enabling easy validation and analysis
  • Is very readable, and fairly easy to relate to DTD
  • Is well-implemented by different processors, and so immediately useable
  • Uses W3C schema datatyping
  • Seems likely to be included in the forthcoming ISO DSDL
  • Can be converted to W3C schema if needed
and it was therefore agreed to convert the TEI Guidelines so that element content models are represented in Relax NG syntax in its own namespace.

Sebastian Rahtz presented a paper at XML Europe 2002 on the subject of how to convert the TEI XML content models to RelaxNG. This work, slightly refined, is the basis for an experimental version of TEI P5. There are a set of derived sample TEI schemas which are available for immediate use.

Skeleton work plan for redesigning ODD

It is intended to suggest and implement changes to the ODD system in the following order:
  1. Clear up the details of <tagDoc>
  2. Revise (part 1) the TSD tagdoc and make it a standard ‘topping’
  3. Convert (part 1) the Guidelines to conform to that schema (*)
    1. Convert the <elemDecl> contents to RelaxNG schema (*)
    2. Convert attributes (where automatically possible) to use new datatyping scheme (*)
    3. Add new <entDoc> elements defining the datatypes (*)
    4. Examine and rework the <string> and <entDoc> elements to remove remaining SGML/XML material
    5. Rewrite and test the scripts which
      • generate schemas (*)
      • generate DTDs
      • generate HTML version of the Guidelines
      • generate PDF version of the Guidelines
    The items marked (*) have been completed.
  4. Clear up the details of higher-level class system
  5. Revise (part 2) the TSD tagdoc
  6. Convert (part 2) the Guidelines to conform to that schema

Work plan beyond ODD: towards P5

The following tasks need to be completed in order to create P5:
  1. Make corrections of known errors
  2. Assess all the attribute datatypes and decide whether:
    • A new datatype should be created (when more than 2 or 3 attributes have the same pattern)
    • An attribute which is now simple text should be reconsidered as a tokenized attribute
    • Extra facets should be added to further refine datatypes
  3. Assess elements to see whether those with plain text bodies can be datatyped
  4. Consider all element content models to decide whether they are too restrictive or too lose; consider whether some of the simplifying facilities available in RelaxNG (eg whether <interleave> ) should be used.

Work on ODD markup

<attList>

We currently have a structure called <attList> which contains one or more <attDef> elements, which have <datatype> and <default> children. The <default> holds both things like ‘#IMPLIED’ and ‘m’, while <datatype> has a mixture of ‘CDATA’, ‘63’ and ‘%ISO-date;’. It is suggested that
  1. <attDef> has a boolean attribute ‘required’
  2. the <default> element should only be used to hold genuine default strings or tokens. It will be optional. Some notation will be needed to encompass ‘%INHERITED;’
  3. <datatype> has a mandatory ‘target’ attribute, which points to an <entDoc> , defining the datatype. This gives us an extra abstract layer over XML schema datatypes. Most token choice attributes would be boiled down to genuine datatypes, so all of ‘Y|N’, ‘yes | no’ and ‘true|false’ would be <datatype target="BOOLEAN"> . In the <entDoc> , we expound on this and map to the relevant W3X Schema datatype (see section Datatyping in attributes). Where the choice is limited, eg ‘A | B’, it is recorded as a set of enumerated values, defined in the body of the <datatype> :
    <datatype target="TOKEN"> <rng:choice> <value>A</value> <value>B</value> </rng:choice> </datatype>

Datatyping in attributes

The task force is asked to use W3C Schema datatypes in the TEI as much as possible.

An analysis of all the current <datatype> values shows that they fall into four categories:
  1. Standard XML datatypes (ID, IDREFS, NMTOKENS, etc)
  2. Abstract datatypes linked to entities in the Guidelines (there are only 2 or 3 of these)
  3. Text with no conditions
  4. Text, but with a fixed set of possibilities
We can deal with the first of these easily; they all map into schema datatypes. The second is simply an indirection. The third will remain as text (but see below). It is suggested that the fourth should be split into:
  1. attributes where the range of possibilities fits a W3C datatype, or it makes sense to at least have a common set of values across the TEI
  2. attributes which really should have token values
However, it is likely that some attributes are mis-classified at present; some of those which are datatyped as free text should be tokenized, and some which are tokenized should be completely free text. It is important to separate out attributes which have completely arbitrary text from those where the text is tokenizable (see section Character encoding in attributes).
It is suggested that the system be rationalized so that all the existing <datatype> entries are replaced by pointers to one of the following datatypes:
Name Relax NG representation
ANYURI <rng:data type="anyURI"/>
BOOLEAN <rng:data type="boolean"/>
DATE <rng:data type="date"/>
DATETIME <rng:data type="dateTime"/>
DURATION <rng:data type="duration"/>
ENTITIES <rng:data type="ENTITIES"/>
ENTITY <rng:data type="ENTITY"/>
EXTPTR <rng:text >
FLOAT <rng:data type="float"/>
FORMULA <rng:text >
ID <rng:data type="ID"/>
IDREF <rng:data type="IDREF"/>
IDREFS <rng:data type="IDREFS"/>
LANGUAGE <rng:text >
NAME <rng:data type="NCNAME"/>
NMTOKEN <rng:data type="NMTOKEN"/>
NMTOKENS <rng:data type="NMTOKENS"/>
SEX <rng:choice <value>m</value> <value>f</value> <value>u</value> <value>x</value> </rng:choice>
TEXT <rng:text >
TIME <rng:data type="time"/>
TOKEN <rng:empty/>
UBOOLEAN <rng:choice> <value>true</value> <value>false</value> <value>unknown</value> <value>unspecfied</value> </rng:choice> >

Table Figure 2 lists some current datatype values and how they map to the new scheme. Table Figure 3 shows 180 attributes which can automatically given a non-text and non-token data types.

Character encoding in attributes

The character encoding workgroup discussed how to deal with attributes which need to use the full range of characters (eg variations, and names). This task force agreed that the correct approach was to support an alternative notation, by which these attributes could optionally be recorded as elements if the TEI user wishes to use some form of character encoding not permitted in TEI attributes. TEI P5 will therefore:
  1. Record which attributes have the extended property of being representable as elements
  2. When making normal DTDs, only support the ‘traditional’ scheme of attributes
  3. Allow for special DTDs (from son-of-pizzachef) which support only the element alternative
  4. When making schemas, support both attribute and element forms
Processing applications (eg XSLT stylesheets) will have to decide whether to support both systems, or only one.

There are over 300 attributes which currently have a text datatype; this includes a good many elements which have a type attribute. The TEI editors will have to decide which of these should be classified as ‘true text’ (see EDW79).

Namespaces and fragment inclusion

The task force is asked to consider how situations can be catered for:
  • Using fragments of another markup language in TEI XML
  • Using fragments of TEI in another markup language
The answer to both of these is XML namespaces. Two vocabularies can be combined, if the elements identify their namespace. Using schemas, it is easy to validate a document which goes off into different namespaces at various points; this is demonstrated in a TEI RelaxNG schema which redefines <formula> to have MathML elements as content. However, to demonstrate the other way round (fragments of TEI embedded in another XML vocabulary) would require assigning a namespace for the TEI. This could be a single namespace for all TEI elements, or a different one for each tagset. The task force considered that the latter would be an unnecessary complication for users, but that a namespace (perhaps ‘http://www.tei-c.org/P5’) for the TEI would be a good idea. However, there are two major problems with this, which have prevented the taskforce from implementing it:
  1. All existing TEI documents would be invalid, as they would be in an empty namespace. It would be a fairly small fix for each instance to add a namespace declaration to root element, but that would make it fail with existing DTDs.
  2. All existing XML processing tools would fail to work with new documents; for instance, XSLT stylesheets which process a current (empty namespace) <TEI.2> would fail to identify the new <TEI.2 xmlns="http://www.tei-c.org/P5"> . It will be possible in XSLT 2.0 to write a stylesheet to work with both old and new TEI documents, but using XSLT 1.0 it will be much harder; all stylesheets will need a large rewrite.
This issue requires further investigation.

A replacement for the Pizza chef

This has not been discussed by the task force, but Sebastian Rahtz has written a paper on the subject for XML Europe 2003. This shows that it is possible to have a simple web application which generates RelaxNG schemas, W3C schemas, and XML DTDs, on demand; the prototype, Roma, works solely with the TEI class system, and provides a better interface to it than the Pizza Chef. There are, however, facilities which Roma does not yet provide:
  1. Adding elements which do not simply follow the class system, but have arbitrary content models and attribute lists. The problem here is how to ask the user to specify the new material without directly writings schema code. It remains to see how many requests we will receive for this feature.
  2. Changing or limited the content model of elements which do not follow the class system fully. The correct answer to this may be to revise the TEI so that all elements do use the class system 100%, but in the short-term this is unrealistic. It may be possible to devise an interface to editing content models.
  3. Adding entire classes to the TEI. This is a complex matter, which it is unlikely we can provide in a simple web interface.

Appendix A: Tables

Current datatypes and proposed replacements:
Current New datatype (values)
%ISO-date; DATE
%extPtr; EXTPTR
%formulaNotations; FORMULA
Y | N BOOLEAN
Y | N | U UBOOLEAN
YES | NO BOOLEAN
all | one | none TOKEN all, one, none
all | some | none TOKEN all, some, none
free | unknown | restricted TOKEN free, unknown, restricted
light | sound | prop | block TOKEN light, sound, prop, block
m | f | u SEX
m | f | u | x SEX
none | some | all TOKEN
silent | tags TOKEN
y | n | u UBOOLEAN
yes | no BOOLEAN
Y | N | I | M | F TOKEN Y, N, I, M, F
Y | N | U UBOOLEAN
Y | N | partial TOKEN Y, N, partial
Y | N BOOLEAN
Y | N BOOLEAN
a | m | j | s | u TOKEN a, m, j, s, u
am | pm | 24hour | descriptive TOKEN am, pm, 24hour, descriptive
audio | video TOKEN audio, video
closed | semi | open TOKEN closed, semi, open
composite | uniform TOKEN composite, uniform
data | rend | std | nonstd | unknown TOKEN data, rend, std, nonstd, unknown
eq | ne TOKEN eq, ne
eq | ne | gt | ge | lt | le TOKEN eq, ne, gt, ge, lt, le
eq | ne | lt | le | gt | ge TOKEN eq, ne, lt, le, gt, ge
eq | ne | sb | ns TOKEN eq, ne, sb, ns
eq | ne | sb | ns | lt | le | gt | ge TOKEN eq, ne, sb, ns, lt, le, gt, ge
excl | incl TOKEN excl, incl
fiction | fact | mixed | inapplicable TOKEN fiction, fact, mixed, inapplicable
high | medium | low | unknown TOKEN high, medium, low, unknown
horizontal | vertical TOKEN horizontal, vertical
initial | medial | final | unknown | complete TOKEN initial, medial, final, unknown, complete
internal | external TOKEN internal, external
int | real TOKEN int, real
lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other TOKEN lexical, punc, lexpunc, digit, space, DL, LD, dia, joiner, other
location-referenced | double-end-point | parallel-segmentation TOKEN location-referenced, double-end-point, parallel-segmentation
model | atts | both TOKEN model, atts, both
new | update TOKEN new, update
none | partial | complete | inapplicable TOKEN none, partial, complete, inapplicable
pe | ge TOKEN pe, ge
perc | real TOKEN perc, real
req | mwa | rec | rwa | opt TOKEN req, mwa, rec, rwa, opt
role | list TOKEN role, list
root | branches TOKEN root, branches
s | w | ws | sw | m | x TOKEN s, w, ws, sw, m, x
silent | tags TOKEN silent, tags
single | composite | frags | unknown TOKEN single, composite, frags, unknown
single | set | bag | list TOKEN single, set, bag, list
smooth | latching | overlap | pause TOKEN smooth, latching, overlap, pause
tei | iso | national | private | none TOKEN tei, iso, national, private, none
tempo | loud | pitch | tension | rhythm | voice TOKEN tempo, loud, pitch, tension, rhythm, voice
to | from | both | none TOKEN to, from, both, none
unit | set | bag | list TOKEN unit, set, bag, list
y | n | unspecified UBOOLEAN
y | n BOOLEAN
yes | abb | init TOKEN yes, abb, init
yes | no BOOLEAN
yes | no BOOLEAN
CDATA TOKEN
ENTITIES ENTITIES
ENTITY ENTITY
ID ID
IDREF IDREF
IDREFS IDREFS
NAME NAME
NMTOKEN NMTOKEN
NMTOKENS NMTOKENS
Attributes with datatypes assigned:
element attribute datatype
analysis ana typeIDREFS
declarable default typeBOOLEAN
declaring decls typeIDREFS
dictionaries location typeIDREF
dictionaries mergedin typeIDREF
dictionaries opt typeBOOLEAN
edit resp typeIDREF
formPointers target typeIDREF
global id typeID
global id typeID
global lang typeIDREF
interpret inst typeIDREFS
linking corresp typeIDREFS
linking synch typeIDREFS
linking sameAs typeIDREF
linking copyOf typeIDREF
linking next typeIDREF
linking prev typeIDREF
linking exclude typeIDREFS
linking select typeIDREFS
pointer targOrder typeUBOOLEAN
pointerGroup domains typeIDREFS
readings hand typeIDREF
TEIform TEIform typeNAME
terminology grpPtr typeIDREF
terminology depPtr typeIDREF
timed start typeIDREF
timed end typeIDREF
xPointer doc typeENTITY
xPointer from typeEXTPTR
xPointer to typeEXTPTR
abbr resp typeIDREF
add resp typeIDREF
add hand typeIDREF
addSpan resp typeIDREF
addSpan hand typeIDREF
addSpan to typeIDREF
admin date typeDATE
alt targets typeIDREFS
app from typeIDREF
app to typeIDREF
arc from typeIDREF
arc to typeIDREF
att tei typeBOOLEAN
birth date typeDATE
catRef target typeIDREFS
catRef scheme typeIDREF
cell rows typeNONNEGATIVEINTEGER
cell cols typeNONNEGATIVEINTEGER
certainty target typeIDREFS
classCode scheme typeIDREF
damage resp typeIDREF
damage hand typeIDREF
date value typeDATE
del resp typeIDREF
del hand typeIDREF
delSpan resp typeIDREF
delSpan hand typeIDREF
delSpan to typeIDREF
distance exact typeUBOOLEAN
docDate value typeDATE
eLeaf value typeIDREF
eTree value typeIDREF
event who typeIDREF
event iterated typeUBOOLEAN
expan resp typeIDREF
f fVal typeIDREFS
fAlt mutExcl typeBOOLEAN
figure entity typeENTITY
form codedCharSet typeIDREF
form entityStd typeENTITIES
form entityLoc typeENTITIES
formula notation typeFORMULA
fs feats typeIDREFS
fsdDecl fsd typeENTITY
gap resp typeIDREF
gap hand typeIDREF
gi tei typeBOOLEAN
gloss target typeIDREF
graph order typeNONNEGATIVEINTEGER
graph size typeNONNEGATIVEINTEGER
handShift new typeIDREF
handShift old typeIDREF
handShift resp typeIDREF
iNode value typeIDREF
iNode children typeIDREFS
iNode parent typeIDREF
iNode ord typeBOOLEAN
iNode follow typeIDREF
iNode outDegree typeNONNEGATIVEINTEGER
join targets typeIDREFS
keywords scheme typeIDREF
kinesic who typeIDREF
kinesic iterated typeUBOOLEAN
language iso639 typeLANGUAGE
language wsd typeENTITY
leaf value typeIDREF
leaf parent typeIDREF
leaf follow typeIDREF
link targets typeIDREFS
move who typeIDREFS
move perf typeIDREFS
msr value typeFLOAT
msr valueTo typeFLOAT
nbr value typeFLOAT
nbr valueTo typeFLOAT
node value typeIDREF
node adjTo typeIDREFS
node adjFrom typeIDREFS
node adj typeIDREFS
node inDegree typeNONNEGATIVEINTEGER
node outDegree typeNONNEGATIVEINTEGER
node degree typeNONNEGATIVEINTEGER
note anchored typeBOOLEAN
note target typeIDREFS
note targetEnd typeIDREFS
occupation scheme typeIDREF
occupation code typeIDREF
pause who typeIDREF
person sex typeSEX
personGrp sex typeSEX
ptr target typeIDREFS
q direct typeUBOOLEAN
rate value typeFLOAT
rate valueTo typeFLOAT
ref target typeIDREFS
relation active typeIDREFS
relation passive typeIDREFS
relation mutual typeBOOLEAN
respons target typeIDREFS
restore resp typeIDREF
restore hand typeIDREF
root value typeIDREF
root children typeIDREFS
root ord typeBOOLEAN
root outDegree typeNONNEGATIVEINTEGER
setting who typeIDREFS
shift who typeIDREF
socecStatus scheme typeIDREF
socecStatus code typeIDREF
sound discrete typeUBOOLEAN
sp who typeIDREFS
span from typeIDREF
span to typeIDREF
state length typeNONNEGATIVEINTEGER
step length typeNONNEGATIVEINTEGER
step from typeEXTPTR
step to typeEXTPTR
supplied hand typeIDREF
symbol terminal typeBOOLEAN
table rows typeNONNEGATIVEINTEGER
table cols typeNONNEGATIVEINTEGER
tag TEI typeBOOLEAN
tagUsage occurs typeNONNEGATIVEINTEGER
tagUsage ident typeNONNEGATIVEINTEGER
tagUsage render typeIDREF
tech perf typeIDREFS
teiHeader date.created typeDATE
teiHeader date.updated typeDATE
time value typeTIME
timeline origin typeIDREF
timeRange from typeTIME
timeRange to typeTIME
tree arity typeNONNEGATIVEINTEGER
tree order typeNONNEGATIVEINTEGER
triangle value typeIDREF
u who typeIDREFS
unclear hand typeIDREF
vAlt mutExcl typeBOOLEAN
vocal who typeIDREF
vocal iterated typeUBOOLEAN
when since typeIDREF
witDetail target typeIDREFS
writing who typeIDREF
writing script typeIDREF
writing gradual typeUBOOLEAN
writingSystemDeclaration date typeDATE

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org