TEI META Task Force: Status Report [MEW01]

Choice of schema language for the TEI
Skeleton work plan for redesigning ODD
Work plan beyond ODD: towards P5
Work on ODD markup
- attList
Datatyping in attributes
Character encoding in attributes
Namespaces and fragment inclusion
A replacement for the Pizza chef

Appendix A: Tables

Choice of schema language for the TEI

The task force considered whether the content models of TEI elements should be expressed in the source TEI Guidelines in:

XML DTD language as at present
W3C Schema language
OASIS Relax NG language
A new notation of the TEI's devising

There was an almost instant preference for Relax NG, since it:

Uses XML syntax, enabling easy validation and analysis
Is very readable, and fairly easy to relate to DTD
Is well-implemented by different processors, and so immediately useable
Uses W3C schema datatyping
Seems likely to be included in the forthcoming ISO DSDL
Can be converted to W3C schema if needed

and it was therefore agreed to convert the TEI Guidelines so that element content models are represented in Relax NG syntax in its own namespace.

Sebastian Rahtz presented a paper at XML Europe 2002 on the subject of how to convert the TEI XML content models to RelaxNG. This work, slightly refined, is the basis for an experimental version of TEI P5. There are a set of derived sample TEI schemas which are available for immediate use.

Skeleton work plan for redesigning ODD

It is intended to suggest and implement changes to the ODD system in the following order:

Clear up the details of <tagDoc>
Revise (part 1) the TSD tagdoc and make it a standard ‘topping’
Convert (part 1) the Guidelines to conform to that schema (*)
1. Convert the <elemDecl> contents to RelaxNG schema (*)
2. Convert attributes (where automatically possible) to use new datatyping scheme (*)
3. Add new <entDoc> elements defining the datatypes (*)
4. Examine and rework the <string> and <entDoc> elements to remove remaining SGML/XML material
5. Rewrite and test the scripts which
  - generate schemas (*)
  - generate DTDs
  - generate HTML version of the Guidelines
  - generate PDF version of the Guidelines
The items marked (*) have been completed.
Clear up the details of higher-level class system
Revise (part 2) the TSD tagdoc
Convert (part 2) the Guidelines to conform to that schema

Work plan beyond ODD: towards P5

The following tasks need to be completed in order to create P5:

Make corrections of known errors
Assess all the attribute datatypes and decide whether:
- A new datatype should be created (when more than 2 or 3 attributes have the same pattern)
- An attribute which is now simple text should be reconsidered as a tokenized attribute
- Extra facets should be added to further refine datatypes
Assess elements to see whether those with plain text bodies can be datatyped
Consider all element content models to decide whether they are too restrictive or too lose; consider whether some of the simplifying facilities available in RelaxNG (eg whether <interleave> ) should be used.

Work on ODD markup

<attList>

We currently have a structure called <attList> which contains one or more <attDef> elements, which have <datatype> and <default> children. The <default> holds both things like ‘#IMPLIED’ and ‘m’, while <datatype> has a mixture of ‘CDATA’, ‘63’ and ‘%ISO-date;’. It is suggested that

<attDef> has a boolean attribute ‘required’
the <default> element should only be used to hold genuine default strings or tokens. It will be optional. Some notation will be needed to encompass ‘%INHERITED;’
<datatype> has a mandatory ‘target’ attribute, which points to an <entDoc> , defining the datatype. This gives us an extra abstract layer over XML schema datatypes. Most token choice attributes would be boiled down to genuine datatypes, so all of ‘Y|N’, ‘yes | no’ and ‘true|false’ would be <datatype target="BOOLEAN"> . In the <entDoc> , we expound on this and map to the relevant W3X Schema datatype (see section Datatyping in attributes). Where the choice is limited, eg ‘A | B’, it is recorded as a set of enumerated values, defined in the body of the <datatype> :
<datatype target="TOKEN"> <rng:choice> <value>A</value> <value>B</value> </rng:choice> </datatype>

Datatyping in attributes

The task force is asked to use W3C Schema datatypes in the TEI as much as possible.

An analysis of all the current <datatype> values shows that they fall into four categories:

Standard XML datatypes (ID, IDREFS, NMTOKENS, etc)
Abstract datatypes linked to entities in the Guidelines (there are only 2 or 3 of these)
Text with no conditions
Text, but with a fixed set of possibilities

We can deal with the first of these easily; they all map into schema datatypes. The second is simply an indirection. The third will remain as text (but see below). It is suggested that the fourth should be split into:

attributes where the range of possibilities fits a W3C datatype, or it makes sense to at least have a common set of values across the TEI
attributes which really should have token values

However, it is likely that some attributes are mis-classified at present; some of those which are datatyped as free text should be tokenized, and some which are tokenized should be completely free text. It is important to separate out attributes which have completely arbitrary text from those where the text is tokenizable (see section Character encoding in attributes).

It is suggested that the system be rationalized so that all the existing <datatype> entries are replaced by pointers to one of the following datatypes:

Name	Relax NG representation
ANYURI	<rng:data type="anyURI"/>
BOOLEAN	<rng:data type="boolean"/>
DATE	<rng:data type="date"/>
DATETIME	<rng:data type="dateTime"/>
DURATION	<rng:data type="duration"/>
ENTITIES	<rng:data type="ENTITIES"/>
ENTITY	<rng:data type="ENTITY"/>
EXTPTR	<rng:text >
FLOAT	<rng:data type="float"/>
FORMULA	<rng:text >
ID	<rng:data type="ID"/>
IDREF	<rng:data type="IDREF"/>
IDREFS	<rng:data type="IDREFS"/>
LANGUAGE	<rng:text >
NAME	<rng:data type="NCNAME"/>
NMTOKEN	<rng:data type="NMTOKEN"/>
NMTOKENS	<rng:data type="NMTOKENS"/>
SEX	<rng:choice <value>m</value> <value>f</value> <value>u</value> <value>x</value> </rng:choice>
TEXT	<rng:text >
TIME	<rng:data type="time"/>
TOKEN	<rng:empty/>
UBOOLEAN	<rng:choice> <value>true</value> <value>false</value> <value>unknown</value> <value>unspecfied</value> </rng:choice> >

Table Figure 2 lists some current datatype values and how they map to the new scheme. Table Figure 3 shows 180 attributes which can automatically given a non-text and non-token data types.

Character encoding in attributes

The character encoding workgroup discussed how to deal with attributes which need to use the full range of characters (eg variations, and names). This task force agreed that the correct approach was to support an alternative notation, by which these attributes could optionally be recorded as elements if the TEI user wishes to use some form of character encoding not permitted in TEI attributes. TEI P5 will therefore:

Record which attributes have the extended property of being representable as elements
When making normal DTDs, only support the ‘traditional’ scheme of attributes
Allow for special DTDs (from son-of-pizzachef) which support only the element alternative
When making schemas, support both attribute and element forms

Processing applications (eg XSLT stylesheets) will have to decide whether to support both systems, or only one.

There are over 300 attributes which currently have a text datatype; this includes a good many elements which have a type attribute. The TEI editors will have to decide which of these should be classified as ‘true text’ (see EDW79).

Namespaces and fragment inclusion

The task force is asked to consider how situations can be catered for:

Using fragments of another markup language in TEI XML
Using fragments of TEI in another markup language

The answer to both of these is XML namespaces. Two vocabularies can be combined, if the elements identify their namespace. Using schemas, it is easy to validate a document which goes off into different namespaces at various points; this is demonstrated in a TEI RelaxNG schema which redefines <formula> to have MathML elements as content. However, to demonstrate the other way round (fragments of TEI embedded in another XML vocabulary) would require assigning a namespace for the TEI. This could be a single namespace for all TEI elements, or a different one for each tagset. The task force considered that the latter would be an unnecessary complication for users, but that a namespace (perhaps ‘http://www.tei-c.org/P5’) for the TEI would be a good idea. However, there are two major problems with this, which have prevented the taskforce from implementing it:

All existing TEI documents would be invalid, as they would be in an empty namespace. It would be a fairly small fix for each instance to add a namespace declaration to root element, but that would make it fail with existing DTDs.
All existing XML processing tools would fail to work with new documents; for instance, XSLT stylesheets which process a current (empty namespace) <TEI.2> would fail to identify the new <TEI.2 xmlns="http://www.tei-c.org/P5"> . It will be possible in XSLT 2.0 to write a stylesheet to work with both old and new TEI documents, but using XSLT 1.0 it will be much harder; all stylesheets will need a large rewrite.

This issue requires further investigation.

A replacement for the Pizza chef

This has not been discussed by the task force, but Sebastian Rahtz has written a paper on the subject for XML Europe 2003. This shows that it is possible to have a simple web application which generates RelaxNG schemas, W3C schemas, and XML DTDs, on demand; the prototype, Roma, works solely with the TEI class system, and provides a better interface to it than the Pizza Chef. There are, however, facilities which Roma does not yet provide:

Adding elements which do not simply follow the class system, but have arbitrary content models and attribute lists. The problem here is how to ask the user to specify the new material without directly writings schema code. It remains to see how many requests we will receive for this feature.
Changing or limited the content model of elements which do not follow the class system fully. The correct answer to this may be to revise the TEI so that all elements do use the class system 100%, but in the short-term this is unrealistic. It may be possible to devise an interface to editing content models.
Adding entire classes to the TEI. This is a complex matter, which it is unlikely we can provide in a simple web interface.

Appendix A: Tables

Current datatypes and proposed replacements:

Current	New datatype	(values)
%ISO-date;	DATE
%extPtr;	EXTPTR
%formulaNotations;	FORMULA
Y \| N	BOOLEAN
Y \| N \| U	UBOOLEAN
YES \| NO	BOOLEAN
all \| one \| none	TOKEN	all, one, none
all \| some \| none	TOKEN	all, some, none
free \| unknown \| restricted	TOKEN	free, unknown, restricted
light \| sound \| prop \| block	TOKEN	light, sound, prop, block
m \| f \| u	SEX
m \| f \| u \| x	SEX
none \| some \| all	TOKEN
silent \| tags	TOKEN
y \| n \| u	UBOOLEAN
yes \| no	BOOLEAN
Y \| N \| I \| M \| F	TOKEN	Y, N, I, M, F
Y \| N \| U	UBOOLEAN
Y \| N \| partial	TOKEN	Y, N, partial
Y \| N	BOOLEAN
Y \| N	BOOLEAN
a \| m \| j \| s \| u	TOKEN	a, m, j, s, u
am \| pm \| 24hour \| descriptive	TOKEN	am, pm, 24hour, descriptive
audio \| video	TOKEN	audio, video
closed \| semi \| open	TOKEN	closed, semi, open
composite \| uniform	TOKEN	composite, uniform
data \| rend \| std \| nonstd \| unknown	TOKEN	data, rend, std, nonstd, unknown
eq \| ne	TOKEN	eq, ne
eq \| ne \| gt \| ge \| lt \| le	TOKEN	eq, ne, gt, ge, lt, le
eq \| ne \| lt \| le \| gt \| ge	TOKEN	eq, ne, lt, le, gt, ge
eq \| ne \| sb \| ns	TOKEN	eq, ne, sb, ns
eq \| ne \| sb \| ns \| lt \| le \| gt \| ge	TOKEN	eq, ne, sb, ns, lt, le, gt, ge
excl \| incl	TOKEN	excl, incl
fiction \| fact \| mixed \| inapplicable	TOKEN	fiction, fact, mixed, inapplicable
high \| medium \| low \| unknown	TOKEN	high, medium, low, unknown
horizontal \| vertical	TOKEN	horizontal, vertical
initial \| medial \| final \| unknown \| complete	TOKEN	initial, medial, final, unknown, complete
internal \| external	TOKEN	internal, external
int \| real	TOKEN	int, real
lexical \| punc \| lexpunc \| digit \| space \| DL \| LD \| dia \| joiner \| other	TOKEN	lexical, punc, lexpunc, digit, space, DL, LD, dia, joiner, other
location-referenced \| double-end-point \| parallel-segmentation	TOKEN	location-referenced, double-end-point, parallel-segmentation
model \| atts \| both	TOKEN	model, atts, both
new \| update	TOKEN	new, update
none \| partial \| complete \| inapplicable	TOKEN	none, partial, complete, inapplicable
pe \| ge	TOKEN	pe, ge
perc \| real	TOKEN	perc, real
req \| mwa \| rec \| rwa \| opt	TOKEN	req, mwa, rec, rwa, opt
role \| list	TOKEN	role, list
root \| branches	TOKEN	root, branches
s \| w \| ws \| sw \| m \| x	TOKEN	s, w, ws, sw, m, x
silent \| tags	TOKEN	silent, tags
single \| composite \| frags \| unknown	TOKEN	single, composite, frags, unknown
single \| set \| bag \| list	TOKEN	single, set, bag, list
smooth \| latching \| overlap \| pause	TOKEN	smooth, latching, overlap, pause
tei \| iso \| national \| private \| none	TOKEN	tei, iso, national, private, none
tempo \| loud \| pitch \| tension \| rhythm \| voice	TOKEN	tempo, loud, pitch, tension, rhythm, voice
to \| from \| both \| none	TOKEN	to, from, both, none
unit \| set \| bag \| list	TOKEN	unit, set, bag, list
y \| n \| unspecified	UBOOLEAN
y \| n	BOOLEAN
yes \| abb \| init	TOKEN	yes, abb, init
yes \| no	BOOLEAN
yes \| no	BOOLEAN
CDATA	TOKEN
ENTITIES	ENTITIES
ENTITY	ENTITY
ID	ID
IDREF	IDREF
IDREFS	IDREFS
NAME	NAME
NMTOKEN	NMTOKEN
NMTOKENS	NMTOKENS

Attributes with datatypes assigned:

element	attribute	datatype
analysis	ana	typeIDREFS
declarable	default	typeBOOLEAN
declaring	decls	typeIDREFS
dictionaries	location	typeIDREF
dictionaries	mergedin	typeIDREF
dictionaries	opt	typeBOOLEAN
edit	resp	typeIDREF
formPointers	target	typeIDREF
global	id	typeID
global	id	typeID
global	lang	typeIDREF
interpret	inst	typeIDREFS
linking	corresp	typeIDREFS
linking	synch	typeIDREFS
linking	sameAs	typeIDREF
linking	copyOf	typeIDREF
linking	next	typeIDREF
linking	prev	typeIDREF
linking	exclude	typeIDREFS
linking	select	typeIDREFS
pointer	targOrder	typeUBOOLEAN
pointerGroup	domains	typeIDREFS
readings	hand	typeIDREF
TEIform	TEIform	typeNAME
terminology	grpPtr	typeIDREF
terminology	depPtr	typeIDREF
timed	start	typeIDREF
timed	end	typeIDREF
xPointer	doc	typeENTITY
xPointer	from	typeEXTPTR
xPointer	to	typeEXTPTR
abbr	resp	typeIDREF
add	resp	typeIDREF
add	hand	typeIDREF
addSpan	resp	typeIDREF
addSpan	hand	typeIDREF
addSpan	to	typeIDREF
admin	date	typeDATE
alt	targets	typeIDREFS
app	from	typeIDREF
app	to	typeIDREF
arc	from	typeIDREF
arc	to	typeIDREF
att	tei	typeBOOLEAN
birth	date	typeDATE
catRef	target	typeIDREFS
catRef	scheme	typeIDREF
cell	rows	typeNONNEGATIVEINTEGER
cell	cols	typeNONNEGATIVEINTEGER
certainty	target	typeIDREFS
classCode	scheme	typeIDREF
damage	resp	typeIDREF
damage	hand	typeIDREF
date	value	typeDATE
del	resp	typeIDREF
del	hand	typeIDREF
delSpan	resp	typeIDREF
delSpan	hand	typeIDREF
delSpan	to	typeIDREF
distance	exact	typeUBOOLEAN
docDate	value	typeDATE
eLeaf	value	typeIDREF
eTree	value	typeIDREF
event	who	typeIDREF
event	iterated	typeUBOOLEAN
expan	resp	typeIDREF
f	fVal	typeIDREFS
fAlt	mutExcl	typeBOOLEAN
figure	entity	typeENTITY
form	codedCharSet	typeIDREF
form	entityStd	typeENTITIES
form	entityLoc	typeENTITIES
formula	notation	typeFORMULA
fs	feats	typeIDREFS
fsdDecl	fsd	typeENTITY
gap	resp	typeIDREF
gap	hand	typeIDREF
gi	tei	typeBOOLEAN
gloss	target	typeIDREF
graph	order	typeNONNEGATIVEINTEGER
graph	size	typeNONNEGATIVEINTEGER
handShift	new	typeIDREF
handShift	old	typeIDREF
handShift	resp	typeIDREF
iNode	value	typeIDREF
iNode	children	typeIDREFS
iNode	parent	typeIDREF
iNode	ord	typeBOOLEAN
iNode	follow	typeIDREF
iNode	outDegree	typeNONNEGATIVEINTEGER
join	targets	typeIDREFS
keywords	scheme	typeIDREF
kinesic	who	typeIDREF
kinesic	iterated	typeUBOOLEAN
language	iso639	typeLANGUAGE
language	wsd	typeENTITY
leaf	value	typeIDREF
leaf	parent	typeIDREF
leaf	follow	typeIDREF
link	targets	typeIDREFS
move	who	typeIDREFS
move	perf	typeIDREFS
msr	value	typeFLOAT
msr	valueTo	typeFLOAT
nbr	value	typeFLOAT
nbr	valueTo	typeFLOAT
node	value	typeIDREF
node	adjTo	typeIDREFS
node	adjFrom	typeIDREFS
node	adj	typeIDREFS
node	inDegree	typeNONNEGATIVEINTEGER
node	outDegree	typeNONNEGATIVEINTEGER
node	degree	typeNONNEGATIVEINTEGER
note	anchored	typeBOOLEAN
note	target	typeIDREFS
note	targetEnd	typeIDREFS
occupation	scheme	typeIDREF
occupation	code	typeIDREF
pause	who	typeIDREF
person	sex	typeSEX
personGrp	sex	typeSEX
ptr	target	typeIDREFS
q	direct	typeUBOOLEAN
rate	value	typeFLOAT
rate	valueTo	typeFLOAT
ref	target	typeIDREFS
relation	active	typeIDREFS
relation	passive	typeIDREFS
relation	mutual	typeBOOLEAN
respons	target	typeIDREFS
restore	resp	typeIDREF
restore	hand	typeIDREF
root	value	typeIDREF
root	children	typeIDREFS
root	ord	typeBOOLEAN
root	outDegree	typeNONNEGATIVEINTEGER
setting	who	typeIDREFS
shift	who	typeIDREF
socecStatus	scheme	typeIDREF
socecStatus	code	typeIDREF
sound	discrete	typeUBOOLEAN
sp	who	typeIDREFS
span	from	typeIDREF
span	to	typeIDREF
state	length	typeNONNEGATIVEINTEGER
step	length	typeNONNEGATIVEINTEGER
step	from	typeEXTPTR
step	to	typeEXTPTR
supplied	hand	typeIDREF
symbol	terminal	typeBOOLEAN
table	rows	typeNONNEGATIVEINTEGER
table	cols	typeNONNEGATIVEINTEGER
tag	TEI	typeBOOLEAN
tagUsage	occurs	typeNONNEGATIVEINTEGER
tagUsage	ident	typeNONNEGATIVEINTEGER
tagUsage	render	typeIDREF
tech	perf	typeIDREFS
teiHeader	date.created	typeDATE
teiHeader	date.updated	typeDATE
time	value	typeTIME
timeline	origin	typeIDREF
timeRange	from	typeTIME
timeRange	to	typeTIME
tree	arity	typeNONNEGATIVEINTEGER
tree	order	typeNONNEGATIVEINTEGER
triangle	value	typeIDREF
u	who	typeIDREFS
unclear	hand	typeIDREF
vAlt	mutExcl	typeBOOLEAN
vocal	who	typeIDREF
vocal	iterated	typeUBOOLEAN
when	since	typeIDREF
witDetail	target	typeIDREFS
writing	who	typeIDREF
writing	script	typeIDREF
writing	gradual	typeUBOOLEAN
writingSystemDeclaration	date	typeDATE

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org