Building TEI DTDs and Schemas on demand

Building TEI DTDs and Schemas on demand Sebastian Rahtz March 2003

The Text Encoding Initiative Guidelines provide generic but detailed recommendations for the mark-up of electronic documents, in particular texts from the literary and linguistic domains. The TEI guidelines, converted to XML in 2001, are maintained in a high-level markup which mixes elements combination and content model rules with text documentation. A project to convert this to use RelaxNG internally was described at XML Europe 2002. Because the TEI is modular and extensible, it is accompanied by a web application which assists the user to define a subset and/or extension of the schema and creates an ad hoc DTD. This paper describes a new version of the program, which will enable users to generate DTDs, RelaxNG schemas, and W3C schemas on demand according to their specification, along with instance documentation.

The application, named Roma (the previous DTD-only incarnation was called Carthage), lets the user choose which TEI modules are needed, and allows them to include or exclude elements individually. It also supports the modification of existing elements, and the definition of new elements, with appropriate changes to TEI model classes. Standard components from other namespaces (SVG and MathML) can be included. Most of this can be done without commitment to which of the output formats is desired. At the end a flattened schema or DTD is produced, containing only the necessary elements.

Introduction

The Text Encoding Initiative's Guidelines for electronic text encoding and interchange () provide exhaustive recommendations for the encoding of key features in literary and linguistic textual materials. These recommendations, instantiated by a modular XML-based architecture in which DTD fragments and documentation are combined according to user-specified requirements, are very effective and are widely adopted in digital library, language engineering, and many other projects ().

One of the projects () of the TEI's Technical Council is to rewrite the Guidelines so that underlying metalanguage is independent of SGML or XML DTD language, allowing for automatic generation of schemas, DTDs, or any future constraint languages. The first stage of this work resulted in a set of RelaxNG () grammar files automatically generated from the Guidelines (available from ). This work was described at XML Europe 2002 (), so we will only provide a summary description here, but we will have to recap some of the explanation.

Manipulation of the TEI is possible because the TEI is not maintained as DTD files, but in a literate programming (cf ) system which documents and describes elements in a largely abstract manner, and describes their interdependence using an independently-documented set of element classes. This is probably best demonstrated by an example. The persName element is specified by the following markup:

persName personal name contains a proper noun or proper-noun phrase referring to a person, possibly including any or all of the person's forenames, surnames, honorifics, added names, etc. type describes the personal name more fully using an open-ended list of words or phrases which help to indicate the function, e.g. married name, maiden name, pen name, religious name, etc. CDATA Any string of characters. #IMPLIED ... %om.RR; ( #PCDATA | %m.personPart; | %m.phrase; | %m.Incl; )* ]]>

The key features here are The general description of the purpose of the element, including examples (in exemplum, the contents of which are omitted here) The list of attributes, specified using name, datatype, default etc The module of the TEI to which persName belongs (ND, ie the module covering names and dates) The classes to which this element contributes (DEMOG, NAMES, and DATA) The content model for the element; this is also expressed in terms of classes, using the DTD markup %m.personPart;—any elements which say they are members of the personPart class are allowed here. This information allows a processor to construct a DTD fragment for the element as follows:

]]>

Note here the addition of more attributes, from the classes of which this element is a member.

The problem with the system described above is the dependence on explicit DTD content models, which are not amenable to processing using standard XML tools. We therefore replace the elemDecl with the following:

]]>

This is much easier to analyze, and is (reasonably!) easy to turn back into DTD markup if needed. A processor can now assemble all the information needed to construct a complete RelaxNG grammar.

The translation of the TEI Guidelines to use RelaxNG markup to encode content models is fairly stable, and the challenge now is to find ways of making use of the extra power provided by schemas.

From Pizza Joint to Sushi Bar

In addition to the class system for maintaining relationships between elements, the TEI also works on the basis of a set of mutually exclusiveThis statement is not entirely true. basic tag sets. The choice is between: Prose This tagset is suitable for most documents most of the time Verse This tagset adds specialist tagging for metrical analysis, rhyme-scheme etc to the basic verse markup already included in the core Drama This tagset adds specialist tagging for cast lists, records of first performance, etc. to the basic drama markup already included in the core Speech This tagset replaces the basic structure by one suitable for linguistic analysis of speech acts, etc. Dictionaries This tagset replaces the basic structure with one containing detailed lexicographic features Terminology This tagset replaces the basic structure with one specific to terminological databases A normal TEI document will start with one of these scenarios, and then add modules from the following list: Linking Adds elements for hypertext linking, segmentation, and alignment Figures Adds elements for encoding tables, pictures, and formulae; Analysis Adds elements for interpretation and simple linguistic analyses FS Adds elements for feature structure analysis Certainty Adds elements for recording uncertainty and responsibility Transcription Adds elements for the transcription of primary sources (e.g. manuscripts) Textcrit Adds elements for text-critical apparatus Names & Dates Adds elements for the detailed tagging of names and dates Nets Adds elements for recording the abstract structure of mathematical graphs, networks, and trees Corpora Adds specialised elements to the TEI-header for use with language corpora It is important to understand that a user must make sort of choice—there is no one TEI DTD or schema which is the default. In addition, the TEI has a clear system for extending the tagset, which again utilises the class system by allowing new elements to be added to classes, and to refer to existing classes.Adding new classes is a more complex exercise, not for the faint hearted

How does a casual user make sense of this complexity? It requires a good understanding of DTD or Schema languages to manipulate the right parameter entities or pattern definitions, so the TEI offers an interface for building customized views of the system. In the DTD-only release of the TEI, this is done using a web form and a utility called carthago; the job of this program was to compile DTDs, expanding all parameter entity references and removing references elements which were not available.Hence the name carthago; it builds of list of elements which are not needed, commenting as it goes haec delenda sunt, or these must be destroyed, echoing Scipio's repeated admonition to the Roman Senate of Carthago delenda est. Now, I hope, it is clear why the schema-based successor is called roma.

The web application is known as the TEI Pizza Chef, because it allows the customer to choose what toppings they want for a particular base. However, it has to leave most of the work to the user, by creating a pair of skeleton DTD extension files which the user downloads, edits, and uploads again. Editing these DTD files by hand is error-prone, fairly forbidding, and cannot be used to modify schemas. A revised system has therefore been built which attempts to keep all the knowledge or DTD or schema in the application itself, and simply ask the user to select options on web forms. This is fancifully known as the TEI Sushi Bar, following the model of an endless choice of clean, distinct, options continually being presented to the user, rather than a rather oily mound of congealing cheese and tomato. More precisely, the Sushi Bar is a web application running scripts known generically as roma.

Roma starts by asking the user to choose which base tagsets and extra modules they require. There are two interfaces, one verbose (Figure ) and one for the expert (Figure ). There are also two important choices to make: The user must indicate what sort of output is needed. The choice is between: RelaxNG schema compiled RelaxNG schema compact RelaxNG schema W3C schema compiled DTD The user must say whether they want to make modifications to the elements in the selected tagset. The choice is between: Leave elements as they are Configure elements, including them by default Configure elements, excluding them by default The user can say whether they want to add some new elements The choices here affect the next stage. Firstly, if a DTD is requested, the user is allowed to choose some ISO entity sets to include (Figure ). Secondly, if element configuration is requested, all the elements in the chosen tagsets are listed, with radio buttons which allow the element to be included in the result, or excluded (Figure ). The links in this table are to the documentation of each element on the TEI web site. At this stage, the user can rename elements; the example shown in Figure has figure being renamed to graphic, and figDesc to caption, while table, row, cell, and formula are declared as unwanted. We will see shortly how this is implemented.

Roma stage 1, verbose mode

Roma stage 1, expert mode

Roma stage 2, choosing entity sets

Roma stage 2, expert mode

Roma stage 2, renaming elements

In the second stage of Roma, there is also a set of general options which can be turned on and off: Whether date elements should be validated against an ISO date format (Schema only) Whether xptr, xref and figure elements should support a url attribute to identify external resourcesThis is done using entities in traditional TEI. Whether the standard extensions of the common subset of the TEI known as TEI Lite should be activated Whether the formula element should be redefined to insist on content being expressed as MathML (Schema only) Whether the figure element should be redefined to allow a content of SVG (Scaleable Vector Graphics) elements (Schema only) After all these choices are made, the Submit button prompts the user to download the resulting DTD or schema.

The look of the result depends on whether or not a compiled form has been selected. Given a simple set of choices, a RelaxNG grammar could result as follows:

]]>

There are some important points to note here: The basic structure is <include href="http://www.tei-c.org/Schemas/RelaxNG/P4X/tei2.dtd.rng"> ... a set of redefinitions of standard TEI patterns ... </include> TEI modules are turned on by redefining a pattern, eg <define name="TEI.figures"><ref name="INCLUDE"/></define> In the same way, individual elements can be disallowed by setting their definition to notAllowed, eg <define name="table"><notAllowed/></define> Elements are renamed by a redefinition of a pattern The last point deserves some more explanation. The original definition for figure is like this:

]]>

that is, a pattern is defined called figure, which defines an element called figure, with a content model given in the pattern c.figure. By redefining figure as follows:

]]>

we define an element called graphic, which has the same content model as the old figure, and is inside athe pattern called figure. This is what other definitions will refer to; so anything which wants to include the figure element will say ref name="figure"/, and it will not matter that the actual element is renamed. The original name of the element is preserved by an attribute called TEIform, defined as ]]>, so it is easy to relate this changed setup to the basic TEI. The renaming feature may be extended in future to allow complete translations of the TEI element names to predefined language sets, allowing the user to simply request "all elements in Spanish, please".

If a compiled output is requested, then the skeleton DTD or Schema will be put through a flattening process to remove redundant elements and references to external files. This has the advantage that a single file is produced, which considerably aids portability, and the removal of unused elements can make it much smaller.

DTD flattening is performed by the existing carthago application, and schema flattening is performed by an XSLT transform of a RelaxNG grammar. The other outputs (compact RelaxNG and W3C Schema) are done by calls to James Clark's trang program ().

MathML and SVG inclusion are managed by simplying includeing the relevant RelaxNG grammars, each in their own namespace.

Extending the TEI

We have so far seen examples of simply choosing subsets of the TEI, or adding standard new features. What if we want to add some elements? This may be for one of two reasons: To add an element which is effectively a clone of an existing element, perhaps with an assumed attribute value, to make the text easier to edit and read. For example, we could mark a set of exercise steps with ]]>, but it would be friendlier to allow ]]>, even though the processing would be identical. To add a new element to an existing class. For example, the elements for describing an address do not include anywhere to put a personal URL, so we want to add a new element parallel to postCode and street. If the user chooses to add elements, they are asked to decide which of these two situations they want to address, and to give the element a name and description. In Figure we show the addition of the homeurl element, in the addrPart class. Of course, this assumes some familiarity with the TEI class system, (see section for a summary of the TEI classes) and the interface is not yet friendly enough for someone completely new to the TEI. The list of elements and classes are derived, of course, dynamically from the TEI Guidelines.

Creating new elements

There are three further facilities which Roma does not yet provide: Adding elements which do not simply follow the class system, but have arbitrary content models and attribute lists. The problem here is how to ask the user to specify the new material without directly writings schema code. It remains to see how many requests we will receive for this feature. Changing or limited the content model of elements which do not follow the class system fully. The correct answer to this may be to revise the TEI so that all elements do use the class system 100%, but in the short-term this is unrealistic. It may be possible to devise an interface to editing content models. Adding entire classes to the TEI. This is a complex matter, which it is unlikely we can provide in a simple web interface.

TEI classes

Here is a list of the currently defined classes of the TEI system: addrPartgroups elements which may constitute a postal or other form of address. agentgroups elements which contain names of individuals or corporate bodies. analysisdefault declaration for class analysis: when the additional tag set for simple analysis is not selected, no attributes are defined for this class. analysisdefines a set of attributes for associating specific analyses or interpretations with appropriate portions of a text, which are enabled for all elements when the additional tag set for simple analysis is selected. baseStandardgroups elements in a writing system which refer to some public or private standard as part of the basis for the writing system declaration biblgroups elements containing a bibliographic description. biblPartgroups elements which can appear only within bibliographic citation elements. binaryelements which express binary values in feature structures. booleangroups elements which express Boolean values in feature structures. chunkgroups elements which can occur between, but not within, paragraphs and other chunks. commongroups common chunk- and inter-level elements. comp.dictionariesgroups those component-level elements which are unique to the base tag set for dictionaries. comp.dramagroups those component-level elements which are specific to performance texts. comp.spokengroups those elements which appear at the component level in spoken texts only. comp.terminologygroups component-level elements unique to the base tag set for terminological data. comp.versegroups component level elements unique to the base tag set for verse. complexValgroups elements which express complex feature values in feature structures. datagroups phrase-level elements containing names, dates, numbers, measures, and similar data. dategroups elements containing a date specifications. declarablegroups elements which may be independently selected (using the special purpose decls attribute) from a candidate list of declarations within a TEI header. declaringgroups elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element. demographicgroups elements describing demographic characteristics of the participants in a linguistic interaction. dictionariesdefault declaration for class dictionaries: when the base tag set for dictionaries is not selected, no attributes are defined for this class. dictionariesdefines a set of global attributes available on elements in the base tag set for dictionaries. dictionaryPartsgroups all elements defined specifically for dictionaries. dictionaryTopLevelgroups related parts of a dictionary entry forming a coherent subdivision, for example a particular sense, homonym, etc. divbotgroups elements which can occur at the end of a text division; for example, trailer, byline, etc. divndefines a set of attributes common to all elements which behave in the same way as divisions. divtopgroups elements which can occur at the start of any division class element. dramafrontgroups elements which appear at the level of divisions within front or back matter of performance texts only. editdefines a group of attributes common to the phrase-level elements used for simple editorial correction and transcription. editgroups phrase-level elements for simple editorial correction and transcription. editInclgroups empty elements which perform a specifically editorial function, for example by indicating the start of a span of text added, deleted, or missing in a source. enjambgroups elements bearing the enjamb attribute. entriesgroups the different styles of dictionary entries. featureValgroups elements which express feature values in feature structures. fmchunkgroups elements which can occur as direct constituents of front matter, when a full title page is not given. formInfogroups elements allowed within a form element in a dictionary. formPointersgroups elements in the dictionary base which point at orthographic or pronunciation forms of the headword. fragmentarygroups elements which mark the beginning or ending of a fragmentary manuscript or other witness. frontgroups elements which appear at the level of divisions within front or back matter. globaldefines a set of attributes available to all components of the writing system declaration. globaldefines a set of attributes common to all elements in the TEI encoding scheme. gramInfogroups those elements allowed within a gramGrp element in a dictionary. hqintergroups elements related to highlighting which can appear either within or between chunk-level elements. hqphrasegroups phrase-level elements related to highlighting. Inclgroups empty elements which may appear at any point within a TEI text. intergroups elements of the intermediate (inter-level) class: these elements can occur both within and and between paragraphs or other chunk-level elements. interpretdefines the set of attributes common to this group of interpretative elements. linkingdefault declaration for class linking: when the additional tag set for linking is not selected, no attributes are defined for this class. linkingdefines a set of attributes for hypertext and other linking, which are enabled for all elements when the additional tag set for linking is selected. listsgroups all list-like elements. locgroups elements used for purposes of location and reference metadatagroups empty elements which describe the status of other elements, for example by holding groups of links or of abstract interpretations, or by providing indications of certainty etc., and which may appear at any point in a document. metricaldefines a set of attributes which certain elements may use to represent metrical information. morphInfogroups elements which provide morphological information within the dictionary tag set. namesgroups those elements which refer to named persons, places, organizations etc. notesgroups all note-like elements. personPartgroups those elements which form part of a personal name. phrase.versegroups phrase-level elements which may appear within verse only. phrasegroups those elements which can occur at the level of individual words or phrases. placePartgroups those elements which form part of a place name. pointerdefines a set of attributes used by all elements which point to other elements by means of one or more IDREF values. pointerGroupdefines a set of attributes common to all elements which enclose groups of pointer elements. readingsdefines a set of attributes common to all elements representing variant readings in text critical work. refsysgroups milestone-style elements used to represent reference systems seggroups elements used for arbitrary segmentation. sgmlKeywordsgroups elements whose content is an SGML or XML identifier or tag of some sort (generic identifier of an element type, name of an attribute, etc.). singleValgroup elements which express single feature values in feature structures. stageDirectiongroups elements containing specialized stage directions defined in the additional tag set for performance texts. temporalExprgroups component elements of temporal expressions involving dates and time, and defines an additional set of attributes common to them. terminologydefault declaration for class terminology: when the base tag set for terminological data is not selected, no attributes are defined for this class. terminologydefines attributes for all elements in documents which use the base tag set for terminological data. terminologyInclusionsgroups elements which may be included at any point within a terminology entry. terminologyMiscgroups elements which can appear together at various points in terminological entries. timeddefines a set of attributes common to those elements which have a duration in time, expressed either absolutely or by reference to an alignment map. tpPartsgroups those elements which can occur as direct constituents of a title page (docTitle, docAuth, docImprint, epigraph, etc.) typeddefines a set of attributes which can be used to classify or subclassify certain elements in any way. xPointerdefines a set of attributes used by all those elements which use the TEI extended pointer mechanism to point at locations which have neither an SGML nor an XML ID.

Conclusions

The increasing power provided by schemas, and the stress on modularity, argue in favour of moving towards (conceptual) two stage validation. In the first phase, the important check is that the document uses the right vocabulary, in our case meaning the 441 elements currently described by the TEI. The structure here can be quite loose. In the second phase, which can depend on individual projects, validation can be a lot more precise, with detailed datatyping and inter-dependency validation. For example, the basic rule may say that an text must have a author, title and date, but be agnostic about their order. A particular project may wish to enforce a rule that they must occur in a fixed order; or it may wish to more limited than the base schema, and say that date is not permitted at all. Thus a typical document may be checked once to ensure that it uses TEI vocabulary and broad grammatical structure, and then checked again to make sure it talks the right dialect.

The relevance of this work is that it shows a way forward for XML users which does not involve low-level interaction with DTDs or Schemas. Unlike the graphic direct manipulation tools in eg XML Spy, the Roma tool works at the level of the TEI class system. Together with the support for other namespaces via schemas, these tools take the TEI one step further on the road to a universal markup language.

Notes and Acknowledgements

This work was carried out as part of the technical work programme of the Metalanguage Taskforce () of the TEI Council in 2003. It is still experimental and does not form a formal part of the TEI.

I am grateful to Norm Walsh and Lou Burnard, and the other members of the Taskforce, for stimulating discussion on this and related subjects; I was also delighted to discover Daniel Veillard's work on a new RelaxNG validator (now part of libxml2) while I was writing this paper, and to have the chance of contributing towards debugging the software with TEI examples.

Bibliography Sebastian Rahtz, Converting to schema: the TEI and RelaxNG, paper presented at XML Europe 2002, Barcelona, May 2002. Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing, Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994. N. Walsh and L. Muellner, DocBook The Definitive Guide, O'Reilly, Sebastopol, CA, USA, 1999. Donald E. Knuth, Literate Programming, Stanford University Center for the Study of Language and Information (CSLI Lecture Notes Number 27), Stanford, CA, USA, 1992. C.M. Sperberg-McQueen and Lou Burnard. The Design of the TEI Encoding Scheme in N. Ide. and J. Veronis, eds. The Text Encoding Initiative: Background and Contexts, special triple issue of Computers and the Humanities, 29:1, 1995, 17-39