MI W 03Practical Guide to Migration of TEI Documents from SGML to XML


Contents

Introduction

This report contains practical recommendations for migrating TEI data from P3 SGML to P4 XML. It provides instructions for performing the conversions described in TEI MIW02: Strategic Considerations in Migration of TEI Documents from SGML to XML, and is written for the programmers or technical staff who will perform the actual conversion. We suggest that non-technical readers should read MIW02 first, so as to better understand the issues involved in the migration. The workflows described in this document are general and may be applied to all TEI projects, so readers interested in specific examples of scripts, batch conversion tools, and the like should consult the software tools folder and the migration case studies that comprise TEI MIW06: Migration Case Study Reports.

As described in the Areas of Migration section of MIW02, a data migration involves several distinct steps: converting document instances from SGML to XML, obtaining an XML DTD and converting any DTD extensions, and modifying the processing environment (including catalog files and applications such as parsers and editors) to accommodate XML. Because instance conversion is often the most substantial part of the migration process, the bulk of this report discusses that topic. Specifically, the first section presents a recommended workflow for instance conversion, while the second section discusses conversion tools. The third section discusses conversion of SDATA entities, which can be one of the trickier aspects of instance conversion. The final section of this report provides a tutorial in converting DTD extensions to XML.

Migration workflow

This section discusses recommended procedures for migrating your data from SGML to XML. It focuses mainly on a schematic workflow for individual document instances but also briefly addresses some other considerations related to your processing environment and DTDs.

Instance conversion workflow

There are four distinct steps for migrating document instances from SGML to XML. First, of course, the documents need to be converted. You will then probably need to normalize the case in your tags (i.e., be sure that the tags use proper capitalization), since XML is case-sensitive and your SGML may not be. You may also wish to format the files to make them easy to read. Finally, we strongly suggest that you develop procedures for checking your results for any unexpected bugs that may have been introduced during the migration process. These steps are discussed below.

SGML to XML conversion

There are various tools available for converting SGML documents to XML, some of which are discussed below. We are currently recommending osx, since it is the best available tool. However osx will not preserve non-significant whitespaces in its output, so the resulting files may be difficult to read; you may wish to run them through a "pretty printing" program to format your XML files.

Correcting case and formatting your output

SGML is not necessarily case sensitive: the SGML declaration sets NAMECASE GENERAL to either YES or NO. If it is set to YES, then generic identifiers (i.e., element names), attribute names, and attribute values (that are tokens) do not need to follow the same case usage as the DTD, so <TEIHEADER> , <teiHeader> , or even <teIHEAder> will parse correctly. If it is set to NO, they will be case-sensitive and the parser will complain about incorrect capitalization. Similarly, the declaration sets NAMECASE ENTITY to either YES or NO, specifying case-sensitivity of entity names. The TEI SGML declaration set NAMECASE GENERAL to YES and NAMECASE ENTITY to NO, so unless a project specifically changed these settings, entity names will be in the correct case, but generic identifiers, attribute names, and attribute value tokens need not be.

XML, on the other hand, is case sensitive. Element names, attribute names, and attribute values that are not CDATA, NMTOKEN, or NMTOKENS must always have correct capitalization. If your SGML documents did not follow appropriate case usage or your XML conversion software did not preserve the correct case, the resulting XML documents will not be valid.

A normal but often unexpected side effect of instance conversion is the inclusion of default attributes from the DTD in the XML output. For instance, every TEI element has a TEIform attribute, for which the default value is the same as the generic identifier of the element. This attribute rarely needs to be specified in a typical TEI document. However, many conversion tools, including osx, cannot distinguish between defaulted attributes from the DTD and those explicitly defined in the document instance, and will include both in the XML output. Thus <P>Hello World!</P> will become <p TEIform="p">Hello World!</p>. The inclusion of default attributes will not affect validation or XML processing of the files, but it can significantly (and unnecessarily) increase their size, and make the files harder for humans to read.

You can use an XSLT stylesheet, such as tei2tei.xsl, to normalize case and clean up your XML files. It will convert TEI element and attribute names into their proper XML mixed case, remove default attributes such as TEIform, and format the output nicely. If you use osx for your conversion, tei2tei.xsl will probably prove useful (please see the discussion of post-processing tools, below, for more information). Otherwise, you can write your own stylesheet to correct any problems.

Checking your results

It is always a good idea to run a bug check after any data conversion. Certain basic syntax features will be checked by XML validation, but it is likely worthwhile to perform at least spot-checks on random pieces of your output. After ensuring that the output is well-formed XML and then valid TEI, it may be informative to compare the ESIS output from a parse of the original SGML with the ESIS output from a parse of the new XML. You may also want to run the output through XSLT stylesheets that verify certain types of data, format the XML so that the editors can check the output, or even generate HTML pages from the XML.

Batch scripts

You may want to use a batch script to automate these conversion steps, especially if you are working with a large group of documents. For example, you might write a Perl or shell script that backs up a directory of SGML files, then runs each file through an osx conversion, pipes the results of that process through an XSLT processor using the tei2tei.xsl stylesheet, and validates the output before writing the XML file to a new directory. Depending on your particular platform, software, and documents, you may need to take additional steps to preserve entity references through the XSLT process. While it is impossible to include an all-purpose batch script here, a selection of sample scripts developed by the migration work group members is available on the Tools page.

Processing environment and DTDs

The processing environment refers to the software that you use to manipulate your documents. It includes editing tools, parsers, transformation engines, stylesheets, and catalogs. You and your staff should carefully consider what kind of tools will best suit your immediate migration needs as well as your project's long-term development. Tools are discussed below.

Catalogs

A catalog maps external identifiers to URI references. If you have an SGML catalog, you will need to convert it to XML syntax or write a new XML catalog. We will not address this task in any detail, but there is a detailed discussion of XML catalogs in the specifications from the OASIS Entity Resolution Technical Committee.

Character entity references

You may need to take special steps to convert your SGML character entity references to XML, especially if they do not expand to Unicode. Please see Handling SDATA entities in the conversion process, below.

The DTD

You must, of course, have an XML DTD for your new XML documents, so you will need either to migrate your SGML DTD or generate a new XML DTD. If you have been using an unmodified TEI DTD or one generated from Pizza Chef, it is a simple process.

  • If you used the SGML TEI Lite DTD (teilite.dtd), you can simply substitute the XML version (teixlite.dtd). The new DTD should be usable to validate your new XML files with no problem.
  • If you are using a flattened TEI SGML DTD generated from the Pizza Chef and are not using extensions, you can now go back to Pizza Chef and generate an XML-compliant DTD.
  • If you used an SGML DTD with extensions, you will need to migrate the extensions manually and then perhaps regenerate a flat DTD from the Pizza Chef. This is discussed in detail in Migrating TEI DTD extensions to XML, below.
  • If you made custom modifications to your SGML DTD by hand, you will have to redo those modifications in an XML DTD. You would be better off using the migration as a chance to review the modifications and determine whether or not they are still necessary. If you decide to keep them, you will want to recreate them using the DTD modification procedures outlined in chapter 29 of the TEI Guidelines, Modifying and Customizing the TEI DTD.

Instance Conversion: Tools

By far the most widely accepted tool in SGML to XML conversion is osx, based on James Clark's sx. 1 Besides being generally accepted, osx is free, readily available, open source software. Therefore, this section will concentrate on osx. Other tools are addressed later, but particular conversion issues using them are not included.

osx

General Information

James Clark originally wrote sx in C++ as a command-line tool and part of his SP package. His version is still available at http://www.jclark.com/sp/index.htm . However, he no longer actively maintains SP; OpenSP, the current recommended version, is maintained as part of the SourceForge OpenJade project. In this distribution, sx is called osx. This task force recommends at least version 1.5.1 of osx.

The osx tool converts only instances, not DTDs. Note that in some cases, such as when notations are declared, osx will generate an internal DTD subset in its output.

Comments

By default, osx does not preserve comments. However, there is an --xml-output-option=comment switch which does preserve comments.

Prolog

The osx tool will output an XML declaration (i.e., <?xml version="1.0" encoding="UTF-8"?>) if your encoding is UTF-8; other output encodings may be requested. When circumstances require the inclusion of an internal subset (on account of entity declarations, notation declarations, etc.), osx will include that as well, with the appropriate root element name, and, by default, no SYSTEM or PUBLIC identifier specified. The user may specify a --dtd_location=dtd-file option, in which case osx will use SYSTEM dtd-file as the external identifier in the DOCTYPE declaration, if one is being output.

Entity preservation

By default, osx resolves all entities (internal and external) and includes the processed result in its output. If the SGML input file includes references to many external entities, the default result will nevertheless be a single output file. The user may request that the file structure be preserved by specifying --xml-output-option=no-expand-external, in which case all included files will be converted and the appropriate entities will be declared and referenced. There is a corresponding --xml-output-option=no-expand-internal to request the preservation of internal entities and their declarations.

SDATA entities

XML does not support SDATA entities, and there is no widely accepted means of expressing them in XML. When asked to preserve internal entities, by default osx treats SDATA entities as general internal entities (i.e., simply replaces their declarations with the equivalent declaration of a general internal entity and preserves all references to them as in the original). The --xml-output-option=sdata-as-pis switch requests that osx instead replace their definitions with a general internal entity, the content of which is a processing instruction (<?sdataEntity entityName entityReplacementText ?>). If osx is expanding internal entities (replacing references to internal entities with the entity replacement text) and SDATA entities are referenced inside attribute values (where markup is not allowed), requesting that an SDATA entity be replaced with a processing instruction may result in output which is not well-formed, so this option should be used with caution. For more information about the issues involved in converting SDATA entities, see Handling SDATA entities in the conversion process, below.

Casing

By default, osx will convert all element names to uppercase. To avoid this incorrect behavior, the --xml-output-option=preserve-case option (available only in versions 1.5.1 and later) should always be specified.

Post-processing

The osx tool will not preserve non-significant whitespace (e.g., space between elements) in its output. If you want your converted files to have neatly wrapped lines and indented elements for human readability, you should use another tool to reformat osx's output. Many XML editing tools will do this.

The widely used open source HTML Tidy application can wrap lines and indent elements. It accepts an --input-xml yes switch to specify that it should treat the input document as XML instead of HTML. Be warned that it will escape all entity references except the basic XML ones (amp, lt, gt, apos, quot), so "&foo;" will become "&amp;foo;". However, unlike xmllint, which is described below, Tidy will not break when it encounters an entity reference for which it has no definition. This allows you to run XML documents through Tidy without an accompanying DTD. An example command line asking Tidy to indent an XML document might be 2 :
tidy -iindent --input-xml yes source.xml > out.xml
Another option is xmllint, which is part of Gnome's open source libxml and therefore pre-installed on all systems with Gnome. It has fewer formatting options than HTML Tidy, but it will not escape your entity references, and it understands XML catalogs to boot. Note that, unlike Tidy, xmllint will break when encountering a reference to an entity for which it does not have a definition, as will happen for some well-formed XML documents that are parsed without an accompanying DTD. An example command line asking xmllint to indent an XML document might be:
xmllint --format source.xml > out.xml

Sebastian Rahtz's tei2tei.xsl stylesheet will provide some other post-processing (case normalization and removal of attributes which have default values) 3 . It will indent the output nicely, but it will also resolve entity references, which may not be a desired behavior. You could use xmllint to escape entity references first and then tei2tei.xsl to create output with entity references in their original state. Some projects may wish to develop their own stylesheets for post-processing. Note that not all systems will be able to apply an XSLT stylesheet to very large input documents, due to memory limitations.

Other tools

Arbortext's Epic

Arbortext's Epic SGML editor will perform XML conversion, though the conversions cannot be batched and must be run by hand inside the editor.

n2x

n2x is an open source SGML to XML conversion tool written in Python. Instead of accepting SGML input, it expects the output of nsgmls (a James Clark SP tool — the same parser used by sx) or onsgmls (part of the OpenSP distribution), which is an ESIS stream. It converts that stream into XML. It reads an sdata.py file to map SDATA entities into hexadecimal Unicode characters, a useful alternative to osx's policy of resolving them as dictated by the DTD, which is often not what the user wants. As with osx, there is a runtime option not to resolve SDATA entities but instead preserve them as references. n2x is missing options that are available in osx, so while it is not a substitute for osx it may be a good alternative for some situations.

XMetaL

XMetaL, an XML editor, will convert individual SGML instances to XML, though there are some issues with the process. This tool is best used for smaller-scale conversion projects. For more information, go to XMetaL's Help menu, choose the "Contents" tab, then click "Working with files", "Saving a document", and "Switching between XML and SGML." XMetaL works best with TEI Lite or a DTD generated by the Pizza Chef; users have reported problems with other DTDs. By default, XMetaL will output all tags in upper case. You can modify the SGML declaration to prevent this or you can massage the output with tei2tei.xsl, which will normalize case. Unlike many conversion tools, XMetaL will not add defaulted attributes and it does preserve all entity references. You will need to edit the XML output document to remove the SGML internal subset.

Handling SDATA entities in the conversion process

General remarks

SDATA entities are "special entity references" which were available in SGML but do not exist in XML. This section will give some simple recommendations for handling them in the migration from SGML to XML.

In the SGML world, SDATA entities have been used mainly to provide a handle to characters that were not available in the coded character set used by a document instance. Accordingly, this use will be discussed in some detail and other cases will be briefly mentioned.

Public entity reference sets

A number of public entity reference sets have been published by the ISO as an informative appendix to the SGML standard. In these sets, each entity declaration usually takes a form similar to this:
<!ENTITY amacron SDATA "[amacron]">
This provides the parser with a string that is algorithmically derived from the entity name. Many SGML applications take this kind of string and map it to the information that the application needs to handle such a character.

In XML the document character set is specified as Unicode, which currently includes almost 100,000 characters. John Cowan prepared a list of SGML public entity mappings to Unicode, which is available via ftp at the Unicode website and from OASIS. For entity references in this list the conversion process is straightforward, but other cases may require more tweaking.

Conversion of SDATA entities representing characters that exist in Unicode

This is the simplest case. Usually, it will require replacing the value of the SDATA entity replacement with the appropriate Unicode value. The exact method for this will depend on the tool used for the conversion. If the conversion is done with osx, you can edit the intEntities.dtf file, which is auto-generated by the conversion process. The value of the Unicode code point could be expressed either as an encoded value, if the platform and editing environment supports it, or as a numeric character reference (NCR). The numeric value of a NCR can either be given as a decimal value, in which case it takes the form &#257;, or it can be given in hexadecimal notation as &#x101;. Since the latter is the value that is usually found in code charts (including the file mentioned above), it is in many cases the preferred one. The character in the example above would thus be represented as:
<!ENTITY amacron "&#x101;">

Conversion of SDATA entities representing characters that do not exist in Unicode

This is the more general case, which fortunately does not occur too often. A strategy for handling this will first need to investigate whether the character in question is eligible to being added to Unicode. Many characters consisting of an alphabetical base character and some combining diacritical mark will not be added to Unicode, since these characters can be represented by a sequence of Unicode code points. In this case, the sequence of code points representing the character should be used as the value in the entity declaration.
<!ENTITY uumlold "u&#x364;">
The character in this example is an old form of u with umlaut, the umlaut represented as ‘e’ in smaller size above the ‘u’.
Another subcategory of this case is ligatures, which do not usually exist as predefined characters in Unicode. If the fact that characters did appear as ligatures needs to be preserved in the encoded text (a decision presumably made at the outset of an encoding project), the replacement could use the Unicode character &#x200D;, the ZERO WIDTH JOINER like this:
<!ENTITY fjlig "f&#x200d;j">
This produces an fj ligature. Another possibility would be to use markup to represent the fact that these two characters should be rendered as ligatures. The conversion process will be easier to conduct if none of these types of changes to the markup are involved, but a reconsideration of such cases after conversion is recommended.

It should also be noted that a significant number of characters have been added to Unicode since John Cowan's file was created. Specifically, a large number of mathematical symbols have been added, some of which did appear in the ISO public entity sets but did not have a mapping to Unicode when the mapping file was created. A good place to look up characters is the search interface at the Letter database. Another starting point for this type of search is the Unicode Code Charts.

If no Unicode representation can be found for a character, the remaining possibilities for this conversion are:
  1. Assign code points from the private use area (PUA)
  2. Use markup constructs to represent these characters
The TEI Guidelines have recommendations for handling this situation (Chapter 4 and Chapter 25). Unfortunately, the Guidelines are in the process of substantial revision in this area and the mechanism in P5 will likely be different from what is available in P4. An interim solution that will probably be easily adaptable to P5 is to use characters from the PUA. To retain the exchangeability of documents containing such characters, supporting declarations will also have to be made in the writing system declaration (WSD). The ‘qthorn’ character, a combination of the letters ‘q’ and ‘þ’ (found, e.g., in The Accession Speech and Prayer, 1558 by Elizabeth I of England abbreviating the word quoth) is not (as of this writing) available in Unicode. It could be represented as:
<!ENTITY qthorn "&#xE000;">
Note that this is a code point from the PUA in Unicode. The WSD for this document would then need to contain a declaration like this:
<writingSystemDeclaration name="-//TEI P4: 2002//NOTATION WSD for extended English//EN" date="2004-02-02"" lang="eng"> <language iso639="eng">Modern English</language> <script>Latin (with diacritics for loan words and some non-standard characters)</script> <direction chars="LR" lines="TB"/> <characters> <baseWsd name="-//TEI P2: 1993//NOTATION WSD for ISO 10646-1//EN" authority="tei"/> <exceptions> <character class="lexical"> <form string="" entityLoc="qthorn" ucs-4="E000"> <desc>a blending of 'q' and 'thorn', typically used for contracting 'quoth'</desc> </form> </character> </exceptions> </characters> </writingSystemDeclaration>

Handling SDATA entities that do not represent characters

Besides being used to represent characters, SDATA entities have seen a variety of other, less frequent usages. Due to the variety of uses, which range from recording specific information during data capture or data conversion, to highly technical application-specific information, no general solution can be outlined here. In many cases, however, the conversion to general XML internal entities (which osx will do for you if --xml-output-option=no-expand-internal is specified) should be good enough.

Migrating TEI DTD extensions to XML

General remarks

This section is for projects that have modified the TEI DTD and want to migrate these modifications from SGML to XML (i.e., want to use the XML-based P4 DTD with equivalent modifications). We begin with some general remarks, then describe a sample DTD modification that covers the most important issues, then outline a recommended migration procedure and demonstrate the key steps using the example.

If the elements or content models that the TEI provide don't quite meet the requirements of your project, there is an official escape route: you can modify the DTD in a number of well-defined ways and your documents will remain ‘TEI conformant.’ This involves creating two extension files, setting some parameter entities, possibly defining new elements or redefining existing ones, and making these modifications known to the parser in the DTD subset at the beginning of the document.

Although the process is a lot simpler than it looks at first glance, many people have taken unofficial escape routes, especially the users of the TEI Lite DTD, who would have been required to first switch to a full TEI DTD before applying local extensions. It is admittedly simpler to just open your local copy of teilite.dtd and change a few lines. Only later will you find out why the TEI Guidelines strongly discourage this, and one of those moments could be the migration of your customized DTD to XML.

If you are in this situation now, there are three ways to proceed:
  1. Redo your modifications the official way for the P4 DTD, using extension files: find out what is changed in your local copy of the TEI Lite DTD, and create proper extension files for the TEI P4 DTD to the same effect. You will find it is probably easier than you expected, and any future migrations will be easier as well. You'll find useful advice for this process in section 29 of the Guidelines and in the rest of this section. This is the recommended procedure.
  2. Redo your modifications as before: find out what you changed in the SGML TEI Lite DTD, and apply the same changes to your local copy of the XML TEI Lite DTD. We do not advocate this procedure, but it is, of course, a practical possibility.
  3. Take a step back: are those modifications really still needed? Were they designed to work around a bug of TEI P3 and are no longer required for P4? (A list of bugs corrected in the P3:1999 edition is available in appendix C3.2 of the Guidelines.) Were they intended for a feature that was never used?
That being said, the rest of this section shall discuss migrating DTD extensions made using the official procedures. So: what types of TEI extensions exist and what is involved in migrating them from SGML to XML? The TEI P4 Guidelines list four kinds of modification (see section 29.1):
  1. Deletion of elements
  2. Renaming of elements
  3. Extension of classes
  4. Modification of content models or attribute lists
The first three are extremely easy, but the fourth item requires more detailed attention. For practical purposes, it can be subdivided into:
  • redefinition of attribute lists
  • modification of existing content models
  • definition and integration of new elements (i.e., hanging the new elements into the existing tree)

The following is a short list of some critical issues involved. In the following subsections, we will work through a fictitious example that covers most of these issues;

  1. Element and attribute name case is significant in XML.
  2. It is likely that some of the modifications in your existing P3 extension files involved copying (and then probably modifying) pieces of the TEI DTD files. You should check whether those DTD pieces have changed from P3 to P4.
  3. Some people have made modifications to work around problems in the TEI P3 DTD; if they are fixed in P4, the workaround could cause errors (a notorious example is <persName> ).
  4. The SGML DTD syntax for element declarations requires two characters of ‘-’ or ‘O’ that indicate whether start and end tags are required or can be omitted. These indicators don't exist anymore in XML DTDs and your private DTD snippets need to be modified.
  5. The content model for XML elements is more restricted than for SGML elements. We won't go into fine detail, but the following two points deserve attention:
    • The only type of character data is PCDATA. You cannot define CDATA content to bypass the parser.
    • The inclusion exception and exclusion exception syntax does not exist in XML DTDs. In SGML, you could declare an element to be legal everywhere within element <X> and its children in a single line by using the inclusion exception syntax. This is not possible in XML. Instead, you have to add <X> to all content models individually.

A tutorial example

In this section we will do some simple TEI DTD modifications in SGML. This will then serve as a tutorial example for the migration to XML. While working on this example, the main problems in converting DTD extensions should be covered. Not everyone will need to take all of the steps treated here, and some needs might not be covered, but this should be an easy, hands-on starting point for most projects. 4

Let's assume that five years ago we wanted a TEI P3 DTD for prose that met the following hypothetical requirements:
  1. Personal names shall be marked with the <persName> tag (this requires TEI extensions for names and dates and a workaround for a bug in the P3 DTD).
  2. The <pb> element shall have an extra attribute, ‘imageUrl’, that contains a URL for an image of the page.
  3. There shall be a new element <ps> for the postscripts of letters, containing normal phrase level content.
  4. The elements <div1> and <div2> shall be renamed to <volume> and <letter> because our source material is a collection of letters organized that way, and we want to keep that structure and make it explicit;
  5. An element <toDo> shall be available everywhere in the text for editorial meta-comments on the ongoing encoding. Therefore, the content shall be CDATA to allow easy typing of element names and entities that are talked about in these notes. Content tags and entity references in CDATA are not recognized by the SGML parser.
These requirements can be cast into TEI SGML by creating two files, my_sgml.ent and my_sgml.dtd:
<!-- file "my_sgml.ent" --> <!-- fix persName problem in TEI P3 --> <!ENTITY % x.data 'persName |'> <!-- suppress "pb" so it can be redefined (add attribute "imageUrl") --> <!ENTITY % pb 'IGNORE'> <!-- add new element "ps" to class for div-bottom --> <!ENTITY % x.divbot 'ps |'> <!-- rename "div1" to "volume" --> <!ENTITY % n.div1 'volume'> <!-- rename "div2" to "letter" --> <!ENTITY % n.div2 'letter'> <!-- suppress TEI.2 so it can be redefined (inclusion of "toDo") --> <!ENTITY % TEI.2 'IGNORE'>
<!-- file "my_sgml.dtd" --> <!-- modified copy of "pb" element: "imageUrl" added --> <!ELEMENT %n.pb; - O EMPTY > <!ATTLIST %n.pb; id ID #IMPLIED lang IDREF %INHERITED rend CDATA #IMPLIED ed CDATA #IMPLIED n CDATA #IMPLIED imageUrl CDATA #IMPLIED TEIform CDATA 'pb' > <!-- new element "ps" --> <!ELEMENT PS - - (%paraContent) > <!ATTLIST ps %a.global; > <!-- new element "toDo" --> <!ELEMENT toDo - - CDATA > <!-- redefined TEI.2: added "toDo" as inclusion --> <!ELEMENT %n.TEI.2; - O (%n.teiHeader;, %n.text;) +(toDo) > <!ATTLIST %n.TEI.2; %a.global; TEIform CDATA 'TEI.2' >
A sample document godot.sgml using these extensions would look like this:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE" > <!ENTITY % TEI.names.dates "INCLUDE" > <!ENTITY % TEI.extensions.dtd SYSTEM "my_sgml.dtd"> <!ENTITY % TEI.extensions.ent SYSTEM "my_sgml.ent"> ]> <TEI.2> <teiHeader> <fileDesc> <titleStmt> <title>Letter from Godot</title> <author>John Doe</author> </titleStmt> <publicationStmt> <publisher>published by the TEI</publisher> </publicationStmt> <sourceDesc> <p>paper original lost after encoding</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <volume type="volume" n="1"> <letter type="letter" n="1"> <pb n="1" imageUrl="godot.tiff"> <salute>Dearest <persName>Doug</persName>,</salute> <p>I will come really soon now.</p> <signed><persName>Godot</persName></signed> <ps>PS: <hi>Thanks</hi> for all the fish <toDo>is this &lt;hi&gt; really correct tagging?</toDo> </ps> </letter> <toDo>Encode next letter</toDo> </volume> </body> </text> </TEI.2>

This example shall be migrated to TEI P4 XML in the next two subsections.

Suggested migration procedure

Although the following step-by-step list may sound over-cautious, this approach is recommended to help those new to the process while converting your DTD and documents. You are switching from P3 to P4, from SGML to XML DTDs, from SGML to XML documents, and from SGML to XML parsers at the same time, and it can be difficult to find your way through the many potential pitfalls.

  1. Pick an interesting test document from your repository and make sure you can parse it as it is (in SGML form) against your current DTD setup (TEI P3 with your extensions), perhaps making up an example as we did above.
  2. Set up a parallel SGML parser environment to parse against the TEI P4 DTD. Before you try to parse your sample file, make sure the parser works with very simple standard TEI files.
  3. Now try to parse your sample document against P4 (in SGML mode). Theoretically, this shouldn't be a problem, but you might encounter errors. Create an intermediate version of your SGML extension files to fix any problems (the cause is most likely that your extensions work around a bug of the P3 DTD that was repaired in P4).
  4. Create new XML extension files based on the SGML ones. We will do this for our example in the next section. If you want to continue supporting SGML documents, consider making the extension files ‘dual-use’ (i.e., compatible with XML and SGML, as described below) so you only need to maintain one set. If you had to make changes for the previous step (moving from P3 to P4 SGML), you have to decide whether the SGML part of your dual-use setup will be compatible with P3 or P4. Normally, you should be fine using P4 SGML, so this is the assumption in the remaining text.
  5. If you have created dual-use extension files, use them to parse your SGML document against P4 in SGML mode. Don't proceed until they are correct.
  6. Make sure that your XML parser is set up properly by parsing a minimal TEI XML document without extensions.
  7. Convert the test document to XML. A short fictional sample document will make this step easier and will avoid any extra confusion from errors arising in a large-scale text migration.
  8. Try to parse your converted test XML document. Errors at this stage may be a result of the document conversion or the migrated DTD extensions. Fix them all.
  9. If you have dual-use extensions, go back and try the SGML parse again.

Migrating the example DTD

We will now focus on rewriting the DTD extensions in XML, with the sample DTD modifications described earlier as a base. We will be creating files called my_xml.ent and my_xml.dtd.

Before we start, however, a strategic decision has to be made: Shall we burn bridges and support only XML in the future? The P4 DTD provides mechanisms to parse both XML and SGML and we can do the same for our customized DTD, if we need to support SGML in parallel for the time being. It takes a little more thought and effort, but in return you get the comfort of a safe transition period. The parameter entity TEI.XML gives you an important tool for accomplishing this: it is defined as INCLUDE in XML mode and as IGNORE otherwise, so that you can segregate XML and SGML entity and element definitions. DTD extensions that use this technique are called ‘dual-use’ extensions in this document. Our example demonstrates both dual-use and pure XML extensions. 5

One obvious syntactic difference between SGML and XML DTDs is the ‘omitted tag minimization parameters’ that appear as ‘-’ and ‘O’ in SGML element declarations to indicate whether start and end tags need to be present or not. They are superfluous in XML, where minimization is not allowed, and therefore are not used. The TEI P4 DTD provides and uses parameter entities om.RO and om.RR as a substitute. For SGML parsing, these entities expand to - - and - O respectively, and for XML parsing they expand to nothing. Elements that require only a start tag (mostly empty elements) use om.RO, and elements that require start and end tags use om.RR (non-empty elements should be defined that way). We can make use of this mechanism for our dual-use DTD extensions. 6

So let's go to work:

We suggested above that, before thinking about P4 XML, we should make sure that our document can be parsed with the P4 DTD in SGML mode. When doing this with our example, we get a frightening number of errors like
nsgmls:my_sgml.dtd:14:45:E: content model is ambiguous: when the current token is the 1st occurrence of "JOIN", both the 1st and 2nd occurrences of "PERSNAME" are possible.
Two occurrences of <persName> ? It turns out that the workaround that was necessary to use <persName> with the P3 DTD is no longer needed and causes trouble instead. So we can remove it from the working copy of our modification file my_sgml.ent. 7

Now let's move towards XML: if we first check for consistent case of the element and attribute names in our DTD, we discover that element <ps> was once written in uppercase and once in lowercase. We decide that lowercase shall be the correct spelling.

Some things are easy: the renaming of <div1> and <div2> to <volume> and <letter> , and the declaration of the <ps> element can remain untouched, except that the ‘- -’ in the definition of <ps> needs to be either removed or (if we aim at a dual-use DTD) replaced by ‘%om.RR;’.

The dual-use decision comes up again with the <pb> tag. In XML, we can just write an ATTLIST containing only ‘imageUrl’ and it will be merged with the existing ATTLIST in the TEI DTD files; there is no need to suppress and copy the definition of <pb> . For continuing support of SGML, we have to suppress and redefine the element as before. We find it in the TEI DTD files (teicore2.dtd), copy the P4 definition and modify it, adding our extra attribute.

The most difficult problem is the <toDo> tag. For one, the content model CDATA needs to become PCDATA. This means that existing documents will most probably break, but there is no choice. It might be a solution to turn all the <toDo> content into CDATA marked sections with an automated search and replace as part of the document conversion, or to escape the contained markup to entity references using a similar procedure.

Also, the simple way of allowing <toDo> everywhere is no longer possible. This could be a good occasion to check how that element is actually used in practice and where it is really needed. A compromise in our example could be to add it to the class ‘Incl’ that is part of every content model within <text> . Sometimes, a more complex redefinition of content models could be necessary. If that is your situation, you may want to consult the unofficial TEI document ED W 69, chapter 8 for in-depth coverage.

The first runs with the XML parser result in many warnings because of redefined parameter entities; this is normal. Some syntax correction is required where XML is more strict than SGML: we forgot a semicolon in a parameter entity reference, ‘%paraContent’ must not be in parentheses while the ‘#PCDATA’ for <toDo> has to be. When you don't know how to get rid of an error, it can be useful to browse the TEI DTD files and compare with your own usage.

The cross-check of the dual-use version with the SGML parser exposes a little additional problem: the document now uses the character entities &lt; and &gt; which are predefined in XML, but not in SGML; once discovered this is easily fixed. In the example file, you will find a solution that looks a little complicated but works flawlessly with SGML and XML.

The reworked extension files in the XML-only form look like this:
<!-- file "my_xml.ent" --> <!-- persName fix removed for TEI P4 --> <!-- add new element "ps" to class for div-bottom --> <!ENTITY % x.divbot 'ps |' > <!-- rename "div1" and "div2" to "volume" and "letter" --> <!ENTITY % n.div1 'volume' > <!ENTITY % n.div2 'letter' > <!-- make new element "toDo" available everywhere in "text" --> <!ENTITY % x.Incl ' toDo |' >
<!-- file "my_xml.dtd" --> <!-- additional attribute for "pb" element. --> <!ATTLIST %n.pb; imageUrl CDATA #IMPLIED > <!-- new element "ps" --> <!ELEMENT ps %paraContent; > <!ATTLIST ps %a.global; > <!-- new element "toDo" --> <!ELEMENT toDo (#PCDATA) >
In the dual-use form, the DTD extension file comes out a little longer (my_dual.ent is identical to my_xml.ent above):
<!-- file "my_dual.dtd" --> <!-- modified copy of "pb" element: "imageUrl" added --> <!ELEMENT %n.pb; %om.RO; EMPTY > <!ATTLIST %n.pb; %a.global; ed CDATA #IMPLIED imageUrl CDATA #IMPLIED TEIform CDATA 'pb' > <!-- new element "ps" --> <!ELEMENT ps %om.RR; %paraContent; > <!ATTLIST ps %a.global; > <!-- new element "toDo" --> <!ELEMENT toDo %om.RR; (#PCDATA) > <!-- entity definitions for SGML, XML has them predefined. --> <![%TEI.XML;[ <!ENTITY % TEI.SGML 'IGNORE'> ]]> <!ENTITY % TEI.SGML 'INCLUDE'> <![%TEI.SGML;[ <!ENTITY amp "&#038;" > <!ENTITY lt "&#060;" > <!ENTITY gt "&#062;" > ]]>
We can easily convert our short test document manually. All that needs to change is the initial XML declaration, the XML-specific parameter entity, the empty-tag syntax for <pb> and the escaping of the content of <toDo> . These steps could serve as models for automated conversion of large documents.
<?xml version="1.0" ?> <!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.XML "INCLUDE" > <!ENTITY % TEI.prose "INCLUDE" > <!ENTITY % TEI.names.dates "INCLUDE" > <!ENTITY % TEI.extensions.dtd SYSTEM "my_dual.dtd"> <!ENTITY % TEI.extensions.ent SYSTEM "my_dual.ent"> ]> ... (header omitted) ... <text> <body> <volume n="1"> <letter n="1"> <pb n="1" imageUrl="godot.tiff"/> <salute>Dearest <persName>Doug</persName>,</salute> <p>I will come really soon now.</p> <signed><persName>Godot</persName></signed> <ps>PS: <hi>Thanks</hi> for all the fish <toDo>is this &lt;hi&gt; really correct tagging?</toDo> </ps> </letter> <toDo>Encode next letter</toDo> </volume> </body> </text> </TEI.2>
Notes
1.
It is, in fact, so widely accepted that there are few other widely available general purpose conversion tools.
2.
These arguments are correct for the June 2003 version of Tidy; different versions may use different argument syntax.
3.
If you use this stylesheet, please note that by default the output will have no DOCTYPE declaration and will be UTF-8. If those settings are not appropriate, you must change the script's xsl:output, or edit the output.
4.
As an additional benefit, this small tutorial might induce people to do their modifications the proper way instead of hacking TEI Lite.
5.
The mechanism of using the TEI.XML parameter entity for dual-use extensions can also be used to select between character entity sets for SGML and XML; see Handling SDATA entities in the conversion process.
6.
If you feel that this discussion is too DTD-technical for you, don't worry. You probably don't need to understand the background if you follow the examples.
7.
A similar problem would have occurred if we had used x.globincl to implement our <toDo> element. Since the inclusion mechanism is gone, the globincl class is gone as well. Instead, TEI P4 has a new class Incl and elements that shall be available everywhere within <text> need to be added to x.Incl.