TEI MI W 06Migration Case Study Reports


Contents

Brown University Women Writers Project

About the Project

Started in 1988, the Brown University Women Writers Project is a long-term research project devoted to early modern women's writing and electronic text encoding. Our goal is to bring texts by pre-Victorian women writers out of the archive and make them accessible to a wide audience of teachers, students, scholars, and the general reader. We support research on women's writing, text encoding, and the role of electronic texts in teaching and scholarship.

Initially funded by the NEH, the WWP is now supported through licensing fees to Women Writers Online, the on-line face to our textbase, and through grant funding from the NEH and the Delmas Foundation. In the past, the WWP has also received grants from the Mellon Foundation and Apple Computer, Inc.

The WWP hires undergraduate and graduate students as encoders. They enter texts using GNU Emacs, Lennart Stafflin's psgml mode, and lots of home-grown macros on a Sun Unix system maintained by the University, and accessed via an SSH client on Macintosh desktop computers.

In addition to transient student encoders, the WWP has a permanent staff that has fluctuated from one to five, but has been steady at three for several years now. The WWP is part of Brown University's Computing and Information Services. The WWP is advised by a board of scholars, a subset of whom serve on an acquisition committee that decides which texts are next to be added to our textbase.

About the Collection

WWO (the on-line system available to subscribers) currently has over 220 texts, of which just over half have additional contextual materials funded by the Mellon Foundation as part of Renaissance Women Online. The total size of the XML source files for WWO is just over 50 million characters (just under 50 MiB). In addition to the texts available to subscribers, there are over a hundred more ‘under construction’, for a total of just over 97 million characters (just under 93 MiB). Currently we are concentrating on printed works, with an eye towards manuscripts in the future.

Nature of the Content

The collection includes works from every genre: political satire, drama, poetry, religious tracts, and even medical reference. All texts are in English (although there are plenty of phrases and quotations in other languages) and either written or translated by women.

Works from the period included (roughly 1400 to 1850) are often in black-letter; never have a nice, solid, scannable baseline; and often have pagination indicated by signatures, if at all. Several texts have tipped-in pages, errors in the printed page numbers, or publishing errors (such as incorrectly ordered sections). Furthermore typographic errors and difficult-to-read or damaged photocopies are common.

The WWP includes a variety of different kinds of texts including monographs, broadsides, and analytic pieces. Typically, in cases where work by a woman is included in a larger work by men, only the front and back matter (e.g., title page and colophon) and those parts written by the woman are included.

While there are obviously no photographs in the works involved, there are many figures, typically woodcuts or engravings. These are currently captured only by a very short description in the encoding 1 , i.e. no scanned images.

Nature of the Encoding

In our SGML texts all characters were encoded in US-ASCII, with characters outside that set represented by general entity references, often to SDATA entities. In the XML version, all characters are encoded in the US-ASCII subset of ISO 10646 UTF-8, with characters outside that set represented by general entity references, mostly to numeric character references to the appropriate ISO 10646 code point. However, we do have some characters not in Unicode. We do not use the WSD mechanism.

The WWP uses quite a detailed level of encoding; a level 5 encoding according the the TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. For example, an average page break is indicated by
  • an <fw> element (renamed to <mw> ) with type=sig to indicate the signature as printed on the page;
  • an <fw> element (renamed to <mw> ) with type=catch to indicate the catchword (note that no further encoding takes place on a catchword; i.e., it is not encoded as a <persName> even if it is a person's name — this helps avoid over counting of such features, and ‘false positive’ hits for searches on names);
  • a <pb> element with an n attribute that indicates the ‘idealized’ (i.e., corrected) page number;
  • a <milestone> element, unit=sig with an n attribute that indicates the ‘idealized’ signature; and
  • an <fw> element (renamed to <mw> ) with type=pageNum to indicate the page number as printed on the page.
Here is an example of an encoded page break, along with the first line (in the encoded file) following the page break. Note that in this text the default rendition for <mw> is break(yes) align(right).
<mw rend="align(center)" type="sig">B3</mw> <mw rend="break(no)slant(italic)" type="catch">Va&s;.</mw> <pb n="6"/> <milestone n="B3v" unit="sig"/> <mw rend="place(outside)" type="pageNum">6</mw> <sp who="WOWSr4"><speaker><abbr expan="Vasquez">Va&s;.</abbr></speaker>
Modified slightly from WWP TR00444

The WWP textbase uses a single DTD that includes the prose, drama, and verse base tag sets, and the additional tagsets for linking, transcription, names & dates, and figures. This DTD, called wwp-store, includes significant extensions, including complex manipulations of element classes.

The WWP also uses a smaller DTD for internal documentation. This DTD, called wwp-doc, uses only the prose base tag set, and the additional tagsets for linking and for figures. This DTD includes simple extensions, mostly to remove elements (like <pb> ) that make no sense in documents we are authoring.

Before migration the WWP files were nearly XML already. We have required all end tags since 1994; we have required that attributes be quoted since 1997; and we have been using case-sensitive encoding since 1999. The DTD extension files, however, were another story. There we made significant use of SGML features not permitted in XML, particularly inclusion & exclusion exceptions, the ampersand connector, and SDATA entities. At least we have been using an or-bar (instead of comma) in all attribute declared value groups since 1998.

Motivations for Migration

In addition to all the usual reasons of being able to use newer, better tools, there were two reasons for migrating:
  • enforces our own syntax The WWP has, for years, had its own subset of SGML instance syntax that documents adhered to. (Core concrete syntax, all end tags present, all attribute values quoted, no occurrences of <, >, =, etc.) Most of our restrictions on SGML were codified in XML. Thus using an XML validator would allow us to ensure that our documents followed our local syntax conventions much more closely than an SGML validator would, thus reducing the need for additional syntax-checking software.
  • embarrassment The WWP is a venerable and respected TEI encoding project, and all staff members have extensive experience in XML — but our own texts were in SGML using P3:1994, i.e. not even P3:1999 (which was for some time unfortunately called ‘P4 beta’), let alone P4. We were getting sick of ‘do what I say, not what I do’.

Barriers to Migration

The biggest impediment to our migration was inertia, particularly in the area of processing our texts for web delivery. We already had a system for capturing, validating, machine-checking for common errors, printing for proofreading, proofreading, correcting, and transforming for delivery on the web via a commercial tool (DynaWeb) based on SGML texts that worked, why change? Especially that last step — the transformation from our source SGML to that which DynaWeb could read into its ‘book’ — is cumbersome and finicky, and not all those who wrote the software are readily available to us.

Furthermore, I anticipated that while migrating our instances to XML would be easy migrating the wwp-store DTD would be difficult.

About the Migration Samples

The samples available from the Migration Task Force's web site are a convenience sample that we had already given to TEI for its samples page.

Samples include prose, verse, and drama; interesting features include a letter with a postscript, an interesting <figure> , acrostics, and a complicated note.

Notes on the Conversion Process

There were many sets of files or processes that needed to be migrated. I will discuss three of them here:
  • migration of textbase files (instances of wwp-store)
  • migration of wwp-doc DTD extensions
  • migration of wwp-store DTD extensions
I will not discuss migration of the documentation files, as it was too trivial: each of the fewer than half dozen files conforming to wwp-doc was migrated by hand in under a minute by adding an XML declaration, changing the DOCTYPE declaration, changing ‘ & ’ to ‘ &amp; ’, and inserting a slash in the (extremely rare) empty element tags.

instances of wwp-store

By far the most difficult part of converting the hundreds of instance files from SGML to XML was writing a script to check a file out of the version control system (RCS) under one name (.sgm or .sgml), modify it, and check it back in under another (.xml), thus maintaining the history of revisions to each file despite the name change. 2 Said script executed a perl program, which might be described as a hack, to actually convert the syntax of a single instance file. The perl program does not actually do any true SGML or XML parsing, but rather relies on the fact that the characters < and > do not occur anywhere in any WWP textbase files except as markup. Accordingly, other projects may find this program un-helpful, if not outright harmful.

One perhaps minor (but nonetheless annoying) problem we have encountered in the migration of the instance files is due to our previous naming convention. In the SGML world, we gave complete SGML files (had a DOCTYPE declaration, and a complete <TEI.2> element) the extension ‘.sgml’; we gave sub-files (no DOCTYPE declaration, usually included in one or more complete files by a general entity reference) the extension ‘.sgm’. This allowed for easy differentiation, in particular when using command line tools to perform tasks, e.g. validation or printing, on a set of files.

wwp-doc DTD extensions

I followed the steps outlined in an early draft of miw03.html#index-div-id2653174 :
  1. Created uselessDoc.sgml as a P3 test document. Validates fine.
  2. Ensured that my P4 parsing environment works (it did).
    • In order to validate uselessDoc.sgml against P4(SGML), I had to create a new version of wwp-doc.dtd (which calls P3, new one calls P4). Also added
      <!-- ********* --> <!-- TEI flags --> <!-- ********* --> <!ENTITY % TEI.XML 'INCLUDE' >
    • Got 215 or so validation errors, all ambiguous content that both 1st and 2nd occurrence of <divGen> could be matched.
    • Found and removed following two lines from wwpdoc.ent:
      <!-- Changes to element classes (to fix oversights in TEI P3) --> <!ENTITY % x.front 'divGen |' >
    • Validates!
  3. Changed wwpdoc.ent and wwpdoc.dtd to be ‘dual-use’; process:
    • changed omissibility indicators to %om.RR; (1 regexp search-and-replace did the trick);
    • changed ‘,’ to ‘|’ in attribute declared value name token group.
  4. Validated against P4 in SGML mode — valid!
  5. (Had been done previously.)
  6. Created uselessDoc.xml from uselessDoc.sgml (added XML declaration).
    • Added SYSTEM identifiers after the PUBLIC identifiers of the TEI.extensions.ent and .dtd declarations;
    • moved intra-declaration comments to be right-after-declaration comments;
    • changed all empty comment declarations (‘<!>’) to blank comment declarations (‘<!-- -->’);
    • added SYSTEM identifiers after the PUBLIC identifier of the WWPiso declaration;
    • complete overhaul of WWPiso: removed large number of declarations no longer used, and changed the remaining ones to be CDATA declarations of numeric character references, rather than SDATA;
    • removed declarations for formulaNotations and formulaContent from wwpdoc.ent, which were declared as CDATA (thus permitting the new default values of CDATA and ‘(#PCDATA)’ to take hold).
    VALID!
  7. To try in SGML mode, added
    <!ENTITY % TEI.XML 'IGNORE'>
    to subset of uselessDoc.sgml. Upon validation nsgmls complained about all the entities declared as ‘&#xNNNN;’; otherwise valid. Pizza page generated flattened DTD without a hitch.
The entire process took the better part of a day (but likely would have taken less if I had not been taking notes for this report).

wwp-store DTD extensions

Whereas the wwp-doc extensions were relatively straightforward, quite similar to TEI Lite, and not that extensive (some 39 declarations other than comments and element selection), the wwp-store extensions are complicated, change several element classes of TEI, and are quite extensive (some 231 declarations other than comments and element selection). Not surprisingly, migration of the wwp-store DTD extensions has proved more difficult.

While I attempted to follow the same steps as with wwp-doc ( miw03.html#index-div-id2653174 ), I had more difficulty, but successfully completed the first seven steps in a day or two. Step 8 — fix all the errors — is not quite complete yet. Furthermore, I was running into sufficient problems and taking significant enough time migrating, that I eventually stopped taking detailed notes, partially on the theory that anything left is very specific to this particular set of extensions, and would not likely be helpful to others.

Here are some details of the migration process. Unless noted otherwise, validation is with nsgmls 1.3.1. Unless noted otherwise, text changes were made in Emacs; the notation qr/one/two/ is shorthand for M-x query-replace RET one RET two RET !, which replaces string ‘one’ with ‘two’; and qrr/three/four/ is shorthand for M-x query-replace-regexp RET three RET four RET !, which replaces the regular expression ‘three’ with ‘four’.
  1. Chose a test file and altered the DOCTYPE declaration in a copy of it so that instead of calling wwp-store.dtd (our driver DTD that in turn declares the various TEI parameter entities, including the extension files, and then calls the main TEI DTD) via FPI, it now calls the TEI (P3) DTD directly and has an appropriate internal subset.
  2. Tested that P4 parsing environment works against a tiny test file. If I recall correctly I had to increase GRPCNT. 3
    • Altered path in DOCTYPE to point to P4.
    • Since declaration of extension external entities has ‘P3’ in the FPI, redeclared them to plain SYSTEM identifiers for now.
    • 18,500 errors, all but last 17 in DTDs. 18,473 of them are due to ambiguous content models (occurrences of <persName> , <placeName> , <anchor> are the culprits in all cases). 2 are mistakes of mine that shouldn't have been here now (I had been jumping ahead). The instance errors are <text> undeclared, and one of <pb> , <milestone> , or <gap> in a spot it's not allowed.
    • Removed <persName> and <placeName> from x.data in wwpstore.ent.
    • Removed <anchor> from x.notes in wwpstore.ent.
    • Re-validated, whaddayaknow, 27 errors.
    • Removed part from ATTLIST of <seg> in wwpstore.dtd.
    • Fixed the 2 mistakes that I had inadvertently created earlier.
    • Changed ‘globincl’ to ‘Incl’ in both extension files.
    • Re-validated, now 47,464 errors. All but 3 are ambiguous content models with <addSpan> , <delSpan> , <gap> , <figure> , <advertisement> , <note> , <mw> , and <seg> as the culprits. And the three are <addSpan> , <delSpan> , <gap> occurring more than once in %m.Incl;, so I'll start there.
    • Remove <addSpan> , <delSpan> , <gap> from m.globedit in wwpstore.ent.
    • Re-validated, down to 20,581 errors. All of them are ambiguous content models with <figure> , <advertisement> , <note> , <mw> , and <seg> as the culprits (same list as before but without the 3 I fixed :-).
    • Removed <mw> , <figure> , and m.notes from m.Incl in wwptore.ent.
    • Re-validated, No errors in DTD! Only 2 errors in instance. However, the encoding is correct, and the DTD is wrong. 4
    • Changed all omitted tag minimization parameters to TEI parameter entity references. Note that since I was very consistent in how I wrote the element declarations, I was able to do this in two normal search-and-replace commands (one for EMPTY, one for content models) and one leftover (declared content ANY) by hand.
    • Case is not a problem (we've been case-sensitive since 1999-09).
    • We have no CDATA content models.
    • Inserted a bunch of missing REFCs 5 using a single Emacs query-replace-regexp command. 6
  3. Valid against P4 SGML except for the two occurrences of <mw> in <div> .
  4. Created test_that_p4_sgml_works.xml with just prose; works.
  5. I do not currently have access to a system that has both a recent enough version of OpenSP and the TEI DTDs, so I converted bg.sgml to bg.xml using the Perl program mentioned above (which basically inserts ‘/’ before the closing ‘>’ of empty element tags so long as they don't span 3 or more lines), and then hand-tweaked the DOCTYPE.
    • Changed all empty comment declarations (‘<!>’) to nothing content declarations (‘<!---->’).
    • Inserted SYSTEM identifiers for the 21 external entity declarations. Harder than it sounds; I had to find and obtain the proper XML ISO character entity sets first.
    • (At this point other obligations became pressing, and I stopped work on DTD extension migration for over a month.)
    • Made copies of the local character entity set files (wwpspec.ent and wwpgrk1.ent) in the right place, and changed the paths of the system identifiers in wwpstore.ent to match.
    • In those files, changed ‘CDATA’ to nothing globally.
    • Found XML versions of isogrk1–4 (available from the TEI or the W3C), moved them into the appropriate directory.
    • We have four ‘boilerplate’ files included into the <teiHeader> of each WWP textbase text via general entity references. For testing and migration purposes, I had redeclared these entities in the internal subset of my test files bg.sgml and bg.xml. However, xmllint was (probably appropriately) objecting to the lack of a SYSTEM identifier in the ‘original’ external declarations for these entities in the DTD extension file wwpstore.ent. Thus at this point I took the time to move over the four files, and update our CATALOG entries to point to them. Note that this is still an SGML format catalog, not an XML catalog.
    • Tried validating, found Lots of problems:

      Via nsgmls: 201 errors, 110 warnings

      • 199 errors about characters not in document set. These are false errors caused by the underlying presumption that there is a 1-byte character set, I think.
      • 2 errors in the instance: same two <mw> s not permitted in <div> .
      • 68 warnings about #PCDATA in model groups
      • 6 warnings about declared values of attributes
      • 1 warning about an and group
      • 13 warnings about attribute values not a literal
      • 4 warnings about exclusions
      • 1 warning about inclusions
      • 17 warnings about missing REFCs

      Via xmllint: I'm not good at reading the error messages yet, but it looks like it quit after spitting out 8 error messages at the first occurrence of paraContent (which is also where the majority of the 68 warnings came from, I believe).

    • So, I performed quite a few fixes:
      • qr/(%paraContent;)/ %paraContent; /
      • qr/(%phrase.seq;)/ %phrase.seq; /
      • qr/(%specialPara;)/ %specialPara; /
      • qrr/%[A-Za-z0-9.-]+/\&;/ and qr/;;/;/ (yes, it's a hack)
      • qr/ NUMBER / NMTOKEN/
      • qr/ NUMBERS / NMTOKENS/
      • qr/ NAME / NMTOKEN/
      • qr/ NAMES / NMTOKENS/
      • Quoted a bunch of attribute values that should have been literals (I used regexp-search for ‘[A-CE-Za-z]$’ which worked well only because I have been consistent about keeping 1 attribute declaration per line with no whitespace at the end; avoiding ‘D’ allowed me to skip all the false positives of the ‘#IMPLIED’ lines, at the risk of a few false negatives if a true attribute ended in ‘D’ — none did).
      • Eliminated entire section of our DTD extension that reproduced P3 element declarations except required end-tags. Since the sole ampersand connector was in one of those content models, that error was thus fixed.
      • Removed declarations for <opener> and <closer> , as the change we made (adding <respLine> , i.e. n.byline) is already present in P4.
      • Fixed declarations for <p> , <seg> , <titlePart> , as they used a reference to %paraContent; which is now a complete content model.
      • Fixed declaration for <text> .
      • I then went through entire set of declarations checking against whether the source declaration had been changed from P3 to P4 and re-copying it over if need be.
      • Several more class level changes in this period, which I failed to write down.

    At this point I tackled the <mw> not permitted in <div> problem. We used to add n.fw to the globincl class. To replicate this, we needed to add it to the m.Incl class. However, in SGML globincl was expressed as an inclusion exception (on <text> ), so the fact that n.fw appeared in phrase was not a problem — putting it in m.Incl, however, results in ambiguous content models, of course. I spent approx 2 hours straightening this out; I did not record all the details, but it included removing n.fw from phrase and adding it to m.refsys.

  6. I have not attempted to validate files in SGML mode, despite having created dual-use DTDs.

After this process, we now have valid XML DTDs a la xmllint, nearly valid (only characters not in character set) via nsgmls, but still get 2 errors from the pizza chef. However, a considerable amount of work is needed on the content model for <front> , as it is not what we want, due in part to the re-arrangement of how <front> is managed — the new fmchunk entity.

Summary and Evaluation

For the WWP, migrating our instances to XML syntax was trivial. Although migrating them into an XML environment was a bit more difficult, it was still pretty easy, and well worth the effort.

Migrating a small, clean (i.e., without much element class manipulation) set of DTD extensions also turned out to be very easy, and well worth the effort.

Migrating a massive, complicated set of DTD extensions has turned out to be very difficult. Projects in this category may need to do a more careful analysis of the benefits of migrating to XML, as the cost can be significant. Furthermore, unless the original author of the extensions or other TEI DTD expertise is available in-house, these projects should consider hiring outside expert assistance.

British National Corpus

About the Project

I have been saying that migrating the BNC from SGML to XML would be a trivial problem for approximately five years. My bluff was finally called when the Open University asked us to produce a 4 million word subset of the BNC in XML, sampled according to their criteria for use in a new grammar teaching course.

About the Collection

The British National Corpus (BNC) is a 100 million word snapshot of British English taken at the end of the 20th century. It contains 4130 distinct texts, sampled from a very wide range of materials both spoken and written.

Nature of the Content

The BNC includes samples of most written genres, including newspapers, novels, textbooks, ephemera, etc. It also contains transcripts of a very wide variety of spoken English, including informal conversation, radio broadcasts, meetings, lectures, consultations, etc.

Nature of the Encoding

All the material is in English.

In addition to the usual TEI structural tagging, the texts are segmented into sentence like units, and words; each word carries a POS (part of speech) code.

The BNC has its own DTD, using the TEI prose base, the corpus additional tagset, and a number of modifications to the basic TEI model, as further described in the Users reference Guide. The most recent edition of this Guide includes a section on TEI conformance which explains in excruciating detail the TEI Extension files used to define the BNC DTD.

The tagging makes heavy use of SGML minimization features; notably for part of speech (POS) coding. For example, here is a heading at the start of text A1l:
<head type=MAIN> <s n="1"><w VVG-AJ0>Ripping <w NN2>yarns <w CJC>and <w AJ0>moral <w NN2>minefields<c PUN>: <w NP0>Allan <w CJC>and <w NP0>Janet <w NP0>Ahlberg <w NN1-VVB>talk <w PRP>to <w NP0>Celia <w NP0>Dodd <w PRP>about <w DPS>their <w NN2>bestsellers <w PRP>for <w NN2>children </head>

Motivations for Migration

Desire to use new XML-based tools and to add to the way the corpus can be processed. For example, we can now offer an XSLT stylesheet to convert the texts to HTML for display.

Barriers to Migration

Laziness. No, cost, size, and complexity, chiefly.

About the Migration Samples

For the needs of the OU project, we selected texts totalling one million words for each of four subcorpora: demographically-sampled speech, newspapers, academic prose, and fiction.

The texts chosen were entirely typical of the rest of the BNC as far as format and tagging goes. The process of converting them was entirely automatic.

Notes on the Conversion Process

I spent a lot of time in March defining the process of migrating the DTD, and also drafted a brief document (now on the BNC website at http://www.natcorp.ox.ac.uk/migration.html ) on how to do this properly. What this document does not tell you is that in the end I had to fudge the BNC SGML DTD somewhat. After I had completed the work described in the migration document referred to, and delivered the first few sample XML files, I realised that the BNC suffered badly from what I will call the Underspecification Gotcha. The UG comes about when you have an SGML DTD which makes liberal use of default values for attributes, with declarations such as <!ATTLIST foo bar (ptrzbie|farble|wibble|zip) "zip">

Why is this a problem? Because when such a DTD is used to convert a document to XML, every occurrence of <foo> which does not specify a value for its bar attribute will be converted to read <foo bar="zip"> . And why is that a problem? Well, maybe the BNC project was exceptional in having encoders who did not realise that not specifying a value in some circumstances meant something different from not specifying a value in others. (But I doubt it).

In any case, I went back to the nice TEI-conformant SGML DTD, and mercilessly hacked it so that all defaulted attributes were given the value of #IMPLIED, even where their declared content was given explicitly in the DTD. So, for example, <!ATTLIST foo bar (yes|no|maybe) "maybe"> became <!ATTLIST foo bar (yes|no|maybe) #IMPLIED> passim.

The BNC has a typical TEI Corpus structure, with a corpus header that contains codebooks used to validate indivual texts by means of the usual SGML IDREF/ID mechanism. The corpus header also contains declarations for two specific speakers who appear in many of the BNC spoken texts (PS00: the unknown participant, and PS01: the unknown group participant). To avoid the nuisance of having to include the corpus header (and then remove it again) when processing individual texts, I modified the SGML DTD so that all IDREF valued attributes were changed to CDATA. Another, even more shameful, admission: a significant number of texts were found to be invalid against the intended XML DTD, although the application of minimization rules had made them appear to be valid against the SGML DTD. The reasons for this are murky, but reinforce in me the feeling that inclusions and minimization are indeed the lapses of good taste which I have long suspected them to be.

The BNC SGML files contain several thousand named character entity references. As delivered, the corpus provides declarations for these as SDATA entities declared explicitly. I used two sets of entity declarations: the one used in SGML land mapped each entity to a null string; the one used in XML land mapped each entity to the equivalent Unicode character number reference. This meant that I could use the exceptionally cool facility provided by OSX of retaining character entity references unconverted when parsing the SGML version, but also run the results through an XSLT transform which would produce meaningful Unicode character entity references.

Interestingly, I can report that the BNC contained entity declarations for the following five non Unicode characters:
<!ENTITY formula "[formula]" > <!ENTITY frac17 "[frac17]" > <!ENTITY FRAC19 "[frac19]" > <!ENTITY frac47 "[frac47]" > <!ENTITY shilling "[shilling]" >
The first of these is a BNC stand-in for any non-transcribed formula. Its expansion should really be <gap desc="formula"/> . The last of these should maybe expand to the string /-. The others I am not sure what to do with: fortunately none of them appears in the texts in our current sample.
I wrote a shell script called xmlify which carries out the following steps:
  • extract a filename from the BNC user file identifier
  • produce a wrapper file which can be submitted to an SGML parser
  • run OSX on this file, with parameters which retain both internal and external entity references
  • run an XSLT transformation to ‘pretty print’ the XML file generated in the previous step, thus replacing the named character entity references by appropriate character number references;
  • (optionally) run another XSLT transformation to generate an HTML version of the XML file generated in the previous step
The stylesheet for pretty printing carries out the following transforms:
  • indent the output nicely
  • ensure that each <s> element starts on a new line
  • adds an appropriate <change> element to the <revisionDesc> element in the document header
  • removes any TEIFORM attribute whose value is identical to the gi of the element carrying it

The xmlify shell script, and the XSLT prettyprint stylesheet are available on the task force's tools page; they are also available along with the DTD files etc. that were used in a single archive file.

Summary and Evaluation

This work took longer than expected, of course. It is not yet complete — I haven't tried running it over the whole BNC, though I have run it to produce the four million word OU sample. However, it was a satisfactory demonstration of the advantages of sticking to the standard TEI route. Problems that arose were almost all caused by deviation from the path of righteousness.

I have always maintained that converting the BNC to XML would be prohibitively expensive in diskspace. Well, diskspace is cheap, but here are some figures to go with the assertion.
corpus files SGML XML factor SGMLzip XMLzip factor
Aca 30 15,192 27,772 1.83 3196 3740 1.17
Fic 25 15,576 29,164 1.87 3456 4004 1.16
Dem 29 21,320 39,028 1.83 3936 4956 1.26
News 94 15,220 27,584 1.81 3696 4240 1.15
This table gives for each of the four 1 million word samples of the BNC XML corpus the number of files it contains, their total size in KiB as uncompressed SGML and XML, and as ZIP archives. The increase in size when going from SGML to XML is far less significant (between 1.15 and 1.26) for the compressed files than it is for the uncompressed files (where the factor is a fairly steady 1.8), because of the repetitiveness of the XML encoding.

The MULTEXT-East 1984 Corpus

About the Project

The EU Copernicus MULTEXT-East project (Multilingual Text Tools and Corpora for Central and Eastern European Languages, http://nl.ijs.si/ME/) is a spin-off of the EU MULTEXT project. MULTEXT is working to develop standards, specifications, and tools for the encoding and processing of linguistic corpora in a wide variety of languages. MULTEXT-East, which ran from 1995 to 1997, developed language resources for six Central and Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) and for English, the 'hub' language of the project. It also adapted existing tools and standards to these languages. The main results of the project were an annotated multilingual corpus and lexical resources for the seven languages. The most important resource turned out to be the parallel corpus of George Orwell's novel 1984 in the English original and translations and heavily annotated with linguistic information.

MULTEXT-East resources have been used in a number of studies and experiments. In the course of such work, errors and inconsistencies were discovered in the MULTEXT-East specifications and data, most of which were subsequently corrected. But because this work was done at different sites and in different manners, the resources' encoding began to drift apart. The EU Copernicus project Concede (Consortium for Central European Dictionary Encoding), which ran from 1998 to 2000 and comprised many of the same partners as MULTEXT-East, offered the possibility of bringing the versions back on a common footing. Although Concede was primarily devoted to machine-readable dictionaries and lexical databases, one of its workpackages did consider the integration of its dictionary data with the MULTEXT-East corpus. In the scope of this workpackage, the corrected morphosyntactically annotated corpus was normalised and re-encoded. The Concede release of the MULTEXT-East resources contains the revised and expanded morphosyntactic specifications, the revised lexica, and the significantly corrected and re-encoded 1984 corpus. This edition, V.2 is also freely available, under a research license, from the http://nl.ijs.si/ME/V2/.

This report documents the SGML to XML conversion process of the Concede edition of the MULTEXT-East 1984 multilingual corpus.

About the Collection

The MULTEXT-East parallel corpus contains the novel 1984 in the orginal English and in six translations. The novel is approximately 100,000 words in length and is composed of four parts, each consisting of a number of chapters. The complete seven-language corpus contains 46,626 sentences, 618,879 word tokens, and 125,016 punctuation tokens.

The corpus file structure is rather complex; the files relevant for the TEI migration are the following, where xx = bg, cs, en, et, hu, ro, sl:
orwell.xml
The document file with the corpus header for and reference to ohdr-xx
ohdr-xx.tei
TEI text header for and reference to oana-xx
oana-xx.tei
text and annotations of '1984' in language xx
msd.tei
TEI document for morphosyntactic annotation: header for and reference to msd-flib and msd-fslib (further explained below)
msd-flib.tei
TEI feature library with attribute-value tables
msd-fslib.tei
TEI feature structure library with lexical MSDs
The full corpus is 35 MiB, or 4.5 MiB compressed.

The corpus is meant primarily as a dataset for the development and testing of language technology methods and tools. It has already been used to develop and test machine learning techniques for part-of-speech tagging, word alignment and word-sense disambiguation. While this research by computational linguistics and language technology professionals has resulted in a relativelly large bibliography (a probably somewhat outdated list is given here) it has not, as far as I am aware, been used by a more general audience interested in the novel itself.

Nature of the Content

The novel itself is well known, and deals with an anti-utopian vision of a Stalinist society. An interesting feature from the linguistic perspective is that the novel contains a number of ‘Newspeak’ words (such as Miniluv, doublethink, plusgood, etc.) and, in fact, an Appendix giving the introduction to Newspeak. The interest comes in studying the translations of these non-words, and the varying strategies the translators have used to translate them. Also, any systems processing the text are almost guaranteed a certain number of unknown words.

Nature of the Encoding

The DTD is a parametrization of TEI and uses the following tagsets: TEI.prose, TEI.linking, TEI.analysis, TEI.fs and no local extensions. Character encoding is via ISO 8879:1986//ENTITIES.

The corpus is rooted in the <teiCorpus.2> element, which consist of the header and seven <TEI.2> elements, each one containing one translation of the novel. The corpus and component TEI headers are quite detailed, and include editorial and tags declarations and source and revision descriptions. Each translation is divided into parts and chapters (encoded as <div> elements), and these into paragraphs (encoded as <p> elements).

The most important aspect of the corpus is its linguistic annotations. Hand-validated sentence elements are marked with IDs and serve as the alignment segments. They contain the TEI.analysis word and punctuation tokens ( <w> and <c> ). Word tokens have two attributes, lemma and ana: the latter has as its IDREF value the morphosyntactic description (MSD) of the word in question. E.g.,
<w lemma="clock" ana="Ncnp">clocks</w>
where the MSD signifies "Noun common nominative plural".

MSD IDs are defined in a feature structure library, <fsLib> , contained in a dedicated <TEI.2> element. Each <fs> defines an MSD, specifies which languages it is appropriate for, and describes its decomposition into features. The features are defined in a feature library, <fLib> , where assigns each feature value an attribute and value name.

In producing the Concede version, special care was taken to make the SGML encoding XML-like, with quoted attributes and camelCased GIs. The XML flavour was, to an extent, enforced by a special SGML declaration for the corpus, but was otherwise created through instructions to the language components contributors.

Motivations for Migration

On the practical level, the aim was to enable XML processing, in particular XSLT. More generally, there is a need to keep the corpus abreast of current encoding practices.

Barriers to Migration

The only barrier was the effort required to implement the migration.

About the Migration Samples

Although the migration was eventually undertaken for the full corpus, we started with a sample for initial tests and distribution in the scope of the working group. This sample contains only the first chapter (37,014 of the novel's 618,879 word tokens), but retains the structure of the full corpus and includes the complete ancillary files, such as the FS libraries that define the word-level syntactic tagset.

The sample, like the complete corpus, was not expected to present any great challenges to migration, except for the marked sections in the corpus document. If the entity %ONETEXT; was set to IGNORE, the corpus was processed as a whole (and hence defined the language and other IDs in the <teiCorpus.2> header). If it was set to INCLUDE, each language was taken to constitute its own SGML document (and hence defined the IDs in its <TEI.2> header).

Notes on the Conversion Process

As mentioned, we did not expect any real problems with the migration and did not experience many. The manner in which the conversion was performed is probably unorthodox: instead of using automatic tools for the conversion (such as osx), the emacs editor's simple search-and-replace function was considered a satisfactory and not too time-consuming method. This, of course, would not be an option with a project with a greater number of files and modifications. Below is a list of steps we had to perform to migrate from P3 SGML to P4 XML:

  1. Much to my embarrassment, the SGML sample itself was found to contain one SGML error in the file orwell.sgml: <tagUsage gi="c" occurs=""6117""> . This was corrected.
  2. One instance of an attribute value was found to be unquoted (file ohdr-en.tei): <availability status=restricted>. This was quoted, to conform to XML syntax.
  3. The empty elements used (actually this is only <sym> ) were converted to XML syntax
  4. The marked section discussed above was dropped — it has been assumed IGNORE and commented out
  5. The last step of the conversion involved making a new TEI P4 XML DTD with PizzaChef
  6. Finaly, the P4 corpus was validatated with the rxp validating parser

Summary and Evaluation

The conversion process turned out to be quite simple, partially due to the initial XML-like data encoding data. Still, it was surprising that the corpus contained mistakes in the application of XML conventions. This leads to the (probably obvious) conclusion that only a validating parser can guarantee syntactic well-formedness.

As noted above, our simple conversion via the emacs editor would not be an option with a project that contains a greater number of files and modifications to be performed. However, in our case it was an acceptable tool.

It would be difficult to say how long the migration process took with the original sample, as effort was made to answer the WG questionnaire, and to explore the various options offered for implementing the conversion. But performing the the subsequent migration on the whole MULTEXT-East corpus took only about two hours, and this includes the updating the corpus headers.

The XML-ized Concede edition of the MULTEXT-East corpus, together with additional resources, should be shortly released on the MULTEXT-East Web site, as V3.

Corpus of Middle English Prose and Verse

About the Project

This collection of Middle English texts, begun in 1993, was assembled from works contributed by University of Michigan faculty and from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus by the Humanities Text Initiative (HTI). The HTI is grateful for the permission of all contributors. All texts in the collection are valid SGML documents, tagged in conformance with the TEI Guidelines and converted to the TEI Lite DTD for wider use.

The HTI intends to develop the Corpus of Middle English Prose and Verse into an extensive and reliable collection of Middle English electronic texts, either by converting the texts ourselves or by negotiating access to other collections produced to specified high standards of accuracy. HTI wants the corpus to include all editions of Middle English texts used in the Middle English Dictionary and the more recent scholarly editions, which in some cases may have superseded them.

About the Collection

At present, sixty-one texts are publicly available and more than sixty others will be coming on-line soon. Currently, 30 MiB of data are online; the expanded collection will be 88 MiB. Texts vary in size from 23 KiB to 6 MiB and are encoded in SGML using a P3 TEI Lite-based DTD.

The collection's audience is made up of scholars, students, and general readers interested in Middle English prose and verse. It is part of the Middle English Compendium but can stand on its own. The Corpus is provided with a full array of search mechanisms, so that texts may be searched individually, in user-designated groups, or collectively. In 2002, users from around the world conducted 30,000 searches, 9,000 browses, and 320,000 text views.

Nature of the Content

The content is primarily electronic versions of nineteenth-century print publications, but includes some original translations and transcriptions. The content encompasses poetry, drama, prose fiction and non-fiction. Only five of the texts currently delivered have associated external image files. All of the new texts have page images associated with them through image references encoded in attributes in the page break elements.

The base language of all the texts is Middle English, but other languages are represented, including French, German, Greek, Latin, and Old English.

Nature of the Encoding

The level of markup encoding is level 4, ‘basic content analysis,’ as described in the Digital Library Federation's TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. This level of encoding aspires to ‘create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.’ However, the texts predate the TEI in Libraries Guidelines and have several inconsistencies, such as unnumbered nested <div> s in front- and backmatter elements and division numbering that begins at <div0> .

The SGML is normalized and so does not make use of tag minimization or unquoted attributes.

Motivations for Migration

The main motivation behind the migration was to be able to take advantage of the many XML and XML-related tools and technologies that are now available, including our own locally-developed digital library middleware, which will provide more functionality than the current online implementation. In addition, we were taking the opportunity to add more material to the collection and to harmonize encoding practices that had changed (intentionally and unintentionally) over time. In addition to the division numbering practices mentioned above, in the earliest days of the HTI we had a number of text encoders who had worked on the Middle English Dictionary and had specialist interests in various aspects of the texts. Some taggers supplied missing text headings or corrected errors in editions with which they were familiar using different encoding practices.

Barriers to Migration

The largest barrier to migration was the encoding harmonization, which added unnecessary complications to an otherwise straighforward task, and the few unusual characters encountered in 19th-century publications of Middle English manuscripts. We did not find Unicode representations for all of the characters, some of which had locally invented chracter entities. These are displayed as [[entname]]. This is a continuing problem in encoding these older texts, where special characters are used and where even specialists are not in agreement about what exactly they mean. A recent example from EEBO is a G with two dots above, two dots below, and a dot on each side. It is clear in context that it is some unit of measurement, but a measure of how much?

About the Migration Samples

We chose the CME as our migration collection because it is one of our oldest locally-created collections; it contains a variety of genres, languages, and special characters; and it is small and publicly available (unlike the EEBO files we sent as examples to the migration team). In addition, we have a new batch of recently produced texts to add to the collection. As we will be adding to this collection after many years without growth, it seemed to be an ideal time to migrate.

Notes on the Conversion Process

We generally followed the procedures laid out in TEI migration guidelines. We used sx for general SGML to XML conversion, a locally developed stylesheet for encoding harmonization (e.g., renumbering nested divisions to start with <div1> ), and the tei2tei.xsl stylesheet to normalize element and attribute names and cases. As much as possible, character entities in the SGML files were replaced by their Unicode numeric values. A handful of locally-created character entities had to be handled with through creative substitution during the conversion process. We now have osx available, and it appears that non-Unicode entity issues could be avoided through use of this tool. This would have been helpful, especially as we are not yet prepared to index and search UTF-8 encoded files.

Summary and Evaluation

Aside from the analysis of variant encoding practice and non-standard character entities, and the creation of strategies to deal with them, it was a straightforward process. We were helped by the fact that we already use sx and XSLT in current practice, and hindered by an over-ambitious plan of work.

Japanese Text Initiative, Electronic Text Center, University of Virginia Library

About the Project

The Electronic Text Center at the University of Virginia Library was founded in 1992. The Etext Center builds and maintains an on-line archive of tens of thousands of SGML and XML-encoded electronic texts; it is also a library service point that trains faculty, staff and students in the creation and analysis of electronic texts. In 1996, the Etext Center launched its Japanese Text Initiative (JTI). Founded in partnership with the University of Pittsburgh, the JTI uses an in-house team of specialists to create electronic versions of canonical Japanese literary texts.

About the Collection

The JTI currently contains several hundred titles, including Genji Monogatari: , Manyoshu, and other large multi-volume works. This collection serves an international audience, including many visitors from Japan and the Far East; only a tiny percentage of visitors come to the site from the University of Virginia.

Nature of the Content

The JTI collection includes poetry, drama, memoirs, and prose fiction, ranging in date from the medieval period up through the mid-twentieth century. The structure of the texts varies dramatically — the basic divisional unit can be anything from a single line of poetry to an entire chapter of prose. The collection contains many lengthy multi-volume anthologies, which are often are broken out into separate files for each volume (this was done to aid local processing and is something we'd eventually like to fix).

Nature of the Encoding

The JTI files are encoded in EUC, a Unix-based double-byte encoding scheme for Japanese characters. EUC is widely supported by web browsers and is also supported by OpenText, the search tool we currently use to index our data (neither OpenText nor XPat, its current iteration, supports Unicode). Many of the texts are actually keyboarded on Windows machines in Shift-JIS encoding but are converted to EUC when they are transferred the Unix server for delivery.

The level of encoding is consistent with Level 4 as described in the TEI in Libraries Guidelines (TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices). In other words, we have encoded basic hierarchical and typographic features, but not all bibliographical or semantic features. In general, the prose works in the collection feature very light markup, while poetry and prose are encoded more thoroughly, with annotations, attributions, and more intricate structural features identified.

The JTI uses a straight TEI Lite DTD, with no extensions or modifications. In addition to the standard ISO character sets, the files use a lengthy catalog of SDATA entities to represent characters that are not present in the EUC code tables (mostly obsolete kanji appearing in the older texts).

As is generally the case at the Etext Center, the JTI has always produced very strict ‘XML-like’ SGML files — there's no tag minimization, attributes always appear in quotation marks, and elements are always in correct camelCase. None of the non-XML compliant SGML features (like SUBDOCS) are ever used.

Motivations for Migration

The University of Virginia Library is developing an XML-based integrated digital repository and the Etext Center will be expected to migrate all of its collections, including the JTI, into this new system. The new system will rely heavily on servlet-driven XSLT stylesheets for text dissemination, and although the indexing tool for this system has not yet been selected, it will most likely be incompatible with SGML. The migration process is especially useful for the JTI texts because it provides a good opportunity to convert the documents to Unicode, which contains a number of kanji not available in EUC and also allows us to combine Japanese characters with other non-Western character sets.

Barriers to Migration

Because the new digital repository is not yet in place, the transition will be slightly awkward and will probably require us to run SGML and XML systems simultaneously for some time. We are in fact running two parallel systems right now — an XML system for new materials that are being created in XML and a separate system for legacy data. Eventually the entire Etext Center collection of over 70,000 texts will have to be migrated, and it will be a challenge to find the resources to accomplish this without disrupting ongoing operations. The collections employ many different flavors of markup, including non-standard TEI-like materials that will be significantly harder to migrate due to unusual SGML-only features like inclusion exceptions.

In our current processing environment, texts are parsed when they are initially put online but are not often re-parsed after that point. Because we rely on non-validating tools like OpenText and Perl to deliver our texts, errors introduced by subsequent editing or other processes are sometimes not detected immediately. Thus we suspect that we will find a number of validation errors when we begin the conversion process, and because data that is flawed but usable in an SGML environment will break entirely in XML, the migration will probably necessitate substantial data cleanup. Obviously this cleanup process will be beneficial in its own right, but it will increase our migration costs in the short term.

About the Migration Samples

We selected the JTI texts for migration because they form a well-defined, reasonably sized subset of our holdings and they feature fairly typical markup. We also felt that these texts would benefit from a simultaneous conversion from EUC encoding to Unicode. The encoding scheme and the number of SDATA entities, including some entities representing non-Unicode characters, presented the biggest challenge.

Notes on the Conversion Process

We consulted the earlier SGML-XML migration guidelines on the TEI website, and used the batch script provided there as a starting point. We were initially unable to compile osx under AIX or Solaris, but once we finally managed to install it on a Windows machine using cygwin we transferred all our files over to a PC to begin the conversion. We made a few changes to the script (now reflected in MIW03) to take advantage of some newly enhanced osx features, and we had to tweak the tei2tei.xsl stylesheet to get it to remove default attributes properly. Then we began debugging by running files through the script one at a time.

By far, the biggest headache was wrestling with the EUC encoding. The current version of osx handles this encoding unpredictably. The standard switch for EUC, -beuc-jp, produces corrupted output, as does running the program with no -b switch at all. We were finally successful with the -beuc switch, which ironically triggers an error message yet produces correct EUC output. (We also tried running the files through a command-line EUC to UTF-8 converter and feeding the UTF-8 output into osx, and this caused osx to blow up with ‘non-SGML character’ errors. We assume that osx can handle Unicode input but were unable to make this work on our samples.)

Applying the stylesheet caused further encoding problems, as neither of the XSLT processors we tried (Saxon and XT) would accept EUC encoding. We finally added an EUC to UTF-8 conversion step to the batch script, just before the stylesheet process, and were able to get correct Unicode output.

Once we perfected the script for individual files, we modified it further for batch processing. We wrote a Perl wrapper to pop the declaration and internal subset off the SGML file and replace it with a modified version after processing. The Perl wrapper also concatenates the external entities file generated by osx after every run; when the batch is complete, it pipes these entities through sort and uniq to generate a single external file containing all the non-Unicode Japanese character entities. (Prior to converting the files, we identified all Japanese characters available in Unicode but not in EUC, and after the conversion we replaced those entity references with Unicode values. We still have no satisfactory way to render those characters not available in Unicode. We currently use small in-line GIF images but we may explore the possibility of using the Unicode PUA.)

We were unable to resolve the problem of attribute normalization. While we use a camelCase ID scheme to identify our files, osx maps all attribute values to lowercase.

Summary and Evaluation

Altogether, we spent about a day converting this sample of approximately 500 files, from installing the software to making the final modifications to the batch script. Running an XML parser on the output files revealed only a small handful of validation errors, which were easily fixed. We expect that we will have a much easier time converting our English language collections, because the encoding scheme will not be an issue (although our experience with the Japanese texts will come in handy when we convert our Chinese and other non-Western collections). In general, we found the migration process to be reasonably straightforward and well worth the effort.

The Thomas MacGreevy Archive

About the Project

The Thomas MacGreevy Archive was begun in 1996 as a long-term, interdisciplinary research project committed to exploring the intersections between traditional humanities research and digital technologies. It was begun with grants from The Newman Scholarship Fund at University College Dublin, and Enterprise Ireland. It is published by The Institute for Advanced Technology in the Humanities (University of Virginia), and supported by Maryland Institute for Technology in the Humanities (University of Maryland). The General Editor of the Archive is Susan Schreibman.

About the Collection

The Archive makes available the works of Thomas MacGreevy (1893–1967), Irish poet, art and literary critic, and Director of the National Gallery of Ireland (1950–63). To date, it has focused on publishing MacGreevy's published articles and books, all of which are out of print. It has also reprinted some 20 articles by contemporary critics on MacGreevy and his circle.

The Archive currently publishes some 350 texts encoded in SGML which are delivered via DynaWeb. Texts range from about 125 KiB (book-length texts) to around 6 KiB (review articles of a few paragraphs). The audience for the collection is scholars interested in Thomas MacGreevy, the subjects he wrote about (Irish art, literature and culture, modernist literature and art, for example), and the people he associated with (including Samuel Beckett, James Joyce, Wallace Stevens, W.B. and Jack Yeats).

Nature of the Content

To date, the genres of the texts we have encoded are monographs and periodical articles. The base language for a majority of the texts (about 95%) is English, but many other languages are encoded using TEI's <foreign> element, including French, German, Greek, Irish, Italian, Latin, and Spanish. We capture textual features such as original pagination, emphasised words (in bold and italic, for example), and typographic features (such as capitalised words and small caps). About 10% of the texts have images associated with them.

In addition to encoding pre-existing texts, Susan Schreibman has created a bibliography, which does not exist in any other form, of the complete writings by and about Thomas MacGreevy following the MacGreevy Archive DTD. This bibliography forms the basis of the Archive's ‘Browse’ mode.

Nature of the Encoding

According to the TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices, the level of encoding is scholarly. We not only capture the structural features of the texts, but also encode and regularise personal and association names. In addition, names are editorially supplied when appropriate. All the texts in the Archive have been encoded according to documentation developed specifically for the project (the documentation can be made available upon request).

The DTD the Archive uses is based on TEI's prose base tagset, and includes the following auxiliary tagsets: TEI.certainty, TEI.figures, TEI.linking, TEI.names.dates, TEI.textcrit, and TEI.transcr. It includes the following entity sets: ISOgreek, ISOLatin, ISOpub, and ISOpen.

Since project participants have always encoded using an SGML editor, such as XMetaL (and Author/Editor before that), the data was relatively XML-compliant in that we never utilised tag minimization or unquoted attributes.

Barriers to Migration

For us, the barrier to migration was not the actual conversion process. That was relatively easy as our encoding was so uniform. We spent the most time learning to use the tools and understanding what scripts were needed to preserve entity references and convert element names to camelCase, etc. It will require more time, however, to complete the rest of the conversion process and migrate the data from a SGML publishing system (DynaWeb) to one in XML. This will require programming resources for remounting the Archive, designing new stylesheets (in XSLT), search and browse functionality, and adapting the HTML wrapper.

About the Migration Samples

We migrated the entire the MacGreevy Archive, currently encoded in SGML and available through DynaWeb at IATH. There are approximately 100 more texts in various stages of encoding/proofing. These are still being encoded in SGML (so they can go live when they are ready in DynaWeb). When the new XML database is ready, we will perform another batch file to convert the rest of the SGML texts into XML.

We did not face any particular challenges that scripting could not handle.

Notes on the Conversion Process

For the most part, we followed the procedures laid out in the earlier TEI Migration Committee Guidelines. The steps we took are outlined below:

  • Commented out the doctype and DTD declaration
  • Commented out entity declarations
  • Renamed entities by searching and replacing the ‘&’ with ‘||’
  • Used the following command to make element names lower case and do the actual conversion to XML: osx -xlower -xcomment -xempty -xndata
  • Used the following script to change element names to camelCase: java com.icl.saxon.StyleSheet $$.xml tei2tei.xsl
  • Restored entities by replacing the ‘||’ with ‘&’.

We also took the opportunity to rethink our DTD while creating a new XML DTD from the TEI Pizza Chef. We had, for example, modified our <p> element to allow it to appear after the <closer> tag in order to accommodate postscripts in letters. This modification was unnecessary in XML, since XSLT could easily reposition a <p> element with an attribute value of <postscript> to follow the letter's closing information.

Summary and Evaluation

In our case, the actual conversion process was relatively painless. It took about five hours of my and a programmer's time. The vast majority of that time was spent in understanding how osx works and finding and testing the scripts for preserving entities, converting element names, etc. Once we understood the process, the actual conversion took about 45 minutes for 350 texts. However, migrating the documents to XML is only the beginning of the process. As mentioned earlier, the entire Archive must be migrated into a new XML-based document delivery system, which will take additional time and resources.

Documenting the American South
University of North Carolina at Chapel Hill

About the Project

Documenting the American South (DocSouth), an electronic publishing initiative of the University of North Carolina at Chapel Hill Library, was launched on UNC's Sunsite in October of 1996. Started as a modest effort to make publicly available the full text of a dozen highly circulating slave narratives, it grew into a serious library electronic publishing endeavor. As of May 2004, the site comprises around 170,000 pages (ca. 1,300 titles) of digitized primary source materials related to Southern history, literature, and culture from the colonial period through the first decades of the 20th century. In addition, it includes hundreds of pages of supporting materials; over 12,000 images, including 100 World War I posters; and several oral history interviews and songs from the University of North Carolina at Chapel Hill and other libraries collections.

DocSouth supplies teachers, students, and researchers at every educational level with a wide array of titles they can use for reference, study, instruction, and research.

All the digitized texts receive individual full-level MARC catalog records, created through the OCLC system. These records are added to the UNC online catalog for each text as the project progresses and are available through the OCLC database and OCLC's WorldCat service. The Library is committed to the long-term maintenance of all materials digitized for DocSouth and to the unrestricted free access for all users to all parts of the project.

About the Collection

Currently, DocSouth includes seven projects:
  • First-Person Narratives of the American South documents the American South from the viewpoint of Southerners. It focuses on diaries, autobiographies, memoirs, travel accounts, and ex-slave narratives of relatively inaccessible populations: women, African Americans, enlisted men, laborers, and Native Americans.
  • Library of Southern Literature, Beginnings to 1920 is based on a list of the 100 most important works of Southern literature prepared by the late Robert Bain, Professor of English at the University.
  • North American Slave Narratives includes the most comprehensive collection of narratives of fugitive and former slaves published in broadsides, pamphlets, and book forms in English up to 1920, as well as many biographies of former slaves.
  • The Southern Homefront, 1861–1865 documents non-military aspects of Southern life during the Civil War, especially the unsuccessful attempt to create a viable nation state as evidenced in both public and private life.
  • The Church in the Southern Black Community, Beginnings to 1920 traces how Southern African Americans experienced and transformed Protestant Christianity into the central institution of community life.
  • The North Carolina Experience, Beginnings to 1940 is an ongoing digitization project that tells the story of North Carolina as seen through representative histories, descriptive accounts, institutional reports, fiction, and other writings.
  • North Carolinians and the Great War examines how World War I shaped the lives of different North Carolinians on the battlefield and on the home front. It also examines how state and federal government responded to wartime demands. The site focuses on the years of American involvement in the war (1917–19) as well as the war's legacy in the 1920s.
As of May 2004, DocSouth includes
  • 1,370 sgml files (*.sgml), texts range from 8K (broadside, etc.) to 4 MiB (several hundred-page-length book)
  • 14,900 html files (*.html),
  • 13,508 JPEG files (*jpg), and
  • 19 audio files.

Total size of the collections—over 3,500,000 KiB

We clearly underestimated our potential audience at the beginning of our endeavor. Having initially targeted college campuses, we were pleasantly surprised to discover that we were reaching a much larger universe of readers, ranging from elementary school students and home schoolers to Hollywood screen writers and genealogists.

The DocSouth readership is already world-wide and is still growing. The most meaningful indicators of its present usefulness and its potential future are the reader comments and questions. They reveal a gratifying multiplicity of uses, especially by those who would not otherwise know that such resources even exist. Selected readers' comments and responses to DocSouth materials can be read in Keep Up the Good Work(s). Readers Comment on Documenting the American South , selected and with a preface by Dr. Joe A. Hewitt, Associate Provost for University Libraries. These comments are a powerful testimony of the services that digitized collections provide to millions who would otherwise not be able to use these resources without coming to the libraries, archives, museums, and repositories where they are stored and preserved.

Nature of the Content

DocSouth includes mainly 19th-century and early 20th-century published texts, with a few titles published in the 17th and 18th-centuries. The most common genres include autobiographies, biographies, essays, travel accounts, poetry, diaries, letters, and memoirs.

Text files include all pictorial materials within a text, e.g., covers, spine, frontspiece, title page, verso, and illustrations. In addition, several manuscripts (diaries and letters) have page images associated with and linked to transcriptions.

DocSouth also includes an ‘image-based’ component of 100 posters and several artifacts from the WWI period, as well as a small audio collection of oral history interviews and songs. We plan to expand these sections in the near future.

All the titles are in English. Occasional occurrence of other languages (mostly European) are marked with an appropriate ID in the <language> element.

Nature of the Encoding

All the texts are encoded with 3P TEI Lite (version 1.6)DTD. No extensions have been used.

All the texts in the DocSouth collections followed in-house guidelines written specifically for Author/Editor, a commercially available SGML-aware software 2 . These guidelines are based on recommendations for ‘basic content analysis’ (level 4) in the Digital Library Federation's TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices and contain detailed descriptions of all tags used in DocSouth collections, along with several examples from texts. All DocSouth encoded titles can be distributed as stand-alone texts.

These guidelines proved to be effective and produced a large collection of standardized texts. As a result, the legacy data has:
  1. all attribute values in quotation marks;
  2. element names in correct case, and
  3. no open tags.
The following encoding conventions are recorded, if appropriate, in the Editorial Declaration along with any ‘oddities’ (e.g., errors in original pagination, etc.):
  1. Original grammar, punctuation, and spelling have been preserved.
  2. Encountered typographical errors have been preserved, and appear in red type.
  3. All footnotes are inserted at the point of reference within paragraphs.
  4. Any hyphens occurring in line breaks have been removed, and the trailing part of a word has been joined to the preceding line.
  5. Indentation in lines has not been preserved.
  6. Running titles have not been preserved.
  7. The long s, which was used routinely in eighteenth-century English printing, but which looks like an f to today's reader, has been printed as an s in the text of this electronic edition.
  8. Catchwords on every page of the original have not been preserved.

In addition, we preserve such textual features as original pagination and emphasized words (e.g., bold, italic).

[Modification of the dtd.] As mentioned above, several manuscript materials (letters and diaries) have page images associated and linked to the transcription. For this purpose, an ENTITY attribute was added to <pb> element (examples provided with Migration Samples). This issue, i.e. linking page images to page transcriptions, was raised many times in the TEI community and requires a close attention on part of the TEI-C.

Motivations for Migration

We had several reasons for SGML to XML migration:
  • A willingness and a need to revise current encoding and production practices. It is usually not an easy task in a production environment, but its importance should not be underestimated.
  • An absolute requirement to revisit earlier files, run them through a stricter validating parser and correct validation errors.
  • And most importantly, an ability to take full advantage of XML tools and technologies, already available and upcoming.

Barriers to Migration

The conversion per se did not present a major issue, as we had uniform encoding practices implemented since 1997 and a large body of legacy data. Several problems have been encountered with texts encoded before 1997. Those were mainly problems with inconsistency with encoding and with not implementing strict validation tools. Besides that, we needed extra time to learn how to use several new tools, to write scripts and make them work.

That said, I will mention barriers that many institutions encounter when launching any new initiative: time constraints; staffing issues, and a previously approved work plan facing a deadline. In our case, we had to work on several concurrent projects that could not be postponed, as they were funded by federal agencies and had to be finished on time. Finally, we had already in place a well working, approved by time, system that we would need to change, at least partially.

About the Migration Samples

Our sample files represent a wide range of genres and encoding challenges encountered in DocSouth collections:
  1. Texts with numerous letters or excerpts of articles incorporated within a narrative and that occur as part of a text (in contrast with collections of letters themselves) and that need to be encoded within <q> <text> <body> <div1 type="letter"> <opener> , etc.
  2. Large files that need to be presented in chunks (e.g., in chapters or other smaller divisions.)
  3. Texts with hundreds of illustrations.
  4. Files with a lot of tables and statistical data that raise formatting and presentation problems.
  5. Texts with marginal notes.
  6. Texts with several footnotes that we put inline within a relevant paragraph.
  7. Texts with a modified dtd, where an ENTITY attribute was added to <pb> element in order to link a transcription to a page image in manuscripts.

Notes on the Conversion Process

The work on the conversion has been done in two stages.

In our first effort to migrate the legacy data, we followed the earlier draft of recommendations that the migration group came up with. Though recommendations included almost the same set of actions as in the current version, the preceding osx version did not include an -xndata switch. That forced us to work around and to write a series of additional steps in a shell script. As NDATA references need to be kept for use in unparsed entity references, we did this by adding a line to the conversion script:
grep '<!ENTITY' $xmlfile> entity
and reinserting this into the new file when we added the doctype declaration. This put the list of entities (stored by the grep in a file named entity) in between the brackets of the doctype declaration. This was a rough solution, but it worked for our needs at the time. However the work has never been finished then, as we needed to concentrate on finishing previously established deadlines for other projects.

The second and final conversion took place some time at the end of 2003. It happened that this forced delay in our effort to convert DocSouth collections to xml came to our advantage—by that time, a new improved osx version was in place and the Migration WG finalized its recommendations.

We wrote a shell script cnvrt.bash that included several steps of the conversion process outlined in the WG recommendations. The script can run either in batch mode or on individual files.

The script includes the following steps:
  • call up a file
  • create a temp sgml file that is submitted to a parser
  • run osx on the temp sgml file and create a temp xml file (see the osx man page for switches): osx -xlower -xcomment -xempty -xndata -xno-expand-internal -xno-expand-external -lhttp://docsouth.unc.edu/dtds/teixlite.dtd $sgmltemp > $xmltemp
  • strip out the external and internal parameter entity declarations in the DOCTYPE prior to XML parsing ex -s -S /public/html/docsouth/bin/fix-ent.ex $xmltemp
  • escape ampersands sed -e 's/\&/||+|/g' <¯ $xmltemp > $xmltemp.tmp
  • use @TEIForm to normalize element names and then strip @TEIForm xsltproc /public/html/docsouth/bin/fix-teiform.xsl $xmltemp.tmp > $xmltemp.tmp2
  • run Sebastian Rahtz's tei2tei.xsl XSLT script to normalize element and attribute names and cases xsltproc /public/html/docsouth/bin/tei2tei.xsl $xmltemp.tmp2 | \ sed -e '/^ *$/d' -e 's/||+|/\&/g;' > $xmltemp
  • pull out the DOCTYPE declaration from the temp file ex -s -S /public/html/docsouth/bin/strip.ex $xmltemp.tmp
  • concatenate the original DOCTYPE and the processed XML file. cat $xmltemp.tmp $xmltemp > $xmlname
  • clean up the DOCTYPE declaration ex -s -S /public/html/docsouth/bin/dec.ex $xmlname
  • escape any ampersands not in entities to $amp; ex -s -S /public/html/docsouth/bin/amp.ex $xmlname
  • parse the result with xmllint and output any errors to a log file xmllint --valid --noout $xmlname &> $xmlname.error.log
  • if the log file is empty, use xmllint to clean up the DOCTYPE declaration and delete the empty log
  • remove all temp files

Summary and Evaluation

In summary, the conversion process was straightforward in most cases and presented a few already mentioned problems. However, people have to realize that it is not a completely trivial undertaking that might take some time to learn new tools and to make scripts work.

In our case, between 96 and 97% of the texts represent a large collection of standardized texts that have been encoded according to the DocSouth in-house encoding guidelines. The remaining 3–4% (approx. 50 texts) were encoded earlier (1994–1995) and were checked for conformance with existing in-house encoding practices, before they were converted to XML. That process of checking for conformance, validating old files and correcting encountered validation errors took much more time than we have initially planned.

Finally, as was discussed above, since our texts use a standard subset of the TEI Guidelines (3P TEI Lite, version 1.6) we did not have to migrate our DTDs and we could use the prepackaged TEI XML DTD (teixlite.dtd).

The Victorian Women Writers Project

About the Project

The Victorian Women Writers Project began in the spring of 1995 to produce highly accurate transcriptions of works by British women writers of the 19th century. The works, selected with the assistance of the Advisory Board, include anthologies, novels, political pamphlets, religious tracts, children's books, and volumes of poetry and verse drama. The project is supported by the Digital Library Program and Library Electronic Text Resource Service (LETRS) at Indiana University. The project's General Editor is Perry Willett, who is currently Head of the Digital Library Production Services at University of Michigan.

About the Collection

The collection comprises over 150 works by more than forty writers, including Mathilde Blind, Mary Elizabeth Braddon, Caroline Norton, and Lady Jane Wilde. The collection contains 187 files, totalling about 50 MiB in size.

The texts are encoded in SGML using the P3 TEI Lite DTD. The audience for the collection is scholars, students, and general readers interested in Victorian literature and culture.

Nature of the Content

The content is varied and represents a number of genres, including poetry, drama, fiction, and non-fiction prose. About forty percent of the texts have external image files associated with them. The base language of all the texts is English, but other languages, including French, German, Greek, Italian, and German, are represented.

Nature of the Encoding

The character encoding is ASCII; non-ASCII characters are represented by entities declared in the internal DTD subset, rather than in external entity files.

The level of markup encoding is level 4, "basic content analysis," as described in the TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices document (on-line at http://www.diglib.org/standards/tei.htm). This level of encoding aspires to "create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text."

The SGML is normalized and so does not make use of tag minimization or unquoted attributes.

Motivations for Migration

The main motivation behind the migration was to be able to take advantage of the many XML and XML-related tools and technologies that are now available.

Barriers to Migration

There were no significant barriers to migration.

About the Migration Samples

We chose the Victorian Women Writers project as a migration sample because it represents a moderately sized collection with a fair bit of variation in terms of genres. It is also our oldest self-produced collection.

Although the collection represents a number of different genres, it did not pose any particular challenges.

Notes on the Conversion Process

We followed the procedures laid out in the earlier TEI migration guidelines. We used osx for general SGML to XML conversion, and the tei2tei.xsl stylesheet to convert to camelCase. Internal entity declarations from the SGML files were replaced in the XML files with parameter entities pointing to external entity files with standard entity declarations that expand to the XML-style Unicode character entity. For instance, &trade; expands to &#x2122;.

The conversion was not difficult, and after conversion and the insertion of a new DOCTYPE declaration, with pointers to an XML TEI Lite DTD and the appropriate external entity files, all the files validated. The validation process occasionally revealed erroneous markup, such as the unnecessary use of an entity reference for a common character, such as a colon or hyphen. These few errors were fixed manually.

Summary and Evaluation

The process was very straightforward. It took one afternoon to convert 187 files. A fortunate byproduct of the conversion process was the detection and correction of a few errors or bad markup practice.

Thesaurus musicarum italicarum

About the Project

The Thesaurus musicarum italicarum is a corpus of digitized sources of Italian music theory dating from before the end of the eighteenth century. The emphasis lies on works originating in the sixteenth and early seventeenth century. The TMI was initiated in 1995 by the former Department of Computers and Humanities at Utrecht University (which merged into the Institute of Information and Computing Sciences in 1999). As an ongoing project, the TMI is now also sustained by a group of researchers from several European countries and the US. Due to lack of funding, work on TMI has slowed down since 2001, but we hope to reactivate the project in the not too distant future.

The TMI comes in two flavours, a website (TmiWeb) and a (small) series of CD-ROM publications. In TmiWeb, treatises are available in a straightforward presentation format. Mark-up is relatively simple and optimized for searching and browsing, and there are no facsimiles. Part of the content of TmiWeb is work in progress, or, ideally, dynamic by character. The CD-ROM format is chosen for editions that have more elaborate and stable markup. There are more presentation options and, to allow verification, sources are also available in facsimile.

We employ DynaWeb 4.3 as a server for TmiWeb. The software used for our first CD-ROM was the Panorama Viewer; the second one uses DynaText 4.3.

About the Collection

Thirty treatises have been published in the TMI so far; the total file size of the published treatises is about 18 MiB of SGML. Some 20 more treatises are in various stages of preparation. The audience for the TMI consists primarily of students and researchers of music (history, theory, historical performance), as well as those studying or researching the history of science (especially mathematics and physics), the history of ideas, and languages.

Nature of the Content

Music treatises are generally prose monographs. Some are in verse, in dialogue form, or consist of a collection of letters. The basic language is Italian, with considerable regional influence (Toscan, Venetian, Neapolitan). There are many terms and quotations in Latin and Greek, and some in Hebrew. The size of the treatises ranges from 10–450 pages in TMI; both bigger and smaller examples are known to exist. Interesting features abound, including pagination errors, page headers from another book, wrong page order, errata lists with new mistakes, indices with entries pointing to non-existing pages, etc. More interesting are multiple editions of treatises (there are four editions of Pietro Aaron's Toscanello, encoded as one SGML instance. One of these contains numerous marginal notes and illustrations (!) by a sixteenth-century reader.

Treatises often contain non-textual material, notably:
  • Illustrations, such as portraits or depictions of musical instruments;
  • Diagrams, illustrating intervals, proportions or calculations, often containing text;
  • Music examples with or without text;
Music examples are transcribed in modern notation and made available as graphics files; MIDI files may be provided for the non-trivial examples.

Nature of the Encoding

The treatises are encoded in ISO 8859-1. Entity references are used for all other characters. The following standard character sets are used:
  • ISOdia
  • ISOnum
  • ISOpub
  • ISOlat2
  • TMIgrk (classical Greek)
Project-specific entities are defined in the file spchar.ent. These include:
  • symbols for Renaissance musical notation
  • abbreviations occurring in the text, such as ‘q́’ for ‘que’, and vowels with macrons to indicate a following ‘m’ or ‘n’

The level of encoding (as described in the Digital Library Federation's TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices) is 3, 4, or 5, depending on the treatise. Several treatises on the website clearly fall in level 3. On the other hand, treatises prepared for CD-ROM publication have extensive text-critical markup (variants, corrections, normalisation) and features such as personal names, cross-references, and citations of books and music.

TMI uses the full P3 DTD with the following tagsets:
  • TEI.prose
  • TEI.linking
  • TEI.transcr
  • TEI.textcrit
  • TEI.names.dates
  • TEI.figures
There are extension files for DTD and entities. These alter a number of content models, add attributes, and define a few new elements, such as <EXAMPLE> for music examples and <OMIT> for characters we want to suppress at some stage during processing (notably hyphens).

The texts are XML-like in several respects. Tag minimization techniques are not used, and all attribute values are quoted. Element names are usually not in camelCase: they tend to be either completely in lower case or in upper cases. There are many references to SDATA entities.

Motivations for Migration

There are no urgent reasons to migrate soon. Less urgent ones are:
  • Contributors more and more frequently submit their texts in XML, which currently must be converted to SGML (!);
  • A good reason to review extensions and text entry procedures;
  • Tools, of course;
  • The department's research programme in Content Engineering. We may want to be able to use TMI as a testbed for XML-related research.

Barriers to Migration

From a technical viewpoint, the most important barrier are SDATA. Almost every treatise discusses or otherwise employs non-standard (Renaissance) musical symbols, often within paragraph content. Currently which we use our own font for these symbols, most of which are lacking in UNICODE. However, the principal barrier to migration is the position of the department. In the past, when the Department of Computers and Humanities was part of the Arts faculty, the production of digital resources was a core task. Since we have become part of Computing Science, these activities have become very peripheral. There are two ways out:
  • Find new funding;
  • Make the migration of TMI part of a core research project.
I'm working along both of these lines.

About the Migration Samples

The samples I submitted to the samples directory are Toscanello: and Trattato... di tutti gli tuoni: , both by Pietro Aaron, and Gioseffo Zarlino's Le istitutioni harmoniche: . The Trattato... di tutti gli tuoni: is pretty simple. The Toscanello: is very complex, as it contains full textcritical markup, including variants from the four existing editions of the work. Le istitutioni harmoniche: is just big, full of illustrations, music examples, SDATA entity references, personal names, and links to other works; and is generally rich in markup. For testing purposes, I've used only Aaron's Trattato delle mutationi, which is one of the simplest texts we have.

Challenges are SDATA and DTD modifications.

Notes on the Conversion Process

Conversion was done under Windows XP, which made it impossible to use osx: . I've used sx: instead. For validation, I used nsgmls. Steps:
  • Validation against P3: document was valid;
  • Validation against P4: many parsing errors;
  • Commented many things out in extension files ->validation succeeded. This was possible mainly because the Trattato delle mutationi is very simple and hardly employs the DTD modifications;
  • SGML->XML with sx -xlower -xcomment -xempty -xndata -bISO-8859-1. Issues here:
    • parsed external entities (files containing entity declarations that are common to all document instances) are included in the internal subset;
    • SDATA; I ‘fixed’these by providing mock CDATA entities;
  • applied tei2tei.xsl. Problem: removal of TEIform attribute;
  • validation with nsgmls succeeded. ( nsgmls is not a very handy tool for this, as it needs a different set of environment variables for XML, and doesn't seem to be able to work with UTF-8. Is there a better command-line tool for validating XML?)

Summary and Evaluation

The outcome of my migration experiments were not wholly satisfactory. Problem areas are:
  • DTD extensions (seem solvable);
  • Physical structure of materials is altered by resolving external parsed entities (can be fixed manually); and
  • SDATA (no satisfactory solution).
So long as there is no solution for the SDATA characters that are lacking in Unicode, XML migration is not likely to happen. When a solution is available, migration may involve some manual work, but given the size of the corpus this cannot be a fundamental objection.
Notes
1.
In <figDesc> of <figure> .
2.

In particular, the ultimate ci command in tb_conv.bash did not work using the script variable indirection method used for all the other commands. I never figured out why not.

Also note that I did not know about the rlog command at the time I started this script. If I had, I would have the script take working files as arguments and used rlog -R to obtain RCS file paths, probably.

3.
The work on wwp-store took place months after the work on wwp-doc, so re-checking this was a sensible thing to do.
4.
The error is that <mw> (our renamed <fw> ) cannot be a child of <div> .
5.
I.e., the entity reference close delimiter, ‘;’.
6.
I got away with
qrr/\(%[a-zA-Z][a-zA-Z0-9.-]+\)\([ )]\)/\1;\2/
if I recall correctly, because I knew that there were no 1-character parameter entities, and that every parameter entity ended with either semicolon (in which case it should not be changed), space (no tabs), or close-parenthesis.
2.
Guidelines for in-house DocSouth encoding practice are available at DocSouth and from the TEI Consortium site Teach Yourself TEI, under ‘Guides to Local Practice.’