TEI MI W 05 (draft)TEI XML Migration Group Work Plan and Status Report


Contents

Overview

The goal of the workgroup is to recommend strategies, procedures, and tools for converting SGML TEI data to P4 XML. The workgroup consists of two subgroups: technical experts, who will recommend specific tools and procedures for data conversion, and repository representatives, who will test those tools and procedures on their own SGML data and document the results. The group will produce two reports (a strategic document and a technical document) and a series of case studies describing specific migration projects in detail.

Activity prior to the first meeting

Each repository group member was asked to identify a few SGML data samples from their holdings that present particular migration challenges. The samples, along with DTDs, associated extension files, and readme statements, were made available to the whole task force via an FTP site. At the same time, the technical group began using an artificial data sample to experiment with currently available migration strategies, using the TEI conversion FAQ as a starting point. In addition to the tools mentioned in the FAQ, the group discussed OpenSP, a currently supported implementation of sx, which has been modified to provide more options for entity handling. The possibility of recommending further modifications to OpenSP was discussed.

The technical group also developed a list of survey questions for TEI data managers. The survey queries managers about their encoding practices, especially with regard to SGML-specific features, as well as their attitudes toward and experience with XML migration. This survey was sent to the TEI-L list but prompted few initial responses.

First meeting: October 13-14, 2002

The official meeting minutes (TEI MI M 01) are publicly available on the TEI website.

The technical group began the meeting by reviewing the draft charge (TEI ED W72) and revising the objectives and the proposed timeline. During the course of the meeting, the group 1) defined the scope of the workgroup activities; 2) developed a plan to survey the TEI user community and elicit SGML data samples; 3) continued its discussion -- initiated via email before the meeting -- of technical recommendations and tools; and 4) developed a basic structure for the final reports, with each group member accepting responsibility for a section of the technical report.

Scope of activities

The group decided to focus primarily on strategies for migrating P3 SGML document instances to P4 XML, although the reports will provide some general discussion about migrating DTD extensions, catalog files, and the processing environment.

Advocacy is not an explicit part of the group's charge. However, the recommendations will point out the advantages of conversion to P4, particularly the fact that P3 is no longer supported.

While the group itself will not undertake any software development, it may express the need for new tools, or modifications to existing tools.

Survey

Because data samples from the workgroup members may not provide a broad enough range of current encoding practices, the group will need to survey the larger TEI user community. Since email questionnaires to TEI-L and other lists tend to be ignored, the group devised a more targeted approach:

  1. A master list of projects using the TEI will be assembled. Projects will be identified via the projects page on the TEI-C website, suggestions from the working group members, web searches, and queries to selected mailing lists.
  2. The workgroup will contact representatives from each project on the master list via email. This initial email will describe the workgroup and its objectives, pose a series of questions about the project's encoding practices (based on the survey that went out to the TEI-L list in October), and request a small data sample. Non-respondents will be contacted by phone and prodded for a reply.
  3. The survey questions and data samples will be divided up among the workgroup members and analyzed against a checklist designed to identify projects that pose interesting conversion challenges.
  4. The group members will follow up with a selected number of projects, depending on the number and nature of responses received. This follow-up contact might include a request for DTD extensions and other associated files, as well as questions about the processing environment and broader institutional practices.

If the survey fails to generate much response, or the data samples provided are inadequate, the group may develop fabricated samples for testing purposes. The group will also attempt to locate survey information on DTD practices that was previously gathered by the TEI consortium.

Technical recommendations and tools

Following is a brief summary of specific topics discussed by the technical group.

  • Minimally invasive conversion: Data managers may desire a conversion process that preserves the appearance of the source documents -- that does not alter whitespace or insert defaulted attributes, for example. While the primary aim of conversion is to preserve the ESIS, the reports will address minimally invasive conversion as a possible goal and will recommend appropriate solutions, such as stylesheets to normalize whitespace and attributes.
  • External entities: References to external entities should be preserved (rather than automatically expanded) in the results document. The workgroup will recommend converting these entities to XIncludes as a more sophisticated solution.
  • Character entities: The recommendations will allow migrators to preserve character entity references, but will suggest that they be converted to characters or numeric character references, and will explain the need to move away from the ISO entity lists. Ideally, characters included in Unicode should be converted to their numeric character references. Characters not in Unicode, as well as glyphs with ambiguous or divergent meanings, should be converted according to one of the three methods spelled out in P4 4.2.1 (e.g. converted to CDATA, processing instructions, or markup); pros and cons of each approach will be covered in the strategic document.
  • Whitespace: Although parsed whitespace in the converted XML document must match the source document, automated conversion cannot guarantee preservation of formatting whitespace from the source. The report will provide suggestions for a post-conversion "pretty printing" process.
  • Comments: The group discussed the restrictions on comments in XML: comment declarations cannot contain more than one comment, comments can't appear inside other declarations, and empty comment declarations (<!>) are not allowed.
  • DTD conversion: The report will strongly urge that the extension mechanism be used, and will discuss the problems created by not doing so.
  • Tools: The technical report will describe what the available conversion tools can and cannot do, rather than advocate the use of particular tools. Nonetheless, the group assumes that sx/OpenSP is the de facto tool for basic SGML=>XML migration (research into other possible tools was conducted after the meeting and turned up few useful results, confirming this assumption).
  • Consultancy: The group discussed the desirability of a migration consultancy that could offer periodic workshops on specific topics, like the conversion of DTD extensions or SDATA entities.

Reports

The strategic report will discuss migration issues from a managerial perspective, with an emphasis on planning and decision-making. The technical report will describe the mechanics of conversion in fine detail; it will provide solutions to specific conversion problems as well as a recommended conversion workflow.

A tentative structure for the final reports has been established:

"Strategic Considerations in Migrating TEI documents from SGML to XML" [TEI MI W 02]

  • Challenges, opportunities, and motivation
  • Types or scope of migration
    • P3->P4
    • P4->P4
  • Areas of migration
    • document instances
    • DTD extensions
    • catalog files
    • processing environment
  • Levels of migration
    • easy conversion
    • "minimally invasive" conversion
    • conversion that maximizes XML tool usability
    • conversion that anticipates P5
  • Appendix: potential impact of future versions of the Guidelines on migration issues

"Practical Guide to the Migration of TEI Documents from SGML to XML" [TEI MI W 03]

  • Converting DTD extensions
  • Converting SDATA entities
  • Converting document instances
    • whitespace
    • comments
    • prologue
    • file structure (e.g. external entities)
  • Conversion tools
  • Recommended workflow

The technical subgroup will be responsible for drafting the technical report; the main sections of the report have already been assigned to individual members. The workgroup chair will draft a skeletal version of the strategic report, which will be fleshed out by the repository representatives.

Each repository representative will also write up a case study, based on his or her experience testing the group's draft recommendations for migration. A generic template will be provided for writing up these results.

Proposed future activities

In the next two months, the technical group will complete its draft report and circulate it to the repository group, who will test the report's recommendations on their own data and begin writing up the results according to the case study template. Also during this period, the workgroup will begin the survey process described above by compiling the master list of TEI projects and making initial contact with the project representatives.

The second workgroup meeting, which will include both the technical and repository subgroups, will take place in late January or early February 2003 at the University of Maryland. At the meeting, the repository group will present their case studies and suggest any necessary modifications or enhancements to the technical report. A portion of this meeting will also be devoted to developing a draft of the strategic report.

After the second meeting, the workgroup members will continue the survey by analyzing the query responses and data samples and requesting additional information when it is needed. The technical group members will revise their report based on both the survey results and the feedback from the repository group. The repository group will complete their case studies and continue to develop the strategic report.

The third and final workgroup meeting has not yet been planned, but will be scheduled to coincide with the spring meeting of the TEI Council if at all possible. According to the workgroup charge, this meeting is intended for the technical group only, but if funding permits, the repository group will be invited to attend as well. The attendees will finalize the two reports and discuss possible TEI migration efforts in the future.

Summary timeline of proposed future activities