Electronic Textual Editing: How and Why to Formalize your Markup [Patrick Durusau]

Why markup decisions should be formalized
Formalizing markup: a process of discovery
Formalizing markup: a process of training
Formalizing markup: a process of validation
Formalizing markup: a process of review
Conclusion

After perusing the TEI Guidelines, even briefly, a chapter on ‘formalizing’ markup may seem like a chapter to read later, if at all. TEI markup is formal and complex to the point of intimidating new users with its array of choices and options for encoding. ‘Formalizing’ markup in this chapter refers to recording the choices made by the encoder or encoders for recording particular features in a text.

Out of the welter of choices and options presented by the TEI Guidelines encoding projects should document what markup options will be used by the project. Such documentation should include: 1. features of the text(s) to be encoded; 2. elements to be used for such features; 3. attributes for the elements used; and, 4. the range of values for each attribute that is allowed on any element. Markup makes explicit structures in a text usually (but not always) represented by formatting. Formalizing markup is the process of recording those markup choices. The process of formalizing markup will continue throughout the project, but in the early stages of any project, it should be the major focus before the task of imposing markup begins.

The process chosen for formalizing markup choices will vary from project to project, depending upon funding, and other constraints. There is no project so small that it will not derive some benefit from choosing and following a process of formalizing the markup used by the project. Those benefits include: cost and time savings, ease of use, longevity, and lower error rates, among others.

A brief example is used to demonstrate the reasons for formalizing markup choices by a project. The sample passage is typical for an academic journal and can be used to demonstrate to colleagues the need to record markup choices. The example is followed by an outline of the issues that can be uncovered by using this sample or others from a particular project. Finally, suggestions are made that can be adapted to particular projects to avoid the pitfalls of not formalizing markup choices.

Why markup decisions should be formalized

There are any number of abstract arguments that can be made for formalizing markup decisions, analysis of markup decisions by types of text, documenting the work of a project, reuse of markup decisions for other projects, but the most compelling ones can be demonstrated better by example. Photocopy the following example and distribute among members of an encoding project and ask them to mark what they would encode and how they encode it:

In his discussion of the Eucharistic doctrine which is presented in The Babylonian Captivity of the Church (1520), Luther not only argues against the Catholic doctrine of the Eucharistic sacrifice but also against transubstantiation. Luther explicitly states that he was influenced on this issue by Peter D'Ailly, a Nominalist follower of William of Occam. D'Ailly and other Nominalists though that transubstantiation was not the only possible way that Christ could be present in the Eucharist, since if God wished he could also make Christ present along with the substance of the bread and the wine. From a Nominalist perspective transubstantiation requires two actions of God, namely, the annihilation of the substance of the bread and the placement of Christ's body under the species. Therefore, consubstantiation would be a simpler account of Christ's Eucharistic presence. (Osborne, 64-65)

While the selection has been presented in block style for this work, it originally appeared originally as part of a single paragraph. After gathering responses, list all the features noticed and how many of the participants noticed the same features and, more importantly, suggested the same encoding method for the feature.

To start with a small part of the example, consider the rendering of ‘The Babylonian Captivity of the Church’ in the opening sentence. A possible first reaction is to simply mimic the presentation of the printed page, which may be all that is possible in some cases. If that reaction is followed, the encoder would use the <hi> element of TEI with an appropriate rend attribute to achieve that effect. Another likely reaction is that ‘The Babylonian Captivity of the Church’ is actually a title of some kind and that it should be encoded using the <title> element. Another encoder realizes that it is a title but wants to construct a reference to an online English translation of the work ( http://www.ctsfw.edu/etext/luther/babylonian/babylonian.htm ) using the <xref> element.

Assuming that the default is no markup at all, the following a list contains (in order of appearance) portions of the text and possible markup options for each, but is by no means exhaustive

Eucharistic <index> , <term> (3)
The Babylonian Captivity of the Church <hi> , <xref> , <title> (4):
- Babylonian <index> , <term> <placeName> <settlement> (5)
- Captivity <index> , <term> (3)
- Church <index> , <name> <ref> (4)
1520 <date> (2)
Luther <index> , <name> <persName> (4)
Catholic <index> , <name> <ref> (4)
Eucharistic sacrifice <index> , <ref> (3)
- Eucharistic <index> , <term> <ref> (4)
- sacrifice <index> , <term> <ref> (4)
transubstantiation <index> , <term> <ref> (4)
Luther <index> , <name> <persName> (4)
Peter D'Ailly <index> , <name> <persName> (4)
- Peter <index> , <name> <persName> <foreName> (5)
- D'Ailly <index> , <name> <persName> <surname> (5)
Nominalist <index> , <term> <ref> (4)
William of Occam <index> , <name> <persName> <ref> (5)
- William <index> , <name> <persName> <foreName> (5)
- Occam <index> , <term> <placeName> <settlement> (5)
Peter D'Ailly <index> , <name> <persName> (4)
- Peter <index> , <name> <persName> <foreName> (5)
- D'Ailly <index> , <name> <persName> <surname> (5)
Nominalists <index> , <ref> , <term> (4)
transubstantiation <index> , <term> <ref> (4)
Christ <index> , <name> <ref> (4)
Eucharist <index> , <term> <ref> (4)
God <index> , <name> <ref> (4)
Christ <index> , <name> <ref> (4)
substance <index> , <term> <ref> (4)
bread <index> , <term> <ref> (4)
wine <index> , <term> <ref> (4)
Nominalist <index> , <ref> , <term> (4)
God <index> , <name> <ref> (4)
substance <index> , <term> <ref> (4)
bread <index> , <term> <ref> (4)
Christ's body <index> , <term> (3)
- Christ's <index> , <name> <ref> (4)
- body <index> , <term> <ref> (4)
species <index> , <term> (3)
cosubstantiation <index> , <ref> , <term> (4)
Christ's <index> , <name> <ref> (4)
Eucharistic presence <index> , <term> (3)
- Eucharistic <index> , <term> <ref> (4)
- presence <index> , <term> <ref> (4)

Markup options for grammatical or linguistic analysis as well as attribute values on the various elements have been omitted. Specialized markup that would record variant readings, overlapping hierarchies, editorial interventions in the text, elements peculiar to manuscripts and others have also been omitted. Despite all those omissions, the first sentence alone confronts the encoder with 4,423,680 possible encodings using the elements listed above. (3 x 4 x 5 x 3 x 4 x 2 x 4 x 4 x 3 x 4 x 4) These are not unique combinations but serve to illustrate the odds of any two encoders or even the same encoder on different days, making the same encoding decisions with any consistency without formal guidance.

The reader should bear in mind that this level of complexity is seen in an example text that is typical of a modern academic journal article. Manuscript witnesses, grammatical or linguistic analysis, critical editions, texts with overlapping hierarchies, drama and other more complex works exhibit a corresponding rise in the possible markup options.

The choices offered in each case for this passage are legitimate and would result in a valid TEI document. A TEI document may be valid, even though encoding choices have been made inconsistently by the encoder. Validation software gives no notice of inconsistent but valid encoding since it gauges only formal validity and not consistency of formally valid markup. Consistency in encoding is a judgment that must be made by the encoding team, even though, as detailed below, there are software scripts that may assist in that effort. If the encoding has been inconsistent, that will most often be seen during efforts to display or search the text.

Despite the emphasis on encoding texts properly, the first real test for any project is display of visual images of the encoded text to department heads or funders. The stylesheet designer is responsible for creating the stylesheets that will either convert or display the text and will test those stylesheets against samples from the project. Inconsistent encoding means that stylesheets that work with one document may or may not work with another document. Given the range of possible encodings, it is very difficult for a stylesheet author to compensate for inconsistent markup. It should be assumed that such errors will first be noticed either at the public debut of the project or while demonstrating the final version to university administrators who are reviewing the project. It is difficult to explain why some portions of the text are not displaying properly while others are, particularly if admitting inconsistent encoding practices is not an option. The problem can be passed off as a stylesheet problem but matters are even worse when searching is attempted.

Assume that the stylesheet fails to display ‘The Babylonian Captivity of the Church’ as a title. Thinking that the stylesheet author has failed in some regard, the user attempts to search the document for titles of works cited. Oddly enough, far fewer titles appear than are known (or assumed) to occur in the text. The next search is for personal names known to occur in the text. Like the titles of works, some are found, some are not. The encoding team knows the information is in the file but the software is simply not finding it. Realizing that it is unlikely that both the stylesheet writer and the search engine failed on the same day, the encoders start looking at the encoded text.

One of the problems that is noticed early in their review is the different encoding of the titles and personal names in the text. Each has a very legitimate argument for their choice but they now have several hundred texts that vary from each others in ways that are not easy to find. The stylesheet writer can adapt the stylesheets to some degree but that will not help with searching inconsistently encoded files. Time, grant money and energy has been spent on creating texts that can only with difficulty be displayed acceptably (not properly), cannot be accessed with markup aware software accurately and will be difficult and expensive to fix. How would formalizing markup help avoid this result?

Formalizing markup: a process of discovery

The process of formalizing markup is actually a proces of learning about a body of texts. Specialists know the text but not from the standpoint of imposing markup from a fixed set of elements such as the TEI Guidelines. And it is not always obvious what TEI elements should be used to encode particular parts of a text, even if there is agreement on those features. The process of applying markup to a particular group of text begins with ‘modeling’ the text(s) to be encoded.

Modeling a text is the process of making explicit what is understood implicitly by most readers. There are few readers who would have not recognized ‘The Babylonian Captivity of the Church’ in the passage cited above as somehow different from the surrounding text. As was seen from the analysis, however, it was possible to legitimately encode that portion of text in several ways. It is insufficient to rely upon readers reaching the same conclusion about that passage for markup purposes and hence the need to model the text. A model for that particular text, could well decide that all titles that appear in the article, should be encoded as references to the online versions of the text whenever possible. That would allow for special display of the title in the running text as well as facilitating hypertext linking to the actual text itself.

Unfortunately for scholars (and others) the process of ‘modeling’ texts has been rarely treated outside of technical markup literature. One of the more accessible attempts to address this important issue can be found in Developing SGML DTDs: From Text to Model to Markup: by Eve Maler and Jeanne El Andaloussi. Despite its formidable title, Chapter 4 Document Type Needs Analysis: and Chapter 5 Document Type Modeling and Specification: can be used as guides to analyzing documents. These chapters are written for users who are developing and not using an existing DTD but the same modeling considerations apply in either case. (Other treatments that are readily available include: Structuring XML Documents: by David Megginson, and The XML and SGML Cookbook by Rick Jelliffe.)

Maler and Andaloussi present a modeling language and sample forms for analysis of any text. While the suggested process will be unfamiliar at first, scholars who are embarking on long term encoding projects would be well advised to adapt both the modeling language and process into written documentation for their projects. It will not only aid in the process of analysis but will result in documentation of markup choices for an intellectual history of the project. The alternative is ad hoc encoding decisions/practices or undocumented encoding decisions/practices, which are almost as pernicious as ad hoc ones.

To model texts for an encoding project, a sample of texts from those to be encoded should be chosen as a starting point. The sample should fairly represent the full range of materials that the project will be encoding. This sample set of texts should be examined and every feature that should be encoded should be marked and the suggested encoding recorded. Since it is a sample of a larger body of material, it is almost certain that features will appear in texts that were not discovered during this initial survey. The process of documenting encoding choices at the beginning of the project will allow new features to be encoded consistently with choices already made by the project. The process of analysis and modeling is never quite finished in a project until it reaches the last text to be encoded. New features may be discovered in the last text to be encoded but an important goal of any encoding project is ensure consistent encoding of features in its texts, whether at the beginning or end of the project.

When developing a proposed encoding for texts, scholars should be wary of presuming that their interpretation of some features as being too ‘obvious’ to merit recording an encoding. Markup is the process of making both the obvious as well as the not so obvious structures in a text explicit and not implied, so every feature that is to be encoded in the project should be noted, along with the appropriate markup. Even something as common as a paragraph should be noted and illustrated with examples.

Consider the case of paragraphs in late 19th or early 20th century grammars. Such works often have what appear to be numbered paragraphs followed by typographically distinct material, usually indented and set is a smaller typeface. Should the both of the following blocks of materials be encoded as paragraphs?

2. Among the cuneiform tablets from Tell el-Amarna, brought to light in 1887, the one which was the largest in the group happened to also be composed in an unknown language.1 Only the introductory paragraph, which takes up seven out of nearly 500 lines, was written in Akkadian. From that introduction, it was learned that the document was a letter address to Amenophis III by Tushratta, king of Mitanni. It was logical, therefore, at the time to assume that the rest of the letter was in the principal language of the Kingdom of Mitanni; the use of the term ‘Mitannian’ was the natural consequence of that assumption.

This name was employed by all the early students of the subject, including P. Jensen (cf. his articles in ZA 5 [1890] 166 ff., 6 [1891] 34ff., and 14 [1899] 173 ff.); L. Messerschmidt, Mitanni-Studien (MVGA 4 [1899] No. 4); F. Bork, Die Mitannisprache (MVAT 14 [1909] Nos. 1/2. Current usage restricts the term, as a rule, to the material in the non-Akkadian letter of Tushratta....

Upon examination, it appears that the second block offers additional reference material or remarks that explain or amplify the material in the preceding paragraph. It does not appear that they are block quotes of any sort, nor should they be so encoded to capture the presentation in the text. The work also has footnotes so it appears that the typographically distinct material is not any sort of inline footnote. The encoding decision is further complicated by a survey of the rest of the grammar that reveals there are numbered paragraphs that appear entirely in a similar format. (These examples are drawn from Introduction to Hurrian by E.A. Speiser, ASOR, 1941. Examples of numbered paragraphs that share the presentation of the second quote can be found at paragraphs 15, 21, 26 and 78. The example text is found in paragraph 2.) For some purposes, the user may wish to display only the ‘main’ text without these reference paragraphs appearing as well, such as for a quick review of a grammar. Without facing the actual case and for purposes of illustration only, a project could decide to encode the second block as a paragraph with a type attribute value of reference. The encoding team will face a number of such situations but the primary rule should be: If a feature is to be encoded at all, that encoding must be recorded with examples of the feature to be encoded.

Once all the sample texts have been marked with likely encodings, the participants should jointly review and decide on how features in the text should be encoded. It is very important that a clean version of the samples with an index of the features being encoded be maintained. This will serve as documentation of the encoding decisions as well as the basis for training materials for encoders who are hired to assist in the project.

Using the The Babylonian Captivity of the Church example reproduced above, the encoding team would record (in part):

Place Encode using <term> and insert <index> element. Use index list for form of name for index level and form of name. Examples: Babylonian
Organization Encode using <term> and insert <index> element. Use index list for form of name for index level and form of name. Examples: Church
Title If the work is available online, encode with <xref> and set target attribute to needsTarget; and type attribute to title.

It should be noted that the use of needsTarget for an attribute value on the <xref> element for the title appearing in the article reflects a multi-pass encoding process. Such a process allows ‘rough’ encoding by part-time student assistants and refinement of that encoding by more skilled users later in the encoding process. Since encoding is a labor intensive process, it is unlikely that any scholar or group of scholars will be able to personally encode all the texts for a project. In order to achieve an acceptable level of quality in encoding, it will be necessary to train encoders to insert the desired markup.

Formalizing markup: a process of training

After the encoding team has reviewed sample texts and prepared a guide to what should be encoded and how, it is necessary to convey that information to the encoders who will be actually inserting markup into the texts. That training can be informal in smaller projects but training of every encoder should be an explicit part of the schedule and funding of the encoding project. Not only is it unfair to expect encoders to be aware of uncommunicated encoding requirements of the project, it will also result in a poor quality result that must be corrected at a later stage of the project.

It is important in the training process that encoding guidelines be conveyed to encoders in an effective manner. For the project principals it might be sufficient to note that bound form of the noun (in Akkadian) should be noted in an attribute value but that would hardly suffice for part-time undergraduate encoders. (Ignoring for the moment that encoding at such a level would be the exclusive domain of a project specialist.) Encoding guides should be developed with the personnel hired to perform the encoding to insure that a common understanding has been reached on what is to be encoded and how it is to be encoded. The encoding guide should use terminology that will be readily recognized by encoders as likely to have information relevant to any question they may have during the encoding process. To the extent possible, the encoding guide should contain photocopy reproductions of several examples of each feature in a text to be encoded. In addition to being an aid to the actual encoders, the process of developing such examples will uncover any cases of mis-understanding of encoding decisions within the project. (Misunderstandings about terminology are common even on standards committees. That should be accepted as a common fact and one goal of any encoding project is to uncover and resolve such misunderstandings.)

In general terms, an encoding guide is only as effective as the examples it contains. A guide that has no examples or worse, incomplete or inconsistent examples is perhaps worse than no encoding guide at all. The needs of projects and their encoders will vary from project to project, but at a minimum, any encoding guide should cover every element that a particular encoder is expected to apply to the text. Note that such an encoding guide need not and probably should not cover every element that will be used by the project. Each encoder needs to be trained to insert the elements that are their responsibility and not to attempt to correct or enter others which are not their responsibility. In terms of content, a minimal encoding guide would have for every element:

Common name a common name for the feature in the text (the one encoder's would use)
Element name the name of the element to be used by the encoder
Examples with no markup examples of the features in context with no markup
Examples with markup same examples of features in context with markup

Examples of text features should always appear with and without markup. Examples with markup are by definition different from the source text that is being encoded. Providing examples without markup, perhaps even photocopies of examples when working from printed materials, will assist encoders in recognizing situations calling for markup.

One effective training mechanism is to provide new encoders with the encoding guide and a non-encoded version of a text that has already been encoded and proofed for the project. The trainees can be taken through the main points of the encoding manual and then requested to complete encoding the sample text. This allows the trainees to become familiar with the encoding manual and provides the project with a sample that can be automatically compared to a known text that has been properly encoded. There are a number of utilities to compare XML files for differences and whenever possible, automatic comparison should be used to test encoding training against texts known to represent the encoding requirements of the project.

Despite the best efforts of the encoding team, encoders will encounter features in the texts that were not anticipated by the project. Encoders should be encouraged to bring such features to the attention of the project, even if it ultimately has to be decided, for consistency purposes, that the features cannot be encoded. Except in exceptionally clear cases of misreading of the encoding guide, all such questions should be recorded and the answer that has been given to the encoder raising the question. Ad hoc decision making, like unrecorded encoding decisions, can lead to the same inconsistency in encoding that reduces the value of the project's work product.

Training new encoders should be viewed as only a part of the overall training process for the project. Experienced and new encoders should be brought together on a regular basis to discuss any new features that have been identified and to go over one or two recent texts. The quality of the encoded texts from the project depend upon the encoders sharing a common understanding of the markup and the texts upon which it is being imposed. Periodic review of the encoding guide, texts recently encoded and any new encoding issues, will help build and maintain the common level of understanding of the encoding required for a successful project. That common level of understanding is particularly important for project leaders or more experienced users who ‘understand’ the required encoding better than the day to day encoding personnel. Such users are more likely to violate projects norms and requirements due to their deeper understanding of the goals of the project. It should always be reinforced that consistency to the encoding guide is the rule for everyone encoding texts for the project. Incosistency, even if due to the brilliance of a particular encoder, will diminish the overall usefulness of the project result.

Formalizing markup: a process of validation

Validation of markup for a particular project is more than simply determining that the files are formally valid. Files must not only be formally valid, but they must comply with the encoding guide developed for the project. This is particularly a problem with vendor based encoding services which do return ‘valid’ files but that may not comply with the markup requirements of a particular project.

One of the advantages to development of the encoding guide with examples, is that the encoding guide can be used in the construction of scripts to validate encoding of particular features within a text. For example, a script can extract all the personal names that have been encoded in a particular way. That list can be scanned to determine if anything other than personal names appear in the list. More importantly, that list can be used to search entire files for occurrences of personal names that are not so encoded. Such a list can also be used to make sure that entries, such as indexes, use the terms specified to appear in index entries. Such searching and validation would not be possible, or not easily so, but for an encoding guide that sets for the requirements beyond validity for various parts of the text.

The validation of files for compliance to the encoding project's requirements should be performed on every file. Records should be kept by encoder for every problem found and corrected, with those results being communicated to the encoder with an explanation of the error. This will provide feedback to the encoders and insure that any problems with the encdoing guide, training, or understanding of the encoding guide are uncovered and corrected before the project creates hundreds of files with unknown and possibly inconsistent encodings. The validation process can be largely automated, as least as far as error detection, and fills a much needed quality assurance role for the project. Features that are likely sources of error or that are likely to recur across a set of texts, such as personal or geographic names, terms and titles should be added to the validation process as they are found in the validation process.

Any number of scripting languages from sed and awk to Perl can be used to validation of markup in files. There are also specialized searching programs such as sgrep, which allow the searching of markup files for particular elements and their content. Encoding projects should develop scripts to validate the markup pratices of the project before any texts are produced. Therefore, the services of a scripting language programmer should be included in the earliest planning stages of the project.

Formalizing markup: a process of review

Formalizing markup is more than modeling texts, recording markup decisions, training encoders or even reviewing encoding work. It has all of those components, but treating each separately or as a static step, will be insufficient to meet the needs of an encoding project. To formalize markup, at least usefully, it is necessary to imbue the markup project itself with the habit of making explicit that which was implicit in the text.

One of the parts of making markup explicit is making the markup process itself explicit. A periodic review of the encoding guide, training materials and the results of markup validation is a needed step in that process. Such a periodic review will uncover patterns of mis-encoding, which if they coincide with a shift to a new period of texts or type of material, may indicate problems with the encoding guide for the project. Such shifts may also indicate training problems or lack of supervision or feedback. Such problems will not resolve themselves and to insure a quality result from the project, should be addressed as soon as they become apparent.

Like the other suggested steps in the formalizing markup process, this process of review should be periodic and result in a written report of its results. Beyond satisfying the project principals that the markup guidelines are being followed, such reports also lend credence to the project's goal of producing texts that meet a standard for consistent encoding.

Conclusion

Imposing markup on texts has the potential to create long term resources that will benefit both students and scholars. Poorly done markup, however, will only consume valuable resources, result in seldom, if ever, used materials and cause other scholars to avoid the benefits of properly used markup. Just as every scholar has an obligation to advance and not retard study in their field of interest, there is also an obligation to properly use tools that advance the scholarly enterprise.

Markup is a tool that ranks with the printing press, index and lexicon as a means of advancing scholarship. It should be noted that markup is, just as the other items mentioned, a tool and not an end unto itself. It should be judged by its effectiveness in aiding scholars in their day to day activities as scholars. In order to make a fair judgment on the utility of markup for scholars, it must be properly done in order to reach its full potential.

Formalizing markup using the techniques outlined will result in encoded texts that will serve the needs of any present day project as well as future scholars who wish to use such texts. Formalized markup offers the opportunity for scholars to contribute resources that will support future scholarship and research to their field. Just as a monograph or article that intends to advance a particular area of study, markup requires and demands no less attention to detail and quality.

Last recorded change to this page: 2007-10-31 • For corrections or updates, contact webmaster AT tei-c DOT org