Textual Criticism and the Text Encoding Initiative

C. M. Sperberg-McQueen

December 1994

MLA '94, San Diego

Session sponsored by Emerging Technologies Committee of MLA

Table of Contents

In his essay "Das Kunstwerk im Zeitalter seiner technischen Reproduzierbarkeit," Walter Benjamin describes how art is affected by the development of new means of reproduction, which both allow existing art works to be copied and widely distributed and also create new forms of art dependent on mechanical reproduction, notably photography and film. [[In the essay, Benjamin identifies the characteristic difference between original art works and reproductions as the loss of what he calls the Here-and-Now, or the aura, of the original. Reproductions are not limited to a single here-and-now, they do not share the historical vicissitudes of their originals, they lack the original's chemical and physical reality, and therefore the concepts of authenticity and historical authority, both become irrelevant and inapplicable to reproductions.]]

In the course of his essay, Benjamin makes a suggestive error. In brief, Benjamin believes that for any medium of technological reproduction, a copy is a copy is a copy. In a well-known observation, he writes: "Von der photographischen Platte zum Beispiel ist eine Vielfalt von Abzü.gen mö.glich; die Frage nach dem echten Abzug hat keinen Sinn."[1] This claim is of particular interest first of all because unlike many in our field, it is empirically testable, and second because it proves false: some prints do appear to be, or at least are treated as, more authentic than others: the price of a print from Ansel Adams's studio, for example, is many times the price of a print from the same negative prepared by commercial remarketers. [2] Contrary to Benjamin's expectation, all prints are not created equal.

Three points are worth noting in this connection. First, aura is not an absolute but a variable quantity. Second, aura can exist even in a mechanical reproduction because even mechanical reproductions can vary. Third, what differentiates a print prepared by Adams's own studio from a mass-reproduced copy is not the art of the photographic exposure [[(after all, the two prints share the same negative)]] but that of the photographic printmaker and touchup artist. Two prints from the same negative can have unequal value because reproducible arts separate the creation of the matrix (negative) from the creation of the reproduction (print).

[[Oral only:]] Literature, too, depends on technologies of reproduction: namely, those of writing and printing. Like all mechanical reproductive processes, these are subject to variation, and their products may have more or less authenticity as a result, contrary to Benjamin's prediction. As with photography, moreover, the transmission of texts depends on arts and crafts allied with, but distinct from, that involved in the creation of what is transmitted. [[end oral only]]

[[All three of these points illuminate the condition of literature. In his essay, Benjamin mentions literature only in passing, and he stops just short of expressing overtly a commonly held view: namely, that the technological reproducibility of literature was accomplished with the introduction of printing. He writes: "Die ungeheuren Verä.nderungen, die der Druck, die technische Reproduzierbarkeit der Schrift, in der Literatur hervorgerufen hat, sind bekannt."[3] But despite the enormous changes introduced by printing, the technology which made literature reproducible was not printing, but the introduction of writing for literary texts. Oral literature has many of the characteristics Benjamin associates with the non-reproducible in art, though for obvious reasons it lacks those which apply only to physical artefacts. The characteristics associated by Benjamin with the technological reproducibility of art, by contrast, especially the detachment from the here-and-now of the original, are remarkably similar to some of those associated with the introduction of writing by, for example, Florian Coulmas, notably the distancing function and the reifying function.[4] ]]

[[For my current purposes it suffices to observe that the variability even of mechanical reproduction ensures that books and other written matter, whether printed or manuscript, retain some measure of uniqueness and thus of aura. This in turn has a number of implications for the study of texts, whether literary or not. Like other arts subject to technological reproduction, literature is transmitted by, and thus dependent on, but nevertheless logically distinct from, its physical carriers. Because those physical carriers inevitably vary among themselves, the question of authenticity does arise for literary works, despite Benjamin's claim that authenticity is not an issue when dealing with reproductions. Those who address the question of authenticity, we call textual critics. They, along with those scholars who study the reception and dissemination of a work, will necessarily be concerned with the individual histories of individual copies of the work, what Benjamin refers to as the "die Geschichte, der es [d.i. das Kunstwerk] im Laufe seines Bestehens unterworfen gewesen ist."[5] ]]

[[At the same time, logical clarity requires a firm distinction between the creation of a text and the creation of a physical carrier for that text. Both may be part of the same social activity we call literature, but they may be performed at different times by different parties, and the same text may be preserved under wildly differing extra-textual conditions. The task of the author and those of the printer, papermaker, and bookbinder must be kept distinct, if we wish to have any clear understanding of literature as a social activity.]]

Benjamin, of course, is by no means alone in passing over in silence the uncomfortable fact that mechanical reproductions can vary, nor is it hard to find others who confuse an abstract art with its concrete means of transmission. His illusions or evasions about the stability of mechanical processes and the uniformity of their products are shared by the vast majority of our colleagues, who are, painful though it is to say it, made nervous by the basic facts of (textual) reproduction and the variation it brings about, and who consistently ignore or evade evidence of textual variation in order to speak about texts as if they were singular, unchanging objects, stable artefacts whose details are not open to question.

In this paper I want to discuss some of the more obvious issues raised by efforts to create electronic texts, and in particular electronic versions of scholarly editions. Benjamin's essay is particularly suggestive here, in the context of efforts to make literary (and non-literary) texts reproducible by new technological methods. I begin by making explicit some of my assumptions about the goals and requirements of electronic scholarly editions; in the second section I explain why my list of requirements says nothing about the choice of software for the preparation and use of scholarly editions. The third section will describe the work and results of the Text Encoding Initiative, a cooperative international project to develop and disseminate guidelines for the creation and interchange of electronic texts, and show how they relate to the requirements for electronic scholarly editions. In the concluding section, I will outline some of the implications of the TEI for electronic and printed scholarly editions, and some essential requirements for any future consensus on how to go about creating useful electronic scholarly editions.

1 Requirements for Electronic Editions

I begin by stating explicitly some assumptions I am making. Arguments could probably be made for some of these propositions, but I won't argue for them, lest I seem to be flogging a dead horse.

  1. Electronic scholarly editions are worth having. And therefore it is worth thinking about the form they should take. [[If present trends continue, we will have, within a decade or so, enough editions available in electronic form to constitute a virtual or digital library. If we want that digital library to be a usable one, now is the time to think about how it should be built.]]
  2. Electronic scholarly editions should be accessible to the broadest audience possible. They should not require a particular type of computer, or a particular piece of software: unnecessary technical barriers to their use should be avoided.
  3. Electronic scholarly editions should have relatively long lives: at least as long as printed editions. They should not become technically obsolete before they are intellectually obsolete.
  4. Printed scholarly editions have developed their current forms in order to meet both intellectual requirements and to adapt to the characteristics of print publication. Electronic editions must meet the same intellectual needs. There is no reason to abandon traditional intellectual requirements merely because we are using a different medium to publish them.
  5. On the other hand, many conventions or requirements of traditional print editions reflect not the demands of readers or scholarship, but the difficulties of conveying complex information on printed pages without confusing or fatiguing the reader, or the financial exigencies of modern scholarly publishing. [6] Such requirements need not be taken over at all, and must not be taken over thoughtlessly, into electronic editions.
  6. Electronic publications can, if suitably encoded and suitably supported by software, present the same text in many forms: as clear text, as diplomatic transcript of one witness or another, as critical reconstruction of an authorial text, with or without critical apparatus of variants, and with or without annotations aimed at the textual scholar, the historian, the literary scholar, the linguist, the graduate student, or the undergraduate. They can provide many more types of index than printed editions typically do. And so electronic editions can, in principle, address a larger audience than single print editions. In this respect, they may face even higher intellectual requirements than print editions, which typically need not attempt to provide annotations for such diverse readers.
  7. Print editions without apparatus, without documentation of editorial principles, and without decent typesetting are not acceptable substitutes for scholarly editions. Electronic editions without apparatus, without documentation of editorial principles, and without decent provision for suitable display are equally unacceptable for serious scholarly work.
  8. As a consequence, we must reject out of hand proposals to create electronic scholarly editions in the style of Project Gutenberg, which objects in principle to the provision of apparatus, and almost never indicates the sources, let alone the principles which have governed the transcription, of its texts.[7]

In sum: I believe electronic scholarly editions must meet three fundamental requirements: accessibility without needless technical barriers to use; longevity; and intellectual integrity.

[[Oral only:]] In the case of text-critically aware scholarly editions, intellectual integrity demands at least:

[[In the case of text-critically aware scholarly editions, intellectual integrity demands at least that the edition meet the standard demands of textual criticism. It must at least contain a good or intellectually defensible text. The definition of good in this context varies with the editor, of course, and an appalling amount of ink continues to be spilt, at least in Anglo-American text criticism, in vain attempts to prove once and for all that there is only one plausible answer to the question What kind of text is good, what kind of edition should we produce?]]

[[The edition must also record text-critically relevant variants, so that specialists can evaluate the text and the work done in establishing it; the apparatus of variants must indicate in some way the degree of reliability with which the manuscript transmission allows the archetype of the text to be reconstructed. The entire transmission of the text must be surveyed and, where possible, explained. The French classicist Louis Havet puts these extended demands neatly:"Le fond de la mé.thode critique, ce n'est pas une appré.ciation immé.diate des leç.ons connues; c'est une reconstitution historique de la transmission du texte, depuis les plus anciens manuscrits.""Tant qu'une faute reste inexpliqué.e, la bonne leç.on demeure entaché.e d'un reste d'incertitude."[8] ]]

[[ Similar demands can be found in modern languages as well: "Der Herausgeber eines Textes hat ü.ber die gesamte erhaltene Textü.berlieferung, noch schä.rfer formuliert: ü.ber jede Variante Rechenschaft zu geben." [9] ]]

[[At the same time, some selection from the wealth of material must usually be made, at least in printed editions. An absolutely complete apparatus, warns one author, will produce "giant compost heaps of variants of highly disparate quality" ("Riesenkomposthaufen von Lesarten unterschiedlichster Qualitä.t"). In order to avoid overburdening the apparatus with variants irrelevant for the reliability of the reconstructed text, many editors include in the apparatus only `text-critically relevant' variants: that is, variants which, given the relationships thought to hold among the manuscripts, might conceivably derive from the archetype (the latest common ancestor of all extant manuscripts). Text-critical books and essays, even of the most abstruse theoretical type, devote incredible amounts of critical acumen to the task of selecting which variants to include in the apparatus and to making the apparatus more compact, easier to typeset, or easier to omit from a photomechanical reproduction.]]

[[Not all editions are selective in their apparatus, however; some include every non-orthographic variant reading of every witness, including variants clearly useless for the reconstruction of an archetypal text. This gives a clearer picture of the forms under which the text actually circulated, and is particularly useful to those tracing the reception of a work. Some editions, notably for relatively recent printed books, include orthographic variants as well. Graphetic variants (distinctions among different letter forms) are sometimes made in electronic editions for philological work, but I have never seen such distinctions in a printed edition of any work in any European language.]]

[[Beyond recording the variations in the text found in different sources, scholarly editions must indicate any disturbance in the normal physical character of the source (be it printed book, manuscript, or something else) needs to be recorded, as it can materially affect the reliability of the transcription offered in the edition. Many editions also indicate that some portions of the text are less reliable than others, sometimes using four or more levels of certainty.]]

[[Let us make this discussion more concrete. A survey of various sets of rules for the construction of critical apparatus and for the use of special signs in critical texts reveals that scholars have found the following features of texts significant enough to merit mention in the apparatus, or (in some cases) special signs in the text itself. An intellectually sound electronic text will need methods o inicating the occurrence of these phenomena:[10]

Beyond this, some editions use distinct symbols in the text to signal variants occurring in different families of manuscripts, or variants whose claim to authenticity is strong, very strong, weak, or very weak.]]

[[In addition to recording in some detail any disturbance in the normal physical characteristics of the source (such as damage or illegibility), the non-disturbed characteristics of the witness also need to be described. The writing material (paper, parchment, papyrus, etc.), its normal preparation (for manuscripts, the pattern of prickings and rules, and which side of the sheet is the flesh side and which the hair side; for printed books the nature of the imposition, the font and leading used, etc.).]]

N.B. it is emphatically not sufficient to provide the full text of each witness to the text, without providing collations and information on their interrelationships. For this reason, if no other, it is not sufficient to represent witnesses to a text by means of page images of the witness. Such page images are very useful, but cannot by themselves provide insight into basic text-critical questions. How often does manuscript A differ from manuscript C in containing a markedly archaic word, where C has a less archaic near-synonym? How often and where do A and B agree against C? B and C against A? A and C against B? Page images provide no help at all answering questions of this kind.

2 Software is Not the Answer

Not included in my list of requirements for a successful electronic scholarly edition is a satisfactory choice of software for its preparation and use. Far more important than choice of software is the use of standard software-independent notations for the edition (such as those defined by the Text Encoding Initiative), which help ensure that the the edition is not irrevocably tied to any particular piece of software. Standard notations, like the TEI's encoding scheme, provide a notation which is independent of specific hardware and software platforms, in terms of which the intellectual content of an edition can be formulated.

Of course, it is not logically necessary for an electronic edition to be formulated in a notation like that of the TEI. It might be conceived in terms of some particular piece of software, and released in a form designed to allow it to be used with that piece of software. Some electronic texts are now being published in Word-Cruncher format, to allow them to be used with the popular interactive concordance program; others are released in Folio Views form, or for use with some other software. Other materials, like the Beowulf Workstation developed by Pat Conner of West Virginia, are released in the form of Hypercard stacks.[11] Just as traditional printed editions are conceived in terms of a particular technology --- that of printing --- so these new editions are conceived in terms of the technology represented by the programs they use. Such editions have the advantage of being self-contained: they can be consulted at once, right out of the box, because they come with software for consulting them.

Like traditional printed editions, however, such software-centered editions have the disadvantage that, being conceived in terms of a particular technology, their use is limited to those with access to that technology. Print technology, for example, is inaccessible to the print-handicapped. Software technology is inaccessible to potential readers who use computers of the wrong make or model: Macintosh users cannot run Unix programs, and vice versa.

We should not try to build our digital library of scholarly editions around a particular piece of software, for at least three reasons. Software is

[[Software never appeals equally to all users. Computers provide tools for thought; we work with them intimately. Naturally, we conceive a liking, or a disliking, for this or that piece of software for intimate reasons we may not even grasp ourselves. If a digital library is to serve all readers, we need to leave the readers some leeway to choose software which they are comfortable with; we should not impose one piece of software on them all. The cultural heritage represented by editions should not be restricted to users of Macintoshes, or users of PCs, or users of Unix machines, still less to users of some one particular program on those machines. It should be accessible to all.]]

[[Perhaps more important, software is never adequate to all uses. Software is designed to enable users to accomplish a certain set of tasks; good software can accomplish a wide array of tasks, many of them not apparent in detail even to the designer and builder of the program. But it is expecting too much of a program to say that it must meet the needs, for today, tomorrow, and the infinite future, of every scholar who will ever need to study a scholarly edition. Different types of research require different types of software; if a digital library of scholarly editions --- or even a single scholarly edition --- is to useful to a wide range of potential readers, it cannot be tied to the capabilities of a single piece of software.]]

Most important of all, software is short-lived. [[This is, of course, a relative statement. Computer software has a longer lifetime, by and large, than computer hardware. But compared to books --- in particular, compared to scholarly editions --- software lives out its life in but the twinkling of an eye.]]

[[As a student of Germanic philology, I spent my academic career reading texts seven hundred years old, and older. Even for the most important texts, scholarly editions are not very frequently undertaken, and if an edition of a standard Middle High German text is less than forty or fifty years old, it is tolerably new. For many important authors, such as Wolfram von Eschenbach, the standard texts are still those edited by Karl Lachmann a century and a half ago. Later editions have appeared, but have not dislodged the work of Lachmann. For documentary materials, the editions of the nineteenth century have not been superseded because the materials have generally not been re-edited.]]

[[In other fields, the material being studied may be more recent, but scholarly editions are still expected to last a few decades before being superseded --- indeed, in many cases documentary editions take more than a few decades to be completed. Only in unusual circumstances does anyone have the energy to begin a replacement edition before another few decades have passed.[12] ]]

[[In short,]] our libraries are full of current editions which are twenty-five, fifty, or one hundred years old, and of editions even older which continue to be consulted although no longer current.

Our computers, by contrast, rarely run any software which is even five or ten years old, and to find older software one must seek out some old-fashioned mainframe computer center, which may be running some antique program like the Computer Assisted Scholarly Editing toolkit, or the Oxford Concordance Program, which even in its second release presumably retains some code from the first edition, released an astonishing sixteen years ago. It is just imaginable that some troglodyte in the air-conditioned basement of a bank somewhere is still running Cobol programs written in the 1960s, on an unaltered version of the operating system OS/360, released thirty years ago this year. But programs more than thirty years old have only historical interest, and run, if at all, only as curiosities: [[ the square-root extraction routines sketched out for ENIAC by John von Neumann on train rides between Philadelphia and Aberdeen, or the programs written by Ada Lovelace for Charles Babbage's Analytical Engine, which might conceivably be running on simulators in computer science departments somewhere. A Macintosh-based simulation of OS/360 is, I believe, the only way it is now possible to run the landmark hypertext system FRESS, developed at Brown University by Andries van Dam in the early 1970s.]]

[[There are editorial projects currently underway which had published their first volume before FRESS was a gleam in Andries van Dam's eye, and which will probably not publish their final index volume until sometime after the last programmer has lost track of where that OS/360 simulator program went, and FRESS is nothing more than a gleam in the shining eyes of hypertext enthusiasts with long memories and a love of history.]]

Software is short-lived. If we want the first volume of a major scholarly edition to remain usable at least until the final volume of the edition is completed, then we must ensure that that first volume does not depend on some specific piece of software. No software available when we begin a major project is likely to survive, let alone remain the best available choice, even for as short a time as twenty or thirty years.

Software is necessary for electronic editions to be successful, but any electronic edition which is conceived and formulated in terms of one specific piece of software will necessarily appeal only to those who like that particular program; it will be useful only to those whose research falls comfortably within the paradigm supported by that program; and it will die when that program, or the hardware it is designed for, also dies. So we should not try to build scholarly editions around a particular piece of software.

It is not necessary, however, to abandon the notion of making electronic texts easy to use, by distributing them with software which most users will find useful. The materials created by the Perseus Project at Harvard University (a large collection of textual and graphical materials for the study of Hellenic civilization) were published in Hypercard form. The TEI Guidelines themselves are being released on CD-ROM with a browser called DynaText. That means that, like the software-specific editions I mentioned earlier, they can be consulted, right out of the box, using the software that comes with them.

The Perseus materials, however, and the TEI Guidelines themselves, differ from the products I mentioned earlier because they are not conceived primarily in terms of a specific piece of software. In each case, they are conceived in abstract terms, as a particular network of textual and other materials, which can be presented in a variety of ways. The Perseus materials are archived in SGML (the Standard Generalized Markup Language), and the distributed Hypercard-based materials are derived from the archival form by mechanical software-driven transformations. The TEI Guidelines are available in their native SGML (TEI-encoded) electronic form; they are also distributed in derivative forms: print, electronic form formatted for convenient viewing onscreen (much like a book from Project Gutenberg), and as a DynaText book. Independently of the TEI, others have used the original form of the Guidelines to create still other forms: a Toronto vendor of SGML software has included the Guidelines on a CD-ROM of SGML materials, where they can be consulted using a general-purpose SGML browser included with the CD-ROM. The library at the University of Virginia has put up a network server to allow searches of the Guidelines in their native form, and to deliver the results in the HTML (Hyper-Text Markup Language) tag set used on the World Wide Web.

Because Perseus and the TEI Guidelines are not conceived in terms of a particular technology of delivering the text, they will not become unreadable when HyperCard, or NCSA Mosaic, or DynaText, are no longer current software. Software- and hardware-independence means they will survive the death of the software and hardware on which they were created. For scholarly editions, such longevity is essential. And that is why I say that the development and use of software-independent markup schemes like SGML and the TEI Guidelines will prove more important, in the long run, to the success of electronic scholarly editions, and to a digital library of such editions, than any single piece of software can.

3 The TEI Guidelines

In the previous section, I outlined some demands placed upon any serious representation of a text with a complex textual history. In the next, I will describe how the TEI Guidelines attempt to meet those demands, for representations of such texts in electronic form. First, however, I need to describe briefly the goals and background of the TEI encoding scheme. The TEI has been described in some detail in reports given at this conference over the past six years; I will limit myself here to noting just a few salient points.

The TEI has adopted, as the basis for its encoding scheme, a formal language called the Standard Generalized Markup Language (SGML).[13] Formally, the TEI scheme is expressed as a document type definition defined in SGML notation, which defines a large number of tags for marking up electronic documents. In practice, tags are used to control the processing of TEI documents, or their translation into other electronic formats.

The TEI groups its tags into tag sets, each containing tags needed for one particular type of text or one particular type of textual work, so users of the scheme can select the tags needed, and exclude the rest from view. Very few tags in the TEI encoding scheme are unconditionally required; a slightly larger number are recommended; the vast majority of tags mark strictly optional features which may be marked up or left unmarked at the option of the creator of the electronic text.

Tags of very wide relevance, which be needed by virtually all users, are included in the `core' tag sets. These include tags for identifying the electronic text, its creators, and its sources, and for common textual phenomena like paragraphs, italicized phrases, names, editorial corrections, notes, and passages of verse or drama. Other tag sets handle particular types of text (prose, verse, drama, spoken material, dictionaries, term-banks) --- these are the base tag sets; in general, every text encoded with the TEI scheme will use one base tag set. Still other tag sets support specialized types of text processing or areas of research (hypertext linking, analysis and interpretation of the text, manuscript transcription, special documentation for language corpora, etc.); these additional tag sets may be used in any combination, and with any base tag set.

The TEI Guidelines do not represent the first concerted attempt since the invention of computers to develop a general-purpose scheme for representing texts in electronic form for purposes of research --- but they do represent the first such attempt

That the TEI did come to fruition is a testament to the urgency of the need felt by the research community for some such mechanism for the sharing and preservation of their electronic texts, and to the tenacious work of the scores of volunteers who served on TEI work groups and committees.

A reference manual for the TEI encoding scheme was published earlier this year, under the title Guidelines for Electronic Text Encoding and Interchange.[14] The TEI Guidelines encompass some 1300 pages of information, available in print and electronically. Introductory user manuals, tutorials, and similar ancillary materials are now being prepared, but for the immediate future it seems fair to predict that the TEI scheme will be adopted more commonly by large projects which need a systematic research-oriented encoding scheme and can invest time learning to use it, than by individual users who may not feel quite so much pressure to ensure the longevity of their work, and who may feel intimidated, rather than reassured, at the amount of detail contained in the reference manual.

[Abbreviate following?]

Because the TEI scheme is an application of SGML, TEI markup can be used with any software which conforms to the SGML standard [[and which has sufficiently large capacity to handle the TEI document type definitions.]] The document type definitions (DTDs) themselves, like the full text of the reference manual, are freely available over the network. This means that the TEI encoding scheme can be used in the creation, dissemination, and study of electronic texts immediately, without the TEI itself having had to expend any time or effort on software development. SGML editors (from the public-domain SGML mode for the popular Unix editor emacs, to state-of-the-art graphical interfaces for Macintosh and Windows users) can be used to create and edit TEI-conformant documents; SGML processing tools (from the public-domain parser sgmls to high-end fourth-generation language tools marketed commercially) can be used to process TEI-encoded material; SGML layout and composition tools can typeset TEI documents; SGML delivery tools and browsers can be used to disseminate TEI-conformant text in electronic form. Because SGML represents a major step forward for text processing, the number, quality, and variety of programs supporting SGML will continue to grow for the foreseeable future. The quality and variety of programs already available constitute one of the most persuasive demonstrations that the TEI made the right choice in selecting SGML as the basis for the TEI encoding scheme, and in concentrating on the definition of the encoding scheme itself, rather than on the development of software.

4 TEI and Textual Criticism

I said earlier that the three overarching goals for serious electronic editions are accessibility, longevity, and intellectual integrity. The TEI encoding scheme secures accessibility and longevity by providing a software- and hardware-independent notation for the creation of electronic texts of all kinds. Electronic resources created in TEI form will by definition be usable on any platform, with a wide variety of software; they will not become technologically obsolete when the software and hardware used to create them does.

The intellectual integrity of materials encoded with the TEI encoding scheme is harder to guarantee. With the TEI, as without it, integrity remains inescapably the responsibility of the creator of an edition; all that the TEI can do is to provide the mechanisms needed to allow textual critics to create intellectually serious electronic editions using the TEI encoding scheme. It is not possible, within the scope of this paper, to describe SGML markup and the TEI encoding scheme in any detail, but it may be useful to provide a short overview of the TEI tag sets for text-critical apparatus of variants and transcription of primary sources, and describe briefly some other relevant parts of the TEI scheme.

Where possible, the TEI based its recommendations on existing practice in the creation of electronic texts. In this area, however, there was not much to work from. A number of programs have been developed for collation of witnesses and for typesetting apparatus criticus. Most general-purpose schemes for the encoding of literary or historical texts, however, make no provision for text-critical variation or for recording physical characteristics of the source.[15]

The TEI provides tags for a variety of problems arising in textual criticism, most critically the encoding of an apparatus criticus, the registration of alterations and physical damage to the source, the indication of uncertain readings, and the association of arbitrarily complex annotation with any passage of text:

For historical reasons, the current version of the TEI encoding scheme is slightly better developed with regard to manuscript materials than with regard to printed matter; a number of manuscript specialists participated actively in the project, but we were unsuccessful in attempts to form a work group for analytical bibliography. As a result, the current version of the Guidelines limits itself, as regards the transcription of printed materials, to some fairly obvious basic requirements. Catchwords, signatures, page numbers, running heads and footers, and other material which does not fit into the ongoing stream of the text are often omitted from transcriptions, but are often crucial for text-critical work with an edition; the current TEI encoding scheme provides a single catch-all tag for these, called fw (for forme work). Gatherings may be recorded using the general-purpose milestone tag to mark the boundaries of the gatherings. No specific provision is currently made for imposition information; it is not clear what form such information needs to take, when it is recorded.

Page layout and typographic features of the source may be described at any level of detail desired, using the global rend (rendition) attribute, but the TEI currently provides no formal language for the description. This omission has a number of reasons. First, different portions of the community have very different requirements in this area: many creators of electronic text have no interest whatever in the typographic characterization of the source; others would like merely to record shifts into italics or bold fonts, without actually identifying the fonts used; still others will need or desire an extremely detailed description of the font and leading (or hand and size), and the layout of the page. The simplest way to serve all these different requirements is to provide a location for recording information on these matters, without restricting the form in which the information is to be recorded. This is what the TEI has done. The advantage of this approach is that it allows text critics and analytic bibliographers to record whatever information they wish to retain, about the typography of any printed or manuscript document; the disadvantage is that it provides no guidance to those trying to decide what features of the text are worth recording, and no way to standardize such information, even among people who are recording the same characteristics of the source.

A second reason for the sparse treatment of typography and layout in the TEI Guidelines is political: the marketplace of text processing software is currently dominated by WYSIWYG word processing systems which are singularly ill suited to the needs of scholars transcribing original source materials, precisely because they focus exclusively on the appearance of the printed page. In order to distinguish the TEI clearly from such word processing systems, the original development laid great stress on developing an approach to text which stresses its linguistic and literary structure, instead of its appearance on the page.

A third reason typography receives no formal treatment in the Guidelines is the most serious, at the moment. It is not at all obvious how best to describe the layout of a page in an electronic transcription of that page. For all their convenience and sophistication, conventional word processing systems limit themselves to trying to get a printer to recreate the page itself; they do a very poor job at expressing the fundamental rules governing a page. Further work requires a much deeper understanding of historical page design and typography than most word processor developers can be expected to acquire.

Despite a tradition of book design going back centuries, and despite the efforts of many devoted critics, no one could claim that there is a consensus on how to describe a printed page in detail. Some critics speak of the `bibliographic codes' which form part of the publication of any work. But the term is, for now, still more a metaphor than a sober description. The characteristic of any code is that it is made up of a finite set of signs, which as Saussure teaches us are arbitrary linkages of signifier and signified. For artificial languages, the sets of signifiers and their meanings are given by the creator of the artificial language. For natural languages, dictionaries attempt to catalog the signifiers and their significance; grammars attempt to explain the rules for combining signs into utterances. We have nothing equivalent for the physical appearance of texts in books. Any serious attempt to record the bibliographic codes built into the book design and typography of a literary work must begin by specifying the set of signs to be distinguished. Is twenty-four-point type different from ten-point type as a bibliographic code? In most circumstances, yes. Is ten-point type from one type foundry different from ten-point type in the same face, produced by a different foundry? In most circumstances, not. What about ten and eleven-point type? Ten and twelve? To specify a formal language for expressing significant differences of typographic treatment, we need to reach some agreement about what constitutes a significant difference --- what the minimal pairs are.

The natural constituencies for such work include text critics and analytical bibliographers; it is to be hoped that the TEI will be able, at some point in the near future, to charter a work group to explore the relevant issues and make recommendations for extending the TEI tag sets to handle relevant information better.

5 Conclusion

[[It remains to describe briefly some implications, for scholarly editors, of standards like SGML and the TEI Guidelines, and some basic requirements for our future work in creating electronic scholarly editions.]]

[[First, it is only fair to point out that the TEI Guidelines, like any standard notation, may prove dangerous. Like any notation, the TEI Guidelines inevitably make it easy to express certain kinds of ideas, and concomitantly harder to express other kinds of ideas, about the texts we encode in electronic form. Any notation carries with it the danger that it must favor certain habits of thought --- in the TEI's case, certain approaches to text --- at the expense of others. No one should use TEI markup without being aware of this danger --- any more than we should use the English language, or any other, without realizing that it favors the expression of certain kinds of ideas, and discourages the expression, and even the conception, of other ideas.]]

[[Such dangers cannot be evaded by any means short of eliminating the use of all notations or symbolic systems from thought --- which in turn would appear to mean they cannot be evaded by any method short of ceasing to think at all. The danger can be mitigated, or minimized, by facing it clearly and attempting to provide mechanisms for modifying, or extending, a notation to allow it to express new concepts conveniently. In the TEI's case, several steps have been taken to minimize the danger that the TEI Guidelines will have the baleful effect of standardizing the way scholars think:

Such methods mean the TEI scheme makes far more explicit provision for the views of dissenting scholars than any other method of creating electronic text ever formulated. They do not eliminate entirely the dangers which any notation poses for those who use it, but they do go far to ensure that the TEI Guidelines can be used by scholars of widely varying views as to the proper interpretation or study of the text.]]

[[On the positive side, standardization of notation does mean that it will be possible to develop software to handle a wider variety of text types and scholarly needs than can easily be handled without such standardization. Since the TEI encoding scheme is not specific to any one piece of software, many TEI-aware programs can be written, and the same TEI-encoded data can be used with any such program. TEI-aware software, in turn, should in principle be usable with any TEI-conformant text. Both text and software should thus be reusable in wider contexts than is the case without standard notations like SGML.]]

[[The highly structured nature of SGML markup makes it feasible to verify the structural soundness of markup mechanically, by means of software to parse and validate the SGML data. Such mechanical validation can detect a wide variety of typographic and conceptual errors in the input. Application programs can rely more confidently on the quality of their input data, which means they can be simpler and less expensive to write.]]

[[Equally important, the structure of SGML markup makes it possible for markup to become much more elaborate and subtle, without overwhelming the ability of either software or users to deal with the complexity. It will become feasible, as it has never been before, for the same textual material to be annotated in many different ways: a legal document might have a detailed diplomatic transcription of the text, linked to a scanned image of the original manuscript, with an analysis of the legal issues involved, links to information about the individuals named, and a detailed analysis of its linguistic structure at the phonological, morphological, and syntactic levels. Archaic spellings could be flagged with their modern equivalents, and unusual words could be glossed, so the same text could be used either for advanced study by specialists or to produce a new-spelling handout with basic annotations, for use in an undergraduate class. The specialist need not be distracted by the spelling modernizations and glosses intended for undergraduates, and the undergraduates need not be intimidated by the linguistic or historical apparatus: it is a simple matter for software to filter out and hide all the annotations intended for undergraduates, or to hide everything but the annotations for undergraduates.]]

[[Such analysis and markup is not the work of a day, of course, and need not be added in a day. For centuries, scholars have built up layer upon layer of commentary and analysis of important texts. SGML markup will not change that practice; it merely makes it possible for many such layers of analysis and interpretation to coexist in the same electronic form. As we work with our texts over the years, we will gradually enrich them with layer after layer of annotation. Eventually the electronic representation of the text will become a more adequate representation of the text and our understanding of it than even the most elaborately annotated printed page --- because the electronic version can be directed to display or to conceal each layer in turn, so that the visual representation of the text can mirror our changing focus of interest as we study it.]]

[[Perhaps the most important implication of standardization of markup will be, however, that it will make it possible to integrate resources developed by separate projects into seamless electronic wholes. Currently, each electronic resource comes with its own software; to move from one to the other you must shut the first down and start the second up; if you are lucky you can simply close one window and open another, but even in the best of cases you must change context more or less completely. It is a bit as if the Works of Ben Jonson and those of William Shakespeare not only did not circulate but were kept in different rooms of the library (or in different buildings entirely), so that to consult both you had to walk from one room to the other, or one building to another. One reason our libraries no longer chain books to the shelves is that scholarship is materially easier when we can have both books out on our desk at the same time for direct comparison.]]

[[Integration of editions into a virtual digital library is the electronic equivalent of getting both volumes onto our desk at the same time. And adoption of a common language for electronic text encoding is the way to make such integration feasible.]]

I said at the outset that the three general goals of any electronic edition must be accessibility, longevity, and intellectual integrity. In the light of the intervening discussion, it is now possible to enunciate some basic principles which can help electronic scholarly editions achieve those goals.

Strive for software- and hardware-independence. Tying an edition to one particular hardware or software platform erects an unnecessary technological barrier to its use by users of other hardware and software, and ensures that the edition will die when the platform on which it depends dies.

Distinguish firmly between the intellectual requirements of the edition and the requirements for convenient use of the edition. Our notions of what constitutes convenient use of an edition vary too much from individual to individual, and will in any case change radically during the lifetime of any serious edition.

Create the edition in a software- and hardware-independent notation. Derive platform-specific versions of that archival form for distribution when and as necessary. Never confuse the edition itself with the temporary forms it takes for distribution and use. Otherwise, you fall into the same confusion as Benjamin, between the roles of the author and those of the printer and bookbinder.

Use documented, publicly defined, non-proprietary formats for the archival version of the edition. At this time, there is no serious alternative to SGML for this purpose. Use proprietary formats only for distribution versions.

Exploit the ability of the computer to provide multiple views of the same material. An electronic edition need not be limited to a single audience; much more readily than print editions, electronic editions can serve both general and specialized audiences. Where appropriate, provide the information needed to allow the user to read the text in new spelling or old, with or without the correction of obvious errors, with or without an apparatus of variants, including or excluding purely orthographic variants, with or without indications of text-critical, historical, literary, or linguistic commentary, with or without a literal or free translation into another language.

If as a community we ignore these principles, we will have electronic editions which do not capture the complexity and difficulty of textual transmission and textual variation, which run only on last year's (or the last decade's) machines, which impose on their readers a particular view of the material and brook no contradiction when they tell their readers what they ought to be interested in, which fall silent with every new revolution in computer hardware and software. If we attend to them, then we can hope to have editions which last longer than the machines on our desks, which can be used by a wide variety of users, from undergraduates to specialists, and which exploit the capacities of the electronic medium to render a full and useful accounting of the texts they represent, and the textual tradition to which we owe their preservation.


[1] "From a photographic plate it is possible to make a multitude of prints; there is no sense in seeking the authentic print." Walter Benjamin, Das Kunstwerk im Zeitalter seiner technischen Reproduzierbarkeit (Frankfurt a.M.: Suhrkamp, 1963), paragraph 4, p. 21 (my English).
[return to text]

[2] This is not an original observation; I owe it to an essayist in the New York Review of Books several years ago,whose essay I have not been able to relocate.
[return to text]

[3] "The enormous changes which printing, the technical reproducibility of writing, has called forth in literature, are well known." Benjamin, paragraph 1, p. 11. Not everyone distinguishes so carefully between literature and writing.
[return to text]

[4] Florian Coulmas, The Writing Systems of the World (Oxford: Blackwell, 1991), p. 12.
[return to text]

[5] "[T]he history, to which it [sc. the artwork] has been subject in the course of its existence." Benjamin, paragraph 2, pp. 13-14.
[return to text]

[6] The discussions of Thomas Tanselle may be taken as emblematic of this point: in his essay on critical apparatus, he devotes much more space to considering how to make it easy to produce a student or lay reader's edition by photographically reproducing selected pages of the scholarly edition, than to considering how to ensure that the apparatus actually meets the needs of working scholars. [REFERENCE TO BE SUPPLIED.] His purely typographic arguments need not apply to electronic editions, however, and computer-driven typesetting renders them irrelevant even to print publication. In a similar vein, cf. Peter L. Schillingsburg, Scholarly Editing in the Computer Age: Theory and Practice (Athens, Ga.: University of Georgia Press, 1986). Commenting on Tanselle's argument in favor of final authorial intention, which rests finally on the claim that "no critical text can reflect [...] multiple intentions simultaneously," Schillingsburg observes "The practical difficulty of presenting multiple intentions in a book is seen as more important than the conceptual insight that multiple intentions are operative." The print-driven nature of much of Tanselle's editorial theory is also visible, though less egregiously, in the general statements on textual criticism in Tanselle's A Rationale of Textual Criticism (Philadephia: University of Pennsylvania Press, 1989).
[return to text]

[7] I say that Project Gutenberg objects in principle to the provision of apparatus, because it insists on distributing texts only in ASCII-only format, without any systematic explicit scheme of markup. Since the American Standard Code for Information Interchange (ASCII) makes no provision for textual apparatus, such apparatus can only be rendered in electronic form by means of some markup language such as the internal markup used by word processors like Word or Word Perfect, or the visible markup used by programs like LaTeX or Scribe, or the SGML markup defined by the TEI. By limiting itself to markup-free ASCII, Project Gutenberg systematically makes apparatus impossible to include in its texts.
[return to text]

[8] "The basis of critical method is not an immediate judgment on the known readings; it is a historical reconstruction of the transmission of the text, using the oldest manuscripts.""As long as any error remains unexplained, the good reading remains tainted with a residue of uncertainty."Louis Havet, Manuel de critique verbale appliqué.e aux textes latins (Paris: Hachette, 1911), para 17, p. 3; para 67, p. 12. (English mine.)
[return to text]

[9] "The editor of a text must account for the entire extant textual tradition, or more pointedly, for every single variant."Georg Steer, "Grundsä.tzliche Ü.berlegungen und Vorschlä.ge zur Rationalisierung des Lesartenapparats," in Kolloquium ü.ber Probleme altgermanistischer Editionen, Marbach am Neckar, 26. u. 27. April 1966, ed. Hugo Kuhn, Karl Stackmann, Dieter Wuttke (Wiesbaden: Franz Steiner Verlag, 1968), pp. 34-41, here p. 34.
[return to text]

[10] The items given are gleaned from a variety of sources, including: Paul Maas, Textkritik, 3. verb. und verm. Auflage (Leipzig: B. G. Teubner, 1957); André. Bataille, Les Papyrus: Traité. d'é.tudes byzantines II (Paris: PUF, 1955); [Louis Havet, rev. Jacques André.], Ré.gles et recommandations pour les é.ditions critiques (Sé.rie latine) (Paris: Socié.té. d'é.dition "Les belles lettres", 1972); [Louis Havet, rev. Jean Irigoin], Ré.gles et recommandations pour les é.ditions critiques (Sé.rie grecque) (Paris: Socié.té. d'é.dition "Les belles lettres", 1972); and Georg Steer, "Grundsä.tzliche Ü.berlegungen und Vorschlä.ge zur Rationalisierung des Lesartenapparats," in Kolloquium ü.ber Probleme altgermanistischer Editionen, Marbach am Neckar, 26. u. 27. April 1966, ed. Hugo Kuhn, Karl Stackmann, Dieter Wuttke (Wiesbaden: Franz Steiner Verlag, 1968), pp. 34-41, who describes a symbol system designed in Zü.rich and used in Erwin Nestlé.'s editions of the Greek New Testament.
[return to text]

[11] On the Beowulf Workstation, see Patrick W. Conner, "The Beowulf Workstation: One Model of Computer-Assisted Literary Pedagogy," Literary & Linguistic Computing 6.1 (1991): 50-58.
[return to text]

[12] The case of Friedrich Hö.lderlin, whose works experienced two major editions within twenty years, is a slightly scandalous exception, the discussion of which lends a certain frisson of excitement to graduate seminars on German literature even now, thirty years later.
[return to text]

[return to text]

[14] (Chicago, Oxford: ACH/ACL/ALLC Text Encoding Initiative, 1994). Available from the Text Encoding Initiative; further information may be obtained from the author.
[return to text]

[15] The Thesaurus Linguae Graecae, at the University of California at Irvine, for example, makes no attempt to transcribe the apparatus of its source editions. I am told by Theodore Brunner that this is primarily because at the beginning of the project, the task of devising a usable encoding scheme for apparatus seemed insuperable. Two exceptions to the general neglect of textual variation in general-purpose systems should be mentioned. The general-purpose package Tustep, developed by Wilhelm Ott at Tü.bingen, does have built-in support for both collation and critical apparatus, but in this as in other ways it is unusually well designed. Manfred Thaller, of the Max-Planck-Institut fü.r Geschichte in Gö.ttingen, has developed notations for text-critical variation in connection with his general-purpose historical workstation software Kleio. Together with various packages for version control in software development, these systems has direct influence on the TEI's recommendations for text-critical apparatus.
[return to text]