Electronic Textual Editing: Collection and Preservation of an Electronic Edition [ Marilyn Deegan, King's College London]
The world of electronic editing is a new one and it is going to take some time for standards and practices to settle down and become part of the established world of scholarship. As the editors point out in their introduction, the book has developed over 500 years, the print edition has been evolving over the past 150 years, while electronic tools have been used to assist in editing for perhaps 50 years (11). The reconceptualization of the edition as an innovative electronic artefact has happened only over the last 10-15 years, and this means that the librarians charged to acquire, deliver, and preserve this most important of scholarly tools are facing new challenges. This chapter discusses what the different issues are in preserving electronic editions, some of which derive from the newly conceptualized modes of instantiation of electronic text, and offers some suggestions to scholars that should enable them to take the necessary steps from the beginning of an editorial project in order to ensure that what is produced is preservable as far as it is possible.
As is shown by the essays in this volume, electronic editing offers a plenitude of materials that represent a work in all its different states of being. It also allows the situation of a work within a nexus of social, contextual, and historical materials, all of which contribute to the totality of its meaning. The electronic edition is itself another ‘version’ of the text, one that in many cases, especially when an archive approach is being taken, seeks to encompass all other versions. But it is another witness in the life of a text, not the final witness, and it needs to be preserved in some form as that witness. However, the fluidity of the medium of creation and delivery is in many ways antithetical to that preservation (this is discussed further below), preservation being an act of fixity, and therefore creating perhaps what the editor has been trying to avoid: the text frozen at a point in time. What are editors, publishers, and librarians to do with the conundrum of trying to preserve for scholars of tomorrow the fluid text of today? First of all, we look at how conventional printed editions are preserved by libraries, and then move on to discuss how this changes in the digital world.
Preserving traditional editions
Conventional printed editions present few problems to libraries in collecting and preserving them that they have not already been grappling with for many years. Such problems as print editions may present all derive from the physicality of the medium used to deliver them. They may, for instance, be old and rare, and in need of careful handling, controlled storage, conservation etc. They may be nineteenth century editions printed on acid paper which is disintegrating into yellow crumbs or may have pages coming loose from bindings. Editions presenting these problems can either be repaired, or their contents can be reformatted onto another medium: microfilm, photographic facsimile, new print volume. While reformatting can raise questions about the authenticity of the material object, it is preferable to the alternative, which might be total loss of the object and its valuable contents. Librarians are concerned mostly with the format and construction of the material object, the carrier of the content. They are not usually concerned about the content itself, other than to be able to describe it bibliographically and store it logically, so that users can access it. In the analogue world, at one level of abstraction, the physical format, the carrier, is the edition, and that is all that libraries and librarians need to know about in order to collect it and preserve it. They are concerned, in Sutherland’s terminology, with the vehicular rather than the incarnational form of the edition (Sutherland, 23).
In the analogue world the editor’s roles and responsibilities in relation to the longevity of the work are clear. Editors are little concerned with the format or composition of the carrier, unless they are involved to some degree in font choice, cover design etc. Most responsibilities of the editor end on the day of publication, by which time he or she may already be engaged in a new piece of work. For publishers, their responsibility is to produce a printed volume containing the edition, ensuring as far as possible to quality of the scholarship it represents. This is a complex process involving many different tasks and subtasks of selecting, advising, reviewing, revising, copy-editing, typesetting, proofing, designing, printing, marketing and dissemination. On-going responsibilities might include reprinting or publishing a new edition, but any role in the long-term survival of the product is minimal. When an edition is out of print, that usually means unobtainable from the publishers, and it is not for the editor or publisher to ensure continuing access: that rests with libraries, especially the copyright and major research libraries. Copyright libraries as a nation’s library of record receive a copy of every publication produced in a country in any one year, so a published edition which is out of print and unavailable anywhere else will almost certainly be locatable in a copyright library somewhere, even if it is more than a hundred years old. And that edition should be perfectly usable and accessible, even if created according to different principles than the user expects: it should be self-describing and contain all the explanation needed for its use.
Preserving editions in a digital context
In his foreward to this volume, G. Thomas Tanselle has suggested that the format in which books are delivered is irrelevant: ‘the use of the computer in editing does not change the questions, and the varying temperaments of editors will continue to result in editions of differing character’ (page ref). This may be true for editors, and it may also be true for users of editions, but it is not true for publishers producing or libraries collecting and preserving electronic editions; for them the changes are profound and far-reaching, and involve a complete rethink of the economics of producing and handling these works. Previously, these editions of ‘differing character’ would all have been treated in much the same way by both publishers and libraries. Even uniform series of texts with strict style guides, unchanging livery, immutable type styles could have very different kinds of editions within them. Glance at the Early English Text series for instance, which has existed for a century and a half and you will find very different contents underneath the brown covers: Zupitza’s facsimile of Beowulf, de Vriend’s edition of the Old English Herbarium, and Scragg’s edition of the Vercelli Book to name but a few are all totally different, as befits the nature of the source materials and changing editorial practices. But to the librarian these present no problems: they can all be lined up together on the shelves, a nice neat series, as uniform as a parade of chocolate soldiers. They can be catalogued, described, borrowed, read: the early volumes from the 1860s just as easily as the most recent. For the publisher, there may be issues of different sigla to represent, alternate page layouts, different fonts etc, but these are details which normally cause few problems, and entail no major rethinking of what the edition or the series actually are.
Tanselle goes on to say that editors working in the electronic world need to confront the same issues that editors have struggled with for 2500 years (page ref). Again, this is of course true, but if their work is to survive beyond even the next five years, serious thought must be given right at the start to how to create the edition in such a way that libraries can collect and preserve it alongside many other electronic and conventional editions for the benefit of scholarship in the long term, as well as for the users of today; this is not something that scholars have ever had to do before. And while the traditional questions are still being asked, the very plenitude of new possibilities in the electronic medium means that many new questions will be asked too. In an edition of Yeats’ poem “Innisfree.” , would the editor want to include the early recording of the poet reading the work? Is this considered part of his output in relation to the poem? Would the song settings for “Innisfree.” written later be considered as part of an edition? How would all these be linked together? What about including radio or film performances of plays in editions of dramatic texts? In this new world, the limits are not what will fit on the page or between the covers. The new constraints are time and money—with a sufficiency of both the technology will allow us to go as far as our imagination will lead.
For a scholar contemplating embarking on an electronic textual edition, the first place to look for advice about creating a preservable edition are the CSE Guidelines printed above. Section V “Electronic editions.” is a list of questions: if the scholar could answer ‘yes’ to all those that apply to his or her projected work, he or she would be well on the way to producing a durable resource. The rest of this chapter expands upon some of the issues raised by the Guidelines.
What are the issues?
Two of the key questions the Guidelines ask of potential editors are: ‘How important is permanence or fixity? How can these be obtained?’ and ‘Is there a possible benefit to openness and fluidity?’ These are crucial but in some ways contradictory points. Openness and fluidity are significant benefits that the digital world offers to editors, and many of the contributions in this volume discuss projects which have no final point of publication of the edition or any of its parts. Some of the editions are purely networked and it is anticipated that these will grow and change over time. Indeed, van Hulle suggests that publication is a 'freezing point' ( 105) and that the act of publication is an exercise in petrification (110). Van Hulle is talking not about edited texts, but about some of the fluid published texts of Samuel Beckett’s oeuvre, but this can apply to edited as well as authored texts. Fluidity in editing can cause serious problems at the same time as it conveys many benefits. As the editors of the current volume suggest ‘the scholarly editor’s basic task is to present a reliable text: scholarly editions make clear what they promise and keep their promises’ (31). It is also vital to provide a stable text. Of course, there is a great deal of debate about what is meant by reliable or stable: what is meant here is that a reliable text is created according to some principles that are made explicit, and is based upon witnesses that can be accessed by others, and a stable text is fixed at some particular point in time in some known state, and then not changed later without those changes being explicitly recorded. This second requirement is much more difficult to ensure in an electronic context, especially in a networked environment.
One of the main points to stress about the fluidity of electronic editions is that it is both a strength and a weakness: it can be a strength for the editor who can adapt and change the edition as new information comes available, but a weakness for the user who may not understand what changes have been made, and for the librarian who needs to deliver and preserve the materials: what version of something becomes the preservation version? Instability of citation is a critical problem; research and scholarship are based upon a fundamental principle of reproducibility. If an experiment is repeated, are the results the same? If a citation is followed by a scholar, does that lead to a stable referent, the same referent that every other scholar following that citation is led to? If these are not true then we do not have scholarship, we have anarchy, and it is an anarchy that is well-supported by the Internet.
Another key issue when planning electronic editions is to establish standards and working practices that may enable editions to be ‘interoperable’: that it to exchange data at some level with other systems. These ‘other’ systems might be other editions, library catalogues, databases, dictionaries and thesauri, and many other kinds of relevant information sources, those that are currently in existence and also those that may appear in the future. Interoperability is very difficult to achieve, and it is something that the digital library world has been grappling with for some time. What is important for editors is not to ensure the interoperability of their editions, but to make editorial and technical decisions that do not preclude the possibility of libraries making the connections at a later date. This is probably the most that can be expected of editors at our current state of knowledge. 1
It is going to take some time for principles and practices to be established that will ensure the longevity of electronic scholarly editions, and those principles and practices will need to be much more precise in many ways than those that are needed for the production of conventional editions, elaborate and precise though these already are. The underlying scholarly practices are much the same, as many of the chapters in this volume make clear, it is the technical issues that need to be resolved. And that is difficult because at the moment electronic editing is more characterized by innovation, experimentation, and new developments than it is by established practices, and that is what makes it so exciting. Electronic scholarly editing is also caught up in the fast-moving world of hardware, software, applications, and standards, which change with dizzying speed. The editor is caught between taking advantage of all these new developments and trying to ensure that the work survives for the long term.
Preserving digital data
Before going on to discuss what the editor might do to ensure the longevity of editions, it might be useful to look at some of the ways in which digital data is being preserved by libraries and other memory institutions. 2 This is a new and changing field, and there are differing views on the best methods of preservation, especially in the preservation of fast-moving online data. There are some web sites, for instance, that not only change every day, they may change every minute: news sites, sport sites, etc. We may be tempted to think that electronic editions are unlikely to change this rapidly: a glance in the pages of this volume should disabuse us of this. Take for instance the editions of poetry discussed by Fraistat and Jones (chapter 9 above). They suggest that electronic editions of poetry might offer a ‘potentially infinite expansion of the traditional editorial apparatus’ (77) and that users might be able to have their own customized version of the edition, based on their behaviour while interacting with the materials (82). They also suggest that an important feature of online poetry could be the ‘contextural’ relation of multiple texts on the internet within hyperlinked clusters, where the poem interacts with many other kinds of information (74). For Crane, electronic editions are ‘protean’ and ‘able to adapt themselves to multiple needs and to recombine themselves in new configurations’ (207). In all of this experimentation and innovation, it is sometimes difficult to know what is at the core of a complex, everchanging edition that can be collected and preserved by a library.
The preservation of digital data has two main components: preserving the integrity of the bits and bytes, and preserving the information which they represent. Preserving the first is a matter of preserving the physical media upon which they are recorded (CDs, DVDs, hard drives, tapes, disks etc), and moving the digital data to new formats if either there is suspicion that the media may be physically compromised in some way, or the means of accessing the material is in danger of becoming obsolete: CDs are reported to have a potential lifespan of hundreds of years, but CD drives will be obsolete long before this. This process is known as ‘refreshing’: the bit stream is not altered in any way, just moved to different media. As the software that created the information becomes obsolete, the information becomes more and more difficult to access unless a) it is stored in some future-proof format or b) it is reformatted. Reformatting means that the information needs to be migrated to new versions of hardware and software, and there is danger that the information could be changed or damaged in the process, especially if it is moved through more than one generation of the software, or to new software completely. If the reformatting route is taken for preserving digital data, it might be necessary to carry out this process many times in the life of a digital object as hardware and software change. Other methods that are being contemplated for digital preservation are: preserving the hardware and software, which could be expensive and problematic (though there are data archives that still maintain old card readers as there is still much information around on punch cards that it would be too expensive to move to another system); emulation: building hardware or software that can run obsolete programs and data; data archaeology: preserving and documenting the bits and leaving it to future scholars to work out what they mean. All the potential methods of digital preservation have costs, and many of these costs are unknowable. National copyright libraries are currently grappling with these problems as they are increasingly collecting publications which now have no print equivalent.
A new approach to the preservation of complex digital data is being explored by the University of Virginia and Cornell University, together with other academic partners. This is the Fedora project (Flexible Extensible Digital Object Repository Architecture), one of a number of repository architectures that have been proposed for use in digital libraries. 3 . The architecture of complex electronic editions has to be planned carefully for many reasons, preservation being only one of them, and some cognisance of repository architectures could be useful for editors. Fedora is of particular interest as it proposes some new ways of reasoning about digital data, based upon data objects and their behaviours, rather than upon the essential nature of the data. The Fedora work at Virginia was initially started as a practical response to the preservation of complex digital data being produced by research centres in the University and its library like IATH (the Institute for Advanced Technology in the Humanities), much of which derives from electronic editing projects like the Blake Archive and the Rossetti Archive. The Library was faced with the long-term support of objects in all media, including much born-digital data. No system existed to allow libraries to provide such mixed-media support, so the University established a digital library research and development group who soon realized the potential of the Fedora theoretic model originally proposed by Carl Lagoze and Sandra Payotte at Cornell University to solve some problems of the interoperability of digital objects and respositories (Payette et al.). Fedora is in the process of development, but show great promise in elucidating the problems and proposing some solutions to the long-term survival of complex editions.
There are a number of projects too which are looking at the problems of preserving information on web sites. The web is the largest and most prolific source of information that there has ever been, much of which could be lost if it is not actively preserved. Initiatives at a number of national libraries, concerned with potential loss of considerable portions of the national heritage now being produced in online form (the Library of Congress, the British Library, National Library of Australia, National Library of Sweden, among others), are looking closely at the long-term preservation of web sites. The volumes of data to be stored are enormous, so it is impossible to try to preserve every state of a web site as a library might, for instance, preserve every edition of a book. Some projects aim to harvest web sites every six months and preserve them for historic purposes. Many of the experiences in the preservation of web sites can offer insights into the preservation of networked editions. For instance, if the networked edition has many links to external sources of information outside the control of the editors, these are going to be highly vulnerable. Link checkers can automatically report that links are broken and information missing, but rarely can anything be done about this other than to remove the link. Another interesting web harvesting initiative is the Internet Archive ( http://www.archive.org ), which has taken an ‘almost complete snapshot of the World Wide Web every 60 days since 1996—that’s about two billion pages’ (Kahle, qtd. in Internet Archive: ), 4 and has created the “Wayback Machine.” to give access to former states of web sites—this makes it possible ‘to surf more than 10 billion web pages.’ Try it on your favourite web site and see what problems this throws up.
What are editors to do?
In perusing this volume it is obvious that there is a real tension between the new possibilities offered by the electronic edition and the need to preserve the scholarly record. As Eaves points out in discussing the Blake Archive: ‘We have no answer to the haunting question of where and how a project like this one will live out its useful life’ (157). The TEI itself was established to address many of the problems (7), but just using the TEI for preparing text is not going to be the answer: there are many more issues than what text encoding scheme to use. The electronic medium offers so much more than textual processing to the editor, as is clear from many contributions here. Each of these media has its own problems of handling and long-term storage. But even more problematic than the format, handling, markup, and storage of different media is the maintenance of the links between and within the media, and, with the networked edition, the links can be highly prolific and sometimes uncontrollable, if they are linking to resources external to the edition..
In the past there has been a threefold distinction made between data, programs, and interface that it might be useful to discuss here, though Fraistat and Jones warn us against ‘the venerable dichotomy that divides the world into opposing demesnes one focused on the front end ... and one on the back end’ (74)—for them, editors (especially editors of poetry) in the digital world need to consider the appearance of texts on particular interfaces, which affects decisions about editing and markup. This is echoed by Kiernan (197) who opines that ‘providing an effective, easily understood, use interface (or GUI) is always an important issue in an electronic edition.’ Perhaps we need to expand the distinction to a fivefold one: data, metadata, links, programs, and interface. The first three of these contain the intellectual capital in an edition, the last two are (should be?) external. However important the programs used to create and deliver the edition and the interface through which it is accessed, scholars must always remember that these are likely to be the least durable part of any electronic edition, and plan for the design and formatting of their intellectual assets in such a way that they can be reused with different programs and interfaces. Easier said than done.
What do we mean here by data, metadata, and links? Data is the raw material deriving from the source itself and can be text, images, sound files, video etc. Metadata is added symbols that describe some features of the data, and links are strings of code that link pieces of data to other pieces of data—either internal to the source itself or external. Throughout this volume, the metadata that has been mostly of concern is textual markup, and in particular TEI markup, but there are other kinds of metadata that it can be useful for the editor to use in an electronic textual edition, some of which are discussed by Lavagnino above. In particular, it may be that the TEI is not the best metadata system to use for non-textual materials that form part of the textual edition. The TEI system is excellent for describing logical objects; there may be other metadata standards that are more useful for describing physical objects. Lavagnino suggests that Encoded Archival Description (EAD) might be better at describing highly structured collections of data that could together make up an edition (259), and METS (the Metadata Transmission and Encoding Scheme) is being used a great deal by libraries in a digital preservation context for describing and encapsulating the structure of digital objects. 5
Creating preservable assets
There are a number of approaches that can be taken to the creation of editions that will survive for the long term. One relatively straightforward one is to produce a fixed edition on some stable medium at regular stages in the life of the edition. This is the approach that is being taken for the Cambridge Edition of the Works of Ben Jonson, described above by Gants. The plan is to produce simultaneously a six-volume traditional edition and a networked electronic edition, the second of which will grow and change as scholarship develops and which will ‘distribute editorial power among the users’ (86). This is an interesting approach as the reliable and stable text that it is so important to establish will be at the core of a conventional edition at the same time as being at the centre of a highly fluid, ‘rhyzomatic’ network of hypertext links. Fixing the text in one form and therefore being assured of its stability can perhaps allow for more experimentation in the electronic medium. It will be interesting to see in what way and over what period the editions diverge, and at what point there is felt to be the need for a new printed edition—if ever. If there ever is a new printed edition, no doubt it will differ profoundly from the first, given its electronic ‘after life’.
While this volume is mostly concerned with electronic textual editions, such is the power of the electronic medium that textual editions can include any other electronic media, which all have to be created according to some standards and stored in a format that is non-proprietary and well-supported. Eaves points out that editors of multimedia editions such as those discussed here should be practising ‘double editing’, editing first in discrete, then in integrated media (155). This is a good approach to creating preservable assets, and it makes editors think very hard about what the different components of the edition are likely to be. It is also a valuable approach to editing in teams: different team members can prepare different parts of the edition which can be integrated later. Indeed, this perhaps provides the nucleus of how one should conceive of the individual components of an edition first, work out how to capture, describe, and store them, and then work out what the links between them might be.
Applying data standards
Standards in this field are changing constantly, and one of the key benefits the TEI has brought to the world of textual editing is a standard that is durable yet flexible. Standards must change to reflect new knowledge as well as new technical possibilities, but they are only useful if they have a predictable path of change. Suggestions offered here are given as examples of how editors need to think about their work. Editing projects must create, encode, and describe their component materials according to the most robust standards available, and these standards will be different for the different media types. They must be compatible and interoperable within the project, and allow for the later possibility of interoperability with other resources.
For text, the standard that should always be used is the ASCII standard, with markup added that is also in ASCII; the markup can be embedded or offset (see below), and, while there has been a great deal of progress in the presentation of special characters through the Unicode standard, it is preferable that characters are encoded as entity references which can be displayed in Unicode than encoded as Unicode itself: for historical reasons, some characters have multiple representations, and there are some ancient and oriental languages that do not yet have full Unicode support. The TEI is of course the markup system of choice for most electronic editing projects starting today. One choice to be made is whether markup (TEI or any other kind) should be embedded in the text or ‘offset’ (this is sometimes called ‘stand-off’ markup): that is, stored in some separate file. The advantage of embedded markup is that the text file is complete in itself: when you have the file, you also have everything that is known about it, including (in a well-formed TEI file with header) all dates of revisions, records of people who have worked on it etc. However, TEI markup can be used to describe many different features of a textual object, which can result in a highly complex file. Some projects are now experimenting with offset markup, arguing that the text file can be kept in a more pristine state (thereby creating a more preservable textual object), and that different kinds of markup can then be kept in separate files, all of which can point to the original text file. The only markup that the text file would then hold is locational information that the offset files can use to point to the correct parts of the text. This approach is proposed by Berrie, Eggert, Tiffin, and Barwell above (197-202), as they feel that this allows the maintenance of the authenticity of a text file across different platforms (204), clearly a desideratum for the preservation of the texts. This approach is also taken by the ARCHway project at the University of Kentucky which is dealing with complex manuscript traditions that need markup to describe paleographic features, physical attributes of the manuscript (collation, damage , etc). They separate out the markup for different features into a number of DTDs, which they find gives them a system which is clear, concise, and easy to maintain (Dekhtyar and Iacob).
For image data, this should be captured at the best quality possible to reveal all significant information about the original, and then stored in a non-proprietary file format using only lossless compression (if compression is to be used at all). This will differ greatly depending on the source materials: modern printed documents might yield everything they have to offer scanned bitonally at 300dpi and stored as (for example) TIFF images with lossless compression. These images can be stored for the long term and will yield up all the information necessary now and in the future. Complex and perhaps damaged manuscript materials will need different treatment: these may need to be captured with the best systems currently available, at the highest possible resolution, stored as archival masters (currently TIFFs with no compression), and then have lower quality surrogates available for access. Such original images scanned with current professional digital cameras (as at mid 2003) can be up to 350 Mb each in size, which poses huge storage problems for projects and libraries. It is vital to consider right from the start what storage requirements a project has, and usually to double or even treble this to take account of all the extra files that will be produced along the way: delivery images, thumbnails, text files, metadata, etc. Think too about the networks that will have to move all this material around and the backup media that will also be needed.
Audio and video data is even more problematic storage requirements, being more memory hungry than still images. Uncompressed high quality video, for instance, has file sizes of around 28 Mb per second. Audio and video standards are currently moving very rapidly because of commercial developments in offering streaming audio and video over the internet. Any standards suggested here would be out of date immediately; it is vital that projects take the best expert advice that they can at the outset in order to establish formats and standards.
Other issues that greatly affect the long-term prospects of electronic editions are in naming conventions used. A complex editing project will likely produce many thousands of files that need to be kept track of. These files may also be kept in different locations, locations which may change over time. File-naming conventions should be devised and documented at the early stages of an editing project. It is often useful to have names assigned automatically as this can reduce human error and can make batch renaming simpler and more reliable if necessary later on. Bar codes, hashing functions, and databases can all be used for assigning filenames. File locations can be seriously problematic: the URL is notoriously prone to difficulties, and some other way of indicating locations should be found to avoid the dreaded ‘Error 404: FILE NOT FOUND’. Work is being done on alternatives to URLs: Uniform Resource Names (URN) identify a piece of information independent of its location: if the location changes, the information can still be found. One particular type of persistent identifier that has been adopted by a number of publishers is the Digital Object Identifier (DOI). DOIs are persistent names that link to some form of redirection, so that a digital object will have the same DOI throughout its life, wherever it may reside. DOIs need to be registered with an agency in order that the redirection process can operate, but they offer good long-term prospects for digital object naming. 6
All of the suggestions above relate to standards and conventions to be used when producing the discrete components of an electronic edition, but the whole point of editing in this new format is the introduction of complexity in the interrelationships between the many sub-parts of the edition, and it is this complexity that is likely to be the most challenging problem in preserving electronic editions. What creates most of the complexity in editions is the exponential growth of numbers of links and in editions of even relatively short works there can be millions of links created. Sometimes this complex interlinking is managed by programs and interfaces—the most vulnerable part of the edition. However, it is not necessary to rely on programs and interfaces to provide linking capability: the TEI specifies a number of mechanisms for the description and markup of links 7 and the World Wide Web Consortium (W3C) has defined the XML Linking Language (XLink) and the XML Pointer Language (Xpointer) for the specification of complex links. 8 The problem with using these specifications is that there is more of a learning curve that when linking is done via pointing and clicking in various kinds of authoring software, but for long-term security, links must be separated from programs.
Finally, the most important thing an editor can do to ensure the development of preservable editions is to consult widely with experts in digital preservation, metadata, digital file formats etc, and with other editors who have been grappling with the same problems. Anyone who has read this far in this volume is well-equipped to begin an editing project with most of the tools they need at their fingertips.