Electronic Textual Editing: Levels of transcription [M. J. Driscoll]

By levels of transcription is meant, essentially, how much of the information in the original document is included (or otherwise noted) by the transcriber in his or her transcription. W. W. Gregg's distinction between ‘substantives’, the actual words of the text, on the one hand, and, on the other, the ‘accidentals’, the surface features of the text such as spelling, punctuation, word division, and so on, is well established. Few editors, however, in particular those working with manuscript or early printed materials, are content simply to record the one and ignore the other. There is, moreover, a great range of ‘accidentals’ which one may or may not want to include in a transcription. At one end of the spectrum there are transcriptions which may be called strictly diplomatic, in which every feature which may reasonably be reproduced in print is retained. These features include not only spelling and punctuation, but also capitalization, word division and variant letter forms. The layout of the page is also retained, in terms of line-division, large initials, etc. Any abbreviations in the text will not be expanded, and, in the strictest diplomatic transcriptions, apparent slips of the pen will remain uncorrected.

Such editions are often so close to the originals as to be all but unreadable for those unfamiliar with early palaeographical or typographical conventions, or in any case no easier to read than the originals (compare the above figures, showing the same passage in Verner Dahlerup's 1880 edition of Ágrip af Noregs konunga sǫgum, and in the original manuscript: Copenhagen, Arnamagnæan Institute, AM 325 II 4to.) At the opposite end there are fully modernized transcriptions, where the substantives are retained but everything else is brought up to date, in some cases to such an extent as to make it questionable whether they are to be regarded as transcriptions at all. In between these two extremes a number of levels may be distinguished — ‘semi-diplomatic’, ‘semi-normalized’, etc. — depending on how the accidents of the original are dealt with. Practice varies greatly, however, and in some editorial traditions it may be common to normalize or regularize some features at one level while retaining others, and I find it more useful therefore to identify the individual features and the ways in which they can be treated by the transcriber.

Forms of letters: Variant letter forms — high and round s, for example — are often distinguished in transcriptions of manuscripts and early printed materials. Texts which are semi-diplomatic or semi-normalized will generally distinguish only between variant letter forms which are felt to have a basis in phonology distinctions, which the two forms of /s/, for example, do not. For statistical purposes, however, it may be desirable to transcribe (or register in some other way) such palaeographical or typographical distinctions even where they have no apparent basis in phonology.

Punctuation: It is standard practice in diplomatic transcriptions for the punctuation of the original to be reproduced, no matter how inconsistent it may appear by modern standards. The transcriber may try to reproduce the actual signs used in the original, or may chose to use the nearest modern equivalent. In antiquity, for example, it was common to place points at varying heights in order to represent different degrees of pause, in ascending order of importance; the transcriber may chose to use an ordinary full stop instead of a mid- or high-dot. Similarly, an ordinary semi-colon may be used for the ‘punctus elevatus’, the function of which was arguably the same. In many early texts, both hand-written and printed, marks of punctuation are both preceded and followed by white space; the transcriber may chose to disregard the space before the mark, while faithfully reproducing the mark itself. In some editorial traditions inverted commas (quotation marks) may be supplied to indicate direct speech, even while other features of the original punctuation are retained.

Capitalization: In a diplomatic transcription the capitalization of the original will normally be retained, although in some traditions proper names are given capital letters whether there are capitals in the original or not. In some cases it may be desirable to distinguish between large and small capitals, i.e. letters which have the shape of majuscules but the size of minuscules. In Old Norse manuscripts, for example, small capitals were frequently used to indicate geminate consonants; transcribing these as ordinary capitals would give a false impression of the use of capitals in the text, while replacing them with double lower-case letters would arguably involve a unacceptable degree of normalization. Some scholars would also recognize the existence of ‘enlarged minuscules’, i.e. letters which have the shape of minuscules but the size of majuscules.

Structure and layout: By structure is meant the division of the work into its constituent parts: prose works will normally comprise a number of chapters, which will contain sections and/or paragraphs; works in verse will be divisible into cantos or fits, these into stanzas, the stanzas into lines and so on. Other types of texts, letters, for example, will have their own structures. By layout is meant the arrangement of the text on the page. In medieval manuscripts and many early printed books there is rarely any connection between these two; a new chapter will not necessarily begin on a new page — except by accident. Poetry was frequently written out like prose, although sometimes the beginning of a new line or stanza would be indicated in some other way, through a sign or mark of punctuation. The more diplomatic a transcription is the more likely it is to favor the structure of the document, i.e. the layout, over the structure of the work. In strictly diplomatic transcriptions the line-, column- (if any) and page-boundaries of the original are reproduced in the transcription, while the structure of the work, if unmarked in the original, is left unmarked in the transcription. In less diplomatic transcriptions the structure of the work will be given priority, while the line-, column- and page-boundaries may indicated by means of a vertical line (single for line-boundaries, double for page boundaries, where both are indicated). In normalized and modernized texts the focus is entirely on the work itself, and there is generally no indication whatsoever of the layout of the original.

Abbreviations:The use of abbreviations, both to spare the scribe the labor of writing words which, due to their frequency generally or in a particular text, could easily be understood in an abbreviated form, and to save parchment or paper and ink, is a characteristic feature of ancient and most medieval vernacular manuscript traditions and early printed books. As an aid to the reader it is common in all but the strictest diplomatic transcriptions to expand abbreviations. When a word or phrase is abbreviated a number of letters is suppressed, and the expansion of the abbreviation thus involves supplying these letters. The letters so supplied are frequently marked in some way in the transcription, printed in italics or given in brackets, for example, but even in transcriptions which are otherwise fairly diplomatic abbreviations may be expanded silently, especially in cases where it can be argued that there is no doubt as to what they represent.

Corrections and emendations:In all but the most diplomatic of transcriptions obvious errors and omissions will be corrected — the sorts of things, it could be argued, which the scribe or compositor himself would have corrected had they been brought to his attention. An editor may also want to emend the text on the basis of readings from other witnesses, common sense or artistic inspiration, ‘correcting’ things in the text which the scribe or compositor would presumably not have regarded as in need of correction. These corrections and emendations can be marked in a variety of ways in printed editions; one common way is to place letters or words assumed to have been inadvertently omitted in angle brackets, while obvious misspellings are corrected and marked with an asterisk, the original form being given in a note. More extensive emendations are normally treated in the notes.

The decisions facing the producer of an electronic transcription are essentially the same as those facing the producer of any transcription — how much information to include? And the basis for making these decisions is much the same as well, in that the choices made will depend on factors such as the amount of time available for the job, and, not least, the intended use to which the transcription will be put. The great advantage of electronic texts, however, is that many decisions need not be made, in that the transcriber can include a wide range of information in the transcription but then chose how much of it to make available to the reader, or, better still, allow the reader to chose for him- or herself how much of it he or she wishes to see. From a single marked-up text, it should be possible, if one so desired, to produce screen or print copy at any level from strictly diplomatic to fully normalized. Such mark-up would by necessity need to be fairly complex, and would almost certainly require several layers of input. And this, indeed, is another great advantage of electronic transcription over traditional print transcription: one can return again and again to one's transcription, adding further levels of mark-up.

How one proceeds will be determined to some extent by whether one is starting from scratch or working with a text already in machine-readable form. In the latter case the most logical thing to do would be to mark up the text feature by feature, beginning with the overall structure and layout, for example, and then adding mark-up for abbreviations, variant letter forms, etc. If one is starting with a blank sheet, as it were, it would perhaps make more sense to mark up these various features as one goes, or at least as many of them as one can reasonably be expected to keep track of at any one time. In this case it might be preferable to transcribe the text as it comes, including the abbreviations, variant letter forms and so on first, and then add the structural mark-up.

As mentioned above, the text of the work and the physical object carrying that text have separate structural hierarchies, both of which need ideally to be encoded, even in the most basic of transcriptions. In order to represent the former it is recommended that the <div> element be used for the largest structural divisions in prose texts, with a type attribute to specify the nature of the division, chapter, section, etc. Paragraphs within these divisions can be tagged using <p> . Verse texts should be marked up using the tags <l> (for ‘line’) and <lg> (for ‘line-group’ i.e. a group of lines functioning as a formal unit), again with a type attribute to identify the type of unit, as e.g. stanza or couplet. Lines and line-groups can also be numbered and identified using the n and id attributes. For the structure of the physical document, it is recommended that empty ‘milestone’ elements be used, <pb/> , <cb/> and <lb/> , for page-, column- and line-boundaries respectively, which can also be numbered and provided with an id. (Note that <lb/> is an empty element used to indicate physical divisions in a printed book or manuscript, while <l> , which is not empty, is used for lines of verse.) Milestone tags should be placed at the beginning of the unit to which they refer. A blank space should therefore appear before, rather than after, the <lb/> tag, except where a word runs over the line-break, in which case there is no space. The way these elements are handled is determined by the style-sheet. Line-breaks can be made to display as such, or marked in some other way, for example with a vertical bar, or can be ignored altogether. Several different style-sheets can be applied to the same text to provide different views.

Variant letter forms — and indeed any ‘exotic’ (read non-English) characters — are encoded using entity references, as detailed in another chapter (REF). If one wished to distinguish between different allographs of a single letter or other palaeographical or typographical features for purposes of statistical analysis, one could define entity references for this purpose, &b1;, &b2; and so on. The same will hold true of small capitals and enlarged minuscules, should one wish to retain them.

Marking-up abbreviations and their expansions is one of the more problematic aspects of the transcription of primary sources. The theory is simple enough: one can use either the <abbr> or the <expan> element, depending on whether the abbreviation or its expansion is to be the preferred form. A strictly diplomatic transcription will strive to be as neutral as possible and normally give the unexpanded abbreviation, tagged <abbr> , while at most other levels the expanded form will be used, tagged <expan> . In theory it shouldn't matter which is used, since the possibility for giving the other form as an attribute value is always there. In practice, however, the situation is somewhat more complicated. To begin with, what, exactly, is ‘the abbreviation?’ Is it the mark, sign or letter (if there is any) that indicates that something has been suppressed, or is it the entire word? And similarly, is ‘the expansion’ only the letters which have been suppressed and have therefore been supplied by the transcriber, or is it, again, the whole word? A case could be made for distinguishing between abbreviations with a lexical reference (suspensions, contractions and a number of brevigraphs) and those with a graphemic reference (superscript letters and signs and the remainder of the brevigraphs). It strikes one as counter-intuitive to treat the former on anything other than the whole-word level, while treating the latter in the same way seems equally misconceived. To take a simple example, MS. is a common abbreviation for ‘manuscript’ (or rather ‘manuscriptum’). It is a suspension, or rather two, where the first letter of each part of the compound word is given and the rest omitted; the fact that it is an abbreviation is indicated by the full stop (which is frequently omitted), and by the fact that it is written upper case (which it frequently isn't). One might want to tag this as an abbreviation and indicate the expansion as an attribute value: <abbr expan="manuscript">MS.</abbr>. Alternatively, one could give the expanded form as the content of the element and the abbreviated form as the value of the attribute: <expan abbr="MS.">manuscript</expan>. To insist on M<expan abbr="">anu</expan>S<expan abbr=".">cript</expan> strikes me at least as patently absurd — in addition to producing the incorrectly expanded form ‘ManuScript’. But even if one is prepared to overlook this, what should one do with the form MSS. for ‘manuscripts’ (or ‘manuscripta’), where the second S is there only to indicate that the word is plural? The superscript nasal stroke, on the other hand, has a specific graphemic reference: it stands for the letter m or n. With a word such as fratrū, it seems more natural to encode the expansion as fratru<expan abbr="&bar;">m</expan>, rather than <expan abbr="fratru&bar;">fratrum</expan>. Doing so also makes it completely explicit which letters have been supplied, and makes it possible for these letters to be displayed in italics (or round brackets), in the manner of a traditional printed edition, something which many scholars still feel to be of importance.

One solution would be to use <abbr> to indicate abbreviations which are left unexpanded (although the expanded form can be given as an attribute value), but <expan> for the letters supplied by the transcriber. Those who really do wish to have their cake and eat it, i.e. derive both forms from a single mark-up, will probably need to find another solution, e.g. giving both the expanded and unexpanded forms side by side within some grouping element.

Alterations made to the text, whether, in the case of manuscripts, by the scribe, or in some later hand, can be encoded using <add> , for additions, and <del> , for deletions while editorial emendations should be encoded with the <sic> and <corr> tags. The latter pair, it has been argued, could also be used for editorial emendations on the grounds that there is, in essence, no difference between changes made to the text by a scribe or later reader and those made by the transcriber or editor of a scholarly edition. There is, however, or should be, no more fundamental distinction in textual scholarship than between what is physically present in the source and what is not. The act of recording what actually is in the source must therefore remain entirely separate from postulating what ought perhaps to have been there but for one reason or another isn't. The simplest way to maintain this distinction is to employ separate sets of elements for the two.

The elements <corr> and <sic> function as mirror images of each other, in the same way as <abbr> and <expan> , and the choice as to which to use is made on the same grounds. In a strict diplomatic transcription, one may wish only to indicate an incorrect or suspicious reading in the manuscript without attempting to correct it, or, in a normalized text, to emend an obvious error without indicating what the original reading was. The two can also be combined, with the one then acting as an attribute of the other. This is not the case with <del> and <add> , however, which cannot be said to mirror each other in the same way, because deletions and additions can obviously also, and probably equally frequently, be found in isolation.

Where a word has been supplied by the editor, the <supplied> tag can be used. It is customary in textual scholarship to distinguish between text now illegible or lost through damage but assumed originally to have been in the source and text assumed to have been inadvertently omitted by the scribe or compositor. This distinction can be indicated in the mark-up through the use of the reason attribute, the value being for example illegible in the former case and omitted in the latter. Where the reading of another witness supports the reconstruction it is possible to use the source attribute. The <supplied> element should only be used when the missing text can be reconstructed with a very high degree of certainty; when such is not the case <gap> can be used instead, with both a reason and an extent attribute.

Finally, there is the question of normalization/regularization. One can use the <reg> element to give regularized forms of variant or archaic spellings, or the <orig> element to indicate that a spelling is archaic or in some other way non-standard. As with several other of the element mentioned here, the two mirror each other, allowing for both the regularized and unregularized forms to be given, the one as the content of the element, the other as an attribute value. If the text has only a few irregular or archaic forms, it is an easy enough matter to tag them as one proceeds, but in a text with highly irregular or archaic orthography, it would be necessary to tag each word. This would probably be better done afterwards, and could be automated to an extent.

For the most part the elements discussed here are used to tag features in the original in a way analogous to the typographical conventions of printed editions. There are, of course, many other elements available within the TEI for tagging other things, dates, for example, or the names of persons, places and institutions, all of which are useful for search purposes and indexing. And why stop here? The TEI also provides mechanisms for associating any kind of semantic or syntactic analysis and interpretation which an encoder might wish to attach to a text, including such familiar linguistic categorizations as ‘clause’, ‘morpheme’, ‘part-of-speech’, etc. as well as characterizations of narrative structure, such as ‘theme’. There is, indeed, to all intents and purposes no limit to the information one can add to a text — apart, that is, from the limits of our own imagination. One thing is clear: the more one puts into a text, the more one can get out of it. If one is lucky one winds up even finding things one didn't know were there; if one is very lucky one finds things one didn't even know it was possible to look for.

Last recorded change to this page: 2007-10-31 • For corrections or updates, contact webmaster AT tei-c DOT org