From LISTSERV@LISTSERV.UIC.EDU Wed Sep  1 17:39:47 1999
Date: Wed, 1 Sep 1999 11:21:09 -0500
From: "L-Soft list server at University of Illinois at Chicago (1.8c)"
    <LISTSERV@LISTSERV.UIC.EDU>
To: Lou Burnard <lou.burnard@COMPUTING-SERVICES.OXFORD.AC.UK>
Subject: File: "TR9W4 DOC"

<!-- document TR9 W4 "Diplomatic transcription of Modern Mss" -->

TEI TR9, TEI Work Group for Manuscripts

Diplomatic transcription of modern manuscripts, preliminary suggestions
and recommendations.

To be submitted to the TEI meeting in Myrdal, November 1991


Claus Huitfeldt, 8 November 1991

Draft: This text has not been reviewed by other members of the group,
and should not be taken as an expression of any official stance of TR9

1 Background

The Text Encoding Initiative Work Group for Manuscripts had its first
meeting in Louvain-la-Neuve, 26-27 October 1991.  Present at this
meeting were: D. Buzetti (Bologna), J.Hamesse (Louvain-la-Neuve), C.
Huitfeldt (Bergen), M. Sperberg-McQueen (Chicago).

There was consensus at the meeting that although it is probably possible
to agree on a recommended set of core tags and rules for the encoding of
manuscripts, it is almost certain that no such core set will suffice for
the encoding of any particular manuscript within any particular project.

The following two distinctions appeared to be crucial:

(1)  Diplomatic transcriptions of single witnesses vs.
     text-critical transcriptions of several witnesses.
(2)  Modern manuscripts vs. ancient and medieval
     manuscripts.

The work group decided to divide its work into three parts:  One
covering features common to all manuscripts, one concerned with modern
manuscripts, and one with ancient and medieval manuscripts. It was also
decided to seek cooperation with the work groups on Textual Criticism
and Printed Books. (Cf. the minutes from the TEI TR9 meeting 26-27
October 1991.)

2 General remarks

The present text was written with the intention to treat exclusively of
what is of particular relevance to the diplomatic transcription of
modern manuscripts. Thus, what such transcriptions have in common with
transcription of

     (i)       printed matter,
     (ii)      ancient and medieval
               manuscripts,
     (iii)     text-critical work, and
     (iv)      analytic or interpretational
               encoding,

will not be covered.

However, I have not been able to live up to this intention in all
respects, and for several reasons. First, because I do not know the
other types of texts and traditions well enough to know what is common
and what is special. Second, because time has neither permitted a
coordination with Prof. Hamesse's work, which is to concentrate on core
manuscript features and ancient and medieval texts, nor a close study of
the other TEI Work Groups' recommendations. Third, because there are
some points at which I have become uncertain whether the abovementioned
distinctions are applicable:

Ad (i), Manuscripts vs. printed matter: What about annotations in
printed books? What about manuscripts written on printed formulae,
schemes, diaries with printed dates, titles, headers, pagination etc.?
And what about the abundant typewritten materials more or less heavily
annotated by hand?

In the following, read "modern manuscripts" for "manuscripts" unless
otherwise specified, and take into consideration the possible inclusion
of these types of annotated printed or typewritten texts.

Ad (ii), modern vs. ancient and medieval: Early printed books have many
features in common with manuscripts.

Ad (iii), Diplomatic vs. textcritical transcription: The transcription
of a single witness is normally assumed to be distinctly different from
the (text-critical) collation of several witnesses. Modern manuscripts
with insertions, substitutions, revisions, annotations etc. present
problems concerning variant readings though partly different from,
partly also similar to those encountered in text critical work. (Cf. 10
below.) Besides, diplomatic transcription of single witnesses is often a
first step on the way to a text-critical version. And what about the
diplomatic transcription of a manuscript which has later emerged in
print? etc.

Ad (iv), Diplomatic vs. analytical or interpretational: First, the
present writer is skeptical about the distinction as such.  E.g., what
is it that makes us recognize a straight horizontal line partly below,
partly overstriking a line of text in a manuscript as an underlining,
and not as a deletion? I will not pursue the subject here, since it
brings us far away from the subject matter...  More important, there are
certain questions, which are normally considered <sic>analytical or
interpretational</sic>, which appear in different form in manuscripts
from other texts, and which should not be neglected even in purely
diplomatic transcriptions. I am thinking of matters such as
abbreviations, encoding of elements below word level, certain kinds of
internal normalization of orthography etc. (Cf.  below.)

At the present stage I have not found it possible or desirable to be
specific about details such as names of tags, attributes, values, or
entity references. Besides, many of the considerations do not
necessarily call for new tags and attributes etc., but for different
application criteria. (This is for me one of the major problems with TEI
P1, a tag set alone can be used for almost anything (cf. Humpty-Dumpty),
the real issue is the application criteria. And they are often extremely
difficult to decide without becoming either general or specific beyond
interest.)

Since there has been no time for a general survey of actual encoding
schemes, I have simply described (though in many instances in a modified
form) some things I believe to be of general relevance from one
particular scheme that I know very well, the Wittgenstein Archives'
system MECS-WIT (which is not even SGML). No doubt, generalizations on
the basis of one specific scheme must have lead me astray on a number of
points.

In sum, this is neither an attempt to produce anything final nor to
represent a consensus of the work group, but rather an attempt to
highlight some items and to suggest some possible recommendations for
discussion.

One final remark: The present writer is convinced that diplomatic
transcription in the computer age, which is also the age of computerized
facsimile, xerox machines, and other reprographic techniques, should not
aim at the exact reproduction of all paleographic/ typographic/
physical/ topological (?) features of the copy text.

In my eyes, a diplomatic transcription can only be a careful and
detailed representation of what is considered as such features, to the
extent that they are found significant by scholars concerned, and in a
form which supplies sufficiently detailed information about what all
scholars agree on, in order for them to make informed decisions on
matters about which they may disagree. Therefore, the following
recommendations are also made on the basis of general assumptions about
what everyone will in general agree on as significant and indisputable -
now, and for a long time to come.


2 Pages and pagination

In printed matter there is a certain regularity in the physical order of
appearance of the text which is often absent in manuscripts. Manuscript
books may contain insertions in running text on recto-pages written on
verso-pages, manuscript books may be written beginning on recto-pages
continuing on verso-pages, etc. Pagination is often inconsistent,
repagination may have resulted in several pagination numbers on each
page etc. At the same time, pages and page numbers are often important
and well established points of reference.

Recommendation: Texts should be transcribed according to their physical
order of appearance, but only to the extent that this does not obviously
lead to unnecessary (?) absurdity. Wherever necessitated by requirements
of cotextual consistency, this recommendation is overruled.

Therefore, since parts of text may have to be moved from their proper
physical to their proper cotextual position, it is of high importance
that folios, pages, and pagination is indicated for all parts of text.


3 Lines, vertical and horizontal space

On the printed page there is always at least a minimal linearity in the
physical order of appearance of the text.  Also in columned or tabular
print, and even in print with interlinear text, there are at least clear
and systematic rules for the deviations from strict linearity, e.g.
there is seldom doubt as to what is interlinear and what not.  This is
not so in all manuscripts, where interlinear and marginal insertions,
annotations, deletions, substitutions and the like both make it unclear
what the physically linear order is, and forces a cotextual order of
representation to deviate from the physically linear order, or even to
repeat several times in the encoded text what occurs only once in the
copy text.

In such cases, it appears pointless to insist on the encoding of line
endings, at least unless commentary has established lines as points of
reference.

However, there are exceptions - even in such texts, there are certain
line endings which are of quite clear significance, e.g. in centered
text, in certain kinds of columns, etc. This calls for a distinction
between "hard" and "soft" line endings in manuscripts, - or rather, to
encode only "hard" line endings.

Indentation, if found to be of significance, should be indicated. In
modern manuscripts written in a casual manner there is hardly any point
in measuring them in millimeters or inches. A more suitable unit of
measure may be average character width, percentage of page width, or the
like, and the values should be rounded off in suitable intervals. The
same goes for other horizontal blank spaces, except that blank space in
a line should be sharply distinguished from indentations, which always
imply a line brake. In tabular and columned text additional indication
of horizontal blank space will normally not be advisable.

Vertical space should be handled with more care, since vertical blank
space often indicates textual subdivisions which are unclear and
therefore not captured by other codes.


4 Sections and sentences

Manuscripts do often not adhere to the general scheme of division, -
front matter, main body, and back matter -, each of which subdivided
into their specific proper parts, each of which finally subdivided into
paragraphs at the lowest level of division. The borderlines between such
parts of content may be fuzzy, their order of appearance may be
switched, etc. Many (most?) texts do not contain front or back matter,
parts, chapters and paragraphs at all - just a series of pages filled
with text and occasional, more or less arbitrary, vertical spaces.

This poses a problem since there is always a need for a system of
reference. Pages or folios supply one such system, but it does mostly
not coincide with natural cotextual units, and in addition we have
already seen (cf. 2 above) that parts of text may have to be "moved"
from or within their proper places within this system.

The div1, div2, div3... tags of TEI P1 are probably well suited, but
each project must carefully design its own application criteria. The
transcription of Wittgenstein is a lucky case, - Wittgenstein was very
careful about where he put blank lines, so we define a section as a part
of text between blank lines. In this case, a section turns out to be
anything between one sentence and several pages, but mostly less than
one and very rarely more than five pages.

The TEI concept of orthographic sentence is also applicable to
manuscripts. However, one should be aware that unclarity as to whether a
piece of text belongs to a sentence or not, or to which sentence, is
much more frequent in manuscripts than in other sources, because of the
number of insertions, deletions, false starts and the like.

Special tags are recommended for incomplete sentences and elements
within sentences which do not form well-formed parts of the sentences
within which they occur.


5 Readability

This phenomenon is not different in manuscripts from what it is in other
types of text, but since it is obviously more frequent and applies to
larger parts of text, it would be convenient to introduce tags to
indicate a number of unreadable words of lines.


6 Marks and lines in margins

This feature is probably not peculiar to manuscripts, but it is
mentioned here because it is very frequent and obviously significant in
certain manuscripts, and not mentioned in TEI P1.


7 Underlining etc.

Underlining is not a feature peculiar to manuscripts. But the fact that
underlinings may be canceled, added by another or a later hand, etc is.
Cf. below.  Shapes and style may also take other forms than in printed
text.


8 Deletions

I know of no good English word for the simple phenomenon I am thinking
of here - perhaps "overstriked text" is the right one. (The TEI del-tag
(TEI P1 p. 95) concerns editors "deletions", so an additional tag is
required.

This phenomenon probably exists only in manuscripts. It is sometimes
assumed that deleted text is of no interest and that it should simply be
left out. This assumption is strongly discouraged. Deleted text should
be transcribed along with non-deleted text. This poses some special
problems with deleted parts of words etc. - cf. 15 below.


9 Insertions

Again a feature which is not in itself unique to manuscripts - but its
frequency and forms are probably unique.

It is recommended that insertions are put in their proper cotextual (not
physical) positions, and that the following attributes are required:
Position, decide ability, marked/non- marked.

The first and last attributes are probably self-explanatory, while the
third requires some explanation: If the proper linear position of an
insertion is explicitly marked (by insertion marks, arrows, or the like)
in the copy text, the insertion is decided. If not, then if its proper
position may be inferred with certainty on the basis of cotextual
considerations, it is decidable. If neither, it is undecidable, and
should be placed in a linear position where it disturbs the rest of the
text as little as possible, yet close to its corresponding physical
location.


10 Substitutions, counterpositions, overwritings

These features are sometimes referred to collectively as alternative
readings, variant readings, or the like. Their structure is to a large
extent identical to the structure of variant readings dealt with in text
critical studies and/or collation of texts from different witnesses.

In manuscripts, however, there are variant readings within one and the
same witness, not only resulting from a second or even from a later
hand, but also from insertions, deletions, overwritings,
counterpositions, and combinations of such.

What is the best way to represent these features syntactically within
SGML is problematic or at least complicated. Some methods have been
discussed in TEI P1 5.10.3-5. At the Wittgenstein Archives a variety of
different methods have been developed over the years. Some of these are
at least basically or partly identical to those discussed in TEI, some
seem to be widely different (but may also turn out to be basically
identical).

However, space and time forbids going into questions of syntactical
means of representation here, and this aspect of the problem is perhaps
outside the scope of the manuscript work group anyway. However, what IS
important at this stage, is that variants or substitutions seem to take
on forms in manuscripts in which they do not exist in other types of
text.

Irrespective of the syntactical means of representation, the encoding of
substitutions for manuscripts should provide for the indication of such
features as: Decidedness, priority, cotextual binding, hand, and status.
Features of individual elements in a substitution should be encoded
independently of their relation to the other elements.

Decidedness: A substitution is decided if there is a clear and
conclusive indication of preference for one element to all others. E.g.
two elements deleted, the third not deleted. A substitution is decidable
if such preference can be inferred on the basis of cotextual
considerations. E.g. two elements not grammatically well-formed parts of
the rest of the sentence, the third a grammatically well-formed part. A
substitution is undecidable if it is neither decided nor decidable. E.g.
none or all elements deleted, none or all elements well-formed parts of
the sentence.

Priority: Irrespective of the decidedness of a substitutions, with any
number of elements higher than 2 the decidedness does not establish
exhaustively the order of preference. (E.g. that a
substitution-relationship between 3 elements is decided in favor of
element number 1, does not establish the order of priority between
number 2 and 3.)

Cotextual binding: An element in one substitution may be cotextually
bound to (presuppose or exclude) an element from some other
substitution.

Hand: A substitution may be the result of interference by a seconda
manu, or by the same hand at a later time, even without any particular
element being added by any other or later hand.

Status: Substitutions may be canceled, - again by another or a later
hand, and even without any particular element being canceled.


11 Abbreviations etc

One does find kinds of abbreviations in manuscripts which do not occur
often in printed texts, but probably there are none in modern
manuscripts which will not also be represented in ancient or medieval
texts.


12 Emendation and normalization

This headline may seem inappropriate in a recommendation for diplomatic
transcription. However, there are certain emendations which are so
easily performed and do no harm to the overall goal of diplomatic
transcription, yet so strongly improves the possible usefulness of the
transcriptions also for other  purposes (word search, lexical analysis),
that it would be wrong not to recommend them also for diplomatic
transcription.  Most details of this is common ground and does not
deserve special mention here (but certainly in a final version of TR9's
recommendations). The issue implies the acceptance and in some cases
even the requirement of transcriber's additions, substitutions,
deletions and counterpositions.


13 Punctuation

Punctuation is generally much more inconsistent in manuscripts than in
most other kinds of text. Since punctuation carries so much structural
information, which may otherwise be scarce in such texts, the
recommendation to encode the underlying feature instead of the
punctuation itself does not apply to manuscripts. Both kinds of marking
should be applied. However, a much more discriminative encoding scheme
than the one proposed so far must be introduced.

E.g., representing all points (full stops) with a full stop does not
suffice. Neither would there be any point (ha-ha) in substituting one
single entity reference, e.g. &fs; for all points. The function of a
point may be to indicate the end of a declarative sentence, an
abbreviation, an ordinal number, it may be a logical operator, and so
on. There should either be separate entity references for each of these
different functions, or one must find a more general way of solving the
problem by means of generic encoding. (The latter strategy has been
adopted by the Wittgenstein Archives.)


14 Cancellations, second or later hand

The following attributes seem to be applicable to a large number of tags
and to be peculiar to manuscripts.

Cancellation: Features like underlining, indentation, deletions,
substitutions, counterpositions, and overwritings may all be canceled in
various different ways and by different hands. To make things worse,
cancellations may themselves be canceled, again in several different
ways and by different hands.

Second or later hand: Not only text elements, but also features such as
underlining, indentation, deletions, substitutions, counterpositions,
and overwritings may be supplied by a different hand, or by the same
hand at a later time. The same goes for cancellations, which may be
regarded as features of attributes, - although it is unclear how this
should be represented.


15 Word delimiters, encoding below word-level

My impression is that blanks are regarded quite generally as word
delimiters in TEI. Already at the outset, this raises the question
whether a special code for typographical blanks should be introduced; or
whether blanks should be regarded as insignificant formatting
characters, i.e. exclusively as parts of the formatting of the encoded
text itself, and a separate encoding of beginning and end of words
introduced.

Encoding below word-level, i.e. insertion of tags inside words, must be
recommended and permitted without restriction.  Encoding below
character-level is not permitted (if only because it is difficult to see
how this should be possible).

Since the transcription of manuscripts involves the transcription of
deleted, substituted, and inserted text and of text by different hands,
they will contain strings which are incomplete words, or strings which
contain parts which do not form proper parts of the word represented by
the string.  This poses obvious problems for retrieval, lexical
analysis, etc.

Special tags should be introduced for the marking of incomplete words
and strings which occur inside but do not form proper parts of words.


16 Presentation vs. underlying feature

Some words have already been said on this under 13 above.

The general recommendations of TEI P1 is to encode the underlying
feature rather than its outer appearance.  Concerning diplomatic
transcription this must of course be quite generally discouraged.

The perhaps most important reason why this must be discouraged is that
SGML does not allow tagging of attribute values, so that representing
parts of text as attribute values prohibit a proper representation of
them.

As touched upon earlier, the present writer is not at all convinced that
the distinction between underlying feature and outer appearance is clear
enough to be generally applicable.  But this issue is to broad to be
pursued here, and has got nothing to do with manuscripts per se.

The End.