.* TEI Document No: EDW7c
.* Title: Anglo-Saxon Texts and the Text Encoding Initiative
.* Drafted:
.* Revised: 9 May 90 for Kalamazoo
.* 9 Jul 91 for Stony Brook (in great haste). Abandoned
.* Kalamazoo text entirely (despite haste) and
.* wrote new paper from ISAS outline. The
.* document number is thus no longer appropriate.
.* 10 Jul 91 print style and similar minor changes
.*
.im teigmlp1 ;.* Use GMLPAPER or GMLGUIDE (or -MLA)
.sr docfile = &sysfnam. ;.sr docversion = quiet
.* Document proper begins.
The problems discussed include:
The Text Encoding Initiative is sponsored by the Association for
Computers and the Humanities, the Association for Computational
Linguistics, and the Association for Literary and Linguistic Computing.
First I would like to acknowledge the financial support of the
NEH, the EEC, and the Mellon Foundation for the work of the Text
Encoding Initiative, from which these remarks come, and the
intellectual contribution of Lou Burnard (whose examples I have
pilfered shamelessly) and the members of the TEI working committees,
who however should not be held responsible.
Let us begin with a painful but simple fact, which we have to
come to grips with if we want to use computers to work with texts.
Texts cannot be put into computers.
Neither can numbers. Computers
can contain and operate on patterns of electronic charges, but not
numbers, and not texts.
As a result, computers never process anyone's data. They process
representations of our data. We know of representations that they are
inevitably partial, inevitably interested, and inevitably reveal their
authors' conscious and unconscious judgments and biases. We know that
they obscure what they do not reveal, and that without them nothing can
be revealed at all. In designing representations of texts inside
computers, we should reveal what is relevant, and obscure only what we
think negligible.
As we work more intimately with computers, we want the electronic
texts we use to help us in our work, to make easy for us the kinds of
work we do with them. As we work more intimately with computers, we
will do the kinds of things that our electronic texts make easy for us
to do. Tools always shape the hand that wields them, technology always
shapes the minds that use it. Reason enough to think about what forms
electronic texts should take.
The representation of a text within a computer inevitably expresses
an opinion about what is important in that text. It is a theory of
that one text. The design of an adequate general-purpose
representation of texts in machine-readable form
embodies in turn a thesis about the kinds of things that
are or can be important in texts.
This afternoon I want to talk about the recommendations of the Text
Encoding Initiative for the encoding of texts, with particular
reference to Anglo-Saxon texts. I hope to persuade those of you who
are not already convinced of it that the theory of texts implicit in
the TEI scheme provides a suitable basis for the expression of those
facts about texts and their history which are of particular concern to
Anglo-Saxonists, so that when you next are involved with the creation of
machine-readable texts for yourself or for larger cooperative research
projects, you will be willing to examine the TEI recommendations in
greater detail and use them, so far as they are appropriate to your
task.
At appropriate moments I will point out some important basic facts
about the TEI method of encoding and its theoretical presuppositions,
but I will be attempting here not a general theoretical discussion of
the TEI but a concrete introduction to some of the specific methods
recommended by the TEI for problems which concern Anglo-Saxonists.
Unfortunately, a scheme like that of the TEI cannot be presented fully
in twenty minutes, so we will be skating very quickly over several
topics which deserve much more detailed attention; I can only recommend
that any of you interested in further details contact me privately or
attempt to attend one of the public workshops on the TEI scheme which
we are able to hold from time to time.
First some background on the TEI as a project.
The Text Encoding Initiative is a cooperative undertaking of the
Association for Computers and the Humanities (ACH), the Association for
Computational Linguistics (ACL), and the Association for Literary and
Linguistic Computing (ALLC), to formulate and disseminate guidelines
for the encoding and interchange of machine-readable texts intended for
literary, linguistic, historical, or other textual research. The
chaotic diversity of encoding schemes now in use for such texts makes
it difficult to move texts from one software program to others, and
researchers who exchange texts with others lose valuable time
deciphering the texts and converting them into their local encoding
scheme. The primary goal of the Text Encoding Initiative is to provide
explicit guidelines which define a text format suitable for data
interchange and data analysis; the format should be hardware- and
software-independent, rigorous in its definition of textual objects,
easy to use, and compatible with existing standards.
The TEI grew out of a planning meeting held in 1987; in 1988, with
NEH funding, its work began in earnest; we expect to complete a stable
version of the TEI guidelines for text encoding and interchange in
summer 1992.
The concrete work of the TEI is carried out by several working
committees and work groups, whose work is coordinated by two editors.
Representatives from approximately twenty learned societies and
professional associations whose members are actively concerned with the
encoding of machine-readable literary and linguistic material are
serving on an Advisory Board which will guarantee a wide range of
expertise to bring to bear on the problems.
The concrete result of the TEI is to be a markup scheme or
A number of subsidiary problems may be identified, but we will
consider them to be mostly technical challenges which need not concern
us in detail here. One must, for example, be able to apply markup both
to spans of text with beginning and end, and to points in the text which
have location but no extent or scope. One must be able to distinguish
markup from text. And so on.
Many people have invented notations for text markup to use in
humanistic research; most of these have been concerned primarily with
solving their own particular problems and have understandably exploited
the peculiarities of their problems to achieve concise suitable
notations. Since however by definition the peculiarities of the texts
studied in one project or the work done there are not shared by all
texts and all projects, most of the existing notational schemes are ill
suited to general adoption for computational work in the humanities or
in any other field.
Unlike most earlier efforts in this field, the Text Encoding
Initiative has not attempted to invent a syntax or notation for text
markup from scratch. We have adopted a scheme called the SGML is not (despite its name) a markup language. That is, it does
not define specifically what markup may or should go into a
machine-readable text. Instead, it is a meta-language, within which
one may specify markup languages which do address the concrete issues
we've mentioned already. The TEI's most concrete product will be an
SGML specification of a set of tags to use in marking up
machine-readable text. SGML is designed to give the designers of
markup languages great freedom in identifying the features of texts
which they wish to mark up, but to constrain them in various other ways
so as to ensure that all markup languages defined with SGML share
certain desirable properties. Some characteristics of the TEI
recommendations are mandated by SGML's constraints; others are not. If
you are not already familiar with SGML, you need hardly worry about the
distinction; for those who are, I will attempt to be consistent in
attributing given features of the scheme either to SGML or to the TEI
itself.
In SGML markup, a passage of text which one wishes to mark (let us
say, because it is a paragraph or because it is a quotation) is marked
at beginning and end with
Introduction
Background: What is the TEI?
Requirements for Any Markup Scheme
Don't do it!
):
The TEI Markup Scheme
The Standard Generalized Markup Language (SGML)
Standard
Generalized Markup Language
or SGML
as the basis for TEI
markup. SGML is defined by an international standard
p
, would be marked with angle-bracket, p,
close-angle-bracket
at its beginning, andangle-bracket, slash,
p, close-angle-bracket
at its end, like this:
type=inlineindicates this is a run-on, or inline, quotation, not a block or display quotation; the notation
lang='OE'indicates the language of the text quoted.
For purposes of this paper, we have covered all the SGML we need to: tags are enclosed in angle brackets, and themselves enclose the passage of text they mark. Attributes of a text passage may also be marked on its start tag. The names of tags, the names of attributes, and the possible values of given attributes, must all be declared so that an SGML-aware program can check the markup to ensure it is correct. Those who wish for further guidance are encouraged to look at the appropriate chapter of the TEI guidelines, or at any of the several books on SGML now available.
Probably the first question most of us asked when we started to
think about using computers for our work was How do I get it to
print (later: to display) eth and thorn and aesc and so on?
This
problem has probably consumed more hours by more computing humanists
than any other problem of similarly slight intellectual interest.
Since SGML is not intended to require some specific type of hardware
to work, SGML does not attack this problem by defining a new character
set for work with texts. The TEI, similarly hoping for
device-independent encoding, similarly contents itself with documenting
existing character sets,
When texts are being exchanged between machines which do not share
the same character set, however, even the use of a proper standard
character set on one machine or both does not guarantee the successful
transmission of the text. Characters like eth and thorn are
particularly susceptible to loss or misinterpretation, because they are
not included in the American Standard Code for Information Interchange
(ASCII) character set
In order to allow eths and thorns to pass unscathed through
communications links which know them not, we must represent them using
other characters which can be transferred successfully between
machines. Ditto if we wish to represent them in a machine which does
not have them in its character set. One long-standing method assigns
some little-used character as a substitute for the eth or thorn: e.g.
the commercial-at sign for eth and the dollar sign for thorn. This is
simple and convenient for typing, but one rapidly runs out of
little-used characters and the result is hideous to anyone not inured
to it by long hours of exposure. A second common method is to use
multiple-character codes, often beginning with an escape
character (so-called), so that \t
can mean thorn and \d
can mean eth.
SGML's solution to this universal problem is to use multiple-byte
codes, but to delimit them on both left and right, instead of just on
the left.
This allows one to recognize the end of the
special code even when the originator of the file has unaccountably but
all too predictably neglected to provide a table of the character codes
used. Such special character codes are a special case of SGML
boilerplate
text or for including separate files in a document.)
SGML entities can be defined by the creator of a document, so
one is not limited to those defined in advance by the TEI or by
national or international standards bodies such as
ISO (the International Organization for Standardization).
ISO has, however, already
defined a number of
To represent a lower-case thorn using an SGML entity reference, one
writes an ampersand (marking what follows as an entity reference rather
than as literal text), the name of the character desired (in this case,
thorn
--- if we wanted uppercase we would write Thorn
),
and a semicolon. Similarly for eth (eth
and Eth
), aesc
(aelig
and AElig
) etc. All SGML entity references take
this form: ampersand, name, semicolon. Using standard ISO names, the
quotation given earlier would read:
It is a rare text which has no internal formal subdivisions, and a rare system of subdivisions which does not entail textual units at different levels: book, chapter, verse; act, scene, speech; poem, canto, stanza, line. One of the main intellectual advantages of SGML markup over other schemes is its explicit support for the definition of such hierarchies, and its ability to deal with the complex forms of hierarchies which arise in practice.
The TEI provides a series of tags for blocks or sections of text.
Many schemes give these names like chapter, section, subsection,
and so on. Unfortunately, it is only within a fairly restricted subset
of modern prose that a series of units like this can be taken
seriously, and even there it's problematic. While many texts do follow
this sequence of units,
some texts have their
chapters within their sections, and not vice versa. To avoid the
confusion of having to call one's textual units by the wrong names, the
TEI gives them all the same name:
Many markup schemes make very heavy weather of annotation. One
mechanism for footnotes, a different one for end-notes, a third (if one
is lucky) for marginal notes, and very occasionally a method for
generating multiple levels of apparatus. The TEI, by the rigorous
application of Occam's Razor, simplifies this area substantially by
decreeing that all annotation may be tagged with the same tag:
Et nihil mihi deerit.Ex quot enim me in terram suo ...
Et nihil mihi deerit.De necessarijs vitæ. Primum ...
The
Quotation and cross-reference are not strictly speaking medieval
inventions, but the medieval
As with annotation, cross-references are handled by the TEI with a
single tag, in this case
For historical research, as for onomastics, one will wish to
be able to extract the personal and place names---all the proper nouns,
in fact---from a text. Since this cannot be done mechanically, it may
well be worth while, in a text one will work with repeatedly, to mark
the names explicitly, so they can be identified mechanically later. The
TEI provides a single tag for this purpose:
Textual variation poses one of the trickiest questions for any markup scheme, and one of the most important for students of older materials. A number of different methods for encoding variants exist, devised for the most part in connection with collation or other text-critical software. Their main drawback is that by and large they allow for the markup of variants only in preparation for producing a critical text or printing a text with apparatus. But we wish to be able to do, with all the manuscript versions, the same things we do with a single base text: make concordances, do stylistic and linguistic studies, examine the text in a hypertext system, etc. So we wish to bring the markup for textual variants into the same scheme as other markup, so we need not isolate the text-critical work from other work.
The current draft of the TEI guidelines defines several methods of marking textual variants. In some methods, the variants are embedded in the base text at the beginning or the end of the passage they vary; in others, they are stored separately in the document, or even in a separate document (thus enabling the compilation of textual apparatus for texts on CD-ROM, where one cannot insert the variants into the base text). In one method, there is no privileged base text at all; merely a sequence of segments with variants and segments without variants, from which it is trivial to generate the text for any witness.
The current intention is to reduce the variety of possible methods to a more manageable set in the next version of the guidelines, but at the moment we still do not know which notations really prove most convenient for practical work and which can safely be omitted.
Any markup scheme to be used with manuscript materials must be able, for
some applications, to encode fairly detailed physical descriptions of
the manuscript source(s) of the text. At the very least, it must be
possible to record page breaks, folio numbers, and the like; for these
purposes the TEI provides page break, column break, and line break tags,
as well as a general-purpose
Detailed codicological, paleographic, or similar information really ought to be markable, but we have not yet been able to elicit from the codicologists and paleographers any detailed description of what information they need to be able to mark. This is a topic for ongoing research.
Texts are abstract cultural and linguistic objects. They are preserved by physical media, hence the necessity of recording, as just noted, important details of the medium. But the texts themselves are linguistic objects, and it is not surprising that they form the focus of intensive linguistic and philological study. Since, as with the proper nouns we discussed earlier, it is not practicable to perform linguistic analyses purely mechanically on the fly, it is useful to be able to perform them (mechanically, if we want, or manually, or in some combination) and record the results for later retrieval.
Linguistic annotation, however, can easily swamp any text by its sheer volume. It is important, for this reason, to allow the analysis to be stored separately, and not necessarily in the source text itself, whose words may otherwise be lost behind the grammatical codes. Grammatical interpretation of a crux or ambiguity, also remains frequently a matter of dispute; it is useful to be able to record multiple analyses of the same material, whether they come from one scholar or several. Fortunately, these two requirements are easily combined: if we store linguistic analyses separate from the main text (as for some methods of storing variants), and point into the main text from the linguistic annotation, it is simple to handle multiple analyses; they must simply point at the same material.
The TEI's responsible committee on linguistic analysis and
interpretation has developed an extremely powerful and elegant method
for expressing linguistic analyses and linking them with the surface
text. The method, based on the feature-structure notation developed in
modern phonology but also used in some schools of syntactic analysis,
makes no presuppositions about the substantive linguistic theory to be
applied. No restrictions are made as to what parts of speech may be
assigned; it is not even assumed that part of speech
is a
category one will wish to describe. (There is a starter set
of
linguistic features to describe, but no obligation to use it.) In fact,
the structure is so free of confining linguistic content that one might
well envision using it for neo-grammarian analyses and modern
transformational analyses side by side.
As an example of the linguistic markup as currently defined, consider
The Seafarer
, where
the correct interpretation of
following lines has been much discussed:
waelwegin line 63: The common interpretation takes it as
,h waelweg
whale-road, i.e.
sea,but some (e.g. Smithers 1957)
The Meaning ofThe Seafarer andThe Wanderer ,
road of the slain, which would place the poem in an eschatological tradition and give it rather a different cast. By taking seriously the goal of allowing the expression, in markup, of these literary interpretations, one sees at once that linguistic tagging is a necessary prerequisite of any literary tagset. This may be more obvious to medievalists than to modernists, because medievalists are less likely to fool themselves into thinking the linguistic understanding of the text is non-problematic.
The two competing interpretations might be tagged in TEI markup as in the example below; for details of the interpretation of these tags, see the full TEI Guidelines.
Any notation is designed to make it easier to think about certain types of objects, by making it easier to name them. We invent new notations in order to exploit the Sapir-Whorf phenomenon for our own purposes. This means, of course, that an ill-chosen or ill-designed notation can become an intellectual burden to us, making it harder to do the work we want to do. Like any other notation, the markup language devised by the TEI has the potential to affect what its users do, by changing what they find convenient.
If the TEI scheme is understood as an endorsement of certain ways of looking at texts and a condemnation of others, --- well, in a way, that is no more than can be said of any notation and any markup scheme. We have said all along what I said to you at the beginning of my remarks: the markup of a text reflects a theory of that text; a markup language reflects a theory of texts in general.
If the TEI scheme is taken too seriously and applied
indiscriminately, it will, like any theory or notation applied
inappropriately, have serious consequences for the quality of
scholarship.
TEI-conformant(if one attaches any value to it) while using a set of tags quite different from those pre-defined by the TEI
Because SGML has an explicitly defined formal syntax, and the TEI provides explicit formal definitions of the tags to be used and the ways they can interact with each other, it is possible to check, formally and automatically, whether a document conforms to the formal syntax of SGML and the TEI or not. Such checking, performed on documents created by hand, commonly throws up a number of typographic errors and mistakes in the coding --- errors one might or might not have caught with more vigilant proof-reading. By restricting the form of document markup, we have made whole classes of errors (like many typos in the names of tags) automatically detectable.
No program can check to see that all tags are used correctly in the sense of being applied only where appropriate in the copy text. But the more errors we can detect mechanically, the easier will be the proofreading, and the more likely we are to notice remaining errors. This has the pleasant effect that we can have greater faith in the quality and regularity of our texts and their markup.
Such regularity and predictability in our texts means our software may be simpler, because it is less likely to have to trap errors in the markup. It means we can spend more time on the intellectual quality of the work we are doing, and less on the mechanics. And thus it means we can make more effective use of our always limited computational resources.