From LISTSERV@LISTSERV.UIC.EDU Wed Sep  1 17:43:11 1999
Date: Wed, 1 Sep 1999 11:21:25 -0500
From: "L-Soft list server at University of Illinois at Chicago (1.8c)"
    <LISTSERV@LISTSERV.UIC.EDU>
To: Lou Burnard <lou.burnard@COMPUTING-SERVICES.OXFORD.AC.UK>
Subject: File: "EDW33 DOC"

               Remarks on the TEI and Electronic Archives
                 Modern Language Association Convention

                              TEI ED W33
                         C. M. Sperberg-McQueen
                            29 December 1992

[This running text was constructed after the fact from the author's
notes.  It attempts to be faithful to the presentation as given, but
some transitions may differ from those of the oral presentaton, and the
list at the end includes some items omitted under time pressure.
Editorial notes providing context which are not part of the text itself
have been added within square brackets.  -CMSMcQ]

The organizers have asked me to give a progress report on the Text
Encoding Initiative (TEI); first of all I will describe, for those of
you who do not already know, and review for the rest of you, the broad
outlines of the TEI's goal and organization.  Then I want to speak in a
little more detail about what it is we are trying to do and how we are
going about it.  Third, I'll consider the TEI in the light of the
overall topic for the panel:  electronic archives.

The Text Encoding Initiative is an international project to develop and
disseminate guidelines for the encoding and interchange of
machine-readable, that is electronic, texts for use in research.  Note
two salient points in that description:  the TEI is a project of the
research community, and the needs of research and researchers are
paramount in our work --- not those of software developers, hardware
vendors, paper or electronic publishers, or airframe manufacturers.
(This is not to claim that the interests of all these groups diverge
profoundly, but only to explain that when they do diverge superficially,
the TEI pays more attention to research than to other possible
applications of the markup language we are developing.)

The TEI is sponsored by the Association for Computers and the
Humanitites (ACH), the Asscociation for Computational Linguistics (ACL),
and the Association for Literary and Linguistic Computing.  As described
in your handout [Document TEI ED J17], we have been funded by the
National Endowment for the Humanities, the Andrew W. Mellon Foundation,
Directorate General XIII of the Commission of the European Communities,
and the Social Science and Humanities Research Council of Canada, to all
of whom thanks.  Further substantial funding comes in the form of
invisible subsidies provided by the University of Illinois at Chicago,
the Oxford University Computing Service, and the other institutions
where participants are based.  The most important subsidy of all,
however, is the donation of time and effort by members of the research
community who volunteer to serve on the working committees and work
groups of the TEI --- a group which includes, I am pleased to say, all
of my fellow panelists [viz.  Susan Hockey, Ian Lancashire, and Elaine
Brennan].

The organization of the TEI is described briefly in the handout; if you
have questions I'll be glad to answer them afterwards.  The TEI began
with a planning conference in November 1987, full five years ago, and
published its first draft proposals, with the document number TEI P1, in
the summer of 1990.  We are now in the throes of producing the second
draft of the TEI Guidelines, with the document number TEI P2 (for
"proposal number 2").  Some chapters have been released, including those
on the transcription of spoken material, characters and character
representation, the TEI Header, core elements available in all TEI
document types, prose texts, terminological data, and the formal grammar
of the TEI subset of SGML, the Standard Generalized Markup Language.  We
hope to release the remaining chapters in the course of this coming
spring, and then after a very brief period of revision, produce the
third version of the Guidelines, TEI P3, and submit it to the
Advisory Board for endorsement around the middle of the year.

So much for the externals of the TEI.  What are we trying to do, and
how?

Macbine-readable texts have been used for humanities research for a
little over forty years -- slightly longer than computers have been
commercially available.  Throughout that time, computer-assisted
projects of text analysis have taken roughly the same form:

        * the text to be analysed is first recorded in electronic form
        * then the analysis itself is performed.

People have been observing in print for at least thirty years that when
multiple projects are interested in the same text, there should be no
need to repeat the first step (the encoding in electronic form) for each
project:  once the machine-readable text is created, it can be used for
many different analyses without further encoding work.  And so people
have been urging for at least thirty years that TEXTS BE SHARED.

We have been only moderately successful, however, in this program of
text sharing; why?

In the first place, some people don't want to share their texts.  If I
went to all the pain and trouble of creating this electronic text, and
am about to do my analysis on it, why should I give it to you?  You'll
run off, perform your own analysis on it, publish first maybe, and get
all the glory.  But I need the glory myself, because I'm coming up for
tenure, promotion, a raise.  This is an understandable departure from
the norm in a profession noted in general for its altruism [laughter],
but notice one important feature of this line of argument:  the implicit
claim that relative to the analysis, the task of creating the electronic
text is large and onerous.  In other words, we really would be saving
time and trouble overall, if we could find a way to make it easier and
more common to share our texts.

A second reason for our failure to achieve widespread text sharing is
that when we do use each other's texts, we discover that we don't always
understand them, because the methods used to encode the texts are so
often idiosyncratic.  This results in part from the newness of the
medium.  Faced with the task of representing a text in electronic form,
without established conventions for the result, scholars find themselves
in an Edenic position.  Like Adam and Eve, we get to give something a
name, and have the name we give be the name of that thing.  If we say
that an asterisk marks an italic word, and a percent sign precedes and
follows a personal name, and an at sign marks a place name, then that is
what those things mean.  The blankness of the slate gives us a kind of
euphoric power, and that power is understandably slightly intoxicating.
The result is that over the last forty years virtually every scholar who
has created an electronic text has used the opportunity to invent a new
language for encoding the text.

Electronic texts thus are, and have always been, in the position of
humankind after the Tower of Babel.  And the result has been pretty much
what the Yahweh of Genesis had in mind:  our cooperation has been
hindered and delayed by the needless misunderstandings and the pointless
work of translating among different systems of signs, makework that
would be unnecessary if we had an accepted common language for use in
electronic texts.

Now, we have two distinct difficulties in using each other's texts.
When I get a text from you, first of all, I may not understand what all
the special marks in it mean.  If you have invented your own
language, your own system of signs, that is, I may find that your text
contains signifiers which are opaque to me because I don't know their
significance.

The second difficulty is that once I do understand your signs, I may
find that the signifieds of your text don't tell me what I want to know.
It's good that I now understand that the at-sign means a place name, but
if I'm not interested in place names, but rather in the use of the
dative case (which you have not marked in the text), then your text may
not be as much use to me as I may have hoped before I knew what it all
meant.

The TEI Guidelines provide tools to address both of these difficulties,
but the second is soluble only within very restricted bounds.  Without
violating the autonomy of the individual researcher, it is impossible to
tell each other that we all have to be interested in the dative case,
instead of in place names.  Within limits, however, a tenuous consensus
can be formed regarding some minimum set of textual features which
everyone, or almost everyone, regards as being of at least potential
interest.  No one should hope for too much from this consensus, however;
the simple political fact is that very few features seem useful to
absolutely everyone.  Thus, I would not recommend to anyone that they
should encode a text recording only the features that the universal
consensus regards as useful.  Almost no one would be happy with such a
text:  everyone regards other features as desirable, though we can reach
no agreement as to what those other features are.

The first difficulty, that of understanding what it is the encoder is
saying about a text, can be solved much more satisfactorily.  The TEI
will provide a large, thoroughly defined set of signs ('tags' is the
technical term) for use in marking up texts, and the current draft of
the Guidelines will suffice for virtually all the signifieds which
workers with electronic text now record in their texts.  By using this
set of documented signs, we cannot guarantee that we will find the
encoding work of others useful or interesting, but we can at least see
to it that we understand what they are saying.

Because such a vocabulary of tags must necessarily be rather large,
almost no one will be interested in using every item in it.  The first
task of the encoder who uses TEI markup will therefore be to make a
selection among the signs defined in the scheme, and to begin making
local policy decisions as to how those signs are to be used.  The TEI
provides, in the TEI header, a place to record those policy decisions,
so that later users of the text can know what was done when the text was
created.

By providing a common public vocabulary for text markup, we will have
taken one major step toward making electronic archives as important and
useful as I think they ought to be, but only one step.  What other steps
are required?

* First of all, we must as a community make a serious commitment to
allowing reuse of our electronic texts.  This will require either a
massive upsurge in the incidence of altruism [chuckles] or much stronger
conventions for the citation of electronic texts, and giving credit for
the creation of electronic materials, both in bibliographic practice and
at promotion, tenure, and salary time.

* Second, we must cultivate a strict distinction between the format of
our data and the software with which we manipulate it, because software
is short-lived, but our texts are, or should be, long-lived.  Our paper
archives are full of documents fifteen or twenty years old, or 150 to
200 years old, or even 1500 or 2000 years old.  But I cannot think of a
single piece of software I can run which was written even 100 years ago.
To allow our texts to survive, we must separate them firmly from the
evanescent software we use to work on them.  SGML and other standards
encourage such a distinction, but proprietary products typically obscure
it:  in some operating systems, every document is tied, at the operating
system level, to a single application --- precisely the wrong approach,
from this point of view.

* Third, we need to cultivate software, in order to make the texts
contained in our archives more useful in our work.

* Next, we need to achive some de facto agreement on a manageably small
set of data formats for use in our archives.  As an editor of the TEI
Guidelines, I would of course prefer to see the TEI scheme among that
set of de facto standard formats, but whether that comes to pass or not,
electronic archives will not be able to function successfully without
restricting themselves to a manageable set of formats.

* Finally, we need if possible to come to a richer consensus about the
ways in which we encode texts:  we should try to move beyond an
agreement on syntax and achieve more unity on the specific features of
text which are widely useful.  Such a consensus will make the TEI less
of merely syntactic convention and more of a real common language.

The TEI's contribution to the success of electronic archives will, I
hope, be that it provides us with a common language, to allow us to
escape our post-Babel confusion.  As the list just concluded makes
clear, such a common language is not all we need.  But as the Yahweh of
Genesis says:

        If as one people speaking the same language they have
        begun to do this, then nothing they plan to do
        will be impossible for them.  [Gen. 11:6, NIV]