From LISTSERV@LISTSERV.UIC.EDU Wed Sep 1 17:43:11 1999 Date: Wed, 1 Sep 1999 11:21:25 -0500 From: "L-Soft list server at University of Illinois at Chicago (1.8c)" To: Lou Burnard Subject: File: "EDW33 DOC" Remarks on the TEI and Electronic Archives Modern Language Association Convention TEI ED W33 C. M. Sperberg-McQueen 29 December 1992 [This running text was constructed after the fact from the author's notes. It attempts to be faithful to the presentation as given, but some transitions may differ from those of the oral presentaton, and the list at the end includes some items omitted under time pressure. Editorial notes providing context which are not part of the text itself have been added within square brackets. -CMSMcQ] The organizers have asked me to give a progress report on the Text Encoding Initiative (TEI); first of all I will describe, for those of you who do not already know, and review for the rest of you, the broad outlines of the TEI's goal and organization. Then I want to speak in a little more detail about what it is we are trying to do and how we are going about it. Third, I'll consider the TEI in the light of the overall topic for the panel: electronic archives. The Text Encoding Initiative is an international project to develop and disseminate guidelines for the encoding and interchange of machine-readable, that is electronic, texts for use in research. Note two salient points in that description: the TEI is a project of the research community, and the needs of research and researchers are paramount in our work --- not those of software developers, hardware vendors, paper or electronic publishers, or airframe manufacturers. (This is not to claim that the interests of all these groups diverge profoundly, but only to explain that when they do diverge superficially, the TEI pays more attention to research than to other possible applications of the markup language we are developing.) The TEI is sponsored by the Association for Computers and the Humanitites (ACH), the Asscociation for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing. As described in your handout [Document TEI ED J17], we have been funded by the National Endowment for the Humanities, the Andrew W. Mellon Foundation, Directorate General XIII of the Commission of the European Communities, and the Social Science and Humanities Research Council of Canada, to all of whom thanks. Further substantial funding comes in the form of invisible subsidies provided by the University of Illinois at Chicago, the Oxford University Computing Service, and the other institutions where participants are based. The most important subsidy of all, however, is the donation of time and effort by members of the research community who volunteer to serve on the working committees and work groups of the TEI --- a group which includes, I am pleased to say, all of my fellow panelists [viz. Susan Hockey, Ian Lancashire, and Elaine Brennan]. The organization of the TEI is described briefly in the handout; if you have questions I'll be glad to answer them afterwards. The TEI began with a planning conference in November 1987, full five years ago, and published its first draft proposals, with the document number TEI P1, in the summer of 1990. We are now in the throes of producing the second draft of the TEI Guidelines, with the document number TEI P2 (for "proposal number 2"). Some chapters have been released, including those on the transcription of spoken material, characters and character representation, the TEI Header, core elements available in all TEI document types, prose texts, terminological data, and the formal grammar of the TEI subset of SGML, the Standard Generalized Markup Language. We hope to release the remaining chapters in the course of this coming spring, and then after a very brief period of revision, produce the third version of the Guidelines, TEI P3, and submit it to the Advisory Board for endorsement around the middle of the year. So much for the externals of the TEI. What are we trying to do, and how? Macbine-readable texts have been used for humanities research for a little over forty years -- slightly longer than computers have been commercially available. Throughout that time, computer-assisted projects of text analysis have taken roughly the same form: * the text to be analysed is first recorded in electronic form * then the analysis itself is performed. People have been observing in print for at least thirty years that when multiple projects are interested in the same text, there should be no need to repeat the first step (the encoding in electronic form) for each project: once the machine-readable text is created, it can be used for many different analyses without further encoding work. And so people have been urging for at least thirty years that TEXTS BE SHARED. We have been only moderately successful, however, in this program of text sharing; why? In the first place, some people don't want to share their texts. If I went to all the pain and trouble of creating this electronic text, and am about to do my analysis on it, why should I give it to you? You'll run off, perform your own analysis on it, publish first maybe, and get all the glory. But I need the glory myself, because I'm coming up for tenure, promotion, a raise. This is an understandable departure from the norm in a profession noted in general for its altruism [laughter], but notice one important feature of this line of argument: the implicit claim that relative to the analysis, the task of creating the electronic text is large and onerous. In other words, we really would be saving time and trouble overall, if we could find a way to make it easier and more common to share our texts. A second reason for our failure to achieve widespread text sharing is that when we do use each other's texts, we discover that we don't always understand them, because the methods used to encode the texts are so often idiosyncratic. This results in part from the newness of the medium. Faced with the task of representing a text in electronic form, without established conventions for the result, scholars find themselves in an Edenic position. Like Adam and Eve, we get to give something a name, and have the name we give be the name of that thing. If we say that an asterisk marks an italic word, and a percent sign precedes and follows a personal name, and an at sign marks a place name, then that is what those things mean. The blankness of the slate gives us a kind of euphoric power, and that power is understandably slightly intoxicating. The result is that over the last forty years virtually every scholar who has created an electronic text has used the opportunity to invent a new language for encoding the text. Electronic texts thus are, and have always been, in the position of humankind after the Tower of Babel. And the result has been pretty much what the Yahweh of Genesis had in mind: our cooperation has been hindered and delayed by the needless misunderstandings and the pointless work of translating among different systems of signs, makework that would be unnecessary if we had an accepted common language for use in electronic texts. Now, we have two distinct difficulties in using each other's texts. When I get a text from you, first of all, I may not understand what all the special marks in it mean. If you have invented your own language, your own system of signs, that is, I may find that your text contains signifiers which are opaque to me because I don't know their significance. The second difficulty is that once I do understand your signs, I may find that the signifieds of your text don't tell me what I want to know. It's good that I now understand that the at-sign means a place name, but if I'm not interested in place names, but rather in the use of the dative case (which you have not marked in the text), then your text may not be as much use to me as I may have hoped before I knew what it all meant. The TEI Guidelines provide tools to address both of these difficulties, but the second is soluble only within very restricted bounds. Without violating the autonomy of the individual researcher, it is impossible to tell each other that we all have to be interested in the dative case, instead of in place names. Within limits, however, a tenuous consensus can be formed regarding some minimum set of textual features which everyone, or almost everyone, regards as being of at least potential interest. No one should hope for too much from this consensus, however; the simple political fact is that very few features seem useful to absolutely everyone. Thus, I would not recommend to anyone that they should encode a text recording only the features that the universal consensus regards as useful. Almost no one would be happy with such a text: everyone regards other features as desirable, though we can reach no agreement as to what those other features are. The first difficulty, that of understanding what it is the encoder is saying about a text, can be solved much more satisfactorily. The TEI will provide a large, thoroughly defined set of signs ('tags' is the technical term) for use in marking up texts, and the current draft of the Guidelines will suffice for virtually all the signifieds which workers with electronic text now record in their texts. By using this set of documented signs, we cannot guarantee that we will find the encoding work of others useful or interesting, but we can at least see to it that we understand what they are saying. Because such a vocabulary of tags must necessarily be rather large, almost no one will be interested in using every item in it. The first task of the encoder who uses TEI markup will therefore be to make a selection among the signs defined in the scheme, and to begin making local policy decisions as to how those signs are to be used. The TEI provides, in the TEI header, a place to record those policy decisions, so that later users of the text can know what was done when the text was created. By providing a common public vocabulary for text markup, we will have taken one major step toward making electronic archives as important and useful as I think they ought to be, but only one step. What other steps are required? * First of all, we must as a community make a serious commitment to allowing reuse of our electronic texts. This will require either a massive upsurge in the incidence of altruism [chuckles] or much stronger conventions for the citation of electronic texts, and giving credit for the creation of electronic materials, both in bibliographic practice and at promotion, tenure, and salary time. * Second, we must cultivate a strict distinction between the format of our data and the software with which we manipulate it, because software is short-lived, but our texts are, or should be, long-lived. Our paper archives are full of documents fifteen or twenty years old, or 150 to 200 years old, or even 1500 or 2000 years old. But I cannot think of a single piece of software I can run which was written even 100 years ago. To allow our texts to survive, we must separate them firmly from the evanescent software we use to work on them. SGML and other standards encourage such a distinction, but proprietary products typically obscure it: in some operating systems, every document is tied, at the operating system level, to a single application --- precisely the wrong approach, from this point of view. * Third, we need to cultivate software, in order to make the texts contained in our archives more useful in our work. * Next, we need to achive some de facto agreement on a manageably small set of data formats for use in our archives. As an editor of the TEI Guidelines, I would of course prefer to see the TEI scheme among that set of de facto standard formats, but whether that comes to pass or not, electronic archives will not be able to function successfully without restricting themselves to a manageable set of formats. * Finally, we need if possible to come to a richer consensus about the ways in which we encode texts: we should try to move beyond an agreement on syntax and achieve more unity on the specific features of text which are widely useful. Such a consensus will make the TEI less of merely syntactic convention and more of a real common language. The TEI's contribution to the success of electronic archives will, I hope, be that it provides us with a common language, to allow us to escape our post-Babel confusion. As the list just concluded makes clear, such a common language is not all we need. But as the Yahweh of Genesis says: If as one people speaking the same language they have begun to do this, then nothing they plan to do will be impossible for them. [Gen. 11:6, NIV]