From LISTSERV@LISTSERV.UIC.EDU Wed Sep 1 17:41:32 1999 Date: Wed, 1 Sep 1999 11:21:15 -0500 From: "L-Soft list server at University of Illinois at Chicago (1.8c)" To: Lou Burnard Subject: File: "TEIJ16 MEMO" The ACH/ACL/ALLC Text Encoding Initiative: An Overview Susan Hockey Document No: TEI J 16 Draft March 21, 1996 (20:44:47) The Text Encoding Initiative (TEI) is a major international project, sponsored jointly by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). Its task is to develop and disseminate guidelines for the interchange of machine- readable texts among researchers, and to make recommendations for the encoding of new texts. Funding of approximately $1,000,000 has been provided, over the six years of the TEI's development work, by the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Union (as it is now called), and the Andrew W. Mellon Foundation. The TEI has also received substantial indirect support from the host institutions of participants in the project. 1 BACKGROUND 1.1 The Need for a Common Text-Encoding Scheme Textual data has been analysed and otherwise manipulated by computer for over thirty years, but until now there has not been any common encoding format for scholarly machine-readable texts. Scholars have developed many different encoding schemes for representing non-standard characters, footnotes, marginalia, text-critical apparatus, for marking the logical divisions of the text (e.g. book, chapter, verse) or the analytic or interpretive information relevant to the text (e.g. syntac- tic, morphological, or semantic analysis) as well as for documenting the source of an encoded text and the nature of the recording. None of the existing encoding schemes has been able to gain accep- tance as a standard. Most reflect the research interests of their origi- nators and are applicable in only one subject area. Some were created to serve the needs of large-scale projects such as the Thesaurus Linguae Graecae at Irvine or the Responsa Project at Bar-Ilan (Israel) or as specifications for input to text-analysis software such as OCP or Wat- Con. Others were devised by single individuals for their own projects. None is sufficiently flexible or generalizable to apply to the encoding of textual materials across the full spectrum of applications and research interests. The practice followed in encoding them is often poorly documented and much time has been wasted writing software to translate from one format to another. A common text encoding scheme developed for the needs of scholarly research would eliminate or minimize many of these problems and would also provide a text format to be used as a starting point by developers of new software. 1.2 Planning meeting The TEI arose out of a planning conference convened by ACH at Vassar College, Poughkeepsie, New York on 12-13 November 1987. Prior to that there had been a number of meetings at which the problem was noted but no firm commitment given to finding any solution. At this meeting thirty-one experts from universities, learned societies, and text archives from North America, Europe, Israel, and Japan discussed the desirability, feasibility, and basic principles of a common set of guidelines for machine-readable text encoding. The firm consensus at this meeting was that such a common framework was necessary and feasi- ble, and the group agreed on several basic principles to govern the scope and the organization of a set of guidelines for encoding textual materials. These principles are set out in Appendix A and have become known as the "Poughkeepsie Principles". Three key factors contributed to this consensus: * More is known now about the problems of text encoding than at the time of previous attempts. * Even though funding limitations and a compressed timetable made it impossible to invite everyone with a major potential contribution to the effort, the Vassar conference succeeded in bringing together more representatives of key organizations and active research cen- tres than had ever met in one place before to discuss these prob- lems. * The recently developed Standard Generalized Markup Language (SGML), defined by the international standard ISO 8879, appears to provide an invaluable tool for developing a simple, flexible, extensible encoding scheme capable of satisfying the widely varying needs of textual researchers. The newly achieved consensus also reflected the growing urgency of the need as more and more machine-readable texts are being created and exchanged. At Vassar, the status quo was several times described as 'chaos'. 2 OBJECTIVES OF THE TEI At Vassar three overall objectives were defined for the project: 1. To specify a common interchange format for machine readable texts. 2. To provide a set of recommendations for encoding new textual materials. The recommendations would specify both what features are to be encoded and how those features are to be represented. 3. To document the major existing encoding schemes, and develop a metalanguage in which to describe them. 3 ORGANIZATION OF THE PROJECT 3.1 Steering Committee Following the Vassar conference ACL and ALLC joined with ACH to form a Steering Committee for the project with two representatives from each association. The Steering Committee is responsible for raising funds and overseeing the project. It meets approximately every three months, using electronic mail to conduct its business between meetings. The position of Chair of the Steering Committee rotates between the associations. The first task of the Steering Committee was to establish a structure for the project and raise funds to carry it out. Initial funding was obtained from the NEH for two years beginning in June 1988 (first cycle), with the anticipation of a further two-year funding period. The project structure outlined below was in place by the end of 1988. 3.2 Participating Organizations -- Advisory Board Fifteen scholarly organisations participate in the TEI by representa- tion on the project's Advisory Board. Members of the Advisory Board act as links between their organizations and the working committees and cir- culate progress reports and interim drafts within their organizations. The Advisory Board met in February 1989 to approve the project workplan and again in June 1993 to review the Guidelines. 3.3 Editors Two editors co-ordinate the work and are responsible for drafting the TEI guidelines. They ensure the compatibility of the activities of the TEI committees, and review and edit the final document, integrating the products of the various committees into a single coherent whole, as well as dealing with day-to-day enquiries and acting as a TEI secretariat. The editors are based in Chicago and Oxford. 3.4 Working Committees The TEI Guidelines were developed by volunteers from the research community. Following a distinction which was first discussed at the planning meeting,(1) four working committees were set up, with membership drawn from a broad cross section of the scholarly community. 3.5 Committee on Text Documentation This committee, with expertise in librarianship and archive manage- ment, was charged with dealing with the problems of labeling a text with in-file encoding of cataloguing information about the electronic text itself, its source and the relationship between the two. 3.6 Committee on Text Representation This committee dealt with the problems of representing in machine- readable form: character sets, the logical structure of a text, layout and other features physically represented in the source material. It first produced recommendations for ways of dealing with character sets and laid ground rules for encoding textual features common to most types of prose text. 3.7 Committee on Text Analysis and Interpretation This committee was asked to provide discipline-specific sets of tags appropriate to the analytic procedure favoured by that discipline, but in such a way as to permit their extension and generalization to other disciplines. Its initial focus was on linguistic analysis and it has developed a number of powerful theory-independent mechanisms for encod- ing analytic features. Encoding of dictionaries is also dealt with by this committee with a subgroup which was able to build on previous work. 3.8 Committee for Metalanguage and Syntax Issues This committee with expertise in formal language theory rapidly decided that SGML was the best choice of an encoding language for the TEI. It has produced recommendations about how SGML should best be used, suggested modifications in SGML that might be needed to accommodate the TEI, and addressed the problems of conversion between the TEI and other encoding schemes. 4 DEVELOPMENT OF THE FIRST DRAFT OF THE TEI GUIDELINES The work of the first cycle was to develop the first draft of the TEI Guidelines with input from all the Working Committees. The editors first prepared documents for the committees outlining their brief in more detail. Most of the committee work was then done by electronic mail with some face-to-face meetings to finalize the recommendations. At the end of the first development cycle the editors were able to put together the first draft of the TEI Guidelines (TEI P1, hereinafter "P1"),(2) a document of almost 300 pages that incorporates the recommendations of the Working Committees. This document was made available publicly in summer 1990 at the end of the first cycle, and has been superseded by TEI P3 (see next section). Section 7 describes the rationale behind the Guidelines in more detail. 5 SECOND DEVELOPMENT CYCLE Substantial funding was obtained for the second cycle, beginning in June 1990 and this enabled the Steering Committee to consider more options on how to organize the work for this period. In the first draft of the Guidelines some topics are only sketched out; others need revi- sion in the light of comments, and some are omitted completely. The major objectives during the second funding cycle were to test the guide- lines and extend their scope and coverage in the light of user comments. A final document was published in May 1994 at the end of the second funding cycle.(3) 5.1 Work Groups The work of the first cycle concentrated on the core requirements of the TEI Guidelines, that is encoding those features which are found in most text types. In the second cycle specific areas were addressed in more detail by a number of small but tightly-focused work groups which the TEI set up. The work groups made recommendations in these areas, either directly where an area was already well-defined, or indirectly by sketching out a problem domain and proposing other work groups which needed to be set up within it. Each work group was given a specific charge and worked to a specified deadline. In all, about a dozen such groups were set up, including: character sets, text criticism, hypertext and hypermedia, mathematical formulae and tables, language corpora, physical description of manuscripts, analytic bibliography, general lin- guistics, spoken texts, literary studies, historical studies, machine- readable dictionaries and computational lexica. Each group was formally assigned to one of the two major working com- mittees of the TEI, depending on whether its work was primarily con- cerned with Text Representation or Text Analysis and Interpretation. These two committees then reviewed and endorsed the findings of each work group. 5.2 Affiliated Projects The TEI set up arrangements with a number of projects which are engaged in the creation or maintenance of machine-readable texts. The projects were chosen to be representative of different discipline areas. They encoded samples of their own data according to the Guidelines and reported back to the TEI on the results of these tests. Each project was assigned a TEI consultant to advise on the encoding and to assist in the preparation of the feedback report. Workshops for the affiliated projects were held in July 1991. The workshops provided an opportunity for affiliated project staff to learn more about the TEI guidelines and for consultants to become familiar with the work of the projects to which they are assigned. 6 SYNTAX OF THE SCHEME -- SGML No final decision about the syntactic basis for the new text encoding scheme was made at the planning conference. All present agreed that if possible the syntax should be borrowed from some existing scheme. The syntax must be relatively simple, capable of expressing the fine dis- tinctions and occasionally complex overlapping hierarchical structures required by textual material, and must allow for user-defined extensions to the pre-defined set of tags. The most obvious candidate was the Stan- dard Generalized Markup Language (SGML). It was decided to begin work on the syntax with SGML as a model, and abandon SGML only if it proved inadequate for the needs of research. SGML was soon shown to meet the requirements of the TEI, the one problem area being the need to handle multiple conflicting hierarchies, which is handled in SGML by an option- al feature in the language which is not widely supported by SGML soft- ware. SGML is not itself an encoding scheme but a framework within which encoding schemes (tag sets) may be developed. Because multiple tag sets may be used in the same texts, SGML-based encoding schemes can easily support a diversity of opinions about the basic text features to be tagged. SGML is device-independent and is supported by software products from an increasing number of vendors. It can handle any natural language for which an electronic representation exists and is by design application-independent. This means that it enables the same text base to be used for e.g. word processing and research-oriented analytic func- tions. 7 THE TEI GUIDELINES The function of the Guidelines is to specify both what features are to be encoded and how those features are to be represented. Because of diversity of the texts, no simple set of absolute requirements can be applicable to all texts and purposes, but the encoding of a certain min- imum number of features is highly recommended. In many types of textual analysis, the textual features studied vary widely depending upon the theoretical orientation of the researcher. Research is needed to define sets of textual features relevant to specific disciplines or text types (e.g. lexical analysis, textual criticism, thematic study, etc.), and to provide techniques for representing them. The TEI approach to dealing with a wide variety of text types is to attempt to define a comparative- ly small number of features which all texts share, and to allow for these to be used in combination with user-definable sets of more spe- cialized features. The TEI has followed the Poughkeepsie Principles in the development of the Guidelines. These have been amplified in TEI ED P1, "Design Prin- ciples for Text Encoding Guidelines", the main points of which are sum- marized here. The Guidelines should meet the following design goals 1. suffice to represent the textual features needed for research 2. be simple, clear and concrete 3. be easy for researchers to use without special-purpose software 4. allow the rigorous definition and efficient processing of texts 5. provide for user-extensions 6. conform to existing and emergent standards. The Guidelines should contain explicit recommendations for 1. the encoding of new texts (which textual features should be cap- tured, as a minimum, and how they should be marked) 2. the addition of new information or corrections to existing encod- ings 3. the interchange of existing encodings 4. archival documentation of encodings 5. text and encoding documentation for purposes of bibliographic con- trol 6. documentation of selected markup schemes in terms of the recom- mended scheme. The Guidelines should support work in any discipline based on textual material, concentrating initially on encoding the textual features of interest to those disciplines most commonly using computers: lexicogra- phy; thematic, metrical and stylistic studies; historical editing, com- putational branches of linguistics, and content analysis. Discipline- related work has concentrated first on linguistic issues and was later extended to literary and historical studies. Following the basic philosophy of SGML and the need for reusability, the Guidelines are based on descriptive rather than procedural markup. Where possible the encoding tags describe structural or other fundamen- tal text features independently of their representation. However it is recognized that in some disciplines the physical appearance of the orig- inal text is important either because it is the primary object of inter- est or because there is no consensus on the meaning of its appearance. Therefore tags for appearance features are also provided, with the rec- ommendation that the primary purpose of the markup should not be viewed simply as a faithful reproduction of the appearance of the original. It should be stressed that as experience is gained with the applica- tion of the Guidelines in practice, the Guidelines will be refined and extended. Since the needs of researchers continually change, no encod- ing scheme for researchers can ever be absolutely complete or defini- tive. Further detail on the content of the Guidelines can be found in Appendix B. 8 INFORMATION ABOUT THE TEI The TEI maintains an electronic bulletin board (TEI-L) on which news about the TEI is posted and which provides a forum for detailed discus- sion of the TEI Guidelines. To subscribe, send an electronic mail mes- sage containing only the line SUBSCRIBE TEI-L to LISTSERV@UICVM (bitnet) or LISTSERV@UICVM.UIC.EDU (internet). The TEI also keeps a document register at Chicago, which holds infor- mation about all TEI-related working papers, reports and publications. Wherever possible, we try to make sure that finalized reports of general interest are made available publicly. Documents may be obtained using the fileserver mechanism of the LISTSERV. To find out what is currently available, send a message to LISTSERV@UICVM (bitnet) or LISTSERV@UICVM.UIC.EDU (internet) containing the line GET TEI-L FILELIST. 9 THE FUTURE Now that the technical specification of the TEI Guidelines has been established, the focus of the TEI will turn more to usability, user edu- cation and general promotion of the Guidelines. 9.1 Publications The TEI encoding scheme is defined formally by the Guidelines them- selves, but it is clear that a single large reference manual is not enough for practical needs. There is a need for a brief introductory guide, setting out the basic TEI framework and philosophy, as well as for tutorial material, and for examples of TEI-encoded texts. At this writing, work is proceeding on such introductory and tutorial material. 9.2 Feedback During their development, the Guidelines were made available in draft in an admittedly incomplete state in the hope of stimulating the widest possible discussion of their proposals. Now that they have been formal- ly published, the TEI continues to seek comment and suggestions. The TEI is committed to respond to and summarize all comments on the propo- sals. 9.3 Maintenance The Sponsoring Organizations have agreed to provide a mechanism for revising and extending the Guidelines after their first publication, based on experience with the Guidelines in practice. This joint mainte- nance mechanism will be responsible for updates and revisions. It may prove appropriate, after approval and publication of the Guidelines, for the Sponsoring Organizations to seek adoption of the Guidelines as national or international standards. 9.4 Training Materials The TEI has held several training workshops. These workshops have provided the basis of information to be used as training materials and also have acted as another means of obtaining feedback from users. As time passes the content of these training materials will stabilize. It is hoped that there will be more people available to do the training and that some self-teaching material will be available. It has already become apparent that a different approach to training may be required for the different types of users. What is easy for the computer scientist to understand is not necessarily as easy for the humanities scholar and vice versa. The TEI Guidelines are important for both and so the method of putting across the information may need to be tailored to the type of user. 9.5 Software The need for software to handle TEI SGML is becoming more pressing. The requirement seems to be for systems which behave like well-known packages, but can also take advantage of the new capabilities offered by SGML. As a first step, we need programs to translate from the TEI encod- ing scheme to those required by the application programs we use, and back in the other direction. For writing one's own software, the commu- nity needs generally available routines which can read and understand TEI documents and which can be built into software individuals or projects develop for themselves or others (TEI parsers). Equally impor- tant for the usability of the encoding scheme in the community at large will be TEI-aware data-entry software, e.g. editors and word processors which can exploit the rich text structure provided by SGML, simple rou- tines to allow TEI tags to be entered into a text with a keystroke or two, and other tools to help make new texts in the form recommended by the TEI. Approximations to some of these are already available, but there is still a real need to provide such tools cheaply on the machines which humanities scholars most often use, i.e. IBM PCs and Macintoshes, and in such a way that they can be incorporated into software which is already in use. 9.6 Further Documents The Metalanguage Committee has also prepared a more explicit defini- tion of the notion 'TEI conformance', which has been requested by those who would like to develop TEI-conformant software. Some of the TEI work- ing papers are also suitable for publication. 9.7 Conversion of Existing Archives It is unreasonable to expect existing archives to use the new scheme for new texts, which they will need to keep consistent with their exist- ing holdings, but the text archivists at the Vassar conference neverthe- less saw a pressing need for a common format for data interchange. Such a format would allow each archive to write software with which to con- vert its present data format into the common interchange format before sending texts to other users, and convert incoming texts from the common interchange format into the local format. At some later time, once the Guidelines are more widely used, it is likely that some archives will want to convert to the TEI scheme. The work of the Metalanguage Committee in providing a formal specification of existing schemes and the of TEI scheme should provide the groundwork for this. 10 THE ULTIMATE GOAL: USER ACCEPTANCE No standard can be imposed. The success of the TEI will depend on the ultimate acceptance of the Guidelines by the community they are intended to serve. Involving users as much as possible in their development has been the first stage in this effort. The TEI Steering Committee is now formulating plans for the future, once the June 1992 version is in place. The activities outlined in the previous section are some of the things we will consider. Suggestions for other will be very welcome. 11 ACKNOWLEDGEMENTS Much of the material in this overview has been taken from other TEI documents, most of which were originally written by the editors Michael Sperberg-McQueen and Lou Burnard. This work has been made possible with the generous financial support provided by the US National Endowment for the Humanities, DG XIII of the Commission of the European Communities and the Andrew W. Mellon Founda- tion. The TEI is also grateful for the substantial indirect support from the host institutions of participants in the project. 12 APPENDIX A 12.1 Closing Statement of the Vassar Planning Conference 12.1.1 The Preparation of Text Encoding Guidelines 1. The guidelines are intended to provide a standard format for data interchange in humanities research. 2. The guidelines are also intended to suggest principles for the encoding of texts in the same format. 3. The guidelines should a. define a recommended syntax for the format, b. define a metalanguage for the description of text-encoding schemes, c. describe the new format and representative existing schemes both in that metalanguage and in prose. 4. The guidelines should propose sets of coding conventions suited for various applications. 5. The guidelines should include a minimal set of conventions for encoding new texts in the format. 6. The guidelines are to be drafted by committees on a. text documentation b. text representation c. text interpretation and analysis d. metalanguage definition and description of existing and pro- posed schemes, coordinated by a steering committee of representatives of the principal sponsoring organizations. 7. Compatibility with existing standards will be maintained as far as possible. 8. A number of large text archives have agreed in principle to sup- port the guidelines in their function as an interchange format. We encourage funding agencies to support development of tools to facilitate this interchange. 9. Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of information not already coded in the texts. 13 APPENDIX B 13.1 Overview of the Contents of the TEI Guidelines The following abbreviated table of contents gives some idea of the topics covered by the Guidelines. * About these Guidelines * A Gentle Introduction to SGML * Structure of the TEI Document Type Definition * Characters and Character Sets * The TEI Header * Elements Available in All TEI Documents * Default Text Structure * Base Tag Set for Prose * Base Tag Set for Verse * Base Tag Set for Drama * Transcriptions of Speech * Print Dictionaries * Terminological Databases * Linking, Segmentation, and Alignment * Simple Analytic Mechanisms * Feature Structures * Certainty and Responsibility * Transcription of Primary Sources * Critical Apparatus * Names and Dates * Graphs Networks and Trees * Tables, FormulAE, and Graphics * Language Corpora * The Independent Header * Writing System Declaration * Feature System Declaration * Tag Set Documentation * Conformance * Modifying the TEI dtd * Rules for Interchange * Multiple Hierarchies * Algorithm for Recognizing Canonical References * Alphabetical Reference List of Tags and Attributes - Element Classes - Entities - Elements * Obtaining the TEI DTD * Obtaining TEI WSDs * Sample Tag Set Documentation * Formal Grammar for the TEI-Interchange-Format Subset of SGML 14 APPENDIX C 14.1 TEI Steering Committee * Susan Armstrong-Warwick (ACL) ISSCO, University of Geneva 54 route des Acacias CH-1227 Geneva, Switzerland susan@divsun.unige.ch (+41 22) 705-7113 (+41 22) 300-1086 fax * David T. Barnard (ACH) Computing & Information Science Queen's University Kingston, ON K7L 3N6 Canada David.Barnard@queensu.ca (+1 613) 545-6056 (+1 613) 545-6513 http://www.qucis.queensu.ca/~barnard/info.html * Susan Hockey (ALLC) Center for Electronic Texts in the Humanities 169 College Avenue New Brunswick, N.J. 08903 USA hockey@rci.rutgers.edu (+1 908) 932-1384 (+1 908) 932-1386 Fax http://www.ceth.rutgers.edu * Nancy M. Ide (ACH) Department of Computer Science Box 520 Vassar College Poughkeepsie, NY 12601 USA ide@vaxsar.vassar.edu (+1 914) 437-5988 (+1 914) 437-7187 Fax http://www.cs.vassar.edu/faculty/ide.html * Judith Klavans (ACL) 40 South Drive Hastings on Hudson, NY 10706 USA klavans@cs.columbia.edu (+1 914) 478-5737 http://www.cs.columbia.edu/~klavans/home.html * Antonio Zampolli (ALLC) Istituto di Linguistica Computazionale Via Della Faggiola 32 I-56100 Pisa, Italy glottolo@icnucevm.cnuce.cnr.it glottolo@icnucevm.bitnet (+39 50) 560481 (+39 50) 589055 Fax 14.2 TEI Editors * C. M. Sperberg-McQueen, Computer Center (M/C 135), University of Illinois at Chicago, 1940 W. Taylor St., Room 124 Chicago, IL 60612-7352, USA u35395@uicvm.uic.edu u35395@uicvm.bitnet (+1 312) 413-0317 (+1 312) 996-6834 fax http://www-tei.uic.edu/orgs/tei * Lou Burnard, Oxford University Computing Services 13 Banbury Rd, Oxford OX2 6NN, England lou@vax.ox.ac.uk (+44 1865) 273238, 273200 (+44 1865) 273275 Fax http://info.ox.ac.uk/~archive --------------------------------- (1) See Appendix A item 6. (2) C.M. Sperberg-McQueen and Lou Burnard, ed., Guidelines For Electron- ic Text Encoding and Interchange (TEI P1) Chicago and OxfordText Encoding Initiative, May, 1990 (3) C.M. Sperberg-McQueen and Lou Burnard, ed., Guidelines For Electron- ic Text Encoding and Interchange (TEI P3), Chicago and Oxford.Text Encoding Initiative, May, 1994 Draft March 21, 1996 (20:44:47)