TEI EDW26 An Introduction to the Text Encoding Initiative Lou Burnard 6 Aug 1991 (Revised, May 1992) Abstract -------- This paper [1] outlines the origins and goals of the Text Encoding Initiative, an international research project which aims to provide guidelines for the standardisation of electronic texts in research. It begins with a re-statement of the goals of the project, placed in the current research context, and then gives a brief description of its nature and scope, with attention to its potential for the handling of the full diversity of textual materials likely to be of interest to social historians. 1. What is the Text Encoding Initiative? ---------------------------------------- The Text Encoding Initiative is an international research project, the aim of which is to develop and to disseminate guidelines for the encoding and interchange of machine-readable texts. It is sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). The project is funded by the U.S. National Endowment for the Humanities, DG XIII of the Commission of the European Communities, the Canadian Social Science and Humanities Research Council and the Andrew W. Mellon Foundation. Equally important has been the donation of time and expertise by the many members of the international research community who have served on the TEI's various Working Committees and Working Groups. The project originated in a conference held at Vassar College in the winter of 1987. [2] The conference was organised by the ACH with funding from NEH. Thirty-one invited scholars, representatives of scholarly organizations and archive directors from North America, Europe and the Middle East participated in two days of intense discussion from which emerged a consensus, expressed in the "Poughkeepsie Principles" discussed below. The success of this conference resulted in the establishment of the Text Encoding Initiative as a co-operative venture sponsored jointly by the ACH, the ACL and the ALLC and directed by a steering committee comprising two representatives from each of these associations. The TEI Steering Committee formulated a work plan to achieve the goals proposed at the Vassar Conference, and then approached fifteen leading scholarly organizations in North America and Europe to nominate representatives to an Advisory Board. This Board met for the first time in Chicago in February 1989, approved the work plan with some modifications and nominated some members to the four working committees of the TEI. It will meet again in 1992 to review the final product of the TEI. During the first funding cycle of the TEI (June 1988-June 1990), work was carried out in four large working committees, with membership drawn from Europe and North America, and from a wide range of disciplines. The members of the largest committee, for example, had backgrounds in computer science, textual editing, theology, software development, linguistics, philosophy and history and were from Norway, Belgium, Germany, Spain and the UK as well as the USA. Each of the four committees proposed a variety of recommendations in distinct areas of text encoding practices. These found expression in the first draft of the TEI Guidelines, a large (300 page) report which was widely distributed in Europe, North America and elsewhere in November 1990. As noted above, however, the present article will not attempt to summarise these Recommendations [3] themselves, as a number of summary articles have already appeared. [4] During the second TEI funding cycle (June 1990-June 1992) the initial recommendations are being reviewed and extended by about a dozen different specialist working groups. The membership of these is, again, as broadly based as possible, while the topics they are currently addressing include character sets; textual criticism; hypermedia; mathematical formulae and tables; language corpora and spoken texts; physical description of printed and manuscript sources; formal characteristics of literary genres in prose, drama and verse; general linguistics; electronic dictionaries, lexica and terminology banks; and last, but by no means least, historical sources. Each work group is charged with the production of a set of recommendations for change to the Guidelines and the testing of their provisions in the Group's chosen field. These reports will form one of three sources of input to the final revision of the current draft proposals, for review during the coming winter. In the summer of 1991, a series of TEI Workshops was organised, at Tempe, Oxford and Providence. These attracted a large number of vociferous discussants, from a wide variety of backgrounds, including professional software developers, librarians, computer scientists as well as researchers and teachers from most of the disciplines in the humanities. The emphasis here was on using the Guidelines as they stood with currently available software and on putting their proposals to the test of real data. Similarly pragmatic concerns underly the second source of modifications to the current draft: the experiences of the "affiliated projects". Some fifteen major research projects, on both sides of the Atlantic, have signed affiliation agreements with the TEI which involve, amongst other things, the testing of the Guidelines against realistically sized samples of the various textual resources which each project is engaged in creating and a detailed report on the problems encountered and solutions found. These projects include major linguistic corpus building ventures such as the British National Corpus and the ACL Data Collection Initiative; projects engaged in the creation of large general purpose historical or literary corpora such as the Brown University Women Writers Project or Harvard University's Perseus Project; as well as smaller individual projects aiming to produce scholarly electronic editions of major authors such as Middleton, Milton or Nietzsche. The final source of comment and revision is the public at large, which has reacted with surprising and occasionally embarassing enthusiasm to the original draft proposals. Over a hundred sets of individual comments and criticisms, some quite detailed, have already been received, and it is hoped that this public discussion will continue during the remainder of the project's lifetime. It has been often remarked that standards cannot be enforced by fiat: they must be accepted voluntarily if they are to achieve any permanent standing. To that end, the TEI has always been anxious to stimulate informed discussion of its proposals in as many and as diverse forums as there are listeners willing to hear. The task of co-ordinating the working groups and committees, and of combining their drafts for publication is carried out by two editors, one European and one American, while the project as a whole is managed by the steering committee. The final deliverables of the project, which will include a substantial reference manual and a number of tutorial guides, will be presented to the Advisory Board in the spring of 1992, in time for final publication at the end of the second funding cycle in June of that year. 2. TEI Design Goals ------------------- As mentioned above, the basic design goals of the TEI Guidelines were determined by the results of a planning conference held at Vassar College in Poughkeepsie, New York, at the outset of the project. That planning conference agreed on the following statement of principles: 1) The guidelines are intended to provide a standard format for data interchange in humanities research. 2) The guidelines are also intended to suggest principles for the encoding of texts in the same format. 3) The guidelines should 3.1) define a recommended syntax for the format, 3.2) define a metalanguage for the description of text-encoding schemes, 3.3) describe the new format and representative existing schemes both in that metalanguage and in prose. 4) The guidelines should propose sets of coding conventions suited for various applications. 5) The guidelines should include a minimal set of conventions for encoding new texts in the format. 6) The guidelines are to be drafted by committees on text documentation; text representation; text interpretation and analysis; metalanguage definition and description of existing and proposed schemes. These committees will be coordinated by a steering committee of representatives of the principal sponsoring organizations. 7) Compatibility with existing standards will be maintained as far as possible. 8) A number of large text archives have agreed in principle to support the guidelines in their function as an interchange format. We encourage funding agencies to support development of tools to facilitate this interchange. 9) Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of information not already coded in the texts. The mandate of creating a common interchange format requires the specification of a specific markup syntax as well as the definition of a large predefined tag set and the provision of mechanisms for extending the markup scheme. The mandate to provide guidance for new text encodings ("suggest principles for text encoding") requires that recommendations be made as to what textual features should be recorded in various situations. The TEI Guidelines were thus, from the start, concerned both with `how' encoding should be preserved for interchange purposes and `what' should in fact be encoded, a point to which I return below. As well as attempting to balance the membership of the original committees geographically and by discipline, an attempt was made to focus the work of each committee on areas where substantial agreement existed among the different parties, rather than to revive or rehearse well recognised disagreements. The guiding principle was to facilitate consensus rather than controversy. Inevitably, this approach runs the risk of pleasing no-one, by being void of substantial content, or refusing to take a stand on any issue. In practice however, the reception of the Guidelines shows that these dangers were safely avoided. Moreover, despite the diversity of backgrounds of those involved, a pleasing discovery was the number of fundamentally identical encoding problems to be found in apparently different types of material. The task of encoding (for example) the different strata of a manuscript tradition poses formally identical problems with those of encoding in parallel multiple linguistic analyses of a given sentence. The task of representing hypertextual links in a document is strikingly similar to that of representing literary allusions or cross references. The problem of preserving the physical appearance of a text together with multiple interpretations derived from it is common to almost every text-based discipline. 3. The need for interchange --------------------------- The goal of the TEI is to develop and disseminate a set of Guidelines for the interchange of machine-readable texts among researchers, so as to allow easier and more efficient sharing of resources for textual computing and natural language processing. It needs to be stressed at the outset that the phrase "machine-readable texts" is to be understood in the widest possible sense. What, for example, do historians do with texts which is fundamentally different from what linguists or social science researchers or literary analysts do? In each case, the same problems arise: an existing source must be represented accurately, a set of interpretations about its component parts must be specified, and some processing carried out. Sometimes the processing will focus more on representation than on analysis (as in the production of a new edition); at other times, the focus will be on a particular kind of analysis (as in the extraction of statistical data or the construction of a database). However, as I and others have argued elsewhere, [5] a crucial benefit of computer-aided research is the way in which it supports both approaches. The creator of an appropriately marked-up electronic text can have her cake and eat it too. Standardisation efforts which do not address the issue of a neutral, processing-independent, view of textual objects, but instead focus on standards in particular application areas, thus fall somewhat short of realising the full potential of electronic texts. Of course, in a world of unlimited resources, there would be no particular need for interchange standards at all: each creator of electronic texts could do so in glorious isolation. However, in several areas of the research community, notably the expanding field of natural language processing, a need for interchangeable electronic resources is already widely recognised, on both economic and methodological grounds. Economically, it is widely accepted that the heavy cost of creating such resources as language corpora and electronic lexica can only be justified if the resources can be re-used by many projects. [6] Methodologically, the repeatability of research results which forms an essential aspect of any empirical research can best be guaranteed by the continued availability of data sets for secondary analysis. Standardisation is easily achieved where there is a broad consensus about the kinds of data to be processed and the particular software packages to be used (as has been, for example, the case for many years in social science survey research). It is less simple where essentially identical kinds of data resources (such as textual corpora) contain matter of interest to distinct research communities characterised by an immense variety of theoretical positions and methods. Yet the same conceptual model of what texts actually are should be applicable to all. This can only be achieved if standardisation is expressed at a sufficiently general level to allow for the widest variety of views. The TEI arose from a perceived need within one, comparatively small, research community: that concerned with the encoding and manipulation of purely textual data for purposes of descriptive or corpus linguistics, stylistic analysis, textual editing and other forms of what is broadly called `Literary and Linguistic Computing' (LLC). There has recently been an interesting convergence between the needs and abilities of that community with those of the somewhat larger body of researchers concerned with the computational analysis of natural language (NLP) whether for natural language understanding, generation or translation systems. Straddling the two communities are those concerned with the creation of better, objectively derived, models of language in use, whose methods have transformed current practices in lexicography and language teaching. What links all of these researchers is the need to process large amounts of textual data in a wide variety of different styles. What the TEI offers them, and others, is a model for the standardisation of textual data resources for interchange. It is helpful, when considering standardisation of electronic resources, to distinguish the objects of standardisation (the `what') from the particular representation recommended for it (the `how'). Like other standardisation efforts, the TEI Guidelines include both recommendations about which textual features should be distinguished when encoding texts from scratch if the resulting text is to be of maximal usefulness to the research community, and recommendations of specific practices in the encoding of new texts. The `how' chosen by the TEI is based on the international standard Standard Generalized Markup Language (SGML), an informal introduction to which is provided elsewhere in this volume. The `what' is rather more difficult to summarise in a short document of this nature, but some general remarks and a few specific examples are provided below. Distinguishing these two aspects of standardisation is particularly important for electronic resources, because of the ease with which their representations may be changed. What is sometimes forgotten is that ease of conversion is crucially dependent on the prior existence of an agreed set of distinctions. The TEI attempts to make such an agreed set of distinctions, by proposing an abstract data model consisting of those features for which a consensus can be reached as to their importance in a wide range of automatic analyses. To identify these features, particular software systems may use entirely different representation schemata: different representations will be appropriate for different hardware environments, for different software packages, for archival storage and in particular for the exchange of data across particular networks. 4. Structure and interpretation: the TEI topoi ---------------------------------------------- I suggested above that the primary function of markup was to make explicit an interpretation of a text. Any standardisation effort such as the TEI must therefore at some time grasp the nettle of deciding which interpretations are to be favoured over others. To put it another way, the TEI must at least attempt to address the question as to which aspects or features of a text should be made explicit by its markup. For some scholars, this is a simple issue. There are some features of a text which are "obvious" and "objective" -- examples usually include major structural subdivisions such as chapters or verse lines or entries in a charter. There are others which are equally obviously "pure interpretation" -- such as whether or not a passage in a prose text belongs to some stylistic category, or is in a foreign language, or is a personal name. As this last list perhaps indicates, for the present writer this is a far from clear cut distinction. In almost every kind of material, and especially in the kinds of materials studied by historians, there is a continuum of categorisations, from things about which almost everyone will agree almost all of the time, down to things which almost no-one will identify in the same way ever. The TEI therefore adopts a liberal policy. It proposes for consideration a set of categories which wide consultation has demonstrated to be of use to a broad consensus of researchers. It proposes ways in which instances of those categories may be marked up (as discussed in the last section). Researchers in agreement as to the use of the categories so defined can thus interchange texts, or (if you wish) interpreted texts. They can do so moreover in a format which allows the disentangling of the interpretation from the text stream, or its enrichment in a controlled way. No claim is made as to the feasibility or desirability of making such interpretations in a given case -- all that the TEI can or does offer is a way of making explicit what has been done. The remainder of this paper discusses some concrete instances of the kinds of textual feature which typify the current TEI proposals. 4.1. The structure of a TEI text -------------------------------- 4.1.1. The TEI header --------------------- All TEI-conformant texts contain (a) a "TEI header" and (b) the transcription of the text proper. The TEI header provides information analogous to that provided by the title page of a printed text. It contains a description of the machine-readable text, a description of the way it has been encoded, and a revision history; these are delimited by the , the , and the tags, respectively. The first of these identifies the electronic text as an object in its own right, independent of its source or sources (which must however be documented within it). The second supplies details of the particular encoding practices or variations which characterise the text, for example any special codebooks or other values used within the body of the text and descriptions of the referencing scheme or editorial principles applied. The header is, perhaps surprisingly, the `only' part of a TEI text which is mandatory. 4.1.2. Marking Divisions within a Text -------------------------------------- The TEI recommendations categorise document elements as either "structural" or "floating". Structural elements are constrained as to where they may appear in a document; for example a or heading may not appear in the middle of a . Floating elements, as the name suggests, are less constrained and may appear almost anywhere in a text: examples include or . Intermediate between the two categories are so-called "crystals": these are floating features the contents of which have an inherent structure, for example or elements. 4.1.1.1. Structural features ---------------------------- The current recommendations define a general purpose hierarchic structure, which has been found to be suitable for a very large (perhaps surprisingly large) variety of textual sources. In this, a text is divided into an optional , a and an optional . The body of a text may be a series of paragraphs (marked with

...

), or it may be divided into chapters, sections, subsections, etc. In the latter case, the is divided into a series of elements known generically as "div"s. The largest subdivision of a given text is tagged , the next smallest and so on. Written prose texts may also be further subdivided into

s (paragraphs). For verse texts, metrical lines are tagged with the tag. 4.1.1.2. Floating features -------------------------- As mentioned above, the current Guidelines propose names and definitions for a wide variety of floating features. Examples include for titles and captions (not properly floating, since they are generally tied to a particular structural element); for quoted matter and direct speech; for lists and for the items within them; for footnotes etc.; for editorial corrections of the original source made by the encoder; and, optionally, a variety of lexically `awkward' items such as eviations, s, s, s, s, for bibliographic or other citations,

for street addresses and for non-English words or phrases. 4.2. Reference scheme --------------------- The advantage of using a single hierarchic scheme as outlined above, is that a referencing scheme based on it can be automatically generated. For example, a given paragraph will acquire a number indicating its sequence within the enclosing
, itself identified by its number within any enclosing
above it, and ultimately within the enclosing . For example, the value "T98.1.9/12" might identify the 12th paragraph in chapter 9 of book 1 of the text with number T98. To complement this kind of internal referencing system, the Guidelines provide two distinct methods of marking other reference schemes, such as page and line numbers. The hierarchy of volume, page, and line can be neatly expressed with a concurrent markup stream separate from the main markup hierarchy (see P1 section 5.6); for data entry purposes, however, the simpler scheme we describe here may be more convenient. After data entry, this markup can be transformed mechanically into that required for a concurrent markup hierarchy, if that is supported by the software in use. Page breaks, column breaks, and line breaks may be marked with empty "milestone" elements: that is, tags such as or which mark a single point in the text, not a span of text, and therefore have no corresponding end-tags. Such tags may have an "n" attribute to supply the number of the page, column, or line beginning at the tag explicitly, or may give only the number of the first if subsequent ones can be calculated automatically. This mechanism also allows for the pagination etc. of more than one edition to be specified by using an "ed" attribute. 4.3. Descriptive vs presentational markup ----------------------------------------- A matter of considerable controversy (and associated misunderstanding) has been the question of whether or not aspects of a text directly related to its physical appearance can or should be marked up. For some researchers, and in many applications, typographic features such as lineation or font are of little or no importance. For others, they are the very subject of interest. Because SGML focuses attention on "describing" a text, rather than attempting to simulate its appearance, the TEI recommendations have proposed that where it is possible to identify a structural (or floating) feature by its function, then that is what should be primarily tagged. This does not however mean that they provide no support for cases where the exact purpose of some distinctly-rendered part of a text cannot be determined. It is recognised that in many cases it may be neither desirable nor possible to interpret changes of rendering in this way. A global attribute "rendition" may be specified for every tag in the TEI scheme, the value of which is a user-specified string descriptive of the way that the current element is rendered in the source being transcribed. [7] In most cases, a change in rendering and a change of element coincide: this mechanism therefore reduces the amount of tagging from what would be required if a separate set of tags were used for rendering. Further reduction in tagging is provided by the fact that the default value for a "rendition" attribute is that of the immediately surrounding element (if any). In cases where a renditional change is not associated with any discernible element, a special tag may be used, the sole function of which is to carry the "rendition" attribute. No recommendations about the form of value to be supplied for rendition attributes have yet been made: these are the subject of current work in two working groups. Similar considerations apply to the use of quotation marks and quoted passages within a text. 4.4. Scope and coverage of P1 ----------------------------- As an example of the scope and range of facilities which SGML can support, I close with a brief summary of the full contents of the current draft and a more detailed description of a few of the more specialised kinds of textual features for which tags are already proposed in the draft Guidelines. It should be stressed that the first draft of the Guidelines, despite its weighty appearance (nearly 300 pages of closely printed A4), is very much a discussion paper and far from being complete or definitive. Some characteristics of the TEI approach are however already discernible which are unlikely to change. One is a focus on the encoding of the content of text, rather than its appearance -- as discussed above, this is also a characteristic of SGML. Another is the rigorous application of Occam's razor: the TEI approach to the immense variety of text types in the real world is to attempt to define a comparatively small number of features which all texts share, and to allow for these to be used in combination with user-definable sets of more specialised features. [8] The current draft has eight main sections, which are briefly summarized below. Chapter 1 outlines the purpose and scope of the TEI scheme. As outlined above, its main goals are both to facilitate data interchange and to provide guidance for those creating new texts. The desiderata of simplicity, clarity, formal rigour, sufficient power for research purposes, conformance to international standards, and independence of software, hardware or application alike are stressed. Chapter 2 provides a gentle introduction to the basic concepts of SGML and also contains some more technical information about the ways in which the TEI scheme uses the standard. Chapter 3 addresses the problems of character encoding and translation in a world dominated by the rival claims of ASCII and EBCDIC. If the goal is to provide machine-independent support for all writing systems of all languages, these problems are far from trivial. The specific recommendations made are that only a subset of the ISO-646 character set (sometimes known as ASCII) can currently be relied on for data interchange, and that this should be extended either by using the entity reference mechanism provided by SGML or by using transliteration schemes. It proposes a powerful but economical way of documenting such transliteration schemes by a formal Writing System Declaration. Chapter 4 contains recommendations for in-file documentation of electronic texts adequate to the bibliographic needs of researchers, data archivists and librarians. It recommends that a special header be added to each file to perform a function analogous to that of the title page of a non-electronic text, and proposes sets of tags for information about the file itself, the source from which it was derived and how it was encoded. Chapter 5, the largest chapter, attempts to define a set of general-purpose structural and floating tags for continuous prose texts. Its basic idea of text as an ordered hierarchy of objects, within which floating features and crystals may appear, was discussed above. This chapter of the Guidelines also proposes tags for features such as lists, notes, names, abbreviations, numbers, foreign or emphasised phrases, cross references, and hypertextual links. Sections deal with the kinds of textual element commonly found in front and back matter of printed texts, title pages etc. Other sections discuss ways of encoding textual variation and critical apparatus and of recording the rendering of arbitrary textual fragments within this overall framework. There is also some discussion of different ways of maintaining multiple referencing schemes within the same text. Chapter 6 outlines a number of theory-independent mechanisms for representing all kinds of linguistic analyses of running text. It is probably the most daunting chapter for the non-specialist reader, though much of its contents are of very wide relevance. It argues that most, if not all, linguistic analyses can be represented as bundles of named, value-bearing, `feature structures', which may be nested and grouped into sets or lists. It proposes ways of supporting multiple and independently aligned analyses, chiefly by means of the ID/IDREF pointer mechanism native to SGML. It also contains some tagsets for such commonly occurring formalisms as tree structures and parts of speech. Chapter 7 considers in more detail particular aspects of some specific types of text. The text-types discussed in this draft are: language corpora and collections; verse, drama, and narrative; dictionaries; and office documents. In each case, an overview of the problems specific to these types of discourse is given, with some preliminary proposals for tags appropriate to them. This chapter is one that will be considerably revised and extended over the coming months, as its initial proposals are firmed up and as its scope is extended to other types of text. Chapter 8 outlines a method by which the current Guidelines may be modified and extended, largely by introducing indirection into the Document Type Definitions (DTDs) (the formal SGML specifications for the TEI encoding scheme). Extension and modification of the TEI proposals is an important design goal, since this is both expected and intended, and the final form of the Guidelines will facilitate it. Preliminary versions of a number of technical appendixes are provided in the current draft. These include annotated examples, illustrating the application of the TEI encoding scheme to a wide range of texts, formal SGML document type declarations (DTDs) for all the tags and groups of tags defined in the TEI scheme, and code pages for some commonly used character sets. Later drafts will extend and improve these initial versions considerably, and will also contain an alphabetical reference section with a summary of each tag, its attributes, its usage, and an example of its use, as well as full Writing System Declarations for a range of commonly used alphabets. Space precludes an exhaustive discussion of the various tags and associated features suggested by the current TEI draft proposals. Further proposals from the specialist working groups currently discussing extensions in a wide range of subject areas will be included in the final TEI report in a year's time. However, it is hoped that enough detail has been provided to give some indication of the general ideas underlying the scheme. ------------------Notes-------------------------------------------- [1] First published in Greenstein, D.I., ed, Modelling Historical Data (Goettingen, Max-Planck-Inst. fuer ------------------------- Geschichte, 1991). [2] See "Report of Workshop on Encoding Principles" Literary and Linguistic Computing 3.2 (1988). --------------------------------- [3] ACH-ACL-ALLC Guidelines for the encoding and interchange of machine-readable texts, edited by Lou Burnard and C.M. Sperberg-McQueen (Chicago and Oxford, Text Encoding Initiative, October, 1990), hereafter "Guidelines" or "P1". [4] Examples include Humanistiske Data 3-90; ACH Newsletter 12 (3-4); ----------------- -------------- EPSIG News 3(3); SGML Users Group Newsletter 18; ACLS Newsletter 2/4 ---------- --------------------------- --------------- and elsewhere. A fuller summary is to appear as Burnard "The TEI: a progress report" in Proceedings of the 11th ICAME Conference, ---------------------------------------- Berlin, 1990, ed G.Leitner. Encoding principles of the TEI are discussed in Sperberg-McQueen, "Texts in the electronic age: textual study and text encoding, with examples from medieval texts", in Literary and Linguistic Computing (6.1, 1991). --------------------------------- [5] "Primary to Secondary" in Peter Denley and Deian Hopkins, eds. History and Computing (Manchester: Manchester University Press, 1987). --------------------- [6] A recent EUROTRA-funded study reported on this and other aspects of `re-usability of lexical resources' in considerable detail: its findings are equally relevant to other disciplines. See U. Heid, "Eurotra-7: Feasibility and Project Definition Study on the Reusability of Lexical and Terminological Resources in Computerised Applications.", January, 1991. [7] This implies of course that the markup describes a single source. [8] This has been termed the "pizza model", by contrast with either the "table d'h^ote" or the "`a la carte" models. A choice of a small number of bases is offered, each of which may be combined with a large number of toppings.