TEI ED P1: Design Principles for Text Encoding Guidelines

Design Principles for Text Encoding Guidelines TEI ED P1 14 December 1988rev. 9 January 1990

Abstract

This document defines the basic design goals and working principles for the text encoding guidelines to be created by the Text Encoding Initiative.

It extends the principles enunciated by the Poughkeepsie Planning Conference of November 1987 (see TEI document no. TEI PCP1) to questions of detail not covered there, and provides basic interpretations of the clauses of the Poughkeepsie Principles.

Introduction

The Text Encoding Initiative is a cooperative undertaking of the textual research community to formulate and disseminate guidelines for the encoding and interchange of machine-readable texts intended for literary, linguistic, historical, or other textual research. It is sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). A number of other learned societies and professional associations support the project by their participation in the Initiative's Advisory Board. The project is funded in part by the U.S. National Endowment for the Humanities.

The primary goal of the Text Encoding Initiative is to provide explicit guidelines which define a text format suitable for data interchange and data analysis; the format should be hardware and software independent, rigorous in its definition of textual objects, easy to use, and compatible with existing standards. The Standard Generalized Markup Language (SGML) is expected to provide an adequate basis for the guidelines.

This document attempts to set out the fundamental principles upon which the work of the Text Encoding Initiative is to proceed. In it, the guidelines for text encoding and text interchange to be formulated by the Initiative are referred to simply as “the guidelines”; the encoding scheme specified by the guidelines is referred to as “the TEI scheme” to distinguish it from other encoding schemes extant or prospective. “Encoding scheme” and “markup scheme” are here used interchangeably; the term “tag set,” which conveys a different sense, is sometimes also used since the expectation is that the TEI markup scheme will consist largely of a set of SGML tags together with an account of their interrelationships and meanings.

The Poughkeepsie Principles Closing Statement of Vassar Conference The Preparation of Text Encoding Guidelines Poughkeepsie, New York: 13 November 1987 The guidelines are intended to provide a standard format for data interchange in humanities research. The guidelines are also intended to suggest principles for the encoding of texts in the same format. The guidelines should define a recommended syntax for the format, define a metalanguage for the description of text-encoding schemes, describe the new format and representative existing schemes both in that metalanguage and in prose. The guidelines should propose sets of coding conventions suited for various applications. The guidelines should include a minimal set of conventions for encoding new texts in the format. The guidelines are to be drafted by committees on text documentation text representation text interpretation and analysis metalanguage definition and description of existing and proposed schemes, coordinated by a steering committee of representatives of the principal sponsoring organizations. Compatibility with existing standards will be maintained as far as possible. A number of large text archives have agreed in principle to support the guidelines in their function as an interchange format. We encourage funding agencies to support development of tools to facilitate this interchange. Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of information not already coded in the texts.

The principles agreed upon at the Poughkeepsie Planning Conference are expounded in more detail and supplemented with other material in the sections which follow.

Purpose of the Guidelines

Guidance for New Encodings

Points 2, 4, and 5 of the Poughkeepsie Principles mandate that the guidelines should simplify the tasks facing projects to encode new texts in machine-readable form by making it unnecessary for such projects to design an encoding scheme from scratch. By recommending a standard minimum set of textual features commonly found useful, the guidelines should help raise the quality and ensure the re-usability of new encodings; sets of special-purpose tags for specific research disciplines should make it easier for independent projects in those disciplines to exchange data and results.

To help those who encode new texts, the guidelines must provide guidance for researchers who might otherwise be perplexed at some of the complications of machine-readable texts and encode unnecessary textual features at the cost of omitting features which prove more desirable. The guidelines should reduce, not increase, the perplexity of deciding what to encode.

Common Interchange Format

Principles 1, 4, 8, and 9 direct that the guidelines must be suitable for the interchange of encodings among sites using different schemes. This should be of assistance to data archives, their borrowers, and even to software developers who can rely on this interchange format as a documented interface between their software and its textual data.

For interchange, it must be possible to translate from any existing scheme for text encoding into the TEI scheme without loss of information. All distinctions present in the original encoding must be preserved. Additionally, conventions used in the original encoding should be documentable within the interchange format.

When the TEI scheme is used as an interchange format for pre-existing encodings, the recommendations for minimum tagging mandated by Principle 5 do not apply. As Principle 8 makes clear, translation into the TEI scheme should not be construed as requiring the addition of any new information not present in the original encoding.

Documentation of Existing Markup Schemes

The guidelines will include descriptions of selected existing markup schemes contrasted with the TEI scheme. This will help clarify the new scheme for those familiar with existing schemes; it should also assist users confronted with data encoded in existing schemes, and those who must translate encodings from one scheme to another.

In addition to an informal description in prose, the guidelines will also provide a formal description of selected existing schemes in a metalanguage to be prepared for the purpose. The use of a formal metalanguage will, it is hoped, encourage rigorous documentation of existing schemes, and may make possible the automatic production of software to translate data from formally documented schemes into the TEI scheme.

Design Goals

The following design goals are to govern the choices to be made by the working committees in drafting the guidelines. Higher-ranked goals should count more than lower-ranked goals. The guidelines should suffice to represent the textual features needed for research be simple, clear, and concrete be easy for researchers to use without special-purpose software allow the rigorous definition and efficient processing of texts provide for user-defined extensions conform to existing and emergent standards For the most part, the design goals are self-explanatory, but some commentary is in order.

The TEI, as an undertaking of the research community, is responsible primarily to that community for creating encoding practices adequate to research needs. Since researchers do use commercial software and publish their results, the practices and needs of commercial software developers and publishers must also be considered, but they are not to outweigh the needs of textual research.[1]

Research work requires above all the ability to define rigorously (i.e. precisely, unambiguously, and completely) both the textual objects being encoded and the operations to be performed upon them. Only a rigorous scheme can achieve the generality required for research. A rigorously defined encoding scheme can also allow many text-management tasks to be automated.

For a scheme to be adopted by the research community, it must be clear, concrete, and easy to use. Otherwise, it will simply be ignored.

Since research necessarily involves the asking of questions that have not been asked before, a research-oriented encoding scheme must also be extensible. Some measure of extensibility, then, is an absolute requirement for the TEI markup scheme. As a design goal, “extensibility” refers not to this absolute requirement (a solution for which we can take as a given), but to the ease with which various portions of the scheme can be fitted with extensions and the ease with which various possible kinds of extensions can be created.

“Compatibility with existing standards and practice” is to be sought, but (as its rank suggests) not at the expense of the other design goals. The standards most relevant to this goal are SGML and existing applications of SGML, as well as the standards now being developed for page description and similar applications.[2] The Text Encoding Initiative will develop a conforming SGML application, if it can meet the needs of researchers by doing so. Where research needs require constructs unavailable with SGML, however, research must take precedence over the standard.

The Initiative is not committed to using the full range of constructs available with SGML; the metalanguage committee is responsible for assessing the guidelines' compatibility with commonly available software.

Scope of the Guidelines

Activities Covered

The guidelines should contain explicit recommendations for the encoding of new texts (which textual features should be captured, as a minimum, and how they should be marked) the addition of new information or corrections to existing encodings the interchange of existing encodings archival documentation of encodings text and encoding documentation for purposes of bibliographic control documentation of selected markup schemes in terms of the recommended scheme

Types of Research

Ultimately, the guidelines should support work in any discipline based on textual material. Pragmatically, this goal exceeds anyone's capacity right now. The first published version of the guidelines should provide explicit guidance for encoding the textual features of interest to those disciplines most commonly using machine assistance: the more computationally oriented branches of linguistics; lexicography; thematic, metrical, and stylistic studies; historical editing; content analysis. Discipline-related work will concentrate first on linguistic issues. Later drafting will extend the guidelines to problems specific to literary studies, historical research, and other textual disciplines.

Text Types

The guidelines should provide explicit guidance for the text types most commonly encountered in textual research; esoteric genres may be left for user-defined extensions. The first drafting cycle will concentrate on simple forms (simple nonfiction, unillustrated prose narrative, poetry, plays) and basic reference works (notably monolingual and multilingual dictionaries). Some attention will be paid to all text types represented in the major linguistic corpora.

Languages and Scripts

The goal of the Initiative is to devise and document encoding methods appropriate to every language used officially or studied extensively with machine assistance in Europe and North America. Since character-encoding problems grow progressively more complex and progressively more dependent on hardware configurations as the script diverges from the left-to-right alphabetic pattern of English, the Initiative will attempt first to address the basic problems of left-to-right alphabetic languages, progressing from there to right-to-left and other monodirectional scripts, and on to multidirectional, multi-script texts and texts in non-alphabetic languages.

Content of the Guidelines

Level of detail

The guidelines will make specific concrete recommendations for delimiters, separators, tags for use in encoding and interchange, codes for “special” characters, methods of declaring extensions, and methods of describing existing markup schemes. Additionally, specific descriptions will be provided for the syntax and semantics of a small set of existing schemes.

The recommendations for interchange among users will take into account the character sets now in use in academic computing environments and the problems of inter-machine transfer.[3] At the same time, the guidelines will be device independent and will neither rely upon nor discuss specific hardware or software.

Form

The guidelines will to the extent possible take the form of sets of SGML tags and attributes. Examples will be provided. Formal document type definitions will be prepared if they promise to make the guidelines appreciably clearer or more useful.

Structure of the Guidelines

Kernel / Additions

The guidelines will provide for the encoding of the text itself the documentation of the text's source the documentation of the encoding itself and its peculiarities

The provisions of the guidelines can be divided into a central core or kernel of principles and tags applicable to all texts or to the great majority of texts (“general-purpose tags”) and various sets of tags or encoding conventions applicable to texts in specific languages or scripts, texts of specific text types, or texts encoded for specific disciplinary purposes (“special-purpose tags”). Within each set of tags, distinctions can be made between recommended and optional practices, but only general-purpose tags will be recommended for all texts. Each set of tags devised for a specific language, script, text type or discipline may itself comprise a kernel of common tags and one or more sets of optional tags extending the kernel. When these sets of tags are used consistently in groups, the encoding practice of individual texts can be described by listing the tag sets used (e.g. “encoded with basic set of general-purpose tags plus level 3 of metrical and level B of lexical tags”).

Draft Table of Contents

A draft table of contents for the guidelines follows: 1 Principles of Text Encoding 1.1 Why Markup Is Necessary at All (A brief discussion about functions of descriptive markup, why it is not presentational, etc.) 1.2 The Advantages of Standardized Markup 2 About These Guidelines 2.1 Intended Applications (Database/retrieval/analysis as well as printing and formatting. Research community rather than commercial. Relevance to language industries.) 2.2 Design Principles (How features are defined and described in the Guidelines.) 2.3 Structure of the Guidelines (Base features, optional features, "boxes" and their base and optional features or "levels of description". Document prolog and document body.) 3 SGML Markup 3.1 Principles and Definitions (Introduction to SGML: tags, elements, content models, document type declarations. Alternatives to DTDs. SGML declarations in general.) 3.2 SGML Declarations for the TEI Guidelines (Description of SGML features used in TEI Guidelines, text of formal SGML Declaration for TEI texts.) 3.3 Non-SGML Declarations for TEI Texts (How to declare what tags you have used. How to declare use of specific levels of description, substitution of tag names, etc. Cross-reference to later sections for details of declarations for pre-defined material; cross-reference to chapter 9 for full details on declaring modifications and extensions.) 4 Characters and Character Sets 4.1 Principles and Definitions (Characters, character sets, character repertoires. What is ASCII. Seven-bit and eight-bit ASCII. Standard ways of extending ASCII. Vendor-specific non-standard extended ASCIIs. SGML-supported character sets. EBCDIC. IBM PC character set. Macintosh, Mac extensions by other vendors. Adobe Postscript. Transliteration schemes. Entity references.) 4.2 Recommendations (For character sets, names of ISO sets and EBCDIC code pages. Possibly include recommended transliterations, sample USEMAP and CHARSET declarations, etc.? For entities, simply list recommended name, description and appearance of various special characters. Possibly relegate detailed lists and code-pages to appendices.) 4.2.1 Recommended Character Sets 4.2.2 Recommended Entity Names 4.2.3 Declaring New Character Sets or Character Entities 5 Bibliographic Control of Electronic Texts 5.1 Principles and Definitions 5.2 Recommended Features and Tags (Bibliographic identification of machine-readable text. Bibliographic identification of source text(s). Documenting changes to the source text during pre-editing or data entry. Documenting changes to the machine-readable text.) 5.3 Correspondence between Recommended Tags and MARC fields 6 Features Common to All Texts 6.1 Principles and Definitions 6.2 Recommended Features and Tags 6.2.1 Basic Text Structure (Front matter, body, back matter, chapters, sections, etc. down to paragraph level.) 6.2.2 Non-structural Text Segments (Features below paragraph level, including highlighting, emphasis, quotation, index entries, special layout, language and script, illustrations ...) 6.2.3 Figures and Tables 6.2.4 Bibliographic References 6.2.5 Critical Apparatus 6.2.6 Parallel Texts 6.2.7 Cross Reference and Textual Links 7 Features for Specific Text Types 7.1 Principles and Definitions 7.2 Recommended Features and Tags 7.2.1 Mixed Corpora 7.2.2 Literary Texts 7.2.3 Technical and Scientific Texts 7.2.4 Historical Documents 7.2.5 Dictionaries and Lexica 7.2.6 Transcripts of Spoken Texts 8 Analytic and Interpretive Features 8.1 Principles and Definitions 8.2 Recommended Features and Tags 8.2.1 Syntactic Features 8.2.2 Morphological Features 8.2.3 Phonological Features 8.2.4 Lexical Features 9 Extending the Guidelines 9.1 Modifying the Guidelines (Substituting short forms or different names for tags or attributes. Using a tag with a different meaning. Changing legal attribute values. Restricting where tags can occur; allowing tags to occur in new places; doing away with syntactic restrictions altogether.) 9.2 Defining Additional Features (Adding new tags: defining where they can occur, what can occur inside them, and what they mean. Defining new attributes for old or new tags.) 9.3 Worked Example 10 Full Alphabetical List of Features (A summary for each recommended tag, giving name, definition, description, associated features and brief example of usage) 11 Translation Table (shows equivalent name in each EC language for every name listed in sections 3.2.2 and 9. Omit in 1990?) 12 Use of the Guidelines for Document Interchange (Portability. Different needs of document capture, storage, processing, and interchange. Reducing danger of character set confusions during document interchange.) Appendices A How the Guidelines were Developed (brief note about TEI structure and history) B Mapping from the Guidelines to Other Encoding Schemes (Translating into the Guidelines from existing encoding schemes. Translating back out.) C Examples of Tagged Texts

The editors, in consultation with the committee heads and the Steering Committee, will have primary responsibility for sections 1, 2, 12, and A (Principles of Text Encoding, About the Guidelines, Document Interchange, and History of the TEI). They will assemble section 10 from the work of the committees.

The Committee on Text Documentation will have primary responsibility for section 5 (Bibliographic Control) and will consult with the Committee on Metalanguage and Syntax Issues on the overall organization of section 3.3 (Non-SGML Declarations). They will advise the Text Representation committee on section 6.2.4 (Bibliographic References).

The Committee on Text Representation will have primary responsibility for sections 4 (Character Sets), 6 (Features Common to All Text Types), and 7.1 through 7.4 (Corpora, Literature, Technical and Scientific Documents, Historical Documents), as well as any other sections inserted into section 7 on further text types. They should consult with the Committee on Text Documentation on section 6.2.4 (Bibliographic References) and will partially determine the content of declarations relevant to their subject domain (section 3.3).

The Committee on Text Analysis and Interpretation will have primary responsibility for sections 7.5 and 7.6 (Dictionaries, Spoken Texts) and section 8 (Analytic and Interpretive Tags). They will contribute also to section 3.3 (Non-SGML Declarations).

The Committee on Metalanguage and Syntax Issues will have primary responsibility for sections 3, 9, and B (SGML, Extensions, and Mapping to Other Schemes). They will collaborate with the Committee on Text Documentation on section 3.3 (Non-SGML Declarations).

All four working committees will contribute to sections 10 (Full List of Features) and C (Examples).

Prescription and Description

Compliance with the guidelines is necessarily a voluntary matter; use of the term “requirement” in connection with the guidelines must therefore not be misconstrued. Within the context of the guidelines, however, the committees will be able to specify a mix of requirements, recommended practices, optional features and practices, required choices among defined alternatives (“electives”), and possible user-defined extensions. Provision will also be made for documentation of user-specified deviations from the recommendations of the guidelines.

Specific Design Issues

Nomenclature

For clarity, each generic identifier (“tag name”) should be a full natural-language word or phrase in English, French, or Latin. Working committees may if they wish define full sets of tag names in more than one language.

To avoid collisions, abbreviations must be set centrally; working committees may recommend abbreviations but should not rely on them.

Attributes vs. Elements

Any SGML tag set including elements with attributes can be rewritten as a tag set including no attributes, only elements (one for each original element or attribute). This makes it unnecessary to decide whether a given textual feature should be expressed as a tag or an attribute, and simplifies the design process. It also simplifies the processing of the text stream. For these reasons, some SGML applications (e.g. the AAP tag set and the SGML-like processors of the Centre for the New Oxford English Dictionary at the University of Waterloo) eschew attributes entirely.

Attributes, on the other hand, express more clearly than separate element tags the association of information units in a text. An element called “chapter” with attributes of “number” and “title” clearly displays the one-to-one relationship holding among chapter, chapter number, and chapter title. A set of three separate elements, on the other hand, obscures the one-to-one relationship for the human reader. This is true even if the document type definition enforces a one-to-one relation among the three elements. The use of attributes also permits a restriction to be made on the content of a particular feature (i.e. the legal values of an attribute) which may be useful in some situations. The tradeoffs between simplicity of parsing and clarity of notation are still being explored; no decision has yet been reached.

Character Sets and Transliteration

Standard transliterations will be documented where available. Standard character sets (including those registered under ISO registration procedures) will be documented if they are useful.

Descriptive Markup

Descriptive markup will be preferred to procedural markup. The tags should typically describe structural or other fundamental textual features, independently of their representation on the page. In some cases, however, the physical appearance of the original text carrier is the primary object of interest. In others, there may be no consensus as to the meaning of all aspects of the text's physical appearance, which must therefore be represented as explicitly as possible. In neither case however should the primary purpose of the markup be viewed simply as the reproduction of the appearance of the original.

Notes 1] Commercial and research interests do not, in any case, always conflict. Both are best served by an intellectually adequate analysis of textual problems and their representation. Very few problems in the research area lack analogues in commercial areas, even though in research the problems may occur more often and more forcefully. 2] Also relevant, but less difficult to accommodate, are the national and international standards for data interchange, character sets, character names, etc., and the standards governing library cataloguing, dataset description, transliterations, etc. 3] This means that problems of ASCII-EBCDIC translation, and the limitations of the ASCII-EBCDIC translations in common use, will be specifically addressed; the interchange character set should be the set of all characters consistently and reversibly translated by all such translation programs. The recently developed 190-character extensions to ASCII and EBCDIC will also be discussed.