Return-Path: Received: from UICVM (NJE origin U35395@UICVM) by UICVM.CC.UIC.EDU (LMail V1.2a/1.8a) with BSMTP id 2950; Mon, 12 Jun 1995 17:49:41 -0500 Date: Mon, 12 Jun 95 17:49:26 CDT From: "C. M. Sperberg-McQueen" Organization: ACH/ACL/ALLC Text Encoding Initiative Subject: EDW49DR MEMO A1 To: Lou Burnard , Wendy Plotkin The Design of the TEI Encoding Scheme C. M. Sperberg-McQueen Lou Burnard This paper discusses the basic design of the encoding scheme described by the Text Encoding Initiative's Guidelines for Electronic Text Encoding and Interchange (TEI document number TEI P3, hereafter simply P3 or the Guidelines).(1) It reviews first the basic design goals of the TEI project and their development during the course of the project. Next, it outlines some basic notions relevant for the design of any markup language and uses those notions to describe the basic structure of the TEI encoding scheme. It also describes briefly the "core" tag set defined in chapter 6 of P3, and the "default text structure" defined in chapter 7 of that work. The final section of the paper attempts an evaluation of P3 in the light of its original design goals, and outlines areas in which further work is still needed. 1 DESIGN GOALS AND DEVELOPMENT PROCESS 1.1 Design Goals At the outset of its work, the overall goals of the TEI were defined by the closing statement of the planning conference held in Poughkeepsie, N.Y., in November, 1987; these "Poughkeepsie Principles" are reproduced and discussed in Ide and Sperberg-McQueen, elsewhere in this issue. These goals were elaborated and interpreted in a series of design documents (TEI ED P1, ED P2, and ED P3), which formulated specific design goals for the work of specifying the TEI markup scheme. The Guidelines, say document TEI ED P1, should: * suffice to represent the textual features needed for research * be simple, clear, and concrete * be easy for researchers to use without special-purpose software * allow the rigorous definition and efficient processing of texts * provide for user-defined extensions * conform to existing and emergent standards We address each of these issues in more detail below. 1.2 Research Adequacy The TEI, as an undertaking of the research community, felt itself responsible primarily to that community for creating encoding practices adequate to research needs. Since researchers also use commercial software and publish their results, the practices and needs of commercial software developers and publishers also needed to be considered, but they were not to outweigh the needs of textual research. In fact, commercial and research interests do not necessarily conflict. Both are best served by an intellectually adequate analysis of textual problems and their representation. Moreover, very few problems in the research area lack analogues in commercial areas, though in research the problems may occur more often, or in more extreme forms, and it may be less possible to proceed without a full solution. The interest taken by many members of the commercial SGML community in the work of the TEI has shown clearly that they are well aware of its potential relevance to their work. Research work requires above all the ability to define rigorously (i.e. precisely, unambiguously, and completely) both the textual objects being encoded and the operations to be performed upon them. Only a rigorous scheme can achieve the generality required for research, while at the same time making possible extensive automation of many text-management tasks. The TEI scheme addresses the need for a rigorous description of textual objects, by using SGML to define a formal grammar for TEI-encoded documents; it does not however attempt to define a set of primitive operations on textual objects. Research work may also require relatively esoteric sets of tags for marking up phenomena of interest in specialized areas. This requirement has led the TEI to develop tags for a variety of research interests not served well by other extant encoding schemes. Because the TEI tag sets are intended to be usable by any researcher, it has been necessary to exercise caution in defining such specialized tag sets, to ensure that they avoid theoretical presuppositions which would make them unacceptable to researchers in the specialty who do not share those presuppositions. It is not possible to define a tag set -- at least, not a useful one -- in a purely atheoretical way, but it is possible, given a specific set of sufficiently explicit theoretical approaches to a field, to define a tag set which allows adherents of competing theories to encode the textual features they find of interest. This problem is analogous to that of generating a common database schema from a variety of views, all of which must be equally well supported. In practice, there is often difficulty in formulating such a "poly-theoretical" tag set, but the difficulty arises most frequently either because one or more of the theories to be accommodated may not be sufficiently explicit, or because the adherents of individual theories, not content with a tag set which supports their view of the problem domain, may insist also that other views be unencodable (i.e. forbidden). 1.2.1 Simplicity and Concreteness A scheme must be clear, concrete, and easy to use in order to be adopted by the research community. It has naturally been difficult to keep the TEI encoding scheme small, simple and easily explained in its entirety while simultaneously including in it facilities for every specialized area of research covered by P3. A strenuous effort has therefore been made to make the encoding modular, so that the parts of P3 not of interest to a given researcher might safely be ignored. The goal of clarity and concreteness also led the TEI to specify a full SGML DTD (document type definition), rather than contenting itself with either a set of general recommendations as to how to such a thing might be constructed or an abstract model or meta-DTD for such a thing. The discipline of defining a concrete DTD, and testing it on real texts, has had (we believe) a beneficial effect on the robustness and precision of the TEI scheme, as well as enabling us to present scholars with a usable product requiring no prior knowledge of DTD construction or the building of abstract models. The pursuit of simplicity has led to two conflicting patterns in the development of the TEI tag set. First, the TEI has tried wherever possible to apply Occam's Razor -- the principle, named for the medieval philosopher William of Occam, of seeking the simplest possible solution to any problem. Occam's formulation was "non sunt multiplicanda entia praeter necessitatem", which may be loosely translated "plurality should not be assumed without necessity." As applied by the TEI, Occam's Razor leads to the merger of similar elements into single elements, possibly differentiated with a type attribute. Thus where many markup languages distinguish three or four types of list (e.g. with bullets, with numbers, or unadorned) and three or more types of note (e.g. footnotes, endnotes, and in-line block notes), the TEI defines a single and a single element. The systematic application of Occam's Razor has significantly reduced the number of distinct elements of the TEI tag set, compared with other markup schemes. A countervailing pressure, however, has led to some apparent violations of Occam's principle. In several instances, a very powerful general notation which proved cumbersome in some common simple cases has been supplemented by specialized tags which handle those common cases in a simpler notation. Simplifying tags of this kind, being defined as synonymous with the more cumbersome general notation, are not strictly necessary. For example, the general tags for text-critical apparatus defined in chapter 19 of TEI P3 could certainly be used to encode all forms of editorial intervention in a text, but simple editorial emendations or normalizations can be recorded much more simply by using the simpler tags defined for such purposes in section 6.5. Occam's Razor, strictly applied, would require the removal of all such redundancy. Certainly, excessive use of such redundant constructs (or "syntactic sugar") can lead to an excessive number of elements and excessive complication of the markup language as a whole; in moderation, however, such devices can ensure that the markup language handles simple cases with a simple notation, reserving more complex notations for situations which really require them. In this and other ways, the TEI has sought to devise a markup language which scales gracefully. That is, the TEI scheme can be used both for very light markup of texts and for extremely dense markup encoding detailed analysis of texts on multiple linguistic and interpretive levels. 1.2.2 Extensibility Since research necessarily involves the asking of questions that have not been asked before, a research-oriented encoding scheme must also be extensible. Some measure of extensibility, then, was from the beginning an absolute requirement for the TEI markup scheme, rather than a design goal. The extent to which the various portions of the scheme might be extended, and the degree to which the various possible kinds of extensions can be easily created and integrated, do however form important design goals. TEI P3 attempts to support convenient extension and modification of the TEI DTD by modularizing the DTD into base and additional tag sets, by standardizing the interface among tag sets so as to allow new modules to be introduced easily, and by the extensive use of SGML parameter entities to allow the suppression of individual declarations, the ability to rename elements, and the addition of new elements to existing element classes, as described further below. 1.2.3 Compatibility with Standards Compatibility with "existing and emergent standards" and practice was to be sought, but (as its rank suggests) not at the expense of the other design goals. From the start, the standards perceived as most relevant to the work of the TEI were SGML and existing applications of SGML; the TEI therefore decided that its Guidelines should, if possible, use the formalisms of SGML, with the caveat that, if the needs of research required constructs unavailable in SGML, research was to take precedence over the standard. It is a tribute to the expressive power and generally good design of SGML that no extensions to SGML were in practice found to be necessary. P3 can thus require all TEI-encoded documents to be conforming SGML documents; in addition, the TEI Interchange Format adds some simple restrictions beyond those of SGML proper, intended to ensure that documents in that format can be parsed by software simpler than full SGML parsers.(2) The TEI was not committed to using the full range of constructs available with SGML; a metalanguage committee was responsible for assessing the guidelines' compatibility with commonly available software. That committee formulated a subset of SGML which used few enough constructs to allow users without SGML-conformant software to construct ad hoc processors for the TEI Interchange Format. 1.3 Additional Design Issues The design goals discussed so far relate for the most part to how the TEI scheme would be defined. This section discusses some important aspects of the scheme's coverage -- what it is intended to address. 1.3.1 Guidance for New Encodings -- and New Encoders The Poughkeepsie Principles mandated that the Guidelines should simplify the task of encoding texts in machine-readable form for the first time by making it unnecessary for projects creating such resources to design an encoding scheme from scratch. The Guidelines recommend a standard minimum set of textual features, derived from (but not limited by) common existing practice. The widespread use of this tag set will, we hope, contribute to an improvement in both the quality and the re-usability of newly-created encoded texts. In addition, the specialized tag sets described in chapters 14 to 23 of TEI P3 should make it easier for projects working in those specific discipline areas to exchange data and results. The Guidelines are also intended to provide guidance for researchers who are uncertain as to which textual features are likely to be of maximal usefulness in an encoding. They should reduce, not increase, the perplexity of deciding what to encode. At present, only the full reference documentation of TEI P3 is available. Since it is the task of a reference manual to describe every detail and arcane subtlety of a system, P3 is only imperfectly suited to the role of a reassuring introduction to the TEI encoding scheme. It is therefore a matter of some priority for the TEI to produce a set of introductory manuals for the Guidelines. In an introduction, only a portion of the TEI scheme will need to be introduced, and many details can be postponed until they are needed; only such introductions will make it possible to achieve the goal of providing suitable guidance to the perplexed novice. 1.3.2 Interchange Format for Existing Material The Poughkeepsie Principles also directed that the Guidelines must be suitable for the interchange of encodings among sites using different schemes. As a definition of an interchange format, the Guidelines will be of assistance to data archives, their borrowers, and even to software developers who can rely on this interchange format, whether as a documented interface between their software and the textual data they may import or export, or as a reference point with which their format can be compared. For interchange, it must be possible to translate from any existing scheme for text encoding into the TEI scheme without loss of information. All distinctions present in the original encoding must be preserved. Any conventions used in the original encoding but not made explicit by its encoding format should be documented within the interchange format. When the TEI scheme is used as an interchange format for pre-existing encodings, only those textual features expressed explicitly in the pre-existing encoding format can be converted into their TEI equivalents. If the original encoding lacks, for example, information about paragraph breaks in the original source, so will the TEI version, even though marking this particular feature is strongly recommended to those creating new electronic texts. As the final item in the Poughkeepsie Principles makes clear, translation into the TEI scheme should not be construed as requiring the addition of any new information not present in the original encoding. 1.3.3 Prescription and Description An SGML document type definition specifies a set of formal rules which define the set of "valid" documents. The formal definition of document validity is one of SGML's great practical strengths, because it allows automatic mechanical checking for markup errors. It also allows the designer of the document type to make the expected structure of documents much more explicit than would be possible without such a formal specification. As described in more detail in section 2 of this paper, such a formal specification may be regarded as a "document grammar", which accepts a certain subset of the set of all possible documents. Grammars, however, may be used for two quite different purposes. One may use a grammar to prescribe the legal forms of some language. The formal grammar for the programming language Pascal, for example, prescribes the syntactically legal forms of Pascal programs. And old-fashioned prescriptive grammars of natural languages like English similarly attempted to prescribe certain grammatical constructions, and proscribe others. A prescriptive grammar is a recipe for the construction of documents which conform to a specification expressed by that grammar. SGML is frequently used for the specification of document grammars which are prescriptive in just this way. A publisher of technical documentation, for example, may use an SGML DTD to prescribe a standard structure for introductory manuals, and a different structure for user manuals, and to enforce rules such as "There must never be an example without a preceding paragraph introducing it and a paragraph following it, which describes what the example means." The SGML parser cannot enforce the prescribed contents of the introductory and following paragraphs, but it can ensure that an object tagged as an is invariably preceded by an and invariably followed by at least one . Memoranda can be required to have a date, a subject line, and a confidentiality status. Dictionary entries can be required to have an etymology section at the beginning. And so on. When a document fails to conform to such a prescriptive grammar, the document typically has some flaw which requires correction. Grammars may also be used, however, to describe some set of independently existing objects. Since the objects already exist, the function of the grammar is not to specify how they may be constructed but to explain how they were constructed in the first place, or to identify the differences between the objects in the set and the objects outside the set. It is characteristic of descriptive grammars that when an object fails to conform to the grammar, the flaw is usually sought not in the object itself, but in the grammar. In descriptive linguistics, grammars are used in this way to explain, or at least to identify, differences between the sets of grammatical and ungrammatical sentences in a language. If a sentence fails to conform to the grammar, but is felt acceptable by native speakers or occurs in a corpus, it is not the sentence but the grammar which needs correction. Before the advent of the TEI, SGML had not been widely used for the specification of descriptive document grammars for existing texts. When pre-existing texts were converted to SGML, the texts involved were frequently reference works or technical documentation being converted by the publishers or corporate authors. Prescriptive grammars were used, and discrepancies between the grammars and the documents were normally resolved in favor of the grammars: i.e., the documents were changed to conform to the grammar. In cases where no such changes were to be contemplated -- as in the creation of the electronic version of the first edition of the Oxford English Dictionary -- SGML-like encodings have sometimes been used, but without the formulation of any DTD. (See further the description of the "Waterloo syntax" below in section 2.) The TEI thus explored new territory in attempting to formulate a descriptive grammar for existing documents using the formalisms of SGML. In the course of this exploration, it faced a critical problem which often arises in the development of descriptive grammars. When the population being described by the grammar is sufficiently various and complex, it is often the case that no grammar can readily be written which exactly matches the population. Either the grammar overgenerates for the population, that is, it accepts some items which are not actually present in the target population, or it undergenerates, that is, it rejects as invalid some items which actually occur. In a grammar of English, overgeneration would mean accepting as grammatical some sentences which no native speaker accepts as English, or which never occur in practice; undergeneration would mean rejecting as ungrammatical some sentences which actually occur in practice and are accepted by native speakers. In a document grammar, overgeneration means allowing as a valid document some sequence of text which never occurs and can never occur in the set of texts studied by researchers; undergeneration means rejecting as invalid some actual existing document. For the TEI, the target population includes, in principle, any document written in any language during the entire span of written history -- a population certainly various and complex enough to defy any attempt at a rigidly prescriptive grammar. In this situation, we chose consciously to err on the side of overgeneration, rather than undergeneration, whenever the choice presented itself. An overgenerating document grammar has the drawback that it accepts nonsensical documents as valid ones; it will thus fail to catch some errors of markup which a less generous grammar would detect mechanically. An undergenerating grammar, on the other hand, will detect many markup errors which would slip past an overgenerating grammar; unfortunately, it will also detect as markup errors some constructs which are not markup errors at all but merely unusual document structures not foreseen in the grammar. An undergenerating descriptive grammar thus behaves like a prescriptive grammar which has strayed into the wrong arena. When an SGML document grammar overgenerates, researchers who would prefer that the SGML parser perform a tighter validation of their documents will be inconvenienced: to achieve the tighter validation, they will have to restrict the document grammar to reduce or eliminate the overgeneration. When it undergenerates, the inconvenience will befall any researcher who is working on a document with structures other than those foreseen in the grammar: to allow the document to be parsed, they will have to loosen the document grammar to allow their document's eccentricities to be accepted as legitimate. Since the skills needed for modifying the document grammar seem more likely to be found among researchers who want to exploit SGML's document validation powers to the full than among researchers who happen to be working with eccentric document structures, it is clearly preferable for the TEI to err by overgenerating, rather than by undergenerating. In order to minimize the baleful effects of such overgeneration, the TEI tag sets sometimes define two alternative forms of markup: a somewhat prescriptive form for use when validation is highly desired, and a very loose alternative form for use in transcribing items which simply do not fit the more prescriptive form. The elements (loose) and (prescriptive) for bibliographic citations, and and for dictionary entries, exhibit this approach. 1.3.4 Descriptive Markup In the TEI encoding scheme, descriptive markup has in general been preferred to presentational markup. TEI tags typically describe structural or other fundamental textual features, independently of their representation on the page. In some cases, however, -- e.g. for codicologists, paleographers, or analytical bibliographers -- the physical appearance of the original text carrier can be the primary object of interest. In others, there may be no consensus as to the meaning of all aspects of the text's physical appearance; it must therefore be possible to represent them as explicitly as possible, without being forced to speculate as to their meaning. For use in such cases, the TEI defines elements (e.g. , for highlighted phrases of any type) which simply record some salient fact about the appearance of the source text, without requiring any overt interpretation on the part of the encoder. A great deal of work remains to be done, however, to ensure that students of a text's transmission or physical presentation can conveniently record the relevant information in a precise, readily processable form. 1.4 Development Process The work of developing an encoding scheme consistent with these goals devolved initially upon a set of four working committees. Committee TD had responsibility for issues of text documentation and produced the TEI header described in chapter 5 of P3. Committee TR, responsible for text representation, produced the bulk of the material now in chapters 4, 6, and 7. Committee AI addressed issues of text analysis and interpretation, producing the work now described in chapters 12, 15, and 16. Committee ML, mentioned earlier, studied issues of metalanguage and formal syntax; it was responsible for the usage of SGML in the TEI scheme and provided formal definitions of the TEI Interchange Format and of TEI conformance. The committees ranged in size from seven to fourteen members, with approximately equal representation from Europe and North America. Following publication of the first draft of the Guidelines (TEI P1) in July, 1990, the extension and revision of the work of committees TR and AI was entrusted to about twenty small, specialized work groups, each reporting either to TR or to AI; committees TD and ML, which had somewhat more tightly circumscribed areas of responsibility, continued their work without formal subcommittees. These specialized work groups, typically numbering three or four members, were charged with extending or revising the recommendations of TEI P1 in some specific field in which the members of the group had expertise. Areas addressed by the work groups, many of which resulted in drafts for the appropriate chapter of P3, were wide-ranging, including character set issues, text criticism, hypertext, mathematical formulae, language corpora, manuscripts, verse, drama and literary prose, spoken texts, literary and historical studies, print dictionaries, computational lexica and terminological databases. 2 FEATURES, IDENTIFIERS, AND TAGS 2.1 Feature Sets, Identifier Sets, and Tag Sets A marked-up document is one in which specific textual features are identified by tags or other mechanisms. More abstractly, markup in electronic text serves simply to predicate the existence of some quality or feature in some specific passage of the text being marked up. In simple presentational markup, the feature involved may be simply that the passage is intended to be printed in italics; in the analytic or descriptive markup characteristic of SGML-based encoding schemes, the features made explicit by markup tend to be slightly more abstract: the passage is a quotation or an emphatic phrase. We distinguish sharply between the textual feature marked and the string of characters or other device used to mark it in the electronic text. That string of characters is conventionally referred to as a tag. Formally, we regard a tag as a specific piece of markup in a text, which signals the occurrence or location of a given textual feature. We will use the term identifier to refer to the string of characters itself, without reference to any feature associated with it.(3) The TEI tag
, the Formex tag , and the ISO "starter set" tag , for example, are used to signal occurrences of the same textual feature, namely that at the given position in the document a graphic of some sort is to be located.(4) A page formatter might leave space on the page for the artwork; a galley formatter might print a marginal note calling attention to the expected artwork; a text database might embed a special symbol indicating that the original document had material not included in the database. Note that a tag must refer to a feature, but a feature need not be tagged. The same feature can be denoted by many identifiers in different encoding schemes. Thus in the TEI tag set, the element may be used to identify occurrences of the feature chapter. In other tag sets the same feature might be tagged with , , , or . We use the term feature precisely in order to stress what would be common among all these tags and ignore what varies (the name or identifier associated with the feature: in SGML terms, its generic identifier or gi). Users of the TEI Guidelines may for example wish to abbreviate identifiers or translate them into another language; the TEI provides mechanisms for such substitutions; with them, the generic identifiers may be changed without altering either the definition of the features themselves or the syntax which governs their occurrences. The association of a given identifier with a given feature in a given encoding system is an SGML element.(5) Instances of a particular element type are indicated by SGML start- and end-tags in the document, and so in informal discussions elements are often referred to metonymically as "tags". In TEI documentation, the term element is normally used to refer either to the abstract type denoted by a tag, or to a particular instance of that type, while tag is used to refer to a specific start-tag or end-tag in the document which delimits a given element instance.(6) By definition each element associates a single (though not necessarily an atomic) textual feature with a specific identifier. As just noted, different encoding schemes may readily mark the same feature using different identifiers (the TEI, for example, marks an International Standard Book Number with the tag idno type=isbn, while ISO 12083 uses the tag isbn for the same feature). They may also use the same identifier for quite different features; the TEI and the Formex tag set both use the identifier body, the former for the body of the text as distinct from the front or back matter, the latter for the name of a corporate body or entity. Different encoding schemes may resemble each other in the sets of features they mark, while differing completely in the sets of identifiers they use, and vice versa. Tag sets which mark the same set of features, moreover, may differ in the document grammar they define for those features, and thus in the sets of documents they accept. We can distinguish, for any given encoding scheme, * the set of features it marks * the set of identifiers it uses * the mapping from identifiers to features * the grammar it imposes on documents Encoding schemes which mark the same set of features may be said to use the same feature set; those which use the same set of identifiers have the same identifier set or name set. Those which additionally impose the same formal restrictions on the allowable combination of features or identifiers may be said to share the same feature grammar or identifier grammar, respectively. When two tag sets are identical in all four characteristics above (feature set, identifier set, document grammar, and mapping from identifiers to features), they have an identical element grammar, which defines the full set of rules and conventions governing the use of a particular document type, referred to in SGML as a document type definition or DTD. Since SGML provides no formalisms for the declaration of semantics, the formally expressible part of the DTD (i.e. the set of all applicable element, attribute-list, and related declarations) captures only the identifier grammar of an SGML-based encoding scheme, the document type declaration in SGML parlance. The full scheme requires natural-language documentation, of the sort provided in the TEI Guidelines, to qualify as a document type definition. Comparing the identifiers used by two encoding schemes is of course simpler than comparing the feature sets they mark, since textual features are typically identified only by natural-language descriptions which are difficult to compare mechanically. Natural-language descriptions are notorious for vagueness and ambiguity; their only advantage appears to be that no other method of defining textual features appears to exist, or is likely to exist soon. If despite the difficulties of documenting their meanings, the feature sets of two encoding schemes can be identified, then translation from one scheme to the other may be relatively simple. If the feature sets and feature grammars are identical, or if the source's feature grammar is a subset of the target's feature grammar, then translation from one of the associated tag sets to the other becomes very simple. More formally, we define feature: a characteristic of some segment of text or of some location in a text, considered independently of any identifier used to denote it. In SGML terms, a segment of text possessing the feature may be marked up as an SGML element, though the presence of the feature may also be indicated by other SGML constructs, as described below in the section on "Methods of Predication." element: a component of an encoded text corresponding with an occurrence of a specific textual feature. An element has both a tag and a textual feature associated with it. tag: the use of a specific identifier to signal the boundaries of an element. identifier: the string of characters used in a tag to identify an element, considered as a string of characters and independent both of any particular instance of the tag used to mark the feature and of the feature itself. The SGML equivalent term is generic identifier. We can also identify sets (that is, unordered collections) of features, elements, and identifiers. To define the allowable relationships among elements and features, document markup languages define syntaxes for the elements of the language. The formal syntax is necessarily defined in terms of the identifiers used by the language; the semantics of the elements (their links to specific features) are not usually specified formally. If the distinction between element and identifier is rigorously maintained, then, formal syntaxes like those expressed by the element declarations of an SGML DTD must be viewed strictly speaking as governing only identifiers, not elements. If we wish to speak of rules governing elements rather than identifiers, we must include both the formal declarations and the semantic specifications of the markup scheme: in SGML terms, we must talk of the document type definition, rather than its declaration. This distinction is preserved in the following discussion. To express the ways in which sets of features or elements may be combined, we may define the following: feature syntax: a set of rules specifying the allowable combinations of feature occurrences within a given type of text, considered independently of the identifiers used to identify them tag syntax, element syntax: a set of rules defining the allowable sequences of tags and textual data in a markup language DTD (document type definition): a tag syntax defined according to the rules of SGML identifier syntax: a syntax governing the allowable combinations of identifiers in a markup language (without regard to their association with particular features) From any document type definition, one may derive a tag set, which is precisely all and only the tags defined in the DTD.(7) The element syntax of a DTD is expressed by the content models of its constituent elements. If we modify a DTD by changing the syntax rules expressed in its content models, the tag set (and the associated set of identifiers) and feature set remain unchanged. We may say that the tag set is what remains constant when the syntax is changed without introducing new tags or deleting old ones. We may change a DTD by translating all the identifiers into another language (as is done with the labels for the ISO starter set in ISO TR 9573, for example). The legal relationships among features remain the same, the features described remain the same; the DTD changes only because the identifiers used to define the syntax change. The term feature syntax is coined to denote what does not change if one translates all the generic identifiers into another vocabulary. 2.2 Important Types of Document Grammar While for any given DTD or element syntax there exists a unique corresponding tag set, the converse is not true: for any given tag set, one may formulate a very large number of possible element syntaxes. Several types of such grammar can be readily identified, which are useful for discussion purposes even if they are not all often used in practice. The first of these types of document grammar is of particular importance, because it comes close to being a null grammar, imposing no requirements on the document beyond those of SGML itself. This is the grammar which, for any given set of elements, requires merely that all elements nest and allows any element to nest within any other. We call this the Waterloo grammar (or, in SGML, the Waterloo DTD), because its most visible exponents in recent years have been located at the University of Waterloo's Centre for the New Oxford English Dictionary, which designed a tag set with this type of document grammar for the encoding of the Oxford English Dictionary.(8) In SGML terms, a pure Waterloo DTD is one in which each element has an element declaration of the form For details of the element-declaration syntax, see Burnard, elsewhere in this issue. This example declares an element with a generic identifier of blort. The two minus signs indicate that neither the start-tag nor the end-tag may be omitted (even if the omitted markup could be unambiguously inferred). The keyword ANY signifies that a blort can contain a mix of character data and any element declared in the DTD, in any sequence.(9) The obvious advantage of the Waterloo grammar is that it radically simplifies the task of data capture and translation from other tag sets, since all that is necessary is the identification and marking of textual features, without concern for grammatical restrictions on their repetition or nesting. The obvious disadvantage is that, as a grammar, it vastly overgenerates for its population. The Waterloo DTD for the Oxford English Dictionary will accept dictionary entries in any of the many forms they take in the actual text, without complaining that some are ill-formed. This is a distinct advantage. But this DTD will also accept entries which are not merely eccentric, but purely nonsensical -- in which citations appear within sense numbers or sense numbers within data elements, for example. (In the OED project, such nonsensical tagging was detected and controlled, but by means of specially written checking routines rather than through the grammatical formalism of a DTD.) A second type of grammar provides a useful contrast to the extreme permissiveness of the Waterloo grammar by prescribing a rigid hierarchical relationship among the SGML elements it defines. As an example, consider the definition of the TEI header: This declaration specifies that a element must invariably begin with a element, which may be followed by , , and elements. Each of these three is optional, but if they appear they must do so in sequence. This prescriptive form of content model is often found in the definition of elements with clear internal structure; in TEI working papers these are often referred to as crystals. Crystals often represent textual features which are themselves compounds of other, usually simpler, features: lists, personal names, addresses, and bibliographic citations may all be defined as crystals in a sufficiently prescriptive tag set. In SGML terms, a crystal is an element defined as having element content rather than data content or mixed content: that is, an element whose children are invariably other elements, rather than strings of characters, possibly with other elements mixed in. In the TEI scheme, "pure" crystals are found almost exclusively in the TEI header, since within the body of the text, it is not feasible to require that encoders always capture the internal structure of items like personal names, addresses, or bibliographic citations. The third type of grammar to be identified here combines the flexibility of the Waterloo grammar with the notion of crystals, to produce a document grammar which overgenerates slightly less than would an equivalent Waterloo grammar. This type, known to the authors as a Belgian grammar in honor of Jean-Pierre Gaspart, who introduced us to it, uses the SGML mechanism of inclusion exceptions to control the appearance of crystals and their parts. In a "pure" Belgian grammar, * crystals will be defined using some form of prescriptive grammar * overall text structure will typically be defined using a prescriptive grammar * elements which may directly contain text will have a content model of "(#PCDATA)" * top-level text-containing elements such as paragraphs will also have an inclusion exception which allows all crystals, but not their constituent parts, to appear anywhere within the text-containing element The following simple DTD, for example, allows personal names and lists to appear anywhere in any paragraph, but allows , , and to appear only within appropriate contexts: The declaration for the

element above demonstrates the power of a Belgian syntax. Its use of inclusion exceptions ensures that names and lists can appear within any element within a paragraph, including within names and lists; items and parts of names, however, can appear only within the relevant enclosing crystal (i.e. within lists or names, respectively). As this example shows, Belgian constructs can be restricted to some elements, allowing the rest of the document grammar to be defined using other constructs. The final type of grammar which must be described uses declarations of the form: This construct, which we will call a starred alternation, can be used to mimic the effects of Waterloo and Belgian grammars, but also allows slightly easier control over nesting.(10) By avoiding the use of inclusion exceptions, it makes clearer precisely what sub-elements are legal within elements of a given type. Its primary disadvantage is that it requires each phrase-level element to name explicitly all the other phrase-level elements which may appear within it, thus exploding the size of the DTD. Compare, for example, the Belgian DTD above with this equivalent DTD in starred-alternation form: The growth in the grammar, small but perceptible here, may prove a practical hindrance when, as in the main TEI DTD, the number of phrase-level elements, which can occur within paragraphs or within other phrase-level elements, rises above fifty. Despite this fact, the starred alternation is probably more common in existing standard DTDs than the constructs of either the Belgian syntax or the Waterloo syntax. One advantage of the starred alternation version of this document grammar over the Belgian version given above is that the content model for each element describes its legal contents in full. In the Belgian syntax, the legal contents of item are determined not only by the content model for that element, but by the inclusions on the p element given elsewhere. If lists can occur both within and outside of paragraph elements, then the legal contents of items may differ, depending on location, in ways which are not immediately obvious, even from a careful reading of the DTD. The starred alternation makes it possible to eschew inclusion and exclusion exceptions in a DTD, thus ensuring that the legal contents of an element do not vary depending on its context; this is important for the validation of SGML fragments in isolation, and for the convenient decomposition of documents into fragments suitable for retrieval and maintenance using standard database techniques. 2.3 Methods of Predication If we view markup languages merely as means of predicating the existence of particular qualities or features in particular passages or locations of a text -- ignoring for the moment the other functions of a markup language, such as ensuring system independence of data or securing useful mechanisms for document management -- then it is useful to ask exactly how we may legitimately infer, from a marked up text, the existence of a given feature at a given location. We assume in this discussion that the feature in question is marked explicitly, ignoring the problems of inferring the occurrence of features not marked from those which have been marked. Such inference is possible when the implication relations of different features are made explicit, as may be done in formal feature-system declarations (FSDs), but it is quite difficult for the most part, and has not as far as we know been undertaken for any markup language intended for general use. In a TEI-encoded text, the predication of a given feature at a given location may be accomplished by any combination of several mechanisms. For any location, the features predicated by the markup as applicable to that location are signaled by: * the generic identifiers of open elements * attribute value specifications on open elements * the structural position of the location in the document (a TEI , for example, marks a chapter title if and only if it is the first child of a text division which is itself marked as a chapter) * pointers to some open element from a element, an encoded feature structure, note, , or element * the position of the location between a location pointed to as the start, and a location pointed to as the end, of a passage pointed at by an encoded feature structure, note, or element * an ana attribute on an open element pointing to a feature structure or interpretation element * the position of the location between two milestone elements (e.g. , , , etc.) Note that the pointers in each case may be indirect, by means of one or more links in a chain of extended pointers. These methods for predicating features in a text are particularly important for search and retrieval applications, which often involve finding instances of particular features in a document; they are relevant for other software as well, in cases where correct processing depends on what features are present in a given passage. The implementation of software sensitive to generic identifiers, attribute values, and structural location in the document is relatively simple and thus quite common. At the time TEI P3 was published, however, software capable of treating virtual elements like , or of following predications across links like those given by the element or by the inst and ana attributes, remained relatively rare. These markup techniques are nevertheless widely used in the TEI Guidelines, primarily because they make it possible, and indeed convenient, to represent multiple possibly conflicting interpretations of the same passage, without requiring that one be privileged vis a vis the others. 3 THE TEI DOCUMENT TYPE DECLARATION Other contributions to the present volume describe in more detail different aspects of the various parts of the TEI scheme. Here, we provide a general overview of the DTD itself, and a very brief summary of the textual features and default text structures defined by its core components. 3.1 Core, Base, Additional, and Auxiliary Tag Sets TEI P3 describes several distinct document type definitions: a single main DTD for the transcription of texts, and several auxiliary DTDs for the encoding of meta-information relevant to the transcription of one or more texts. The main DTD is described further below; the auxiliary DTDs include: * the independent header, which documents the identity of a particular electronic text and its source, as further described below. * the writing system declaration, which documents the alphabet or writing system used by a given text, and the character sets, transliteration schemes, and SGML entity sets used to record it in machine-readable form * the feature system declaration, which defines the set of features nameable within a given feature system, the domains over which their values may range, their default values, and co-occurrence constraints upon sets of feature values * the tag set documentation, which defines tags for systematic documentation of SGML tag sets Although formally it defines only a single document type, the TEI main DTD may vary dramatically in different invocations, to reflect the varying needs of the document and encoder. By specifying a particular view of the DTD, the user may tailor it to the needs of a particular application in a way difficult or impossible with most other general purpose DTDs so far developed. The user of the TEI scheme is offered the opportunity of building a DTD which matches his or her requirements, but constrained to do so in a way that facilitates interchange. The main TEI DTD thus resembles a complex database schema from which different views may be derived for different applications. The main DTD is defined as a set of tag sets, which may be used in a variety of combinations. Each view of the main DTD selects one or more of these tag sets, thus making the tags in those sets available for use in the encoding; different views of the main DTD differ because they select different tag sets and thus make different elements available. Three types of tag sets are distinguished: * the core and header tag sets; these include elements which must be included in any view of the main DTD * base tag sets; these define textual components specific to particular text types * additional tag sets; these define elements of interest for particular types of analysis or processing, which may conceivably be tagged in texts of any type This classification of tag sets reflects what is jocularly known as the Chicago pizza model of DTD construction. In Chicago, as elsewhere in the U.S. (though not in Italy), all pizzas have some ingredients in common (cheese and tomato sauce); the consumer may specify, however, a choice of one style of crust (thin crust, deep-dish or pan, or stuffed) and an arbitrary selection of toppings to be strewn on the top of the pizza. In the same way, the user of the TEI scheme constructs a view of the TEI DTD by combining the core tag sets (which are always present), exactly one "base" tag set, and any selection of "additional" tag sets, which play the part of the toppings. The core tag set is described in more detail further below; it includes elements which are specific neither to particular genres or types of text, nor to particular types of application or research. The TEI header allows the provision of documentary or bibliographic information about electronic texts. Such information is essential for any satisfactory interchange of texts coming from multiple sources, or for which long term uses are envisaged. As with software, leaving the documentation of an electronic text to the last moment is a recipe for disaster all too commonly followed. The TEI header is one of the few mandatory elements in a TEI document. It has four major divisions which together provide a detailed framework for the documentation of: * the electronic document itself and the sources from which it was derived * the encoding system which has been applied * non-bibliographic aspects of the document (e.g. the demographic characteristics of its author and readers, its subject matter, or its classification in some scheme of text types) * the revision history of the electronic document The TEI header may vary widely in its size and complexity. At one extreme, an encoder may provide nothing more than a bibliographic identification of the text. At the other, encoders wishing to ensure that their texts can be used for the widest range of applications, will want to provide the kind of detailed documentation most often found in a detailed user's manual. Most headers will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. A collection of TEI headers can also be regarded as a distinct document, and an auxiliary DTD is provided to support interchange of headers alone, for example between libraries or archives, as noted above. The greater part of the TEI main DTD is taken up, however, not by the core and header tag sets, but by the base and additional tag sets which may be selected by the user in various combinations. To construct a view of the TEI DTD, the user must always choose one of eight base tag sets. Six of these are intended for documents which are predominantly composed of one type of text; the other two are provided for use with texts which combine these basic tag sets. * prose * verse * drama * transcriptions of spoken material * printed dictionaries * terminological data * a "general" base for anthologies, etc. * a "mixed" base for anarchic mixtures of text types In addition to one base tag set, the user of the TEI main DTD may select any combination of the following additional tag sets; these typically define elements relevant for particular types of processing, or for particular types of research, or text components which may occur in many different types of text, but are not so widespread as to belong in the core tag set. The additional tag sets define SGML elements for: * hypertext linking, text segmentation, and alignment of multiple texts or multiple passages * simple analysis and interpretation of the text * markup of feature structures * recording the certainty or uncertainty of markup and responsibility for markup * transcription of primary sources (principally manuscripts) * text-critical apparatus * detailed analysis of names and dates * graphs, networks, and trees * tables, formulae, and graphics * demographic information about authors or speakers, and text-classification elements, of types particularly useful in language corpora Aspects of most of the above tag sets are described in more detail elsewhere in this issue. 3.2 Selection of Tag Sets The pizza model is implemented by means of parameter entities in the TEI DTD, as further discussed below. To illustrate the basic mechanism we present here the outline of a TEI-conformant document in which the base tag set for prose has been selected together with the additional tag set for linking: ]> As may be seen, the user selects a particular tag set by declaring a parameter entity with a particular name as having the value "INCLUDE". Because tag sets are selected explicitly by means of declarations within the DTD subset, as shown above, any recipient of a TEI document can tell which TEI tag sets are required to process it. Any deviations or modifications of the TEI definitions (for example, the renaming of elements, or the addition of new ones) are made visible in a similarly declarative manner. In the following example, all modifications to the TEI scheme are contained in the files project.ent and project.dtd: ]> 3.3 Implementation of the Pizza Model In implementing the pizza model so as to work with any conforming SGML parser, several problems must be solved: * preventing logical or syntactic conflict among tag sets * embedding the correct files, i.e. only those which define the tag sets selected by the user * ensuring that the elements defined in each tag set appear in the appropriate content models in the effective DTD * allowing for user-defined extensions to the DTD The first problem has a relatively simple solution. To avoid name clashes among elements, and to ensure that tag sets may be used in any combination, we adopt the simple rule that elements may in general appear in only one tag set. Special handling must provided for the tags of the default text structure, which are used by all bases. The second problem has a slightly more complex, but still rather straightforward solution. The individual tag sets are defined and referred to by the main driver file of the TEI DTD, tei2.dtd, in such a way that they will be embedded in the DTD if and only if the user has selected them. Simplified slightly, the general pattern followed is as follows: %TEI.prose.dtd; ]]> %TEI.verse.dtd; ]]> As can be seen, the entity containing each tag set's declarations is declared and referred to within a conditional marked section controlled by a parameter entity with a default value of IGNORE, which the user can override to INCLUDE. If the user has selected the tag set for prose, for example, then upon interpretation of the parameter entities TEI.prose and TEI.verse, the fragment above will resolve to the following: %TEI.prose.dtd; ]]> %TEI.verse.dtd; ]]> and then, upon parsing of the marked sections, the declaration of the verse tag set will disappear entirely: %TEI.prose.dtd; The third problem requires even more extensive use of parameter entities and conditional marked sections, as well as a set of shared conventions in the definition of content models. Each TEI base tag set determines the basic structure of all the documents with which it is to be used. More exactly, it defines the constituents of elements. In practice, so far, though means exist for them to vary, all the TEI bases defined use the same "default" text structure, described in detail below, which allows a text to contain nested text divisions (chapters, sections, etc.). The bases do however differ greatly in the components which may appear within those divisions: the sections of a dictionary (for example) will contain dictionary entries, while the scenes of a play will contain speeches and stage directions, and the subdivisions of a speech transcript will contain utterances and descriptions of non-verbal activity. The sets of components, moreover, may overlap: prose paragraphs may occur within conventional prose texts, and in (the front matter of) dictionaries, and in plays. To cater for this variety, the constituents of all divisions of a TEI element are defined indirectly in terms of a parameter entity named component, which is in turn defined differently in each base tag set. Once more, marked sections are used to control which declaration is effective, as shown in this simplified example: ]]> ]]> ]]> Within the body of the DTD, elements are defined using these parameter entities, for example: When a base tag set is selected, one or the other of the optional entity declarations will be "activated" by the user's selection of a base DTD. The components of a text are thus invariably defined by entities whose values vary with the particular base in use. All textual divisions are defined with the same content model, which includes a reference to the parameter entity component.seq; the value of this parameter entity will however be different in different bases. In this way it is possible for the divisions of a text using the drama base (for example) to consist of speeches and stage directions, while those of a text using the dictionary base will consist of lexical entries. In the distributed DTD, the definitions of component and other parameter entities for use in content models are found in a separate file, teiclas2.ent, which also defines the structure of the TEI element-class system described below in more detail. The final problem of implementing the pizza model, ensuring easy introduction of user-specified extensions to the DTD, is accomplished once more by use of parameter entities. Because most user modifications will be effected by overriding the default declarations of parameter entities, it is essential that any user-modified parameter-entity declarations be processed before the default declarations are encountered. (In SGML, the first declaration of an entity is the only one which counts.) In simple cases, it would be enough for the user to insert the desired new parameter-entity declarations in the DTD subset of a document; since the DTD subset is processed before the external DTD entity, this would ensure the necessary priority of declaration. In practice, however, this would have several disadvantages. A project which consistently used the same modifications of the TEI DTD would have to include all of the modifications in the DTD subset of every document; this is inconvenient and would lead to confusion when the local modifications were themselves revised. Moreover, elements declared within the DTD subset are unable to refer to the parameter entities normally used in element declarations; because component, for example, has not yet been declared, an element declaration in the DTD subset cannot refer to that parameter entity anywhere within its content model. It is more convenient if user-supplied element declarations can refer to the same parameter entities as are found in the standard TEI declarations. This is possible if they are encountered in the DTD not before but after the TEI-supplied parameter-entity declarations. The TEI's method of embedding user-supplied defaults ensures precisely this. The main driver file of the TEI DTD embeds two entities, within which local modifications can be conveniently enclosed. First, before declaring the standard TEI parameter entities, it embeds any local modifications of those declarations: %TEI.extensions.ent; As shown, the default declaration of TEI.extensions.ent is the null string, so by default the reference to that entity has no effect. When the user overrides that declaration, as shown in the previous section, the user's modifications will be embedded before the TEI declarations of standard parameter entities are encountered. After all standard parameter entities such as component have been declared, the driver file embeds any local element declarations: %TEI.extensions.dtd; Again, the default expansion of the entity is to the null string. 4 OTHER FORMAL ASPECTS OF THE TEI SCHEME Several other characteristics of the TEI DTD should be described briefly here, as they may be relevant to understanding the other articles in this issue, and may also be of interest to others developing general-purpose SGML tag sets. This section describes in particular * the global attributes defined for every TEI element * TEI class system, which allows attributes and structural characteristics to be declared for classes of elements and inherited by all elements which are members of those classes * the full range of mechanisms for extension and modification of the TEI DTD * the notion of the derived DTD, which is important for the understanding of TEI conformance 4.1 Global Attributes Several attributes are so widely useful that they have been defined as applying to every element in the TEI DTD. These are: id: provides an SGML identifier for an element; SGML requires this identifier to be unique within the document, and it may therefore be used as the target of cross-references and other pointers n: provides a possibly non-unique number, label, or name for an element; this is particular useful when it is desired to specify explicitly the number of a section, or of a list item, without transcribing the number as data content lang: specifies the language of the content of an element; this is an indirect pointer to a TEI Writing System Declaration, which identifies the natural language and writing system and the character set used to transcribe them rend: provides information about the rendering of an element where this is desirable The id and n attributes allow for the identification of any element occurrence within a TEI-conformant text. Elements carrying an id attribute value may be the object of a link or cross-reference, or any of the other re-structuring mechanisms proposed by the TEI for circumventing the rigidly hierarchic structure of a simple SGML DTD. The fact that the requirement for such links is usually unpredictable is one reason for making this attribute global. Values on id attributes must be unique (their declared value is ID). Values on the n attribute however need not be; they may be used to carry a TEI canonical reference. A method for defining the structure of such canonical reference schemes is also provided, so that documents using it can be processed automatically. The lang attribute indicates both the language and hence the writing system applicable to the element's content, thus providing explicit support for polyglot or multi-script texts. If no value is given, that of the element's direct parent is inherited. (A number of TEI attributes use inheritance of this type; in the DTD, it is indicated by the use of the TEI-defined keyword %INHERITED to specify the attribute's default value). The value of this element identifies a special purpose element which documents the language in use, optionally associating it with an external entity in which a formal writing system declaration may be given. The TEI writing system declaration (WSD) attempts to help encoders come to terms with a world in which, for one reason or another, documents may not always use the same universal character set, whether from ignorance, perversity, or the sheer impossibility of finding one large enough to represent all the glyphs they contain. It provides for the systematic documentation of a writing system, in terms of existing international or other standards, public or private entity sets, ad hoc transliteration schemes or explicit definitions, as well as combinations of all four. Finally, the global rend element may be used to give information about the physical presentation of the text in the source, where this is not otherwise given. A default rendition may be specified for all elements of a given type. No specific set of values is defined for this attribute in the current draft, though it is possible that some suitable set of DSSSL primitives may be proposed in a later version.(11) It should be stressed that the rend element does not specify procedurally how an element is to be formatted. It does associate the element with a particular form of presentation, but does so in a purely declarative way. Its normal meaning is that the element in question was rendered as indicated in the source -- it says nothing in itself as to how the element is to be rendered in later processing, except insofar as it may be desired to mimic the approximate appearance of the source edition. In document production applications, where there is no physical source, some users may be tempted to use the rend attribute to guide a formatting process, but external style sheets will normally provide a better method of handling this task. 4.2 Inheritance of Attributes from Element Classes Other attributes, while not so widely useful as those defined globally, do apply to more than one element type, and need to be kept consistent among all elements which use them. In the TEI scheme, such attributes are defined as belonging to a particular class of elements; attributes of a class in turn are inherited by all members of that class. For example, all TEI elements which represent links or associations between one element and another do so using a common set of attributes. To express this visibly in the DTD, all such elements are defined as members of an attribute class, to which the TEI gives the name pointer. Since SGML has no explicit concept of element classes, class inheritance is implemented by parameter entities. The attributes of a class are defined in a parameter entity, the name of which is the string a. followed by the name of the class itself. Members of the class refer to that parameter entity in declaring their own attributes. In the following simplified example, the element is a member of the class pointer, while is a member of the subclass xPointer: After expansion of the parameter entities, it is as though had been declared with the attributes id, n, lang, rend, type, resp, targType, evaluate, and target. The element has further inherited the attributes of the xPointer class. It is as though the following declarations had been in the DTD: As may be seen, * the global attributes themselves are declared as the attributes of a class named global -- all elements are members of this class. * the subclass xPointer inherits attributes from its superclass pointer, and thus its members do, too; inheritance is thus transitive, as it should be. * the attributes of a class can readily be modified by the user, who need merely provide an alternative declaration for the appropriate parameter entity. 4.3 Model Classes Inherited class membership is also used in defining content models. Members of a model class share the same structural properties: that is, they may appear at the same position within the SGML document structure. For example, the class phrase includes all elements which can appear within paragraphs but not between them, while the class chunk includes all elements which can appear between but not within paragraphs (such as paragraphs, for example). A class inter is also defined, for elements such as lists which can appear either within or between chunk elements. Similarly, the class divtop contains all elements (headings, epigraphs, etc.) which can appear at the start of a textual division. As well as these general purpose classes, some functional or semantic classes are defined: for example, all elements used to mark editorial corrections or omissions are all members of the class edit; elements marking bibliographic citations, etc., are all members of the class bibl and so on. Elements may of course be members of more than one class. Classes may have super- and sub-classes, and properties are inherited. For example, reflecting the needs of many TEI users to treat texts both as documents and as input to databases, a sub-class of phrase called data is defined to include "data-like" features such as names of persons, places or organizations, numbers and dates, abbreviations and measures. These behave in exactly the same way as phrase elements, and so the data class is a sub-class of the phrase class. The formal definition of these classes in the SGML syntax used to express the TEI scheme makes it possible for users of the scheme to extend it in a simple and controlled way: new elements may be added into existing classes, and existing elements renamed or undefined, without any need for extensive revision of the TEI document type definitions. The class is defined by two parameter entities: one contains a list of class members, in a form suitable for use within a content model; it is given a name constructed by adding the prefix m. (m plus full stop) to the name of the class itself, and is thus termed an m-dot. the m-dot entity also contains a reference to an x-dot entity, which by default is declared as the empty string. (N.B. the use of m-dot and similar prefixes is a naming convention followed consistently by the TEI; it is not an SGML construct, and for an SGML processor, the m-dot and x-dot entities for a class are simply two distinct parameter entities. The formal relationship between them is detectable only by TEI-aware software, not by generic SGML systems.) For example, the element class divtop is defined as follows: To add a new element (say, ) to this class, enabling it to appear anywhere in the content model that other members of the class do, all that is needed is to re-define the x-dot entity within the DTD subset: is already defined in the TEI scheme (within the header); if it were not, an element declaration would also be necessary here. The m-dot entity is used in content models wherever members of the class can appear. The divtop class, for example, appears in the declaration of text-division elements: 4.4 Extension and Modification Mechanisms The TEI takes the following steps to ensure that the TEI DTD can be readily extended or modified where necessary; some have already been described. * modularization of the DTD, to allow individual components, which are referred by using external parameter entities, to be replaced by modified components * pre-defined entities for TEI extensions, embedded by the driver at appropriate points, as described above * the provision of x-dot entities for all model classes, to allow convenient addition of new elements to the model-class system * the expression of attribute classes using parameter entities, allowing them to be modified easily * use of parameter entities for common content models, to allow easy modification * use of parameter entities for the generic identifier of each element, so that elements may be renamed simply by overriding the default declaration of the element's n-dot entity * declaration of each element within a conditional marked section controlled by a parameter entity, to allow individual elements to be suppressed at will The first few points have already been explained; the latter two can be illustrated with a simple example. In the non-modifiable form of the TEI DTD, the declarations for , for example, are given in the following form: In the modifiable-form version of the same file, however, the declarations take the following form (again, we simplify slightly): ]]> The element can be renamed by simply redefining the entity used for its generic identifier: It can be suppressed entirely by redefining that used to control the marked section within which it is declared: 4.5 Derived DTDs The TEI DTDs may be used as they stand; it is expected, however, that many users will modify them slightly, if only by selecting a particular set of tag sets and suppressing the declaration of elements which they expect never to tag in their documents. The Oxford Text Archive, for example, has for some time been distributing documents encoded with a small subset of the TEI DTD, in which the main difference from the standard DTDs has been the elimination of many elements of the core and header. The OTA DTD can be expressed purely as a set of modifications of the standard TEI DTD, in the manner shown earlier, in which the TEI.extensions.ent file contains a large number of declarations of the following forms: Rather than parse the entire TEI DTD each time a document needs to be parsed, however, it is more convenient to define a free-standing OTA DTD, which defines exactly the same element grammar as that of the TEI DTD as modified by OTA; this derived DTD is smaller than the TEI DTD, makes no reference to suppressed elements, and is thus more readily comprehensible on its own terms than it would be if it had to be read only in conjunction with the full TEI DTD. Its disadvantage, from the point of view of document interchange, is that it is difficult to disentangle the parts which represent modifications of the TEI DTD from the parts which are exactly the same; a recipient who wanted to know exactly what changes would be required to process an OTA text with a standard TEI-aware processor would have difficulty finding out. It will be most convenient, in general, if the derived DTD is provided in two forms: both as a free-standing DTD and as a pair of extension files which are to be used as the TEI.extensions.ent and TEI.extensions.dtd entities. Documents conforming to a derived DTD are also TEI conformant, as long as the element grammar expressed by the derived DTD can also be expressed by a set of legal modifications to the TEI DTD. For ease of maintenance, it will be convenient to maintain the DTD as a set of modifications, and to generate the free-standing derived DTD by processing the DTD with a program which generates, from the modified TEI DTD, a free-standing derived DTD which expresses the same element grammar. As pointed out in chapter 29 of TEI P3, any modification of a DTD changes the set of documents accepted by that DTD. The set of documents accepted by a derived DTD may be: * a subset of the documents accepted by the unmodified TEI DTD * a superset of those documents * neither a subset nor a superset If either of the first two cases applies, the changes may be said to be "clean" modifications; clean modifications, however, are not a condition of TEI conformance, and a combination of clean modifications is not itself guaranteed clean. 5 ELEMENTS OF THE CORE TAG SET It remains to describe at least briefly the elements defined in the core tag set and in the default text-structure tag set, which are not described elsewhere in this volume. The core tag set defines elements common to many different types of text, which are commonly enough used that it seemed inconvenient to separate them out into additional tag sets. The elements defined in the core can be grouped into: * those which can appear directly within a text division ("paragraph-level" elements -- to avoid a bias toward prose, P3 defines these as component-level elements) * those which can appear at character, word, or phrase level, i.e. within paragraphs or other component-level elements -- and also within other phrase-level elements; these elements are members of the class phrase * those which can appear both at phrase-level and at component-level; these are members of the class inter There are few documents which do not exhibit some of these features; and none of these features is particularly restricted to any one kind of document. In some cases, additional more specialized tag sets are provided for those wishing to encode aspects of these features in more detail (e.g. in chapter 14, which provides a more elaborate tag set for cross references and hypertext linking, or chapter 20, which allows for detailed analysis of the components of names, dates, and measures), but the elements defined in this core are believed to be adequate for most applications most of the time. 5.1 Paragraph-Level Elements The only items defined as strictly component-level elements by the core tag set are: * paragraphs (tagged

) * verse lines () * verse stanzas or other line groups ( -- an contains one or more elements) * dramatic speeches (), which contain an optional speaker indication () and a series of paragraphs, lines, or line groups Paragraphs are included in the core, rather than in the base tag set for prose, because prose materials regularly appear in texts of all types, including dictionaries and term banks, if only in the front and back matter. SGML elements for verse and dramatic speeches are included in the core tag set, rather than in the base tag sets for verse and drama, on similar grounds: they appear frequently in texts of all sorts, either directly, in the case of mixed-genre materials, or indirectly through quotations and epigraphs. Any attempt to segregate these elements into the respective base tag sets would have rendered it so frequently necessary to combine base tag sets that the utility of separating the bases would have been seriously impaired. 5.2 Phrase-level Elements The phrase-level elements of the TEI core serve a variety of uses. They include: * typographically highlighted phrases, optionally distinguishing among emphasis (), technical terms (), foreign words () or words otherwise regarded as linguistically distinct or anomalous (), titles of books (), etc.. When such distinctions are not to be made, the phrase may be tagged simply as "highlighted" (<hi>); the global rend attribute can be used in either case to specify how the phrase is highlighted in the source. * quoted phrases, optionally distinguishing quotation or direct discourse (<q> -- strictly speaking, <q> is an inter-level element), glosses (<gloss>), cited phrases (<mentioned>), "scare quotes" (<soCalled>), or not so distinguishing (<hi> again); * semantically distinct and important items like names (<name>), references to people, places, or things (<rs>, for 'referring string'), numbers and measures (<num> and <measure>), dates and times (<date>, <time>, <dateRange>, <timeRange>), abbreviations and their expansions (<abbr>, <expan>), etc. * simple editorial changes such as correction of apparent errors (<corr>) or faithful reproduction of errors (<sic>), regularization and normalization or the reproduction of unnormalized forms (<reg> or <orig>), additions (<add>), material transcribed despite being deleted (<del>), and omissions (<gap>); material not reliably transcribed because the original is illegible or inaudible may be marked <unclear>. * simple links and cross references (<ptr> and <ref>), providing basic hypertextual features; <ptr> is an empty element, used as in a document-production system which generates the appropriate cross reference; <ref> encloses a phrase which expresses a cross reference. * pre-existing or generated indexing (<index>) * simple or complex referencing systems, not necessarily dependent on the existing SGML structure (<pb>, <lb>, <cb> for page, line, and column breaks, <milestone> for other reference units) 5.3 Inter-level Elements The core tag set defines a number of elements which can appear at either the phrase or the component level: * annotation, whether pre-existing or added by the encoder (<note>); the place and type attribute can be used to distinguish notes by place of appearance in the source (footnotes, endnotes, marginal notes, etc.) or by type (source, parallel passage, etc.) * lists of all kinds (<list>, <item>, and <label> for item labels; <listBibl> for lists of bibliographic citations * quotation, whether actual or pretended (e.g. dialog in a novel or narrative history); these are always reliably distinguishable and may both be tagged <q>. Where it is possible and desirable to draw a distinction between the two, <q> should be used for dialog or direct discourse, and <quote> for actual quotations from other authors. Attributed quotations may be associated with their bibliographic references by enclosing both within a <cit> element. * bibliographic citations, adequate for most commonly used bibliographic packages, in either a free or a tightly structured format (<bibl> for unstructured, <biblStruct> for tightly structured, bibliographic citations, <listBibl> for lists of bibliographic entries) * stage directions, which can occur either within or between speeches (<stage>) * embedded texts (tagged <text>) 5 ELEMENTS OF THE DEFAULT TEXT STRUCTURE TAG SET The default text structure, which is currently embedded by all TEI base tag sets, divides each text into a body, surrounded by optional front and back matter. These three top-level units are tagged, respectively, <front>, <body>, and <back>. 5.1 Text Body and Text Divisions While different text types exhibit a bewildering variety of names for their component parts (chapters, sections, subsections, acts, scenes, entries, parts, books, cantos, adventures, staves, fittes, etc. -- to mention only English-language terms), nevertheless the components and subcomponents of text typically behave in the same way whatever their name: typically each such is incomplete in itself, and typically smaller ones nest within larger ones. In the TEI scheme, therefore, all such objects are therefore regarded as the same fundamental kind of element, which the TEI calls a <div> (or text division). A type attribute may be used to give the name associated with a particular text division in the source; chapters may thus be tagged <div type='chapter'>, cantos may be tagged <div type='canto'>, and so on. In this way, the encoder can distinguish different types of division. It will frequently be possible to distinguish them also by their hierarchical position: when plays are consistently divided into acts, and acts into scenes, then the top-level division of a text body will consistently be a <div type='act'> and the next level a <div type='scene'>. Like others of its kind in the TEI scheme, the type attribute has no standard set of values, precisely because no such standard set of terms exists, or is likely to exist. A set of legal values may however be defined for a given application, either in the TEI Header or by a user-defined modification to the DTD. Following the example of such markup languages as Scribe, LaTeX, GML, and the SGML tag set developed for the Association of American Publishers (now ISO 12083), the TEI defines a set of text-division elements each labeled with its hierarchical depth (<div0>, <div1>, etc., down to <div7>). This style of tagging is intuitively clear to many users, allows convenient markup omission techniques, and can sometimes assist in the detection of errors in the markup. Since such a style of markup can make it hard to reuse text unless it is embedded at exactly the same level in the hierarchy of text divisions, however, a second style of markup is sometimes preferred; this too is provided by the TEI, in the form of an un-numbered or "vanilla" <div>. Either numbered or un-numbered division elements may be used in TEI documents, but they may not be mixed in the same <front>, <body>, or <back> element. In the normal case, the components of all divisions in a particular base are homogeneous -- they all use the same value for component.seq. However, the scheme also allows for two kinds of heterogeneity. If the general base is selected, together with two or more other bases, then different divisions of a text may have different constituents, though each division must itself be homogeneous. A mixed base is also defined, in which components from any selection of bases may be combined promiscuously across division boundaries. The way in which this is done is beyond the scope of this article; full details are however given in the appropriate chapter of the TEI Guidelines. At the beginning of each text division, a division title (tagged <head>) may appear, as may several other items (<epigraph>, <argument>, <byline>, <dateline>, <salute> -- the salutation at the opening of a letter, or the signoff at its closing --, <signed> -- the signature of a letter). A number of these may also appear at the end of a division. To avoid ambiguity in the document grammar, the ambiguous items (<byline>, <dateline>, <salute>, and <signed>) must be enclosed within the grouping elements <opener> or <closer>. For example: <div><head>A LETTER TO THE PUBLISHER Occasioned by the present Edition of the DUNCIAD</head> <p>It is with pleasure I hear that you have procured a correct edition of the <title>Dunciad, ...

I shall conclude with remarking ... I am Your most humble Servant, St. James's Dec. 22, 1728 William Cleland. 5.2 Front and Back Matter In general front and back matter are tagged in TEI documents using the same elements as the text body; a few specialized elements are defined for specialized elements found only there. No specialized elements are defined for items like prefaces, acknowledgements, etc., since these may conveniently be tagged

, etc. The title page, however, is given a tag of its own (), and a number of specialized elements are defined for it, to make it easier to extract salient information like the title and author of the document in question. 5.3 Groups Textual divisions pose no problems in many conventional texts. In some cases, however, it is impossible to find a satisfactory method of handling the internal structure of a text, using only the elements , , and
. In encoding an anthology, for example, one might well balk at tagging a poem or short story as a division of some larger whole, ignoring the fact that it too is a text in its own right. One could, of course, tag the entire anthology as a corpus of texts, enclosing each item in the anthology within its own and elements; but only at the cost of losing the ability to regard the entire collection as a text in its own right. For anthologies prepared by later editors, this might not be an unacceptable cost, for some encoders, but it is intolerable for poem cycles and other elaborately structured works. For cases of this sort, the TEI scheme defines an alternative to and
. In an anthology, the front and back matter surround a element, which contains either a series of elements, or a series of smaller elements, or some combination thereof. A group can have the same kind of header and trailer elements at its beginning and end as can a text division; texts enclosed within a group have the normal structure and can thus have their own front matter, body, and back matter -- or may themselves be compound texts, containing groups of their own. In addition, is declared as a component-level element, so that embedded texts quoted in full can be tagged as free-standing texts. 6 CONCLUSION It is too early to tell with certainty, and the authors are not the judges called to decide, whether the TEI scheme has wholly justified all the hopes placed in it at its inception. But it is possible to make a brief survey of the original design goals and see where the TEI now stands in relation to them. We believe that the scheme defined in TEI P3 does suffice, thanks to the generous work of scores of specialists, for the immediate needs of the majority of scholars working with electronic text. Without user-defined extensions, it does not suffice, and no finite vocabulary can ever suffice, for all the future needs of all researchers; by providing extensive facilities for user extensions, however, we believe the TEI does the best job possible of meeting the future needs of research which cannot be foreseen today, as well as making the existing DTD usable for those with specialized needs today. The extensive facilities for analysis and interpretation are designed to allow a great deal of extension to the native semantics of the TEI tag sets to occur without requiring any formal modification to the SGML DTD. The scheme is, however, far more complex and daunting than either of us thought likely when the project began. In part, this reflects the project's efforts to meet the existing documented needs of researchers; in part, no doubt, it reflects shortcomings in the editorial work for which we must shoulder responsibility. In our own defense we will offer only the observation that a markup language truly capable of reflecting in detail the research results of work of scholars and researchers in very diverse fields will necessarily take on some of the complexity inherent in advanced work in those fields; a very simple markup scheme will hardly provide the kind of support for research work which P3 provides. We do hope to have kept the explanations in TEI P3 reasonably clear, and we do claim that TEI P3 is reasonably concrete: it is not a meta-DTD but a real markup language which can be used without further ado, and for every element there are concrete examples, almost always from real texts. We can also certify that it is possible to use the TEI markup language without special-purpose software; we have used earlier versions of it ourselves for several years, with no editors or word processors other than the standard ones on our systems. The goal, of course, was that the scheme should be not just "possible", but easy to use without special-purpose software. And here we must admit that while it has not been very onerous to edit TEI documents in standard editors, the task is made much easier by the use of SGML-aware editors. And while it has been possible to write our own software to process our own SGML documents (with which the printed version of TEI P3 was generated), the task of processing SGML documents becomes incomparably simpler when SGML software is used. The final items in the initial list of design goals call for the TEI scheme to allow user-defined extensions, and to conform to existing and emergent standards. The list of facilities given above for simplifying the task of extending the TEI DTD has, we hope, persuaded the reader that such extension is adequately catered to. And the final goal has been met by erecting the Guidelines on the foundations provided by SGML. The task of conforming to emerging standards, of course, has no end, and future revisions of the Guidelines will have to take explicit account of HyTime, DSSSL, and other standards which accompany and extend SGML. Some work remains for the future. Several existing chapters would benefit from further elaboration; several missing areas (analytic bibliography of early printed books, computational lexica, detailed description of page layout and design, letters and memoranda, and numerous others) should be included in some future revision. The most pressing need, however, remains that of developing simple introductory manuals for teaching the Guidelines to practitioners and students -- a task to which we now turn our attention, and that of the community as a whole. ------------------------------------------------------------------------ The Design of the TEI Encoding Scheme ------------------------------------------------------------------------ NOTES (1) Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC), Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard (Chicago, Oxford: Text Encoding Initiative, 1994). (2) Although it requires SGML conformance of all TEI-encoded documents, the TEI encoding scheme is not strictly speaking a "conforming SGML application" within the definition given in ISO 8879, because of the additional restrictions it imposes on the SGML constructs to be used in the TEI Interchange Format, and because P3 sometimes replaces standard terminology used in ISO 8879 with terminology which was felt to be less stilted and thus, we hope, clearer. (3) In Saussure's terminology, we are distinguishing here between the signified and the signifier. The association we postulate between the tag and a specific feature, however, means that a tag, as we use the term, is not a signifier but a sign. Since in common usage, the term tag is used both to refer to signifiers and to refer to signs, we use identifier when it is important to refer unambiguously to the signifier, and element (see below) when it is desired to refer unambiguously to the sign. (4) For Formex, see C. Guittet, ed., Formex: Formalized Exchange of Electronic Publications (Luxembourg: Office for Official Publications of the European Communities, 'New Technologies -- Project Management' Department, 1984; for the starter set see ISO 8879-1986 Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML) ([n.p.]: ISO, 1986), Annex E.1, pp. 136-139. The starter set is discussed and various national-language versions defined in ISO technical report ISO/TR 9573-1988(E) Information processing--SGML support facilities--Techniques for using SGML. (5) In our usage, an SGML element is thus a sign: an association between a signifier (the identifier) and a signified (the feature). (6) It is therefore perhaps rather inconsistent to refer to the sets of elements described in chapters 5 to 23 of P3 as tag sets, but the usage is now established. (7) More rigorously, the formal portion of the DTD directly determines only a set of identifiers. Taken in conjunction with the semantic rules of the encoding scheme, however, this set uniquely determines the tag set defined by the DTD. (8) See, for example, Frank Wm. Tompa, "What is (Tagged) Text?" in Dictionaries in the Electronic Age: Fifth Annual Conference of the UW Centre for the New Oxford English Dictionary (Oxford: [n.p.], 1989). (9) A content model of ANY is defined as equivalent to a starred alternation of #PCDATA and all the elements declared in the DTD. For further discussion of starred alternations, see below. (10) For example, in a DTD containing declarations for , , , and , the declaration given here is the exact equivalent of the Waterloo-style declaration for given earlier. (11) DSSSL, the Document Style Semantics and Specification Language, is a draft ISO standard for specifying the processing or layout of SGML documents; it is formally defined in ISO DIS 10179 -- Information Technology--Text Composition--Document Style Semantics and Specification Language ([Geneva]: ISO, 1994). ------------------------------------------------------------------------ The Design of the TEI Encoding Scheme ------------------------------------------------------------------------ WORKS CITED * Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994. * Guittet, C., ed. Formex: Formalized Exchange of Electronic Publications. Luxembourg: Office for Official Publications of the European Communities, 'New Technologies -- Project Management' Department, 1984. * International Organization for Standardization (ISO). ISO 8879-1986 Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML). [Geneva]: ISO, 1986. * --. ISO/TR 9573-1988(E) Information processing--SGML support facilities--Techniques for using SGML. [Geneva]: ISO, 1988. * --. ISO DIS 10179 -- Information Technology -- Text Composition -- Document Style Semantics and Specification Language [Geneva]: ISO, 1994. * Text Encoding Initiative. TEI ED P1 "Design Principles for Text Encoding Guidelines." [Chicago, Oxford]: TEI, 1989. * --. TEI ED P2 "Charges to the Working Committees." [Chicago, Oxford]: TEI, 1989. * --. TEI ED P3 "Theoretical Stance and Resolution of Theory Conflict." [Chicago, Oxford]: TEI, 1989. * Tompa, Frank Wm. "What is (tagged) text?" In Dictionaries in the Electronic Age: Fifth Annual Conference of the UW Centre for the New Oxford English Dictionary. Oxford: [n.p.], 1989.