TEI Characters | 1 Proposal for a Character Encoding Standard for the Text Encoding Initiative TEI TRW7 - Draft by Steven J. DeRose 7500 W. Camp Wisdom Rd. Dallas, TX 75236 USA texbell!txsil!steved.uucp Last updated October, 1989 Table of Contents Table of Contents ii Introduction 1 Practical Constraints 2 Encoding Options 5 Language Indication 5 As an element 6 As an attribute 7 As entities 9 As content 9 Character Encoding 10 Unresolved Issues 12 A Proposal 13 Language Indication 13 Character Encoding 13 Character Encoding Declaration 16 Appendix 1: Some relevant standards 18 ISO 8859: SGML 18 ISO 646: 7-bit Character Set 18 ISO 2022: Code Extensions 19 Xerox Characters 19 EBCDIC 20 Appendix 2: Fundamental axioms 21 Encoding is not synonymous with rendering or keyboarding 21 Language is the fundamental organizing principle 22 The data should be readable 23 Bibliography 25 Introduction This document presents some issues involved in, and proposes a very preliminary standard for, the encoding of characters. It is written for the Text Representation committee of the Text Encoding Initiative (TEI).1 This document does not deal with higher levels, such as the structural units of documents; nor with lower levels, such as which byte comes first in the 'word' on particular hardware. It also does not discuss how sorting, hyphenation, and other process-specific operations are to be defined. There are several main parts to this document: 1) an introduction, with a discussion of certain practical constraints; 2) a discussion of the theoretically possible encoding options for (a) the language of each portion of a text, and (b) the characters appropriate to each language used; 3) a proposed standard; rather than attempting to specify a single universal character encoding system, it specifies a way of documenting particular coding schemes; it also suggests establishing a system for registering new conforming character encodings; 4) an appendix which briefly describes some of the salient standards now in use; 5) an appendix by Gary Simons and Steven DeRose, which discusses some fundamental axioms; and 6) a brief bibliography. Practical Constraints As discussed in Appendix 2, it is desirable that document portions reflect their actual language, rather than merely being encoded so as to allow generation of recognizable graphical shapes. This constraint fits well with the overall TEI commitment to SGML and a descriptive approach, since many (though not all) changes of language apply to structural document elements which are independently motivated and likely to require tagging. Also, a desideratum of interchange is readability across systems. With this in mind: A text is termed minimally human readable if none of the encoded information becomes invisible when the file is displayed on a standard terminal or printer. |Standard| should be understood as including all commonly used computer hardware and software.2 Put simply, a transportable file should not have any invisible characters when you put it on a machine and use whatever passes for the |type| command, with a commonplace terminal or display. Although the file may not look pretty when displayed that way, at least no information has been lost. A minimally human readable text is likely to be transportable from system to system, but custom software may be required for translation or display before the text will seem meaningful. A slightly stronger requirement is therefore preferable: A text is termed portably human readable if none of the encoded information units appears conceptually different when the file is displayed on a standard terminal or printer. By |conceptually| different, I mean that the information unit should appear to be |the same| to the average reader at each end. For example, the a in any Roman font is the |same thing,| whereas a and & are not. The significance of this requirement is in emphasizing that a conforming file does not use non-standard codes, such as are available on many systems to print line graphics, suits for card games, sigla, etc. The corresponding numeric codes may not be strictly invisible on many systems, but they fail to appear conceptually comparable across systems. Seeing | as | after transporting a file is not as bad as seeing nothing; but it is still quite bad. Consider the following equivalents: Originating system Receiving system a) | (picture of telephone) b) u¨aut; u¨aut; Clearly the lengthy encoding in (b), though it uses more space, is far more readable across systems than the arbitrary character code reflected in (a), which obviously works perfectly for the originator, but is meaningless to the receiver, appearing as an image or printer|s dingbat which happens to occupy the same numeric location in some font owned by the receiving system. To see the true advantage of more readable encoding, it is important to remember that rendering is independent of encoding. Someone using umlauts a great deal will have some local rendering system that can generate the ideal shape, and so will move any text to be used extensively into the local encoding. This will likely be true however the text was originally encoded, and whatever the local encoding may be. In either case (a) or case (b), some special software will be required to do this (unless, of course, all fonts and display software become consistent, an implausible expectation). Because of this, a primary goal of an interchange encoding should be to make such translations easy. Both that heavy user, and the person who sees one umlaut per year (and that on a teletype!), can read |u¨aut;| (one reasonable encoding of u with umlaut as an SGML entity reference) and know exactly what is meant by the originator. However, if a proprietary/nonstandard encoding is used, only those with the same system, or a comparably capable one with special translation software, can even perceive what is intended; anyone else is lost among invisible or bizarre characters. Another advantage of a common, universally understood, mnemonic encoding, is that less software is required for translating in and out of local formats. Given n local encodings, a translator is needed to get in and out of each coding from the standard: thus 2n distinct conversions. If every one of the n is to be convertible directly to every other, then n2 conversions are needed instead, constituting a much greater problem. This proposal thus recommends that encoded text be (a) machine readable, in the sense that it can be interchanged without loss of information, and (b) portably human readable, in the sense just defined. If a file is to be used only on the originating system, any encoding works: one need only assign random unique codes of any length to any set of units of the writing system, inform the software, and the job is done. This fact, unfortunately, has led to a proliferation of encoding schemes, differing not only from one hardware platform to another, but even among individual document processing programs on a single platform. If files are to be portable, then the first requirement is that they can be transmitted without information loss to other systems. In practical terms, this means that a standard which cannot be safely transmitted through computer networks or between machines from various vendors is unacceptable. The practical difficulty, however, is that a great many texts exist which have been prepared in one or another special- purpose format; specifically, in formats which are not portably or even minimally human readable. The |extended| characters of the IBM PC and Apple Macintosh character sets are widely used, though dissimilar and not in accord with ISO or other sanctioned standards. A key problem arises in trying to make such diverse data available and useful to a wide range of users, when individual scholars often lack the time and money, and sometimes lack the expertise, required to bring their data into conformance with a single standard. Thus, it is probably necessary to define a standard with levels of conformance, and to encourage the development of effective data-translation utilities. Encoding Options I now present some of the issues involved in designing a character encoding standard, and the options available within SGML. In general, the scope of the standard is to be as follows: | Multilingual documents, not merely documents in various single languages, must be accommodated. | The particular complexities of Asian and some other writing systems will not be dealt with at this time, though it is hoped that they will be included later. | The encoding must be SGML conforming. Two sub-problems must be dealt with in encoding characters: 1) the fact that certain text elements or portions of elements are in particular languages must be indicated in the text; 2) the meaning or interpretation of the numeric character codes used for each language must be defined. Language Indication I will take as axiomatic that any portion of a text can appear in any language. Some documents have the mixture of languages as an integral part, for example multilingual dictionaries and commentaries on works in another language than the commentator's. Most other texts also have insertions in various languages, for example foreign names in bibliographies or in-text citations, and stock foreign phrases such as per se.3 Given an SGML structured text, then, it must be possible to label any portion of the text as being in a particular language. The capability for inheritance which follows naturally from SGML's hierarchical structure, applies well to language: an embedded element should be assumed to be in the same language as its containing element unless specified otherwise. Only a few kinds of object can exist in an SGML file; the fundamental ones are: elements, attributes, entities, and content. These are therefore the mechanisms available for specifying the languages of document portions of a text. The respective merits of encoding language via each of these methods will now be discussed. As an element The element method involves defining one or more elements which can appear anywhere, and which define content as being in a particular language. A set of elements could be defined, one for each language: , . . ., . There are several major conceptual problems with this |extensional| method: 1) The number of tags is needlessly increased, complicating the SGML Document Type Definitions (DTDs) and placing a greater capacity load on parsing programs. 2) The many tags, although they share one conceptual function, are not syntactically connected to each other. That is to say, the central, salient notion |language| never appears, but only particular instances from which one can infer it. 3) This method cannot be extended for new languages without modifying the SGML DTD to define new tags. Since this is a task beyond the scope (or beyond the software) of many users, and since one cannot hope to include all potentially desired languages or writing systems in a standard, this is a major drawback. 4) Likewise, this method cannot be extended easily to provide for quasi-languages such as mathematical or linguistic notations, new or unusual writing systems for known languages, etc. Alternatively, a more |intensional| method uses a single element, with an attribute naming the language of each particular instance. Specifying a language could then be limited to this one element. An example is: The attribute, whose value is the language in use, has been given the same name as the element on which it appears. This overcomes the human problem of forgetting which name is which, and it poses no syntactic problems or ambiguities to an SGML parser. Either element method requires that the element(s) be able to appear anywhere within a document. There is no standard provision for making this possible, but a good approximation can be made by defining as an SGML |inclusion| at the highest level (see ISO 1986, section B.11.1). This allows to appear within the scope of any other elements that occur within the highest element (such as ). The conceptual drawbacks of the element method are: 1) Most of the elements will share exactly the scope of some other element, making the tagging less clear than it might be if it were associated with the element per se. For example, an entire paragraph may be inserted in a second language, as in the following two examples (between which the choice is, unfortunately, arbitrary):

en arch hn o logos

en arch hn o logos

2) As reflected by the previous problem, the element is not generally a structural portion of the document, as are paragraphs, block quotes, etc. Although may be an element in its own right, very frequently it is properly a property of a some other, pre-existing structural element. 3) In bilingual dictionaries and many other highly structured documents a worse problem also arises: particular elements will always or nearly always be in particular languages. For example, in a Greek- English lexicon the head words will always be Greek, while definition text (exclusive of quotation elements and other marked phenomena) will always be English. This leads to redundant tagging if a full- fledged tag must occur in every element, or even in every element of all but one language. As an attribute Rather than including an SGML element for every document portion for which language must be specified, an attribute may be defined (with a reference such as, say, lg), that can occur on any element, and that specifies the language of the element (plus any contained elements which do not specify another language). For example:

en arch hn o logos

This general approach has several advantages: 1) The entire structural element has its language as a property. This is more realistic than an element|s merely happening to contain another element, which has a language property. This distinction also makes it unnecessary to define separate tags for standard elements differentiated only by language, such as , , etc. 2) A default value for the attribute can be supplied in the SGML DTD on some or all elements, indicating what language that element is expected to be in. Yet this default can be overridden when desired for specific instances of the element, by explicit specification of the attribute value. 3) A list of known languages can be enforced, if desired, by defining the attribute to have a constrained value-set. However, it is also possible (and in my view preferable) to allow any string as the value of this attribute; with this method no modification of the SGML DTD is required when adding new languages and writing systems. 4) Because the SGML DTD can define a default value for the lg attribute of the highest-level element, monolingual documents in the default language for that DTD require no coding for language. And because of inheritance, monolingual documents in any other language require only one specification of language, on the top element. 5) Conceptually, the language of a document element is a property of that element, rather than being a structural unit in its own right, much as the make and color are properties rather than structural units of a car. The main problems with the attribute method are: 1) SGML provides no syntax for defining a |universal attribute,| i.e, one that can appear on all elements. The |inclusion| feature serves almost this purpose for elements, but to be universal an attribute must be explicitly declared for each element. Fortunately the average user will not define DTDs, and so this redundant task need only be done by those implementing DTDs. 2) If language can only be specified as an attribute of elements, then minor insertions (such as the per se example already given) must be promoted to the status of elements. For this purpose a generally applicable element such as could be introduced, or the tag described could be retained in addition to the universal lg attribute. As entities Entities generally represent the including or insertion of other material, such as figures, special characters, external files, etc. Thus they are not appropriate for the specification of the language of document portions. Also, entities are point events, not relating to the hierarchical document structure. An example of indicating language via entities would be:

&Greek;en arch hn o logos

&English; Using this method would make it impossible to constrain or verify the scope of language changes, because entities do not have scope. Also, this method cannot express uniform relationships between particular elements and their languages (e.g., that s in a dictionary DTD might always be in a particular language). Also, the same several problems arise as with the |extensional| element method (when a different element is defined for each language). As content Language changes are not content, therefore they should not be coded as content. They could only be so coded by merely inserting some unique string to mean |change language.| For example:

**%!Greek!%**en arch hn o logos

**%!English!%** The string would best be composed of IRV characters, or else it too would be subject to loss during transmission across systems. Even so, the problems are severe. First, all of the problems of encoding as entities arise, plus the problem that an SGML parser or application would not be aware of the language indications. Further, the interpretation of whatever string(s) might be chosen would be entirely arbitrary, and hence limited to the original application. And finally, the string chosen becomes unusable to express data. Character Encoding A great temptation when considering character encodings is to make immediately a chart of what bits represent what shapes, and to argue about whether to mark things with particular escape sequences, multi-byte codes, |universal| character sets, etc. But before any of these possibilities can be considered thoughtfully, the underlying choices and trade- offs must be made clear. The fundamental tradeoff in defining a character encoding standard for TEI is between the desire to include as many texts as possible, and the desire to make those texts usable by as wide a range of people as possible. The fact that authors may be reluctant to invest time to convert their work to standard form must also be considered. The ISO 646 IRV character standard provides the least common denominator of coding schemes (it might be more familiarly, though less precisely, described as matching the printable subset of ASCII| see Appendix 1). It includes the 26 letters A-Z in upper and lower case; the digits; and certain typical punctuation, not including distinct open and close quotes, diacritical marks, and such. If only IRV characters are used, a text is portably human readable; that is, readers on many systems can expect to see the same conceptual image for each internal code when using a simplistic display method such as a |type| command.4 Whenever characters beyond those of IRV are required, users will differ in their encoding choices, and readability will then suffer.5 Several general methods are commonly used to encode characters that go beyond ISO 646. Most of these approaches fall into three general classes: | Changing the symbolic meaning of numeric codes within the IRV range (first the national use characters, then others). | Encoding single information units by multiple characters from a more restricted base set (preferably IRV). | Making a wider range of numeric values permissible. Some examples of more specific approaches include: | Adding symbols in the non-graphical portions of the ISO 646 table (e.g., control characters). | Adding symbols in the numeric range 128-255 (e.g., each computer manufacturer's definition of 8-bit |ASCII|). | Attempting an all-inclusive symbol set (e.g., 2-byte proposals that assign a place to every |distinct| symbol). | Using multiple fonts, and font-change codes to fit in extra symbols (e.g., MS Windows'| approach to Hebrew). | Using undelimited sequences of standard characters to encode non-standard characters (e.g., using |:u| to represent a u-umlaut in electronic mail). | Using delimited sequences of standard characters to encode non-standard characters (e.g., entities such as |α| in SGML). If |national use| or other IRV characters are used for non- IRV meanings, the receiver of the file will generally not see what the originator saw. However, he or she will see some perceptible graphical character. If this is the only change, and it is documented, then translation for local use will be relatively easy (being, for example, within the scope of the |change| command in most word processing programs). If sequences of IRV characters are used to represent non-IRV characters, then at least the original criterion of machine readability can be assured. If the sequences are mnemonic, portable human readability can also be achieved. If additional numeric codes are used, even machine readability is quickly lost. For example, much software reserves control codes for control functions, making their use as graphical characters impossible; some hardware and software products strip off the 8th bit, making any characters with codes above 127 unusable. The wider the range of numeric codes used, the less portable the data will be. Unresolved Issues Many issues are left undecided by this discussion of options and requirements. There is also so much variability in current practice that it is probably impossible to impose complete order. Some of the issues not discussed above, but which impinge on the choice of character encoding, are: 1) how sorting is to be accomplished; this is particularly complex when compound characters, characters to be ignored or equated, etc., are considered; 2) how hyphenation-, word-, and line-breaking places are to be determined; 3) how encoding relates to hardware limitations (for example, is it better to show the contextual variants of an Arabic letter as different symbols on a low-end display, as mnemonics, or as a single transliteration symbol); The following list recommends methods of encoding some common phenomena which also have not been discussed: 1) accents should be encoded separately from the characters they modify (e.g., a´ is preferable to á); 2) accents should follow the character to which they apply (this ordering is largely arbitrary; but TEI should probably recommend a preferred order); 3) the primary SGML delimiters (& and <) should be encoded in all cases by their standard entity references (& and <); 4) character encoding specifications should use one encoding for each information unit, rather than one for each graphical shape (e.g., Greek sigma should be a single value). A Proposal At this time, this should be considered preliminary; comments are solicited from all readers before it is finalized and submitted to the Text Encoding Initiative for formal consideration. Language Indication Given the options and considerations discussed above, the following method is proposed as the TEI standard method for specifying the different languages and writing systems for portion of documents: 1) All DTDs will provide an optional attribute named lg for every element. A default language name may be provided if appropriate for some or all elements, at the discretion of the DTD designer(s). If no default value, or a default of ||, is defined for the attribute, and if an element instance in a document also provides no value for the lg attribute, then the element shall be interpreted as being in the same language as its immediately containing element. 2) If no default value for the lg attribute is specified for the top-level document element, then English (?) will be assumed. 3) All DTDs will provide a tag , with the required attribute lg, which is an inclusion of the top-level document element (i.e., it can appear anywhere). 4) Because of this method, SGML application implementors are advised that it is necessary to support layout (e.g. font choice) which is based not only upon element type, but also on attribute value. A full-featured system should also be able to determine keyboard and display mappings on this basis. Character Encoding Given the criteria discussed above, the following recommendations for a TEI character standard are proposed: 1) Any file using characters that are not standard under ISO 646 must declare every information unit used, by providing one of the following encoding declarations: a) Reference to another existing ISO- or TEI- registered character standard. b) A declaration of the information unit codings, in the form described below. c) Reference as in (a), with exceptions declared as in (b). 2) Any number of distinct encoding methods may occur in a single document, but the encoding method may change only by virtue of the lg attribute proposed above. No escape or shift sequences are permitted within the data. An encoding declaration must be provided for every value of the lg attribute used, and must identify which value it is for, as defined in the description of the declaration document (see below). 3) The TEI character conformance level of a file is determined by the range of numeric codes the file contains, as defined in the following table. The higher the conformance level number, the more portable the file can be expected to be: L4) Optimal conformance Uses only ISO 646 non-control, non-national-use characters. Control characters may appear as needed for record separation, etc., but are not part of the data content, and have no intended graphic representation. Such a file should move without change through current computer networks, using standard translation tools at non-ISO 646 (e.g., EBCDIC) sites. L3) IRV National conformance Identical to optimal conformance, except adds the national-use characters. Should transfer without information loss except to or through EBCDIC systems, such as across multi-vendor networks, national boundaries, etc. National- use characters may not appear the same, although no information is lost per se. L2) 8-bit conformance Defines (and declares!) characters with numeric codes between 144 and 255 (i.e., the |upper half,| excluding those codes corresponding to the control codes of the lower half). Such files will likely suffer extreme data loss on multi-vendor networks, and will appear drastically differently on the default displays of differing vendors. The actual information loss with specific hardware, such as EBCDIC- based or strictly 7-bit hardware, makes this level of conformity (or any lower level) highly undesirable, even though it is commonly used, such as with the major personal computer vendors' character sets. L1) 8-bit minimum conformance Adds numeric codes 128-143, but not 00-31 or 127; this is even less likely to survive transfers than 8-bit conformant data; it is, however, a commonly-used method. L0) Non-conformance Any text using codes 00-31 or 127 for graphical character data, or any text for which a conforming declaration is not provided, is non- conforming. 4) Any encoding that uses control codes for graphic data or that uses numeric codes greater than 255 (e.g., multiple-byte character sets) is not TEI conforming at this time. 5) TEI will register standard encodings, provide a name by which such encodings may be unambiguously referenced, and make available copies of the registered encodings.6 Only files conforming at levels L4 and L3 are likely to survive transmission across most computers, networks, and software environments. Level L3 is fairly safe, and is commonly used in the free-form environment of electronic mail networks; only a few characters are likely to be lost, and they are well-known. At level 2, half the distinct characters will become meaningless on the majority of systems other than the originator's own model, and many of those characters will be lost in transmission through a network.7 Character Encoding Declaration A Character Encoding Declaration Document, or CEDD, must be encoded strictly in ISO 646 IRV. TEI should develop a formal DTD for the declaration document itself. It is desirable that a CEDD be easily machine-parsable, so that character encoding translation utilities can be developed which merely read the CEDD, and then are able to translate documents which conform to the CEDD. A CEDD must be provided for each language or writing system used, and must include: 1) The language specifier, which is the value for the lg attribute to be used in an SGML document to indicate elements which are in the language being declared. 2) The language description, which describes for human readers the conventional name(s) for the language and/or writing system being used. 3) The base character encoding, which specifies either an ISO or national standard for character encoding, or "NONE", which indicates that all codes are defined within this CEDD. The declaration then includes, for each encoding unit: 1) The string of characters that unambiguously represents the unit; this will often be a single character, but strings are permissible, for example, "ll" for one way of encoding Spanish. 2) A descriptive name for the unit, preferably including the name by which speakers of the language in question refer to it. Terse or unclear descriptions are to be avoided. 3) (optional) A description of any interactions of this unit with others for purposes of display; for example, diacritic placement may be described here. 3) (optional) An integer indicating (relative to all other declared units, and not necessarily unique) the placement of the unit in the standard sort order for the language. Any non-positive value shall be taken to mean that the unit is to be ignored in sorting.8 Appendix 1: Some relevant standards ISO 8859: SGML ISO 8859, Information Processing | Text and office systems | Standard Generalized Markup Language (SGML), defines a syntax for declaring the structure of a markup system. An SGML |Document Type Definition| specifies a set of tags to mark the scope of the structural elements applicable to a class of documents, plus a set of structural rules that defines the permissible relationships of those elements (e.g., paragraphs within chapters, but not vice versa). SGML does not say much about character sets, except to provide |entity reference| names by which non-standard characters may be accessed. Even these names, however, neither (a) are sufficient for the requirements of all languages, nor (b) are ideally chosen even so far as they go. SGML reserves several characters for its own use; all can be re-defined, but doing so is not advisable since not all SGML processors support features beyond the minimal |reference concrete syntax.| ISO 8859 does not, however, appear to rule out values above 127. The reference concrete syntax rules out codes 0-31, 127, and 255, and declares ISO 646 as the basic character set thereafter (see section 14 of ISO 8859, pages 52-54). ISO 646: 7-bit Character Set ISO 646, 7-bit Coded Character Set for Information Interchange, defines a standard set of character codes upon which many other standard have since been built. It defines the meanings of numeric codes 0-127 (only); of these codes: | 00-31 are control characters, which are reserved for various machine-control functions, not for graphical data. | 32-126 have graphic representations (though space may also be regarded as a control character). | 127 is a control character intended for deleting data, or filling time in data transmission, and is thus not a data character. ISO 646 specifies several codes that can freely be assigned alternative graphical representations, yielding ISO 646- conforming national character standards. If the default values of these variable characters are used, then the encoding is called the International Reference Version, or IRV. The most widely known variant is ASCII, the American national version; it differs from IRV only by having |$| in place of the |neutral currency indicator| character. Characters that may change from one national variant to another include: pound-sign vs. number-sign; dollar-sign vs. currency-sign; and IRV's at-sign, square and curly brackets, reverse solidus (or backslash), circumflex, grave accent, vertical bar, and tilde/overline. ISO 2022: Code Extensions ISO 2022 |specifies methods of extending the 7-bit code, remaining in a 7-bit environment or increasing to an 8-bit environment| (ISO 2020, p. 1). These extensions retain codes 00-31 as control characters, though they may be given new meanings. They also retain the following 96 graphical characters (including the usual space and delete values), likewise with potentially new meanings. A further extension allows 8-bit character sets. The lower half of an 8-bit set conforms to the 7-bit standard. The upper half is analogous: the first 32 values are reserved for control functions (but with |transmission control| meanings excluded; the remainder are available for graphic characters. For languages that require more than 96 or 192 graphic symbols, ISO 2022 also defines |sets of graphic characters with multiple-byte representation| (p. 15); these cannot be effectively handled by everyday hardware, software, and even programming languages. The standard also provides for code extension via escape sequences which switch between character sets; this has the disadvantages of failing to be mnemonic, or even minimally human readable, and of encoding font-changes when it is language-changes which are salient, and these usually correlate with structural elements. It is worth noting that hardware vendors, particularly personal computer manufacturers, have not conformed to ISO 2022. IBM and Apple personal computers have influential characters sets, but both define graphic characters for reserved control codes. Xerox Characters The Xerox Character Code Standard (Xerox 1987) uses 7- and 8- bit character sets which conform to ISO 2022, but with a different escaping system to change between character sets. Though the method is somewhat more perspicuous than that of ISO 2022, it retains basically the same problems of opacity. EBCDIC EBCDIC, used on IBM mainframe computers, is the only significant character standard that is not at least fundamentally based on ISO 646. It uses entirely different codings. However, there is a well-defined, one-to-one mapping between a subset of the ISO 646 symbols and a subset of the EBCDIC symbols. The vast majority of systems, whether EBCDIC-based or not, have EBCDIC/ISO 646 translation utilities readily available; for those which do not, the translation program is simple to write. Therefore, files that limit themselves to the set of characters common to EBCDIC and IRV can be transferred between EBCDIC and non- EBCDIC systems without loss of information. This property is crucial for use of mixed-vendor networks, such as Bitnet. Appendix 2: Fundamental axioms by Gary F. Simons and Steven J. DeRose Encoding is not synonymous with rendering or keyboarding Becker (1984, 1987) provides a clear discussion of the range of problems to be dealt with in handling the world's various writing systems. He defines character encoding as the scheme for storing textual information on machine-readable media. A basic principle is that an encoding scheme should consistently represent the same information unit as the same encoded character. Rendering is then the process of converting the encoded information into a correct graphic form for display. Early computers performed only a mechanical mapping for display or printing, in which each encoded unit produced a particular ink or phosphorus shape, regardless of context, language, or anything else. However, a more sophisticated mapping can and should be performed at display time, using methods now well understood (e.g., FSAs). For instance, in encoding Arabic (where letters vary depending on their position in the word), computer readable text should encode the letters of the alphabet and not the particular positional variants. Keyboarding is similarly distinct from encoding. The encoded text need not look anything like the sequence of keys that was pressed to type the text. Some text-entry systems for Chinese allow the user to type a phonetic representation, yet store a single code for each ideograph. Conversely, many programs provide macros for entering lengthy sequences of information units with a single keystroke. Another common example is using a |dead| key in combination with another key to enter a single unit into the text for a language that needs more than the 26 Roman letters. Various programs and individuals have strong preferences about how things should be typed; but this fact need not prejudice how information is stored, or compromise the portability of text. Recognizing that encoding is distinct from rendering and keyboarding has a number of implications: 1) A document interchange standard is a standard of encoding, and thus need not address rendering and keyboarding at all, except to ensure that these can be done algorithmically. 2) Because the keyboarding-to-encoding and encoding-to- rendering mappings can be performed algorithmically, the actual encoding scheme can vary arbitrarily so long as the mapping relationships remain unambiguous. Thus the set of internal numeric codes is relatively unimportant, and encoding via transliteration is acceptable, as is using a unique sequence of characters to encode a single information unit (as in digraphs). 3) Because not all languages have an established set of information units, and because current encoding practices do not handle them consistently, an interchange standard must provide a means to describe and document a set of information units and their representations in a particular encoding. Language is the fundamental organizing principle An oft repeated theme in discussions of character encoding standards is the idea of developing a |universal character set,| often using multiple-byte character codes, that would include all the characters needed for encoding all languages (e.g., Anderson 1984). If hardware and software vendors could all agree, then this approach would ensure that texts would always be rendered with |appropriate| graphic symbols. However, it incorrectly focuses on graphical representation rather than information content, and this leads to several fatal flaws. First, when the focus is on physical shape or appearance, it is difficult to define the set of |distinct| characters. Is Spanish ll distinct, or is it just 2 l|s? If graphic shape is the criterion of distinction, why don't we distinguish the |a| characters in different fonts (e.g., |aaaaaaa|) with separate codes? More practically, do we distinguish characters which are only sometimes written differently (Greek medial, final, and lunate sigmas; Arabic letters in general, etc.)? Second, the encoded characters of different languages, even if they may have precisely the same rendering on the screen, are in fact different information units. For instance, the character sequence die is a very different thing in English than it is in German. To a scholar examining texts with English and German intermixed, it is important that the text encoding differentiate between English die and German die. The same principle applies to sorting and similar tasks: For instance, in an English dictionary ch comes between ce and ci; in a Spanish dictionary ch follows cu. The character c is a different information unit depending on whether it is in English text or Spanish text. Similarly, vowels with umlaut are sorted differently in different languages, despite identical appearance. Therefore encoding the notion of language per se is crucial to addressing the character encoding problem in multilingual text. Every piece of textual information stored in a computer is expressed in some language; the encoding scheme for texts must account for that fact. A given piece of text in one language may contain parts expressed in another language, and so on (such as when an English essay quotes a German paragraph which discusses some Greek words). Encoding this language information in text is crucial for a text encoding standard since the identity of the language governs so many computational processes like keyboarding, rendering, retrieval, sorting, and hyphenation (Simons, in press). A key insight embodied in a descriptive markup system like SGML (ISO 1986) is that a document is not merely a string of characters (as some word processing software views it); rather, a document is a structured hierarchy of text elements (Coombs, Renear, and DeRose 1987). Similarly, multilingual text is not merely a sequence of characters from a multilingual character set (or, equivalently, from monolingual character sets with interspersed codes to switch sets); rather, a multilingual text must be viewed as a hierarchy of text elements, each in a language (or perhaps languages). The data should be readable An interchange standard should make data readable by people other than the originator. This requires that the data be transferable without loss from the originator's machine to the receiver's, and that the receiver's display be correctly interpretable. The first criterion is relatively easy to fulfill: if every byte is copied verbatim from one machine to another, or if every distinct byte code is translated by a one-to-one (invertible) function, then no data has been lost per se. Everyday software accomplishes this for many character codes, though not for all. But this is not enough: the result may yet be uninterpretable for reasons such as: | different character-generator ROMs in a terminal; | lack of the same display font in a workstation; | different rendering software (e.g., a word-processor that hides certain characters); | mixing of graphical and non-graphical information expressed in a non-standard fashion (such as graphic or control characters, escape sequences, etc. used as procedural or presentational markup, or graphical characters used for control functions). For the purposes of the TEI we need to facilitate human readability over the widest range of hardware and software. But since it will never be the case that all hardware and software share all fonts, keyboard conventions, etc., we cannot build a standard upon a large set of unusual graphical shapes and numerical codes. Since the shapes are so language-dependent, it does not seem to be an issue that can be addressed in a language-universal way. Therefore, readability must be achieved at a more abstract level. Given these concerns, a minimum requirement for human readability is that none of the encoded information becomes invisible when the file is displayed on a standard terminal screen or printer. Bibliography Anderson, Lloyd B. 1984. |Multilingual Text Processing in a Two-byte Code.| Proceedings of Coling84, Stanford University, 2-6 July 1984: 1-4 Morristown, NJ: Association for Computational Linguistics. Association for History and Computing. 1989. |Report of the Task Group for the Standardization and Exchange of Historical Data Bases.| Becker, Joseph D. 1984. |Multilingual word processing.| Scientific American 251(1): 96-107. _____. 1987. |Arabic Word Processing.| Communications of the Association for Computing Machinery 30(7): 600-610. Clews, John. 1988. Language Automation Worldwide: The Development of Character Set Standards. British Library R&D report 5962. Harrogate, N. Yorkshire: SESAME Computer Projects. Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. |Markup Systems and the Future of Scholarly Text Processing.| In Communications of the Association for Computing Machinery 30(11): 933-947. ISO. 1983. Information processing | ISO 7-bit coded character set for information interchange, ISO 646-1983. Second edition. Geneva: International Organization for Standardization. _____. 1986a. Information processing | ISO 7-bit and 8-bit coded character sets | Code extension techniques, ISO 2202-1986. Third edition. Geneva: International Organization for Standardization. _____. 1986b. Information Processing | Text and office systems | Standard Generalized Markup Language (SGML), ISO 8879-1986. Geneva: International Organization for Standardization. Simons, Gary F. In press. |The computational complexity of writing systems.| The Fifteenth LACUS Forum 1988. Lake Bluff, IL: Linguistic Association of Canada and the United States. Xerox Corporation. 1987. Character Code Standard. XNSS 058710. Sunnyvale, CA: Xerox Systems Institute. 1TEI |is a major international project to develop guidelines for the preparation of machine-readable texts for scholarly research and for the interchange of such texts| (TEI introductory brochure, p. 1). The Text Representation committee is chaired by Stig Johansson. The author is co-ordinating efforts related to character encoding, and wishes to express his appreciation to the other members of the TEI committee on Text Representation (especially Elli Mylonas and Robin Cover), a number of correspondents, and Gary Simons, for numerous helpful comments and suggestions. 2In fairness to EBCDIC, one may assume that very rudimentary translation software is supported, such as is invisibly applied as files pass through mixed-vendor computer networks. This poses relatively few problems. 3Stock phrases may at first appear to be an insignificant case, but they do need to be marked. One simple reason is that manuals of style make their italicization dependent upon whether they are common enough to be considered native, or whether they are still considered foreign. Also, when such phrases are borrowed from languages other than Latin, it is unlikely that their entire character repertoire will be available in the standard character set. 4EBCDIC (see Appendix 1) remains an important exception (even though IRV/EBCDIC translation utilities are ubiquitous): not quite all of the IRV characters can safely be transferred. However, all those that ISO 646 does not designate as redefinable for |national use| can be expected to safely transfer to EBCDIC and back. 5Along with users, I here include hardware and software engineers, font designers, network administrators, and any others whose decisions affect the mapping of characters between systems. 6The proposal of a registration mechanism requires further discussion and expansion if it is to stand. 7The reader will recognize that software exists that can encode text at any of these levels in order to get it through networks without information loss. Such software re-codes any non-IRV characters with reserved sequences of IRV characters. Examining such a file will, however, quickly convince the reader that it does not fit any plausible criteria of human readability. 8This approach to specifying sorting is only a beginning, and neglects the more complex (and interesting) cases.