PreviousUpNext

4 Characters and Character Sets

Part 2

Core Tags and General Rules

4 Characters and Character Sets

Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines.

Four character-set problems arise for the encoder of electronic texts:

  1. selecting a character set to use in creating, processing, or storing the electronic text
  2. preparing documents for interchange so that the characters within them are not corrupted in transit
  3. encoding characters which are not provided by the character set available on the computer system in use
  4. signaling shifts from one character set to another, e.g. from the Latin alphabet to Greek and back, or to a special symbol character set and back
This chapter describes the recommended solutions to these problems, in enough detail to satisfy the needs of most users. More detail and more technical information can be found in chapters 28 , 30 , 25 , and 37 .

4.1 Local Character Sets

No single character set is prescribed for use in TEI-encoded documents. Users may use any character set available to them, subject to the character set restrictions imposed by the SGML declaration. It is recommended that the character set used be documented by one or more writing system declarations (WSD), on which see below. In most cases, a predefined writing system declaration should be suitable. For writing system declarations provided as part of the TEI Guidelines, see chapter 37 .

In general, it is most convenient to use a character set readily available on one's computer system, though for special purposes it may be preferable to customize the character set using software specialized for the purpose. Whether to use the usual character set or create a custom set depends on the documents being encoded, the staff support and tools available for customizing the character set, the user's technical facility, etc. A choice must be made between the perceived relative convenience of living with an existing character set and the effort of modifying it to suit one's documents more closely. Care should be taken, in particular where multi-byte character sets are used, that the syntactic distinctions required by an SGML processor are preserved. Where a suitable character set is defined by a national or international standard, local customization should implement the standard rather than inventing yet another non-standard character set. The choice must be made by each individual according to individual circumstances; no general recommendations are made here as to whether locally customized character sets should be used. In principle, however, for local processing, encoders should use whatever character set they find convenient.

4.1.1 Characters Available Locally

When the characters in a text exist in the local character set, the appropriate character codes should be used to represent them. Virtually all computer systems provide at least the 52 letters of the standard Latin alphabet, the ten decimal digits, and some basic punctuation marks.

Other characters, such as Latin characters with diacritics (e.g. [auml ] or [eacute]) or non-Latin characters (e.g. Greek, Hebrew, Arabic, Cyrillic) are less universally provided. Oriental scripts pose particular problems because of the size of their character repertoires. If the local character set provides an `[auml ]' however, there is normally no reason not to use it where that character appears in the text, unless the electronic text is to be moved frequently among machines, in which case one may wish to restrict the electronic text to characters known to translate well among machines. (For more information on moving characters among machines, see section 4.3 below.)

As noted above, full use of a local character set may require that the SGML declaration be modified to define all the characters used as legal SGML characters.

4.1.2 Characters Not Available Locally

Characters not available in the local character set should usually be encoded using SGML entity references. In SGML terms, an entity (described in more detail in section 2.7 ) is any named string of characters, from a single character to an entire system file. Entities are included in SGML documents by entity references. Lists of suggested names for all the characters and printers' symbols used by most modern European languages have been published by ISO and others. [ see note 26 ]

For example, the standard entity name for the character `[auml ]' is `auml'; a reference to the entity gives the name of the entity, preceded by an ampersand and followed by a semicolon. [ see note 27 ] For example, consider the following German sentence: ``Trotz dieser langen Tradition sekund[auml ]ranalytischer Ans[auml ]tze wird man die Bilanz tats[auml ]chlich durchgef[uuml ]hrter sekund[auml ]ranalytischer Arbeiten aber in wesentlichen Punkten als unbefriedigend empfinden m[uuml ]ssen.'' Using the standard names for a-umlaut and u-umlaut, one could transcribe this sentence thus:

 Trotz dieser langen Tradition sekundäranalytischer
 Ansätze wird man die Bilanz tatsächlich
 durchgeführter sekundäranalytischer Arbeiten
 aber in wesentlichen Punkten als unbefriedigend
 empfinden müssen.

As noted above, standard entity names have been defined for most characters used by languages written in the Latin alphabet, and for some other scripts, including Greek, Cyrillic, Coptic, and the International Phonetic Alphabet. A useful subset of these may be found in chapter 37 .

Before an entity can be referred to, it must be declared. Standard public entity names can be declared en masse, by including within the DTD subset of the SGML document a reference to the standard public entity which declares them. The German document quoted above, for example, would have the following lines, or their equivalent, in its DTD subset:

 <!ENTITY % ISOLat1
        PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN">
 %ISOLat1;

Where no standard entity name exists, or where the standard name is felt unsuitable for some reason, the encoder may declare non-standard entities, using the normal SGML syntax. The replacement string for such entities may vary for different purposes, as in the following extended example.

In transcribing a manuscript, it might be desirable to distinguish among three distinct forms of the letter `r'. In the transcript, each of these forms will be encoded by an entity reference, for example: &, &r2; , and &r3; . Entity declarations must then be provided, within the DTD subset of the SGML document, to define these entities and specify a substitute string.

One possible set of declarations would be as follows:

 <!ENTITY r  'r[1]'><!-- most common form of 'r'  -->
 <!ENTITY r2 'r[2]'><!-- secondary form of 'r'    -->
 <!ENTITY r3 'r[3]'><!-- third form of 'r'        -->
The expansions shown above will simply flag each occurrence with a number in brackets to indicate which form of `r' appears in the manuscript.

Another (imaginary) set of declarations might use some printer-specific codes to produce visually distinct versions of the three letters in output:

 <!ENTITY so 'I'><!-- shift to alternate printer font -->
 <!ENTITY si 'I'> <!-- shift to basic printer font -->
 <!ENTITY r  'r'>             <!-- most common form of 'r'  -->
 <!ENTITY r2 '&so;r&si;'>     <!-- secondary form of 'r'    -->
 <!ENTITY r3 '&so;q&si;'><!-- third form of 'r'        -->
Here the replacement strings for each entity contain character references, for example &#27; , indicating the decimal value (or code point) of the character to be generated by the SGML processor. The assumption is that the sequence of bytes generated will drive a printer to produce a distinctive form of the letter. Such printer instructions are of their nature highly machine- and software-dependent; by allowing them to be confined to a single place (the entity declarations), SGML makes it easier to adapt such device-specific information to new environments.

Alternatively, the declarations can give all three forms the same appearance in the output:

 <!ENTITY r  'r'><!-- most common form of 'r'  -->
 <!ENTITY r2 'r'><!-- secondary form of 'r'    -->
 <!ENTITY r3 'r'><!-- third form of 'r'        -->

And finally, to ensure that the SGML output uses the same entity references for these characters as does the SGML input, one could use the following declarations.

 <!ENTITY r  CDATA '&r;' ><!-- most common form of 'r'  -->
 <!ENTITY r2 CDATA '&r2;'><!-- secondary form of 'r'    -->
 <!ENTITY r3 CDATA '&r3;'><!-- third form of 'r'        -->

Wherever locally defined entities are used for the representation of characters in the text, whether to record presentational variants as in this example or to represent special symbols not present in standard character sets or public entity sets, it is recommended that the entities be documented in a writing system declaration; for details see chapter 25 .

For transcriptions in scripts not supported by the local character set, entity references may prove unwieldy. In such cases, it is also possible to transliterate the material from its original script into the script of the local character set; like a customized local character set, a transliteration scheme should be documented with a writing system declaration. To avoid information loss, a reversible transliteration scheme (i.e. one in which it is possible to reconstruct the original writing from the transliteration) should be preferred; wherever possible, standard schemes should be preferred to ad hoc schemes. [ see note 28 ]

For example, using the Beta code transcription developed for ancient Greek by the Thesaurus Lingu[aelig ] Gr[aelig ]c[aelig ], [ see note 29 ] one would transcribe the start of the Iliad of Homer thus:

 <l>*MH=NIN A)/EIDE QEA\ *PHLHI+A/DEW *)AXILH=OS
 OU)LOME/NHN, H(\ MURI/' *)AXAIOI=S A)/LGE' E)/QHKE,

Some standard transliteration schemes for commonly encoded languages are included in chapter 37 .

4.2 Shifting Among Character Sets

Many documents contain material from more than one language: loan words, quotations from foreign languages, etc. Since languages use a variety of writing systems, which in turn use a variety of character repertoires, shifts in language frequently go hand in hand with shifts in character repertoire and writing system. Since language change is frequently of importance in processing a document, even when no character set shift is needed, the encoding scheme defined here provides a global attribute lang to make it possible to mark language shifts explicitly. This attribute may also be used to trigger character-set shifting by application programs.

Some languages use more than one writing system. For example, some Slavic languages may be written either in the Latin or in the Cyrillic alphabet; some Turkic languages in Cyrillic, Latin, or Arabic script. In such cases, each writing system must be treated separately, as a separate `language'. Each distinct value of the lang attribute, therefore, represents a single natural language, a single writing system, and a single coded character set.

Each value used for the lang attribute must correspond to a writing system declaration suitable for the character set and writing system being used. The values should where possible be taken from the standard two- and three-letter language codes defined by the international standard ISO 639. [ see note 30 ] When more than one writing system is used for the same language in the same document, suffixes should be added to the values from ISO 639. A selection of standard language codes, as well as a number of standard writing system declarations, are provided in chapter 37 ; other writing system declarations may be provided locally.

Like any global attribute, the lang attribute may be used on any element in the SGML document. To mark a technical term, for example, as being in a particular language, one may simply specify the appropriate language on the <term> tag (for which see 13 ):

<p lang=EN> ...
But then only will there be good ground of hope for the
further advance of knowledge, when there shall be received
and gathered together into natural history a variety of
experiments, which are of no use in themselves, but simply
serve to discover causes and axioms, which I call
<term lang=LA>Experimenta lucifera</term>, experiments of
<term>light</term>, to distinguish them from those which I
call <term lang=LA>fructifera</term>, experiments of
<term>fruit</term>.
<p>Now experiments of this kind have one admirable
property and condition:  they never miss or fail. ...

The form in which materials in different writing systems are processed or displayed is not specified by these Guidelines; if appropriate characters are not available locally, application software may choose to display the material in an appropriate transliteration, as a series of entity references, or in other forms. If the local system requires explicit escape sequences or locking-shift control functions to signal shifts to alternate character sets, these escape sequences may be embedded directly in the content of the appropriate element or may be supplied by the application software upon recognition of the language shift. However, escape sequences are vulnerable to loss or misinterpretation when documents are sent from site to site. For this reason, the method of embedding them directly in the document, while allowed within TEI-conformant interchange, is not recommended for general use. [ see note 31 ]

4.3 Character Set Problems in Interchange

Electronic texts may be exchanged over electronic networks, through exchange of magnetic media (e.g. disk or tape), or by other means. In every case except the transmission of storage media from one machine to another machine of the same hardware type running the same operating system and using the same coded character set, the characters are subject to translation and interpretation, and hence to misinterpretation and corruption, by utility software working somewhere on the interchange path. Network gateways, tape-reading software, and disk utilities routinely translate from one character set to another before passing the data on. If the utility errs in identifying the character set, or if several utilities translate back and forth among character sets using non-reversible translations, the chances are good that characters will be garbled and information lost.

At this time (1994), it is recommended that in texts subject to `blind' interchange (that is, interchange between parties who do not or cannot make explicit agreements over the character set to be used in interchange) no characters should be used other than those in the international reference version (IRV) of ISO 646. [ see note 32 ] Other characters should be represented by SGML entity references. For scripts (such as those of East Asia) which require a character repertoire larger than 255 characters, regional proposals for character and entity sets to be used in non-negotiated interchange must be developed and published internationally.

In some cases, even the characters in ISO 646 IRV are corrupted in transit; where interchange takes place over links not reliable for the full IRV character set, the characters subject to misinterpretation and corruption should be replaced by standard entities. At present, the characters least susceptible to loss or misinterpretation in transit among systems are those shown below, which represent a subset of the characters in IRV and may thus be called the ISO 646 subset.

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
" % & ' ( ) * + , - . / : ; < = > ? _

In interchange over any transmission link, the transmitted document should contain only those characters which safely survive transmission over the link; others should be represented with entity references, or with transliterations, as described above.

In blind interchange by means of magnetic or optical media, it is recommended that the document be encoded using some well documented and widely used standard character set.

In blind interchange over networks, it is recommended that the transmitted document contain only characters known to travel safely over the networks involved. In the most general case, those characters are the ISO 646 subset given above.

The problem of character data loss in document interchange is in principle a temporary one; eventually, it should disappear as improvements are made in network handling of character data and as more networks support larger character sets such as ISO 10646. [ see note 33 ] The need for backward compatibility with older systems is likely, however, to ensure that care must be taken in blind interchange for the foreseeable future.

For further discussion of negotiated and non-negotiated interchange, see chapter 30 .

4.4 The Writing System Declaration

Each language and writing system used in a document should be documented in a Writing System Declaration (WSD), which specifies:

The characters available in a writing system may be specified in the WSD for that writing system in one or more of the following ways: Individual characters within a WSD are formally declared, where necessary, by providing the following information:

The writing system declaration is one of a set of auxiliary documents which provide documentation relevant to the processing of TEI texts. Auxiliary documents are themselves SGML documents, for which document type declarations are provided. The DTD for the Writing System Declaration is discussed in detail in chapter 25 . Standard Writing System Declarations are provided in chapter 37 .


PreviousUpNext