Outline of Content/Issues in Chapter 4 of TEI P4

Introduction
Languages and Character Sets: Preface materials
A simple character encoding model
Entry and display of characters
- Character input and entity references
- Transliteration schemes
Code shifting
The Writing System Declaration

Introduction

The following is an outline of Chapter 4, Languages and Character Sets, of the P4 release of the TEI Guidelines. No effort has been made to explain concepts of that chapter. The following is a summary listing of the issues covered by that chapter for use in reviewing the proposed revision of that chapter. As such, it presumes knowledge of both Chapter 4 and related TEI information.

Each section of chapter 4 of P4 is treated separately to allow the reader to orient themselves with regard to the text.

Languages and Character Sets: Preface materials

Chapter 4 opens with a note and introductory materials on character sets

Note on changes to chapter, more changes likely.
Computer systems vary on characters availabe, affects interchange, hence this chapter.
In absence of universal character set, lists problems of users (not all interchange problems)
Two paragraphs summary of Unicode Consortium, problems listed no longer obtain for a majority of users
For users with characters sets not covered by Unicode, simpliest methods may not obtain.
Will go onto to discuss: a. informal modern character encoding model, benefits of Unicode, recommendations for it use with XML. Also discuss legacy SGML data.

A simple character encoding model

Section 4.1 contains, 4.1.1 Some definitions; 4.1.2 Characters and glyphs; 4.1.3, Characters and their encoding; 4.1.4 Character semantics; and, 4.1.5 Characters from the Private Use Area.

Some definitions

This section disentangles the term character by offering informal definitions of the following terms:

abstract character
character repertoire
glyph
coded character set
encoded character
character encoding

There is also discussion of the character vs. glyph distinction in historical documents.

Characters and glyphs

This section offers standard as well as non-standard definitions for the following terms:

character (ISO 10646)
glyph (ISO 10646)
character (Unicode)
glyph (Unicode)
glyph image (Unicode)
grapheme (linguistics)
allograph (linguistics)

This section also notes:

Glyphs in Unicode standard are an aid to the reader and are not normative.
Discusses allograph and glyph.
Distinction between character and glyph is crucial for encoding texts, with reference to standard treatment of the model for that distinction.
Notes distinction between graphic appearance and the content represented by a character. Systems rely on the latter for searching.
Representation of glyphs used in a document instance requires use of proper Unicode code point or at markup level. Former requries documentation in header or WSD, does not mean proper display but chance of intent being discovered.
There is no guidance on encoding glyph variation.

Characters and their encoding

History of how once upon a time there was no universal character set and now XML has one. Other than repetition of advice such as not relying upon glyph appearance in a particular Unicode font, not much in the way of information not found elsewhere.

Character semantics

Brief discussion of the Unicode Character Database, which is defined as containing character properties and internal mappings, not semantics.

This section contains:

Pointers to the Unicode Character Database and a partial listing of properties.
Discussion of compatibility characters, but does fail to note that such are no longer being accepted as proposals.
Brief treatment of normalization, recommends Normalization Form C.

Characters for the Private Use Area

Covers the private use area (PUA) of Unicode noting:

General information on PUA
Discourages the use of PUA for interchange
Notes local use may be convenient
Provides concrete example of using PUA

Entry and display of characters

Introduction to the entry of Unicode data using non-Unicode-aware software systems.

Character input and entity references

Issues covered:

Direct entry can be aided by shortcuts but not covered since depends on the local software.
Entity references can be used in lieu of direct entry, but must be declared before being used.
Entities unnecessary for common characters, but may be useful for representing characters that are not yet standardized but that the encoder wishes to distinguish.
Use of arbitrary entity names allows different expansion with each entity.
Example of the three forms of r being declared and then used.
Noted that three forms of r can be associated with different Unicode characters
Multiple mapping only works if font supports the various glyphs desired, can use processing instruction.
If rendering is not an issue, computational purposes only, can use entities to tag the different forms explicitly.
Notes that WSD should be used with locally defined entities.

Transliteration schemes

Use on transliteration schemes and the need for reversible transliteration is noted. Example of use of Beta code (TLG) is given as an illustration.

Noted that in an XML context, no reason for transliteration other than convenience of data entry. Such schemes should be documented in a WSD. Transliteration depracated unless there is no alternative.

Code shifting

General treatment of the issue of shift in language within the speech of a single speaker. Covers the following issues:

Languages that use more than one writing system need more than one lang attribute, to represent both a single natural language and a single writing system.
Each value in lang attribute corresponds with identifier of a <language> element found in the <langUsage> element on the TEI header. The language element may reference a writing system declaration. References ISO 639, et al.
Documents use of lang attribute with example.
May have more than one language element. Examples given.
Inheritance of lang attribute noted.
Glyph appearance not specified in the Guidelines.
Reference to declaring a writing system declaration.
Use of xml:lang noted as inappropriate for language shifts.

The Writing System Declaration

Omitted from this summary since it is being dropped in P5

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org