CE W 03: Use cases for the Character Encoding Extensions

Overview
Chinese: Adding characters to the document character set (CBETA)
Old Norse: Providing a common set of additional glyphs and ligatures (Menota)
[not language specific] Precombined alternatives for characters expressable only as combined forms

Overview

In order to get a handle on the requirements and extend of a mechanism to extend the character encoding of a TEI document, this documents collects use cases. Some of these cases are taken from actual existing projects and practice, others are constructed by the editors and contributors of this document.

In the following listing, the heading includes information about which language is to extended, what purpose the extension has and which project, if any, is actually using it. What follows is a prose description of this use case. The concluding paragraphs attempts an evaluation, including any problems this approach might have for text encodders.

Chinese: Adding characters to the document character set (CBETA)

The Chinese Buddhist Electronic Text Association (CBETA) is compiling an electronic version of the Chinese Buddhist Canon. This is a large scale undertaking and so far digitized versions of 56 volumes of printed text, or about 80 million characters have been realized. The project uses an adapted version of the TEI DTD and makes the master XML files, as well as other derived formats available on the Web and on CD-ROM. More information (for readers of Chinese) is available at www.cbeta.org.

CBETA uses Big5, a coded character set mostly used in Taiwan, as the base document character set. Characters that are not in Big5 are identified in the XML texts with a general entity reference of the type &CBnnnnn;, where n stands for the internal serial number of the character, which is also a key to a database table where information about these characters is maintained. The information about these characters in the database includes

mappings to other, larger encoding schemes (Unicode and Mojikyo Font Institute),
information about "normalized" versions of this character,
a glyph expression that describes this character using other characters and a small set of operators
radical, stroke count, four-corner number and other properties
dictionary references
readings of the characters, if known.

Currently, the database of non-Big5 characters has more than 13000 entries. Only a relatively small number (around 1000) can be found in Unicode as of version 3.0, although this might increase now with version 3.2.

Each of the texts includes a entity replacement table that gives expansions of the entity references. At the early stages of the project, the database was used to directly generate these tables before parsing of the text. Depending on the purpose of the parsing (for example the target format), different expansions where used.

Thus, for example a text containing &CB00001; could have an associated table with at one time something like

<!ENTITY CB00001 SYSTEM "/images/CB00001.GIF" >

<!ENTITY CB00001 SYSTEM '<FONT face="xx">丶</FONT>' >

or even

<!ENTITY CB00001 'A+B' >

This turned out to be cumbersome, so we later added an element to which the entity will always expand, with some of the information from the database added to the attributes:

<!ENTITY CB00006 "<gaiji cb='CB00006' des='[(¤ý*¥¨)/¤ì]' nor='íª' mojikyo='M021123' mofont='Mojikyo M104' mochar='6E82'/>" >

Our conversion scripts will then access the information they need and use that.

In this case, there is much un-documented information (or information not presented in a coherent and machine readable form) and a lot of the information is implicit in the business logic of the various scripts used. This is the kind of information, that would need to move to some future WSD or similar extension mechanism to be usable according to a coherent model.

Old Norse: Providing a common set of additional glyphs and ligatures (Menota)

The Menota project has not only produced some excellent Guidelines for the encoding of Old Norse, (see http://www.hit.uib.no/menota/guidelines/), but also gave birth to the ‘Medieval Unicode Font Initiative’, which tries to enumerate and systematize the encoding units needed for the transcription of Old Norse medieval manuscripts.

The following table shows the categories that are currently proposed in the ‘Medieval Unicode Font Initiative’ (see http://www.hit.uib.no/mufi/proposal/proposal-v1.html )

No.	Name of range	Inventory	Allocated span
1	Mixed script characters	19	E000 - E0FF
2	Precomposed diacritical characters [NB! very large page]	183	E100 - E1FF
3	Small capitals	19	E200 - E2FF
4	Enlarged minuscules	28	E300 - E3FF
5	Ligatures	15	E400 - E4FF
6	Punctuation marks	4	E500 - E5FF
7	Base line abbreviation marks	15	E600 - E6FF
8	Combining abbreviation marks	11	E700 - E7FF
9	Precomposed abbreviated characters	8	E800 - E8FF
10	Superscript (interlinear) characters	22	E900 - E9FF
11	Metrical symbols	12	EA00 - EAFF
12	Critical and epigraphical signs	4	EB00 - EBFF
	Total number of characters included in this proposal 340

A closer look at the tables reveals that many of the encoded units can be represented in Unicode with ligature joiners, combining diacritics and the like. Some others, like Small Capitals, enlarged minuscules, superscript characters are a combination of a specific rendering requirement and a character, it seems to me that these features might be better represented with markup.

This project defines a large number of PUA codepoints, many of them could be syntactic sugar for features already expressable with Unicode combining characters and markup. To accomodate this kind of usage (which might very well be considered a user requirement), the WSD-NG will have to record the mapping between these shortcuts and the standard Unicode represenation. Furthermore, the issue of how to decide whether to use specialised characters or markup should be raised.

[not language specific] Precombined alternatives for characters expressable only as combined forms

As a generalization of the Menota case, it could be said that most display engines and operating systems still assume a 1:1 mapping between code-point and glyph in a font. While Unicode will consider combining characters together with a base character as a sufficient definition, in practice it is often required or at least desirable to have a precomposed form of a character. The TEI WSD-NG should also be able to provide the mapping between the canonical standard form of the Unicode Standard and a project-specific precombined form. (Although in such cases the canonical Unicode form should be used for interchange).

Notes

At this moment (2002-08-26), there are only 2 use cases, but we are actively looking for more.

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org

CE W 03: Use cases for the Character Encoding Extensions

Contents

Overview

Chinese: Adding characters to the document character set (CBETA)

Old Norse: Providing a common set of additional glyphs and ligatures (Menota)

[not language specific] Precombined alternatives for characters expressable only as combined forms