CE W 03: Use cases for the Character Encoding Extensions


Contents

Overview

In order to get a handle on the requirements and extend of a mechanism to extend the character encoding of a TEI document, this documents collects use cases. Some of these cases are taken from actual existing projects and practice, others are constructed by the editors and contributors of this document.

In the following listing, the heading includes information about which language is to extended, what purpose the extension has and which project, if any, is actually using it. What follows is a prose description of this use case. The concluding paragraphs attempts an evaluation, including any problems this approach might have for text encodders.

1

Chinese: Adding characters to the document character set (CBETA)

The Chinese Buddhist Electronic Text Association (CBETA) is compiling an electronic version of the Chinese Buddhist Canon. This is a large scale undertaking and so far digitized versions of 56 volumes of printed text, or about 80 million characters have been realized. The project uses an adapted version of the TEI DTD and makes the master XML files, as well as other derived formats available on the Web and on CD-ROM. More information (for readers of Chinese) is available at www.cbeta.org.

CBETA uses Big5, a coded character set mostly used in Taiwan, as the base document character set. Characters that are not in Big5 are identified in the XML texts with a general entity reference of the type &CBnnnnn;, where n stands for the internal serial number of the character, which is also a key to a database table where information about these characters is maintained. The information about these characters in the database includes
  • mappings to other, larger encoding schemes (Unicode and Mojikyo Font Institute),
  • information about "normalized" versions of this character,
  • a glyph expression that describes this character using other characters and a small set of operators
  • radical, stroke count, four-corner number and other properties
  • dictionary references
  • readings of the characters, if known.
Currently, the database of non-Big5 characters has more than 13000 entries. Only a relatively small number (around 1000) can be found in Unicode as of version 3.0, although this might increase now with version 3.2.

Each of the texts includes a entity replacement table that gives expansions of the entity references. At the early stages of the project, the database was used to directly generate these tables before parsing of the text. Depending on the purpose of the parsing (for example the target format), different expansions where used.

Thus, for example a text containing &CB00001; could have an associated table with at one time something like
<!ENTITY CB00001 SYSTEM "/images/CB00001.GIF" >
or
<!ENTITY CB00001 SYSTEM '<FONT face="xx">&#x4e36;</FONT>' >
or even
<!ENTITY CB00001 'A+B' >
This turned out to be cumbersome, so we later added an element to which the entity will always expand, with some of the information from the database added to the attributes:
<!ENTITY CB00006 "<gaiji cb='CB00006' des='[(¤ý*¥¨)/¤ì]' nor='íª' mojikyo='M021123' mofont='Mojikyo M104' mochar='6E82'/>" >

Our conversion scripts will then access the information they need and use that.

In this case, there is much un-documented information (or information not presented in a coherent and machine readable form) and a lot of the information is implicit in the business logic of the various scripts used. This is the kind of information, that would need to move to some future WSD or similar extension mechanism to be usable according to a coherent model.

Old Norse: Providing a common set of additional glyphs and ligatures (Menota)

The Menota project has not only produced some excellent Guidelines for the encoding of Old Norse, (see http://www.hit.uib.no/menota/guidelines/), but also gave birth to the ‘Medieval Unicode Font Initiative’, which tries to enumerate and systematize the encoding units needed for the transcription of Old Norse medieval manuscripts.

The following table shows the categories that are currently proposed in the ‘Medieval Unicode Font Initiative’ (see http://www.hit.uib.no/mufi/proposal/proposal-v1.html )
No. Name of range Inventory Allocated span
1 Mixed script characters 19 E000 - E0FF
2 Precomposed diacritical characters [NB! very large page] 183 E100 - E1FF
3 Small capitals 19 E200 - E2FF
4 Enlarged minuscules 28 E300 - E3FF
5 Ligatures 15 E400 - E4FF
6 Punctuation marks 4 E500 - E5FF
7 Base line abbreviation marks 15 E600 - E6FF
8 Combining abbreviation marks 11 E700 - E7FF
9 Precomposed abbreviated characters 8 E800 - E8FF
10 Superscript (interlinear) characters 22 E900 - E9FF
11 Metrical symbols 12 EA00 - EAFF
12 Critical and epigraphical signs 4 EB00 - EBFF
Total number of characters included in this proposal 340

A closer look at the tables reveals that many of the encoded units can be represented in Unicode with ligature joiners, combining diacritics and the like. Some others, like Small Capitals, enlarged minuscules, superscript characters are a combination of a specific rendering requirement and a character, it seems to me that these features might be better represented with markup.

This project defines a large number of PUA codepoints, many of them could be syntactic sugar for features already expressable with Unicode combining characters and markup. To accomodate this kind of usage (which might very well be considered a user requirement), the WSD-NG will have to record the mapping between these shortcuts and the standard Unicode represenation. Furthermore, the issue of how to decide whether to use specialised characters or markup should be raised.

[not language specific] Precombined alternatives for characters expressable only as combined forms

As a generalization of the Menota case, it could be said that most display engines and operating systems still assume a 1:1 mapping between code-point and glyph in a font. While Unicode will consider combining characters together with a base character as a sufficient definition, in practice it is often required or at least desirable to have a precomposed form of a character. The TEI WSD-NG should also be able to provide the mapping between the canonical standard form of the Unicode Standard and a project-specific precombined form. (Although in such cases the canonical Unicode form should be used for interchange).

Notes
1.
At this moment (2002-08-26), there are only 2 use cases, but we are actively looking for more.

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org