CE W 06: Representation of non-standard characters and glyphs


Contents

Overview

Text encoders do come in situations where the repertoire of characters and glyphs available in published standards do not seem sufficient to convey the material to be encoded aptly. These Guidelines provide a mechanism to deal with such a situation, which is outlined in this chapter.

If encoders encounter some graphical unit in a document they want to render electronically, the first question that needs to be asked is: ‘Is this a character?’ To determine whether a particular graphical unit is a character or not, see Terminology and key concepts.

If the unit is indeed determined to be a character, the next question is, ‘Has this character been encoded already?’ In order to determine if a character has been encoded, users should follow a number of steps:
  • Check the Unicode website (www.unicode.org, first reading the webpage "Where is my Character?" http://unicode.org/standard/where/, then the code charts). Alternatively, users can check the latest published version of The Unicode Standard, though the website is often more up to date than the printed version, and should be checked with preference.

    The pictures (‘glyphs’) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under Annotating Characters.

  • Check the Proposed New Characters webpage (http://unicode.org/alloc/Pipeline.html) to see if the character is in line for approval.
  • Ask on the Unicode email list to determine if a proposal is pending, or whether this is indeed a new character (or if this not a character at all, in which case it would not be eligible for addition to the Unicode standard).

Since there are now close to 100000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Try a bit harder and use other sites, for example http://www.eki.ee/letter, which allows also searches based on scripts and languages.

An encoded character may be precomposed or it may be formed from base characters and combining diacritical marks. Either will suffice for a character to be "found" as an encoded character.

If this first question has been considered and no suitable form has been found in such a repertoire, the next question will be: ‘Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?’ If the character is determined to be missing from Unicode, it would be helpful to submit the new character for inclusion (see http://unicode.org/pending/proposals.html).

These guidelines will try to help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called annotation of a character, while the second case will be called adding of a new character. How to handle graphical units that represent variants will be discussed below under Annotating characters, while the problem of representing new characters will be dealt with under the section Adding new characters.

While there is some overlap between these requirements, separate, specialized markup constructs have been created for each of these cases as explained in the section "Markup constructs for representing non-standard characters, below. The following section will then proceed to discuss how to apply them to the problems at hand, discussing the annotation in "Annotating characters" and finally the creation in "Adding new characters".

Markup constructs for representing non-standard characters

The ‘TEI WSD-NG’ provides a mechanism to declare characters in addition to those that are available in the document character set [Note: In most cases, the document character will be the Unicode. XML does however also allow a document to be in a subset of Unicode. In these cases the extensions declared by the ‘TEI WSD-NG’ might in fact be characters of Unicode, but outside of the documents subset. ] Functionally, the ‘TEI WSD-NG’ is part of the TEI header, but for larger document collections it might be more convenient to maintain it separately and include it with the standard XML provisions.

The main function of the ‘TEI WSD-NG’ is to provide attributes for a character and optionally a handle to this character, if there is not already one. The list of attributes for characters is modelled on those in the Unicode Character Database, which distinguishes normative and informative character properties. Apart from that, additional attributes can be given. Since the list of properties will vary with different versions of The Unicode Standard, there might not be an exact correspondence with the list of properties defined in these Guidelines. If additional properties are required, they may be added under <addProp> .

The element <charDesc> contains a list of either <char> elements, each of which describe a character or <glyph> elements, each of which provide a glyph and some additional information. Optionally, it can also hold a <desc> element with a general description and information pertaining to all characters or glyphs for which information is given.

The   <char>  element for adding new characters to the document character set

A <char> has a required attribute ID, which holds an identfier for this character. It can be freely chosen, but has to follow the restrictions on identifier names in XML, so it can not start with a digit. It can have the following elements:
  • <charName> (required) A name to identify the character. For characters of non-ideographic scripts, a name following the conventions for Unicode names should be chosen. For ideographic scripts, an Ideographic Description Sequence (IDS) as described in Chapter 10.1 of The Unicode Standard is recommended where possible. These recommendations are given in an attempt to make blind interchange as successful as possible. Projects working in the same or neighbouring fields are well advised at coordinating and publishing their list of <charName> s in order to make data exchange even more successful. If an entity reference is used, a corresponding <addProp> element should be used to record this.
  • <normProp> (required) This is an empty element which takes a number of properties as its attribute values. More information about the normative character properties in Unicode can be found at The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-1).

    Attribute list

    • ucs This gives the codepoint assigned to the character, if such an assignment is used.
    • general-category The general category (described in The Unicode Standard 4.5) is an assignment to some major classes and subclasses of characters. The value of this property has to be selected from the list of predeclared values. The default value is "Lo", which means "Letter, other". Please make sure the approptiate values for this attribute are provided, for example "Ll" for lowercase letter.
    • canonical-combining-class This property exists for characters that are not used independently, but in combination with other characters. It records a class for these characters, which is used to determine how character interact typographically. For more information, see The Unicode Standard 4.2.
    • directional-category All Unicode characters possess a directinal type, which governs the application of the algorithm for bi-directional behaviour. The default for this category as defined in these Guidelines is "L" which means "Left-to-Right".
    • character-decomposition-mapping This is used to determine the relationship to other character(s). The Unicode Standard contains a list of tags used for this purpose.
    • numeric-value The numeric value (in decimal notation) of a character that expresses any kind of numeric value.
    • mirrored The mirrored character property is used to properly render characters such as U+0028, OPENING PARENTHESIS independent of the text direction.
  • <infProp> (optional) A set of additional, informative properties is given for Unicode characters in The Unicode Standard. If encoders want to provide such properties, they should go here and use the same naming conventions as in The Unicode Standard.
  • <addProp> (optional) This element can hold a list of <prop> elements that give additional character properties. These properties do not parallel properties given in the The Unicode Standard.
  • <desc> (optional) A prose description of the character, the type attribute can be used to categorize these descriptions.
  • <mapping> (optional, multiple occurrences possible) This element can contain one or more characters, that do have some kind of relationship to this character. The type of relationship is expressed with the type attribute on <mapping> . The <c> elements themselves can point to either another <char> or <glyph> element or contain a character that is intended to be the target of this mapping. This could be used, among other things, to point to lowercase or uppercase equivalents of this character.

    Attribute list

    • type(required) The type of mapping. The typology used can be further explained in a suitable section of the encoding description in the header.
  • <glyphImg> (optional) This points to a place where a glyph image of this character can be found or might even contain a glyph description inline, for example in SVG. Several <glyphImg> can be given, for example for glyph images of different resolution, or different types of image data.

    Attribute list

    • type(optional) The type attribute can be used to record the data type of the image, for example using MIME-types as described in RFC 2046 of the Internet Engineering Task Force. .
  • <note> (optional) Any type of additional noteworthy information that would not be suitable to be contained in <desc> .

The   <glyph>   element for specifying how a character appears in the document

The <glyph> element is used to annotate a character, most often by providing a specific glyph that shows how a character appeared in the original document. Unicode codepoints do refer to a very general abstract character, which can be rendered with a huge number of possible representations. The <glyph> element is provided for cases where the encoder wants to point to a specific glyph out of all possible glyphs. It takes the following elements as its content:
  • <glyphName> (optional) A name to identify the glyph. The name should follow the same conventions as the <charName> above.
  • <addProp> (optional) This element can hold a list of <prop> elements, that give additional character properties. These properties do not parallel properties given in The Unicode Standard.
  • <desc> (optional) A prose description of the character, the type attribute can be used to categorize these descriptions.
  • <mapping> (optional, multiple occurrences possible) This element can contain one or more characters, that do have some kind of relationship to this character. The type of relationship is expressed with the type attribute on <mapping> . The <c> elements themselve can point to either another <char> or <glyph> element or contain a character that is intended to be the target of this mapping. This could be used, among other things, to point to lowercase or uppercase equivalents of this character.

    Attribute list

    • type(required) The type of mapping. If the character described in the current <glyph> element is a pre-modern form, this could for example be set to modern to indicate the equivalent modern character. The typology used can be further explained in a suitable section of the encoding description in the header.
  • <glyphImg> (optional) This points to a place where a glyph image of this character can be found or might even contain a glyph description inline, for example in SVG. Several <glyphImg> can be given, for example for glyph images of different resolution, or different types of image data.

    Attribute list

    • type(optional) The type attribute can be used to record the data type of the image, for example using MIME-types as described in RFC 2046 of the Internet Engineering Task Force. .
  • <note> (optional) Any type of additional noteworthy information that would not be suitable to be contained in <desc> .

The DTD fragment

Here is the DTD fragment for the ‘TEI WSD-NG’ corresponding to the above description:
<!-- charDesc DTD Fragment <!-- character encoding extension entities --> <!ENTITY % teiHeader 'IGNORE' > <!ELEMENT teiHeader %om.RR; (fileDesc, charDesc?, encodingDesc*, profileDesc*, revisionDesc?)> <!-- charDesc DTD Fragment drafted by CW [2002-12-08] last changed Time-stamp: "2003-04-03 15:59:03 JST chris" purpose: Hold declaration for additional characters --> <!ELEMENT charDesc (desc?, (char | glyph)*) > <!ELEMENT glyph (glyphName?, infProp?, addProp?, desc?, mapping*, glyphImg*, note?) > <!ATTLIST glyph id ID #REQUIRED target CDATA #IMPLIED > <!ELEMENT glyphName (#PCDATA)> <!-- name of the character (should be build similar to Unicode character names ) --> <!ELEMENT char (charName, normProp, infProp?, addProp?, desc?, mapping*, glyphImg*, note?) > <!ATTLIST char id ID #REQUIRED target CDATA #IMPLIED > <!ELEMENT charName (#PCDATA)> <!-- name of the character (should be built similar to Unicode character names ) --> <!ELEMENT normProp EMPTY > <!-- Unicode normative properties --> <!ATTLIST normProp ucs CDATA #IMPLIED general-category ( Lu | Ll | Lt | Lm | Lo | Mn | Mc | Me | Nd | Nl | No | Pc | Pd | Ps | Pe | Pi | Pf | Po | Sm | Sc | Sk | So | Zs | Zl | Zp | Cc | Cf | Cs | Co | Cn ) "Lo" canonical-combining-class CDATA #IMPLIED directional-category (L | LRE | LRO | R | AL | RLE | RLO | PDF | EN | ES | ET | AN | CS | NSM | BN | B | S | WS | ON) "L" character-decomposition-mapping CDATA #IMPLIED numeric-value CDATA #IMPLIED mirrored CDATA #IMPLIED > <!ELEMENT infProp EMPTY > <!ATTLIST infProp east-asian-width ( W | FW | N | HW | A | Neutral ) "A" > <!ELEMENT addProp (prop)* > <!ATTLIST addProp > <!ELEMENT prop (#PCDATA) > <!ATTLIST prop name CDATA #REQUIRED > <!-- this might point to a font/glyphindex, SVG, or any other kind of image--> <!ELEMENT glyphImg ANY > <!ATTLIST glyphImg type CDATA #IMPLIED > <!ELEMENT desc (#PCDATA | p)* > <!ATTLIST desc type CDATA #IMPLIED > <!ELEMENT mapping (#PCDATA | c)* > <!-- the way things are currently defined, this means that any <c> occurring here will need to have a IDREF that points somewhere. This might be unwanted, but how do we solve this? --> <!ATTLIST mapping type CDATA #REQUIRED > <!-- while <char> is the definition of character properties, <c> is the reference to such a character URI does not exist as a data type, so we use CDATA as a work around --> <!-- <!ELEMENT c (#PCDATA)> --> <!ATTLIST c ref CDATA #REQUIRED>

Annotating characters

Annotating a character is necessary, where a distinction on only some aspects of a character (such as its graphical appearance) is desired. In a manuscript, for example, where three distinctly different types of the letter "r" can be recognized, it might be useful to distinguish them for exact rendering of the distinct form [Note: If a facsimile is needed, a digital image of the manuscript page, linked to an encoded version of the text is the way to go, see [FIXME: reference to the appropriate location in P5]. However, this does not allow for any arguments that might be based on the analysis of the distribution of such different forms, for which character annotation as described here provides a solution. It should be kept in mind however, that any kind of text encoding is an abstraction and interpretation of the text at hand and will not be useful in reproducing an exact facsimile of a manuscript.], while retaining their "r"ness. For this purpose, two distinct <glyph> elements will be defined as follows:
<charDesc> <glyph id="r1"> <glyphName> LATIN SMALL LETTER R WITH A FUNNY STROKE</glyphName> <addProp><prop name="entity">r1</prop></addprop> <glyphImg> <!-- some image of the glyph goes here --> </glyphImg> </glyph> <glyph id="r2"> <glyphName> LATIN SMALL LETTER R WITH TWO FUNNY STROKES</glyphName> <addProp><prop name="entity">r2</prop></addprop> <glyphImg> <!-- some image of the glyph goes here --> </glyphImg> </glyph> </charDesc>
With these definitions in place, occurrences of these two special "r"s in the text can be annotated using the element <c> :
<p>Wo<c ref="#r1">r</c>ds in this manusc<c ref="#r2">r</c>ipt are sometimes written in a funny way.</p>

As can be seen in this example, the <glyph> element pointed to from the <c> element will be interpreted as an annotation on the content of the element <c> . It is thus possible to use this mechanism to indicate ligatures [Note: While technically this could be used to indicate abbreviations, within the framework of these Guidelines, it is recommended practice to employ the <abbr> element, see .]. With this markup in place, it will be possible to write programs to analyze the distribution of the different letters "r", produce ‘faithful’ renderings that use the original glyph, but also to produce normalized versions by simple ignoring the annotation pointed to by the element <c> . To make this kind of processing more efficient, the "type" attribute on <c> can be used, with an enumeration of different types and their usage documented in the TEIHeader.

As can be seen from the above example, words intersected with different types of "r" become difficult to read with these <c> elements interspersed. An alternative and functionally identical way to express the same information would be to use entity references and expand them in the DTD subset. The previous example could thus be rewritten like this:
<p>Wo&r1;ds in this manusc&r2;ipt are sometimes written in a funny way.</p>
The corresponding entity declaration would then be:
<!ENTITY r1 '<c ref="#r1">r</c>' > <!ENTITY r2 '<c ref="#r2">r</c>' >

Since this mechanism employs markup objects to provide a link between a character in the document and some annotation on that character, it can not be used in places where such markup constructs are not allowed, e.g. in attribute values.

Adding new characters

The creation of additional characters for use in text encoding is similar to annotating an existing character. The same element <c> is used to provide a link from the character instance in the text to the character definition in the document header (or elsewhere). The main difference is that the <c> element now points to a <char> element. Also, the content of this element could be empty. The element <c> could however also hold a codepoint from the Private Use Area (PUA) of The Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to assign such PUA characters are given in the following section.

Under certain circumstances, Han characters can be written within a circle. While this could be considered simply a facet of the rendering, it can also be considered a new, derived character, which will be in many ways similar to the original, non-circled character, but has a distinct rendering. The following example will provide the necessary markup to encode such an encircled character.

<charDesc> <char id="U4EBA-circled"> <charName>CIRCLED IDEOGRAPH 4EBA</charName> <!-- IDS can not be used here, since a circle is not amongst the characters permitted by the grammar for IDS (cf. TUS 3.0, p. 269) --> <normProp ucs="" character-decomposition-mapping="circle"/> <!-- we do not assign a pua value --> <addProp> <prop name="daikanwa">36</prop> </addProp> <mapping type="standard"> <c ref="U4EBA">人</c> </mapping> </char> </charDesc>

How to use codepoints from the Private Use Area

The developers of the Universal Character Set have set aside an area of the codespace for the private use of software vendors, user groups or individuals. As of this writing (Unicode 4.0), there are around 137000 codepoints available in this area, which should be enough for most needs. No codepoint assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which will be overwritten where necessary by the mechanism outlined in this chapter). Therefore, in contrast to all other codepoints of the UCS, PUA codepoints should not be used directly in documents intended for blind interchange. Instead of using PUA codepoints directly in the document content, entity references should be used. This will make it easier for receiving parties to find out what PUA characters are used in a document and where possible codepoint clashes with local use on the receiving side occurs.

This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which might be increasingly the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the <char> element as a handle to the character in other contexts.


Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org