<?xml version="1.0" encoding="utf-8"?>
<!--
Copyright TEI Consortium. 
Dual-licensed under CC-by and BSD2 licences 
See the file COPYING.txt for details.
$Date: 2012-10-03 17:09:11 +0100 (Wed, 03 Oct 2012) $
$Id: WD-NonStandardCharacters.xml 10900 2012-10-03 16:09:11Z rahtz $
-->
<div xmlns="http://www.tei-c.org/ns/1.0" type="div1" xml:id="WD" n="25">
    <head>Representation of Non-standard Characters and Glyphs</head>
    <p>Despite the availability of Unicode, text encoders still
    sometimes find that the published repertoire of available
    characters is inadequate to their needs. This is particularly the
    case when dealing with ancient languages, for which encoding
    standards do not yet exist, or where an encoder wishes to
    represent variant forms of a character or <term>glyphs</term>.
    The module defined by this chapter provides a mechanism to satisfy
    that need, while retaining compatibility with standards.
</p>
<div type="div2" xml:id="WDNE"><head>Is Your Journey Really Necessary?</head>
<p>When encoders encounter some graphical unit in a document which is
to be represented electronically, the first issue to be resolved
should be <q>Is this really a different character?</q> To determine
whether a particular graphical unit <emph>is</emph> a character or
not, see <ptr target="#D4-42"/>. </p>
    <p>If the unit is indeed determined to be a character, the next
    question should be <q>Has this character been encoded already?</q>
    In order to determine whether a character has been encoded,
    encoders should follow the following steps:
<list type="ordered">
      <item><p>Check the Unicode
      web site at <ptr target="http://www.unicode.org"/>, in particular the page <ref target="http://unicode.org/standard/where/">"Where is my
      Character?"</ref>, and the associated character code charts.
      Alternatively, users can check the latest published version of
      <title>The Unicode Standard</title> (<ref target="#CH-BIBL-3">Unicode Consortium (2006)</ref>), though the web site is
      often more up to date than the printed version, and should be
      checked for preference.</p> 
<p>The pictures (<soCalled>glyphs</soCalled>) in the Unicode code
charts are only meant to be representative, not definitive. If a
specific form of an already encoded character is required for a
project, refer to the guidelines contained below under <ref target="#D25-30">Annotating Characters</ref>. Remember that your
encoded document may be rendered on a system which has different fonts
from yours: if the specific form of a character is important to you,
then you should document it. </p></item>
      <item>Check the Proposed New Characters web page (<ptr target="http://unicode.org/alloc/Pipeline.html"/>) to see whether
      the character is in line for approval.</item>

<item>Ask on the Unicode email list (<ptr target="http://www.unicode.org/consortium/distlist.html"/>) to
see whether  a proposal is pending, or to determine whether this
character is considered eligible for addition
to the Unicode Standard.  </item>

</list> </p>
    <p>Since there are now close to 100,000 characters in Unicode,
    chances are good that what you need is already there, but it might
    not be easy to find, since it might have a different name in
    Unicode.  Look again, this time at other sites, for example <ptr target="http://www.eki.ee/letter"/>, which also provide searches
    based on scripts and languages. Take care, however, that all the
    properties of what seems to be a relevant character are consistent
    with those of the character you are looking for. For example, if
    your character is definitely a digit, but the properties of the
    best match you can find for it say that it is a letter, you may
    have a character not yet defined in Unicode.  </p>
<p>In general, it is advisable to avoid Unicode characters generally
described as presentation forms.<note place="bottom">Specifically,
characters in the Unicode blocks Alphabetic Presentation Forms, Arabic
Presentation Forms-A, Arabic Presentation Forms-B, Letterlike Symbols,
and Number Forms.</note> However, if the character you are looking for
is being used in a notation (rather than as part of the orthography of
a language) then it is quite acceptable to select characters from the
Mathematical Operators block, provided that they have the appropriate
properties (i.e. <code>So</code>: Symbol, Other; or <code>Sm</code>:
Symbol, Math).</p>
    <p>An encoded character may be precomposed or it may be formed
    from base characters and combining diacritical marks. Either will
    suffice for a character to be "found" as an encoded character. </p>
<p>If there are several possible Unicode characters to choose amongst,
it is good practice to consult other colleagues and practitioners to
see whether a consensus has emerged in favour of one or other of
them. </p>
<p>If, however, no suitable form of your character seems to exist,
    the next question will be: <q>Does the graphical unit in question
    represent a variant form of a known character, or does it
    represent a completely unencoded character?</q> If the character
    is determined to be missing from the Unicode Standard, it would be helpful to
    submit the new character for inclusion (see <ptr target="http://unicode.org/pending/proposals.html"/>).  </p>
    <p>These guidelines will help you proceed once you have
     identified a given graphical unit as either a variant or an
     unencoded character. Determining this will require knowledge of
     the contents of the document that you have. The first case will
     be called <emph>annotation</emph> of a character, while the
     second case will be called <emph>adding</emph> of a new
     character. How to handle graphical units that represent variants
     will be discussed below (<ptr target="#D25-30"/>)
     while the problem of representing new characters will be dealt
     with in section <ptr target="#D25-40"/>.    </p>
    <p>While there is some overlap between these requirements,
    distinct specialized markup constructs have been created for each
    of these cases as explained in section <ptr target="#D25-20"/>
    below.  The following sections will then proceed to discuss how to
    apply them to the problems at hand, discussing annotation of
    existing characters in section <ptr target="#D25-30"/> and finally
    creation of new ones in <ptr target="#D25-40"/>.</p>
   </div>
   <div type="div2" xml:id="D25-20">
<!--
    <head>Markup constructs for representing non-standard characters</head>
    <p>The gaiji module provides a mechanism to declare characters
    additional to those available from the document character set. XML
    allows for a document (or document component) to declare its
    encoding, thus restricting the characters that can be encoded
    directly within it without using numeric character references. For
    example, an XML document which begins <code>&lt;?xml version="1.0"
    encoding="iso-8859-1"?&gt;</code> can include non-ISO-8859-1
    characters only by representing them as numeric character
    references. In such a case, it might be convenient to declare as
    additional characters some characters already defined by
    the Unicode Standard. Generally speaking, however, the document character set
    will be Unicode, and this mechanism will be needed only for
    characters not defined by the Unicode Standard. </p>
-->
<head>Markup Constructs for Representation of Characters and Glyphs</head>
<p>An XML document can, in principle, contain any defined Unicode
character. The standard allows these characters to be represented
either directly, using an appropriate encoding (UTF-8 by default), or
indirectly by means of numeric character references (NCR), such as
<code>&amp;#196;</code> (A-umlaut). The encoder can also restrict the
range of characters which are represented directly in a document (or
part of it) by adding a suitable encoding declaration. For example, if
a document begins with the declaration <code>&lt;?xml
encoding="iso-8859-1"?&gt;</code> any Unicode characters which are not
in the ISO-8859-1 character set must be represented by NCRs. </p>
     <p>The <mentioned>gaiji</mentioned> module defined by this
     chapter adds a further way of representing specific characters
     and glyphs in a document. (Gaiji is from Japanese <seg
     xml:lang='ja'>外字</seg>, meaning <gloss>external
     characters</gloss>.) This allows the encoder to distinguish
     characters and glyphs which Unicode regards as identical, to add
     new nonstandard characters or glyphs, and to represent Unicode
     characters not available in the document encoding by an
     alternative means.</p>
    <p>The mechanism provided here consists functionally of two parts:
     <list type="ordered">
      <item>an element <gi>g</gi>, which serves as a proxy for new
       characters or glyphs</item>
      <item>elements <gi>char</gi> and <gi>glyph</gi>, providing information about such characters or glyphs; these elements are stored in the
       <gi>charDecl</gi> element in the header.</item>
     </list>
    </p>
<p>When the gaiji module is included in a schema, the
<gi>charDecl</gi> element is added to the <ident type="class">model.encodingDescPart</ident>
class, and the <gi>g</gi> element is added to the phrase class. These
elements and their components are documented in the rest of this
section. </p>
<p>The Unicode standard defines properties for all the characters it
defines in the Unicode Character Database, knowledge of which is
usually built into text processing systems.  If the character
represented by the <gi>g</gi> element does not exist in Unicode at
all, its properties are not available. If the character represented is
an existing Unicode character, but is not available in the document
character set recognized by a given text processing system, it may
also be convenient to have access to its properties in the same way.
The <gi>char</gi> element makes it possible to store properties
for use by such applications in a standard way.</p>
<p> The list of attributes (properties) for characters is modelled on
those in the Unicode Character Database, which distinguishes
<term>normative</term> and <term>informative</term> character
properties. Additional, non-Unicode, properties may also be supplied.
Since the list of properties will vary with different versions of the
Unicode Standard, there may not be an exact correspondence between
them and the list of properties defined in these Guidelines.</p>

<p>Usage examples for these elements are given below at <ptr target="#D25-30"/> and <ptr target="#D25-40"/>.  The gaiji module
itself is formally defined in section <ptr target="#WSD-DEF"/>
below. It declares the following additional elements:
<specList>
<specDesc key="charDecl"/>
<specDesc key="g" atts="ref"/>
</specList>
The <gi>charDecl</gi> element is a member of the class <ident
type="class">model.encodingDescPart</ident>, and thus becomes
available within <gi>encodingDesc</gi> when this module is included in
a schema.  The <gi>g</gi> element is the only member of the class
<ident type="class">model.gLike</ident>: this class is referenced as
an alternative to plain text in almost every element which contains
plain text, thus permitting the <gi>g</gi> element also to appear at
such places when this module is included in a schema.
</p>
<p>The following elements may appear within a <gi>charDecl</gi>
     element:
<specList>
<specDesc key="desc"/>
<specDesc key="char"/>
<specDesc key="glyph"/>
     </specList>
    </p>
<p>The <gi>char</gi> and <gi>glyph</gi> elements have similar contents
and are used in similar ways, but their functions are different. The
<gi>char</gi> element is provided to define a character which is not
available in the current document character set, for whatever reason,
as stated above. The <gi>glyph</gi> element is used to annotate a
character that has already been defined somewhere (either in the
document character set, or through a <gi>char</gi> element) by
providing a specific glyph that shows how a character appeared in the
original document. This is necessary since Unicode code points refer
not to a single, specific glyph shape of a character, but rather to a
set of glyphs, any of which may be used to render the code point in
question; in some cases they can differ considerably.</p>
<p>The <gi>glyph</gi> element is provided for cases where the encoder
wants to specify a specific glyph (or family of glyphs) out of all
possible glyphs. Unfortunately, due to the way Unicode has been
defined, there are cases where several glyphs that logically belong
together have been given separate code points, especially in the blocks
defining East Asian characters. In such cases, <gi>glyph</gi> elements
can also be used to express the view that these apparently distinct
characters are to be regarded as instances of the same character (see
further <ptr target="#D25-30"/>).</p>
<p>The Unicode Standard recommends naming conventions which should be
followed strictly where the intention is to annotate an existing
Unicode character, and which may also be used as a model when
creating new names for characters or glyphs<note>It should be noted,
however, that this naming convention cannot meaningfully be applied to
East Asian characters; the typical Unicode descriptions for these
characters take the form <q>CJK Unified Ideograph
<code>U+4E00</code></q>, where <code>U+4E00</code> is simply the
Unicode code point value of the character in question.  In cases where
no Unicode code point exists, there is little hope of finding a name
that helps to identify the character. Names should therefore be
constructed in a way meaningful to local practice, for example by
using a reference number from a well-known character dictionary or a
project-specific serial number.</note>. For convenience of processing,
the following distinct elements are proposed for naming characters and
glyphs:
<specList>
<specDesc key="charName"/>
<specDesc key="glyphName"/>
</specList></p>
<p>Within both <gi>char</gi> and <gi>glyph</gi>, the following elements are available:
<specList>
<specDesc key="gloss"/>
<specDesc key="charProp"/>
<specDesc key="desc"/>
<specDesc key="mapping"/>
<specDesc key="figure"/>
<specDesc key="note"/>

</specList>
</p>

<p>Four of these elements (<gi>gloss</gi>, <gi>desc</gi>,
<gi>figure</gi>, and <gi>note</gi>) are defined by other TEI
modules, and their usage here is no different from their usage
elsewhere. The <gi>figure</gi> element, however, is used here only to
link to an image of the character or glyph under discussion, or to
contain a representation of it in SVG. The <gi>figure</gi> element may
contain more than one <gi>graphic</gi>
element, for example to provide images with different
resolution, or in different formats, or may itself be repeated. As
elsewhere, the <att>mimeType</att> attribute
of <gi>graphic</gi> should be used to specify
the format of the image.</p>

<p>The <gi>mapping</gi> element is similar to the standard TEI
<gi>equiv</gi> element. While the latter is used to express
correspondence relationships between TEI concepts or elements and
those in other systems or ontologies, the former is used to express
any kind of relationship between the character or glyph under
discussion and characters or glyphs defined elsewhere. It may contain
any Unicode character, or a <gi>g</gi> element linked to some other
<gi>char</gi> or <gi>glyph</gi> element, if, for example, the
intention is to express an association between two non-standard
characters. The type of association is indicated by the
<att>type</att> attribute, which may take such values as
<code>exact</code> for exact equivalences, <code>uppercase</code> for
uppercase equivalences, <code>lowercase</code> for lowercase
equivalences, <code>standard</code> for standardized forms, and
<code>simplified</code> for simplified characters, etc., as in the
following example: <egXML xmlns="http://www.tei-c.org/ns/Examples"><charDecl>
<char xml:id="aenl">
<charName>LATIN LETTER ENLARGED SMALL A</charName>
<charProp>
<localName>entity</localName>
<value>aenl</value>
</charProp>
<mapping type="standard">a</mapping>
</char>
</charDecl>
</egXML> 
</p>
<p>The mapping element may also be used to represent a mapping of the
character or (more likely) glyph under discussion onto a character
from the private use area as in this example:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><charDecl>
<glyph xml:id="z103">
<glyphName>LATIN LETTER Z WITH TWO STROKES</glyphName>
<mapping type="standard">Z</mapping>
<mapping type="PUA">U+E304</mapping>
</glyph>
</charDecl>
</egXML> 
</p>
<p>A more precise documentation of the properties of any character or
glyph may be supplied using the generic <gi>charProp</gi> element
described in the next section. Despite its name, this element may be
used for either characters or glyphs.
</p>
<div type="div3" xml:id="ucsprops"><head>Character Properties</head>
<p>The Unicode Standard documents <soCalled>ideal</soCalled>
characters, defined by reference to a number of
<term>properties</term> (or attribute-value pairs) which they are said
to possess. For example, a lowercase letter is said to have the value
<code>Ll</code> for the property <code>general-category</code>. The
Standard distinguishes between <term>normative</term> properties
(i.e. properties which form part of the definition of a given
character), and <term>informative</term> or <term>additional</term>
properties which are not normative. It also allows for the addition of
new properties, and (in some circumstances) alteration of the values
currently assigned to certain properties. When making such
modifications, great care should be taken not to override standard
informative properties for characters which already exist in the Unicode
Standard, as documented in <ref target="#CH-eg-02">Freytag (2006)</ref>.</p>
<p>The <gi>charProp</gi> element allows an encoder to supply information
about a character or glyph. Where the information concerned relates to
a property which has already been identified in the Unicode Standard, encoders are
urged to use the appropriate Unicode property name. </p>
<p>The following elements are used to record character properties:
<specList>
<specDesc key="unicodeName"/>
<specDesc key="localName"/>
<specDesc key="value"/>
</specList>
For each property, the encoder must supply either a
<gi>unicodeName</gi> or a <gi>localName</gi>, followed by a
<gi>value</gi>. </p>
<p>For convenience, we list here some of the normative character
properties and their values. For full information, refer to chapter 4 of
<title>The Unicode Standard</title>, or the online documentation of the Unicode Character Database.
<list type="gloss">
	 <label>general-category</label> <item>The general
	  category (described in the Unicode Standard chapter 4 section 5) is an assignment to some
	  major classes and subclasses of characters.  Suggested
	  values for this property are listed here:
<table>
<row><cell><code>Lu</code></cell><cell>Letter, uppercase</cell></row>
<row><cell><code>Ll</code></cell><cell>Letter, lowercase</cell></row>
<row><cell><code>Lt</code></cell><cell>Letter, titlecase</cell></row>
<row><cell><code>Lm </code></cell><cell>Letter, modifier</cell></row>
<row><cell><code>Lo</code></cell><cell>Letter, other</cell></row>
<row><cell><code>Mn</code></cell><cell>Mark, nonspacing</cell></row>
<row><cell><code>Mc</code></cell><cell>Mark, spacing combining</cell></row>
<row><cell><code>Me</code></cell><cell>Mark, enclosing</cell></row>
<row><cell><code>Nd</code></cell><cell>Number, decimal digit</cell></row>
<row><cell><code>Nl</code></cell><cell>Number, letter</cell></row>
<row><cell><code>No</code></cell><cell>Number, other</cell></row>
<row><cell><code>Pc</code></cell><cell>Punctuation, connector</cell></row>
<row><cell><code>Pd</code></cell><cell>Punctuation, dash</cell></row>
<row><cell><code>Ps</code></cell><cell>Punctuation, open</cell></row>
<row><cell><code>Pe</code></cell><cell>Punctuation, close</cell></row>
<row><cell><code>Pi</code></cell><cell>Punctuation, initial quote</cell></row>
<row><cell><code>Pf</code></cell><cell>Punctuation, final quote</cell></row>
<row><cell><code>Po</code></cell><cell>Punctuation, other</cell></row>
<row><cell><code>Sm</code></cell><cell>Symbol, math</cell></row>
<row><cell><code>Sc</code></cell><cell>Symbol, currency</cell></row>
<row><cell><code>Sk</code></cell><cell>Symbol, modifier</cell></row>
<row><cell><code>So</code></cell><cell>Symbol, other</cell></row>
<row><cell><code>Zs</code></cell><cell>Separator, space</cell></row>
<row><cell><code>Zl</code></cell><cell>Separator, line</cell></row>
<row><cell><code>Zp</code></cell><cell>Separator, paragraph</cell></row>
<row><cell><code>Cc</code></cell><cell>Other, control</cell></row>
<row><cell><code>Cf</code></cell><cell>Other, format</cell></row>
<row><cell><code>Cs</code></cell><cell>Other, surrogate</cell></row>
<row><cell><code>Co</code></cell><cell>Other, private use</cell></row>
<row><cell><code>Cn</code></cell><cell>Other, not assigned</cell></row>
</table>
	 </item>
	 <label>directional-category</label> 
<item>This property applies to all Unicode characters. It governs the
application of the algorithm for bi-directional behaviour, as further
specified in Unicode Annex 9, <title>The Bidirectional
Algorithm</title>. The following 19 different values are currently
defined for this property in <ref target="#WD-bibl-01">Davis et al (2006)</ref>:
<table>
<row><cell><code>L</code></cell><cell>left to right</cell></row>
<row><cell><code>LRE</code></cell><cell>left to right embedding</cell></row>
<row><cell><code>LRO</code></cell><cell>left to right override</cell></row>
<row><cell><code>R</code></cell><cell>right to left</cell></row>
<row><cell><code>AL</code></cell><cell>right to left Arabic</cell></row>
<row><cell><code>RLE</code></cell><cell>right to left embedding</cell></row>
<row><cell><code>RLO</code></cell><cell>right to left override</cell></row>
<row><cell><code>PDF</code></cell><cell>Pop Directional Format</cell></row>
<row><cell><code>EN</code></cell><cell>European Number</cell></row>
<row><cell><code>ES</code></cell><cell>European Number Separator</cell></row>
<row><cell><code>ET</code></cell><cell>European Number Terminator</cell></row>
<row><cell><code>AN</code></cell><cell>Arabic Number</cell></row>
<row><cell><code>CS</code></cell><cell>Common Number Separator</cell></row>
<row><cell><code>NSM</code></cell><cell>Non-spacing Mark</cell></row>
<row><cell><code>BN</code></cell><cell>Boundary Neutral</cell></row>
<row><cell><code>B</code></cell><cell>Paragraph separator</cell></row>
<row><cell><code>S</code></cell><cell>Segment separator</cell></row>
<row><cell><code>WS</code></cell><cell>Whitespace</cell></row>
<row><cell><code>ON</code></cell><cell>Other neutrals</cell></row>
</table></item>
	 <label>canonical-combining-class</label> <item>This
	  property exists for characters that are not used
	  independently, but in combination with other characters, for
	  example the strokes making up CJK (Chinese, Japanese, and Korean) characters.  It
	  records a class for these characters, which is used to
	  determine how they interact typographically.  The following
	  values are defined in  the Unicode Standard 5.0: (see <ref target="http://unicode.org/Public/UNIDATA/UCD.html#Canonical_Combining_Class_Values">Unicode
Character Database: Canonical Combining Class Values</ref>)
<table>
<row><cell><code>0</code></cell><cell>Spacing, split, enclosing, reordrant, and Tibetan subjoined </cell></row>
<row><cell><code>1</code></cell><cell>Overlays and interior </cell></row>
<row><cell><code>7</code></cell><cell>Nuktas </cell></row>
<row><cell><code>8</code></cell><cell>Hiragana/Katakana voicing marks </cell></row>
<row><cell><code>9</code></cell><cell>Viramas </cell></row>
<row><cell><code>10</code></cell><cell>Start of fixed position classes </cell></row>
<row><cell><code>199</code></cell><cell>End of fixed position classes </cell></row>
<row><cell><code>200</code></cell><cell>Below left attached </cell></row>
<row><cell><code>202</code></cell><cell>Below attached </cell></row>
<row><cell><code>204</code></cell><cell>Below right attached </cell></row>
<row><cell><code>208</code></cell><cell>Left attached (reordrant around single base character) </cell></row>
<row><cell><code>210</code></cell><cell>Right attached </cell></row>
<row><cell><code>212</code></cell><cell>Above left attached </cell></row>
<row><cell><code>214</code></cell><cell>Above attached </cell></row>
<row><cell><code>216</code></cell><cell>Above right attached </cell></row>
<row><cell><code>218</code></cell><cell>Below left </cell></row>
<row><cell><code>220</code></cell><cell>Below </cell></row>
<row><cell><code>222</code></cell><cell>Below right </cell></row>
<row><cell><code>224</code></cell><cell>Left (reordrant around single base character) </cell></row>
<row><cell><code>226</code></cell><cell>Right </cell></row>
<row><cell><code>228</code></cell><cell>Above left </cell></row>
<row><cell><code>230</code></cell><cell>Above </cell></row>
<row><cell><code>232</code></cell><cell>Above right </cell></row>
<row><cell><code>233</code></cell><cell>Double below </cell></row>
<row><cell><code>234</code></cell><cell>Double above </cell></row>
<row><cell><code>240</code></cell><cell>Below (iota subscript) </cell></row>
</table></item>
<label>character-decomposition-mapping</label> 
          <item>This property is defined for characters,
	  which may be decomposed, for example to a canonical form
	  plus a typographic variation of some kind. For such characters the Unicode standard  specifies both
	  a decomposition type and a decomposition mapping
	  (i.e. another Unicode character to which this one may be
	  mapped in the way specified by the decomposition type). The
	  following types of mapping are defined in the Unicode Standard:
<table>
<row><cell><code>font</code></cell><cell>   A font variant (e.g. a blackletter form)</cell></row>
<row><cell><code>noBreak</code></cell><cell>   A no-break version of a space or hyphen</cell></row>
<row><cell><code>initial</code></cell><cell>   An initial presentation form (Arabic)</cell></row>
<row><cell><code>medial</code></cell><cell>   A medial presentation form (Arabic)</cell></row>
<row><cell><code>final</code></cell><cell>   A final presentation form (Arabic)</cell></row>
<row><cell><code>isolated</code></cell><cell>   An isolated presentation form (Arabic)</cell></row>
<row><cell><code>circle</code></cell><cell>   An encircled form</cell></row>
<row><cell><code>super</code></cell><cell>   A superscript form</cell></row>
<row><cell><code>sub</code></cell><cell>   A subscript form</cell></row>
<row><cell><code>vertical</code></cell><cell>   A vertical layout presentation form</cell></row>
<row><cell><code>wide</code></cell><cell>   A wide (or zenkaku) compatibility character</cell></row>
<row><cell><code>narrow</code></cell><cell>   A narrow (or hankaku) compatibility character</cell></row>
<row><cell><code>small</code></cell><cell>   A small variant form (CNS compatibility)</cell></row>
<row><cell><code>square</code></cell><cell>   A CJK squared font variant</cell></row>
<row><cell><code>fraction</code></cell><cell>   A vulgar fraction form</cell></row>
<row><cell><code>compat</code></cell><cell>   Otherwise-unspecified compatibility character</cell></row>
</table>
</item>
	 <label>numeric-value</label> <item>This property applies for
	 any character which expresses any kind of numeric value. Its
	 value is the intended value in decimal notation.</item>
<label>mirrored</label> <item>The mirrored
	 character property is used to properly render characters such
	   as U+0028, <code>OPENING PARENTHESIS</code> independent of
	 the text direction: it has the value <code>Y</code>
(character is mirrored) or <code>N</code> (code is not mirrored).</item>
	</list></p>
<p>The Unicode Standard also defines a set of informative (but non-normative)
properties for Unicode characters.  If encoders want to provide such
properties, they may be included using the suggested Unicode name,
tagged using the <gi>unicodeName</gi> element. However, encoders may
also supply other locally-defined properties, which must be named using
the <gi>localName</gi> element to distinguish them. If a Unicode name
exists for a given property, it should however always be preferred to
a locally defined name. Locally defined names should be used only for   properties
which are not specified by the Unicode Standard.</p>
</div>
   </div>
   <div type="div2" xml:id="D25-30">
    <head>Annotating Characters</head>
    <p>Annotation of a character becomes necessary when it is desired
    to distinguish it on the basis of certain aspects (typically, its
    graphical appearance) only.  In a manuscript, for example, where
    distinctly different forms of the letter "r" can be recognized, it
    might be useful to distinguish them for analytic purposes, quite
    distinct from the need to provide an accurate representation of the
    page. A digital facsimile, particularly one linked to a
    transcribed and encoded version of the text, will always provide a
    superior visual representation (for information on how to link a
    digital facsimile to a transcribed text see <ptr target="#PHFAX"/>), but cannot be used to support arguments based
    on the distribution of such different forms. Character annotation
    as described here provides a solution to this problem.<note place="bottom"> It should be kept in mind that any kind of text
    encoding is an abstraction and an interpretation of the text at
    hand, which will not necessarily be useful in reproducing an exact
    facsimile of the appearance of a manuscript.</note> </p>

<p>Assuming that we wish to distinguish the variant glyphs from the
standard representation for the character concerned, we will need to
define distinct <gi>glyph</gi> elements, one for each of the forms of
the letter we wish to distinguish: <egXML xmlns="http://www.tei-c.org/ns/Examples"><charDecl>
  <glyph xml:id="r1">
  <glyphName>LATIN SMALL LETTER R WITH ONE FUNNY STROKE</glyphName>
   <charProp>
      <localName>entity</localName>
       <value>r1</value>
   </charProp>
   <figure><graphic url="r1img.png"/></figure>
 </glyph>
  <glyph xml:id="r2">
  <glyphName>LATIN SMALL LETTER R WITH TWO FUNNY STROKES</glyphName>
   <charProp>
      <localName>entity</localName>
       <value>r2</value>
   </charProp>
   <figure><graphic url="r2img.png"/></figure>
 </glyph>
</charDecl> </egXML>
     With these definitions in place, occurrences of these two special
     "r"s in the text can be annotated using the element <gi>g</gi>:
     <egXML xmlns="http://www.tei-c.org/ns/Examples">
    <p>Wo<g ref="#r1">r</g>ds in this 
      manusc<g ref="#r2">r</g>ipt are sometimes
      written in a funny way.</p> </egXML></p>
    <p>
     As can be seen in this example, the <gi>glyph</gi> element pointed
     to from the <gi>g</gi> element will be interpreted as an
     annotation on the content of the element <gi>g</gi>.  This mechanism
     can be used to represent common manuscript abbreviations or ligatures, as in the
     following examples:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><p> ... <g ref="#Filig">Fi</g>lthy riches...</p>
<!-- in the charDecl -->
  <glyph xml:id="Filig">
   <glyphName>LATIN UPPER F AND LATIN LOWER I LIGATURE</glyphName>  
   <figure><graphic url="Filig.png"/></figure>
 </glyph>
</egXML>
<egXML xmlns="http://www.tei-c.org/ns/Examples"><p> ... <abbr><g
ref="#per">per</g></abbr> ardua</p>
<!-- in the charDecl -->
  <glyph xml:id="per">
   <glyphName>LATIN ABBREVIATION PER</glyphName>  
   <figure><graphic url="per.png"/></figure>
 </glyph>
  
</egXML>
(In fact the Unicode Standard does provide a character to represent the
      <code>Fi</code> ligature; the encoder may however prefer not to
      use it in order to simplify other text processing  operations,
      such as indexing).    </p>
<p>With this
     markup in place, it will be possible to write programs to analyze
     the distribution of the different letters "r" as well as produce
     more <soCalled>faithful</soCalled> renderings of the original. It
     will also be possible to produce normalized versions by simply ignoring
     the annotation pointed to by the element <gi>g</gi>.  <!-- To make
     this kind of processing more efficient, the "type" attribute on
     <gi>g</gi> can be used, with an enumeration of different
     types and their usage documented in the TEIHeader.-->
    </p>
    <p>For brevity of encoding, it may be preferred to predefine
    internal entities such as the following:
 <eg><![CDATA[<!ENTITY r1 '<g ref="#r1">r</g>' >
<!ENTITY r2 '<g ref="#r2">r</g>' >]]></eg>
which would enable the same material to be encoded as follows:
     <eg><![CDATA[<p>Wo&r1;ds in this manusc&r2;ipt are 
      sometimes written in a funny way.</p> ]]></eg>
    </p>
<p>The same technique may be used to represent particular
abbreviation marks as well as to represent other characters or
glyphs. For example, if we believe that the r-with-one-funny-stroke is
being used as an abbreviation for <code>receipt</code>, this might be
represented as follows:<eg><![CDATA[<abbr>&r1;</abbr>]]></eg></p>
<!-- should become a choice element some time -->


    <p>Note however that this technique employs markup objects to
    provide a link between a character in the document and some
    annotation on that character. Therefore, it cannot be used in
    places where such markup constructs are not allowed, notably in
    attribute values. 
    </p>

    <p>Since the need to use these constructs to annotate or define
    characters occurs frequently in Chinese, Korean, and Japanese
    documents, here are some issues that are specific to these
    documents. There are two slightly different versions of the
    problem. In the first case, due to the way Unicode is defined,
    there are occasions when more than one glyph is defined for a
    character. In such an occasion, one might want to retain the
    character as used, but add information in a way so that a
    normalizer (for search or indexing operations) could take
    advantage of this information. To achieve this, we simply define
    within a <gi>charDecl</gi> element a <gi>glyph</gi> that has two
    <gi>mapping</gi> elements, as shown here:
    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:lang="zh">
      <charDecl>
	<glyph xml:id="u8aaa">
	  <mapping type="Unicode">說</mapping>
	  <mapping type="standard">説</mapping>
	</glyph>
      </charDecl>
    </egXML>
    The first of these <gi>mapping</gi>s, of type <val>Unicode</val>,
    simply maps our glyph to the code point where Unicode defined it.
    The other one, of type <val>standard</val>, encodes the fact that
    in our view, this glyph is a variation of the standard character
    given in the content of the element. We could then use this
    <gi>glyph</gi> element's unique identifier <val>u8aaa</val> to
    refer to it from within a text as follows.
    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:lang="zh">
      <g ref="#u8aaa">說</g>
    </egXML>
    </p>
    <p>A slightly different, but related problem occurs when we have
    multiple variants, none of which has been defined in Unicode. In
    this case, we need to define one as a new character using
    <gi>char</gi>, and the others as glyphs using <gi>glyph</gi>.
    <egXML xmlns="http://www.tei-c.org/ns/Examples">
      <charDecl>
	<char xml:id="newchar1">
	  <!-- more properties here -->
	</char>
	<glyph xml:id="varofnewchar1">
	  <!-- more properties here -->
	  <mapping type="standard"><g ref="#newchar1"/></mapping>
	</glyph>
      </charDecl>
    </egXML>
    The <gi>char</gi> defines a new character, while the
    <gi>glyph</gi> element then defines a variant glyph of this newly
    defined character. Additional properties should be specified in
    order to make these both identifiable.</p>
  </div>

   <div type="div2" xml:id="D25-40">
    <head>Adding New Characters</head>
    <p>The creation of additional characters for use in text encoding
    is quite similar to the annotation of existing characters.  The
    same element <gi>g</gi> is used to provide a link from the
    character instance in the text to a character definition provided
    within the <gi>charDecl</gi> element. This character definition
    takes the form of a <gi>char</gi> element.  The element <gi>g</gi>
    itself will usually be empty, but could contain a code point from
    the Private Use Area (PUA) of the Unicode Standard, which is an
    area set aside for the very purpose of privately adding new
    characters to a document.  Recommendations on how to use such PUA
    characters are given in the following section.  </p>
<p>In some circumstances, it may be desirable to provide a single
precomposed form of a character that is encoded in Unicode only as a
sequence of code points. For example, in Medieval
Nordic material, a character looking like a lowercase letter Y with a
dot and an acute-accent above it may be encountered so frequently that
the encoder wishes to treat it as a single precomposed character with
    one single coded value. In the
transcription concerned, the encoder enters this letter as
<code>&amp;ydotacute;</code>, which  when the
transcription is processed can then be expanded in one of three ways,
depending on the mapping in force. The entity reference  might be
translated into the sequence of corresponding Unicode code points 
or into some locally-defined PUA character 
(say <code>&amp;#xE0A4;</code>) for local
processing only. Both these options have disadvantages; the former
loses the fact that the sequence of composed characters is regarded as
a single object; the second is not reliably portable. 
Therefore, the recommended
representation is to use the <gi>g</gi> element defined by
the module defined in this chapter: <egXML xml:lang="und" xmlns="http://www.tei-c.org/ns/Examples"><g
 ref="#ydotacute"/></egXML>. This makes it possible for the encoder to
provide useful documentation for the particular character or glyph so referenced:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><char xml:id="ydotacute">
       <charName>LATIN SMALL LETTER Y WITH DOT ABOVE AND
       ACUTE</charName>
     <charProp>
       <localName>entity</localName>
       <value>ydotacute</value>
     </charProp>
     <mapping type="composed">&amp;#x0079;&amp;#x0307;&amp;#x0301;</mapping>
     <mapping type="PUA">U+E0A4</mapping> </char></egXML> This
     definition specifies the mapping between this composed character
     and the individual Unicode-defined code points which make it
     up. It also supplies a single locally-defined property
     (<soCalled>entity</soCalled>) for the character concerned, the
     purpose of which is to supply a recommended character entity name
     for the character.
</p>
<p>Under certain circumstances, Chinese Han characters can be written
    within a circle.  Rather than considering this as simply an aspect of the
    rendering, an encoder may wish to treat such circled characters as
    entirely distinct  derived characters. For a given character
    (say that represented by the numeric-character reference <code>&amp;#x4EBA;</code>)
    the circled variant might conveniently be represented as
    <egXML xml:lang="und" xmlns="http://www.tei-c.org/ns/Examples"> <g ref="#U4EBA-circled"/></egXML>, which references a
    definition such as the following:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><char xml:id="U4EBA-circled">
  <charName>CIRCLED IDEOGRAPH</charName>
  <charProp>
    <unicodeName>character-decomposition-mapping</unicodeName>
    <value>circle</value>
  </charProp>
  <charProp>
   <localName>daikanwa</localName>
   <value>36</value>
  </charProp>
  <mapping type="standard">
   &amp;#x4EBA;
  </mapping>
  <mapping type="PUA">
   &amp;#xE000;
  </mapping>
 </char></egXML></p>
<p>In this example, the <soCalled>circled ideograph</soCalled>
character has been defined with two mappings, and with two
properties. The two properties are the Unicode-defined
character-decomposition which specifies that this is a circled
character, using the appropriate terminology (see <ptr target="#ucsprops"/> above) and a locally defined property known as
<soCalled>daikanwa</soCalled> <!-- whatever that means -->. The two
mappings indicate firstly that the standard form of this character is
the character <code>&amp;#x4EBA;</code>, and secondly that the
character used to represent this character locally is the PUA
character  <code>&amp;#xE000;</code>. For convenience of local
processing this PUA character may in fact appear  as content of
the <gi>g</gi> element. In general, however, the <gi>g</gi> element
will be empty.</p>
   </div>
   <div type="div2" xml:id="D25-50">
    <head>How to Use Code Points from the Private Use Area</head>
    <p>The developers of the Unicode Standard have set aside an
    area of the codespace for the private use of software vendors,
    user groups, or individuals.  As of this writing (Unicode 5.0),
    there are around 137,000 code points available in this area, which
    should be enough for most needs. No code point assignments will be made
    to this area by standard bodies and only some very basic default
    properties have been assigned (which may be overridden where
    necessary by the mechanism outlined in this chapter). Therefore,
    unlike all other code points defined by the Unicode Standard, PUA code points should
    <emph>not</emph> be used directly in documents intended for blind interchange.
<!--    Instead of using PUA code points directly in the document content,
    entity references should be used.  This will make it easier for
    receiving parties to find out what PUA characters are used in a
    document and where possible code point clashes with local use on
    the receiving site might occur.--></p>
<p>In the two previous examples, we mentioned that the variant
characters concerned might well be assigned specific code points from
the PUA. This might, for example, facilitate the use of a particular
font which displays the desired character at this code point in the
local processing environment.  Since however this assignment would be
valid only on the local site, documents containing such code points are
unsuitable for blind interchange.  During the process of preparing
such documents for interchange, any PUA code points should be replaced by an appropriate use of the <gi>g</gi> element,  such as <tag>g ref="#xxxx"</tag>, thus associating the character required
with the documentation of it provided by the referenced  <gi>char</gi> element.  The PUA character
used during the preparation of the document might be recorded in the
<gi>char</gi> element, as shown in the example in <ptr target="#D25-40"/>, or retained as content of the <gi>g</gi> element. However, since there is no requirement that the same PUA
character be used to represent it at the receiving site, and since it
may well be the case that this other site has already made an
assignment of some other character to the original PUA code point, it is best practice to remove the locally-defined PUA character. It is to be expected that a further translation into the
local processing environment at the receiving site will be necessary
to handle such characters, during which variant letters can be
converted to hitherto unused code points on the basis of the
information provided in the <gi>char</gi> element.</p>
    <p>This mechanism is rather weak in cases where DOM trees or
    parsed XML fragments are exchanged, which may increasingly be the
    case.  The best an application can do here is to treat any
    occurrence of a PUA character only in the context of the local
    document and use the properties provided through the <gi>char</gi>
    element as a handle to the character in other contexts.  </p>
<p>In the fullness of time, a character may become standardized, and
thus assigned a specific code point outside the PUA. Documents which
have been encoded using the mechanism must at the least ensure that
this changed code point is recorded within the relevant <gi>char</gi>
element; it will however normally be simpler to remove the
<gi>char</gi> element and replace all occurrences of <gi>g</gi>
elements which reference it by occurrences of the newly coded
character. </p>
   </div>
<div type="div2" xml:id="WSD-DEF"><head>Module Character and Glyph Documentation</head>
<p>The module described in this chapter makes available the following
components:
<moduleSpec xml:id="DWD" ident="gaiji">
<altIdent type="FPI">Character and Glyph
Documentation</altIdent><desc>Character and glyph documentation</desc>
<desc xml:lang="fr">Représentation des caractères et des glyphes non standard</desc>
  <desc xml:lang="zh-TW">文字與字體說明</desc>
<desc xml:lang="it">Documentazione di caratteri non standard e glifi</desc><desc xml:lang="pt">Documentação dos carateres</desc><desc xml:lang="ja">外字モジュール</desc></moduleSpec>

The selection and combination of modules to form a TEI schema is described in
<ptr target="#STIN"/>.
</p>
<specGrp>










<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/g.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/char.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/charName.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/charProp.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/charDecl.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/glyph.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/glyphName.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/localName.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/mapping.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/unicodeName.xml"/>















<include xmlns="http://www.w3.org/2003/XInclude" href="../../Specs/value.xml"/>





</specGrp>
</div>
</div>
