TEI P4 Beta: Reference Documentation

<form> (letter form) identifies one letter form taken by a particular character in a writing system declaration.
Attributes:
string gives the byte string used to encode the letter form in the text.
Datatype: CDATA
Values: any string of characters (often a single byte)
Default: #IMPLIED
Example:
<form string='a/'> <desc>lowercase Greek alpha with acute accent</desc> </form>

Note
If the character is encoded only using entity references, then the value of string should be '' (the empty string).

In coded character sets which use character-set shifting (e.g. JIS 0208), the string attribute should typically contain the required shift characters, in order to render the value unambiguous. In such a case, there is no expectation that every occurrence of the character will be immediately preceded by the shift sequence; processing software is responsible for understanding the shift mechanism and acting accordingly.

The same string value may not appear on more than one <form> elements (except the empty string), unless each occurrence is associated with a different coded character set.

codedCharSet (coded character set) specifies which base coded character set the string value occurs in.
Datatype: IDREF
Values: a reference to the SGML identifier of a <codedCharSet> element in the current writing system declaration.
Default: #IMPLIED
Example:

Note
If more than one <codedCharSet> is specified as a base component of the writing system declaration, then it is expected that character-set shifting is in use, as described in ISO 2022 or some equivalent. In this case, each <form> element which has a value for the string attribute should also identify, by means of the codedCharSet attribute, which identifies which coded character set actually contains the string in question. Proper shifting among character sets is the responsibility of the user.

entityStd (standard entity name) gives the name of one or more entities defined for this character form in some standard entity set(s).
Datatype: ENTITIES
Values: One or more valid SGML entity names declared in the document type definition of the WSD; the entity must also be included in an entity set mentioned in an <entitySet> declaration in the current writing system declaration or in some base writing system referred to by a <baseWsd> element.
Default: #IMPLIED
Example:
<form entityStd='thorn'> <desc>lowercase Old English/Icelandic thorn</desc> </form>

Note
If the same letter form is defined by more than one public entity set, more than one value may appear in this attribute.

The same entity name may not appear in the entityStd or entityLoc attributes of more than one <form> element.

entityLoc (local entity name) gives one or more entity names used locally for this character form.
Datatype: ENTITIES
Values: One or more valid SGML entity names declared in the document type definition of the WSD; the entity must also be included in an entity set mentioned in an <entitySet> declaration in the current writing system declaration or in some base writing system referred to by a <baseWsd> element.
Default: #IMPLIED
Example:
<form entityStd='thorn' entityLoc='t'> <desc>lowercase Old English/Icelandic thorn</desc> <note>The standard entity name is 'thorn'; the local entity 't' is used for brevity and legibility.</note> </form>

Note
The same entity name may not appear in the entityStd or entityLoc attributes of more than one <form> element.

ucs-4 (universal-character-set code) gives the position of the character form in the thirty-two bit `universal character set' defined by ISO 10646.
Datatype: CDATA
Values: one or more sets of two or four two-digit hexadecimal numbers giving a valid ISO 10646 code point for the character form; for legibility the two-digit hexadecimal numbers should be separated by hyphens. If more than one UCS-4 code is associated with a given character form, the two UCS-4 codes should be given separated by blanks. If the character form is associated with a sequence of UCS-4 codes (e.g. a base character followed by one or more non-spacing diacritics), then the components of the sequence should be separated by '+'.
Default: #IMPLIED
Example:

Note
The same UCS-4 code (or sequence) may not appear within more than one <character> element within the writing system declaration. It may however appear on several forms of the same character.

Multiple UCS-4 codes can be given for a single character; this allows sequences treated as distinct by ISO 10646 to be documented as referring to a single `character' as defined by the WSD (e.g. ``lowercase a-umlaut'' and ``lowercase a'' plus ``umlaut'').

If a single UCS-4 code is to be treated as relating to two distinct `characters' as defined by the WSD (e.g. to reverse the effects of Han unification on some character), then one of the <character> elements should be associated with the UCS-4 code in the normal way, and the others should call attention to the relevant UCS-4 code by a comment in a <note> element.

afiicode (AFII code) gives one or more codes associated with this letter form by the Association for Font Information Interchange.
Datatype: CDATA
Values: any valid AFII identifier.
Default: #IMPLIED
Example:

Note
The AFII tables are designed as an inventory of glyphs (identifiably distinct shapes, leaving differences of font design out of account---one character may be associated with several glyphs, and each glyph with items in several different fonts). Because the same glyph may be associated with more than one character (in some fonts, for example, the lowercase letter L and the digit 1 share the same glyph), the value of afiiCode is used for informational purposes only and need not be unique within the writing system declaration.

Example

Note
The <form> element documents one form of a character; in most cases, there will be only one. If more than one form is given, in general, they are to be regarded as free variants of the character unless otherwise specified in the notes.

The distinction between <character> and <form> makes it possible to distinguish, in an encoding, among different letter forms (which may have historical, aesthetic, linguistic, or other significance) without having to claim that the different forms constitute different `characters' in any normal sense. (Using the technical terms occasionally encountered, the <form> element can be used to record each allograph of a given character or grapheme.) The concepts of `character' and `letter form', however, vary from analyst to analyst; the decision to treat a given set of forms as a single character or as a set of characters is not always obvious, and may require the application of considerable learning and judgement. The <note> element should be used to record the reasoning behind any particularly difficult decision.

Tagset auxiliary tag set for writing system declarations
Class
Filename teiwsd2
Content: May contain a series of description element, optionally one or more figure elements showing the character form in question, and optionally a series of notes.
Parents character superentry
Children desc extFigure figure note
Declaration
<!ELEMENT form - O (desc+, (figure | extFigure)*, note*) > <!ATTLIST form %a.global string CDATA #IMPLIED codedCharSet IDREF #IMPLIED entityStd ENTITIES #IMPLIED entityLoc ENTITIES #IMPLIED ucs-4 CDATA #IMPLIED afiicode CDATA #IMPLIED >
See 25.4.2

Back to index