CE W 05: Semantics for characters and linguistic features


Contents

Overview

This document attempts to enumerate features and categories that are needed for a writing system declaration. At this time (2002-08-26), it is still in a rather rough state.

Features in the ‘ old ’ TEI-WSD

This list enumerates some of the features that are available in the WSD as of P4, we need to think about which of them to retain for P5.
  • language
  • script
  • direction
  • characters

    exceptions; details for characters (form, desc), are listed here. 1

  • note

Features in the Eric Albright's paper Design of an electronic method for describing writing systems

In chapter 6 of his thesis, Albright discusses the following ‘Elements of writing systems’:
  • 6.1. Linguistic elements

    Link to the linguistic description. Elements <linguistic-unit> , <sequence> (container for sequential units), containing <linguistic-unitRef> .

  • 6.2. Graphs

    discrete segments, <graph> , contains: <name>

  • 6.3. Graphemes

    graphemes, if such a thing exists: <grapheme>

  • 6.4. Writing system units

    higher level units: characters, syllables, words, phrases, sentences: <writing-unit> (do we need this formal description?)

  • 6.5. Classes

    <class> . Classes might be better built by enumerating the feature on the graphs. But, admittedly, there is a level of abstract description that might be difficult to achieve otherwise.

  • 6.6. Computational units
  • 6.6.1. Key codes

    we do not worry about this.

  • 6.6.2. Coded units

    we do not worry about this.

  • 6.6.3. Glyphs

    this is meant to reference to the glyphindex of a font. This is a low-level feature, that is usually not available in text processing (where glyphs in fonts are adressed through cmap tables by character code.

  • 7. correspondence rules 2

Evaluation

The TEI WSD and Albright's EWSD have surprisingly little overlap. The TEI WSD is mainly a mechanism to define a mapping between various legacy coded character sets and the Universal Character Set (UCS). EWSD on the other hand starts with a tabula rasa and enumerates all information that it deems useful for the electronic processing of a writing system.

In the following sections, an attempt will be made to evaluate what of the information from these earlier attempts at defining a WSD should be retained for the ‘TEI WSD-NG’. This will be split in three different parts:
  • the definition of a new character
  • definition of semantics for the characters
  • definition of linguistic properties for features of a writing system.

Definition of new characters

In the ‘TEI WSD-NG’ we will assume the document encoding to be the UCS, therefore in most cases no mapping is needed. There are however some special cases, where such a mapping still might be required:
  • Only a subset (a legacy CCS like Big5) of the UCS is used. The ‘TEI WSD-NG’ will need to be able to map additional UCS characters (that were not available in the subset, but are in fact in UCS)
  • The document uses an older version of the UCS. Characters that did not have a UCS mapping at the time the document was created, might have such a mapping now. In such cases, one convenient way to make this known to XML processors is the WSD-NG.

The various strategies for defining a new character is the subject of work paper CE W 02. Here we will simple assume that such a mechanism is in place.

Semantics for characters

‘TEI WSD-NG’ will need a flexible way to define (for new characters defined with the above mechanism) or overlay (for existing characters) the semantics of a character. What kind of semantics do we need?

To answer this question, it might be useful to first have a look at the character semantics that are defined by The Unicode Standard. The Unicode Standard divides character semantics into two categories, normative and informative (See The Unicode Standard, Version 3.0, Addison and Wesley, Chapter 4). The normative properties are listed as follows as of The Unicode Standard 3.0

Normative Character Properties in Unicode (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-1).

  • Case
  • Combining Classes
  • Conjoining Jamo (1100­11FF)
  • Decomposition (Canonical and Compatibility)
  • Directionality
  • Jamo Short Name
  • Numeric Value
  • Private Use
  • Special Character Properties
  • Surrogate
  • Mirrored
  • Unicode Character Names
According to The Unicode Standard, Version 3.0, Addison and Wesley Section 3.4, p.42, Case, Numeric Value , Directionality and Mirrored are designated ‘simple character properties.’
The informative properties as of The Unicode Standard 3.0 are given as follows:

Informative Character Properties (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-2).

  • Case Mapping
  • Dashes
  • East Asian Width
  • Letters (Alphabetic and Ideographic)
  • Line Breaking
  • Mathematical Property
  • Spaces
  • Unicode 1.0 Names
In addition to these lists, there are also ‘special character properties’ as enumerated in The Unicode Standard, Version 3.0, Addison and Wesley Section p.47. These are
  • Line boundary control
  • Hyphenation control
  • Fraction formatting
  • Special behavior with nonspacing marks
  • Double nonspacing marks
  • Joining
  • Bidirectional ordering
  • Alternate formatting
  • Syriac abbreviation
  • Indic dead-character formation
  • Mongolian variant selectors
  • Ideographic variation indication
  • Ideographic description
  • Interlinear annotation
  • Object replacement
  • Code conversion fallback
  • Byte order signature
Furthermore, there is the ‘General Category’, which is defined for all Unicode characters and assigns them to some general classes. This general category does in part overlap with the above listed character properties.

What strategy should be choosen to deal with Unicode character properties in ‘TEI WSD-NG’? It might be useful to hardcode the definition skeleton (that is, the key for the definition is predefined, only the value needs to be given) for some of the normative properties into the ‘TEI WSD-NG’, while others, such as the informative properties might be better served with a freeform definition skeleton (that is, a key/value pair can be freely defined). Additionally, we will need to provide for the possibility to define arbitrary additional properties -- do we need to make this syntactical different from Unicode Character Properties? It might be useful for a processor to be able to recognize Unicode Character Properties, but OTOH a specific attribute-value could also provide for this.

In addition to the Unicode properties, a number of other properties are required frequently and should be given predefined definition skeletons:
  • Graphical appearance
  • Pronounciation
  • Name of a font and codepoint(UCS, possibly PUA) within the font that contains the character
  • Mappings to non-UCS or private character repertoires, or other reference systems
  • Standard orthographic form of the character(s)

Definition of linguistic features

This is largely covered in Albright's thesis. It might best be served by a third, separate module of the ‘TEI WSD-NG’ to be used only where needed.

Notes
1.
it seems, there is no way to specify properties like case equivalents. Character classes can be assigned to the character element, but there is only a very limited set of possible classes.
2.
do we need this for text encoding?// case relationship, collation sequence is needed!?

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org