TEI Character Encoding WG: Language Identification [CEW09]

Language identification

Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accomodates this need in the following way:

A global attribute lang is defined for all TEI elements. Its value identifies the language used.
The TEI Header has a section set aside for the information about the languages used in a document, for details see 5.4.2 Language Usage.

The value of the attribute lang identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of xml:lang):

The identifier for the language should be constructed as in RTF 3066 or its successor. This same identifier has to be used to identify the <language> element in the TEI header.

The current draft of Tags for Identifying Languages proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA) by assembling this tag from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag.

The identifier consists of at least one ‘primary’ subtag, it maybe followed by one or more ‘extended’ subtags.
Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2.
ISO 639-2 reserves for private use codes the range 'qaa' through 'qtz'. These codes should be used for non-registered language subtags.
A single letter primary subtag "x" indicates that the whole language tag is privately used.
Extended language subtags must begin with the letter "s". They must follow the primary subtag and precede subtags that do define other properties of the language. The order is significant.
4 character subtags are interpreted as script identifiers taken from ISO 15924
Region subtags can be either two letter country codes taken from ISO 3166 (with exceptions) or 3 digit codes from the UN Standard Country Codes for Statistical Use.
Variant subtags may follow any of the above, but must precede private use extensions.
Private use extensions are separated from the other subtags by the single letter subtag "x", which must be followed by at least one subtag. They might consist of several subtags separated with "-", but may not exceed a length of 32 characters.

Examples of language tags, mostly taken from RTF 3066

Simple language code
- de (German)
- ja (Japanese)
- zh (Chinese)
Language code plus Script code
- zh-Hant (Traditional Chinese)
- en-Latn (English written in Latin script)
- sr-Cyrl (Serbian written with Cyrillic script)
Language-Script-Region
- zh-Hans-CN (Simplified Chinese for the PRC)
- sr-Latn-891 (Serbian, Latin script, Serbia and Montenegro)
Language-Region
- zh-SG (Chinese for Singapore)
- de-DE (German for Germany)
Other
- zh-CN (Chinese in China, no script given)
- zh-Latn (Chinese transcribed in the Latin script)
Extended:
- de-CH-x-phonebook (phonebook collation for Swiss German)
- zh-s-min (Min sub-language of Chinese)
- zh-s-min-s-nan-Hant-CN (Southern variant of Min sublanguage as used in China, written with traditional Characters)
- zh-Latn-x-pinyin (Chinese transcribed in the Latin script using the Pinyin system)

It should be noted that capitalization given here follows established convention (e.g. capital letters for country coded, small letters for language codes), but RTF 3066 does not ascribed any meaning to differences in capitalization.

As can be seen, both RTF 3066 and ISO 639-2 provide extensions that can be employed by private convention. The constructs mentioned above can thus be used to generate identifiers for any language, past and present, in any used in any area of the World. If such private extensions are used within the context of the TEI, they should be documented within the <language> element of the TEI header, which might also provide a prose description of the language described by the language tag.

While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. "grc" for "Greek, Ancient (to 1453)" in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in RTF 3066 and relate that to a <date> or <dateRange> in the corresponding <language> sectio of the TEI header.

Equivalences to language identifiers by other authorities can be given in the <language> section as well, but no formal mechanism for doing so has been defined.

The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the lang attribute, including all elements and all attributes where a language might apply. [Note: This will exclude all attributes where a non-textual data type has been specified, for example tokens, boolean values or predefined value lists.]

Phillips, Addison. Davis, Mark, Tags for Identifying Languages 2004-04-08, Internet Draft, proposed revision for RTF3066 http://xml.coverpages.org/draft-phillips-langtags-02a.txt
Cover, Robin Language Identifiers in the Markup Contexthttp://xml.coverpages.org/languageIdentifiers.html
Tim Bray Jean Paoli C. M. Sperberg-McQueen Eve Maler - Second Edition Francois Yergeau - Third Edition Extensible Markup Language (XML) 1.0 (Third Edition) W3C Recommendation 04 February 2004 http://www.w3.org/TR/2004/REC-xml-20040204/

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org