Language and Script in TEI Documents

Language and Script in TEI Documents: CEW04 /PD (26 August 2002)

Author: Patrick Durusau

Introduction

With the adoption of XML as the metalanguage on which P5 of the TEI Guidelines will be based, the continued use of some features from the TEI P4 edition of the Guidelines has come into question. This particular paper addresses the issue surrounding the recording of language and script for a TEI document as well as the use of xml:lang in the future P5. To provide a common starting place, the operation of the prior lang attribute from P4 and xml:lang are compared before considering other issues.

lang (TEI attribute) versus xml:lang

P3 covers the TEI lang attribute in section 4.2 Shifting Among Character Sets, while P4 renumbers and renames the section (with some addtional material) as 4.3 Code Shifting. The P3 version makes it clear that lang was addressing the shifting between character sets (a mapping in a WSD), which was a processing matter made necessary by the crazy quilt of character encodings at the time of P3. P4 introduces a notion of "code shifting" within a speech by a single speaker but does not define the relationship of that idea to the more tradition linkage of lang to the WSD. It is merely noted as grounds for possibly different processing but why lang is being pressed into service to mark a linguistic feature (for which there exist other elements) of a text is left unexplained.

In any event, lang in both P3 and P4 is meant to be a link to a WSD (writing system declaration), whose primary function was to provide a mapping between a single natural language, a single writing system, and a single coded character set. The value of lang is suggested to be drawn from ISO 639 or other list of language identifiers.

The xml:lang attribute in XML had a less ambitious goal than the lang attribute of the TEI:

2.12 Language Identification

In document processing, it is often useful to identify the natural or formal language 
in which the content is written. A special attribute named xml:lang may be inserted in 
documents to specify the language used in the contents and attribute values of any element 
in an XML document. In valid documents, this attribute, like any other, must be declared 
if it is used. The values of the attribute are language identifiers as defined by 
[IETF RFC 1766], Tags for the Identification of Languages, or its successor on the 
IETF Standards Track.

Note:

[IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], 
from two-letter country codes as defined by [ISO 3166], or from language identifiers registered 
with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor 
to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered 
by [ISO 639].

Thus, xml:lang does not attempt to cover the mapping of a natural language, writing system and character set, in part because it has a single character set, that of Unicode. The wisdom of ignoring the writing system of a text may be questioned but it certainly simplifies the required declaration.

Much has been made of the error (if error it was) in xml:lang of applying to both element content and attribute values. It should be noted that this creates a problem for P5 only if it persists in declaring default values for attributes in xml:lang="en" rather than allowing users to define their own attribute values. Or perhaps better, allowing users a convenient mechanism to re-define attribute values where they wish to use xml:lang and so must rename large numbers of attribute values to another language. (It should be noted that element and attributes names need not remain the TEI (English) defaults in order to have seamless interchange. XSLT transformations are one mechanism for making such interchanges possible, so long as the mappings are clearly defined.)

It should be noted that in P3/P4, that the lang attribute always refers to the content of the element and never to the value of attributes. Since there are no attributes on attributes, one is faced with the oddities of xml:lang or no mechanism for declaring a lang for an attribute value in TEI.

Do we need lang?

The provocative answer is No! At least as it currently operates in P3/P4. The mapping mechanism to which lang refers (WSD) represents a mapping mechanism that is both problematic and unnecessary in P5. It is problematic in that the writing system documentation falls far short of the minimum required to have useful information about a writing system (cf. Eric's thesis). It is unnecessary since character set shifting is no longer required, at least for XML documents.

As a practical matter, however, I suspect that we will see the lang attribute in P5 as well as in P10. So, given the reality that lang will persist, what practical use can it be given in P5? As far as element content, I think we should drop its type from IDREF to simple CDATA and point users at the Ethnologue site (http://www.ethnologue.com) or other resources for language names. I suggest that as a way to divorce lang from the series of mapping such as language and writing system which its present mechanisms do not adequately support. In most cases, the simple language identifier will be sufficient for searching or processing of the text. (This also avoids the oddity of xml:lang's operation on attributes, should compliant software ever appear.)

Ethnologue: (http://www.ethnologue.com/)

As of 26 August 2002, the Ethnologue site has some 6,800 main entries and another 41,000 alternative names and dialects. When that coverage is contrasted with the rather meager lists mentioned in the XML standard, I think it is fairly clear which one would be better suited for the purposes of the TEI. The site is freely available on the web and I think that should be taken into account in terms of what resources are suggested to TEI users.

Should lang IDREF the writing system declaration?

Conceding that for one reason or another we will always have lang as an attribute, it should certainly be divorced from the notion of a writing system. The lang attribute should be use as a help to the reader/processor for searching, etc., but as Eric has rather convincingly shown, documenting a writing system is not something one would attempt with the current WSD mechanism. Part of our current difficulties is we have accepted that mapping model without determining if it is adequate to the task or even makes sense in the current context.

For example, if I were to attempt to document the writing system of the Middle Kingdom of Egypt, for example, I would have to document both time as well as geographic use changes over long period of time. In addition to declaring whatever PUA mappings I would need, there would be thousands of mappings between "signs" (yes, being deliberately vague) and other information. To say that lang must call a generic mechanism that supports that sort of unique requirement is to impose an unreasonable burden on users who do not need that capacity. That is not to say that we should not design (or attempt to design) a separate doucmentation system for such projects but let's not burden the casual use of lang with such a mechanism.