TEI Work Group on Character Encoding

CE W 04: Language and script identification

Author: Espen S. Ore
National Library of Norway, Oslo Division
Date: 2002-08-27

Overview

Language identification in the TEI P4 is now defined in the TEIHeader and further in Chapter 4 under code switching.

Chapter 4 recommends that language codes should be taken from ISO 639 two or three letter code sets and from SIL Ethnologue where a language is missing from ISO 639.

Problems

Chapter 4 in TEI P4 suggests that finer divisions within a language than what is available can be made by adding a suffix to the ISO 639 standard. It is not cler whether this is supposed to be an alternative to SIL ethnologue or whether we have the following order of recommendations:

Use ISO 639
Use SIL ethnologue
Use ISO 639 + self invented suffix
Use SIL ethnologue + self invented suffix

Language identification in Chapter 25 (the chapter describing the WSD - Writing System Declaration) uses an iso639 attribute which is supposed only to holde values from ISO 639-1 or 639-2. There is no direct link here with a language value which would be a a natural candidate for the value of the id-attribute in one the language element in the TeiHeader's LangUsage element since there is no suggested formalism for languages not in the ISO 639.

Alternative language identification

If we let the WSD be for the time being and look at the <language> elements in the TeiHeader's <LangUsage>-element the following would be more flexible than the current suggestions and clearer:

<LangUsage>
<language id="OBG-CYR"><langAuth name="ISO 639-1" code="cu">
   <langAuth name="ISO 639-2" code="chu"/>
	<langAuth name="SIL" code="SLN"/>
   <p>Old Bulgarian, written in Cyrillic script.</p></language>
</langUsage>

(It is not clear where the suggested language code "OBG" in the P4 comes from since it is neither valid SIL nor ISO 639-2.)

Below is a full DTD-example of how <language> and <langAuth> could be defined. I am basing this example on a Pizza-chef generated DTD on Aug. 27, 2002 using a mixed base with everything checked, so I have not ried to analyze which content/attribute entities this include:

<!ELEMENT language 
	(#PCDATA | abbr | address | date | dateRange | dateStruct 
	| expan | geogName | lang | measure | name | num | orgName 
	| persName | placeName | rs | time | timeRange | timeStruct 
	| add | app | corr | damage | del | orig | reg | restore 
	| sic | space | supplied | unclear | oRef | oVar | pRef 
	| pVar | formula | fw | handShift | distinct | emph | foreign 
	| gloss | hi | mentioned | soCalled | term | title | ptr 
	| ref | xptr | xref | caesura | c | cl | m | phr | s | seg 
	| w | anchor | addSpan | delSpan | gap | alt | altGrp | 
	certainty | fLib | fs | fsLib | fvLib | index | interp | 
	interpGrp | join | joinGrp | link | linkGrp | respons | 
	span | spanGrp | timeline | cb | lb | milestone | pb | langAuth)* >


<!ATTLIST language 
	group CDATA #IMPLIED
	grpPtr IDREF #IMPLIED
	depend CDATA #IMPLIED
	depPtr IDREF #IMPLIED
	corresp IDREFS #IMPLIED
	synch IDREFS #IMPLIED
	sameAs IDREF #IMPLIED
	copyOf IDREF #IMPLIED
	next IDREF #IMPLIED
	prev IDREF #IMPLIED
	exclude IDREFS #IMPLIED
	select IDREFS #IMPLIED
	ana IDREFS #IMPLIED
	id ID #IMPLIED
	n CDATA #IMPLIED
	lang IDREF #IMPLIED
	rend CDATA #IMPLIED
	usage CDATA #IMPLIED
	TEIform CDATA "language" >

<!ELEMENT langAuth EMPTY>

<!ATTLIST langAuth
	group CDATA #IMPLIED
	grpPtr IDREF #IMPLIED
	depend CDATA #IMPLIED
	depPtr IDREF #IMPLIED
	corresp IDREFS #IMPLIED
	synch IDREFS #IMPLIED
	sameAs IDREF #IMPLIED
	copyOf IDREF #IMPLIED
	next IDREF #IMPLIED
	prev IDREF #IMPLIED
	exclude IDREFS #IMPLIED
	select IDREFS #IMPLIED
	ana IDREFS #IMPLIED
	id ID #IMPLIED
	n CDATA #IMPLIED
	lang IDREF #IMPLIED
	rend CDATA #IMPLIED
   name CDATA #IMPLIED
	code CDATA #IMPLIED
   TEIform CDATA 'langAuth' >

This means that one can always add in as many language names from authorty lists as one wishes to while at the same time the id attribute in the language-element can be decided freely based upon what is meaningful for a project.

The wsd attribute has been removed from the <language> element since I have come back to my basic view that this kind of information should be in the header, either as free text (the original WSD gave an idea of formalism and so looked as if it could be used for something computerwise but I don't think it ever was, for real) or we could define some kind of simple structure/formalism.

The lang attribute within various TEI text elements is useful not only for marking language switching but also for identifying text-parts as being in any given (defined) language. One very simple use for this is to generate separate search/index functions for different languages.

The xml:lang attribute

The current P4, chapter 4 says:

“Any XML document may use an additional attribute xml:lang, the value of which is the identifier of a language from ISO 639 or registered with IANA. According to the XML Recommendation, the scope of this attribute is ‘considered to apply to all attributes and contents of the element where it is specified, unless overriden with an instance of xml:lang on another element within that content.’ (XML Recommendation, 2.12). Since the TEI DTD defines a great number of CDATA attributes with predeclared content in English, xml:lang cannot be used by TEI documents as intended in the XML recommendation. The current version of these Guidelines does not recommend use of the xml:lang attribute as a means of indicating language shifts; the TEI global lang attribute should instead be used for this purpose. This recommendation will be reviewed at the next revision of these Guidelines.”

I suggest that this is changed to:

These Guidelines does not recommend use of the xml:lang attribute as a means of indicating language shifts; the TEI global lang attribute should instead be used for this purpose.