Naming Conventions

For TEI DTDs

Committee for Metalanguage and Syntax Issues

TEI ML W26

7 June 1991

1 Introduction
2 General markup decisions
3 Structure of Tag Names
4 Summary of Rules and Recommendations

1 Introduction

This document contains naming standards for the formulation of document type declarations (DTDs) in the work of the Text Encoding Initiative. It specifies explicit rules for the selection of tag and attribut names which should be followed when creating tag and attribute names for the TEI. While most of the rules involve points that seem minor (and sometimes even arbitrary) when considered individually, the sum of the rules taken together creates a clear, consistent, uniform style for names used in the guidelines. The contents of this document reflect a list of areas of consideration identified by the Committee on Metalanguage and Syntax; items have, however, been coalesced and rearranged for presentation. The draft guidelines have been used to guide creation of these rules by helping to detect regularities in current naming practice that need to be addressed, as well as to find irregularities that require further analysis.

2 General markup decisions

2.1 Case Insensitivity

The question of case sensitivity is a difficult one. The consistency imposed by case sensitive names has a basic appeal to the technical user or software designer. However, in practical work, particularly work (such as tagging a text) that involves extensive hand manipulation of a document, it is often difficult and pointless to ensure that case is always consistent. It is also clear that a large number of TEI users will not be technical specialists, will not perceive value in that consistency, and will be doing significant hand manipulation of texts. Given the triviality of making software case-insensitive, case will never be used to distinguish two tag names.

TEI documents and examples, however, should consistently use lower case for tag names, except that in names formed from phrases, the initial character of each word but the first should be uppercased. (Thus <p> or <quotation> but <partOfSpeech> and <teiHeader>.)

Entity names will need to be case sensitive, as both the standard entity sets, and natural encoding styles require it. Attribute names should not depend on case, but attribute values are case sensitive as they may record information which requires mixed case.

2.2 Language Bindings

Names should be natural-language words or phrases. Initial drafts of the guidelines provide a set of English-language names. Non-English tag names are only appropriate if the name used is a technical term from another language (e.g. French, Latin, or German), which is standard among Anglophone scholars. For instance, "enjambement" and "sandhi", though non-English words, are the standard terms in scholarly discussion and could be used in the English-language tag set.

As a matter of policy, the TEI will produce nine language-specific versions of the TEI tag set for the 1992 publication, covering at least the languages of the European Community. Similar restrictions will be observed for those tagsets as well, with appropriate modifications for the languages in question.

2.3 Character Sets and Encoding

This issue is being addressed by the character-set committee and should be orthogonal to the issues addressed here.

2.4 Abbreviation

Abbreviations are an extremely natural way to reduce the redundancy of frequent items; unfortunately, what is frequent for one user may be very infrequent for another, and vice versa. Therefore, we have taken a conservative approach to abbreviations. When used, abbreviations will be the sole names for a tag or attribute. There will be no variant names. In general, abbreviations will be avoided unless a tag is used very frequently. The threshold for frequency is the "1 per page rule." If a tag name would occur more than once per printed page of the source material, on average, then it may be abbreviated if appropriate. If a tag of wide application would appear more than 3-4 times per page, then a minimal abbreviation of 1-5 characters may be appropriate.

Abbreviations that are usual in a discipline should be preferred to new ones, and any abbreviation should be suggestive of the abbreviated word. If there are no traditional abbreviations, then the following rules are suggested for forming an abbreviation:

abbreviate single words by taking characters up to beginning of second vowel cluster, exclusive; e.g. "structure" becomes "struct", "attribute" becomes "attr", etc. (A vowel cluster is a series of adjacent vowels.)
abbreviate phrases by abbreviating each single word of the phrase according to the preceding rule; e.g. "feature structure" becomes "featStruct".
abbreviate phrases severely by taking the first character of each word (only); e.g. "feature structure" becomes "fs".

2.5 Translatability

In choosing names, it may be regarded as an advantage if a name or set of names has/have clear analogues in other European languages.

3 Structure of Tag Names

Many names are simple atoms with no internal structure. These names are relatively unproblematic. Other sets of names have essential relationships that could profitably be reflected by a naming convention. It seems clear that for sets of related tags some form of structured namespace can help to systematize the large numbers of names. We have identified three sorts of multi-part name, and three different syntaxes that are natural for these functions. Each of these will be considered in turn, and then the proper syntax for handling that case will be described.

This classification was created by simultaneously working forward from some initial intuitions about plausible tag name structures, and working backward from the kinds of multi-part name actually used in the guidelines. The names used in the draft guidelines were developed without the use of a preexisting explicit theory of name structure, and thus the forms that occur there are more likely to be based on practical requirements than an over-elaborated theory.

3.1 Multi-word Tag Names

The first reason for having multi-part names is to use multiple word phrases as tagnames. Concatenated strings of words are hard to read, and conventions based that rely on capitalization of words within word strings are not enforceable, given the case insensitivity of tag names. The two traditional characters for word-separation in compound names are the hyphen and the underscore. Underscore is not a legal name character in the reference concrete syntax of SGML, and the hyphen is apt to confuse most users familiar with conventional programming languages.

Therefore, "phrasal tag names" should be written together without separation of words. If the phrase is abbreviated using the initial letter of each word, all letters should be lowercase; otherwise all words (or word-fragments) but the first should be written with initial capital. Phrasal names are thus distinct from the hierarchically formed tag names described in the next section.

3.2 Subcategorization of Tag Names

The second reason to have multi-part names is to create new subcategories of existing tags. This creation of subcategories is not necessarily intended to reflect a deep analysis of textual characteristics, but to provide an easily understood pragmatic mechanism for linking custom tags to existing tags that they seem to specialize. Cf., e.g., <list> and <list.citn> or <citn> and <citn.struct>. [1] This mechanism could be especially useful if it were consistently used for the cases that are now covered by "type" attributes.

In representing such relationships the name of the parent tag and the new subtype name will appear separated by a period, with the name of the parent tag on the left, and the name of the subtype on the right. This follows a general principle of structured names that was articulated early in the project, that the parts of multi-part names go from left to right with the more general elements to the left.

3.3 Context Specification

It is desirable for tags to have names that describe their meaning without a great deal of surrounding context. As part of achieving that goal, tagnames that might occur in more than one context (such as "author") need to be disambiguated. The basic reason to qualify a name by a context is to prevent a conflict between the semantics or content model of names that serve different functions in the document. There are two kinds of decision that need to be made in doing this type of disambiguation: when to do it, and how to do it.

The need to qualify by context occurs frequently in the case of "crystals," self-contained units of markup with rigid internal structure. Whenever there is a choice between qualifying a name that occurs within a crystal, and one that does not occur within a crystal, the name that is contained in the crystal should be qualified. If both names occur within crystals, either one or both of the crystals can be qualified. If neither name occurs within a crystal, then one or both of the names should be reassigned, using either a different word, or a multi-word tag name as appropriate. Any crystal name may be qualified in the interests of clarity, whether or not a conflict exists for any name within the crystal.

Names within crystals should be qualified by prefixing the name of the crystal to all the tags that can occur within the crystal, with the prefix and tag name separated by a period. The use of prefixing rather than suffixing here serves two functions: it differentiates the cases of context-addition and subcategorization, and enforces the uniform, left-to-right, general-to-specific ordering desired in compound names. A pleasant but non-critical side effect is that qualified crystal names naturally sort near each other in alphabetical indexes.

Context qualification may be of use in the naming of attributes. If attributes representing the editorial history of a particular element are included, then those could be distinguished by a prefix to give names like "edition.editor", "edition.tagTheory", "layout.indentation", and so on.

3.4 Miscellaneous issues

Name collisions in different name spaces (e.g. attribute names, tag names) should be avoided. However, SGML keywords (e.g. ELEMENT, ATTRIBUTE) can be used again, as can values like YES/NO.

In general, use nouns for element names; avoid verbs. Adjectives may be used if no noun can be found. For attribute names, nouns and adjectives (including participles) should be preferred to finite or infinite verbs. For attribute values any part of speech may be used. Basically, use anything which makes sense with the attribute name.

When creating elements that cross-reference each other, use the attribute name ID for all IDs. IDREFs should be represented by an attribute named target. If the notion of a target is completely inappropriate, as when representing co-referential relations that are not directed, another attribute name may be considered. It is not expected that there will be many such attribute names (if, indeed there are any).

4 Summary of Rules and Recommendations

The following rules should be strictly applied in creating names in TEI document type declarations:

Tag and attribute names are case insensitive; no two such names may be identical except for case.
Entity names are case sensitive; for character entity names, the case conventions of ISO 8879 should be followed. For other entities, names of one word should be lowercase and phrasal names should uppercase the first character of each word but the first.
When forming abbreviations, use standard abbreviations where they exist; where they do not, use a uniform rule in forming the abbreviations.
Tags which mark subcategories of some parent category may be given hierarchically structured names: parent-category name on the left, subcategory name on the right, separated by a period.
Homonyms should be distinguished by renaming or by qualifying a crystal in which one of the homonyms appears.
If a crystal is qualified, all tags occurring within the crystal should be given qualified names (crystal name, period, component name).
All ID attributes should bear the name id. Unless radically inappropriate, all IDREF attributes should bear the name target.

The following recommendations should be applied where possible in generating names for TEI DTDs, and should govern usage in TEI documents and examples:

TEI documents and examples should give all one-word tag and attribute names in lowercase; phrasal names should uppercase the initial character of each word but the first.
Names should be natural-language words or phrases.
Avoid abbreviation except for very common items.
Where possible, avoid forming names from phrases.
Avoid collisions among names of different types.
Use nouns and adjectives for tag and attribute names; avoid verbs.

Notes

[1] N.B. these existing tags violate the abbreviation conventions given earlier and must be changed to agree with it.
[return to text]

Naming Conventions

For TEI DTDs

Table of Contents

Notes