Naming Conventions
 
                              For TEI DTDs
 
 
              Committee for Metalanguage and Syntax Issues
 
                      Document Number:  TEI MLW26
 
                              June 7, 1991
 
                        Version 2, June 7, 1991
 
 
 
                                   1
                                   1
 
                              INTRODUCTION
 
 
   This document contains naming standards for the formulation of docu-
ment type declarations (DTDs) in the work of the Text Encoding Initia-
tive.  It specifies explicit rules for the selection of tag and attribut
names which should be followed when creating tag and attribute names for
the TEI.  While most of the rules involve points that seem minor (and
sometimes even arbitrary) when considered individually, the sum of the
rules taken together creates a clear, consistent, uniform style for
names used in the guidelines.  The contents of this document reflect a
list of areas of consideration identified by the Committee on Metalan-
guage and Syntax; items have, however, been coalesced and rearranged for
presentation.  The draft guidelines have been used to guide creation of
these rules by helping to detect regularities in current naming practice
that need to be addressed, as well as to find irregularities that
require further analysis.
 
 
 
                                   2
                                   2
 
                        GENERAL MARKUP DECISIONS
 
 
Case Insensitivity
 
   The question of case sensitivity is a difficult one.  The consistency
imposed by case sensitive names has a basic appeal to the technical user
or software designer.  However, in practical work, particularly work
(such as tagging a text) that involves extensive hand manipulation of a
document, it is often difficult and pointless to ensure that case is
always consistent.  It is also clear that a large number of TEI users
will not be technical specialists, will not perceive value in that con-
sistency, and will be doing significant hand manipulation of texts.
Given the triviality of making software case-insensitive, case will nev-
er be used to distinguish two tag names.
 
   TEI documents and examples, however, should consistently use lower
case for tag names, except that in names formed from phrases, the ini-
tial character of each word but the first should be uppercased.  (Thus
<p> or <quotation> but <partOfSpeech> and <teiHeader>.)
 
   Entity names will need to be case sensitive, as both the standard
entity sets, and natural encoding styles require it.  Attribute names
should not depend on case, but attribute values are case sensitive as
they may record information which requires mixed case.
 
 
Language Bindings
 
   Names should be natural-language words or phrases.  Initial drafts of
the guidelines provide a set of English-language names.  Non-English tag
names are only appropriate if the name used is a technical term from
another language (e.g.  French, Latin, or German), which is standard
among Anglophone scholars.  For instance, "enjambement" and "sandhi",
though non-English words, are the standard terms in scholarly discussion
and could be used in the English-language tag set.
 
   As a matter of policy, the TEI will produce nine language-specific
versions of the TEI tag set for the 1992 publication, covering at least
the languages of the European Community.  Similar restrictions will be
observed for those tagsets as well, with appropriate modifications for
the languages in question.
 
 
Character Sets and Encoding
 
   This issue is being addressed by the character-set committee and
should be orthogonal to the issues addressed here.
 
 
Abbreviation
 
   Abbreviations are an extremely natural way to reduce the redundancy
of frequent items; unfortunately, what is frequent for one user may be
very infrequent for another, and vice versa.  Therefore, we have taken a
conservative approach to abbreviations.  When used, abbreviations will
be the sole names for a tag or attribute.  There will be no variant
names.  In general, abbreviations will be avoided unless a tag is used
very frequently.  The threshold for frequency is the "1 per page rule."
If a tag name would occur more than once per printed page of the source
material, on average, then it may be abbreviated if appropriate.  If a
tag  of wide application would appear more than 3-4 times per page, then
a minimal abbreviation of 1-5 characters may be appropriate.
 
   Abbreviations that are usual in a discipline should be preferred to
new ones, and any abbreviation should be suggestive of the abbreviated
word.  If there are no traditional abbreviations, then the following
rules are suggested for forming an abbreviation:
 
*   abbreviate single words by taking characters up to beginning of sec-
    ond vowel cluster, exclusive; e.g. "structure" becomes "struct",
    "attribute" becomes "attr", etc.  (A vowel cluster is a series of
    adjacent vowels.)
*   abbreviate phrases by abbreviating each single word of the phrase
    according to the preceding rule; e.g. "feature structure" becomes
    "featStruct".
*   abbreviate phrases severely by taking the first character of each
    word (only); e.g. "feature structure" becomes "fs".
 
 
Translatability
 
   In choosing names, it may be regarded as an advantage if a name or
set of names has/have clear analogues in other European languages.
 
 
 
                                   3
                                   3
 
                         STRUCTURE OF TAG NAMES
 
 
   Many names are simple atoms with no internal structure.  These names
are relatively unproblematic.  Other sets of names have essential rela-
tionships that could profitably be reflected by a naming convention.  It
seems clear that for sets of related tags some form of structured name-
space can help to systematize the large numbers of names.  We have iden-
tified three sorts of multi-part name, and three different syntaxes that
are natural for these functions.  Each of these will be considered in
turn, and then the proper syntax for handling that case will be
described.
 
   This classification was created by simultaneously working forward
from some initial intuitions about plausible tag name structures, and
working backward from the kinds of multi-part name actually used in the
guidelines.  The names used in the draft guidelines were developed with-
out the use of a preexisting explicit theory of name structure, and thus
the forms that occur there are more likely to be based on practical
requirements than an over-elaborated theory.
 
 
Multi-word Tag Names
 
   The first reason for having multi-part names is to use multiple word
phrases as tagnames.  Concatenated strings of words are hard to read,
and conventions based that rely on capitalization of words within word
strings are not enforceable, given the case insensitivity of tag names.
The two traditional characters for word-separation in compound names are
the hyphen and the underscore.  Underscore is not a legal name character
in the reference concrete syntax of SGML, and the hyphen is apt to con-
fuse most users familiar with conventional programming languages.
 
   Therefore, "phrasal tag names" should be written together without
separation of words.  If the phrase is abbreviated using the initial
letter of each word, all letters should be lowercase; otherwise all
words (or word-fragments) but the first should be written with initial
capital.  Phrasal names are thus distinct from the hierarchically formed
tag names described in the next section.
 
 
Subcategorization of Tag Names
 
   The second reason to have multi-part names is to create new subcate-
gories of existing tags.  This creation of subcategories is not neces-
sarily intended to reflect a deep analysis of textual characteristics,
but to provide an easily understood pragmatic mechanism for linking cus-
tom tags to existing tags that they seem to specialize.  Cf., e.g.,
<list> and <list.citn> or <citn> and <citn.struct>.(1)  This mechanism
could be especially useful if it were consistently used for the cases
that are now covered by "type" attributes.
 
   In representing such relationships the name of the parent tag and the
new subtype name will appear separated by a period, with the name of the
parent tag on the left, and the name of the subtype on the right.  This
follows a general principle of structured names that was articulated
early in the project, that the parts of multi-part names go from left to
right with the more general elements to the left.
 
 
Context Specification
 
   It is desirable for tags to have names that describe their meaning
without a great deal of surrounding context.  As part of achieving that
goal, tagnames that might occur in more than one context (such as
"author") need to be disambiguated.  The basic reason to qualify a name
by a context is to prevent a conflict between the semantics or content
model of names that serve different functions in the document.  There
are two kinds of decision that need to be made in doing this type of
disambiguation:  when to do it, and how to do it.
 
   The need to qualify by context occurs frequently in the case of
"crystals," self-contained units of markup with rigid internal struc-
ture.  Whenever there is a choice between qualifying a name that occurs
within a crystal, and one that does not occur within a crystal, the name
that is contained in the crystal should be qualified.  If both names
occur within crystals, either one or both of the crystals can be quali-
fied.  If neither name occurs within a crystal, then one or both of the
names should be reassigned, using either a different word, or a multi-
word tag name as appropriate.  Any crystal name may be qualified in the
interests of clarity, whether or not a conflict exists for any name
within the crystal.
 
   Names within crystals should be qualified by prefixing the name of
the crystal to all the tags that can occur within the crystal, with the
prefix and tag name separated by a period.  The use of prefixing rather
than suffixing here serves two functions:  it differentiates the cases
of context-addition and subcategorization, and enforces the uniform,
left-to-right, general-to-specific ordering desired in compound names.
A pleasant but non-critical side effect is that qualified crystal names
naturally sort near each other in alphabetical indexes.
 
   Context qualification may be of use in the naming of attributes.  If
attributes representing the editorial history of a particular element
are included, then those could be distinguished by a prefix to give
names like "edition.editor", "edition.tagTheory", "layout.indentation",
and so on.
 
 
Miscellaneous issues
 
   Name collisions in different name spaces (e.g.  attribute names, tag
names) should be avoided.  However, SGML keywords (e.g. ELEMENT,
ATTRIBUTE) can be used again, as can values like YES/NO.
 
   In general, use nouns for element names; avoid verbs.  Adjectives may
be used if no noun can be found.  For attribute names, nouns and adjec-
tives (including participles) should be preferred to finite or infinite
verbs.  For attribute values any part of speech may be used.  Basically,
use anything which makes sense with the attribute name.
 
   When creating elements that cross-reference each other, use the
attribute name ID for all IDs.  IDREFs should be represented by an
attribute named target.  If the notion of a target is completely inap-
propriate, as when representing co-referential relations that are not
directed, another attribute name may be considered.  It is not expected
that there will be many such attribute names (if, indeed there are any).
 
 
 
                                   4
                                   4
 
                  SUMMARY OF RULES AND RECOMMENDATIONS
 
 
   The following rules should be strictly applied in creating names in
TEI document type declarations:
 
1.    Tag and attribute names are case insensitive; no two such names
      may be identical except for case.
2.    Entity names are case sensitive; for character entity names, the
      case conventions of ISO 8879 should be followed.  For other enti-
      ties, names of one word should be lowercase and phrasal names
      should uppercase the first character of each word but the first.
3.    When forming abbreviations, use standard abbreviations where they
      exist; where they do not, use a uniform rule in forming the abbre-
      viations.
4.    Tags which mark subcategories of some parent category may be given
      hierarchically structured names:  parent-category name on the
      left, subcategory name on the right, separated by a period.
5.    Homonyms should be distinguished by renaming or by qualifying a
      crystal in which one of the homonyms appears.
6.    If a crystal is qualified, all tags occurring within the crystal
      should be given qualified names (crystal name, period, component
      name).
7.    All ID attributes should bear the name id.  Unless radically inap-
      propriate, all IDREF attributes should bear the name target.
 
   The following recommendations should be applied where possible in
generating names for TEI DTDs, and should govern usage in TEI documents
and examples:
 
1.    TEI documents and examples should give all one-word tag and attri-
      bute names in lowercase; phrasal names should uppercase the ini-
      tial character of each word but the first.
2.    Names should be natural-language words or phrases.
3.    Avoid abbreviation except for very common items.
4.    Where possible, avoid forming names from phrases.
5.    Avoid collisions among names of different types.
6.    Use nouns and adjectives for tag and attribute names; avoid verbs.
 
-------------------------
 
(1) N.B. these existing tags violate the abbreviation conventions given
    earlier and must be changed to agree with it.
 
                                                 Version 2, June 7, 1991