Notes on Alternate Approaches to Attributes

TEI ML W05


C. M. Sperberg-McQueen

University of Illinois at Chicago

May 8, 1989

Table of Contents


It may help clarify the discussion of attributes and other notations if we make some basic distinctions among approaches. I am not going to argue one way or another here on the attribute question, only to try to state as neutrally as I can the various alternatives. The clarity of my conception of our alternatives has benefited a great deal from electronic and telephonic discussion with SM, who should however not be held responsible for what follows.

Some Alternate Approaches

The discussion about attributes has produced, I think, five alternative approaches to the marking of information which might be thought of as attributive. (By and large, that is, I think, information which is one-to-one with occurrences of text elements of specified types.) To take two simple examples:

Silent Approach

One approach is to keep the major information and just ignore the other information: <page> and <fontshift>. This is always a possibility (make the user who thinks the extra information is worthwhile extend the tag set) but not a solution. Let us call this the silent method. (Or the Wittgenstein method, if you prefer.)

Attribute Approach

Another approach will be to make one piece of information a text element and associate the others one-to-one with it as attributes:

  1.     <page side=recto>page content here ... </page>
        or
        <page side=verso>page content here ... </page>
    
  2.     <textseg type=italic size=12pt>italic text here ... </textseg>
        or
        <textseg type=bold size=10pt>bold text here ... </textseg>
    
This is what I've been saying ought to be an option, even though it's not always the best of all possible approaches.

Cartesian Approach

A third approach (that used in the Brown and LOB corpora for parts of speech, and suggested in an earlier draft of our syntax document) makes a tag for each combination of values of the attributes. (Possible, obviously, only when the attributes range across finite domains.)

    <recto>page content here ... </recto>
    and
    <verso>page content here ... </verso>
 
    <italic10>italic ten-point text here ... </italic10>
    <italic12> ... </italic12>
    <bold10>bold text here ... </bold10>
    <bold12> ... </bold12> etc.
Since this generates an attribute-free tag set by taking the Cartesian product of the attribute domains, we can call this the Cartesian product approach, or just the Cartesian method.

If we omit point size, this appears the usual approach for tagging font shifts: <italic> italic text here ... </italic>, <bold> bold text here ... </bold> etc.

Subordination Approach

The fourth approach we have so far generated writes all the information as text elements -- one can rewrite any tag plus attributes by keeping the parent tag and rewriting each attribute name as a tag (to guarantee uniqueness write the parent tag name plus the attribute name) and rewriting the value of the attribute as content following that tag. Thus:

    <page><side>recto</side> page content here </page>
 
    <textseg>
        <type>italic</type>
        <size>12pt</size>
        italic text here ...
    </textseg>
    or
    <textseg><type>bold</type><size>10pt</size>bold text here ...
    </textseg>
Since this method can rewrite attributes by subordinating them to their parent tag, let us call this the subordination approach.

Combination Approach

In some cases, a combination of the Cartesian product and subordination approaches can be used. For each attribute, construct either (a) a single text element, rewriting the attribute value as content, or (b) an array of text elements, one for each possible value of the attribute. The tags created by (b) can either be empty or have the same scope as the old parent element. In some cases the old parent text element can be eliminated and one of the new text elements promoted in its place; in others, the old parent should be kept. In our second example,

    <italic><size>12pt</size>
        italic text here ...
    </italic>
    or
    <bold><size>10pt</size>bold text here ... </bold>
(It isn't clear to me when the old parent text element is omissible and when not. For font shifts it seems omissible -- for parts of speech, not omissible. The parent can be omitted only if at least one candidate attribute has been expanded into an array of text elements, because only such tags can have their scope extended to that of the old parent.)

Since it sometimes results in eliminating the old parent, we could call this the Electra or the Oedipal approach. Since it doesn't always, thus appearing indecisive, let us call it the Hamlet or the Danish approach instead.

The markup for this approach can be made to look something like the feature notations of some generative grammars.

Characteristics of the Approaches

Some characteristics of these approaches seem worth noting, though they don't seem to point us unambiguously in one way or the other.

a. The Cartesian product approach erases the distinction between types of information (here, between "page" and "side", or among "fontshift", "font type" and "font size") and creates a population of related tags (related in their semantics and presumably in their syntactic behavior). The relationship among these related tags is implicit, but not explicit, in the syntactic restrictions of the DTD. (I am not sure whether it could be recognized formally or not.) The other approaches preserve the category distinctions more or less explicitly.

b. As seen in earlier discussion, the Cartesian product approach can succumb to combinatorial explosion.

c. The attribute, Cartesian product and Danish approaches (can) put information about legal attribute values into the document type definition: the attribute approach by using the attribute value declaration, the Cartesian and Danish approaches by providing tags only for some attribute values. If specification of legal attribute values doesn't belong in a DTD, none of these will be ideal. If it does, no others will.

d. All approaches except the silent approach allow documents to express the same information by their markup. They are extensionally (denotationally) equivalent, intensionally (connotationally) different.

e. The connotations of the approaches seem very clear in some instances, but we have found no universal consistency in them.

f. Given a list of n categories of information c1... ci... cn such that each category c&'sub(i) ranges over a domain of k&'sub(i) values, and each instance of any category is one-to-one with an instance of each of the others (i.e. they are suitable for treatment as attributes or as singly occurring constituents), the five approaches will produce different numbers of ELEMENT and ATTLIST declarations in the DTD.

In our examples, g. The content models for the five approaches are related. If the silent method for a set of categories assigns the content model M to its tag, then the attribute method and the Cartesian product method will also use M as the content model for each tag. That is, if we take the content model for the silent-method tag and define a parameter entity M for it so the element can be defined:
.dc cw off
        <mdecl>!ELEMENT e - - %M; </mdecl>
.dc cw ;
        
then the element declarations for the attribute and Cartesian methods will differ only in their element names. The content model of the subordination method will be (%SubCats;, %M;) where the entity SubCats is defined as the conjunction of optional appearances of each piece of subordinate information:
    <mdecl>!ENTITY % SubCats "(c&'sub(1)? & c&'sub(2)? & ... & c&'sub(n)? )" </mdecl>
.dc cw off
    <mdecl>!ELEMENT e - - (%SubCats;, %M;) </mdecl>
.dc cw ;
This content model preserves free order and makes all subordinate tags optional. If all attributes are required, the entity SubCats would be defined by:
    <mdecl>!ENTITY % SubCats "(c&'sub(1)? & c&'sub(2)? & ... & c&'sub(n)? )" </mdecl>
If free ordering is immaterial, the ampersands can be replaced by commas, and the parsing will be much easier. The parent tag (or its Oedipal replacement) in the Danish approach will have as its content model the concatenation (%DanishCats, %M;), where the entity DanishCats is defined as the conjunction of (%cat&'sub(1?) & %cat&'sub(2?) & ... & %cat&'sub(n?)) (or else the sequence or some other combination of the categories), and where each entity cat&'sub(i) is defined, for subordinated attributes, as the name of the attribute:
    <mdecl>!ENTITY % cat&'sub(i) "c&'sub(i)" </mdecl>
and, for Cartesian expansions of attribute domains, as the alternation of all legal attribute values:
    <mdecl>!ENTITY % cat&'sub(i) "(AttrVal&'sub('i,1') | AttrVal&'sub('i,2') | ... | AttrVal&'sub('i,k(i)'))" </mdecl>
The overall declaration, then, looks something like this:
    <mdecl>!ENTITY % cat&'sub(1) "c&'sub(1)" </mdecl>
 
    <mdecl>!ENTITY % cat&'sub(2) "(AttrVal&'sub('2,1') | AttrVal&'sub('2,2') | ... | AttrVal&'sub('2,k(2)'))" </mdecl>
 
    <mdecl>!ENTITY % cat&'sub(3) "(AttrVal&'sub('3,1') | AttrVal&'sub('3,2') | ... | AttrVal&'sub('3,k(3)'))" </mdecl>
 
    <mdecl>!ENTITY % DanishCats "(cat&'sub(1)? & cat&'sub(2)? & cat&'sub(3)?)" </mdecl>
 
.dc cw off
    <mdecl>!ELEMENT e - - (%DanishCats;, %M;) </mdecl>
.dc cw ;

h. The content models of possible parent text elements (text elements within which the tags in question can occur -- or content models in which the tag names do occur) will be related under the five approaches. If we assume the DTD of the silent approach and take the set of all content models which contain the silent approach's one tag name, let us call this set of content models the "parent content models" of the candidate tags in question.

The parent content models are identical under all methods except the Cartesian product method. In that method, new parent content models can be generated from the old by replacing the tag name of the silent method with a parenthesized disjunction of the tag names produced by the Cartesian method. Thus,

    <mdecl>!ELEMENT volume - - (page)+ </mdecl>
would be rewritten
    <mdecl>!ELEMENT volume - - (recto | verso)+ </mdecl>
and so forth.

i. If we attempt to measure complexity of resulting DTDs or documents, the result varies with the measure; no one approach is simplest or most complex by all measures.

  1. number of statements in the DTD
  2. number of lines or tokens in the DTD
  3. number of ELEMENT types in the hierarchy (= number of text elements defined in the DTD)
  4. number of statement types in the DTD
  5. number of tokens per model group or per declaration (an attempt to measure the complexity of individual ELEMENT statements in the DTD)
  6. number of tags needed to express the information in the document
  7. number of tokens needed to express the information in the document
If
n
= the number of categories (pieces of information)
k&'sub(i)
= the number of legal values for category c&'sub(i)
P
= k&'sub(1) * k&'sub(2) * ... * k&'sub(n)
S
= k&'sub(1) + k&'sub(2) + ... + k&'sub(n)
M
= number of tokens in the content model of the candidate element in the silent approach
Q
= number of occurrences of the candidate parent element in the silent, attribute, subordinate, or Danish approach in the content models of the DTD
(Note that we have P > S > n.) Then we have:
Arithmetic Summary
Approach E A ATCM R TN TT
silent 1 0 0 M Q 1 2
attribute 1 1 n M Q 1 2n
Cartesian P 0 0 M*P Q*P 1 2
subordinaten 0 0 M+n-1Q n 3n-1
min. Danishn-100 M+n-1Q n 3n-1
max. DanishS 0 0 M+S Q n n+1
Where:
E
no of ELEMENT statements
A
no of ATTLIST statements
AT
no of attributes
CM
no of tokens in content models for these tags/attributes
R
no of tokens in parent tags' content models (references)
TN
no of elements (nodes in the hierarchy) for an instance of this tuple of information, fully specified, in a marked-up text
TT
no of tokens for an instance of this tuple of information, fully specified, in a marked-up text (content of parent tag not counted). Start tags and end tags count one each; attribute names and values (and their equivalents) count one each.


HTML generated 15 May 1998