It may help clarify the discussion of attributes and other notations if we make some basic distinctions among approaches. I am not going to argue one way or another here on the attribute question, only to try to state as neutrally as I can the various alternatives. The clarity of my conception of our alternatives has benefited a great deal from electronic and telephonic discussion with SM, who should however not be held responsible for what follows.
The discussion about attributes has produced, I think, five alternative approaches to the marking of information which might be thought of as attributive. (By and large, that is, I think, information which is one-to-one with occurrences of text elements of specified types.) To take two simple examples:
One approach is to keep the major information and just ignore the other information: <page> and <fontshift>. This is always a possibility (make the user who thinks the extra information is worthwhile extend the tag set) but not a solution. Let us call this the silent method. (Or the Wittgenstein method, if you prefer.)
Another approach will be to make one piece of information a text element and associate the others one-to-one with it as attributes:
<page side=recto>page content here ... </page> or <page side=verso>page content here ... </page>
<textseg type=italic size=12pt>italic text here ... </textseg> or <textseg type=bold size=10pt>bold text here ... </textseg>
A third approach (that used in the Brown and LOB corpora for parts of speech, and suggested in an earlier draft of our syntax document) makes a tag for each combination of values of the attributes. (Possible, obviously, only when the attributes range across finite domains.)
<recto>page content here ... </recto> and <verso>page content here ... </verso> <italic10>italic ten-point text here ... </italic10> <italic12> ... </italic12> <bold10>bold text here ... </bold10> <bold12> ... </bold12> etc.Since this generates an attribute-free tag set by taking the Cartesian product of the attribute domains, we can call this the Cartesian product approach, or just the Cartesian method.
If we omit point size, this appears the usual approach for tagging font shifts: <italic> italic text here ... </italic>, <bold> bold text here ... </bold> etc.
The fourth approach we have so far generated writes all the information as text elements -- one can rewrite any tag plus attributes by keeping the parent tag and rewriting each attribute name as a tag (to guarantee uniqueness write the parent tag name plus the attribute name) and rewriting the value of the attribute as content following that tag. Thus:
<page><side>recto</side> page content here </page> <textseg> <type>italic</type> <size>12pt</size> italic text here ... </textseg> or <textseg><type>bold</type><size>10pt</size>bold text here ... </textseg>Since this method can rewrite attributes by subordinating them to their parent tag, let us call this the subordination approach.
In some cases, a combination of the Cartesian product and subordination approaches can be used. For each attribute, construct either (a) a single text element, rewriting the attribute value as content, or (b) an array of text elements, one for each possible value of the attribute. The tags created by (b) can either be empty or have the same scope as the old parent element. In some cases the old parent text element can be eliminated and one of the new text elements promoted in its place; in others, the old parent should be kept. In our second example,
<italic><size>12pt</size> italic text here ... </italic> or <bold><size>10pt</size>bold text here ... </bold>(It isn't clear to me when the old parent text element is omissible and when not. For font shifts it seems omissible -- for parts of speech, not omissible. The parent can be omitted only if at least one candidate attribute has been expanded into an array of text elements, because only such tags can have their scope extended to that of the old parent.)
Since it sometimes results in eliminating the old parent, we could call this the Electra or the Oedipal approach. Since it doesn't always, thus appearing indecisive, let us call it the Hamlet or the Danish approach instead.
The markup for this approach can be made to look something like the feature notations of some generative grammars.
Some characteristics of these approaches seem worth noting, though they don't seem to point us unambiguously in one way or the other.
a. The Cartesian product approach erases the distinction between types of information (here, between "page" and "side", or among "fontshift", "font type" and "font size") and creates a population of related tags (related in their semantics and presumably in their syntactic behavior). The relationship among these related tags is implicit, but not explicit, in the syntactic restrictions of the DTD. (I am not sure whether it could be recognized formally or not.) The other approaches preserve the category distinctions more or less explicitly.
b. As seen in earlier discussion, the Cartesian product approach can succumb to combinatorial explosion.
c. The attribute, Cartesian product and Danish approaches (can) put information about legal attribute values into the document type definition: the attribute approach by using the attribute value declaration, the Cartesian and Danish approaches by providing tags only for some attribute values. If specification of legal attribute values doesn't belong in a DTD, none of these will be ideal. If it does, no others will.
d. All approaches except the silent approach allow documents to express the same information by their markup. They are extensionally (denotationally) equivalent, intensionally (connotationally) different.
e. The connotations of the approaches seem very clear in some instances, but we have found no universal consistency in them.
f. Given a list of n categories of information c1... ci... cn such that each category c&'sub(i) ranges over a domain of k&'sub(i) values, and each instance of any category is one-to-one with an instance of each of the others (i.e. they are suitable for treatment as attributes or as singly occurring constituents), the five approaches will produce different numbers of ELEMENT and ATTLIST declarations in the DTD.
P = k&'sub(1) * k&'sub(2) * ... * k&'sub(n)
S = k&'sub(1) + k&'sub(2) + ... + k&'sub(n)Specifically, there are 2&'sup(n) possible counts of attributes, defined by:
Count = (1 | 0) + (1 | k&'sub(2)) + ... + (1 | k&'sub(n))
<mdecl>!ELEMENT textseg - - ANY </mdecl>
<mdecl>!ELEMENT textseg - - ANY </mdecl> <mdecl>!ATTLIST textseg style=(ro | it | bd | bi) size=(8pt | 10pt | 12pt) </mdecl>
<mdecl>!ELEMENT rom8 - - ANY </mdecl> <mdecl>!ELEMENT rom10 - - ANY </mdecl> <mdecl>!ELEMENT rom12 - - ANY </mdecl> <mdecl>!ELEMENT ital8 - - ANY </mdecl> <mdecl>!ELEMENT ital10 - - ANY </mdecl> <mdecl>!ELEMENT ital12 - - ANY </mdecl> <mdecl>!ELEMENT bold8 - - ANY </mdecl> <mdecl>!ELEMENT bold10 - - ANY </mdecl> <mdecl>!ELEMENT bold12 - - ANY </mdecl> <mdecl>!ELEMENT bdit8 - - ANY </mdecl> <mdecl>!ELEMENT bdit10 - - ANY </mdecl> <mdecl>!ELEMENT bdit12 - - ANY </mdecl>
<mdecl>!ELEMENT testseg - - (fontsize, fontstyle, #PCDATA) </mdecl> <mdecl>!ELEMENT fontsize - - (#PCDATA) </mdecl> <mdecl>!ELEMENT fontstyle - - (#PCDATA) </mdecl>
<mdecl>!ELEMENT fontsize - - ANY </mdecl> <mdecl>!ELEMENT fontstyle - - (#PCDATA) </mdecl>
<mdecl>!ELEMENT rom - - ANY </mdecl> <mdecl>!ELEMENT ital - - ANY </mdecl> <mdecl>!ELEMENT bold - - ANY </mdecl> <mdecl>!ELEMENT bdit - - ANY </mdecl> <mdecl>!ELEMENT pt8 - - ANY </mdecl> <mdecl>!ELEMENT pt10 - - ANY </mdecl> <mdecl>!ELEMENT pt12 - - ANY </mdecl>
.dc cw off <mdecl>!ELEMENT e - - %M; </mdecl> .dc cw ;then the element declarations for the attribute and Cartesian methods will differ only in their element names. The content model of the subordination method will be (%SubCats;, %M;) where the entity SubCats is defined as the conjunction of optional appearances of each piece of subordinate information:
<mdecl>!ENTITY % SubCats "(c&'sub(1)? & c&'sub(2)? & ... & c&'sub(n)? )" </mdecl> .dc cw off <mdecl>!ELEMENT e - - (%SubCats;, %M;) </mdecl> .dc cw ;This content model preserves free order and makes all subordinate tags optional. If all attributes are required, the entity SubCats would be defined by:
<mdecl>!ENTITY % SubCats "(c&'sub(1)? & c&'sub(2)? & ... & c&'sub(n)? )" </mdecl>If free ordering is immaterial, the ampersands can be replaced by commas, and the parsing will be much easier. The parent tag (or its Oedipal replacement) in the Danish approach will have as its content model the concatenation (%DanishCats, %M;), where the entity DanishCats is defined as the conjunction of (%cat&'sub(1?) & %cat&'sub(2?) & ... & %cat&'sub(n?)) (or else the sequence or some other combination of the categories), and where each entity cat&'sub(i) is defined, for subordinated attributes, as the name of the attribute:
<mdecl>!ENTITY % cat&'sub(i) "c&'sub(i)" </mdecl>and, for Cartesian expansions of attribute domains, as the alternation of all legal attribute values:
<mdecl>!ENTITY % cat&'sub(i) "(AttrVal&'sub('i,1') | AttrVal&'sub('i,2') | ... | AttrVal&'sub('i,k(i)'))" </mdecl>The overall declaration, then, looks something like this:
<mdecl>!ENTITY % cat&'sub(1) "c&'sub(1)" </mdecl> <mdecl>!ENTITY % cat&'sub(2) "(AttrVal&'sub('2,1') | AttrVal&'sub('2,2') | ... | AttrVal&'sub('2,k(2)'))" </mdecl> <mdecl>!ENTITY % cat&'sub(3) "(AttrVal&'sub('3,1') | AttrVal&'sub('3,2') | ... | AttrVal&'sub('3,k(3)'))" </mdecl> <mdecl>!ENTITY % DanishCats "(cat&'sub(1)? & cat&'sub(2)? & cat&'sub(3)?)" </mdecl> .dc cw off <mdecl>!ELEMENT e - - (%DanishCats;, %M;) </mdecl> .dc cw ;
h. The content models of possible parent text elements (text elements within which the tags in question can occur -- or content models in which the tag names do occur) will be related under the five approaches. If we assume the DTD of the silent approach and take the set of all content models which contain the silent approach's one tag name, let us call this set of content models the "parent content models" of the candidate tags in question.
The parent content models are identical under all methods except the Cartesian product method. In that method, new parent content models can be generated from the old by replacing the tag name of the silent method with a parenthesized disjunction of the tag names produced by the Cartesian method. Thus,
<mdecl>!ELEMENT volume - - (page)+ </mdecl>would be rewritten
<mdecl>!ELEMENT volume - - (recto | verso)+ </mdecl>and so forth.
i. If we attempt to measure complexity of resulting DTDs or documents, the result varies with the measure; no one approach is simplest or most complex by all measures.
Approach | E | A | AT | CM | R | TN | TT |
silent | 1 | 0 | 0 | M | Q | 1 | 2 |
attribute | 1 | 1 | n | M | Q | 1 | 2n |
Cartesian | P | 0 | 0 | M*P | Q*P | 1 | 2 |
subordinate | n | 0 | 0 | M+n-1 | Q | n | 3n-1 |
min. Danish | n-1 | 0 | 0 | M+n-1 | Q | n | 3n-1 |
max. Danish | S | 0 | 0 | M+S | Q | n | n+1 |