MLW5 - Notes on alternate approaches to attributes

C. M. Sperberg-McQueen, 11 May 89


It may help us clarify our discussion if we make some basic distinctions
among approaches.  I am not going to argue one way or another here on
the attribute question, only to try to state as neutrally as I can the
various alternatives.  The clarity of my conception of our alternatives
has benefited a great deal from electronic and telephonic discussion
with SM, who should however not be held responsible for what follows.

The discussion about attributes has produced, I think, five alternative
approaches to the marking of information which might be thought of as
attributive.  (By and large, that is, I think, information which is
one-to-one with occurrences of text elements of specified types.)  To
take two simple examples:

    * LB's 'page (break)' and 'side of page (recto | verso)'
         (ignoring, for now, the problems of making page marks
         coexist with markup of the logical structure, which
         is important but not relevant to the attribute question)
    * text segments in fonts of specified size (e.g. 8 | 10 | 12 pt)
         and style (roman, italic, bold, bold italic)

1.  One approach is to keep the major information and just ignore the
other information:  <page> and <fontshift>.  This is always a
possibility (make the user who thinks the extra information is
worthwhile extend the tag set) but not a solution.  Let us call this the
silent method.  (Or the Wittgenstein method, if you prefer.)

2.  Another approach will be to make one piece of information a text
element and associate the others one-to-one with it as attributes:

    <page side=recto>page content here ... </page>
    or
    <page side=verso>page content here ... </page>

    <textseg type=italic size=12pt>italic text here ... </textseg>
    or
    <textseg type=bold size=10pt>bold text here ... </textseg>

This is what I've been saying ought to be an option, even though it's
not always the best of all possible approaches.

3.  A third approach (that used in the Brown and LOB corpora for parts
of speech, and suggested in an earlier draft of our syntax document)
makes a tag for each combination of values of the attributes.
(Possible, obviously, only when the attributes range across finite
domains.)

    <recto>page content here ... </recto>
    and
    <verso>page content here ... </verso>

    <italic10>italic ten-point text here ... </italic10>
    <italic12> ... </italic12>
    <bold10>bold text here ... </bold10>
    <bold12> ... </bold12> etc.

Since this generates an attribute-free tag set by taking the Cartesian
product of the attribute domains, we can call this the Cartesian product
approach, or just the Cartesian method, if everyone agrees.

If we omit point size, this appears the usual approach for tagging font
shifts:  <italic> ... </italic>, <bold> ... </bold> etc.

4.  The fourth approach we have so far generated writes all the
information as text elements -- one can rewrite any tag plus attributes
by keeping the parent tag and rewriting each attribute name as a tag (to
guarantee uniqueness write the parent tag name plus the attribute name)
and rewriting the value of the attribute as content following that tag.
Thus:

    <page><side>recto</side> page content here </page>

    <textseg>
        <type>italic</type>
        <size>12pt</size>
        italic text here ...
    </textseg>
    or
    <textseg><type>bold</type><size>10pt</size>bold text here ...
    </textseg>

Since this method can rewrite attributes by subordinating them to their
parent tag, let us call this the subordination approach.

5.  In some cases, a combination of the Cartesian product and
subordination approaches can be used.  For each attribute, construct
either (a) a single text element, rewriting the attribute value as
content, or (b) an array of text elements, one for each possible value
of the attribute.  The tags created by (b) can either be empty or
have the same scope as the old parent element.  In some cases the old
parent text element can be eliminated and one of the new text elements
promoted in its place; in others, the old parent should be kept.  In our
second example,

    <italic><size>12pt</size>
        italic text here ...
    </italic>
    or
    <bold><size>10pt</size>bold text here ... </bold>

(It isn't clear to me when the old parent text element is omissible and
when not.  For font shifts it seems omissible -- for parts of speech,
not omissible.  The parent can be omitted only if at least one candidate
attribute has been expanded into an array of text elements, because only
such tags can have their scope extended to that of the old parent.)

Since it sometimes results in eliminating the old parent, we could call
this the Electra or the Oedipal approach.  Since it doesn't always, thus
appearing indecisive, let us call it the Hamlet or the Danish approach
instead.

The markup for this approach can be made to look something like the
feature notations of some generative grammars.

-----

Some characteristics of these approaches seem worth noting, though they
don't seem to point us unambiguously in one way or the other.

a.  The Cartesian product approach erases the distinction between types
of information (here, between "page" and "side", or among "fontshift"
"font type" and "font size") and creates a population of related tags
(related in their semantics and presumably in their syntactic behavior).
The relationship among these related tags is implicit, but not explicit,
in the syntactic restrictions of the DTD.  (I am not sure whether it
could be recognized formally or not.)  The other approaches preserve
the category distinctions more or less explicitly.

b.  As seen in earlier discussion, the Cartesian product approach can
succumb to combinatorial explosion.

c.  The attribute, Cartesian product and Danish approaches (can) put
information about legal attribute values into the document type
definition:  the attribute approach by using the attribute value
declaration, the Cartesian and Danish approaches by providing tags only
for some attribute values.  If specification of legal attribute values
doesn't belong in a DTD, neither of these will be ideal.

d.  All approaches except the silent approach allow documents to express
the same information by their markup.  They are extensionally
(denotationally) equivalent, intensionally (connotationally) different.

e.  The connotations of the approaches seem very clear in some
instances, but we have found no universal consistency in them.

f.  Given a list of n categories of information c(1) ... c(i) ... c(n)
such that category c(i) ranges over a domain of k(i) values, and each
instance of any category is one-to-one with an instance of each of the
others (i.e. they are suitable for treatment as attributes or as singly
occurring constituents), the five approaches will produce different
numbers of ELEMENT and ATTLIST declarations in the DTD.

    - the silent approach will yield one tag with no attributes
         (1 ELEMENT, 0 ATTLIST lines)
    - the attribute approach will yield one tag with n-1 attributes
         (1 ELEMENT, 1 ATTLIST statement with n lines, if we
         define each attribute name on a line by itself)
    - the Cartesian product approach will yield P tags and no attributes,
         where P = k(1) * k(2) * ... * k(n) (K ELEMENT, 0 ATTLIST
         statements)
    - the subordination approach will yield n tags with no attributes
         (n ELEMENT, 0 ATTLIST statements)
    - the Danish approach will yield no attributes, at least n tags if
         the parent is retained, at least n-1 if not, and at most S
         tags, where S = k(1) + k(2) + ... + k(n).  Specifically,
         there are 2**n possible counts of attributes, defined by:
         Count = (1 | 0) + (1 | k(2)) + ... + (1 | k(n)).

In our examples,

    a.  n=2 (page break, page side)
         k(1) = 1
         k(2) = 2 (recto, verso)
    - silent approach yields 1:   pagebreak
    - attribute method yields 1:  pagebreak type=(recto | verso)
    - Cartesian product yields 2: recto
                                  verso
    - subordination yields 2:     pagebreak
                                  pagetype
    - Danish approach does not apply (either = Cartesian product or
         = subordination)

    b.  n=3 (font shift, font size, font style)
         k(1) = 1
         k(2) = 3 (8pt, 10pt, 12pt)
         k(3) = 4 (roman, italic, bold, bolditalic)

    - silent approach yields 1:   textseg
    - attribute method yields 1:  textseg
         with two attributes:          style=(ro | it | bd | bi)
                                       size=(8pt | 10pt | 12pt)
    - Cartesian method yields 12: rom8, rom10, rom12,
                                  ital8, ital10, ital12,
                                  bold8, bold10, bold12,
                                  bdit8, bdit10, bdit12
    - subordination yields 3:     textseg, fontsize, fontstyle
    - Danish approach yields between 2 and 8.
         Minimum set is           fontstyle, fontsize
         Maximum set is           rom, ital, bold, boldital,
                                  pt8, pt10, pt12,
                                  textsegment

g.  The content models for the five approaches are related.  If the
silent method for a set of categories assigns the content model M to its
tag, then the attribute method will also use M as the content model for
its tag.  Each tag of the Cartesian product method will also use M for
its content model.

The content model of the subordination method will be (%SubCats, M)
where the entity SubCats is defined as:

    ( (c(1))? & (c(2))? & ... & (c(n))? )

or, to write it with fewer parentheses, (c1? & c2? & ... & cn?).  This
content model preserves free order and makes all subordinate tags
optional.  If all attributes are required, the entity SubCats would be
(c1 & c2 & ... & cn).  If free ordering is immaterial, the ampersands
can be replaced by commas, and the parsing will be much easier.

The content model for the Danish approach will be the concatenation
(%DanishCats, M), where the entity DanishCats is defined as (%cat1? &
%cat2? & ... & %catn?) or (%cat1, %cat2, ... , %catn) or some other
variation, and where each entity catX is defined either by
(NameOfAttribute) (for the subordinated attributes) or by (AttrVal1 |
AttrVal2 | ... | AttrValN) (for the Cartesian expansions of attribute
domains).

h.  The content models of possible parent text elements (text elements
within which the tags in question can occur -- or content models
in which the tag names do occur) will be related under the five
approaches.  If we assume the DTD of the silent approach and take the
set of all content models which contain the silent approach's one tag
name, let us call this set of content models the "parent content models"
of the candidate tags in question.

The parent content models are identical under all methods except the
Cartesian product method.  In that method, new parent content models
can be generated from the old by replacing the tag name of the silent
method with a parenthesized disjunction of the tag names produced by
the Cartesian method.  Thus,

    <!ELEMENT volume - - (page)+ >

would be rewritten

    <!ELEMENT volume - - (recto | verso)+ >

and so forth.

i.  If we attempt to measure complexity of resulting DTDs or documents,
the result varies with the measure; no one approach is simplest or most
complex by all measures.

    1 number of statements in the DTD
    2 number of lines or tokens in the DTD
    3 number of ELEMENT types in the hierarchy (= number of text
         elements defined in the DTD)
    4 number of statement types in the DTD
    5 number of tokens per model group or per declaration
         (an attempt to measure the complexity of individual
         ELEMENT statements in the DTD)
    6 number of tags needed to express the information in the document
    7 number of tokens needed to express the information in the document

I won't work out all the arithmetic, but it's clear that the silent
method loses by 6-7 and does well on everything else, the Cartesian
method wins by 6 and 7 but loses by 1-5.  The attribute and subordination
methods, on which we seem to be centering the discussion, tie with
respect to some measures, and split the rest.  Even more obvious is that
different people will weight different measures differently.