MLW5 - Notes on alternate approaches to attributes C. M. Sperberg-McQueen, 11 May 89 It may help us clarify our discussion if we make some basic distinctions among approaches. I am not going to argue one way or another here on the attribute question, only to try to state as neutrally as I can the various alternatives. The clarity of my conception of our alternatives has benefited a great deal from electronic and telephonic discussion with SM, who should however not be held responsible for what follows. The discussion about attributes has produced, I think, five alternative approaches to the marking of information which might be thought of as attributive. (By and large, that is, I think, information which is one-to-one with occurrences of text elements of specified types.) To take two simple examples: * LB's 'page (break)' and 'side of page (recto | verso)' (ignoring, for now, the problems of making page marks coexist with markup of the logical structure, which is important but not relevant to the attribute question) * text segments in fonts of specified size (e.g. 8 | 10 | 12 pt) and style (roman, italic, bold, bold italic) 1. One approach is to keep the major information and just ignore the other information: and . This is always a possibility (make the user who thinks the extra information is worthwhile extend the tag set) but not a solution. Let us call this the silent method. (Or the Wittgenstein method, if you prefer.) 2. Another approach will be to make one piece of information a text element and associate the others one-to-one with it as attributes: page content here ... or page content here ... italic text here ... or bold text here ... This is what I've been saying ought to be an option, even though it's not always the best of all possible approaches. 3. A third approach (that used in the Brown and LOB corpora for parts of speech, and suggested in an earlier draft of our syntax document) makes a tag for each combination of values of the attributes. (Possible, obviously, only when the attributes range across finite domains.) page content here ... and page content here ... italic ten-point text here ... ... bold text here ... ... etc. Since this generates an attribute-free tag set by taking the Cartesian product of the attribute domains, we can call this the Cartesian product approach, or just the Cartesian method, if everyone agrees. If we omit point size, this appears the usual approach for tagging font shifts: ... , ... etc. 4. The fourth approach we have so far generated writes all the information as text elements -- one can rewrite any tag plus attributes by keeping the parent tag and rewriting each attribute name as a tag (to guarantee uniqueness write the parent tag name plus the attribute name) and rewriting the value of the attribute as content following that tag. Thus: recto page content here italic 12pt italic text here ... or bold10ptbold text here ... Since this method can rewrite attributes by subordinating them to their parent tag, let us call this the subordination approach. 5. In some cases, a combination of the Cartesian product and subordination approaches can be used. For each attribute, construct either (a) a single text element, rewriting the attribute value as content, or (b) an array of text elements, one for each possible value of the attribute. The tags created by (b) can either be empty or have the same scope as the old parent element. In some cases the old parent text element can be eliminated and one of the new text elements promoted in its place; in others, the old parent should be kept. In our second example, 12pt italic text here ... or 10ptbold text here ... (It isn't clear to me when the old parent text element is omissible and when not. For font shifts it seems omissible -- for parts of speech, not omissible. The parent can be omitted only if at least one candidate attribute has been expanded into an array of text elements, because only such tags can have their scope extended to that of the old parent.) Since it sometimes results in eliminating the old parent, we could call this the Electra or the Oedipal approach. Since it doesn't always, thus appearing indecisive, let us call it the Hamlet or the Danish approach instead. The markup for this approach can be made to look something like the feature notations of some generative grammars. ----- Some characteristics of these approaches seem worth noting, though they don't seem to point us unambiguously in one way or the other. a. The Cartesian product approach erases the distinction between types of information (here, between "page" and "side", or among "fontshift" "font type" and "font size") and creates a population of related tags (related in their semantics and presumably in their syntactic behavior). The relationship among these related tags is implicit, but not explicit, in the syntactic restrictions of the DTD. (I am not sure whether it could be recognized formally or not.) The other approaches preserve the category distinctions more or less explicitly. b. As seen in earlier discussion, the Cartesian product approach can succumb to combinatorial explosion. c. The attribute, Cartesian product and Danish approaches (can) put information about legal attribute values into the document type definition: the attribute approach by using the attribute value declaration, the Cartesian and Danish approaches by providing tags only for some attribute values. If specification of legal attribute values doesn't belong in a DTD, neither of these will be ideal. d. All approaches except the silent approach allow documents to express the same information by their markup. They are extensionally (denotationally) equivalent, intensionally (connotationally) different. e. The connotations of the approaches seem very clear in some instances, but we have found no universal consistency in them. f. Given a list of n categories of information c(1) ... c(i) ... c(n) such that category c(i) ranges over a domain of k(i) values, and each instance of any category is one-to-one with an instance of each of the others (i.e. they are suitable for treatment as attributes or as singly occurring constituents), the five approaches will produce different numbers of ELEMENT and ATTLIST declarations in the DTD. - the silent approach will yield one tag with no attributes (1 ELEMENT, 0 ATTLIST lines) - the attribute approach will yield one tag with n-1 attributes (1 ELEMENT, 1 ATTLIST statement with n lines, if we define each attribute name on a line by itself) - the Cartesian product approach will yield P tags and no attributes, where P = k(1) * k(2) * ... * k(n) (K ELEMENT, 0 ATTLIST statements) - the subordination approach will yield n tags with no attributes (n ELEMENT, 0 ATTLIST statements) - the Danish approach will yield no attributes, at least n tags if the parent is retained, at least n-1 if not, and at most S tags, where S = k(1) + k(2) + ... + k(n). Specifically, there are 2**n possible counts of attributes, defined by: Count = (1 | 0) + (1 | k(2)) + ... + (1 | k(n)). In our examples, a. n=2 (page break, page side) k(1) = 1 k(2) = 2 (recto, verso) - silent approach yields 1: pagebreak - attribute method yields 1: pagebreak type=(recto | verso) - Cartesian product yields 2: recto verso - subordination yields 2: pagebreak pagetype - Danish approach does not apply (either = Cartesian product or = subordination) b. n=3 (font shift, font size, font style) k(1) = 1 k(2) = 3 (8pt, 10pt, 12pt) k(3) = 4 (roman, italic, bold, bolditalic) - silent approach yields 1: textseg - attribute method yields 1: textseg with two attributes: style=(ro | it | bd | bi) size=(8pt | 10pt | 12pt) - Cartesian method yields 12: rom8, rom10, rom12, ital8, ital10, ital12, bold8, bold10, bold12, bdit8, bdit10, bdit12 - subordination yields 3: textseg, fontsize, fontstyle - Danish approach yields between 2 and 8. Minimum set is fontstyle, fontsize Maximum set is rom, ital, bold, boldital, pt8, pt10, pt12, textsegment g. The content models for the five approaches are related. If the silent method for a set of categories assigns the content model M to its tag, then the attribute method will also use M as the content model for its tag. Each tag of the Cartesian product method will also use M for its content model. The content model of the subordination method will be (%SubCats, M) where the entity SubCats is defined as: ( (c(1))? & (c(2))? & ... & (c(n))? ) or, to write it with fewer parentheses, (c1? & c2? & ... & cn?). This content model preserves free order and makes all subordinate tags optional. If all attributes are required, the entity SubCats would be (c1 & c2 & ... & cn). If free ordering is immaterial, the ampersands can be replaced by commas, and the parsing will be much easier. The content model for the Danish approach will be the concatenation (%DanishCats, M), where the entity DanishCats is defined as (%cat1? & %cat2? & ... & %catn?) or (%cat1, %cat2, ... , %catn) or some other variation, and where each entity catX is defined either by (NameOfAttribute) (for the subordinated attributes) or by (AttrVal1 | AttrVal2 | ... | AttrValN) (for the Cartesian expansions of attribute domains). h. The content models of possible parent text elements (text elements within which the tags in question can occur -- or content models in which the tag names do occur) will be related under the five approaches. If we assume the DTD of the silent approach and take the set of all content models which contain the silent approach's one tag name, let us call this set of content models the "parent content models" of the candidate tags in question. The parent content models are identical under all methods except the Cartesian product method. In that method, new parent content models can be generated from the old by replacing the tag name of the silent method with a parenthesized disjunction of the tag names produced by the Cartesian method. Thus, would be rewritten and so forth. i. If we attempt to measure complexity of resulting DTDs or documents, the result varies with the measure; no one approach is simplest or most complex by all measures. 1 number of statements in the DTD 2 number of lines or tokens in the DTD 3 number of ELEMENT types in the hierarchy (= number of text elements defined in the DTD) 4 number of statement types in the DTD 5 number of tokens per model group or per declaration (an attempt to measure the complexity of individual ELEMENT statements in the DTD) 6 number of tags needed to express the information in the document 7 number of tokens needed to express the information in the document I won't work out all the arithmetic, but it's clear that the silent method loses by 6-7 and does well on everything else, the Cartesian method wins by 6 and 7 but loses by 1-5. The attribute and subordination methods, on which we seem to be centering the discussion, tie with respect to some measures, and split the rest. Even more obvious is that different people will weight different measures differently.