TEI MLW2 - On the elimination of attributes CMSMcQ, 23 March 89 The draft Sandy circulated in mid-February has, I think, a better treatment of the question of attributes than we have had before. The guidelines for choosing between attributes and tags still don't persuade me, though, and the text sounds as though we thought the no-attributes approach were a viable one for the example given (verb tags). In reality, I think the example given is a very persuasive demonstration of why attributes are necessary -- not only for clarity and elegance in the tag set, but for processing. Here I try to show why I think so, taking a fairly obvious case of the same example. If we wish to encode the traditional analysis of the surface grammar of a standard Latin text, ignoring complications like variant forms, deponents, periphrastic forms, participles, exceptions, and the study of grammar since about the fifth century, we might say: 1 Each word has a part of speech: noun, adjective, verb, adverb, pronoun, preposition, conjunction, or interjection. (I have the nasty feeling I'm leaving one out: there should be nine, but maybe I'm thinking of English, which adds articles.) 2 Each noun or adjective is distinguished by number, gender, case, and declension pattern. 3 Each verb form is distinguished by person, number, voice, tense, mood, and conjugation pattern. 4 Each pronoun is distinguished by person, number, and case. Third person pronouns are additionally marked for gender and demonstrative class (unmarked, close, middle, distant). 5 Each conjunction is either coordinating or subordinating. 6 Each preposition takes a specific case, or takes either dative or accusative, but we can omit that from our encoding because it does not vary from use to use (and when it does, we can look at the preposition's object). 7 Adverbs and interjections we can just tag with their part of speech. If we encode with attributes, our declaration will look something like this (if PCDATA is what these should include): <!ELEMENT noun - O #PCDATA > <!ELEMENT adjective - O #PCDATA > <!ELEMENT verb - O #PCDATA > <!ELEMENT adverb - O #PCDATA > <!ELEMENT pronoun - O #PCDATA > <!ELEMENT preposition - O #PCDATA > <!ELEMENT conjunction - O #PCDATA > <!ELEMENT interjection - O #PCDATA > <!-- for each attribute, we allow omission if we haven't parsed it yet, by claiming the value is "implied" --> <!ATTLIST (noun & adjective) number (sing | plur) #IMPLIED gender (masc | fem | neut) #IMPLIED case (nom | gen | dat | acc | abl | voc) #IMPLIED decl (1 | 2 | 3 | 4 | 5) #IMPLIED > <!ATTLIST verb person (1 | 2 | 3) #IMPLIED number (sing | plur) #IMPLIED voice (active | passive) #IMPLIED tense (pres | imp | fut | presperf | pluperf | futperf) #IMPLIED mood (ind | subj) #IMPLIED conj (1 | 2 | 3 | 4) #IMPLIED -- this may be illegal, since 1-3 are already used for "person", in which case use (c1 | c2 | c3 | c4 | c5) -- > <!ATTLIST pronoun number (sing | plur) #IMPLIED person (1 | 2 | 3) #IMPLIED gender (masc | fem | neut) #IMPLIED case (nom | gen | dat | acc | abl | voc) #IMPLIED demons (no | here | there | yonder) #IMPLIED > <!ATTLIST conjunction ctype (coord | subord) #IMPLIED > <exmp> That makes forty-seven discrete values in twelve categories: <fig> 1 generic identifier 8 values 2 number 2 3 gender 3 4 case 6 5 declension 5 6 person 3 7 voice 2 8 tense 6 9 mood 2 10 conjugation 4 11 demonstrative class 4 12 conjunction type 2 ----- (total) 47 <efig> The alternative simplifies from twelve categories to one, by eliminating attributes and moving their information into the tag. Since we need a separate tag for each combination of attributes, the count K of tags needed for each class of word is the product of the number of possible values for each attribute: <fig> K(adverbs) = 1 K(prepositions) = 1 K(interjections) = 1 K(conjunctions) = 2 K(nouns) = number * gender * case * declension = 2 * 3 * 6 * 5 = 180 K(adjectives) = number * gender * case * declension = 2 * 3 * 6 * 5 = 180 K(pronouns) = non-third-person + third-person = person * number * case + number * case * gender * demons = 2 * 2 * 6 + 2 * 6 * 3 * 4 = 24 + 144 = 168 K(verbs) = person * number * voice * tense * mood * conj = 3 * 2 * 2 * 6 * 2 * 4 = 576 <efig> This yields a total of 1109 discrete values in one category. If we pushed this a little harder (adding the three other Indo-European cases, worrying about the twenty-five or so cases of Finnish or Hungarian, adding gender to verbs to handle Hebrew, adding aspect to verbs to handle Greek and Russian, adding a 'common' gender for modern Scandinavian and Dutch, adding 'objective' or 'oblique' as a value for modern English and other degenerate case systems), we could surely push it up to four or five thousand tags for morphology, without working very hard -- and without coming close enough to completeness for the scheme to be seriously usable for the languages we are pledged to support. If we add the need for less-than-full specification of morphology, (e.g. for cases where the analyst does not wish to decide, yet, whether a noun is accusative or dative), we have to generate not the set of all possible combinations of attributes but the set of all possible combinations of values and missing values. Add one to each factor in the multiplications above. The results: <fig> K'(adverbs) = 1 K'(prepositions) = 1 K'(interjections) = 1 K'(conjunctions) = 3 K'(nouns) = 504 K'(adjectives) = 504 K'(pronouns) = 483 K'(verbs) = 3780 Total: 5277 tags. <efig> This seems kind of unwieldy to me: too large not just for practical use in manual encoding, but for practical processing. Aesthetically, too, it's rather disappointing. It reminds me of a column in Jon Bentley's book <cit>Programming Pearls<ecit>: <lq>Most programmers have seen them, and most good programmers realize they've written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs. I once saw a COBOL program whose guts were IF THISINPUT IS EQUAL TO 001 ADD 1 TO COUNT001. IF THISINPUT IS EQUAL TO 002 ADD 1 TO COUNT002. ... IF THISINPUT IS EQUAL TO 500 ADD 1 TO COUNT500. [... The program] contained about 1600 lines of code: 500 to define the variables COUNT001 through COUNT500, the above 500 to do the counting, 500 to print out how many times each integer was found, and 100 miscellaneous statements. Think for a minute about how you could accomplish the same task with a program just a tiny fraction of the size [...]. <elq> I haven't mentioned yet that a basic task in such work is to provide the dictionary form for any inflected token. Since there is no chance of providing that information in the tag, it has to be provided either as content or as an attribute value. It seems to me, intrinsically, to be an attribute value. I'll stop here, rather than pound this any harder. If I seem a bit obdurate in my insistence that attributes be allowed, this kind of combinatorial explosion is one reason. It needs to be borne in mind that we have tagging of almost precisely this sort and level of detail for fairly substantial bodies of ancient Greek, Hebrew, and probably Latin. If the TEI scheme is to support existing taggings, what I have sketched above is a *minimal* apparatus. It is fair to admit that one can get by with far fewer tags in English: the Brown corpus uses under 200. But the failure to separate concepts like part-of-speech and inflectional information means the Brown scheme lacks all generality -- I can't use it, for example, to tag my texts in Middle High German. My conclusion is that attributes are essential to serious tagging of literary and linguistic material. We would have to use them even if they did not help make clear, beautiful, compact tagging schemes.