Tagging Parts of Speech <author>C. M. Sperberg-McQueen <docnum>TEI &docfile. <date>&docdate. </titlep> <abstract> A set of feature structures sufficient to express the analysis underlying the tags of the tagged LOB corpus is provided, with a set of entity names for tagging text. The method of construction, similar to that described in AIW21, "On Lexical Ambiguity," is such as to allow simple extension to other languages and other grammatical features. The possible meanings of underspecified analyses are described and discussed. <note>The paper is not now complete; it offers a full list of the basic grammatical categories in the LOB scheme, and full specification of the features of verbs and some other categories. Full definitions of the LOB tags in feature-structure notation are given only for verb tags. The work on non-lexical parts of speech, especially, does not agree with normal linguistic analysis and should be brought into line. <p> Much work remains to be done in this area; I believe it should proceed as follows: <ol> <li>assume the method described here and in AI W 21 for representing complex feature structures with simple entity references built up out of other entity references for simple feature-value pairs <li>develop a set of feature structures (ignoring for the moment the SGML formalism) for <q>standard average European</q> as commonly annotated in large corpora: number, gender, case, tense, etc. <li>ensure that the feature structures so developed are upward compatible with commonly used schemes like LOB, Brown, etc. That is, LOB, Brown and other common schemes should fall out of the TEI scheme as simplifications or as particular sets of values. <li>if consensus can be achieved, propose a specific set of tags using this standard-average-European feature set, for use in corpus annotation. <li>using the method assumed in point 1, create the required entity definitions for features and feature structures. Optionally provide DTD modifications for enforcing the standard average feature set. </ol> </note> </abstract> <toc> </frontm> <body> <h1>Introduction This paper describes one approach to expressing part-of-speech tags using the feature-structure markup proposed by the A&I committee. It takes as a given the part-of-speech classification of the LOB corpus and seeks an equivalent expression of that classification in TEI-conformant SGML. <note>It is intended that this paper eventually provide a full specification of the LOB tags in feature-structure notation. A partial specification, however, is enough to make clear the direction being suggested and to allow for comment. The paper is thus being distributed in a half-complete state.</note> The grammatical features now outlined in this paper include those required for a full treatment of LOB scheme's verb tags, and the grammatical features required for LOB's treatment of nouns, pronouns, conjunctions, numerals, and determiner-pronouns. Features required for adverbs, determiners and articles, adjectives, qualifiers, and WH-pronouns must still be added; this is a matter of transcription from the <q>naive</q> SGML form in which they have already been worked out. Full feature-structure definitions are given only for the LOB verb tags; similar definitions for the other categories remain to be formulated, which should be straightforward. Further work should attempt to extend these definitions to other classifications. <h1>Description of the LOB Tags We begin with a list of the tags used in the LOB markup. This is taken from <cit>The Tagged LOB Corpus: User's Manual</cit>, by Stig Johansson in collaboration with Eric Atwell, Roger Garside, and Geoffrey Leech (Bergen: Norwegian Computing Centre for the Humanities, 1986). <note>List to be added.</note> <h2>LOB Verb Tags For the moment, let's work with just the verb tags: <dl> <dt>BE <dd>the verb TO BE <dt>BED <dd>the verb TO BE, past tense <dt>BEDZ <dd>the verb TO BE, past tense, 3d person singular <dt>BEG <dd>the participle BEING <dt>BEM <dd>am, 'm <dt>BEN <dd>been <dt>BER <dd>are, 're <dt>BEZ <dd>is, 's <dt>DO <dd>the verb DO as auxiliary <dt>DOD <dd>did <dt>DOZ <dd>does <dt>HV <dd>have <dt>HVD <dd>had (as past tense) <dt>HVG <dd>having <dt>HVN <dd>had (as past participle) <dt>HVZ <dd>has <dt>MD <dd>modal auxiliary verb <dt>VB <dd>lexical verb <dt>VBD <dd>lexical verb in past tense <dt>VBG <dd>lexical verb, present participle <dt>VBN <dd>lexical verb, past participle <dt>VBZ <dd>lexical verb, third-person singular present tense </dl> <h2>Naive Transcription into SGML The semantics of these 22 tags can be reduced to a few atomic notions; if we use conventional (traditional?) grammatical terms, we can arrange these tags in a (sparse) matrix along the following axes: <dl> <dt>lexical type <dd>lexical verb, BE, DO, HAVE, auxiliary, or modal <dt>number <dd>singular, plural, or unmarked <dt>person <dd>1st, 2nd, 3rd, unmarked, or not applicable (for participles) <dt>tense <dd>present, past, future </dl> Because this is modern English, these axes are not truly orthogonal: <term>plural</term> occurs only for BER and <term>3rd-person</term> and <term>singular</term> correlate strongly. An analysis having <emph>only</emph> modern English in mind might thus collapse these features for reasons of economy; I keep them separate because this traditional analysis is clear and commonly understood and because it can more readily be extended to historical forms of English and to other Indo-European languages. The full analysis will also be required in in the pronoun system, in any case. The feature structures for the LOB tags for verbs can be built out of these primitive notions. One straightforward approach would use the major category as a generic identifier and specify feature-value pairs using the attribute-value notation. The element and attribute declarations would look like this: <xmp font=mono><![CDATA[ <!ELEMENT verb> <!ATTLIST verb n (sg, pl, ind) i -- number: singular, plural, indefinite -- p (1, 2, 3, 0) 0 -- person: 1, 2, 3, or unmarked -- pt (participle, nonparticipial) nonparticipial -- participles: yes or no -- t (pres, past, fut) pres -- tense: present, preterite, future -- lex (lex, be, do, have, aux, mod) lex -- lexical, auxiliary (and which), or modal -- > ]]> </xmp> So the various LOB verb tags could be specified thus: <gl> <gt>BE (be) <gd>verb lex=be <gt>BED (were) <gd>verb lex=be t=past <gt>BEDZ (was) <gd>verb lex=be t=past p=3 n=sg <gt>BEG (being) <gd>verb lex=be t=pres pt=part <gt>BEM (am, 'm) <gd>verb lex=be t=pres p=1 n=sg <gt>BEN (been) <gd>verb lex=be t=past pt=part <gt>BER (are, 're) <gd>verb lex=be t=pres n=pl <gt>BEZ (is, 's) <gd>verb lex=be t=pres p=3 n=sg <gt>DO (do) <gd>verb lex=do <gt>DOD (did) <gd>verb lex=do t=past <gt>DOZ (does) <gd>verb lex=do t=pres p=3 n=sg <gt>HV (have) <gd>verb lex=have <gt>HVD (had, 'd) <gd>verb lex=have t=past <gt>HVG (having) <gd>verb lex=have t=pres pt=part <gt>HVN (had (pp)) <gd>verb lex=have t=past pt=part <gt>HVZ (has, 's) <gd>verb lex=have t=pres p=3 n=sg <gt>MD (modal aux) <gd>verb lex=mod <gt>VB (base verb) <gd>verb <gt>VBD (past tense) <gd>verb t=past <gt>VBG (present participle, gerund) <gd>verb pt=part t=pres <gt>VBN (past participle) <gd>verb pt=part t=past <gt>VBZ (3d pers sg) <gd>verb p=3 n=sg </gl> In the notes which follow, this direct translation of category names and values into generic identifiers, attribute names, and attribute values will be called the <q>naive</q> approach. <h2>Transcription into SGML Using Feature Structures The naive SGML version of the LOB verb tags can be translated directly into the feature-structure notation devised by the A&I committee. The structure for BEZ, for example, might be expressed thus: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>BEZ <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>copula </feature> <feature><f.name>number <f.struct>singular </feature> <feature><f.name>person <f.struct>3rd </feature> <feature><f.name>tense <f.struct>present </feature> </f.struct> ]]> </xmp> while the feature structure for VBD might be somewhat simpler: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> <h2>Interpretation of Missing Features Here we encounter a minor conundrum. By leaving <term>number</term> and <term>person</term> unspecified, this rendition of VBD could conceivably be claiming any of the following: <ol> <li id=free >1. that <term>number</term> was either <term>singular</term> or <term>plural</term> or <term>unmarked</term>, those being the allowable values <li id=default>2. that the word in question is not marked for number (so the feature defaults to the value <term>unmarked</term>) <li id=unknown>3. that the feature has an unknown value (e.g. the analysis is not complete and may or may not be completed later) <li id=inappli>4. that the feature does not apply here (i.e. the analysis is complete without it and cannot ever supply a value for this feature) </ol> It seems better to forbid the second interpretation (<liref refid=free>), and insist that failure to specify a value says <emph>nothing</emph> about the value---no defaulting mechanism is provided or allowed. Similarly, the first interpretation (<liref refid=default>) can be forced by explicitly providing an OR of the various possible values over which the feature can range or by providing a value like <term>unmarked</term>, which may have a similar effect, as it does here. The final interpretation (<liref refid=inappli>) may be tempting, but it would be unenforceable by any SGML parser. Moreover it would be redundant to specify this interpretation, by silence or any other means, every time a feature was not mentioned. It would suffice to specify such information once in a grammar; I conclude that a grammar is where such claims belong, and that we can therefore eliminate the final interpretation. Inapplicable features will always be passed over in silence, but not all features passed over in silence need be interpreted as inapplicable.<fn>Alternatively, a value of <term>unknown</term> or <term>unspecified</term> could be required for all features. This would eliminate the ambiguity between feature values not yet analyzed and inapplicable features, but it also makes the provision of underspecified analyses much more cumbersome. I do not recommend it.</fn> More properly, then, the tag VBD ought to be analyzed this way, specifying the value </term>unmarked</term> for both <term>person</term> and <term>number</term>. <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>number <f.struct>unmarked </feature> <feature><f.name>person <f.struct>unmarked </feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> Or we can be more explicit about the combinatorial possibilities, banning the value <term>unmarked</term> and restricting the values to 1st, 2nd, or 3rd person and singular or plural. In this case we provide explicit alternations to show the range of possibilities: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>number <f.s.OR><f.struct>singular</f.struct> <f.struct>plural </f.struct> </f.s.OR> </feature> <feature><f.name>person <f.s.OR><f.struct>1st</f.struct> <f.struct>2nd</f.struct> <f.struct>3rd</f.struct> </f.s.OR> </feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> <h1>Definitions of Primitive Grammatical Elements It seems clear that feature structures like those just described may conveniently be expressed by general entity references which occur within <tag>f.struct</tag> tags, or which themselves contain the <tag>f.struct</tag> tags. Thus in a running text one might have: <xmp font=mono> Wash <f.struct>&nn; </f.struct> sinks <f.struct>&vbz; </f.struct> . <f.struct>&punct.stop;</f.struct> </xmp> It also seems clear that such complex entity values are best built up from smaller primitive entity values, each describing one feature. This has the advantage of allowing all analyses which use a grammatical feature (e.g. <term>number</term>) to use the same definitions. In the remainder of this paper I will give the entity definitions required for the LOB tags and give some simple examples of their possible use. <h2>Major Categories The major categories (<q>parts of speech</q>) assumed by the LOB tagging can be treated as values of a feature called <term>category</term>. <xmp font=mono> <![ cdata [ <!ENTITY v "<feature><f.name> category </f.name> <f.struct> verb </f.struct> </feature>" > <!ENTITY adv "<feature><f.name> category </f.name> <f.struct> adverb </f.struct> </feature>" > <!ENTITY n "<feature><f.name> category </f.name> <f.struct> noun </f.struct> </feature>" > <!ENTITY pron "<feature><f.name> category </f.name> <f.struct> pronoun </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY conj "<feature><f.name> category </f.name> <f.struct> conjunction</f.struct> </feature>" > <!ENTITY num "<feature><f.name> category </f.name> <f.struct> numeral </f.struct> </feature>" >    <!ENTITY AB "<feature><f.name> category </f.name> <f.struct> determiner-pronoun</f.struct> </feature> <feature><f.name> position </f.name> <f.struct> pre-posed</f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY AP "<feature><f.name> category </f.name> <f.struct> determiner-pronoun</f.struct> </feature> <feature><f.name> position </f.name> <f.struct> post-posed</f.struct> </feature>" > <!ENTITY det "<feature><f.name> category </f.name> <f.struct> determiner </f.struct> </feature>" > <!ENTITY article "<feature><f.name> category </f.name> <f.struct> article </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY ex "<feature><f.name> category </f.name> <f.struct> existential THERE</f.struct> </feature>" > <!ENTITY prep "<feature><f.name> category </f.name> <f.struct> preposition </f.struct> </feature>" > <!ENTITY adj "<feature><f.name> category </f.name> <f.struct> adjective </f.struct> </feature>" > <!ENTITY qual "<feature><f.name> category </f.name> <f.struct> qualifier </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY to "<feature><f.name> category </f.name> <f.struct> infinitival TO </f.struct> </feature>" > <!ENTITY uh "<feature><f.name> category </f.name> <f.struct> interjection </f.struct> </feature>" > <!ENTITY wh "<feature><f.name> category </f.name> <f.struct> WH-determiner </f.struct> </feature>" > <!ENTITY not "<feature><f.name> category </f.name> <f.struct> NOT </f.struct> </feature>" > <!ENTITY letter "<feature><f.name> category </f.name> <f.struct> letter </f.struct> </feature>" > <!ENTITY punct "<feature><f.name> category </f.name> <f.struct> punctuation </f.struct> </feature>" > <!ENTITY formula "<feature><f.name> category </f.name> <f.struct> formula </f.struct> </feature>" > <!ENTITY foreign "<feature><f.name> category </f.name> <f.struct> foreign phrase </f.struct> </feature>" > ]]> </xmp> <h2>Lexical Subcategorizations Like most linguists, LOB distinguishes among subgroups of the major categories; these subcategorizations may be expressed in feature-structure notation this way: <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [          <!ENTITY vb.lex "&v; <feature><f.name>AUX</f.name><minus></feature> <feature><f.name>MOD</f.name><minus></feature>" >  <!ENTITY vb.mod "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><plus></feature>" >  <!ENTITY vb.be "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> be </f.struct> </feature>" > <!ENTITY vb.do "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> do </f.struct> </feature>" > <!ENTITY vb.have "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> have </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [                      <!ENTITY n.com "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.cap "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >    <!ENTITY n.proper "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY np.loc "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>locative </f.name><plus> </feature> <feature><f.name>title </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY np.title "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>locative </f.name><minus> </feature> <feature><f.name>title </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.cited "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><plus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.unit "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><plus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.adverb "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><plus> </feature>" >   ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY adv.nom "&pron; <feature><f.name> adv.type </f.name> <f.struct> denominative </f.struct> </feature>" >  <!ENTITY adv.prep "&pron; <feature><f.name> adv.type </f.name> <f.struct> prepositional </f.struct> </feature>" >  <!ENTITY adv.part "&pron; <feature><f.name> adv.type </f.name> <f.struct> participial </f.struct> </feature>" > <!ENTITY adv.com "&pron; <feature><f.name> adv.type </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY pro.nom "&pron; <feature><f.name> pron.type </f.name> <f.struct> nominal </f.struct> </feature>" >  <!ENTITY pro.det "&pron; <feature><f.name> pron.type </f.name> <f.struct> determiner </f.struct> </feature>" > <!ENTITY pro.pers "&pron; <feature><f.name> pron.type </f.name> <f.struct> personal </f.struct> </feature>" > <!ENTITY pro.refl "&pron; <feature><f.name> pron.type </f.name> <f.struct> reflexive </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY CC "&conj; <feature><f.name>subordinating</f.name><minus> </feature>" > <!ENTITY CS "&conj; <feature><f.name>subordinating</f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY num.card "# <feature><f.name>ordinal</f.name><minus> </feature>" > <!ENTITY num.ord "# <feature><f.name>ordinal</f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY AB.qual "&AB; <feature><f.name> det.type </f.name> <f.struct> qualifier </f.struct> </feature>" > <!ENTITY AB.quant "&AB; <feature><f.name> det.type </f.name> <f.struct> quantifier </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY jj.attr "<feature><f.name>attrib-only</f.name><plus> </feature>" > <!ENTITY jj.pred "<feature><f.name>attrib-only</f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [      <!ENTITY rel.yes "<feature><f.name>relative</f.name><plus> </feature>" > <!ENTITY rel.no "<feature><f.name>relative</f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY p.bang "<feature><f.name>character</f.name> <f.struct> ! </f.struct> </feature>" > <!ENTITY p.openbr "<feature><f.name>character</f.name> <f.struct> ( </f.struct> </feature>" > <!ENTITY p.closbr "<feature><f.name>character</f.name> <f.struct> ) </f.struct> </feature>" > <!ENTITY p.openq "<feature><f.name>character</f.name> <f.struct> &ldquo </f.struct> </feature>" > <!ENTITY p.closq "<feature><f.name>character</f.name> <f.struct> &rdquo </f.struct> </feature>" > <!ENTITY p.dash "<feature><f.name>character</f.name> <f.struct> &dash </f.struct> </feature>" > <!ENTITY p.comma "<feature><f.name>character</f.name> <f.struct> , </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY p.stop "<feature><f.name>character</f.name> <f.struct> . </f.struct> </feature>" > <!ENTITY p.ellips "<feature><f.name>character</f.name> <f.struct> &hellip </f.struct> </feature>" > <!ENTITY p.colon "<feature><f.name>character</f.name> <f.struct> : </f.struct> </feature>" > <!ENTITY p.semi "<feature><f.name>character</f.name> <f.struct> ; </f.struct> </feature>" > <!ENTITY p.query "<feature><f.name>character</f.name> <f.struct> ? </f.struct> </feature>" > ]]> </xmp> <h2>Number, Person, Case, Gender, and Other Grammatical Features Features of traditional grammar like number, gender, and case appear in many of the LOB tags. <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [    <!ENTITY sing "<feature><f.name> number </f.name> <f.struct> singular </f.struct> </feature>" > <!ENTITY plur "<feature><f.name> number </f.name> <f.struct> plural </f.struct> </feature>" > <!ENTITY num.no "<feature><f.name> number </f.name> <f.struct> unmarked </f.struct> </feature>" >     ]]> </xmp> <xmp font=mono> <![ CDATA [       <!ENTITY p1 "<feature><f.name> person </f.name> <f.struct> 1st </f.struct> </feature>" > <!ENTITY p2 "<feature><f.name> person </f.name> <f.struct> 2nd </f.struct> </feature>" > <!ENTITY p3 "<feature><f.name> person </f.name> <f.struct> 3rd </f.struct> </feature>" > <!ENTITY impers "<feature><f.name> person </f.name> <f.struct> none </f.struct> </feature>" >  <!ENTITY per.no "<feature><f.name> person </f.name> <f.struct> unmarked </f.struct> </feature>" > <!ENTITY partic "<feature><f.name> participle </f.name> <plus> </feature>" > <!ENTITY par.no "<feature><f.name> participle </f.name> <minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [        <!ENTITY present "<feature><f.name> tense </f.name> <f.struct> present </f.struct> </feature>" > <!ENTITY preterite "<feature><f.name> tense </f.name> <f.struct> preterite </f.struct> </feature>" >    <!ENTITY future "<feature><f.name> tense </f.name> <f.struct> future </f.struct> </feature>" > <!ENTITY presperf "<feature><f.name> tense </f.name> <f.struct> present </f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > <!ENTITY pluperf "<feature><f.name> tense </f.name> <f.struct> preterite</f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > <!ENTITY futperf "<feature><f.name> tense </f.name> <f.struct> future </f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY pos "<feature><f.name> degree </f.name> <f.struct> positive </f.struct> </feature>" > <!ENTITY comp "<feature><f.name> degree </f.name> <f.struct> comparative </f.struct> </feature>" > <!ENTITY sup "<feature><f.name> degree </f.name> <f.struct> superlative </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [       <!ENTITY nom "<feature><f.name> case </f.name> <f.struct> nominative </f.struct> </feature>" > <!ENTITY gen "<feature><f.name> case </f.name> <f.struct> genitive </f.struct> </feature>" >   <!ENTITY acc "<feature><f.name> case </f.name> <f.struct> accusative </f.struct> </feature>" > <!ENTITY case.no "<feature><f.name> case </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY masc "<feature><f.name> gender </f.name> <f.struct> masculine </f.struct> </feature>" > <!ENTITY fem "<feature><f.name> gender </f.name> <f.struct> feminine </f.struct> </feature>" > <!ENTITY neut "<feature><f.name> gender </f.name> <f.struct> neuter </f.struct> </feature>" >    <!ENTITY common "<feature><f.name> gender </f.name> <f.s.OR> <f.struct> masculine </f.struct> <f.struct> feminine </f.struct> </f.s.OR> </feature>" > <!ENTITY gend.no "<feature><f.name> gender </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <h2>Miscellaneous Features Some other features are not recognizable as traditional grammatical notions. <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [     <!ENTITY num.one "<feature><f.name>unitary value</f.name><plus> </feature>" > <!ENTITY num.plur "<feature><f.name>unitary value</f.name><minus> </feature>" > <!ENTITY num.pair "<feature><f.name> count </f.name> <f.struct> 2 </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY conj.dbl "<feature><f.name>double-conj </f.name><plus> </feature>" > <!ENTITY c.dbl.no "<feature><f.name>double-conj </f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY pre.yes "<feature><f.name>preposable </f.name><plus> </feature>" > <!ENTITY pre.no "<feature><f.name>preposable </f.name><minus> </feature>" > <!ENTITY post.yes "<feature><f.name>postposable </f.name><plus> </feature>" > <!ENTITY post.no "<feature><f.name>postposable </f.name><minus> </feature>" > ]]> </xmp> <h1>Combinations of Primitives We can define the verbal tags of the LOB scheme fully as follows: <xmp font=mono> <![ cdata [  <!ENTITY VB "&vb.lex; &num.no; &per.no; &par.no; &present;" > <!ENTITY VBD "&vb.lex; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY VBG "&vb.lex; &num.no; &partic; &present;" > <!ENTITY VBN "&vb.lex; &num.no; &partic; &preterite;" > <!ENTITY VBZ "&vb.lex; &sing; &p3; &par.no; &present;" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY MD "&vb.mod; &num.no; &per.no; &par.no; &present;" >      <!ENTITY BE "&vb.be; &num.no; &per.no; &par.no; &present;" > <!ENTITY BED "&vb.be; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY BEDZ "&vb.be; &sing; &p3; &par.no; &preterite;" > <!ENTITY BEG "&vb.be; &num.no; &per.no; &partic; &present;" > <!ENTITY BEM "&vb.be; &sing; &p1; &par.no; &present;" > <!ENTITY BEN "&vb.be; &num.no; &per.no; &partic; &preterite;" > <!ENTITY BER "&vb.be; &plur; &per.no; &par.no; &present;" > <!ENTITY BEZ "&vb.be; &sing; &p3; &par.no; &present;" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY DO "&vb.do; &num.no; &per.no; &par.no; &present;" > <!ENTITY DOD "&vb.do; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY DOZ "&vb.do; &sing; &p3; &par.no; &present;" > ]]> </xmp> Note that in VBD, DOD, and BED, the string <q><entity>num.no</entity> <entity>per.no</entity></q> says, correctly, that the verbs in question are not marked for person and number. In the case of VBD and DOD, however, this means <term>person</term> can be 1st, 2nd, or 3rd, and <term>number</term> can be singular or plural, in any combination; in the case of BED, it means that <term>person</term> and <term>number</term> can be any combination except 3rd-person singular. This is a simple fact of English grammar. Our choice of expression, modeled on the choices made in the LOB tag scheme, places the burden for handling this fact on the grammar and the application program; one could also change the definitions of these entities to make it explicit here. This facility effectively allows us to specify, in our entity declarations, just what we mean by a given part-of-speech classification, and thus represents an advantage over the naive approach presented earlier. All the other LOB tags can be similarly defined; completion of the definition is for now left to the reader as an exercise. </body> <appendix> <h1>Definition of all LOB tags in feature-structure notation </appendix>  <appendix> <h1>Summary of Features <h2>Binary Features <xmp> +/-AUX auxiliary /* verb */ +/-MOD modal /* verb */ +/-PROP proper /* noun */ +/-CAP capitalized /* noun, adjective */ +/-SUB subordinating /* conjunction */ +/-ORD ordinal /* number */ +/-PERF perfective /* tensed verbs */ +/-PART participle /* verbs */ +/-LOC locative term /* proper nouns */ +/-TITL title /* proper nouns */ +/-UNIT unit-term /* noun */ +/-CITE cited-word /* noun */ +/-ATTR attributive /* adjectives */ +/-PRED predicative /* adjectives -- redundant? */ +/-DBLC double-conj /* determiner/pronouns, and determiners */ +/-PRE preposable /* ? may precede its head */ +/-POST postposable /* ? may follow its head */ +/-PTCL particle /* ? adverb ?==? inverse of +/-takes-complement? */ +/-REL relative /* pronouns -- alternative to pron.type */ +/-PERS personal /* pronouns -- alternative to pron.type */ +/-REFL reflexive /* pronouns -- alternative to pron.type */ +/-WH WH-word /* pronouns, adverbs */ /* cross-category usages: */ +/-pseudo-adverb /* i.e. can appear in adverbial positions -- noun */ +/-pseudo-noun /* i.e. can appear in noun positions -- adverb */ +/-also-prep /* i.e. is also a preposition -- adverb */ +/-DET /* i.e. is a determiner -- pronoun */ +/-exnoun /* formed from a noun -- pronoun (anybody ...) */ </xmp> <h2>N-way Features <xmp> /* Base categories */ CAT category = verb | adverb | noun | pronoun | conjunction | number | determiner | article | THERE | preposition | adjective | qualifier | TO | interjection | [WH] | NOT | letter | punctuation | formula | foreign /* Sub-categorization */ LEX lexitem = (string) /* verbs */ CHAR character = (string) /* punctuation */ CNT count = (integer) /* numbers -- for pairs, ranges */ [ATYP adv.type = nominal | preposition | particle | unmarked ] [prefer binary +/-pseudo-noun +/-also.prep +/-ptcl ] [PTYP pron.type = nominal | determiner | personal | reflexive ] [prefer binary +/-exnoun +/-det +/-pers +/-refl +/-wh +/-rel ] DTYP det.type = qualifier | quantifier /* Categories of Traditional Grammar */ NUM number = singular | plural | unmarked PER person = 1st | 2nd | 3rd | none | unmarked TEN tense = present | preterite | future DEG degree = positive | comparative | superlative CASE case = nominative | genitive | accusative | unmarked GEN gender = masculine | feminine | neuter | unmarked [ | common ] </xmp> <h2>Cross-Category Groupings in LOB <xmp> noun-but-can-serve-as-adverb adverb-but-can-serve-as-noun (as prepositional object) adverb-or-preposition (RI) adverb-or-preposition-without-object (RP) </xmp> <h1>Definitions of LOB, Brown, and Lancaster Tags <h2>LOB tags <xmp> Summary of tags (with spaces inserted for clarity): 1 A B L pre-qualifier (quite, rather, such) 7.12 CAT=(DET|PRON), DTYP=QUALIFIER, +PRE 2 A B N pre-quantifier (all, half) 7.12 CAT=(DET|PRON), DTYP=QUANTIFIER, +PRE 3 A B X pre-quantifier/pronoun/double conjunction (both) CAT=(DET|PRON), DTYP=QUANTIFIER, +PRE, +DBLC 4 A P post-determiner/pronoun. CAT=(DET|PRON), +POST 5 A P $ other's CAT=(DET|PRON), +POST, CASE=GEN 6 A P S others CAT=(DET|PRON), +POST, NUM=PLURAL 7 A P S $ others' CAT=(DET|PRON), +POST, CASE=GEN, NUM=PLURAL 8 A T article, singular (a, an, every) 7.12 CAT=ARTICLE, NUM=SINGULAR 9 A T I article, sing or plural (the, no) 7.12 CAT=ARTICLE, NUM=UNMARKED 10 BE be CAT=VERB +AUX -MOD -PART TEN=PRES LEX=BE 11 BE D were CAT=VERB +AUX -MOD -PART TEN=PRET LEX=BE 12 BE D Z was CAT=VERB +AUX -MOD -PART TEN=PRET NUM=SING PER=3 LEX=BE 13 BE G being CAT=VERB +AUX -MOD +PART TEN=PRES LEX=BE 14 BE M am, 'm CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=1 LEX=BE 15 BE N been CAT=VERB +AUX -MOD +PART TEN=PRET LEX=BE 16 BE R are, 're CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=BE 17 BE Z is, 's CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=BE 18 CC coordinating conjunction (and, and/or, but, nor, only, or, yet) CAT=CONJ -SUB 19 CD 2, 3, two, three, hundred, thousand, dozen, zero - 7.17 20 CD $ cardinal + genitive 21 CD -CD hyphenated pair of cardinals 7.17 22 CD 1 one, 1 7.17 23 CD 1 $ one's 24 CD 1 S ones 25 CD S cardinal + plural (tens, millions, dozens, etc.) 26 CS subordinating conjunction (after, although, etc.) 7.14-15 CAT=CONJ +SUB 27 DO do 7.5 CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=DO 28 DO D did CAT=VERB +AUX -MOD -PART TEN=PRET NUM=UNMKD PER=UNMKD LEX=DO 29 DO Z does CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=DO 30 DT singular detemrinal (another, each, that, this) 7.12 31 DT $ singular determiner + genitive (another's) 32 DT I singular or plural determiner (any, enough, some) 33 DT S plural determiner (those, these) 34 DT X determiner/double conjunction (either, neither) 7.12 35 EX existential 'there' 36 HV have 7.5 CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=HAVE 37 HV D had, 'd CAT=VERB +AUX -MOD -PART TEN=PRET NUM=UNMKD PER=UNMKD LEX=HAVE 38 HV G having CAT=VERB +AUX -MOD +PART TEN=PRES LEX=HAVE 39 HV N had (past participle) CAT=VERB +AUX -MOD +PART TEN=PRET LEX=HAVE 40 HV Z has, 's CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=HAVE 41 IN preposition (about, above, etc.) 7.13, 7.15 42 JJ adjective 7.3-4, 7.8-9, 7.11 43 JJ B attributive-only adjective (chief, main, entire, etc.) 44 JJ R comparative adjective 7.9, 7.11 45 JJ T superlative adjective 7.9, 7.11 46 J NP adj with word-initial capital (English, German, etc.) 47 MD modal auxiliary CAT=VERB +AUX +MOD TEN=PRES NUM=UNMKD PER=UNMKD 48 N C cited word 7.23 NC CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT +CITE 49 N N noun, sg, common 7.4, 7.6, 7.7 NN CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT -CITE 50 N N $ noun, sg, common, + genitive 7.6 NN$ CAT=NOUN N=SING CASE=GEN -PROP -CAP -UNIT -CITE 51 N N P noun, sg, common, with word-initial capital 7.7 NNP CAT=NOUN N=SING CASE=NOM -PROP +CAP -UNIT -CITE 52 N N P $ noun, sg, common, with word-init cap and genitive NNP$ CAT=NOUN N=SING CASE=GEN -PROP +CAP -UNIT -CITE 53 N N P S noun, pl, common, with word-init cap NNPS CAT=NOUN N=PLUR CASE=NOM -PROP +CAP -UNIT -CITE 54 N N P S $ noun, pl, common, with word-init cap and genitive NNS$ CAT=NOUN N=PLUR CASE=GEN -PROP +CAP -UNIT -CITE 55 N N S noun, pl, common 7.6, 7.7 NNS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP -UNIT -CITE 56 N N S $ noun, pl, common, + genitive NNS$ CAT=NOUN N=PLUR CASE=GEN -PROP -CAP -UNIT -CITE 57 N N U noun, abbrev unit of measurement (hr., lb., etc.) NNU CAT=NOUN N=SING CASE=NOM -PROP -CAP +UNIT -CITE 58 N N U S noun, abbrev unit of measurement, pl (gns, yds, etc.) NNUS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP +UNIT -CITE 59 N P noun, sg, proper 7.7 NP CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 60 N P $ noun, sg, proper, + genitive NP$ CAT=NOUN N=SING CASE=GEN +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 61 N P L noun, sg, locative with word-initial cap (Abbey, NPL CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 62 N P L $ ditto + genitive NPL$ CAT=NOUN N=SING CASE=GEN +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 63 N P L S noun, pl, locative with word-initial cap NPLS CAT=NOUN N=PLUR CASE=NOM +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 64 N P L S $ ditto + genitive NPLS$ CAT=NOUN N=PLUR CASE=GEN -PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 65 N P S noun, pl, proper 7.7 NPS CAT=NOUN N=PLUR CASE=NOM +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 66 N P S $ noun, pl, proper, + genitive NPS$ CAT=NOUN N=PLUR CASE=GEN +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 67 N P T noun, sg, titular with word-initial cap NPT CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE -LOC +TITL -PS.ADV 68 N P T $ noun, sg, titular, cap, + genitive NPT$ CAT=NOUN N=SING CASE=GEN +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 69 N P T S noun, pl, titular, cap NPTS CAT=NOUN N=PLUR CASE=NOM +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 70 N P T S $ noun, pl, titular, cap, + genitive NPTS$ CAT=NOUN N=PLUR CASE=GEN +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 71 N R noun, sg, adverbial (Jan, Feb, east, today, NR CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 72 N R $ noun, sg, adverbial + genitive NR$ CAT=NOUN N=SING CASE=GEN -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 73 N R S noun, pl, adverbial NRS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 74 N R S $ noun, pl, adverbial + genitive NRS$ CAT=NOUN N=PLUR CASE=GEN -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 75 OD ordinal (1st, 2nd, first, ...) 7.17 76 OD $ ordinal + genitive 77 P N nominal pron (anybody, anyone, anything; everybody, 78 P N $ nominal pron + genitive 79 P P $ poss determiner (my, your, etc.) 7.12 80 P P $$ poss pron (mine, yours, etc.) 81 P P 1 A pers pron, 1st pers sing nom (I) 82 P P 1 A S pers pron, 1st pers plur nom (we) 83 P P 1 O pers pron, 1st pers sing acc (me) 84 P P 1 O S pers pron, 1st pers plur acc (us) 85 P P 2 pers pron, 2nd pers (you, thou, thee, ye) 86 P P 3 pers pron, 3rd pers sing nom + acc (it) 87 P P 3 A pers pron, 3rd pers sing nom (he, she) 88 P P 3 A S pers pron, 3rd pers plur nom (they) 89 P P 3 O pers pron, 3rd pers sing acc (him, her) 90 P P 3 O S pers pron, 3rd pers plur acc (them, 'em) 91 P P L refl pron, sg 92 P P L S refl pron, pl; reciprocal pron 93 QL qualifier (as, awfully, less, more, so, too, very, ...) 94 QL P post-qualifier (enough, indeed) 95 R B adverb 7.10-7.11 CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 96 R B $ adverb + genitive (else's) CAT=ADV CASE=GEN -PSEUDO.NOUN -ALSO.PREP -WH 97 R B R comparative adverb CAT=ADV DEG=COMP CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 98 R B T superlative adverb CAT=ADV DEG=SUP CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 99 R I adverb (homograph of preposition: below, near, ...) CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN +ALSO.PREP -PTCL -WH 100 R N nominal adverb (here, now, there, then) 7.10 CAT=ADV DEG=POS CASE=UNMKD +PSEUDO.NOUN -ALSO.PREP -WH 101 R P adverbial particle (back, down, off, ...) 7.10, 7.13 CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN +ALSO.PREP +PTCL -WH 102 TO infinitival 'to' CAT=TO 103 UH interjection CAT=INTERJECTION 104 VB base form of verb (uninflected present tense, imper) CAT=VERB -AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD 105 VB D past tense of verb 7.3 CAT=VERB -AUX -MOD -PART TEN=PRET 106 VB G present participle, gerund 7.4 CAT=VERB -AUX -MOD +PART TEN=PRES 107 VB N past participle 7.3 CAT=VERB -AUX -MOD +PART TEN=PRET 108 VB Z 3d person sg CAT=VERB -AUX -MOD -PART TEN=PRES NUM=SING PER=3 109 W DT WH-determiner (what, whatever, interrogative 110 W DT R WH-determiner, relative (which) 7.16 111 W P WH-pron, interrogative, nom+acc (who, whoever) 112 W P $ WH-pron, interrogative, gen (whose) 113 W P $ R WH-pron, relative, gen (whose) 114 W P A WH-pron, nom (whosoever) 115 W P O WH-pron, interrogative, acc (whom, whomsoever) 116 W P O R WH-pron, relative, acc (whom) 117 W P R WH-pron, relative, nom+acc (that, relative who) 7.14, 118 W RB WH-adverb (how, when, ...) 7.16 119 XNOT 'not' 120 ZZ letter 121 ! exclamation mark 122 &FO formula 7.22 123 &FW foreign word 7.21 124 ( left bracket (round or square) 125 ) right bracket (round or square) 126 *' begin quote (single or double) 2.6 127 **' end quote (single or double 2.6 128 *- dash 7.24 129 , comma 7.24 130 . full stop 7.24 131 ... ellipsis 132 : colon 7.24 133 ; semicolon 7.24 134 ? question mark </xmp> </appendix> </gdoc>

.sr docfile = &sysfnam. ;.sr docversion = 'Draft' .im teigmlp1 .* Document proper begins. Tagging Parts of Speech <author>C. M. Sperberg-McQueen <docnum>TEI &docfile. <date>&docdate. </titlep> <abstract> A set of feature structures sufficient to express the analysis underlying the tags of the tagged LOB corpus is provided, with a set of entity names for tagging text. The method of construction, similar to that described in AIW21, "On Lexical Ambiguity," is such as to allow simple extension to other languages and other grammatical features. The possible meanings of underspecified analyses are described and discussed. <note>The paper is not now complete; it offers a full list of the basic grammatical categories in the LOB scheme, and full specification of the features of verbs and some other categories. Full definitions of the LOB tags in feature-structure notation are given only for verb tags. The work on non-lexical parts of speech, especially, does not agree with normal linguistic analysis and should be brought into line. <p> Much work remains to be done in this area; I believe it should proceed as follows: <ol> <li>assume the method described here and in AI W 21 for representing complex feature structures with simple entity references built up out of other entity references for simple feature-value pairs <li>develop a set of feature structures (ignoring for the moment the SGML formalism) for <q>standard average European</q> as commonly annotated in large corpora: number, gender, case, tense, etc. <li>ensure that the feature structures so developed are upward compatible with commonly used schemes like LOB, Brown, etc. That is, LOB, Brown and other common schemes should fall out of the TEI scheme as simplifications or as particular sets of values. <li>if consensus can be achieved, propose a specific set of tags using this standard-average-European feature set, for use in corpus annotation. <li>using the method assumed in point 1, create the required entity definitions for features and feature structures. Optionally provide DTD modifications for enforcing the standard average feature set. </ol> </note> </abstract> <toc> </frontm> <body> <h1>Introduction This paper describes one approach to expressing part-of-speech tags using the feature-structure markup proposed by the A&I committee. It takes as a given the part-of-speech classification of the LOB corpus and seeks an equivalent expression of that classification in TEI-conformant SGML. <note>It is intended that this paper eventually provide a full specification of the LOB tags in feature-structure notation. A partial specification, however, is enough to make clear the direction being suggested and to allow for comment. The paper is thus being distributed in a half-complete state.</note> The grammatical features now outlined in this paper include those required for a full treatment of LOB scheme's verb tags, and the grammatical features required for LOB's treatment of nouns, pronouns, conjunctions, numerals, and determiner-pronouns. Features required for adverbs, determiners and articles, adjectives, qualifiers, and WH-pronouns must still be added; this is a matter of transcription from the <q>naive</q> SGML form in which they have already been worked out. Full feature-structure definitions are given only for the LOB verb tags; similar definitions for the other categories remain to be formulated, which should be straightforward. Further work should attempt to extend these definitions to other classifications. <h1>Description of the LOB Tags We begin with a list of the tags used in the LOB markup. This is taken from <cit>The Tagged LOB Corpus: User's Manual</cit>, by Stig Johansson in collaboration with Eric Atwell, Roger Garside, and Geoffrey Leech (Bergen: Norwegian Computing Centre for the Humanities, 1986). <note>List to be added.</note> <h2>LOB Verb Tags For the moment, let's work with just the verb tags: <dl> <dt>BE <dd>the verb TO BE <dt>BED <dd>the verb TO BE, past tense <dt>BEDZ <dd>the verb TO BE, past tense, 3d person singular <dt>BEG <dd>the participle BEING <dt>BEM <dd>am, 'm <dt>BEN <dd>been <dt>BER <dd>are, 're <dt>BEZ <dd>is, 's <dt>DO <dd>the verb DO as auxiliary <dt>DOD <dd>did <dt>DOZ <dd>does <dt>HV <dd>have <dt>HVD <dd>had (as past tense) <dt>HVG <dd>having <dt>HVN <dd>had (as past participle) <dt>HVZ <dd>has <dt>MD <dd>modal auxiliary verb <dt>VB <dd>lexical verb <dt>VBD <dd>lexical verb in past tense <dt>VBG <dd>lexical verb, present participle <dt>VBN <dd>lexical verb, past participle <dt>VBZ <dd>lexical verb, third-person singular present tense </dl> <h2>Naive Transcription into SGML The semantics of these 22 tags can be reduced to a few atomic notions; if we use conventional (traditional?) grammatical terms, we can arrange these tags in a (sparse) matrix along the following axes: <dl> <dt>lexical type <dd>lexical verb, BE, DO, HAVE, auxiliary, or modal <dt>number <dd>singular, plural, or unmarked <dt>person <dd>1st, 2nd, 3rd, unmarked, or not applicable (for participles) <dt>tense <dd>present, past, future </dl> Because this is modern English, these axes are not truly orthogonal: <term>plural</term> occurs only for BER and <term>3rd-person</term> and <term>singular</term> correlate strongly. An analysis having <emph>only</emph> modern English in mind might thus collapse these features for reasons of economy; I keep them separate because this traditional analysis is clear and commonly understood and because it can more readily be extended to historical forms of English and to other Indo-European languages. The full analysis will also be required in in the pronoun system, in any case. The feature structures for the LOB tags for verbs can be built out of these primitive notions. One straightforward approach would use the major category as a generic identifier and specify feature-value pairs using the attribute-value notation. The element and attribute declarations would look like this: <xmp font=mono><![CDATA[ <!ELEMENT verb> <!ATTLIST verb n (sg, pl, ind) i -- number: singular, plural, indefinite -- p (1, 2, 3, 0) 0 -- person: 1, 2, 3, or unmarked -- pt (participle, nonparticipial) nonparticipial -- participles: yes or no -- t (pres, past, fut) pres -- tense: present, preterite, future -- lex (lex, be, do, have, aux, mod) lex -- lexical, auxiliary (and which), or modal -- > ]]> </xmp> So the various LOB verb tags could be specified thus: <gl> <gt>BE (be) <gd>verb lex=be <gt>BED (were) <gd>verb lex=be t=past <gt>BEDZ (was) <gd>verb lex=be t=past p=3 n=sg <gt>BEG (being) <gd>verb lex=be t=pres pt=part <gt>BEM (am, 'm) <gd>verb lex=be t=pres p=1 n=sg <gt>BEN (been) <gd>verb lex=be t=past pt=part <gt>BER (are, 're) <gd>verb lex=be t=pres n=pl <gt>BEZ (is, 's) <gd>verb lex=be t=pres p=3 n=sg <gt>DO (do) <gd>verb lex=do <gt>DOD (did) <gd>verb lex=do t=past <gt>DOZ (does) <gd>verb lex=do t=pres p=3 n=sg <gt>HV (have) <gd>verb lex=have <gt>HVD (had, 'd) <gd>verb lex=have t=past <gt>HVG (having) <gd>verb lex=have t=pres pt=part <gt>HVN (had (pp)) <gd>verb lex=have t=past pt=part <gt>HVZ (has, 's) <gd>verb lex=have t=pres p=3 n=sg <gt>MD (modal aux) <gd>verb lex=mod <gt>VB (base verb) <gd>verb <gt>VBD (past tense) <gd>verb t=past <gt>VBG (present participle, gerund) <gd>verb pt=part t=pres <gt>VBN (past participle) <gd>verb pt=part t=past <gt>VBZ (3d pers sg) <gd>verb p=3 n=sg </gl> In the notes which follow, this direct translation of category names and values into generic identifiers, attribute names, and attribute values will be called the <q>naive</q> approach. <h2>Transcription into SGML Using Feature Structures The naive SGML version of the LOB verb tags can be translated directly into the feature-structure notation devised by the A&I committee. The structure for BEZ, for example, might be expressed thus: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>BEZ <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>copula </feature> <feature><f.name>number <f.struct>singular </feature> <feature><f.name>person <f.struct>3rd </feature> <feature><f.name>tense <f.struct>present </feature> </f.struct> ]]> </xmp> while the feature structure for VBD might be somewhat simpler: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> <h2>Interpretation of Missing Features Here we encounter a minor conundrum. By leaving <term>number</term> and <term>person</term> unspecified, this rendition of VBD could conceivably be claiming any of the following: <ol> <li id=free >1. that <term>number</term> was either <term>singular</term> or <term>plural</term> or <term>unmarked</term>, those being the allowable values <li id=default>2. that the word in question is not marked for number (so the feature defaults to the value <term>unmarked</term>) <li id=unknown>3. that the feature has an unknown value (e.g. the analysis is not complete and may or may not be completed later) <li id=inappli>4. that the feature does not apply here (i.e. the analysis is complete without it and cannot ever supply a value for this feature) </ol> It seems better to forbid the second interpretation (<liref refid=free>), and insist that failure to specify a value says <emph>nothing</emph> about the value---no defaulting mechanism is provided or allowed. Similarly, the first interpretation (<liref refid=default>) can be forced by explicitly providing an OR of the various possible values over which the feature can range or by providing a value like <term>unmarked</term>, which may have a similar effect, as it does here. The final interpretation (<liref refid=inappli>) may be tempting, but it would be unenforceable by any SGML parser. Moreover it would be redundant to specify this interpretation, by silence or any other means, every time a feature was not mentioned. It would suffice to specify such information once in a grammar; I conclude that a grammar is where such claims belong, and that we can therefore eliminate the final interpretation. Inapplicable features will always be passed over in silence, but not all features passed over in silence need be interpreted as inapplicable.<fn>Alternatively, a value of <term>unknown</term> or <term>unspecified</term> could be required for all features. This would eliminate the ambiguity between feature values not yet analyzed and inapplicable features, but it also makes the provision of underspecified analyses much more cumbersome. I do not recommend it.</fn> More properly, then, the tag VBD ought to be analyzed this way, specifying the value </term>unmarked</term> for both <term>person</term> and <term>number</term>. <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>number <f.struct>unmarked </feature> <feature><f.name>person <f.struct>unmarked </feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> Or we can be more explicit about the combinatorial possibilities, banning the value <term>unmarked</term> and restricting the values to 1st, 2nd, or 3rd person and singular or plural. In this case we provide explicit alternations to show the range of possibilities: <xmp font=mono> <![ CDATA [ <f.struct> <f.struct.name>VBD <feature><f.name>category <f.struct>verb </feature> <feature><f.name>lexical type <f.struct>full verb</feature> <feature><f.name>number <f.s.OR><f.struct>singular</f.struct> <f.struct>plural </f.struct> </f.s.OR> </feature> <feature><f.name>person <f.s.OR><f.struct>1st</f.struct> <f.struct>2nd</f.struct> <f.struct>3rd</f.struct> </f.s.OR> </feature> <feature><f.name>tense <f.struct>preterite</feature> </f.struct> ]]> </xmp> <h1>Definitions of Primitive Grammatical Elements It seems clear that feature structures like those just described may conveniently be expressed by general entity references which occur within <tag>f.struct</tag> tags, or which themselves contain the <tag>f.struct</tag> tags. Thus in a running text one might have: <xmp font=mono> Wash <f.struct>&nn; </f.struct> sinks <f.struct>&vbz; </f.struct> . <f.struct>&punct.stop;</f.struct> </xmp> It also seems clear that such complex entity values are best built up from smaller primitive entity values, each describing one feature. This has the advantage of allowing all analyses which use a grammatical feature (e.g. <term>number</term>) to use the same definitions. In the remainder of this paper I will give the entity definitions required for the LOB tags and give some simple examples of their possible use. <h2>Major Categories The major categories (<q>parts of speech</q>) assumed by the LOB tagging can be treated as values of a feature called <term>category</term>. <xmp font=mono> <![ cdata [ <!ENTITY v "<feature><f.name> category </f.name> <f.struct> verb </f.struct> </feature>" > <!ENTITY adv "<feature><f.name> category </f.name> <f.struct> adverb </f.struct> </feature>" > <!ENTITY n "<feature><f.name> category </f.name> <f.struct> noun </f.struct> </feature>" > <!ENTITY pron "<feature><f.name> category </f.name> <f.struct> pronoun </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY conj "<feature><f.name> category </f.name> <f.struct> conjunction</f.struct> </feature>" > <!ENTITY num "<feature><f.name> category </f.name> <f.struct> numeral </f.struct> </feature>" >    <!ENTITY AB "<feature><f.name> category </f.name> <f.struct> determiner-pronoun</f.struct> </feature> <feature><f.name> position </f.name> <f.struct> pre-posed</f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY AP "<feature><f.name> category </f.name> <f.struct> determiner-pronoun</f.struct> </feature> <feature><f.name> position </f.name> <f.struct> post-posed</f.struct> </feature>" > <!ENTITY det "<feature><f.name> category </f.name> <f.struct> determiner </f.struct> </feature>" > <!ENTITY article "<feature><f.name> category </f.name> <f.struct> article </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY ex "<feature><f.name> category </f.name> <f.struct> existential THERE</f.struct> </feature>" > <!ENTITY prep "<feature><f.name> category </f.name> <f.struct> preposition </f.struct> </feature>" > <!ENTITY adj "<feature><f.name> category </f.name> <f.struct> adjective </f.struct> </feature>" > <!ENTITY qual "<feature><f.name> category </f.name> <f.struct> qualifier </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY to "<feature><f.name> category </f.name> <f.struct> infinitival TO </f.struct> </feature>" > <!ENTITY uh "<feature><f.name> category </f.name> <f.struct> interjection </f.struct> </feature>" > <!ENTITY wh "<feature><f.name> category </f.name> <f.struct> WH-determiner </f.struct> </feature>" > <!ENTITY not "<feature><f.name> category </f.name> <f.struct> NOT </f.struct> </feature>" > <!ENTITY letter "<feature><f.name> category </f.name> <f.struct> letter </f.struct> </feature>" > <!ENTITY punct "<feature><f.name> category </f.name> <f.struct> punctuation </f.struct> </feature>" > <!ENTITY formula "<feature><f.name> category </f.name> <f.struct> formula </f.struct> </feature>" > <!ENTITY foreign "<feature><f.name> category </f.name> <f.struct> foreign phrase </f.struct> </feature>" > ]]> </xmp> <h2>Lexical Subcategorizations Like most linguists, LOB distinguishes among subgroups of the major categories; these subcategorizations may be expressed in feature-structure notation this way: <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [          <!ENTITY vb.lex "&v; <feature><f.name>AUX</f.name><minus></feature> <feature><f.name>MOD</f.name><minus></feature>" >  <!ENTITY vb.mod "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><plus></feature>" >  <!ENTITY vb.be "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> be </f.struct> </feature>" > <!ENTITY vb.do "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> do </f.struct> </feature>" > <!ENTITY vb.have "&v; <feature><f.name>AUX</f.name><plus></feature> <feature><f.name>MOD</f.name><minus></feature> <feature><f.name> Lexical item</f.name> <f.struct> have </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [                      <!ENTITY n.com "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.cap "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >    <!ENTITY n.proper "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY np.loc "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>locative </f.name><plus> </feature> <feature><f.name>title </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY np.title "&n; <feature><f.name>proper </f.name><plus> </feature> <feature><f.name>capitalized </f.name><plus> </feature> <feature><f.name>locative </f.name><minus> </feature> <feature><f.name>title </f.name><plus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.cited "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><plus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.unit "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><plus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><minus> </feature>" >  <!ENTITY n.adverb "&n; <feature><f.name>proper </f.name><minus> </feature> <feature><f.name>capitalized </f.name><minus> </feature> <feature><f.name>unit noun </f.name><minus> </feature> <feature><f.name>cited word </f.name><minus> </feature> <feature><f.name>noun-as-adv </f.name><plus> </feature>" >   ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY adv.nom "&pron; <feature><f.name> adv.type </f.name> <f.struct> denominative </f.struct> </feature>" >  <!ENTITY adv.prep "&pron; <feature><f.name> adv.type </f.name> <f.struct> prepositional </f.struct> </feature>" >  <!ENTITY adv.part "&pron; <feature><f.name> adv.type </f.name> <f.struct> participial </f.struct> </feature>" > <!ENTITY adv.com "&pron; <feature><f.name> adv.type </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY pro.nom "&pron; <feature><f.name> pron.type </f.name> <f.struct> nominal </f.struct> </feature>" >  <!ENTITY pro.det "&pron; <feature><f.name> pron.type </f.name> <f.struct> determiner </f.struct> </feature>" > <!ENTITY pro.pers "&pron; <feature><f.name> pron.type </f.name> <f.struct> personal </f.struct> </feature>" > <!ENTITY pro.refl "&pron; <feature><f.name> pron.type </f.name> <f.struct> reflexive </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY CC "&conj; <feature><f.name>subordinating</f.name><minus> </feature>" > <!ENTITY CS "&conj; <feature><f.name>subordinating</f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY num.card "# <feature><f.name>ordinal</f.name><minus> </feature>" > <!ENTITY num.ord "# <feature><f.name>ordinal</f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY AB.qual "&AB; <feature><f.name> det.type </f.name> <f.struct> qualifier </f.struct> </feature>" > <!ENTITY AB.quant "&AB; <feature><f.name> det.type </f.name> <f.struct> quantifier </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY jj.attr "<feature><f.name>attrib-only</f.name><plus> </feature>" > <!ENTITY jj.pred "<feature><f.name>attrib-only</f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [      <!ENTITY rel.yes "<feature><f.name>relative</f.name><plus> </feature>" > <!ENTITY rel.no "<feature><f.name>relative</f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY p.bang "<feature><f.name>character</f.name> <f.struct> ! </f.struct> </feature>" > <!ENTITY p.openbr "<feature><f.name>character</f.name> <f.struct> ( </f.struct> </feature>" > <!ENTITY p.closbr "<feature><f.name>character</f.name> <f.struct> ) </f.struct> </feature>" > <!ENTITY p.openq "<feature><f.name>character</f.name> <f.struct> &ldquo </f.struct> </feature>" > <!ENTITY p.closq "<feature><f.name>character</f.name> <f.struct> &rdquo </f.struct> </feature>" > <!ENTITY p.dash "<feature><f.name>character</f.name> <f.struct> &dash </f.struct> </feature>" > <!ENTITY p.comma "<feature><f.name>character</f.name> <f.struct> , </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [ <!ENTITY p.stop "<feature><f.name>character</f.name> <f.struct> . </f.struct> </feature>" > <!ENTITY p.ellips "<feature><f.name>character</f.name> <f.struct> &hellip </f.struct> </feature>" > <!ENTITY p.colon "<feature><f.name>character</f.name> <f.struct> : </f.struct> </feature>" > <!ENTITY p.semi "<feature><f.name>character</f.name> <f.struct> ; </f.struct> </feature>" > <!ENTITY p.query "<feature><f.name>character</f.name> <f.struct> ? </f.struct> </feature>" > ]]> </xmp> <h2>Number, Person, Case, Gender, and Other Grammatical Features Features of traditional grammar like number, gender, and case appear in many of the LOB tags. <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [    <!ENTITY sing "<feature><f.name> number </f.name> <f.struct> singular </f.struct> </feature>" > <!ENTITY plur "<feature><f.name> number </f.name> <f.struct> plural </f.struct> </feature>" > <!ENTITY num.no "<feature><f.name> number </f.name> <f.struct> unmarked </f.struct> </feature>" >     ]]> </xmp> <xmp font=mono> <![ CDATA [       <!ENTITY p1 "<feature><f.name> person </f.name> <f.struct> 1st </f.struct> </feature>" > <!ENTITY p2 "<feature><f.name> person </f.name> <f.struct> 2nd </f.struct> </feature>" > <!ENTITY p3 "<feature><f.name> person </f.name> <f.struct> 3rd </f.struct> </feature>" > <!ENTITY impers "<feature><f.name> person </f.name> <f.struct> none </f.struct> </feature>" >  <!ENTITY per.no "<feature><f.name> person </f.name> <f.struct> unmarked </f.struct> </feature>" > <!ENTITY partic "<feature><f.name> participle </f.name> <plus> </feature>" > <!ENTITY par.no "<feature><f.name> participle </f.name> <minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [        <!ENTITY present "<feature><f.name> tense </f.name> <f.struct> present </f.struct> </feature>" > <!ENTITY preterite "<feature><f.name> tense </f.name> <f.struct> preterite </f.struct> </feature>" >    <!ENTITY future "<feature><f.name> tense </f.name> <f.struct> future </f.struct> </feature>" > <!ENTITY presperf "<feature><f.name> tense </f.name> <f.struct> present </f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > <!ENTITY pluperf "<feature><f.name> tense </f.name> <f.struct> preterite</f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > <!ENTITY futperf "<feature><f.name> tense </f.name> <f.struct> future </f.struct> </feature> <feature><f.name> perfective </f.name><plus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY pos "<feature><f.name> degree </f.name> <f.struct> positive </f.struct> </feature>" > <!ENTITY comp "<feature><f.name> degree </f.name> <f.struct> comparative </f.struct> </feature>" > <!ENTITY sup "<feature><f.name> degree </f.name> <f.struct> superlative </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [       <!ENTITY nom "<feature><f.name> case </f.name> <f.struct> nominative </f.struct> </feature>" > <!ENTITY gen "<feature><f.name> case </f.name> <f.struct> genitive </f.struct> </feature>" >   <!ENTITY acc "<feature><f.name> case </f.name> <f.struct> accusative </f.struct> </feature>" > <!ENTITY case.no "<feature><f.name> case </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [    <!ENTITY masc "<feature><f.name> gender </f.name> <f.struct> masculine </f.struct> </feature>" > <!ENTITY fem "<feature><f.name> gender </f.name> <f.struct> feminine </f.struct> </feature>" > <!ENTITY neut "<feature><f.name> gender </f.name> <f.struct> neuter </f.struct> </feature>" >    <!ENTITY common "<feature><f.name> gender </f.name> <f.s.OR> <f.struct> masculine </f.struct> <f.struct> feminine </f.struct> </f.s.OR> </feature>" > <!ENTITY gend.no "<feature><f.name> gender </f.name> <f.struct> unmarked </f.struct> </feature>" > ]]> </xmp> <h2>Miscellaneous Features Some other features are not recognizable as traditional grammatical notions. <note>This section is not complete for all categories.</note> <xmp font=mono> <![ cdata [     <!ENTITY num.one "<feature><f.name>unitary value</f.name><plus> </feature>" > <!ENTITY num.plur "<feature><f.name>unitary value</f.name><minus> </feature>" > <!ENTITY num.pair "<feature><f.name> count </f.name> <f.struct> 2 </f.struct> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [     <!ENTITY conj.dbl "<feature><f.name>double-conj </f.name><plus> </feature>" > <!ENTITY c.dbl.no "<feature><f.name>double-conj </f.name><minus> </feature>" > ]]> </xmp> <xmp font=mono> <![ CDATA [   <!ENTITY pre.yes "<feature><f.name>preposable </f.name><plus> </feature>" > <!ENTITY pre.no "<feature><f.name>preposable </f.name><minus> </feature>" > <!ENTITY post.yes "<feature><f.name>postposable </f.name><plus> </feature>" > <!ENTITY post.no "<feature><f.name>postposable </f.name><minus> </feature>" > ]]> </xmp> <h1>Combinations of Primitives We can define the verbal tags of the LOB scheme fully as follows: <xmp font=mono> <![ cdata [  <!ENTITY VB "&vb.lex; &num.no; &per.no; &par.no; &present;" > <!ENTITY VBD "&vb.lex; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY VBG "&vb.lex; &num.no; &partic; &present;" > <!ENTITY VBN "&vb.lex; &num.no; &partic; &preterite;" > <!ENTITY VBZ "&vb.lex; &sing; &p3; &par.no; &present;" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY MD "&vb.mod; &num.no; &per.no; &par.no; &present;" >      <!ENTITY BE "&vb.be; &num.no; &per.no; &par.no; &present;" > <!ENTITY BED "&vb.be; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY BEDZ "&vb.be; &sing; &p3; &par.no; &preterite;" > <!ENTITY BEG "&vb.be; &num.no; &per.no; &partic; &present;" > <!ENTITY BEM "&vb.be; &sing; &p1; &par.no; &present;" > <!ENTITY BEN "&vb.be; &num.no; &per.no; &partic; &preterite;" > <!ENTITY BER "&vb.be; &plur; &per.no; &par.no; &present;" > <!ENTITY BEZ "&vb.be; &sing; &p3; &par.no; &present;" > ]]> </xmp> <xmp font=mono> <![ CDATA [  <!ENTITY DO "&vb.do; &num.no; &per.no; &par.no; &present;" > <!ENTITY DOD "&vb.do; &num.no; &per.no; &par.no; &preterite;" > <!ENTITY DOZ "&vb.do; &sing; &p3; &par.no; &present;" > ]]> </xmp> Note that in VBD, DOD, and BED, the string <q><entity>num.no</entity> <entity>per.no</entity></q> says, correctly, that the verbs in question are not marked for person and number. In the case of VBD and DOD, however, this means <term>person</term> can be 1st, 2nd, or 3rd, and <term>number</term> can be singular or plural, in any combination; in the case of BED, it means that <term>person</term> and <term>number</term> can be any combination except 3rd-person singular. This is a simple fact of English grammar. Our choice of expression, modeled on the choices made in the LOB tag scheme, places the burden for handling this fact on the grammar and the application program; one could also change the definitions of these entities to make it explicit here. This facility effectively allows us to specify, in our entity declarations, just what we mean by a given part-of-speech classification, and thus represents an advantage over the naive approach presented earlier. All the other LOB tags can be similarly defined; completion of the definition is for now left to the reader as an exercise. </body> <appendix> <h1>Definition of all LOB tags in feature-structure notation </appendix>  <appendix> <h1>Summary of Features <h2>Binary Features <xmp> +/-AUX auxiliary /* verb */ +/-MOD modal /* verb */ +/-PROP proper /* noun */ +/-CAP capitalized /* noun, adjective */ +/-SUB subordinating /* conjunction */ +/-ORD ordinal /* number */ +/-PERF perfective /* tensed verbs */ +/-PART participle /* verbs */ +/-LOC locative term /* proper nouns */ +/-TITL title /* proper nouns */ +/-UNIT unit-term /* noun */ +/-CITE cited-word /* noun */ +/-ATTR attributive /* adjectives */ +/-PRED predicative /* adjectives -- redundant? */ +/-DBLC double-conj /* determiner/pronouns, and determiners */ +/-PRE preposable /* ? may precede its head */ +/-POST postposable /* ? may follow its head */ +/-PTCL particle /* ? adverb ?==? inverse of +/-takes-complement? */ +/-REL relative /* pronouns -- alternative to pron.type */ +/-PERS personal /* pronouns -- alternative to pron.type */ +/-REFL reflexive /* pronouns -- alternative to pron.type */ +/-WH WH-word /* pronouns, adverbs */ /* cross-category usages: */ +/-pseudo-adverb /* i.e. can appear in adverbial positions -- noun */ +/-pseudo-noun /* i.e. can appear in noun positions -- adverb */ +/-also-prep /* i.e. is also a preposition -- adverb */ +/-DET /* i.e. is a determiner -- pronoun */ +/-exnoun /* formed from a noun -- pronoun (anybody ...) */ </xmp> <h2>N-way Features <xmp> /* Base categories */ CAT category = verb | adverb | noun | pronoun | conjunction | number | determiner | article | THERE | preposition | adjective | qualifier | TO | interjection | [WH] | NOT | letter | punctuation | formula | foreign /* Sub-categorization */ LEX lexitem = (string) /* verbs */ CHAR character = (string) /* punctuation */ CNT count = (integer) /* numbers -- for pairs, ranges */ [ATYP adv.type = nominal | preposition | particle | unmarked ] [prefer binary +/-pseudo-noun +/-also.prep +/-ptcl ] [PTYP pron.type = nominal | determiner | personal | reflexive ] [prefer binary +/-exnoun +/-det +/-pers +/-refl +/-wh +/-rel ] DTYP det.type = qualifier | quantifier /* Categories of Traditional Grammar */ NUM number = singular | plural | unmarked PER person = 1st | 2nd | 3rd | none | unmarked TEN tense = present | preterite | future DEG degree = positive | comparative | superlative CASE case = nominative | genitive | accusative | unmarked GEN gender = masculine | feminine | neuter | unmarked [ | common ] </xmp> <h2>Cross-Category Groupings in LOB <xmp> noun-but-can-serve-as-adverb adverb-but-can-serve-as-noun (as prepositional object) adverb-or-preposition (RI) adverb-or-preposition-without-object (RP) </xmp> <h1>Definitions of LOB, Brown, and Lancaster Tags <h2>LOB tags <xmp> Summary of tags (with spaces inserted for clarity): 1 A B L pre-qualifier (quite, rather, such) 7.12 CAT=(DET|PRON), DTYP=QUALIFIER, +PRE 2 A B N pre-quantifier (all, half) 7.12 CAT=(DET|PRON), DTYP=QUANTIFIER, +PRE 3 A B X pre-quantifier/pronoun/double conjunction (both) CAT=(DET|PRON), DTYP=QUANTIFIER, +PRE, +DBLC 4 A P post-determiner/pronoun. CAT=(DET|PRON), +POST 5 A P $ other's CAT=(DET|PRON), +POST, CASE=GEN 6 A P S others CAT=(DET|PRON), +POST, NUM=PLURAL 7 A P S $ others' CAT=(DET|PRON), +POST, CASE=GEN, NUM=PLURAL 8 A T article, singular (a, an, every) 7.12 CAT=ARTICLE, NUM=SINGULAR 9 A T I article, sing or plural (the, no) 7.12 CAT=ARTICLE, NUM=UNMARKED 10 BE be CAT=VERB +AUX -MOD -PART TEN=PRES LEX=BE 11 BE D were CAT=VERB +AUX -MOD -PART TEN=PRET LEX=BE 12 BE D Z was CAT=VERB +AUX -MOD -PART TEN=PRET NUM=SING PER=3 LEX=BE 13 BE G being CAT=VERB +AUX -MOD +PART TEN=PRES LEX=BE 14 BE M am, 'm CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=1 LEX=BE 15 BE N been CAT=VERB +AUX -MOD +PART TEN=PRET LEX=BE 16 BE R are, 're CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=BE 17 BE Z is, 's CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=BE 18 CC coordinating conjunction (and, and/or, but, nor, only, or, yet) CAT=CONJ -SUB 19 CD 2, 3, two, three, hundred, thousand, dozen, zero - 7.17 20 CD $ cardinal + genitive 21 CD -CD hyphenated pair of cardinals 7.17 22 CD 1 one, 1 7.17 23 CD 1 $ one's 24 CD 1 S ones 25 CD S cardinal + plural (tens, millions, dozens, etc.) 26 CS subordinating conjunction (after, although, etc.) 7.14-15 CAT=CONJ +SUB 27 DO do 7.5 CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=DO 28 DO D did CAT=VERB +AUX -MOD -PART TEN=PRET NUM=UNMKD PER=UNMKD LEX=DO 29 DO Z does CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=DO 30 DT singular detemrinal (another, each, that, this) 7.12 31 DT $ singular determiner + genitive (another's) 32 DT I singular or plural determiner (any, enough, some) 33 DT S plural determiner (those, these) 34 DT X determiner/double conjunction (either, neither) 7.12 35 EX existential 'there' 36 HV have 7.5 CAT=VERB +AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD LEX=HAVE 37 HV D had, 'd CAT=VERB +AUX -MOD -PART TEN=PRET NUM=UNMKD PER=UNMKD LEX=HAVE 38 HV G having CAT=VERB +AUX -MOD +PART TEN=PRES LEX=HAVE 39 HV N had (past participle) CAT=VERB +AUX -MOD +PART TEN=PRET LEX=HAVE 40 HV Z has, 's CAT=VERB +AUX -MOD -PART TEN=PRES NUM=SING PER=3 LEX=HAVE 41 IN preposition (about, above, etc.) 7.13, 7.15 42 JJ adjective 7.3-4, 7.8-9, 7.11 43 JJ B attributive-only adjective (chief, main, entire, etc.) 44 JJ R comparative adjective 7.9, 7.11 45 JJ T superlative adjective 7.9, 7.11 46 J NP adj with word-initial capital (English, German, etc.) 47 MD modal auxiliary CAT=VERB +AUX +MOD TEN=PRES NUM=UNMKD PER=UNMKD 48 N C cited word 7.23 NC CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT +CITE 49 N N noun, sg, common 7.4, 7.6, 7.7 NN CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT -CITE 50 N N $ noun, sg, common, + genitive 7.6 NN$ CAT=NOUN N=SING CASE=GEN -PROP -CAP -UNIT -CITE 51 N N P noun, sg, common, with word-initial capital 7.7 NNP CAT=NOUN N=SING CASE=NOM -PROP +CAP -UNIT -CITE 52 N N P $ noun, sg, common, with word-init cap and genitive NNP$ CAT=NOUN N=SING CASE=GEN -PROP +CAP -UNIT -CITE 53 N N P S noun, pl, common, with word-init cap NNPS CAT=NOUN N=PLUR CASE=NOM -PROP +CAP -UNIT -CITE 54 N N P S $ noun, pl, common, with word-init cap and genitive NNS$ CAT=NOUN N=PLUR CASE=GEN -PROP +CAP -UNIT -CITE 55 N N S noun, pl, common 7.6, 7.7 NNS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP -UNIT -CITE 56 N N S $ noun, pl, common, + genitive NNS$ CAT=NOUN N=PLUR CASE=GEN -PROP -CAP -UNIT -CITE 57 N N U noun, abbrev unit of measurement (hr., lb., etc.) NNU CAT=NOUN N=SING CASE=NOM -PROP -CAP +UNIT -CITE 58 N N U S noun, abbrev unit of measurement, pl (gns, yds, etc.) NNUS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP +UNIT -CITE 59 N P noun, sg, proper 7.7 NP CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 60 N P $ noun, sg, proper, + genitive NP$ CAT=NOUN N=SING CASE=GEN +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 61 N P L noun, sg, locative with word-initial cap (Abbey, NPL CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 62 N P L $ ditto + genitive NPL$ CAT=NOUN N=SING CASE=GEN +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 63 N P L S noun, pl, locative with word-initial cap NPLS CAT=NOUN N=PLUR CASE=NOM +PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 64 N P L S $ ditto + genitive NPLS$ CAT=NOUN N=PLUR CASE=GEN -PROP +CAP -UNIT -CITE +LOC -TITL -PS.ADV 65 N P S noun, pl, proper 7.7 NPS CAT=NOUN N=PLUR CASE=NOM +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 66 N P S $ noun, pl, proper, + genitive NPS$ CAT=NOUN N=PLUR CASE=GEN +PROP +CAP -UNIT -CITE -LOC -TITL -PS.ADV 67 N P T noun, sg, titular with word-initial cap NPT CAT=NOUN N=SING CASE=NOM +PROP +CAP -UNIT -CITE -LOC +TITL -PS.ADV 68 N P T $ noun, sg, titular, cap, + genitive NPT$ CAT=NOUN N=SING CASE=GEN +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 69 N P T S noun, pl, titular, cap NPTS CAT=NOUN N=PLUR CASE=NOM +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 70 N P T S $ noun, pl, titular, cap, + genitive NPTS$ CAT=NOUN N=PLUR CASE=GEN +PROP -CAP -UNIT -CITE -LOC +TITL -PS.ADV 71 N R noun, sg, adverbial (Jan, Feb, east, today, NR CAT=NOUN N=SING CASE=NOM -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 72 N R $ noun, sg, adverbial + genitive NR$ CAT=NOUN N=SING CASE=GEN -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 73 N R S noun, pl, adverbial NRS CAT=NOUN N=PLUR CASE=NOM -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 74 N R S $ noun, pl, adverbial + genitive NRS$ CAT=NOUN N=PLUR CASE=GEN -PROP -CAP -UNIT -CITE -LOC -TITL +PS.ADV 75 OD ordinal (1st, 2nd, first, ...) 7.17 76 OD $ ordinal + genitive 77 P N nominal pron (anybody, anyone, anything; everybody, 78 P N $ nominal pron + genitive 79 P P $ poss determiner (my, your, etc.) 7.12 80 P P $$ poss pron (mine, yours, etc.) 81 P P 1 A pers pron, 1st pers sing nom (I) 82 P P 1 A S pers pron, 1st pers plur nom (we) 83 P P 1 O pers pron, 1st pers sing acc (me) 84 P P 1 O S pers pron, 1st pers plur acc (us) 85 P P 2 pers pron, 2nd pers (you, thou, thee, ye) 86 P P 3 pers pron, 3rd pers sing nom + acc (it) 87 P P 3 A pers pron, 3rd pers sing nom (he, she) 88 P P 3 A S pers pron, 3rd pers plur nom (they) 89 P P 3 O pers pron, 3rd pers sing acc (him, her) 90 P P 3 O S pers pron, 3rd pers plur acc (them, 'em) 91 P P L refl pron, sg 92 P P L S refl pron, pl; reciprocal pron 93 QL qualifier (as, awfully, less, more, so, too, very, ...) 94 QL P post-qualifier (enough, indeed) 95 R B adverb 7.10-7.11 CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 96 R B $ adverb + genitive (else's) CAT=ADV CASE=GEN -PSEUDO.NOUN -ALSO.PREP -WH 97 R B R comparative adverb CAT=ADV DEG=COMP CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 98 R B T superlative adverb CAT=ADV DEG=SUP CASE=UNMKD -PSEUDO.NOUN -ALSO.PREP -WH 99 R I adverb (homograph of preposition: below, near, ...) CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN +ALSO.PREP -PTCL -WH 100 R N nominal adverb (here, now, there, then) 7.10 CAT=ADV DEG=POS CASE=UNMKD +PSEUDO.NOUN -ALSO.PREP -WH 101 R P adverbial particle (back, down, off, ...) 7.10, 7.13 CAT=ADV DEG=POS CASE=UNMKD -PSEUDO.NOUN +ALSO.PREP +PTCL -WH 102 TO infinitival 'to' CAT=TO 103 UH interjection CAT=INTERJECTION 104 VB base form of verb (uninflected present tense, imper) CAT=VERB -AUX -MOD -PART TEN=PRES NUM=UNMKD PER=UNMKD 105 VB D past tense of verb 7.3 CAT=VERB -AUX -MOD -PART TEN=PRET 106 VB G present participle, gerund 7.4 CAT=VERB -AUX -MOD +PART TEN=PRES 107 VB N past participle 7.3 CAT=VERB -AUX -MOD +PART TEN=PRET 108 VB Z 3d person sg CAT=VERB -AUX -MOD -PART TEN=PRES NUM=SING PER=3 109 W DT WH-determiner (what, whatever, interrogative 110 W DT R WH-determiner, relative (which) 7.16 111 W P WH-pron, interrogative, nom+acc (who, whoever) 112 W P $ WH-pron, interrogative, gen (whose) 113 W P $ R WH-pron, relative, gen (whose) 114 W P A WH-pron, nom (whosoever) 115 W P O WH-pron, interrogative, acc (whom, whomsoever) 116 W P O R WH-pron, relative, acc (whom) 117 W P R WH-pron, relative, nom+acc (that, relative who) 7.14, 118 W RB WH-adverb (how, when, ...) 7.16 119 XNOT 'not' 120 ZZ letter 121 ! exclamation mark 122 &FO formula 7.22 123 &FW foreign word 7.21 124 ( left bracket (round or square) 125 ) right bracket (round or square) 126 *' begin quote (single or double) 2.6 127 **' end quote (single or double 2.6 128 *- dash 7.24 129 , comma 7.24 130 . full stop 7.24 131 ... ellipsis 132 : colon 7.24 133 ; semicolon 7.24 134 ? question mark </xmp> </appendix> </gdoc>