Received: from NORUNIT.BITNET by UICVM (Mailer R2.03B) with BSMTP id 6542; Thu, 03 May 90 03:44:24 CDT X-Delivery-Notice: SMTP MAIL FROM does not correspond to sender. Received: from NORUNIT (SMTP) by NORUNIT.BITNET (Mailer R2.04) with BSMTP id 7483; Thu, 03 May 90 10:41:41 EDT Received: from runix.runit.sintef.no by NORUNIT.sintef.no (IBM VM SMTP R1.2.2MX) with TCP; Thu, 03 May 90 10:41:39 EDT Received: by runix.runit.sintef.no (norunix.EARN) (1.2/8.6) id AA07845; Thu, 3 May 90 10:39:39 +0200 Date: 3 May 90 10:37 +0100 From: Stig Johansson To: "Michael Sperberg-McQueen 312 996-2477 -2981" In-Reply-To: <9004300100.AA20758@dxmint.cern.ch> Message-Id: <730*h_johansson@use.uio.uninett> Subject: part-of-speech markup Michael, You have really put a lot of work into your study of the LOB tagging scheme. Notice, first of all, that the scheme was set up (modelled on the scheme used for the tagged Brown Corpus) for a project of automatic word class tagging. Some categories may look strange (among them various groups of determiners/pronouns). A good number of distinctions which one would like to have are not made, e.g. between determiner and pronoun for words like 'all' and 'much', 'this' and 'that', between auxiliary and other uses of 'be', 'have' and 'do' (did you notice, by the way, that 'doing' and 'done' are VBG and VBN, respectively - these forms are never used as auxiliaries), between different uses of 'it', different uses of base forms and '-ing' forms of verbs, etc. The underlying thinking was to set up classes which could be automatically identified with a reasonable degree of success and/or would be useful in disambiguation of other words. And, obviously, classes which are important in an analysis of present-day English. No claims were made as to the more general applicability of the scheme. But I believe it may be worth some consideration anyway. And enough distinctions were made to cause us a lot of head-aches! I will send you separately a paper discussing problems of overlap between classes: ambiguity (more than one tag is applicable; two interpretations), merger (more than one tag may be applicable; it does not matter which you choose), and gradience (fuzzy borderlines between classes; we do not know where to draw the line). There is certainly a need for something like 'unmarked', in the case of merger and inapplicability of distinctions. I would advise you to look at the major classes, as you have done, and not to fully analyse pronouns and determiners (where a lot could be done differently). What is the point of using feature analysis as against units? Because you want to make use of the features in some way - how? One good thing which the feature analysis brings out is the relationship between tags. This is otherwise only shown (partly and less directly) by the choice of tag name. It may also be a good way of handling merger and gradience. A further argument for feature analysis is simplicity - but this surely does not apply in this case? I am worried in general by the large amount of markup required to express the simplest things. Surely there will be tremendous problems in verifying the text, in spite of supporting software? I shudder to think of what one would have to do to check a long text. Checking the tagged LOB Corpus was difficult enough. I know that you are busy with many things, so I will note write very much more now. Just a few points of detail. Should you use the feature 'indefinite' in talking about verbs? Usually associated with determiners and noun phrases. BEDZ is both 1st and 3rd pers (I was, he was). BER is both sing and plur. The verb 'be' can be an auxiliary, copula verb (she is old), and intransitive verb (she is in London). What about types of WH-words? 'Like most linguists, LOB distinguishes...' (It is corporeal, but is it a linguist?) 'LOB distinguishes lexical verbs, auxiliaries, and modals.' - Actually no distinction is made between uses of 'be', 'do', and 'have' as auxiliaries and lexical verbs. So they are not +AUX in the LOB scheme. (But it would have been nice to have made the distinction. Your feature analysis seems OK for auxiliary uses.) 'No compound tenses' - what about 'future'. When does 'future' apply? 'Tense' for participles is a bit hard to swallow. Are participles values of 'pers'? What about 'non-finite'? 'Van', 'de' and the like are not really independent proper nouns, but they can be part of naming expressions. In the tagged LOB Corpus they were tagged as 'foreign'. But we tagged NP in cases like: d'Alba, d'Oliviera. So we probably need non-cap proper nouns anyway. Hope some of this may be of use to you. Too bad there is so little time... Oslo 3 May 1990 Stig J