Date:         Wed, 20 Dec 89 10:51:57 EST
Sender:       Text Encoding Initiative - Text Analysis and Interpretation
              Committee <TEI-ANA@UICVM>
From:         Beatrice Santorini <beatrice@UNAGI.CIS.UPENN.EDU>
Subject:      for your comments and suggestions
 
Text Encoding Initiative
Committee on Text Analysis and Interpretation
Subcommittee on Syntax
 
December 19, 1989
 
Beatrice Santorini
Department of Computer Science
University of Pennsylvania
Philadelphia, PA  19104
beatrice@unagi.cis.upenn.edu
 
 
The following is an expanded version of the very brief summary of the
issues the Syntax Subcommittee of the Text Encoding Initiative has been
considering which I sent out on December 14, and draws considerably on
Terry Langendoen's initial document that was circulated in the summer.  I
first address some issues that are central to syntactic markup, and then
briefly discuss some concerns that are more general and intersect with the
concerns of other subcommittees.  This document focuses on substantive
issues; questions of SGML implementation are as yet outside the scope of
this document.  Again, please send comments and suggestions for improvements
to beatrice@unagi.cis.upenn.edu.
 
 
 
SPECIFIC ISSUES
 
o  Delimiting the sentences and phrases of a text
 
The most elementary task of the Syntax Committee is to provide a
standard way of delimiting the basic units of syntactic investigation:
sentences and their constituent phrases.  Sentences are normally
delimited in modern texts by initial capital letters and final
punctuation.  Since both capitalization and punctuation have a variety
of functions in texts in most languages, we need to be able to
explicitly tag their use as sentence delimiters.  As far as the
encoding of constituent phrases is concerned, it is not clear to me
that we can or should give a list of possible constituent phrases.  And
although we might recommend that proposed phrase structures conform to
certain formal conventions (e.g., no crossing branches), it is not
clear that we want to prohibit people from expressing unorthodox phrase
structures.
 
 
o  Representing functional dependencies
 
In addition to delimiting linear sequences, we must be able to represent
the relationships among them and their constituents.  We will need to be
able to identify:
 
   (a) heads of phrases,
   (b) the argument structure of those heads,
   (c) the syntactic realization of the arguments, and
   (d) any adjuncts associated with the heads.
 
For instance, we must be able to distinguish between two sentences as in
(1).
 
(1) a.  I gave him a book.
    b.  I consider him a fool.
 
How exactly all of this information is to be represented in SGML
remains to be worked out.
 
Again, it is not clear to me what conventions regarding phrase
structure we are in a position to recommend.  For instance, the idea
that a phrase needs a lexical head, while uncontroversial in many
cases, is debatable in the case, say, of clauses.
 
 
o  Representing discontinuous dependencies
 
Representing the structure of discontinuous dependencies is a relatively
spectacular case of the more general issue of representing functional
dependencies and will presumably require extensive use of the
cross-referencing capabilities of SGML.  Some cases of phrasal
discontinuity in English are indicated by bracketing in the examples in
(2).
 
(2) a.  You left [a word] out [I wanted included].
    b.  We'll have to wait [an] hour [and a half].
    c.  [Can] she [sing]?
    d.  I wonder [who] they think they [are].
    e.  [Advantage] was [taken of] us.
 
This task will be particularly important in German and Dutch, which
obey the verb-second constraint and in which the sentence-initial
constituent is often an element of the predicate.
 
 
o  Identifying null constituents
 
The example in (2a) may be analyzed as missing a relative pronoun between
the words `out' and `I', and as having an empty phrase between the words
`wanted' and `included', corresponding to the missing relative pronoun.
The existence of null constituents is not uncontroversial in
linguistics, but there are enough practitioners who countenance and
make use of them that recommendations for encoding them should be made.
The simplest and most attractive recommendation would be that null
constituents may have no special syntactic features or properties.
However, this recommendation would rule out most versions of
Government-Binding theory since the null element PRO, unlike other
nominal elements, is obligatorily ungoverned.  We might encourage
people to explicitly state their assumptions concerning null
constituents if they make use of them.
 
The possibility of representing null constituents is particularly
important in those European languages that do not require overt
subjects.  In some languages, like German, missing subjects are
restricted to impersonal contexts, as in (3), but in others, like
Italian, Spanish or Greek, missing subjects are possible in the same
contexts as subjects denoting human agents, as shown in (4).
 
(3)     Sonntags   wurde getanzt.  (German)
        Sunday-gen was   danced
        `There was dancing on Sunday.'
 
(4) a.  (Maria) vuole mangiare una mela.  (Italian)
                wants eat      an  apple
 
    b.  (Maria) quiere comer una manzana.  (Spanish)
                wants  eat   an  apple
 
    c.  (Maria) theli na   fai  ena milo.  (Greek)
                wants that eats an  apple
        `Maria/(s)he wants to eat an apple.'
 
Even in the absence of theoretical justification for null constituents,
it is useful to have the ability to encode them; for example, to enable
someone to undertake a study of the relative frequency of explicit and
missing relative pronouns in a particular text, or to investigate the
conditions governing the use of the overt pronoun in the examples in
(5).
 
(5) a.  Select six medium-size apples.  Wash (them) carefully, peel
        (them), core (them), slice (them) thinly and sprinkle (them)
        with sugar and cinnamon.
 
    b.  Do (*you) sit down.
        Don't (you) sit down.
 
 
o  Representing anaphoric relations between pronouns and their
antecedents and between reduced phrases and their antecedents
 
Anaphoric relations, such as those that hold between nominal phrases
and pronominal forms, and between full phrases of whatever type and
reduced phrases, are of great interest for both theoretical and
practical purposes.  The proper treatment of anaphoric relations has
been at center stage in theoretical linguistic discussions for over a
decade.  During this same period, much effort has been made in
developing natural-language processing systems to represent these
relations, because texts cannot be understood properly without them.
Indeed, when a passage is misunderstood, much of the time it is because
the anaphoric relations the speaker or writer intended for it have not
been correctly recovered.  Typical anaphoric relations hold among the
bracketed phrases in the examples in (6).
 
(6) a.  [The women] convinced [one another] that [they] were happy.
    b.  Joey [rode on the carousel], but no-one else [did].
 
The example in (6a) is in fact ambiguous, depending on whether `they'
is anaphorically connected to the phrase `The women', and the markup
recommendations that the A&I Committee develops must be able to
represent the anaphoric alternatives that are available.
 
An important question is what representation of the text to key the
possibilities of anaphora to.  For most purposes in syntax, it is
probably sufficient to be able to associate each phrase in a sentence
with a list of possible and impossible indices, but we also need to
allow people to express anaphora with respect to the elements of a full
discourse model.
 
 
o  Identifying other grammatical dependencies among phrases, such as
government (often realized as overt case), concord (agreement) and cross
reference
 
Grammatical concord among words and phrases is a familiar phenomenon of
natural language; it holds in English between certain verb forms and
their subjects, and between certain demonstrative adjectives and the
nouns they modify, as illustrated by the bracketed phrases in the
examples in (7).
 
(7) a.  [The baby] [is] sleeping.
    b.  [Babies] [are] sleeping.
    c.  I like [that] [shoe].
    d.  I like [those] [shoes].
 
The full encoding of concord patterns requires more than simply
registering in examples like those in (7) the grammatical number of the
forms that agree.  It also requires marking the fact that the form of
the verb depends on the form of its subject, and that the form of the
demonstrative adjective depends on the form of the noun it modifies.
 
Furthermore, while in general, we might want to enforce subject-verb
agreement, we must allow non-agreement and be able to mark it as
exceptional in cases like (8).
 
(8)     A bunch of papers were/*was lying all over the room.
        Twelve pages is/*are short for a thesis.
        There's/There are problems with every solution.
 
 
o  Representing ambiguous syntactic structures
 
The problem of ambiguity arises at every level of the A&I Committee's
work.  However, it is perhaps the most severe in connection with
syntactic representation.  Consider the example given in (9).
 
(9)     The duck is ready to eat.
 
Conceptually, the simplest way to represent the two very distinct
syntactic interpretations of this example is to provide two taggings
for it, one for each interpretation.  However, it may not be possible
to do so effectively without repeating the sentence.  Repeating the
sentence, however, means altering the text, so that we need to be able
to tag the repetition as not occurring in the text.
 
Two particularly important subcases of ambiguity phenomena concern
ambiguities that arise (a) from the scope of conjunctions and (b) from
different attachment possibilities of prepositional phrase modifiers.
A further important case of ambiguity arises in connection with the
common causative construction illustrated for German in (10).
 
(10)     Er laesst den     Hans anrufen.
         he lets   the-acc Hans call-up
 
The construction in (10) is ambiguous between an active reading of the
infinitival complement (`He is having Hans call up') and a more usual
passive reading, in which the agent is not expressed (`He is having
someone call up Hans').
 
 
o  Mismatch and reanalysis phenomena
 
Mismatches between syntax and other levels of description are fairly
common and must be handled easily.  In addition to mismatches between
syntax and prosody of the sort illustrated by `This is the house that
Jack built', there are mismatches between syntax and morphology, as in
the case of contractions and forms like `gonna' and `wanna'.  In many
European languages other than English, causative constructions involve
syntax-morphology mismatches in which a causative verb and its
complement behave for some purposes as if they were two words and for
others as if they were one.  From a notational point of view, these
phenomena are all variants of the ambiguity problem, since in every
case a single string is to be associated with more than one
representation.
 
 
GENERAL ISSUES
 
o  Underspecification
 
The encoding standards should not require providing a full,
disambiguated structure, since it is often desirable to represent
underspecified analyses.  For instance, in phrases like `ten years
ago', we might not want to be forced to specify the part of speech of
`ago'.  While it is perhaps best treated as a postposition, one might
neither want to introduce the category of postposition in English nor
(given its position in the phrase) to classify `ago' as a preposition.
Nevertheless, one might want to be able to treat the entire phrase as a
prepositional phrase headed by `ago'.  Cf. also `to dance the whole
night through'.  A similar example concerns the collocation `near NP',
where one might want to remain agnostic about the part of speech of
`near'.  On the one hand, it seems to be an adjective or adverb, since
it can be modified by the degree adverb `very' and allows the
comparative and superlative forms `nearer/nearest NP', but on the other
hand, it is acting as the head of what seems to be a prepositional
phrase, as evidenced by the fact that it can occur in the locative
complement of `put'.
 
The above examples involve two related but distinct types of
underspecification: (a) words that are not assigned parts of speech and
(b) phrases of a specified type with categorially unspecified heads.
We also need to be able to represent the converse of (b)---categorially
unspecified phrases with categorially specified heads---in order to
accommodate the so-called "small clause" analysis of examples as in
(11).
 
(11)     That makes me sad.
         I want him off the boat.
 
Under this analysis, the oblique noun phrase and the predicate phrase
following it form a constituent whose head is the head of the predicate
phrase but whose own categorial identity may remain unspecified, as shown
in (12).
 
(12)     That makes [? me [Adj sad]].
         I want [? him [[Prep off] the boat]].
 
 
o  Tokenization issues
 
While proper tokenization falls mainly within the domain of morphology,
tokenization issues involving the interpretation of hyphens and
blankspace arise regularly in syntax as well.
 
We need to allow the tagging of compound proper nouns like `New York'
as units without having to assign separate parts of speech to their
constituents.  Adverbial phrases and collocations frequently require
the same solution.  For instance, in (13), it is clear that `all but'
is functioning as an adverbial phrase (cf. `They virtually promised'),
but it is not intuitively obvious what part of speech to assign to
`but'---or `all', for that matter.
 
(13)     They all but promised.
 
Other instances in which part-of-speech assignment by word seems out of
place concern what might be called `compound conjunctions', like `much
as', `so that' and `so long as' in English.  Such compound conjunctions
are also common in other European languages; cf. the various
conjunctions ending in `que' in French.
 
Issues concerning tokenization overlap with ones concerning
underspecification.  For instance, not only might we want to omit
part-of-speech information about `New' and `York' in `New York', but we
might not want to be forced to specify other information normally
required for phrases, such as what the head of the phrase is.
 
A related issue concerns the use of hyphens as markers of constituent
grouping.  Consider the phrase `the New York-New Jersey express'.
Here, `New York' and `New Jersey' should be grouped together, and the
hyphen indicates higher-level constituent grouping.  Conversely, in an
example like `the New-Orleans Times-Democrat', the hyphen indicates
constituent grouping at the lower level.  There appear to be two
alternative ways of handling this.  We might treat hyphens that connect
constituents at different levels of syntactic hierarchy as distinct
items, rather like homonymous lexical items.  That is, we might tag the
hyphen in `the New York-New Jersey express' as a phrase level hyphen,
but the ones in `the New-Orleans Times-Democrat' as sub-phrase level
hyphens.  A more flexible and realistic approach is probably to find a
way of indicating the domain of a hyphen, much as we need to be able to
indicate the scope of conjunctions.  The default domain would be a word
on each side, but other domains (say, up to a clause) would be
possible.  The domain of a hyphen can be smaller than a word and need
not be adjacent to it.  This is especially true of German, where
conjoined phrases of the form `ac Conj bc' are regularly expressed in
the `factored' form `a- Conj bc', even if `a' and `b' are bound
morphemes.  Similar examples are found in English, though not as
frequently; cf. `mid- to late summer', where `mid-' is a bound
morpheme.
 
Finally, we need to be able to deal easily with inconsistent usage
regarding hyphenation.  On the one hand, we often need to avoid
altering the original text, but on the other hand, it is desirable to
be able to map inconsistencies onto a unique normalized form (this is
an issue that also arises in connection with typographical errors).
 
o  Useability of marked-up documents
 
Given the number of features that even an analysis with only modest aims
ends up encoding, a marked-up text file almost immediately becomes too
complex for humans to use directly.  This complexity is compounded by
the bracketing overlap that results from the representation of
syntactic ambiguity and of mismatches between different levels of
analysis.  In addition to providing standards of encoding, it is
therefore crucial that we provide standards for accessing and modifying
the various streams of a text file without going directly to the master
copy.  This is especially important when we consider that marked-up
text in syntax more often than not illustrates a particular analysis or
a point in an argument.  Hence, we must easily be able to extract only
those elements that are germane to the issue at hand.  We also need to
consider providing means for checking the consistency of a marked-up
text file.  Since these questions are of general concern, perhaps an
additional subcommittee needs to be set up to deal with them.