TEI TRW3
 
The encoding of sample text corpora (draft, version 1, part 1)
 
Stig Johansson 6 August 1989
 
1. Definition of text type
 
A corpus is a collection of texts put together in a principled way. We
may distinguish between:
 
    a. sample text corpora
    b. full text corpora
    c. monitor text corpora
 
While the first two are closed and limited in size, the third is
open-ended and expanding. A monitor corpus is, in fact, a text archive
or "language bank". Sample text corpora contain text extracts, full text
corpora consist of whole texts. Typical examples of the three types are
the Brown Corpus (a), the Leuven Drama Corpus (b), and the Birmingham
Collection of English Text (c).
 
A corpus may consist of raw text or it may be annotated ("tagged"). For
example, each word may be provided with a part-of-speech label. The
tagged Brown Corpus is an example of an annotated text corpus.
 
Note: I am here restricting the notion of a corpus to collections of
running text. The definition excludes word lists, concordances,
collections of citations, and the like.
 
2. Requirements for sample text corpora
 
Sample text corpora have usually been put together for the purposes of
linguistic (including stylistic) analysis. The extracts are typically
numerous and often of a specified length (for example, 500 texts of
2,000 words as in the Brown Corpus), and they typically represent a
range of text types. The main challenge for the corpus compiler is to
achieve uniformity of text encoding while staying as close possible to
the original text.
 
The minimal requirements for a sample text corpus are that, besides the
text, it contains or is supplemented by:
 
 - text identification
 - a coding key
 - a reference system
 
In addition, the corpus should be supplemented by information on the
principles governing the selection of text extracts.
 
2.1 Text identification
 
The minimal information needed is: author, title, time and place of
publication, location of extract, copyright note. In addition, it may be
useful to include an indication of genre category and a note on
idiosyncratic features of the text (such as typographical errors,
idiosyncratic spelling, or the presence of dialect or foreign-language
quotations). If changes have been made with respect to the original text
(e.g. normalization of spelling or omission of parts of the text), this
must be recorded.
 
2.2 Text encoding
 
The goals of the text endocing must be to produce a faithful
representation of the text with as little loss of information as
possible. Codes for special characters or symbols must be specified in a
coding key. If a character or symbol is left uncoded, there must be a
mechanism to show this.
 
Typographical errors in the original text may be faithfully reproduced
or they may be corrected. In the former case there must be a mechanism
for specifying the form in the original text, in the latter it may be
useful to insert a comment tag (corresponding to SIC in printed texts).
 
In general, there is a need for a mechanism for comment by the
electronic editor. In the LOB Corpus comment tags were used for a) text
identification, b) material left out in the machine-readable version
(figures, tables, maps, etc), c) missing end-quote marks, d) SIC
marking.
 
A lot can be done with a text provided only with codes for special
characters or symbols and some editorial comment. It may, however, be
useful to add some enhancements. For example, we may wish to insert tags
to identify particular sections of the text, either to study these or to
provide reference points for studies of other elements or features of
the text.
 
In a sample corpus consisting of a variety of text types it is not
possible to define a common overall structure, except in a very general
manner: text identification + text. There is, however, a range of common
features which it may be useful to mark. The discussion is limited to
written texts, for which there are established organisa- tional
conventions.
 
2.2.1 Paragraph
 
Written texts of all kinds contain paragraphs. These are semantic units
marked off formally by indentation and/or by being separated from the
surrounding text by blank lines. Problems of identification arise
because indentation and blank lines are also used for other purposes,
e.g. with headings and list items. The identification is usually clear,
however, for the human editor (but see the discussion of lists below).
 
Paragraphs form part of larger units, such as chapters and sections.
They consist of a variable number of sentences. Note that there are
one-sentence paragraphs, e.g. in popular newspapers or in fictional
dialogue; these should be marked both as paragraphs and as sentences.
More exceptionally, there are paragraphs which form part of sentences,
e.g. in some types of legal texts and in the final sentence of Joyce's
ULYSSES.
 
The paragraph is a very useful unit, both for linguistic study and for
information retrieval. Hence it is important to mark it explicitly and
to resolve the ambiguity inherent in existing typographical conventions.
 
2.2.2 Sentence
 
Many users of machine-readable texts have measured characteristics of
sentences, such as their length or complexity. For many types of
studies, lexical as well as grammatical or stylistic, the sentence is
also a very useful point of reference. But it is notoriously hard to
define. Its essence is independence of the surroundings; structural,
phonological, and orthographic. In this context the sentence is best
defined orthographically. It opens with a capital letter and ends with a
major punctuation mark; a full stop, a question mark, or an exclamation
mark.
 
Sentences can be simple (consisting of a single clause), compound
(consisting of coordinated clauses), or complex (including subordinate
clauses). In addition, there are sentence fragments, consisting of
phrases or single words. The distinction between phrase and clause is
crucial. Though they often overlap, there is a characteristic
difference. A sentence is a unit of text. A clause is defined in terms
of its internal structure. Nevertheless, the proto-typical sentence has
the structure of a clause.
 
There are some problems in applying our definition, as we shall see
below. End punctuation is often missing with included sentences (e.g. in
quotations). On the other hand, a full stop does not necessarily
indicate the end of a sentence (as with enumerators and abbreviations).
Capitalization may be an uncertain guide (e.g.  with capitalized list
items). In other words, our orthographic criterion must be used with
discretion. In cases of doubt it is preferable to use a sentence
division which maximizes the number of proto-typical sentences (with
clause structure) and reduces the number of sentence fragments.
 
A particular problem with our orthographic definition is that it is not
applicable to spoken material. It may very well be that the analysis
into sentences is inapplicable in this case. As in the London-Lund
Corpus, it may be preferable to have a primary division into tone units.
Or it may be possible to apply the notion of a "macro-syntagm", as in
some Scandinavian analyses of spoken material (where the definition is
based on internal cohesion and external independence and is thus
reminiscent of the definition of the sentence in written material).
 
2.2.3 List
 
A list consists of a set of items arranged typographically in a parallel
manner and containing information presented by the author as equal on
some level (not necessarily in weight or importance, however). It is a
presentation device which makes it easier to focus on, comment on, or
refer to each item.
 
A list item is optionally introduced by an enumerator (a, b, c, ...; 1,
2, 3, ...; etc). The rest may be a single word or phrase, a sentence, a
sequence of sentences, or even a block containing paragraph division.
(Enumerators may also introduce chapters, sections, paragraphs -- for
example, in legal statutes and textbooks --, example sentences, etc.)
 
The coding of lists vs sentences and paragraphs is not straightforward.
Superficially a list item may look like a paragraph (because of
indentation and/or separation by blank lines). But a list is typically
part of a sentence, e.g. "This is due to the following factors: ...". On
the other hand, list items may very well contain sentences, and even
paragraphs. It is thus not possible to set up a simple hierarchical
structure. (This sort of relationship is, however, no surprise to the
linguist who is used to the notion of rank shift:  for example, clauses
are made up of phrases, which in their turn may contain clauses.)
 
In addition to the complexities just mentioned, lists may contain lists,
as in:
 
"2.4.3 Optional features All the optional features are needed to process
FORMEX records, namely:
 - the document mark-up level features:
    - SHORTTAG
    - OMITTAG
    - SHORTREF
    - DATATAG
    - RANK
 - the document type features:
    - CONCUR
    - SUBDOC"
 
This is a section consisting of an enumerator, a heading, and one
paragraph.  The paragraph contains a single sentence with an embedded
list containing two list items each consisting of a list.
 
To handle these complexities, it is necessary to introduce different
levels of elements: sentence 1 (part of paragraph), sentence 2 (within
the domain of sentence 1), paragraph 1, paragraph 2, list 1, list 2,
etc.  Level 2 sentences are also needed in connection with quotation
(see below) and with example sentences in linguistic or philosophical
discourse.
 
While list items are typically part of sentences or can be divided into
sentences or paragraphs, this is not always the case. Bibliographies and
recipes in cookbooks are, for example, best treated as just containing
list items.
 
2.2.4 Heading
 
A heading is typically highlighted typographically and separated from
the surrounding text. It may be preceded by an enumerator (as above).
Headings may occur on different levels (chapter, section, sub-section,
etc).
 
Headings are characteristically (though not necessarily) different in
structure from ordinary sentences. Like enumerators they have a
predominantly organizational function; they organize the text rather
than refer to something extra-textual. There are good reasons for not
including them in the sentence category (or, at least, for treating them
as very special kinds of sentences). Similar status would apply to the
names of characters preceding speeches in a drama or to numbers of
stanzas in a poem.
 
By and large, headings are easy to identify. Special cases are summaries
of events or "tantalizers" at the beginning of a newspaper article or a
short story or quotations at the beginning of a chapter of a book. Like
headings, these may be highlighted typographically and sepa- rated from
the rest of the text. But they are ordinary sentences or paragraphs and
should be treated as special introductory elements in particular types
of texts.
 
2.2.5  Quotation
 
Quotations are identified by quotation marks or, if long, they may just
be separated from the rest of the text and possibly printed in smaller
type or with closer spacing. The relationship of quotations to sentences
and paragraphs is similar to that of lists. A quotation is typically
part of a sentence, e.g.: She said, "... ". The quotation may contain
sentences, paragraphs, and even headings (preceded by an enumerator).
As with lists, quotations may occur within quotations (normally
indicated by different types of quotation marks). The solution is again
to recognise different levels of elements.
 
As quotations are indicated typographically in a variety of ways, as
begin-quote and end-quote marks may look the same, and as the single
end-quote mark is identical to the apostrophe, it is preferable to use
explicit begin-quote and end-quote tags.
 
A simple example will illustrate the use of quotations in different
parts of superordinate sentences:
 
    Initial:   "I disagree completely," she said.
    Final:     She said, "I disagree completely."
    Medial:    She said, "I disagree completely," and left the room.
 
The three types were treated differently in the LOB Corpus, where the
notion of "included sentence" was introduced at a late stage to avoid an
unfortunate sentence division with quotation in medial position. But
they are best dealt with in the same way, with a level 1 sentence and an
included level 2 sentence.
 
It may be desirable to distinguish between quotation and direct speech
in fiction. In dramas where everything except stage directions is direct
speech by default, there is of course no need for special marking.
 
Note, finally, that quotation marks are used for other purposes than
quotation, in particular as distancing devices (like "so-called) and
when used to mention linguistic forms (the word "to") or gloss their
meanings (in the sense "X"). In these cases, it is of course irrelevant
to use begin-quote and end-quote tags (but there may still be a case for
distinguishing single end-quote marks from the apostrophe).
 
2.2.6 Foreign-language material
 
It is desirable to tag foreign elements in predominantly monolingual
texts and to tag sections in each language in bi- or multilingual texts.
Problems of tagging arise in monolingual texts with single words and
phrases derived from foreign sources. There is a cline from elements
which are completely integrated into the receiving language and are no
longer perceived as foreign to those which are expressly said to be
foreign or are indicated as such by some typographical device (foreign
alphabet, italics, quotation marks, or capitalization). There may be
something to be said for the practice followed in the LOB Corpus only to
tag words or phrases expressly said to be or indicated as foreign. This
practice means that the same word may be treated differently depending
upon the context and that even normally very foreign-sounding words may
be uncoded provided that the foreignness is not indicated or commented
upon in any way in the text.
 
2.2.7 Abbreviations
 
The tagging of abbreviations prevents confusion of the full stop as an
end-of-sentence marker and as an abbreviation marker. In the LOB Corpus
abbreviation coding was used regardless of whether a short form of a
word ended in a full stop or not (Mr., Mr). There are a number of
problems of demarcation, however; see the LOB Corpus manual.
 
If sentences are tagged explicitly, it may be unnecessary to tag
abbreviations (unless the object is to distinguish abbrevations from
ordinary vocabulary items).
 
2.2.8 Names
 
The tagging of names prevents confusion of capitalization to mark names
from sentence openings. Note that parts of names may have lower-case
initials (de Gaulle). Forms may simultaneously be names and
abbreviations (e.g.  initials). There are problems of demarcation, as
initial capitalization extends beyond sentence openings and names. See
the types recognized in the tagged LOB Corpus.
 
If sentences are tagged explicitly, it may be unnecessary to tag names
(unless the object is to distinguish names from ordinary words or
perhaps to identify names for indexing purposes).
 
2.2.9 Emphasis
 
For most purposes it is not essential to tag the variety of
typographical shifts used in printed texts. Such marking rather clutters
up the text and makes it harder to process by computer. Mechanisms for
indicating significant typographical shifts should be provided, however,
especially shifts of emphasis as indicated by italics, boldface, italic
boldface, and underlining.
 
The great variety of highlighting in headlines and enumerators is best
left untagged (unless there is a shift within the heading).
 
2.2.10 Hyphen
 
Mechanisms must be provided for distinguishing hyphen and dash and for
distinguishing line-end hyphens from those which are integral parts of
words. The former should be removed or given special codes. Line-end
hyphens were removed in the LOB Corpus unless dictionaries showed that
hyphenation was normal or the word in question was found hyphenated
elsewhere in the text (this left some cases undecided, however).
 
A way of temporarily avoiding the problem in converting texts from
printed to machine-readable form is to preserve the lineation of the
original. But this may not always be possible. And any subsequent
processing of words presupposes that the hyphenation problem has been
dealt with.
 
There is a good case for recommending non-use of line-end hyphens in
machine-readable texts or for devising a special "word continuation"
marker.
 
Spacing is a simple way of distinguishing hyphen and dash; dashes could
be distinguished from hyphens by consistent use of surrounding spaces
(while hyphens are immediately preceded and/or followed by other
characters).  Alternatively, dashes could be marked as double hyphens,
as is sometimes done in written manuscripts.
 
2.2.11 Word
 
Like the sentence, the word is hard to define exactly. But there is good
consistency in word division in written material in English and many
other languages, and the word can therefore usefully be defined as a
sequence of alphanumeric characters surrounded by spaces, with
elimination of any initial or final punctuation marks (except final full
stops in abbrevations and initial and final apostrophes and hyphens:
'em, boys', livin', fourteenth-, -room, etc). There may be a case for
allowing words to contain punctuation marks, as in analyses of the LOB
Corpus in cases like: 2.1, 3,000, M.A.
 
The simplest solution is to apply ordinary spacing conventions in
machine-readable texts, although this gives rise to some very strange
"words": A1, 2.1, 1951-52, etc. Some strange forms arise with
hyphenation (see also the examples in the paragraph above) of compounds
and complex premodifiers, as in: four-letter word, New York-born, etc.
 
For most purposes it is probably possible to live with the simple
orthographic definition given above. But any more sophisticated analysis
of words will require special coding. The tagging of enumerators,
abbreviations, and foreign words (cf above) takes care of part of the
problem. There is a need for a mechanism for indicating that a sequence
surrounded by spaces may be part of a word, as in: bi- or multilingual,
the ending ER. We need a mechanism for splitting up contracted forms, as
in: didn't, it's. And we need mechanisms for dealing with the problems
of hyphenation illustrated above. (A particular problem in English is
that genitives and some contracted form may look alike, e.g.: John's =
John's (genitive), John is, John has.)
 
It is possible that some of these problems can only be meaningfully
solved if the text is subjected to more exhaustive linguistic tagging.
Note, for example, that the ending in group genitives really belongs to
a phrase rather than to the word it is attached to: the King of Sweden's
crown, an hour and a half's talk, John and Mary's car, etc.
 
2.2.11 Capitalization
 
In some early corpora which only had a limited character set available
the text was given in upper case and capitalization in the original text
was indicated by special codes. Where the character set includes upper
and lower case, it is natural to reproduce capitalization as given in
the original text. It may, however, be desirable to code sentences
and/or names (cf above) and thus distinguish between capitalization used
for different purposes.
 
If capitalization is coded or if changes are made in the distribution of
upper and lower case, this should of course be declared (like all
editing and coding decisions).
 
2.3 Reference system
 
To make it possible to use and refer to a machine-readable text, there
must be an explicit reference system. In converting printed text to
machine-readable form it is often essential to include markers of page
and line division (etc) in the machine-readable version. These can then
be used for reference purposes. In many cases it may be useful to keep
the same line division in the machine-readable version as in the
original text.
 
In a sample text corpus with a large number of texts represented it is
natural to introduce a new reference system. The system used in the LOB
Corpus, where each line is prefixed by a code for text category, text
number, and line number (e.g. A12 100), has been found very useful.
References to the original location of the texts are given in an
accompanying manual.
 
2.4 Concluding remarks
 
The principal consideration for the corpus compiler is to produce a
faithful representation of the text and to document sources, coding
conventions, and editorial decisions. In this paper I have mainly drawn
on my experiences from work with the LOB Corpus. The fact that this
corpus has been widely used is evidence that the practices used have
been successful. This does not mean, however, that these are practices
to be recommended for future corpus compilers. It is desirable to rely
less on a printed manual, to make available more information on sources
and coding in the corpus itself, and - best of all - to work for common
coding conventions which reduce the need to specify the coding decisions
for each individual corpus.
 
In this paper I have limited the discussion to the encoding of elements
for which there are established conventions in printed texts.  Building
on these conventions and refining them by mechanisms for more explicit
marking is probably a good way of arriving at generally accepted
guidelines for the encoding of machine-readable texts.
 
August, 1989
Stig Johansson
University of Oslo