<?xml version="1.0" encoding="utf-8"?>
<!--
Copyright TEI Consortium. 
Dual-licensed under CC-by and BSD2 licences 
See the file COPYING.txt for details.
$Date$
$Id$
-->


<?xml-model href="http://tei.oucs.ox.ac.uk/jenkins/job/TEIP5/lastSuccessfulBuild/artifact/P5/release/xml/tei/odd/p5.nvdl" type="application/xml" schematypens="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"?>

<div xmlns="http://www.tei-c.org/ns/1.0" type="div1" xml:id="CC" n="23"><head>Language Corpora</head>

<p>The term <term>language corpus</term> is used to mean a number of
rather different things.  It may refer simply to any collection of
linguistic data (for example, written, spoken, signed, or multimodal), although
many practitioners prefer to reserve it for collections which have
been organized or collected with a particular end in view, generally to
characterize a particular state or variety of one or more languages.
Because opinions as to the best method of achieving this goal differ,
various subcategories of corpora have also been identified.  For our
purposes however, the distinguishing characteristic of a corpus is that
its components have been selected or structured according to some
conscious set of design criteria.
 </p>
<p>These design criteria may be very simple and undemanding, or very
sophisticated.  A corpus may be intended to represent (in the
statistical sense) a particular linguistic variety or sublanguage, or
it may be intended to represent all aspects of some assumed
<soCalled>core</soCalled> language.  A corpus may be made up of whole
texts or of fragments or text samples.  It may be a
<soCalled>closed</soCalled> corpus, or an <soCalled>open</soCalled> or
<soCalled>monitor</soCalled> corpus, the composition of which may
change over time.  However, since an open corpus is of necessity
finite at any particular point in time, the only likely effect of its
expansibility from the encoding point of view may be some increased
difficulty in maintaining consistent encoding practices (see further
section <ptr target="#CCREC"/>). For simplicity, therefore, our
discussion largely concerns ways of encoding closed corpora, regarded
as single but composite texts.
 </p>
<p>Language corpora are regarded by these Guidelines as
<term>composite texts</term> rather than <term>unitary texts</term>
(on this distinction, see chapter <ptr target="#DS"/>).  This is
because although each discrete sample of language in a corpus clearly
has a claim to be considered as a text in its own right, it is also
regarded as a subdivision of some larger object, if only for
convenience of analysis.  Corpora share a number of characteristics
with other types of composite texts, including anthologies and
collections.  Most notably, different components of composite texts
may exhibit different structural properties (for example, some may be
composed of verse, and others of prose), thus potentially requiring
elements from different TEI modules.
 </p>
<p>Aside from these high-level structural differences, and possibly
differences of scale, the encoding of language corpora and the
encoding of individual texts present identical sets of problems.  Any
of the encoding techniques and elements presented in other chapters of
these Guidelines may therefore prove relevant to some aspect of corpus
encoding and may be used in corpora.  Therefore, we do not repeat here
the discussion of such fundamental matters as the representation of
multiple character sets (see chapter <ptr target="#CH"/>); nor do we
attempt to summarize the variety of elements provided for encoding
basic structural features such as quoted or highlighted phrases, 
cross-references, lists, notes, editorial changes and reference systems (see
chapter <ptr target="#CO"/>).  In addition to these general purpose
elements, these Guidelines offer a range of more specialized sets of
tags which may be of use in certain specialized corpora, for example
those consisting primarily of verse (chapter <ptr target="#VE"/>),
drama (chapter <ptr target="#DR"/>), transcriptions of spoken text
(chapter <ptr target="#TS"/>), etc.  Chapter <ptr target="#ST"/>
should be reviewed for details of how these and other components of
the Guidelines should be tailored to create a document type definition
appropriate to a given application.  In sum, it should not be assumed 
that only the matters specifically addressed in this chapter are of
importance for corpus creators.
 </p>
<p>This chapter does however include some other material
relevant to corpora and corpus-building, for which no other location
appeared suitable.  It begins with a review of the distinction between
unitary and composite texts, and of the different methods provided by
these Guidelines for representing composite texts of different kinds
(section <ptr target="#CCDEF"/>).  Section <ptr target="#CCAH"/> describes a
set of additional header elements provided for the documentation of
contextual information, of importance largely though not exclusively to
language corpora.  This is the additional module for language corpora
proper.  Section <ptr target="#CCAS"/> discusses a mechanism by which
individual parts of the TEI header may be associated with different
parts of a TEI-conformant text.  Section <ptr target="#CCAN"/> reviews
various methods of providing linguistic annotation in corpora, with some
specific examples of relevance to current practice in corpus
linguistics.  Finally, section <ptr target="#CCREC"/> provides some general
recommendations about the use of these Guidelines in the building of
large corpora.
</p>
<div type="div2" xml:id="CCDEF"><head>Varieties of Composite Text</head>

<p>Both unitary and composite texts may be encoded using these
Guidelines; composite texts, including corpora, will typically make
use of the following tags for their top-level organization.
<specList><specDesc key="teiCorpus"/><specDesc key="TEI"/><specDesc key="teiHeader" atts="type "/><specDesc key="text"/><specDesc key="group"/></specList> Full descriptions of these may be found in
chapter <ptr target="#HD"/> (for <gi>teiHeader</gi>), and chapter <ptr target="#DS"/> (for <gi>teiCorpus</gi> <gi>TEI</gi>, <gi>text</gi> and
<gi>group</gi>); this section discusses their application to composite
texts in particular.
 </p>
<p>In these Guidelines, the word <term>text</term> refers to any stretch
of discourse, whether complete or incomplete, unitary or composite,
which the encoder chooses (perhaps merely for purposes of analytic
convenience) to regard as a unit.  The term <term>composite text</term>
refers to texts within which other texts appear; the following common
cases may be distinguished:
<list rend="bulleted">
<item>language corpora</item>
<item>collections or anthologies</item>
<item>poem cycles and epistolary works (novels or essays written
in the form of collections or series of letters)</item>
<item>otherwise unitary texts, within which one or more subordinate
texts are embedded</item></list>
The elements listed above may be combined to encode each of these
varieties of composite text in different ways.
 </p>
<p>In corpora, the component samples are clearly distinct texts, but the
systematic collection, standardized preparation, and common markup of
the corpus often make it useful to treat the entire corpus as a unit,
too.  Some corpora may become so well established as to be regarded as
texts in their own right; the Brown and LOB corpora are now close to
achieving this status.
 </p>
<p>The <gi>teiCorpus</gi> element is intended for the encoding of
language corpora, though it may also be useful in encoding newspapers,
electronic anthologies, and other disparate collections of material.
The individual samples in the corpus are encoded as separate
<gi>TEI</gi> elements, and the entire corpus is enclosed in a
<gi>teiCorpus</gi> element.  Each sample has the usual structure for
a <gi>TEI</gi> document, comprising a <gi>teiHeader</gi> followed by a
<gi>text</gi> element. The corpus, too, has a corpus-level
<gi>teiHeader</gi> element, in which the corpus as a whole, and encoding
practices common to multiple samples may be described. The overall
structure of a TEI-conformant corpus is thus:
<egXML xmlns="http://www.tei-c.org/ns/Examples" valid="feasible"><teiCorpus>
    <teiHeader>
    </teiHeader>
    <TEI>
         <teiHeader>  </teiHeader>
         <text>  </text>
    </TEI>
    <TEI>
         <teiHeader>  </teiHeader>
         <text>  </text>
    </TEI>
</teiCorpus></egXML></p>
<p>Header information which relates to the whole corpus rather than to
individual components of it should be factored out and included in the
<gi>teiHeader</gi> element prefixed to the whole.  This two-level
structure allows for contextual information to be specified at the
corpus level, at the individual text level, or at both.  Discussion of
the kinds of information which may thus be specified is provided
below, in section <ptr target="#CCAH"/>, as well as in chapter <ptr target="#HD"/>.  Information of this type should in general be
specified only once: a variety of methods are provided for associating
it with individual components of a corpus, as further described in
section <ptr target="#CCAS"/>.
 </p>
<p>In some cases, the design of a corpus is reflected in its internal
structure.  For example, a corpus of newspaper extracts might be
arranged to combine all stories of one type (reportage, editorial,
reviews, etc.) into some higher-level grouping, possibly with sub-groups
for date, region, etc.  The <gi>teiCorpus</gi> element provides no
direct support for reflecting such internal corpus structure in the
markup:  it treats the corpus as an undifferentiated series of
components, each tagged <gi>TEI</gi>.
 </p>
<p>If it is essential to reflect a single permanent organization of a
corpus into sub- and sub-sub-corpora, then the corpus or the high-level
subcorpora may be encoded as composite texts, using the <gi>group</gi>
element described below and in section <ptr target="#DSGRP"/>.  The
mechanisms for corpus characterization described in this chapter,
however, are designed to reduce the need to do this.  Useful groupings
of components may easily be expressed using the text classification and
identification elements described in section <ptr target="#CCAHTD"/>,
and those for associating declarations with corpus components described
in section <ptr target="#CCAS"/>.  These methods also allow several
different methods of text grouping to co-exist, each to be used as
needed at different times.  This helps minimize the danger of
cross-classification and misclassification of samples, and helps
improve the flexibility with which parts of a corpus may be
characterized for different applications.
 </p>
<p>Anthologies and collections are often treated as texts in their own
right, if only for historical reasons.  In conventional publishing, at
least, anthologies are published as units, with single editorial
responsibility and common front and back matter which may need to be
included in their electronic encodings.  The texts collected in the
anthology, of course, may also need to be identifiable as distinct
individual objects for study.
 </p>
<p>Poem cycles, epistolary novels, and epistolary essays differ from
anthologies in that they are often written as single works, by single
authors, for single occasions; nevertheless, it can be useful to treat
their constituent parts as individual texts, as well as the cycle
itself.  Structurally, therefore, they may be treated in the same way
as anthologies: in both cases, the body of the text is composed
largely of other texts.
 </p>
<p>The <gi>group</gi> element is provided to simplify the encoding of
collections, anthologies, and cyclic works; as noted above, the
<gi>group</gi> element can also be used to record the potentially
complex internal structure of language corpora.  For a full description,
see chapter <ptr target="#DS"/>.
 </p>
<p>Some composite texts, finally, are neither corpora, nor anthologies,
nor cyclic works:  they are otherwise unitary texts within which other
texts are embedded.  In general, they may be treated in the same way as
unitary texts, using the normal <gi>TEI</gi> and
<gi>body</gi> elements.  The embedded text itself may be encoded using
the <gi>text</gi> element, which may occur within quotations or between
paragraphs or other chunk-level elements inside the sections of a larger
text.  For further discussion, see chapter <ptr target="#DS"/>.
 </p>
<p>All composite texts share the characteristic that their different
component texts may be of structurally similar or dissimilar types.  If
all  component texts may all be encoded using the same  module,
then no problem arises. If however they require
different modules, then these must be included in the schema. This
process is described in more detail in section <ptr target="#STMA"/>.
</p></div>
<div type="div2" xml:id="CCAH"><head>Contextual Information</head>
<p>Contextual information is of particular importance for collections
or corpora composed of samples from a variety of different kinds of
text. Examples of such contextual information include:  the age, sex,
and geographical origins of participants in a language interaction, or
their socio-economic status; the cost and publication data of a
newspaper; the topic, register or factuality of an extract from a
textbook. Such information may be of the first importance, whether as
an organizing principle in creating a corpus (for example, to ensure
that the range of values in such a parameter is evenly represented
throughout the corpus, or represented proportionately to the population
being sampled), or as a selection criterion in analysing the corpus
(for example, to investigate the language usage of some particular
vector of social characteristics).</p>
<p>Such contextual information is potentially of equal importance for
unitary texts, and these Guidelines accordingly make no particular
distinction between the kinds of information which should be gathered
for unitary and for composite texts.  In either case, the information
should be recorded in the appropriate section of a TEI header, as
described in chapter <ptr target="#HD"/>.  In the case of language corpora,
such information may be gathered together in the overall corpus header,
or split across all the component texts of a corpus, in their individual
headers, or divided between the two.  The association between an
individual corpus text and the contextual information applicable to it
may be made in a number of ways, as further discussed in section <ptr target="#CCAS"/> below.</p>
<p>Chapter <ptr target="#HD"/>, which should be read in conjunction with
the present section, describes in full the range of elements available
for the encoding of information relating to the electronic file itself,
for example its bibliographic description and those of the source or
sources from which it was derived (see section <ptr target="#HD2"/>);
information about the encoding practices followed with the corpus, for
example its design principles, editorial practices, reference system,
etc.  (see section <ptr target="#HD5"/>); more detailed descriptive
information about the creation and content of the corpus, such as the
languages used within it and any descriptive classification system used
(see section <ptr target="#HD4"/>); and version information documenting any
changes made in the electronic text (see section <ptr target="#HD6"/>).</p>
<p>In addition to the elements defined by chapter <ptr target="#HD"/>,
several other elements can be used in the TEI header if the additional
module defined by this chapter is invoked.  These additional tags make
it possible to characterize the social or other situation within which a
language interaction takes place or is experienced, the physical setting
of a language interaction, and the participants in it.  Though this
information may be relevant to, and provided for, unitary texts as well
as for collections or corpora, it is more often recorded for the
components of systematically developed corpora than for isolated texts,
and thus this module is referred to as being <q>for language
corpora</q>.</p>

<p>When the module defined in this chapter is included in a schema, a
number of additional elements become available within the
<gi>profileDesc</gi> element of the TEI header (discussed in section
<ptr target="#HD4"/>).  <specList><specDesc key="textDesc"/><specDesc key="particDesc"/><specDesc key="settingDesc"/></specList> These
elements, members of the <ident type="class">model.profileDescPart</ident>, are discussed in the
remainder of the chapter.

<specGrp>







<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/textDesc.xml"/>












<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/particDesc.xml"/>












<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/settingDesc.xml"/>





</specGrp>
<specGrpRef target="#DCCAHTD"/>

<specGrpRef target="#DCCAHSE"/>
	</p>

<div type="div3" xml:id="CCAHTD"><head>The Text Description</head>
<p>The <gi>textDesc</gi> element provides a full description of the
situation within which a text was produced or experienced, and thus
characterizes it in a way relatively independent of any <foreign>a
priori</foreign> theory of text-types.  It is provided as an alternative
or a supplement to the common use of descriptive taxonomies used to
categorize texts, which is fully described in section <ptr target="#HD43"/>, and section <ptr target="#HD55"/>.  The description is
organized as a set of values and optional prose descriptions for the
following eight <term>situational parameters</term>, each represented by
one of the following eight elements:
<specList><specDesc key="channel" atts="mode"/><specDesc key="constitution" atts="type"/><specDesc key="derivation" atts="type"/><specDesc key="domain" atts="type"/><specDesc key="factuality" atts="type"/><specDesc key="interaction" atts="type active passive"/><specDesc key="preparedness" atts="type"/><specDesc key="purpose" atts="type degree"/></specList></p>
<p>These elements constitute a model class called <ident type="class">model.textDescPart</ident>; new parameters may be defined
by defining new elements and adding them to that class, as further
described in <ptr target="#MD"/>.</p>

<p>By default, a text description will contain each of the above
elements, supplied in the order specified. Except for the
<gi>purpose</gi> element, which may be repeated to indicate multiple
purposes, no element should appear more than once within a single text
description.  Each element may be empty, or may contain a brief
qualification or more detailed description of the value expressed by
its attributes.  It should be noted that some texts, in particular
literary ones, may resist unambiguous classification in some of these
dimensions; in such cases, the situational parameter in question
should be given the content <q>not applicable</q> or an equivalent
phrase.
	</p>
<p>Texts may be described along many dimensions, according to many
different taxonomies.  No generally accepted consensus as to how such
taxonomies should be defined has yet emerged, despite the best efforts
of many corpus linguists, text linguists, sociolinguists,
rhetoricians, and literary theorists over the years.  Rather than
attempting the task of proposing a single taxonomy of
<term>text-types</term> (or the equally impossible one of enumerating
all those which have been proposed previously), the closed set of
<term>situational parameters</term> described above can be used in
combination to supply useful distinguishing descriptive features of
individual texts, without insisting on a system of discrete high-level
text-types. Such text-types may however be used in combination with
the parameters proposed here, with the advantage that the internal
structure of each such text-type can be specified in terms of the
parameters proposed.  This approach has the following analytical
advantages:<note place="bottom">Schemes similar to that proposed here were developed
in the 1960s and 1970s by researchers such as Hymes, Halliday, and
Crystal and Davy, but have rarely been implemented; one notable
exception being the pioneering work on the Helsinki Diachronic Corpus
of English, on which see <ptr type="cit" target="#CC-BIBL-1"/></note>
<list rend="bulleted">
<item> it enables a relatively continuous characterization of texts (in
contrast to discrete categories based on type or topic)</item>
<item>it enables meaningful comparisons across corpora</item>
<item>it allows analysts to build and compare their own text-types
based on the particular parameters of interest to them</item>
<item>it is equally applicable to spoken, written, or signed texts</item></list></p>
<p>Two alternative approaches to the use of these parameters are
supported by these Guidelines.  One is to use pre-existing taxonomies
such as those used in subject classification or other types of text
categorization.
Such taxonomies may also be appropriate for the description of the
topics addressed by particular texts.  Elements for this purpose are
described in section <ptr target="#HD43"/>, and elements for defining or
declaring such classification schemes in section <ptr target="#HD55"/>.  A
second approach is to develop an application-specific set of
<term>feature structures</term> and an associated <term>feature system
declaration,</term> as described in
chapters <ptr target="#FS"/> and <ptr target="#FD"/>.
	</p>
<p>Where the organizing principles of a corpus or collection so permit,
it may be convenient to regard a particular set of values for the
situational parameters listed in this section as forming a
<term>text-type</term> in its own right; this may also be useful where
the same set of values applies to several texts within a corpus.  In
such a case, the set of text-types so defined should be regarded as a
<term>taxonomy</term>.  The mechanisms described in section <ptr target="#HD55"/> may be used to define hierarchic taxonomies of such
text-types, provided that the <gi>catDesc</gi> component of the
<gi>category</gi> element contains a <gi>textDesc</gi> element rather
than a prose description.  Particular texts may then be associated with
such definitions using the mechanisms described in sections <ptr target="#HD43"/>.
	</p>

<p>Using these situational parameters, an informal domestic
conversation might be characterized as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><textDesc n="Informal domestic conversation">
   <channel mode="s">informal face-to-face conversation</channel>
   <constitution type="single">each text represents a continuously
     recorded interaction among the specified participants
     </constitution>
   <derivation type="original">  </derivation>
   <domain type="domestic">plans for coming week, local affairs</domain>
   <factuality type="mixed">mostly factual, some jokes</factuality>
   <interaction type="complete" active="plural" passive="many">  </interaction>
   <preparedness type="spontaneous">  </preparedness>
   <purpose type="entertain" degree="high">  </purpose>
   <purpose type="inform" degree="medium"> </purpose>
</textDesc></egXML>
</p>
<p>The following example demonstrates how the same situational
parameters might be used to characterize a novel:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><textDesc n="novel">
   <channel mode="w">print; part issues</channel>
   <constitution type="single">     </constitution>
   <derivation type="original">     </derivation>
   <domain type="art">     </domain>
   <factuality type="fiction">     </factuality>
   <interaction type="none">     </interaction>
   <preparedness type="prepared">     </preparedness>
   <purpose type="entertain" degree="high">     </purpose>
   <purpose type="inform" degree="medium">  </purpose>
</textDesc></egXML>
<!-- need more examples -->
</p>


<specGrp xml:id="DCCAHTD" n="Text description">









<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/channel.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/constitution.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/derivation.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/domain.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/factuality.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/interaction.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/preparedness.xml"/>















<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/purpose.xml"/>






</specGrp>


</div>
<div type="div3" xml:id="CCAHPA"><head>The Participant Description</head>

<p>The <gi>particDesc</gi> element in the <gi>profileDesc</gi> element
provides additional information about the participants in a spoken
text or, where this is judged appropriate, the persons named or
depicted in a written text.  When the detailed elements provided by
the <ident type="module">namesdates</ident> module described in <ptr target="#ND"/> are included in a schema, this element can
contain detailed demographic or descriptive information about
individual speakers or groups of speakers, such as their names or
other personal characteristics. Individually identified persons may
also identified by a code which can then be used elsewhere within the
encoded text, for example as the value of a <att>who</att>
attribute.</p>

<p>It should be noted that although the terms <term>speaker</term> or
<term>participant</term> are used throughout this section, it is
intended that the same mechanisms may be used to characterize fictional
personæ or <soCalled>voices</soCalled> within a written text, except
where otherwise stated.  For the purposes of analysis of language usage,
the information specified here should be equally applicable to
written, spoken, or signed texts.</p>
<p>The element <gi>particDesc</gi> contains a description of the
participants in an interaction, which may be supplied as
straightforward prose, possibly containing a list of names, encoded
using the usual <gi>list</gi> and <gi>name</gi> elements, or
alternatively using the more specific and detailed <gi>listPerson</gi>
element provided by the <ident type="module">namesdates</ident> module
described in <ptr target="#ND"/>.
</p>
<p>For example, a  participant in a recorded conversation might be
described informally as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><particDesc xml:id="p2">
<p>Female informant, well-educated, born in Shropshire UK, 12 Jan
    1950, of unknown occupation. Speaks French fluently.
    Socio-Economic status B2 in the PEP classification scheme.</p>
</particDesc></egXML></p>

<p>Alternatively, when the <ident type="module">namesdates</ident> module
is included in a schema, information about the same participant
described above might be provided in a more structured way as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><person sex="2" age="mid">
  <birth when="1950-01-12">     
    <date>12 Jan 1950</date>
    <name type="place">Shropshire, UK</name>
  </birth>
  <langKnowledge tags="en fr">
    <langKnown level="first" tag="en">English</langKnown>
    <langKnown tag="fr">French</langKnown>
  </langKnowledge>
  <residence>Long term resident of Hull</residence>
  <education>University postgraduate</education>
  <occupation>Unknown</occupation>
  <socecStatus scheme="#pep" code="#b2"/>
</person></egXML></p>
<p>An identified character in a drama or a novel may also be regarded
as a participant in this sense, and encoded using
the same techniques:<note place="bottom">It is particularly useful to
define participants in a dramatic text in this way, since it enables the
<att>who</att> attribute to be used to link <gi>sp</gi> elements to
definitions for their speakers; see further section <ptr target="#DRSP"/>.</note>
<egXML xmlns="http://www.tei-c.org/ns/Examples"><particDesc>
  <p>The chief speaking characters in this novel are
  <list>
    <item xml:id="EMWOO"><name>Emma Woodhouse</name></item>
    <item xml:id="DARCY"><name>Mr Darcy</name></item>
    <!-- ... -->
  </list>
  </p>
</particDesc></egXML>
Here, the characters are simply listed without the detailed
structure which use of the <gi>listPerson</gi> element permits.
</p>

</div>
<div type="div3" xml:id="CCAHSE">
<head>The Setting Description</head> 
<p>The <gi>settingDesc</gi> element is used to describe the setting or
settings in which language interaction takes place.  It may contain a
prose description, analogous to a stage description at the start of a
play, stating in broad terms the locale, or a more detailed
description of a series of such settings.  </p>
<p>Each distinct setting is described by means of a <gi>setting</gi>
element. 
<specList>
<specDesc key="setting"/>
</specList>


Individual settings may be associated with particular participants by
means of the optional <att>who</att> attribute which this element
inherits as a member of the <ident type="class">att.ascribed</ident>
if, for example, participants are in different places.  This attribute
identifies one or more individual participants or participant groups,
as discussed earlier in section <ptr target="#CCAHPA"/>.  If this
attribute is not specified, the setting details provided are assumed
to apply to all participants represented in the language
interaction. Note however that it is not possible to encode different
settings for the same participant: a participant is deemed to be a
person within a specific setting.</p>
<p>The <gi>setting</gi> element may contain either a prose description
or a selection of elements from the classes <ident type="class">model.nameLike.agent</ident>, <ident type="class">model.dateLike</ident>, or
<ident type="class">model.settingPart</ident>. By default, when the
module defined by this chapter is included in a schema, these classes thus
provide the following elements: 
<specList>
<specDesc key="name" />
<specDesc key="date"/>
<specDesc key="time"/>
<specDesc key="locale"/>
<specDesc key="activity"/>
</specList>
Additional more specific naming elements such as <gi>orgName</gi> or
<gi>persName</gi> may also be available if the
<ident type="module">namesdates</ident> module is also included in the schema.
</p>
<p>The following example demonstrates the kind of background information
often required to support transcriptions of language interactions, first
encoded as a simple prose narrative:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><settingDesc>
   <p>The time is early spring, 1989. P1 and P2 are playing on the rug
    of a suburban home in Bedford. P3 is doing the washing up at the
    sink. P4 (a radio announcer) is in a broadcasting studio in
    London.</p>
</settingDesc></egXML>
The same information might be represented more formally in the following
way:
<egXML xmlns="http://www.tei-c.org/ns/Examples"><settingDesc>
   <setting who="#p1 #p2">
      <name type="city">Bedford</name>
      <name type="region">UK: South East</name>
      <date>early spring, 1989</date>
      <locale>rug of a suburban home</locale>
      <activity>playing</activity>
   </setting>
   <setting who="#p3">
      <name type="city">Bedford</name>
      <name type="region">UK: South East</name>
      <date>early spring, 1989</date>
      <locale>at the sink</locale>
      <activity>washing-up</activity>
   </setting>
   <setting who="#p4">
      <name type="place">London, UK</name>
      <time>unknown</time>
      <locale>broadcasting studio</locale>
      <activity>radio performance</activity>
   </setting>
</settingDesc></egXML></p>
<p>Again, a  more detailed encoding for  places is feasible if the
<ident type="module">namesdates</ident> module is included in the
schema. The above examples assume that only the
general purpose <gi>name</gi> element supplied in the core module is
available.
<specGrp xml:id="DCCAHSE" n="Setting description">







<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/setting.xml"/>












<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/locale.xml"/>












<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/activity.xml"/>





</specGrp>
</p></div></div>
<div type="div2" xml:id="CCAS"><head>Associating Contextual
Information with a Text</head>
<p>This section discusses the association of the contextual information
held in the header with the individual elements making up a TEI text or
corpus.  Contextual information is held in elements of various kinds
within the TEI header, as discussed elsewhere in this section and in
chapter <ptr target="#HD"/>.  Here we consider what happens when different
parts of a document need to be associated with different contextual
information of the same type, for example when one part of a document
uses a different encoding practice from another, or where one part
relates to a different setting from another.  In such situations, there
will be more than one instance of a header element of the relevant type.
 </p>
<p>The TEI scheme allow for the following possibilities:
<list rend="bulleted">
<item>A given element may appear in the corpus header only, in the
header of one or more texts only, or in both places</item>
<item>There may be multiple occurrences of certain elements in either
corpus or text header.</item></list>
 </p>
<p>To simplify the exposition, we deal with these two possibilities
separately in what follows; however, they may be combined as
desired.
 </p>
<div type="div3" xml:id="CCAS1"><head>Combining Corpus and Text Headers</head>
<p>A TEI-conformant document may have more than one header only in the
case of a TEI corpus, which must have a header in its own right, as well
as the obligatory header for each text.  Every element specified in a
corpus-header is understood as if it appeared within every text header
in the corpus.  An element specified in a text header but not in the
corpus header supplements the specification for that text alone.  If any
element is specified in both corpus and text headers, the corpus header
element is over-ridden for that text alone.
 </p>
<p>The <gi>titleStmt</gi> for a corpus text is understood to be
prefixed by the <gi>titleStmt</gi> given in the corpus header.  All
other optional elements of the <gi>fileDesc</gi> should be omitted from
an individual corpus text header unless they  differ from those
specified in the corpus header.  All other header elements behave
identically, in the manner documented below.
This facility makes it possible to state once for all in the corpus
header each piece of contextual information which is common to the whole
of the corpus, while still allowing for individual texts to vary from
this common denominator.
 </p>
<p>For example, the following schematic shows the structure of a corpus
comprising three texts, the first and last of which share the same
encoding description. The second one has its own encoding description.
<egXML xmlns="http://www.tei-c.org/ns/Examples" valid="feasible"><teiCorpus>
  <teiHeader>
    <fileDesc><!-- corpus file description--></fileDesc>
    <encodingDesc>
        <!-- default encoding description -->
    </encodingDesc>
    <revisionDesc><!-- corpus revision description --></revisionDesc>
  </teiHeader>
  <TEI>
    <teiHeader>
      <fileDesc>
         <!-- file description for this corpus text --> 
       </fileDesc>
    </teiHeader>
    <text>
      <!-- first corpus text -->
    </text>
  </TEI>
  <TEI>
    <teiHeader>
      <fileDesc>         
        <!-- file description for this corpus text --> 
      </fileDesc>
      <encodingDesc>
        <!-- encoding description for this corpus 
             text, over-riding the default  --> 
      </encodingDesc>
    </teiHeader>
    <text>
      <!-- second corpus text -->
    </text>
  </TEI>
  <TEI>
    <teiHeader>
      <fileDesc>
        <!-- file description for third corpus text --> 
      </fileDesc>
    </teiHeader>
    <text>      
      <!-- third corpus text -->
    </text>
  </TEI>
</teiCorpus>
</egXML>
 </p></div>
<div type="div3" xml:id="CCAS2"><head>Declarable Elements</head>

<p>Certain of the elements which can appear within a TEI header are
known as <term>declarable elements</term>. These elements have in
common the fact that they may be linked explicitly with a particular
part of a text or corpus by means of a <att>decls</att> attribute on
that element. This linkage is used to over-ride the default
association between declarations in the header and a corpus or corpus
text. The only header elements which may be associated in this way are
those which would not otherwise be meaningfully repeatable.  </p>
<p>Declarable elements are all members of the class <ident type="class">att.declarable</ident>; the corresponding declaring
elements are all members of the class <ident type="class">att.declaring</ident>.
<specList>
<specDesc key="att.declarable" atts="default"/>
<specDesc key="att.declaring" atts="decls"/>
</specList></p>

<p>An alphabetically ordered list of declarable elements follows:

<specList>
<specDesc key="availability"/>
<specDesc key="bibl"/>
<specDesc key="biblFull"/>
<specDesc key="biblStruct"/>
<specDesc key="broadcast"/>
<specDesc key="correction"/>
<specDesc key="editorialDecl"/>
<specDesc key="equipment"/>
<specDesc key="hyphenation"/>
<specDesc key="interpretation"/>
<specDesc key="langUsage"/>
<specDesc key="listBibl"/>
<specDesc key="normalization"/>
<specDesc key="particDesc"/>
<specDesc key="projectDesc"/>
<specDesc key="quotation"/>
<specDesc key="recording"/>
<specDesc key="samplingDecl"/>
<specDesc key="scriptStmt"/>
<specDesc key="segmentation"/>
<specDesc key="sourceDesc"/>
<specDesc key="stdVals"/>
<specDesc key="textClass"/>
<specDesc key="textDesc"/>
<specDesc key="xenoData"/>
</specList>
All of the above elements may be multiply defined within a single
header, that is, there may be more than one instance of any declarable
element type at a given level.  When this occurs, the following rules
apply:
<list rend="bulleted">
<item>every declarable element must bear a unique identifier</item>
<item>for each different type of declarable element which occurs more
than once within the same parent element, exactly one element must be
specified as the default, by means of the <att>default</att> attribute
</item></list>
 </p>
<p>In the following example, an editorial declaration contains two
possible <gi>correction</gi> policies, one identified as
<val>CorPol1</val> and the other as <val>CorPol2</val>.  Since there
are two, one of them (in this case <val>CorPol1</val>) must be
specified as the default: <egXML xmlns="http://www.tei-c.org/ns/Examples"><editorialDecl>
   <correction xml:id="CorPol1" default="true">
      <p> ... </p>
   </correction>
   <correction xml:id="CorPol2">
      <p> ... </p>
   </correction>
   <normalization xml:id="n1">
      <p> ... </p>
      <p> ... </p>
   </normalization>
</editorialDecl></egXML> For texts associated with the header in which
this declaration appears, correction method <val>CorPol1</val> will be
assumed, unless they explicitly state otherwise.  Here is the
structure for a text which does state otherwise: <egXML xmlns="http://www.tei-c.org/ns/Examples"><text>
  <body>
    <div1 n="d1">  </div1>
    <div1 n="d2" decls="#CorPol2">  </div1>
    <div1 n="d3">  </div1>
  </body>
</text></egXML> In this case, the contents of the divisions D1 and D3
will both use correction policy <val>CorPol1</val>, and those of
division D2 will use correction policy <val>CorPol2</val>.
 </p>
<p>The <att>decls</att> attribute is defined for any element which is a
member of the class <term>declaring</term>.  This includes the major
structural elements <gi>text</gi>, <gi>group</gi>, and <gi>div</gi>, as
well as smaller structural units, down to the level of paragraphs in
prose, individual utterances in spoken texts, and entries in
dictionaries.  However, TEI recommended practice is to limit the number
of multiple declarable elements used by a document as far as possible,
for simplicity and ease of processing.
 </p>
<p>The identifier or identifiers specified by the <att>decls</att>
attribute are subject to two further restrictions:
<list rend="bulleted">
<item>An identifier specifying an element which contains multiple
instances of one or more other elements should be interpreted as if it
explicitly identified the elements identified as the default in each
such set of repeated elements</item>
<item>Each element specified, explicitly or implicitly, by the list of
identifiers must be of a different kind.</item></list>
 </p>
<p>To demonstrate how these rules operate, we now expand our earlier
example slightly:
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<encodingDesc>
  <editorialDecl xml:id="ED1" default="true">
    <correction xml:id="C1A" default="true">   
           <p> ... </p></correction>
    <correction xml:id="C1B">                 
          <p> ... </p></correction>
    <normalization xml:id="N1">
      <p> ... </p>
      <p> ... </p>
    </normalization>
  </editorialDecl>
  <editorialDecl xml:id="ED2">
    <correction xml:id="C2A" default="true">   <p> ... </p></correction>
    <correction xml:id="C2B">                 <p> ... </p></correction>
    <normalization xml:id="N2A">              <p> ... </p></normalization>
    <normalization xml:id="N2B" default="true"><p> ... </p></normalization>
  </editorialDecl>
</encodingDesc></egXML>
 </p>
<p>This encoding description now has two editorial declarations,
identified as <val>ED1</val> (the default) and <val>ED2</val>.  For texts not specifying
otherwise, <val>ED1</val> will apply.  If <val>ED1</val> applies, correction method C1A and
normalization method N1 apply, since these are the specified defaults
within <val>ED1</val>.  In the same way, for a text specifying <att>decls</att> as
<q><val>ED2</val></q>, correction C2A, 
 and normalization N2B will
apply.
 </p>
<p>A finer grained approach is also possible.  A text might specify
<tag>text decls='C2B N2A'</tag>, 
 to <soCalled>mix and match</soCalled> declarations as
required.  A tag such as <tag>text decls='ED1 ED2'</tag> would
(obviously) be illegal, since it includes two elements of the same type;
a tag such as <tag>text decls='ED2 C1A'</tag> is also illegal, since in
this context <val>ED2</val> is synonymous with the defaults for that
editorial declaration, namely <val>C2A N2B</val>, resulting in a list
that identifies two correction elements (C1A and C2A).
 </p></div>
<div type="div3" xml:id="CCAS3"><head>Summary</head>
<p>The rules determining which of the declarable elements are applicable
at any point may be summarized as follows:
<list rend="numbered">
<item>If there is a single occurrence of a given declarable
element in a corpus header, then it applies by default to all elements
within the corpus.</item>
<item>If there is a single occurrence of a given declarable
element in the text header, then it applies by default to all elements
of that text irrespective of the contents of the corpus header.</item>
<item>Where there are multiple occurrences of declarable elements
within either corpus or text header,
<list rend="bulleted">
<item>each must have a unique value specified as the value
of its <att>xml:id</att> attribute;</item>
<item>one only must bear a <att>default</att> attribute with
the value <val>YES</val>.</item></list></item>
<item>It is a semantic error for an element to be associated
with more than one occurrence of any declarable element.</item>
<item>Selecting an element which contains multiple occurrences of a
given declarable element is semantically equivalent to selecting only
those contained elements which are specified as defaults.</item>
<item>An association made by one element applies by default
to all of its descendants.
	</item></list>
</p></div></div>
<div type="div2" xml:id="CCAN"><head>Linguistic Annotation of Corpora</head>
<p>Language corpora often include analytic encodings or annotations,
designed to support a variety of different views of language.  The
present Guidelines do not advocate any particular approach to linguistic
annotation (or <soCalled>tagging</soCalled>); instead a number of
general analytic facilities are provided which support the
representation of most forms of annotation in a standard and
self-documenting manner.  Analytic annotation is of importance in many
fields, not only in corpus linguistics, and is therefore discussed in
general terms elsewhere in the
Guidelines.<note place="bottom">See in particular chapters
<ptr target="#SA"/>, <ptr target="#AI"/>, and <ptr target="#FS"/>.</note>
The present section presents informally some particular applications of
these general mechanisms to the specific practice of corpus linguistics.</p>
<div type="div3" xml:id="CCAN1"><head>Levels of Analysis</head>
<p>By <term>linguistic annotation</term> we mean here any annotation
determined by an analysis of linguistic features of the text, excluding
as borderline cases both the formal structural properties of the text
(e.g. its division into chapters or paragraphs) and descriptive
information about its context (the circumstances of its production, its
genre, or medium).  The structural properties of any TEI-conformant text
should be represented using the structural elements discussed elsewhere
in these Guidelines, for example in  chapters <ptr target="#CO"/> and
<ptr target="#DS"/>.
The contextual
properties of a TEI text are fully documented in the TEI header, which
is discussed in chapter <ptr target="#HD"/>, and in section <ptr target="#CCAH"/> of the present chapter.
 </p>
<p>Other forms of linguistic annotation may be applied at a number of
levels in a text.  A code (such as a word-class or part-of-speech
code) may be associated with each word or token, or with groups of such
tokens, which may be continuous, discontinuous, or nested.  A code may
also be associated with relationships (such as cohesion) perceived as
existing between distinct parts of a text.  The codes themselves may
stand for discrete non-decomposable categories, or they may represent
highly articulated bundles of textual features.  Their function may be
to place the annotated part of the text somewhere within a narrowly
linguistic or discoursal domain of analysis, or within a more general
semantic field, or any combination drawn from these and other domains.
 </p>
<p>The manner by which such annotations are generated and attached to
the text may be entirely automatic, entirely manual, or a mixture.  The
ease and accuracy with which analysis may be automated may vary with the
level at which the annotation is attached.  The method employed should
be documented in the <gi>interpretation</gi> element within the encoding
description of the TEI header, as described in section <ptr target="#HD53"/>.  Where different parts of a corpus have used different
annotation methods, the <att>decls</att> attribute may be used to
indicate the fact, as further discussed in section <ptr target="#CCAS"/>.
 </p>
<p>An extended example of one form of linguistic analysis commonly
practised in corpus linguistics is given in section <ptr target="#AILA"/>.
</p>
</div></div>
<div type="div2" xml:id="CCREC"><head>Recommendations for the Encoding of Large Corpora</head>
<p>These Guidelines include proposals for the identification and
encoding of a far greater variety of textual features and
characteristics than is likely to be either feasible or desirable in
any one language corpus, however large and ambitious.  The reasoning
behind this catholic approach is further discussed in chapter <ptr target="#AB"/>.  For most large-scale corpus projects, it will therefore
be necessary to determine a subset of TEI recommended elements
appropriate to the anticipated needs of the project, as further
discussed in chapter <ptr target="#MD"/>; these mechanisms include
the ability to exclude selected element types, add new element types,
and change the names of existing elements.  A discussion of the
implications of such changes for TEI conformance is provided in
chapter <ptr target="#CF"/>.
 </p>
<p>Because of the high cost of identifying and encoding many textual
features, and the difficulty in ensuring consistent practice across very
large corpora, encoders may find it convenient to divide the set of
elements to be encoded into the following four categories:
<list type="gloss"><label>required</label>
<item>texts included within the corpus will always
encode textual features in this category, should they exist in the
text</item><label>recommended</label>
<item>textual features in this category will be
encoded wherever economically and practically feasible; where present
but not encoded, a note in the header should be made.</item><label>optional</label>
<item>textual features in this category may or may not
be encoded; no conclusion about the absence of such features can be
inferred from the absence of the corresponding element in a given
text.</item>
<label>proscribed</label>
<item>textual features in this category are deliberately not encoded; they may be
transcribed as unmarked up text, or represented as <gi>gap</gi>
elements, or silently omitted, as appropriate.</item></list>
</p> 
</div>
<div><head>Module for Language Corpora</head>
<p>The module described in this chapter makes available the following components:

<moduleSpec xml:id="DCCHDR" ident="corpus">
<altIdent type="FPI">Metadata for Language Corpora</altIdent>
<desc>Corpus texts</desc>
<desc xml:lang="fr">Corpus linguistiques</desc>
<desc xml:lang="zh-TW">文集文本</desc>
<desc xml:lang="it">Corpus di testi</desc><desc xml:lang="pt">Textos do corpora</desc><desc xml:lang="ja">コーパスモジュール</desc></moduleSpec><!--publicID: -//TEI
P5//ELEMENTS Additional Element Set for Language Corpora//EN--> The
selection and combination of modules to form a TEI schema is described
in <ptr target="#STIN"/>.</p>
</div>

</div>
