Date:     Mon, 2 Oct 89  17:47 BST
  From:     Lou Burnard <LOU@VAX.OXFORD.AC.UK>
  To:       U35395@UICVM
  Subject:  RE: tei-l
 
[This isnt what you were expecting yet, but I had to get it to Holland
urgently so I sent it anyway. Any views? It's meant to be useful e.g. as
all-purpose abstract]
 
 
                The Text Encoding Initiative:
        an application of SGML in humanistic research
 
                         Lou Burnard
             Oxford University Computing Service
 
This paper describes the work of the Text Encoding Initiative, which
aims to develop and promote a set of guidelines, both for the use of the
scholarly community in the preparation and interchange of
machine-readable texts in research applications, and more generally by
the language industries. The paper describes the overall structure,
organisation and timescale of the project, outlining the tasks and
problem areas currently being addressed within the Initiative's four
working committees. It is argued that the TEI's requirements may call
into question the adequacy of SGML as a generic descriptive markup
system.
 
Goals of the project
 
Scholars in the humanities have been analysing literary and other texts
by means of computers for over twenty years. Vast corpora of texts have
been, and still are being, created solely for the purposes of scholarly
analysis, rather than for publication in conventional book form. Yet no
simple universally agreed method of encoding (that is, representing in
computer tractable form features of texts other than the words of which
they are composed) has yet been found. Instead, each major project has
had to rediscover its own solutions to fundamentally similar problems,
with consequent wastage of scarce resources.
     At a planning conference organised by the ACH at Vassar College in
1987, representatives of over 30 major research organisations and
societies agreed a comprehensive set of principles which underly the
current effort. It was proposed to formulate a common interchange
format, which would enable the exchange and integration of existing
textual resources and to provide extensible sets of recommendations for
the encoding of new texts. The proper suspicion with which the scholarly
community regards standardisation was recognised by requiring of such
recommendations not only extensibility but also the ability to support
different theoretical models. The guidelines should be applicable to
texts in any language, and, in line with contemporary computing
practice,  independent of any particular hardware or software. Finally,
recognising the reusability of machine-readable texts, the Planning
Conference proposed that the Guidelines should as far as possible be
application- independent. and also that they should be application
independent.  These last requirements pointed clearly to SGML as the
most promising vehicle for the eventual formulation of the Guidelines,
although reservations were (and continue to be) expressed as to its
suitability.
 
Structure and timetable
 
     The TEI is jointly funded by the US National Endowment for the
Humanities and by the Commission of the European Community and sponsored
by three leading professional societies, the ACH, the ACL and the ALLC.
It is managed by a steering committee, comprising two members from each
of these societies, which  appoints members to its four working
committees. The committees are responsible for producing tagsets in
specified areas, which will then be combined to form the Guidelines by
two editors. Input to the task is also provided by the widest possible
consultation with the scholarly community, both informally by the normal
channels of publication and electronic discussion and  more formally by
affiliation with major research projects currently engaged in the
preparation of electronic texts. An Advisory Board, comprising
representatives of 15 leading professional and learned associations,
will review and endorse the publication of the initial draft Guidelines,
which should be completed by June 1990. Funding is currently being
sought for a second two-year cycle in which the draft Guidelines will be
reviewed and extended, with a view to eventual progress to international
standardisation.
 
Working Committees
 
The real work of the TEI is carried on in its working committtees and
the work groups these have organised. There are four committees, one
on Text Documentation, one on Text Representation, one on Text
Analysis and Interpretation and one on Syntax and Metalanguage.
      The purpose of the Text Documentation Committee is to define
an SGML-based tagset adequate to document electronic texts. This
tagset will describe the tags used in an electronic text and also the
relationship between it and its non-electronic source or sources. The
Committee is able to take advantage of substantial work already
carried out by the library and archive community in establishing
bibliographic methods appropriate to machine readable datasets, and
its work should be completed by the end of the first funding cycle.
      The Text Representation Committee is charged with the larger
task of identifying tagsets for all those structural features of running
text for which typographic conventions already exist in printed texts.
Workgroups have been set up to consider character set
representations, the structural elements of particular types of texts
(notably technical, historical, philosophical, legal and literary texts)
and of corpora or representative collections of text samples. It is
expected that a set of common or core tags will emerge from this
work, together with suggestions for recommended extensions.
      By contrast, the Text Analysis and Interpretation Committee
addresses structural aspects of text for which no agreed typographic
conventions exist. Clearly this is an open-ended task: initially
therefore, the committee will focus on the specific area likely to be of
most general benefit: that of identifying features of use in linguistic
analyses. Three of its workgroups will define tagsets for use in
phonological and phonetic analysis, in lexical and morphological
analysis, and in syntactic analysis respectively; a fourth, which has
already completed a draft proposal, will address the tagging of natural
language dictionaries, and the relationship between such tagsets and
those appropriate to tagging the structure of electronic lexica.
      The fourth committee, on Metalanguage and Syntax, has a
somewhat different role from the others. Rather than proposing
tagsets, it is charged with identifying a subset of SGML features
appropriate to the needs of the Guidelines, and with assisting in the
task of defining formal DTDs derived from the tagsets where these
are thought advisable. It will also provide assistance in translating
between other existing encoding schemes and that finally proposed by
the Guidelines.
 
Current problem areas
 
The advocates of SGML have always argued that one of its greatest
strengths is that using it requires the tagging of intrinsic structural
properties of texts rather than accidental features such as their
realisation on paper. The scholar analysing existing texts may need to
maintain a double focus - on the text as first realised and on its
underlying structural components. In early printed texts, for example,
line divisions and page breaks may have a subtle and intimate effect
on lexical features such as spelling or interpretation. Furthermore,
when analysing verse or drama, there are additional formal structures
such as verse lines or dialogue which interact unpredictably with
syntactic structures. It is clear that any SGML-based markup for
these purposes will need to make extensive use of the CONCUR
feature, but there is as yet very little experience of the effectiveness
of this concept outside the world of document publishing.
     Another source of concern is the theoretical adequacy of SGML
as a formal language. The existence of two parallel formalisms
(attribute/value pairs and delimited elements) in the same system
offends those who wish for rigour and simplicity above all else - and
also makes it very difficult to decide when, for example, an attribute
should be used and when it should not. The emphasis of SGML on
hierarchic structures is belied by its support for inclusion- and
exclusion exceptions: this makes it difficult to determine when the
subordination of an element is semantically significant and when it is
not. And SGML provides few straightforward ways of handling the
loosely defined, overlapping or discontinuous text elements which tend
to be the rule rather than the exception in humanistic analysis of
texts.
     At present, very little commercially available software has taken
any notice of the SGML standard, while such as has has been aimed
almost exclusively at either the high-end electronic or conventional
publishing market or (increasingly) the lucrative Defence market.
Interfaces between SGML tagged texts and existing industry standard
formatters and wordprocessing systems are easy enough to cobble
together, but few of the major players in the academic marketplace
has done any more than express cautious interest in the prospect as
yet; interfaces between SGML tagged texts and standard database
systems, which present more of a challenge, have also been barely
addressed. Yet this surely is where the real benefits of the TEI's work
will be found.
 
Lou Burnard read English Literature at Oxford and has been working in
research applications of computing methods in the Humanities for the
last ten years. He is founder and Director of the Oxford Text Archive
and associate editor of the Text Encoding Initiative. His other
responsibilities at Oxford University Computing Service include database
and textbase design and construction, on which he has lectured and
published extensively.