Date: Mon, 2 Oct 89 17:47 BST From: Lou Burnard To: U35395@UICVM Subject: RE: tei-l [This isnt what you were expecting yet, but I had to get it to Holland urgently so I sent it anyway. Any views? It's meant to be useful e.g. as all-purpose abstract] The Text Encoding Initiative: an application of SGML in humanistic research Lou Burnard Oxford University Computing Service This paper describes the work of the Text Encoding Initiative, which aims to develop and promote a set of guidelines, both for the use of the scholarly community in the preparation and interchange of machine-readable texts in research applications, and more generally by the language industries. The paper describes the overall structure, organisation and timescale of the project, outlining the tasks and problem areas currently being addressed within the Initiative's four working committees. It is argued that the TEI's requirements may call into question the adequacy of SGML as a generic descriptive markup system. Goals of the project Scholars in the humanities have been analysing literary and other texts by means of computers for over twenty years. Vast corpora of texts have been, and still are being, created solely for the purposes of scholarly analysis, rather than for publication in conventional book form. Yet no simple universally agreed method of encoding (that is, representing in computer tractable form features of texts other than the words of which they are composed) has yet been found. Instead, each major project has had to rediscover its own solutions to fundamentally similar problems, with consequent wastage of scarce resources. At a planning conference organised by the ACH at Vassar College in 1987, representatives of over 30 major research organisations and societies agreed a comprehensive set of principles which underly the current effort. It was proposed to formulate a common interchange format, which would enable the exchange and integration of existing textual resources and to provide extensible sets of recommendations for the encoding of new texts. The proper suspicion with which the scholarly community regards standardisation was recognised by requiring of such recommendations not only extensibility but also the ability to support different theoretical models. The guidelines should be applicable to texts in any language, and, in line with contemporary computing practice, independent of any particular hardware or software. Finally, recognising the reusability of machine-readable texts, the Planning Conference proposed that the Guidelines should as far as possible be application- independent. and also that they should be application independent. These last requirements pointed clearly to SGML as the most promising vehicle for the eventual formulation of the Guidelines, although reservations were (and continue to be) expressed as to its suitability. Structure and timetable The TEI is jointly funded by the US National Endowment for the Humanities and by the Commission of the European Community and sponsored by three leading professional societies, the ACH, the ACL and the ALLC. It is managed by a steering committee, comprising two members from each of these societies, which appoints members to its four working committees. The committees are responsible for producing tagsets in specified areas, which will then be combined to form the Guidelines by two editors. Input to the task is also provided by the widest possible consultation with the scholarly community, both informally by the normal channels of publication and electronic discussion and more formally by affiliation with major research projects currently engaged in the preparation of electronic texts. An Advisory Board, comprising representatives of 15 leading professional and learned associations, will review and endorse the publication of the initial draft Guidelines, which should be completed by June 1990. Funding is currently being sought for a second two-year cycle in which the draft Guidelines will be reviewed and extended, with a view to eventual progress to international standardisation. Working Committees The real work of the TEI is carried on in its working committtees and the work groups these have organised. There are four committees, one on Text Documentation, one on Text Representation, one on Text Analysis and Interpretation and one on Syntax and Metalanguage. The purpose of the Text Documentation Committee is to define an SGML-based tagset adequate to document electronic texts. This tagset will describe the tags used in an electronic text and also the relationship between it and its non-electronic source or sources. The Committee is able to take advantage of substantial work already carried out by the library and archive community in establishing bibliographic methods appropriate to machine readable datasets, and its work should be completed by the end of the first funding cycle. The Text Representation Committee is charged with the larger task of identifying tagsets for all those structural features of running text for which typographic conventions already exist in printed texts. Workgroups have been set up to consider character set representations, the structural elements of particular types of texts (notably technical, historical, philosophical, legal and literary texts) and of corpora or representative collections of text samples. It is expected that a set of common or core tags will emerge from this work, together with suggestions for recommended extensions. By contrast, the Text Analysis and Interpretation Committee addresses structural aspects of text for which no agreed typographic conventions exist. Clearly this is an open-ended task: initially therefore, the committee will focus on the specific area likely to be of most general benefit: that of identifying features of use in linguistic analyses. Three of its workgroups will define tagsets for use in phonological and phonetic analysis, in lexical and morphological analysis, and in syntactic analysis respectively; a fourth, which has already completed a draft proposal, will address the tagging of natural language dictionaries, and the relationship between such tagsets and those appropriate to tagging the structure of electronic lexica. The fourth committee, on Metalanguage and Syntax, has a somewhat different role from the others. Rather than proposing tagsets, it is charged with identifying a subset of SGML features appropriate to the needs of the Guidelines, and with assisting in the task of defining formal DTDs derived from the tagsets where these are thought advisable. It will also provide assistance in translating between other existing encoding schemes and that finally proposed by the Guidelines. Current problem areas The advocates of SGML have always argued that one of its greatest strengths is that using it requires the tagging of intrinsic structural properties of texts rather than accidental features such as their realisation on paper. The scholar analysing existing texts may need to maintain a double focus - on the text as first realised and on its underlying structural components. In early printed texts, for example, line divisions and page breaks may have a subtle and intimate effect on lexical features such as spelling or interpretation. Furthermore, when analysing verse or drama, there are additional formal structures such as verse lines or dialogue which interact unpredictably with syntactic structures. It is clear that any SGML-based markup for these purposes will need to make extensive use of the CONCUR feature, but there is as yet very little experience of the effectiveness of this concept outside the world of document publishing. Another source of concern is the theoretical adequacy of SGML as a formal language. The existence of two parallel formalisms (attribute/value pairs and delimited elements) in the same system offends those who wish for rigour and simplicity above all else - and also makes it very difficult to decide when, for example, an attribute should be used and when it should not. The emphasis of SGML on hierarchic structures is belied by its support for inclusion- and exclusion exceptions: this makes it difficult to determine when the subordination of an element is semantically significant and when it is not. And SGML provides few straightforward ways of handling the loosely defined, overlapping or discontinuous text elements which tend to be the rule rather than the exception in humanistic analysis of texts. At present, very little commercially available software has taken any notice of the SGML standard, while such as has has been aimed almost exclusively at either the high-end electronic or conventional publishing market or (increasingly) the lucrative Defence market. Interfaces between SGML tagged texts and existing industry standard formatters and wordprocessing systems are easy enough to cobble together, but few of the major players in the academic marketplace has done any more than express cautious interest in the prospect as yet; interfaces between SGML tagged texts and standard database systems, which present more of a challenge, have also been barely addressed. Yet this surely is where the real benefits of the TEI's work will be found. Lou Burnard read English Literature at Oxford and has been working in research applications of computing methods in the Humanities for the last ten years. He is founder and Director of the Oxford Text Archive and associate editor of the Text Encoding Initiative. His other responsibilities at Oxford University Computing Service include database and textbase design and construction, on which he has lectured and published extensively.