2 What is the TEI?

The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of textual materials in electronic form. The project was sponsored and organized by three leading professional associations in the field: the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC) and the Association for Computing and the Humanities (ACH). It has been funded throughout its five years of activities on both sides of the Atlantic: primarily by the US National Endowment for the Humanities and by the European Union 3rd framework Programme for Linguistic Research and Engineering, but also with grants from the Mellon Foundation and from the Canadian Social Sciences and Humanities Research Council. Of equal significance has been the donation of time and expertise by the many members of the wider research community who have served on the TEI's Working Committees and Working Groups.

As its title suggests, the TEI is strongly interested in text. But this interest is by no means confined to the use of electronic text as a stage in the production of paper documents, and the word text should not be read too literally. The TEI is equally concerned with both textual and non-textual resources in electronic form, whether as constituents of a research database or components of non-paper publications.

Like the publishing industry, the research community has long realized that its stock in hand is not words on the page, but information, independent of any particular physical realization. As technology emerges which is genuinely adequate to the task of integrating text, graphics and audio into a seamless information-bearing vehicle, so the importance of that integrated vision becomes more apparent. By providing a description of information which is independent of realization or media, the TEI scheme, like other SGML-based approaches, enormously facilitates the construction and exploitation of multimedia technology.

In the same way, the texts with which language researchers are concerned are likely to be very heterogenous. In the construction of language corpora such as the British National Corpus [See note 2], material as divers as newspapers, books, office memoranda, playscripts, publicity brochures, letters and diaries, transcribed lectures and interviews, TV and radio broadcasts, and unscripted conversations are integrated into a single body of material. Research needs impose that this integration be carried out with minimal loss of information, and at the same time with minimal complexity: in any case, the resulting `text' is far removed from the conventional notion of a printed work.

Electronic texts are most obviously different from printed ones in that the former contain markup or encoding, which makes explicit various features of the text, so that they can be efficiently processed. Printed texts adopt a variety of similarly-motivated conventions (use of typeface, organization of the carrier medium etc), but these are not so readily processable as the tags of a formal markup scheme.

The goals of the TEI project initially had a dual focus: being concerned with both what textual features should be encoded (i.e. made explicit) in an electronic text, and how that encoding should be represented for loss-free, platform-independent, interchange.

Early on in the project, the Standard Generalized Markup Language (SGML; ISO 8879) was chosen as the most appropriate vehicle for the Guidelines, initially on the purely pragmatic grounds that to create a comparably expressive and versatile formal language would be a major research project in itself. In the event, despite some frequently rehearsed inelegancies, SGML has proved entirely adequate to the needs of researchers, and after five years, is still increasing its domination of the software industry, with new product announcements coming every year. The TEI was thus able to focus its efforts on the expression, using SGML, of the set of textual features indicated as its first goal above.

The prime deliverable of the TEI project is a very large number (over 400) of textual feature definitions, expressed as SGML elements and attributes, with associated documentation and examples. These elements are grouped into tag sets of various kinds, as further discussed below, and together constitute a modular scheme which can be configured to provide hardware-, software-, and application- independent support for the encoding of all kinds of text in all languages and of all times.

The TEI tag sets are necessarily based on, but not limited by, existing encoding practices; they are designed to be both comprehensive and extensible. They are collectively documented in a substantial reference manual, the Guidelines for Text Encoding for Interchange, which appeared in May 1994 after five years of extensive development work. This 1400 page manual is published both in paper and electronic hypertext form and is also available over the Internet, in a variety of formats. [See note 3]

Back to table of contents
On to next section
Back to previous section