24 The Independent Header

Part 5

Auxiliary Document Types

24 The Independent Header

Many libraries, text repositories, research sites and related institutions collect bibliographic and documentary information about machine readable texts without necessarily collecting the texts themselves. Such institutions may thus want access to the header of a TEI document without its attached text in order to build catalogues, indexes and databases that can be used by people to locate relevant texts at remote locations, obtain full documentation about those texts, and learn how to obtain them. This chapter of the Guidelines describes a set of practices by which the headers of TEI documents can be extracted from those documents and exchanged as freestanding independent TEI documents. Headers exchanged independently of the documents they describe are called independent headers.

This chapter outlines practices recommended for encoders (especially those responsible for the documentation of text) when creating independent headers to be distributed, and specifies the set of recommended elements that should be included in the independent header. Of interest to librarian cataloguers who may receive independent headers from remote sites, it also discusses the relationship between the elements of TEI headers and MARC tags, in order to facilitate the cataloguing of these headers or the loading of independent headers into local MARC-based bibliographic databases. This chapter does not describe how to create a header. Guidance on the creation of headers and descriptions of each element in the header can be found in chapter 5 .

24.1 Definition and Principles for Encoders

An independent header is a header extracted from a TEI text that can be exchanged as an independent document between libraries, archives, collections, projects, and individuals. The file description of the independent header (enclosed by the <fileDesc> element) can be used to generate bibliographic records. The profile description, encoding description, and revision history (encoded by the <profileDesc> , <encodingDesc> , and <revisionDesc> elements) can form part of a bibliographic description or, more appropriately, be used as an attached `codebook' for full documentation of the analysis of the text and how it was encoded. Thus, the independent header can serve as the primary means by which libraries, archives, related repositories, research projects, and individual researchers can obtain bibliographic, descriptive, and full documentary information on machine-readable texts that reside in remote locations.

The structure of an independent header is exactly the same as that of a <teiHeader> attached to a document, and can therefore be validated using the same document type definition (DTD). In practice, this means that a <teiHeader> and its DTD can be extracted from a TEI document and shipped to a receiving institution with little or no change. However, some fields that are listed as ``optional'' in the header are listed as ``recommended'' for the independent header. For this reason, this chapter should be consulted in connection with any plan to send headers as independent documents.

When deciding which information to include in the independent header, and the format or structure of that information, the following should be kept in mind:

The independent header should provide full bibliographic information on the encoded text, its source, where the text can be located, and any restrictions governing its use.

The independent header should contain useful information about the encoding of the text itself. In this regard, it is highly recommended that the encoding description be as complete as possible. The Guidelines do not require that the encoding description be included in the header (since some simple transcriptions of small items may not require it), but in practice the use of a header without an encoding description would be severely limited.

The independent header should be amenable to automatic processing, particularly for loading into databases and for the creation of publications, indexes, and finding aids, without undue editorial intervention on the part of the receiving institution. For this reason, two recommendations are made regarding the format or structure of the header: first, where there is a choice between a prose content model and one that contains a formal series of specialized elements, wherever possible and appropriate the specialized elements should be preferred to unstructured prose. For instance, the source description can contain either a free-prose citation (tagged <bibl> or even <p> ) or a <biblStruct> element, which provides a more rigorous structure for the bibliographic information (see examples in section 6.10 ). The more structured <biblStruct> element is more suitable for automatic processing, and is therefore recommended over the less structured alternatives whenever the header is to be exchanged as an independent header. Second, with respect to corpora, information about each of the texts within a corpus should be included in the overall corpus-level <teiHeader> . That is, source information, editorial practices, encoding descriptions, and the like should be included in the relevant sections of the corpus <teiHeader> , with pointers to them from the headers of the individual texts included in the corpus. There are three reasons for this recommendation: first, the corpus-level header will contain the full array of bibliographic and documentary information for each of the texts in a corpus, and thus be of great benefit to remote users, who may have access only to the independent header; second, such a layout is easier for the coder to maintain than searching for information throughout a text; and third, generally speaking, this practice results in greater overall consistency, especially with respect to bibliographic citations.

24.2 Required and Recommended Tags

The richness and size of the header reflect the diversity of uses to which electronic texts conforming to these Guidelines will be put. It is not intended, however, that all of the elements recommended in this chapter be present in every header. As described in section 5.6 , the TEI header allows for the provision of a very large amount of information concerning the text itself, its source, encodings, and revisions as well as detailed descriptive information that can be used by researchers in analysing the text. The amount of encoding will depend on the nature and intended use of the text. At one extreme, an encoder may expect that the header will only provide bibliographic information about the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information in such a way that no prior or ancillary knowledge about the text is needed in order to process it. The header, in the latter case, will be very full, approximating the kind of documentation often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend toward the latter extreme.

The following is a list of the components of the header, in the order in which they are presented in chapter 5 , together with an indication of their importance in constructing an independent header.

24.3 Header Elements and their Relationship to the MARC Record

This section offers some guidance to both cataloguers and bibliographic analysts who want to load TEI independent headers into a MARC-based retrieval system. Because there are variations in cataloguing practice across local sites, among bibliographic utilities (such as OCLC and RLIN), and differences in MARC usage in different countries, only tentative advice is possible. Note that the following examples are based on USMARC, not UNIMARC. [ see note 125 ] UNIMARC offers cataloguers in different countries the opportunity to combine different national practices in a single MARC format, and is the preferred variety of MARC records for distribution across national boundaries. The implementation of UNIMARC, however, will be affected by local practice and by guidelines offered by the bibliographic utilities. Though UNIMARC is a stable format, the guidelines for its implementation are not sufficiently known or stabilized to be included in this chapter.

There are some major differences between the MARC record and the TEI header that will cause problems for librarians trying to map from the TEI independent header to the MARC record. The most important difference between the MARC record and the TEI header is the function of each. Despite the efforts and claims of some members of the library community, the MARC record remains fundamentally an electronic version of the catalogue card, with the limitations of its model. [ see note 126 ] The catalogue card is a unitary record for a physical object containing complex bibliographic data of varying sorts. The catalogue card points to the physical object. The TEI header provides full bibliographic information (as would a card), as well as documentary non-bibliographic information that supports the analysis, either by humans or machines, of the electronic text documented by header. Most of this analytical information, which is found in profile description, encoding description, and revision history, has little direct provision for it in the MARC record, and if retained must be recorded as unstructured notes (55XX) fields. Notes fields usually do not have the structure to support machine retrieval and analysis, while properly formatted profile, encoding, and revision descriptions lend themselves to retrieval, can support machine processing (including analysis), and point directly to the electronic text attached to the header. Moreover, the electronic text points back to the relevant elements in the header.

Though this chapter offers some advice on where the profile, encoding, and revision descriptions might go in a MARC record, for practical reasons a repository might want create a codebook from these divisions of the header, and create a MARC record from the file description only. The MARC record should contain a reference to the codebook.

Subfields (or delimiters) are conventionally indicated by the dollar sign.

24.4 MARC Fields for the File Description

Note that there is no provision for the `Main Entry' (or USMARC 1XX fields) in the TEI header. The main entry should be constructed, using appropriate name authority control, by the cataloguer from information derived from the header that indicates who is primarily responsible for the intellectual content of the work. There is an <author> tag, but the form of the name will have to be checked by a cataloguer before the main entry is constructed.

<titleStmt> corresponds to title and statement of responsibility fields in MARC, typically 240 (for uniform title) and 245 (for title proper).
<title> 240 $a (for uniform titles) or 245 $a fields. Put any subtitles in 24X $b . Insert the constant, ``[computer file]'' in the 24X $h gmd subfield.

The following elements belong in the 245 $c subfield: statement of responsibility.

<sponsor>
<funder>
<principal>

Example:

<titleStmt>
  <title>Two stories by Edgar Allen Poe: electronic
         version</title>
  <author>Poe, Edgar Allen (1809-1849)</author>
  <respStmt><resp>compiled by</resp>
  <name>James D. Benson</name></respStmt>

This might be tagged in MARC as:

245 Two stories by Edgar Allen Poe :$belectronic version ;
    compiled by $cJames D. Benson.

<edition> 250 $a
<name> 250 $b

Example:

<editionStmt><edition>
    Student's edition,
    <date>June 1987</date>
</edition><respStmt>
    <resp>New annotation by</resp>
    <name>George Brown</name>
</respStmt></editionStmt>

This might be tagged in MARC as:

250  $aStudent's edition, June, 1987, new annotation by
     $bGeorge Brown.

<extent> . The extent is analogous to the `Physical Description' MARC field. Fields 256 or 3XX , depending on local practice are appropriate.
<date> 260 $c , and appropriate fixed fields.
<publisher> , <distributor> , or <authority> 260 $b
<pubPlace> 260 $a

Example:

<publicationStmt>
  <publisher>Columbia University Press</publisher>
  <pubPlace>New York</pubPlace>
  <date>1993</date>
</publicationStmt>

This may be tagged in MARC as:

260 $aNew York :$bColumbia University Press, $c1993.

Local practice will determine appropriate MARC fields for <address> , <idno> , and <availability> . Restrictions on access should normally be placed in the 506 field, while the place where an item may be ordered will be located in a local notes (590 ) field. If local practice warrants it, the address of the publisher should be indicated in the 260 field.

The series <title> and the <idno> should be placed in the appropriate 490 fields (series untraced), if series authority checking needs to be done. Further, because the TEI tags do not differentiate between name, conference, or title series, there is no simple mechanical method for determining which MARC tag (410, 411, etc.) should be used. Safe practice would be to load any series statements into 490 fields, and then to conduct authority work on those fields.

<notesStmt> These are usually reserved for general notes (500 ) fields.

The <sourceDesc> can be mapped to be a `source of data' note (537 in RLIN MDF format) with the print constant ``Transcribed from:'' at the beginning of the note. The <biblStruct> itself can be mapped onto a 581 field (note on primary publication) using the ISBD format to separate each data element.

The <scriptStmt> , <recordingStmt> , <recording> , <equipment> , and <broadcast> elements do not easily map to existing MARC fields, and should be put into a local notes field (590 ) treating the TEI tag introducing each component as a print constant at the head of the field in order to facilitate future local processing and retrieval. Example:

<scriptStmt id=CNN12>
  <bibl><author>CNN Network News</author>
    <title>News Headlines</title>
    <date>12 Jun 1991</date>
  </bibl>
</scriptStmt>

This may be tagged in MARC thus:

590  <scriptStmt id=CNN12>
     <bibl><author>CNN Network News</author>
     <title>News Headlines</title>
     <date> 12 Jun 1991></date></bibl>
     </scriptStmt>

Example:

<recordingStmt>
   <recording type=video dur="10 mins">
   <equipment><p>Recorded from FM radio to chrome
       tape</p></equipment>
   <broadcast>
      <bibl>
         <title>Britain's pleasure parade</title>
         <author>BBC Radio 4 FM</author>
         <editor role=interviewer>Robin Day</editor>
         <editor role=interviewee>Margaret Thatcher</editor>
         <series><title>The World Tonight</></series>
         <date>27 Nov 89</date>
       </bibl>
    </broadcast>
    </recording>
</recordingStmt>

This can be tagged in MARC as:

590 <recordingStmt>
    <recording type=video dur="10 mins">
    <equipment><p>Recorded from FM radio to chrome
         tape</p></equipment><broadcast>
    <bibl><title>Britain's pleasure parade</title>
    <author>BBC Radio 4 FM</author>
    <editor role=interviewer>Robin Day</editor>
    <editor role=interviewee>Margaret Thatcher</editor>
    <series><title>The World Tonight</></series>
    <date>27 Nov 89</date>
    </bibl>
    </broadcast>
    </recording>
    </recordingStmt>

24.5 MARC Fields for the Encoding Description

The <encodingDesc> element provides useful information documenting the relationship between an electronic text and the source or sources from which it was derived. The <projectDesc> , <samplingDecl> , <editorialDecl> , and <refsDecl> elements provide details of decisions and rationales used about the process and purposes of the project, how text was sampled, principles of editorial practice, and how canonical references are constructed. The 567 field (notes on methodology) appears to be the most appropriate for this sort of information, though this field is normally intended for methodologies characterizing the social sciences. Practically, it would be wise to transcribe the <projectDesc> , <editorialDecl> , <refsDecl> , and <classDecl> elements directly as one or more 567 fields without intervention, with the element name at the beginning of each field, and any TEI tags left intact. This may facilitate any locally-developed retrieval software.

Example:

<encodingDesc>
  <projectDesc><p>Texts were collected to illustrate the
    full range of twentieth-century spoken and written Swedish,
    written by native Swedish authors.</projectDesc>
  <samplingDecl><p>Sample of 2000 words taken from the
    beginning of the text.</p></samplingDecl>
  <editorialDecl>
       <interpretation><p>Errors in transcription controlled
         by using the SUC spell checker, v.2.4</p></interpretation>
  </editorialDecl>
</encodingDesc>

This may be tagged in MARC as:

567  <projectDesc><p>Texts were collected to illustrate the
      full range of twentieth-century spoken and written
      Swedish, written by native Swedish authors.</p>
567  <samplingDecl><p>Sample of 2000 words taken from the
     beginning of the text.</p>
567  <editorialDecl>
     <interpretation><p>Errors in transcription controlled
      by using the SUC spell checker, v. 2.4</p>
     </interpretation>
     </editorialDecl>

24.6 MARC Fields for the Profile Description

The profile description is the most problematic element in the TEI header for librarian cataloguers, because it provides a detailed description of the non-bibliographic aspects of the text, specifically the languages and sublanguages used, the situation in which it was produced, and the participants and their setting. This information can be used for retrieval purposes or in machine-supported analysis of the text. The information can be loaded into a separate `codebook' and referenced by the MARC record. Little guidance can be offered on the appropriate MARC location for the elements that make up the profile description, except to suggest that if a site wants to load the profile description into a MARC record for archival and possibly retrieval purposes, then the contents of the profile description may be mapped into a locally-defined notes field (59X ) with its TEI tags intact, as in the examples above.

24.7 MARC fields for the Revision Description

The revision history (<revisionDesc> ) logs all changes to a machine readable file whether or not these constitute a new edition of the file. Aside from the edition area of the MARC record, there are no MARC fields that deal specifically with changes of this sort. This information might be best included in a `codebook', rather than a MARC record. As before, the simplest way of approaching this problem is to include the material with its TEI tags intact as a locally-defined note (59X ) in order to support future local processing.

24.8 Structure of the DTD for Independent Headers

The following document type definition is provided in file teishd2.dtd and constitutes the auxiliary DTD for independent headers as described in this chapter.

<!-- 24.8: File teishd2.dtd: Auxiliary DTD for Independent    -->
<!-- Header                                                   -->
<!-- Text Encoding Initiative: Guidelines for Electronic      -->
<!-- Text Encoding and Interchange. Document TEI P3, 1994.    -->

<!-- Copyright (c) 1994 ACH, ACL, ALLC. Permission to copy    -->
<!-- in any form is granted, provided this notice is          -->
<!-- included in all copies.                                  -->

<!-- These materials may not be altered; modifications to     -->
<!-- these DTDs should be performed as specified in the       -->
<!-- Guidelines in chapter "Modifying the TEI DTD."           -->

<!-- These materials subject to revision. Current versions    -->
<!-- are available from the Text Encoding Initiative.         -->

<!-- Embed entities for TEI generic identifiers.              -->

<!ENTITY % TEI.elementNames system 'teigis2.ent'                >
%TEI.elementNames;

<!-- Embed entities for TEI keywords.                         -->

<!ENTITY % TEI.keywords.ent system 'teikey2.ent'                >
%TEI.keywords.ent;

<!-- Define element classes for content models, shared        -->
<!-- attributes for element classes, and global attributes.   -->
<!-- (This all happens within the file teiclas2.ent.)         -->

<!ENTITY % TEI.elementClasses system 'teiclas2.ent'             >
%TEI.elementClasses;

<!-- Now declare the IHS element.                             -->

<!ELEMENT ihs           - O  (teiHeader+)                       >
<!ATTLIST ihs                %a.global;                         >

<!-- Finally, embed the TEI header and core tag sets.         -->

<!ENTITY % TEI.header.dtd system 'teihdr2.dtd'                  >
%TEI.header.dtd;
<!ENTITY % TEI.core.dtd system 'teicore2.dtd'                   >
%TEI.core.dtd;

The overall structure of a set of independent headers, encoded for interchange as a group, is thus:

<!DOCTYPE ihs PUBLIC "-//TEI P3//DTD Auxiliary Document Type:
        Independent TEI Header//EN"
       "teishd2.dtd">
<ihs>
<teiHeader>
  <fileDesc> ...     </fileDesc>
  <encodingDesc> ... </encodingDesc>
  <profileDesc> ...  </profileDesc>
  <revisionDesc> ... </revisionDesc>
</teiHeader>
<teiHeader>
  <fileDesc> ...     </fileDesc>
  <encodingDesc> ... </encodingDesc>
  <profileDesc> ...  </profileDesc>
  <revisionDesc> ... </revisionDesc>
</teiHeader>
<teiHeader> ... </teiHeader>
<!-- ... etc. -->
</ihs>

In practice, headers might be stored in separate operating system files, to reduce redundant storage requirements; in this case, the top-level file for a typical document might have the following structure:

<!DOCTYPE tei PUBLIC "-//TEI P3//DTD Main Document Type//EN"
                     "tei2.dtd" [
  <!ENTITY txt01 system 'text01.tei' >
  <!ENTITY hdr01 system 'text01.hdr' >
]>
<tei.2>
&hdr01;
&txt01;
</tei.2>

while that for a set of independent headers might have this structure:

<!DOCTYPE ihs PUBLIC "-//TEI P3//DTD Auxiliary Document Type:
        Independent TEI Header//EN"
       "teishd2.dtd" [
  <!ENTITY hdr01 system 'text01.hdr' >
  <!ENTITY hdr02 system 'text02.hdr' >
  <!ENTITY hdr03 system 'text03.hdr' >
  <!-- ... etc. -->
]>
<ihs>
&hdr01;
&hdr02;
&hdr03;
<!-- etc. -->
</ihs>