Core Elements in Text Representation: A Point of View David Chesnutt Scholars in the humanities are most often engaged in preparing monographs, journal articles, editions, and textual materials to be used in their classrooms. While most scholars today create "electronic" versions of these texts, they are accustomed to thinking only of what they must do in their particular word processing programs to achieve the desired format for their text. The levels of ambiguity and inconsistency in these author/editor texts do not seem to be of great concern to most scholars as long as the texts convey the information they wish to present. The Association of American Publishers recognized the value of electronic manuscripts as a means of lowering typesetting costs early in this decade. Major textbook publishers were the first to develop in-house systems designed to produce typesetting files. At about the same time, a number of scholarly editions in the States followed suit. Larger university presses followed suit and many other presses (both private and public) experimented with electronic files produced by their authors. The results were mixed but out of that experience came the AAP effort to standardize the coding of electronic files via an SGML markup scheme. By and large, the AAP standard has had little impact to date. Among all the scholarly editions in history and literature that I am familiar with (probably most of the major projects in the States), none uses the AAP tagging scheme. Even among the publishers themselves, AAP has not "caught on." Part of this lack of enthusiasm probably stems the AAP documentation itself which is sometimes difficult to follow. But the major flaw of the AAP scheme from an end-user point of view is that it asks too much. It is too particularized--geared too much to the publisher who is the beneficiary of the potential savings; geared too little to the inexperienced academic. Yet at the same time, the AAP coding does not provide a tagging scheme for one of the most elementary textual elements which commonly occurs in our texts--poetry. (Nor does it provide a tagging scheme for some of the more specialized problems in the publication of scholarly editions--letters and documents of various types, cancellations, interlineations, textual variants. But that's another issue because these features are not common to most texts in the humanities.) The goals of the TEI are to provide a tagging scheme which will facilitate the interchange of text and which will allow tagging at various levels of sophistication. To accomplish those goals we must provide guidelines which will not intimidate the novice, yet which are comprehensive enough to accomodate the needs of scholars who wish to tag their data for sophisticated presentation or analysis. If the TEI is to gain the kind of widespread acceptance we hope it will, it seems to me that we must provide a simplified "cookbook" for the novice and a comprehensive reference manual for the more sophisiticated applications. My concern here is with the cookbook and the ingredients necessary for a palatable stew. Core Elements Within any reading text, we tend to discriminate between the elements visually as a means of conveying information to the reader. More often than not, we use the conventions that printers have used for centuries. We center a heading, put it in larger type, make it boldface, separate it with vertical space above and below. In short, we emulate the conventions of the printed page. With an electronic text, we usually have a mix of the old with the new. For example, we might display the words we have marked for emphasis with a different background color instead of an italic font. Regardless of our media or our conventions, the important aspect is simply to mark or tag those elements in a way which others can easily recognize and adapt to their own uses. The following outline includes some of the most common elements which influence the way we "see" a text. body text titles/headings paragraph poetry lists/tables block quotation titles/headings paragraph poetry lists/tables block quotation annotation titles/headings paragraph poetry lists/tables block quotation Steve DeRose uses the term "text container" to describe some of these elements which is a useful way of separating them from those elements that occur within the text (e.g., sections of text marked for emphasis, hyphen problems, occurrences of foreign languages, etc.). The internal elements are probably no less critical, but I want to confine my remarks here to the "containers" which seem to be most central or common to texts in the humanities. One of the problems I see in providing tag sets for these core elements is the recurrence of the same elements in different settings. These elements pop up all over the place and are commonly input by our colleagues without any discrimination as to where they occur in the text. On the other hand, our current textual conventions of presentation in either print or electronic format require discrimination. In a printed work, the elements within the body text are typically set in larger type than those in the annotation. To achieve that effect using the current generation of commercial typesetting software requires explicit tagging. With the exception of paragraphs, all of the other elements have to have unique tag sets. A block quotation in the annotation cannot be tagged with the same tags used for a block quotation in the body text. While I have no doubt that our colleagues could easily understand the necessity of explicit tagging, I don't think we would win many converts to a tagging scheme of that type. Most would simply throw up their hands. And quite frankly, if my own project did not have the software to insert the explicit tags required for typesetting, we would not be engaged in preparing typesetting files today. Unfortunately our software is designed only for the kinds of documents we commonly publish. But the principle inherent in our software is one worth thinking about because we use what might be called "context tags" to discriminate between textual elements that commonly occur in different circumstances. The use of context tags could provide a way of simplifying the tagging of core elements. For example, you might have a tagging structure that looked something like the outline below. text The supposition here is that the relationship of the subordinate elements to the text within which they occur is the same; hence, the same tags are used as identifiers. In the outline above, the tag functions as a context tag for its five subordinate elements; the first tag becomes in itself a context tag for its elements; and the tag functions also functions as a context tag. Another way of thinking about how context tags would work is to explode the implicit relationship of the tags. becomes etc. However, the explicit relationship does not have to be made unless there is a particular need such as typesetting or some type of textual analysis in which the various elements must be particularized. I want to turn now to the question of tagging the core elements themselves. Paragraphs and block quotations are less complicated to work with in most instances than headings, tables, and poetry. Headings are generally used in a hierarchical fashion which reflects the organization of the body text. Tables are used to organize information in columnar fashion, with or without headings and labels. Poetry is complex because the format is often an integral part of the poet's presentation of his/her work and must be faithfully represented. For the author who prepares a book or an article, I think we would be wise to adopt an interchange standard which requires only simple tagging. Let's take the case of titles and headings. From my point of view, the two are slices of the same pie. Nevertheless, it's useful to call certain headings by what they are:
, etc. Authors can easily relate to these familiar terms. On the other hand, other types of headings do not lend themselves to precise naming. Hence, a descriptive term like , though commonly used in the print publishing world, does not have much meaning to most of our colleagues--especially when their texts contain a wide array of headings. At the authorial level, I think we should ask for no more than a simple tag. And I would make the same recommendation for lists, tables, and poetry. To distinguish between an ennumerated list which can be treated as a single-column table and one which needs to be treated as a two-column table requires the kind of expertise that most of our colleagues don't have and probably won't ever be interested enough to develop. Tagging at that level is best left to those who import the texts and have a need for that kind of precision. Though I have not touched on the core elements which occur within the bounds of these text "containers" because of time constraints, my viewpoint is much the same for the texts prepared by most of our colleagues. Until we have software to imbed SGML tagsets that discriminate between italics used for emphasis and those used for book titles, the tagging should be left to those who have more experience and as well as a need for that type of discrimination. But think how far we will have come if we can convince our colleagues that even the minimal tagging of core elements is a worthwhile undertaking. As scholars in the humanities become more involved with electronic texts in the years to come, many will graduate to the levels of tagging we hope to eventually achieve because they will be driven by their own research and teaching needs. But for now, our basic guidelines for the inexperienced should require no more than a tagging of the fundamental elements in a simple and easily understood manner. David R. Chesnutt University of South Carolina