Core Elements in Text Representation:
A Point of View
 
David Chesnutt
 
    Scholars in the humanities are most often engaged in preparing
monographs, journal articles, editions, and textual materials to be used
in their classrooms.  While most scholars today create "electronic"
versions of these texts, they are accustomed to thinking only of what
they must do in their particular word processing programs to achieve the
desired format for their text.  The levels of ambiguity and
inconsistency in these author/editor texts do not seem to be of great
concern to most scholars as long as the texts convey the information
they wish to present.
    The Association of American Publishers recognized the value of
electronic manuscripts as a means of lowering typesetting costs early in
this decade. Major textbook publishers were the first to develop
in-house systems designed to produce typesetting files.  At about the
same time, a number of scholarly editions in the States followed suit.
Larger university presses followed suit and many other presses (both
private and public) experimented with electronic files produced by their
authors.  The results were mixed but out of that experience came the AAP
effort to standardize the coding of electronic files via an SGML markup
scheme.
    By and large, the AAP standard has had little impact to date.  Among
all the scholarly editions in history and literature that I am familiar
with (probably most of the major projects in the States), none uses the
AAP tagging scheme.  Even among the publishers themselves, AAP has not
"caught on."  Part of this lack of enthusiasm probably stems the AAP
documentation itself which is sometimes difficult to follow.
    But the major flaw of the AAP scheme from an end-user point of view
is that it asks too much.  It is too particularized--geared too much to
the publisher who is the beneficiary of the potential savings; geared
too little to the inexperienced academic.  Yet at the same time, the AAP
coding does not provide a tagging scheme for one of the most elementary
textual elements which commonly occurs in our texts--poetry. (Nor does
it provide a tagging scheme for some of the more specialized problems in
the publication of scholarly editions--letters and documents of various
types, cancellations, interlineations, textual variants.  But that's
another issue because these features are not common to most texts in the
humanities.)
 
    The goals of the TEI are to provide a tagging scheme which will
facilitate the interchange of text and which will allow tagging at
various levels of sophistication.  To accomplish those goals we must
provide guidelines which will not intimidate the novice, yet which are
comprehensive enough to accomodate the needs of scholars who wish to tag
their data for sophisticated presentation or analysis.  If the TEI is to
gain the kind of widespread acceptance we hope it will, it seems to me
that we must provide a simplified "cookbook" for the novice and a
comprehensive reference manual for the more sophisiticated applications.
My concern here is with the cookbook and the ingredients necessary for a
palatable stew.
 
Core Elements
    Within any reading text, we tend to discriminate between the
elements visually as a means of conveying information to the reader.
More often than not, we use the conventions that printers have used for
centuries.  We center a heading, put it in larger type, make it
boldface, separate it with vertical space above and below.  In short, we
emulate the conventions of the printed page.  With an electronic text,
we usually have a mix of the old with the new.  For example, we might
display the words we have marked for emphasis with a different
background color instead of an italic font.  Regardless of our media or
our conventions, the important aspect is simply to mark or tag those
elements in a way which others can easily recognize and adapt to their
own uses.
    The following outline includes some of the most common elements
which influence the way we "see" a text.
 
         body text
              titles/headings
              paragraph
              poetry
              lists/tables
              block quotation
                   titles/headings
                   paragraph
                   poetry
                   lists/tables
                   block quotation
         annotation
              titles/headings
              paragraph
              poetry
              lists/tables
              block quotation
 
    Steve DeRose uses the term "text container" to describe some of
these elements which is a useful way of separating them from those
elements that occur within the text (e.g., sections of text marked for
emphasis, hyphen problems, occurrences of foreign languages, etc.).  The
internal elements are probably no less critical, but I want to confine
my remarks here to the "containers" which seem to be most central or
common to texts in the humanities.
    One of the problems I see in providing tag sets for these core
elements is the recurrence of the same elements in different settings.
These elements pop up all over the place and are commonly input by our
colleagues without any discrimination as to where they occur in the
text.  On the other hand, our current textual conventions of
presentation in either print or electronic format require
discrimination.  In a printed work, the elements within the body text
are typically set in larger type than those in the annotation.  To
achieve that effect using the current generation of commercial
typesetting software requires explicit tagging.  With the exception of
paragraphs, all of the other elements have to have unique tag sets.  A
block quotation in the annotation cannot be tagged with the same tags
used for a block quotation in the body text.
    While I have no doubt that our colleagues could easily understand
the necessity of explicit tagging, I don't think we would win many
converts to a tagging scheme of that type.  Most would simply throw up
their hands.  And quite frankly, if my own project did not have the
software to insert the explicit tags required for typesetting, we would
not be engaged in preparing typesetting files today.  Unfortunately our
software is designed only for the kinds of documents we commonly
publish.  But the principle inherent in our software is one worth
thinking about because we use what might be called "context tags" to
discriminate between textual elements that commonly occur in different
circumstances.
    The use of context tags could provide a way of simplifying the
tagging of core elements.  For example, you might have a tagging
structure that looked something like the outline below.
 
         <body> text
             <head>
             <pgh>
             <poetry>
             <list>
             <blkquote>
                   <head>
                   <pgh>
                   <poetry>
                   <list>
                   <blkquote>
         <notes>
              <head>
              <pgh>
              <poetry>
              <list>
              <blkquote>
 
    The supposition here is that the relationship of the subordinate
elements to the text within which they occur is the same; hence, the
same tags are used as identifiers.  In the outline above, the <body> tag
functions as a context tag for its five subordinate elements; the first
<blkquote> tag becomes in itself a context tag for its elements; and the
<notes> tag functions also functions as a context tag.  Another way of
thinking about how context tags would work is to explode the implicit
relationship of the tags.
 
         <body>
             <head>          becomes        <body-head>
             <pgh>                          <body-pgh>
             <poetry>                       <body-poetry>
             <list>                         <body-list>
             <blkquote>                     <body-blkquote>
                   <head>                        <body-blkquote-head>
                   <pgh>                        <body-blkquote-pgh>
                   <poetry>
                   <list>                             etc.
                   <blkquote>
 
    However, the explicit relationship does not have to be made unless
there is a particular need such as typesetting or some type of textual
analysis in which the various elements must be particularized.
 
    I want to turn now to the question of tagging the core elements
themselves.  Paragraphs and block quotations are less complicated to
work with in most instances than headings, tables, and poetry.  Headings
are generally used in a hierarchical fashion which reflects the
organization of the body text.  Tables are used to organize information
in columnar fashion, with or without headings and labels.  Poetry is
complex because the format is often an integral part of the poet's
presentation of his/her work and must be faithfully represented.
    For the author who prepares a book or an article, I think we would
be wise to adopt an interchange standard which requires only simple
tagging.  Let's take the case of titles and headings.  From my point of
view, the two are slices of the same pie.  Nevertheless, it's useful to
call certain headings by what they are: <book title> <author> <chapter
title> <section title> <appendix title>, etc.  Authors can easily relate
to these familiar terms.  On the other hand, other types of headings do
not lend themselves to precise naming.  Hence, a descriptive term like
<head level 1>, though commonly used in the print publishing world, does
not have much meaning to most of our colleagues--especially when their
texts contain a wide array of headings.  At the authorial level, I think
we should ask for no more than a simple <heading> tag. And I would make
the same recommendation for lists, tables, and poetry. To distinguish
between an ennumerated list which can be treated as a single-column
table and one which needs to be treated as a two-column table requires
the kind of expertise that most of our colleagues don't have and
probably won't ever be interested enough to develop.  Tagging at that
level is best left to those who import the texts and have a need for
that kind of precision.
    Though I have not touched on the core elements which occur within
the bounds of these text "containers" because of time constraints, my
viewpoint is much the same for the texts prepared by most of our
colleagues.  Until we have software to imbed SGML tagsets that
discriminate between italics used for emphasis and those used for book
titles, the tagging should be left to those who have more experience and
as well as a need for that type of discrimination.  But think how far we
will have come if we can convince our colleagues that even the minimal
tagging of core elements is a worthwhile undertaking.  As scholars in
the humanities become more involved with electronic texts in the years
to come, many will graduate to the levels of tagging we hope to
eventually achieve because they will be driven by their own research and
teaching needs.  But for now, our basic guidelines for the inexperienced
should  require no more than a tagging of the fundamental elements in a
simple and easily understood manner.
 
                                            David R. Chesnutt
                                            University of South Carolina