Date:         Tue, 5 Dec 89 08:30:05 CST
Sender:       Text Encoding Initiative - Text Analysis and Interpretation
              Committee <TEI-ANA@UICVM>
From:         Gary Simons <gary@TXSIL.LONESTAR.ORG>
 
To:   TEI-ANA distribution list
From: Gary Simons
Date: December 4, 1989
 
Appended is the document which Terry Langendoen referred to
as SSS in his recent mailing.  It was originally distributed to
a subset of the TEI-ANA committee along with a photocopy of a
chapter of the IT (Interlinear Text) system documentation on
which it was based.  At Terry's request I have revised the
original document slightly to make it suitable for posting on
its own.
 
 
------------------------------------------------------------
 
 
Text Encoding Initiative
Committee on Text Analysis and Interpretation
 
November 1, 1989
Revised December 4, 1989
 
 
 
 
                 PROPOSED FRAMEWORK FOR ENCODING
 
           ANALYSIS AND INTERPRETATION OF RUNNING TEXT
 
 
                         Gary F. Simons
 
                 Summer Institute of Linguistics
                     7500 W. Camp Wisdom Rd.
                        Dallas, TX  75236
                  UUCP: ...!texbell!txsil!gary
                Internet: gary@txsil.lonestar.org
 
 
 
     This document proposes a framework for encoding the
linguistic analysis and interpretation of running text (as
opposed to the encoding of isolated displays used as examples in
articles or books).  This presentation assumes familiarity with
the text encoding used in the IT (Interlinear Text) system
developed by the Summer Institute of Linguistics (SIL).  The MS-
DOS version of IT is documented in "How to Use IT: a guide to
interlinar text processing," by Gary F. Simons and Larry Versaw
(SIL, Dallas, TX, 1987), and the Macintosh version in "How to Use
IT: interlinear text processing on the Macintosh," by Gary F.
Simons and John V. Thomson (Linguist's Software, Edmonds, WA,
1988).  IT uses backslash markers to tag kinds of information in
an analyzed text and uses column alignment to encode the
relationships between the elements in different tagged fields.
This document suggests how the same kinds of information could be
encoded in SGML.
 
     The IT documentation illustrates many kinds of information
that can be encoded in analyzed texts.  The focus in this
document is not on the variety of things that can be encoded (and
for which we would eventually need to develop tags), but on a
basic strategy for encoding text analysis in SGML (particularly
where there is simultaneously an analysis of the text at
different levels of structure).
 
     Most of the discussion below gives the justification for
what I am proposing and gives examples to illustrate.  There are
some specific proposals, however.  Where these occur, I have
drawn attention to them by setting them apart in an indented
paragraph.  I have also given each a unique reference number to
facilitate future discussion of the proposals.  In the discussion
below I use some of the terminology of the IT system.  The
baseline text is the original text being analyzed.  The analyses
added to the text are called annotations.
 
                       Word-level analysis
 
     I will begin with a simple example (from page 2-19 of the IT
manual) which involves only word-level annotation:
 
     \ot Ma  nau ku rikia kini  eri  ne  kuu   ka  bula
     \rt Ma  nau ku rikia kini  'eri ne  ku'u  ka  bula
     \wg and I   I  saw   woman that who drank she be drunk
 
Here, \ot stands for `original transcription,' \rt for
`retranscription,' and \wg for `word gloss.'
 
     In the interlinear presentation of text, the associations
between words and their annotations are achieved by means of
explicit manipulation of format.  In a data interchange standard
we need to get away from that, for at least two reasons: (1) the
actual formatting (such as the size of the spaces between words)
is dependent on the size and style of the fonts that are
eventually used, and (2) software could not understand the
structure of an interlinear text (and the relationships among its
elements) without having a special purpose parser.  To remedy
this, the interchange standard should explicitly encode the
structure of the interlinear text in its descriptive markup.
This leads to the first proposal:
 
     1. A markup element, say <w> ... </w>, is needed to
     enclose a word and all of its annotations in an
     interlinear text.
 
Within this element, the base word and each of the annotations
are also marked with a descriptive tag.  I will assume that the
DTD specifies that end tags are optional for these lowest level
tags.  Following this scheme, the above example (keeping the same
markup tags) would be encoded as:
 
     <w> <ot>Ma <rt>Ma <wg>and </w>
     <w> <ot>nau <rt>nau <wg>I </w>
     <w> <ot>ku <rt>ku <wg>I </w>
     <w> <ot>rikia <rt>rikia <wg>saw </w>
     <w> <ot>kini <rt>kini <wg>woman </w>
     <w> <ot>eri <rt>'eri <wg>that </w>
     <w> <ot>ne <rt>ne <wg>who </w>
     <w> <ot>kuu <rt>ku'u <wg>drank </w>
     <w> <ot>ka <rt>ka <wg>she </w>
     <w> <ot>bula <rt>bula <wg>be drunk </w>
 
     In this example, a space has been put after each data item.
It is my feeling that the standard should ignore any white space
following a data item within a <w> element.  This flexibility is
needed, especially as analysis structures get more complex, in
order to make interchange formats human readable.  Thus,
 
     2. Within the <w> element, all spaces and newlines
     preceding a markup tag are superfluous.
 
Thus the above sample means exactly the same thing as the
following in which all the superfluous spaces are omitted (thus
producing a file that is highly compact but much less human
readable):
 
     <w><ot>Ma<rt>Ma<wg>and</w>
     <w><ot>nau<rt>nau<wg>I</w>
     <w><ot>ku<rt>ku<wg>I</w>
     <w><ot>rikia<rt>rikia<wg>saw</w>
     <w><ot>kini<rt>kini<wg>woman</w>
     <w><ot>eri<rt>'eri<wg>that</w>
     <w><ot>ne<rt>ne<wg>who</w>
     <w><ot>kuu<rt>ku'u<wg>drank</w>
     <w><ot>ka<rt>ka<wg>she</w>
     <w><ot>bula<rt>bula<wg>be drunk</w>
 
Likewise, the above two encodings are equivalent to the following
in which extra spaces are added to achieve a very human readable
format:
 
     <w> <ot>Ma    <rt>Ma    <wg>and      </w>
     <w> <ot>nau   <rt>nau   <wg>I        </w>
     <w> <ot>ku    <rt>ku    <wg>I        </w>
     <w> <ot>rikia <rt>rikia <wg>saw      </w>
     <w> <ot>kini  <rt>kini  <wg>woman    </w>
     <w> <ot>eri   <rt>'eri  <wg>that     </w>
     <w> <ot>ne    <rt>ne    <wg>who      </w>
     <w> <ot>kuu   <rt>ku'u  <wg>drank    </w>
     <w> <ot>ka    <rt>ka    <wg>she      </w>
     <w> <ot>bula  <rt>bula  <wg>be drunk </w>
 
The following is still another possible encoding of the very same
information:
 
     <w>
        <ot>Ma
        <rt>Ma
        <wg>and </w>
     <w>
        <ot>nau
        <rt>nau
        <wg>I </w>
     <w>
        <ot>ku
        <rt>ku
        <wg>I </w>
     <w>
        <ot>rikia
        <rt>rikia
        <wg>saw </w>
     etc.
 
The examples below which involve morpheme analysis within word
analysis demonstrate the necessity for being able to treat
newlines as well as spaces as superfluous.
 
     I further propose, as shown in the above samples, that:
 
     3. Within the <w> element, the baseline text element
     should have a markup tag just as do all the
     annotations.
 
This means that <w> will always be followed immediately by the
baseline tag.  This appears redundant, and one could argue that
the baseline word should be the content of the <w> tag.  I prefer
to have the baseline separately tagged for the following reasons:
(1) The <w> element is more like a record of a database than an
element with running text; in this perspective, the baseline word
is a field of information as are all the annotations.  (2) The
baseline text may be drawn from any of a number of domains --
phonetic transcription, phonemic transcription, standard
orthography, historical orthography -- which happen to be among
the domains that are possible for annotations as well.  This
proposal ensures that elements are tagged consistently, whether
they are base forms or annotations.  (3) If the baseline were not
tagged, then the meaning of <w> would be variable depending on
the kind of representation being used for the baseline.  With the
baseline separately tagged, <w> always means exactly the same
thing, namely, "This is a bundle of information about a word."
 
                     Morpheme-level analysis
 
     Now we shift to a more complex example involving morpheme-
level analysis within the word-level analysis.  The example is
from Eskimo, a highly agglutinative language.  In interlinear
format (from page 2-33 of the IT manual) the example looks like
this:
 
     \tx Akutchilighmik-uvva          uqaaqtullangniaqtunga.
     \at akut    -chi-ligh-mik  =uvva uqaaqtu   -llang-niaq-tunga
     \mr akutuq  -si -liq -mik  =uvva uqaaqtuq  -llak -niaq-tunga
     \mg icecream-RSL-GER -s.MOD=now  tell story-DUR  -INT -1s.I
     \wg about making Eskimo icecream I am going to tell a story
 
The field codes in this case are \tx for `baseline text,' \at for
`allomorphic transcription' (that is, with morph cuts indicated
in the surface form of the word), \mr for `morphemic representa-
tion' (that is, underlying forms), \mg for `morpheme gloss,' and
\wg for `word gloss.'
 
     In this example we have the elements of the baseline text
and the word gloss lines working at the word level.  But the
middle three lines are further subdivided into a morpheme level
of analysis.  We could ignore this and treat everything as a
word-level annotation, as in
 
     <w> <tx>Akutchilighmik-uvva
         <at>akut-chi-ligh-mik=uvva
         <mr>akutuq-si-liq-mik=uvva
         <mg>icecream-RSL-GER-s.MOD=now
         <wg>about making Eskimo icecream </w>
 
     <w> <tx>uqaaqtullangniaqtunga.
         <at>uqaaqtu-llang-niaq-tunga
         <mr>uqaaqtuq-llak-niaq-tunga
         <mg>tell story-DUR-INT-1s.I
         <wg>I am going to tell a story </w>
 
Note that in this representation we have preserved the morpheme
break characters in the data so that the human reader can
reconstruct the relationships between the parts of the related
lines, but we have not encoded those relationships explicitly in
the markup.  Without a special purpose parser, the morpheme
subalignments are lost to computer software.
 
     To handle the morphemic substructure in the markup, we need
an element like <w> above, but which works at the morpheme level
and can be embedded inside <w> elements.  Thus,
 
     4. A markup element, say <m> ... </m>, is needed to
     enclose a morpheme and all of its annotations in an
     interlinear text.
 
Proposal 2 above, about superfluous newlines and spaces, applies
as well within the scope of the <m> element.  So does proposal 3,
which in this case means that the first item within a <m> element
should have a descriptive tag.
 
     To translate the above interlinear example into descriptive
SGML markup requires one more step.  We must realize that the
three lines which will be expressed at the morpheme level inside
<m> elements function together as a single element at the word
level.  In the same way that \wg provides a gloss for the whole
word, the set of three lines provides a morphemic analysis for
the whole word.  We thus introduce <ma> for `morphemic analysis'
as a word-level annotation in which to embed the morpheme-level
analyses.  The descriptive markup of the Eskimo example then
becomes something like:
 
     <w> <tx>Akutchilighmik-uvva
         <ma> <m> <at>akut <mr>akutuq <mg>icecream </m>
              <m> <at>-chi <mr>-si <mg>-RSL </m>
              <m> <at>-ligh <mr>-liq <mg>-GER </m>
              <m> <at>-mik <mr>-mik <mg>-s.MOD </m>
              <m> <at>=uvva <mr>=uvva <mg>=now </m> </ma>
         <wg>about making Eskimo icecream </w>
 
     <w> <tx>uqaaqtullangniaqtunga.
         <ma> <m> <at>uqaaqtu <mr>uqaaqtuq <mg>tell story </m>
              <m> <at>-llang <mr>-llak <mg>-DUR </m>
              <m> <at>-niaq <mr>-niaq <mg>-INT </m>
              <m> <at>-tunga <mr>-tunga <mg>-1s.I </m> </ma>
         <wg>I am going to tell a story </w>
 
Note that this is just one possible way to use indentations and
newlines to format the encoded information.  By proposal 2 above,
many others would be possible.
 
                        Above word level
 
     Text analysis must also go above the word level.  For
instance, the free translations of conventional interlinear text
analysis are annotations at a sentence or clause level.  One
could also perform analysis and annotation at a phrase level or
at a paragraph level.  Markup can be generalized at any level by
doing what we did to embed morpheme analysis inside word level
analysis.  Thus,
 
     5. For each level at which annotation is to be done, a
     markup tag (along the lines of <w> and <m> introduced
     above) should be created.  If a lower level of anno-
     tation is to be embedded within it, then a markup tag
     to embed the lower level analysis must also be created.
 
The second sentence refers to a tag like <ma> which had to be
created to embed morpheme analyses inside of word analyses.
 
     Consider this example (adapted from page 2-64 of the IT
manual):
 
     \ref BARU u 11
     \tx "Hata     ma  haia,
     \wg  vine sp. and putty nut
 
     \lt "Hata vine and putty nut,
 
     \ft "There is plenty of hata vine and putty nut around,
 
     \bi Hata is a vine used for stitching together the planks
         of canoes.  The nut of the haia tree (Parinarium
         glaberrimum) contains a putty-like substance used for
         caulking canoes.
 
In this example, \ref stands for `reference,' \tx for the
baseline text, \wg for `word gloss,' \lt for `literal
translation,' \ft for `free translation,' and \bi for `background
information.'
 
     For the purposes of text glossing, this particular text was
divided into units of convenient size for translation (often more
than a single clause but less than a sentence).  Since the
division does not correspond to a linguistic level like sentence
or clause, we will simply call it `unit' and use <unit> as the
tag for this unit.  Within the unit, we have word-level analysis.
Thus we need a tag to embed the lower level of analysis.  We will
use <wa> for `word-by-word analysis.'  The above would then be
marked up as:
 
     <unit> <number> 11
        <wa> <w> <tx>"Hata <wg> vine sp. </w>
             <w> <tx>ma <wg>and </w>
             <w> <tx>haia <wg>putty nut </w> </wa>
        <lt>"Hata vine and putty nut,
        <ft>"There is plenty of hata vine and putty nut around,
        <bi>Hata is a vine used for stitching together the planks
     of canoes.  The nut of the haia tree (Parinarium
     glaberrimum) contains a putty-like substance used for
     caulking canoes.
     </unit>
 
If the words exhibited complex morphology, then a <ma> element
would be further embedded within the <wa> element.
 
                           Where next?
 
     My purpose here has been to suggest a possible framework for
the markup of analysis and interpretation in texts.  If this
framework is adopted, then it seems that the major task of the
committee would be to devise a standard set of tags for all the
kinds of information that could be encoded in analyzed texts.
The IT manual proposes more than 30, but many more are possible.
The following are some of the difficulties I foresee along the
way:
 
     (1) This kind of text is not amenable to the once-for-all
DTD in the same way that something like an academic book or
article (a la the AAP tag set) has proven to be.  Two reasons
come to mind.  First, there is not a closed tag set.  The
imaginative analyst could invent new ways of interpreting and
analyzing text for which appropriate standardized markup tags
have not been devised.  Second, each analyzed text potentially
has a unique DTD rather than all analyzed texts sharing the same
DTD.  In a particular analyzed text, the analyst makes a
selection of the kinds of information that will be included; the
analyst should then be consistent in following that model
throughout.  The analyst's selection thus would ideally define a
unique DTD for the text which consistently enforces the analyst's
intention.  We could probably live with a once-for-all general-
purpose DTD, but it would permit sloppy work in which each
element of the text could be annotated with a different selection
of fields.
 
     (2) The whole scheme presented above is optimal for
situations where there is a one-to-one correspondence between all
the analysis items at the same level.  What do we do when there
is not?  For instance, what happens if one word in the baseline
text is a contraction that needs to be represented as two words
in the next line of the analysis.  Or, what if two words in the
baseline text are a single lexeme which should be represented as
a single item in a later line of analysis.  This problem gets
even more difficult when the items involved are discontinuous.
 
     (3) The example above shows only a traditional single-tier
approach to morphology.  Encoding of multiple-tier analyses will
be tricky because the associations between elements on different
tiers are typically not one-to-one.
 
     (4) It is unclear to me how encoding the interpretation and
analysis of a running text (which is what this working paper
considers) should relate to the encoding of linguistic examples
in a book or article.  It seems pretty obvious that the encoding
of interlinear examples can be covered by the same mechanism.
For other kinds of examples, such as syntactic tree diagrams or
the multi-tier displays common in phonology, it is not so clear.
It seems that it would be possible to force these into the above
model (and indeed we probably want this for the case where these
kinds of analysis are used in annotating a running text).
However, it also seems that it might be better to have a separate
encoding scheme optimized for these special kinds of displays.