Comments on AIW7, "Where's Morphology?" and AIW7a "A Revision of the
'Box' Proposal
CMSMcQ, 1 May 1990
The basic proposal here is to allow running texts to be treated as
simple series of segments, each segment containing one or more levels of
content (typically one base and zero or more annotations). Since an
annotation might attach to another annotation, we can assert that each
annotation has some base, but not that only one base exists. AIW7 makes
explicit an obvious parallel to presentation of tabular material,
treating each segment as a BOX containing a series of ROWs or BOXes.
This corresponds to the internal structure of a table in column-major
form: the table consists of a series of COLUMNs, each containing a
series of ROWs or other tables. AIW7 elides the TABLE and COLUMN levels
as not needing to be distinct: a ROW containing data is a cell, while a
ROW containing other BOXes or COLUMNs is clearly a table.
AIW7a renames the elements more descriptively: the segment as UNIT, and
the levels of content or ROWs as BASE and ANNOTATION. It proposes, in
BNF, a grammar which would result in this DTD fragment:
]]>
One immediate drawback to this formalization is that BASE is not typed.
Since a content level can be both an annotation and a base for another
level, we should allow BASEs to carry the same typing information that
ANNOTATIONs do. So I assume as a modification that BASE is typed in the
same way that ANNOTATION and UNIT are. If we have 5 levels: A, B and C
(annotations on A), D (annotation on B), and E (annotation on D), this
grammar would give us, for one segment:
A-segment
B-segment
C-segment
A-segment
B-segment
D-segment
C-segment
A-segment
B-segment
D-segment
E-segment
C-segment
]]>
We can conclude that AIW7a can handle multiple bases, by nesting.
Another possibility, without nesting, would eliminate the distinction
between base and annotation as elements, allowing each content level to
have an ID (desirable on other grounds, for cross-reference) and, if an
annotation, a pointer to the content level which forms its base. This
will be defined with a default value of #CURRENT, so that we need only
specify it once. (The pointer is to be interpreted by an application
program as a pointer to the transcription or annotation level pointed
at, not as a pointer just at the particular segment involved. Note that
the base for a given level can change from segment to segment (e.g. if
the morphological segmentation is based now on the first, and now on the
second, of two orthographic transcripts of a tape). The required DTD is
this:
]]>
Our five-level example can now look like this for the first unit:
A-segment
B-segment
C-segment
D-segment
E-segment
]]>
And like this for subsequent units:
A-segment
B-segment
C-segment
D-segment
E-segment
]]>
The example used in AIW1 would look like this:
Akutchilighmik-uvva
akut
akutuq
icecream
-chi
-si
RSL
-ligh
-liq
GER
-mik
-mik
s.MOD
=uvva
=uvva
NOW
about making Eskimo icecream
uqaaqtullangniaqtunga
uqaaqtu
uqaaqtuq
tell story
-llang
-llak
DUR
-niaq
-niaq
INT
-tunga
-tunga
1s.I
I am going to tell a story
]]>
The free-text translation example used in AIW1 would look like this:
11
"Hata
vine sp.
ma
and
haia
putty nut
Hata vine and putty nut,
There is plenty of hata vine and putty nut around,
Hata is a vine used for stitching together the planks of
canoes. The nut of the haia tree (Parinarium
glaberrimum) contains a putty-like substance used for
caulking canoes.
]]>
A row-major organization could be devised, equivalent to this
column-major organization, for those would would rather read across than
down. While plausible for data-entry, this organization is not
recommended for data interchange and thus is not defined explicitly in
this draft. Readers who have an interest in a publicly defined
row-major notation for implicitly aligned material are invited to let us
know and to provide examples of material marked up in the desired
fashion.