Comments on AIW7, "Where's Morphology?" and AIW7a "A Revision of the 'Box' Proposal CMSMcQ, 1 May 1990 The basic proposal here is to allow running texts to be treated as simple series of segments, each segment containing one or more levels of content (typically one base and zero or more annotations). Since an annotation might attach to another annotation, we can assert that each annotation has some base, but not that only one base exists. AIW7 makes explicit an obvious parallel to presentation of tabular material, treating each segment as a BOX containing a series of ROWs or BOXes. This corresponds to the internal structure of a table in column-major form: the table consists of a series of COLUMNs, each containing a series of ROWs or other tables. AIW7 elides the TABLE and COLUMN levels as not needing to be distinct: a ROW containing data is a cell, while a ROW containing other BOXes or COLUMNs is clearly a table. AIW7a renames the elements more descriptively: the segment as UNIT, and the levels of content or ROWs as BASE and ANNOTATION. It proposes, in BNF, a grammar which would result in this DTD fragment: <![CDATA[ <!ELEMENT unit - - (base, annotation+) > <!ELEMENT base - O (#PCDATA) > <!ELEMENT annotation - O (#PCDATA | unit+) > <!ATTLIST (unit, annotation) type CDATA #IMPLIED > ]]> One immediate drawback to this formalization is that BASE is not typed. Since a content level can be both an annotation and a base for another level, we should allow BASEs to carry the same typing information that ANNOTATIONs do. So I assume as a modification that BASE is typed in the same way that ANNOTATION and UNIT are. If we have 5 levels: A, B and C (annotations on A), D (annotation on B), and E (annotation on D), this grammar would give us, for one segment: <![CDATA[ <!-- first pass --> <unit><base>A-segment <annotation>B-segment <annotation>C-segment </unit> <!-- second pass --> <unit><base>A-segment <annotation> <unit><base>B-segment <annotation>D-segment </unit> <annotation>C-segment </unit> <!-- third pass --> <unit><base type=orthographic>A-segment <annotation> <unit><base type=morphological>B-segment <annotation> <unit><base type=morph.gloss>D-segment <annotation type=morph.gloss.2>E-segment </unit> </unit> <annotation type=syntax>C-segment </unit> ]]> We can conclude that AIW7a can handle multiple bases, by nesting. Another possibility, without nesting, would eliminate the distinction between base and annotation as elements, allowing each content level to have an ID (desirable on other grounds, for cross-reference) and, if an annotation, a pointer to the content level which forms its base. This will be defined with a default value of #CURRENT, so that we need only specify it once. (The pointer is to be interpreted by an application program as a pointer to the transcription or annotation level pointed at, not as a pointer just at the particular segment involved. Note that the base for a given level can change from segment to segment (e.g. if the morphological segmentation is based now on the first, and now on the second, of two orthographic transcripts of a tape). The required DTD is this: <![CDATA[ <!ELEMENT unit - - (level+) > <!ELEMENT level - O (#PCDATA | unit+) > <!ATTLIST (unit, level) type CDATA #IMPLIED id ID #IMPLIED base IDREF #CURRENT > ]]> Our five-level example can now look like this for the first unit: <![CDATA[ <unit> <level id=seg.A1 type=orthog>A-segment <level id=seg.B1 type=morpho base=seg.A1>B-segment <level id=seg.C1 type=syntax base=seg.A1>C-segment <level id=seg.D1 type=mgloss base=seg.B1>D-segment <level id=seg.E1 type=mglos2 base=seg.D1>E-segment </unit> ]]> And like this for subsequent units: <![CDATA[ <unit> <level id=seg.A2>A-segment <level id=seg.B2>B-segment <level id=seg.C2>C-segment <level id=seg.D2>D-segment <level id=seg.E2>E-segment </unit> ]]> The example used in AIW1 would look like this: <![CDATA[ <unit> <level type=tx>Akutchilighmik-uvva <level><unit> <level type=at>akut <level type=mr>akutuq <level type=mg>icecream </unit> <unit> <level type=at>-chi <level type=mr>-si <level type=mg>RSL </unit> <unit> <level type=at>-ligh <level type=mr>-liq <level type=mg>GER </unit> <unit> <level type=at>-mik <level type=mr>-mik <level type=mg>s.MOD </unit> <unit> <level type=at>=uvva <level type=mr>=uvva <level type=mg>NOW </unit> <level type=wg>about making Eskimo icecream </unit> <unit> <level type=tx>uqaaqtullangniaqtunga <level><unit> <level type=at>uqaaqtu <level type=mr>uqaaqtuq <level type=mg>tell story </unit> <level><unit> <level type=at>-llang <level type=mr>-llak <level type=mg>DUR </unit> <level><unit> <level type=at>-niaq <level type=mr>-niaq <level type=mg>INT </unit> <level><unit> <level type=at>-tunga <level type=mr>-tunga <level type=mg>1s.I </unit> <level type=wg>I am going to tell a story </unit> ]]> The free-text translation example used in AIW1 would look like this: <![CDATA[ <unit> <level type=no>11 <level type=wa><unit> <level type=tx>"Hata <level type=wg>vine sp. </unit> <unit> <level type=tx>ma <level type=wg>and </unit> <unit> <level type=tx>haia <level type=wg>putty nut </unit> <level type=lt>Hata vine and putty nut, <level type=ft>There is plenty of hata vine and putty nut around, <level type=bi>Hata is a vine used for stitching together the planks of canoes. The nut of the haia tree (Parinarium glaberrimum) contains a putty-like substance used for caulking canoes. </unit> ]]> A row-major organization could be devised, equivalent to this column-major organization, for those would would rather read across than down. While plausible for data-entry, this organization is not recommended for data interchange and thus is not defined explicitly in this draft. Readers who have an interest in a publicly defined row-major notation for implicitly aligned material are invited to let us know and to provide examples of material marked up in the desired fashion.