Notes on SGML Solutions To Markup Problems Document Number: TEI MLW18 May 9, 1991 3.1, May 9, 1991 This paper discusses some sample problems in the use of SGML which have come up in the course of the work of the Text Encoding Initiative, and presents a number of example solutions to each problem. These notes may be taken as representing the views of the committee as to appropriate uses of SGML mechanisms for logical problems; they do not necessarily reflect the views of the committee concerning the individual application areas. The discussion of linguistic examples, especially, focuses on markup problems and is not intended as a full linguistic analysis. Some problems discussed here are also treated in the work of ANSI committee X3V1.8, Music in Information Processing Standards, especially in their work on hypertext and hypermedia documents. Their documents, notably their document X3V1.8/SD-7, "Journal of Development, ANSI Project X3.749-D, Hypermedia/Time-based Document Structuring Language (HyTime)," should be consulted by anyone working with hypertext problems (as should the documents of TEI work group TR3, Hypermedia). The exam- ples and solutions described here are pedagogical in intent; divergences in this document from the recommendations of X3V1.8 or TR3 should not be taken as rejections of their recommendations. 1 1 MARKING ARBITRARY SEGMENTS Problem: how do you mark text segments which are arbitrary both with respect to the primary (or any) hierarchical structure in the text, and also with respect to each other? Examples: 1 the passage rendered illegible by water stains(1) 2 the passage rendered inaudible by a passing truck (e.g. in a tran- script of an interview) 3 this is where N.V. moved to the window (e.g. in a gestural tran- scription of a conversation) 4 the discussion of tariffs in these minutes where the topics are: tariff policy, foreign relations with Japan (including cultural exchange, economic cooperation, tariff policy, patent law), and wheat production (including a digression on wheat tariffs) Solution 1: Concurrent Markup If the number of such segment types is bounded and small, use CONCUR. For instance, example 3 shows a segment corresponding to a movement by one participant. By assigning one document type to each participant, we can mark each participant's moves using CONCUR. So NV's move to the window can be bounded by <(nino)move desc='walking to window'> ... without regard for the other structures in the text, after a DTD with declarations like ]> This assumes the gestural transcription will mark MOVEs, GESTUREs, and possibly other things for which here the ellipsis stands in; also that no two gestures or moves overlap or nest. If gestures and moves may nest, the content models for the document type NINO should be revised appropriately: they may express a suitable hierarchy or be replaced by the content model ANY; alternatively, the gestural tags may all be included as inclusion exceptions on the document-type element NINO, thus: etc. Similarly, legibility or audibility information can be readily accom- modated by CONCUR with a "leg" or "aud" DTD segmenting the document into segments of a given legibility or audibility. The DTD might look some- thing like this: ]> The use of (#PCDATA) as a content model ensures that these tags cannot nest (which we assume would be meaningless) and the content model on AUD ensures that every portion of the document will be marked as clear, dis- torted, inaudible, or "dna" ("does not apply", for sections of the docu- ment to which audibility does not apply--document header, annotation, etc.) So the document could have, interspersed randomly among other tags, sequences like: <(aud)dna> [document header ...] <(aud)clear> ... <(aud)inaudible cause='truck'> ... <(aud)clear> ... <(aud)distorted cause='volume overload'> ... <(aud)clear> ... Solution 2a: Single Empty Segment-Boundary Element Where the number of types of segments is in principle unbounded (e.g. not only move and gesture but an indefinite number of further possibili- ties), a single tag may be used to mark the beginning or end of any segment, thus dividing the document into time-slices to be managed and grouped by higher-level application software.(2) If we wish to transcribe, say, movement, location, and whether a par- ticipant is smoking, we can segment the text on these lines: if N.V. walks to the window, stands there, and after a few minutes lights a cigarette, returning to the table before putting it out, we could imag- ine a simple segmentation like this: ... passage A ... ... passage B ... ... passage F ... ... passage G ... The application software is then responsible for linking each to its corresponding , by means of the identical value on the id attribute of the and the startid attribute (an IDREF attribute) on the tag, and treating the intervening text as though it were a single element. Here, passages A-C must be treated as though braced by <(nino)smoking> ... , and C-D and F-G as though single segments with NV's location as marked. The document type declaration fragment required for tagging of this kind would be something like this: ]> SGML would not verify that each tag had exactly one corresponding , nor that each event-start tag had an ID attribute and each event-end tag a STARTID attribute, but the ID/IDREF mechanism would ensure that each STARTID pointed at exactly one ID. Solution 2b: Paired Segment-Boundary Elements The method of the preceding section requires a clean distinction among tags: some mark the beginning of an event and must have an ID value, along with person and desc; others mark the end of the same event and should have only a startid value. A third type marks the point of occurrence of an event without duration, for which person and desc values are logically required and an ID value optional. As was noted above, SGML is not in a position to enforce these constraints from the declarations given. We may make the method more reliable, at the cost of two additional element types, by defining three distinct ele- ments for the three types of event tags. In this revised method, a pair of and tags may be used to mark the beginning and end of any segment. Events which occur at a single point in time and have no distinct start and end may be represented by a third tag. The same sequence of events as that given above would be transcribed thus, in this method:

... passage F ... ... passage H ...

As before, the application is responsible for treating passages A-C, C-D, and F-H as units. The document type declaration fragment required for tagging of this kind would be something like this: ]> Solution 3: Typed Segment-Boundary Delimiters If the types of events form a closed set, a different segment- boundary element can be defined for each type of event. Like the tag of solution 2a, these segment-boundary tags would be empty. To define distinct segment-boundary tags for moves and gestures, the DTD would include definitions like these: N.V.'s walk to the window would be marked: ... ... ... As in solution 2, application software would be responsible for linking the start and end tags (by means of the identical value for the ID and STARTID attributes). SGML would not verify that each START tag had exactly one END, nor that each START tag had an ID attribute and each END tag a STARTID attribute, but the ID/IDREF mechanism would ensure that each STARTID pointed at exactly one ID. Better SGML validation can be achieved, as before, by creating pairs of distinct segment-start and segment-end tags.(3) Solution 4: Arbitrary Segments as Lists of Elements A variant on solution 1 can be used to provide slightly better sup- port for arbitrary segments in SGML. In this variant, an arbitrary seg- ment of the text is defined as a set of elements of the text; where existing elements do not have precisely the right extension to define the desired segment, special segmentation elements are used to place boundaries in the correct positions in the text. This approach is defined in TEI P1, where the element is used to specify arbitrary segments of a text as sets of elements, using elements if necessary to divide the text into the desired chunks. This approach provides a more declarative interpretation of arbitrary segments (in terms of a set of subtrees of the parse tree, rather than in terms of a specific processing model involving left-to-right scan of the document); it also automatically provides for discontiguous seg- ments. Its disadvantage is in requiring out-of-line markup: the char- acteristics to be associated with a given arbitrary segment are speci- fied in a separate element, e.g. an , and associated with the arbitrary segment only through an alignment map. Using this method, N.V.'s walk to the window might be marked this way.(4) Comments are used to show more clearly what is happening; the information is carried by the tags, however, not the comments. ...
... text ... ~~... passage A ...~~ ~~... passage B1 ...~~

~~... passage B2 ...~~ ~~... passage C ... ~~... passage D1 ...~~~~

~~... passage D2 ...~~ ~~... passage E ...~~

... passage F ...

... passage G ...

... passage H ... Nino sits at table. Nino walks table to window. Nino stands at window. Nino smokes. Nino knocks on wood. Nino walks window to table. Discussion CONCUR is optimal for expressing orthogonal views of the document. Movement by participants in a conversation may be so viewed. Topic shift (ex. 4) is really not orthogonal and might require segment- terminus tags (solution 2). Solution 2 might also be preferred for data capture; a mechanical operation should be able to convert the resulting text to one using concurrent markup. If the number of views (types of segment) is in principle bounded, prefer CONCUR. If the number of views is in principle unbounded, the event/time-slice technique must be used. In this case one tag (EVENT) will suffice and more should not be used. 2 2 MARKING DISCONTIGUOUS SEGMENTS Problem: how do you mark a segment marked by a single feature, but which is discontiguous in the text? Examples: 1 the words rendered illegible by the stain on the right hand side of this page 2 the finite verb "stellte vor" in the German sentence "Er stellte seine These den Kollegen hoffnungsvoll vor" 3 the discussion of tariffs in these parliamentary minutes (assuming that the discussion wanders back and forth from one topic to another) 4 the root KTB in the Arabic word "al-kaatib" Solution 1: Co-indexing Solution 1: use co-indexing by means of the SGML ID/IDREF mechanism. If we wish, we can gather all the ID occurrences in other tags else- where, in a sort of register which might look like this (for problem example 3): ]> ... ... ... ... ... ... ... ... ... ... ... An alternative form would use a single tag for both the ID and the IDREF attributes, using declarations like these (where LEG='legibility' and S='stain'): and allowing document sequences like: Random statistical quirk for the day: the word "no" appears 1344 times in the King James Bible, but the <(leg)s id='s23'>word "yes" appears only twice! (Grep for<(leg)s segid='s23'> yourself if you don't believe me). At<(leg)s segid='s23'> first I thought this was just a hilarious<(leg)s segid='s23'> artifact of religious dogma, so I chec<(leg)s segid='s23'>ked Alice in Wonder- land -- "yes" appears onl<(leg)s segid='s23'>y once! Curiouser and curiouser. Well it<(leg)s segid='s23'> turns out to be a property of English (yes<(leg)s segid='s23'>/no = .066 on average), and when you consider why this might be, it's undoubtedly due to the fact ... (Humanist 3.769, Tue, 21 Nov 89, posting from mike@tome.media.mit.edu (Michael Hawley)) Note that one occurrence of S (here the first) must have an ID attri- bute, and the others an IDREF. Of the two mechanisms described here, the former, with different tags for the head of the group and the various tails, is to be preferred. Solution 2: Redundant Separate Storage For micro-discontinuities like that in example 4, it might be simpler to introduce redundancy and store the discontiguous segment separately, e.g. with al-kaatib or KTB
al-kaatib
Since KTB is analysis and not part of the text being lemmatized, the ML committee leaned toward the former solution (root as attribute, not ele- ment). Solution 3: Alignment Mechanism Another mechanism for marking discontiguous segments is the alignment map mechanism defined in chapter 6 of TEI P1 and described above as solution 4 for arbitrary segmentation. 3 3 HANDLING AMBIGUOUS CONTENT Problem: how does one mark multiple analyses of the same content? Examples: 1 the gross syntactic structure of the sentence "I saw the man with the telescope" 2 the pagination of the various editions of Shakespeare's Hamlet ______ Solution 1: Concurrent Markup Use CONCUR and define a separate document type for each edition to be included. Assume that we wish to mark volume, page, and column numbers for some editions, volume and page numbers for others. The following DTD may be embedded for each edition; it assumes that any edition is composed of one or more volumes, each volume comprises a set of pages, and each page can contain character data, lines, or columns. Because different editions have different material, an OMITTED tag is provided to mark some contents as not being present in the edition. This concurrent hierarchy is enabled as shown in the comments; the document contains (after the lines enabling the basic document hier- archy) the sequence of lines (assuming the DTD is stored under the sys- tem file identifier "plrefs.dtd"): ]> which call the document type for page and line references and give it the name "La." If page and line numbers from more than one standard edition are to be marked, then the relevant lines may be repeated, each time using a different value for the document type and entity definition (where the example has "La"). Multiple editions of Hamlet might be tagged this way, using this ______ mechanism: ]> ]> ]> ]> ]> <(tei.1)tei.1><(f)f><(q1)q1><(q2)q2><(Ri)Ri> <(f)omitted><(q1)omitted><(q2)omitted><(Ri)omitted> <(tei.1)tei.header> ... <(tei.1)text><(tei.1)body> <(tei.1)div1 name='act' n='1'> <(F)page n='g5a'> <(Q1)page n='3'> <(Q2)page n='[3]'> <(Ri)page n='234'> <(F)page n='g5b'> <(Q2)page n='4'> <(Ri)page n='235'> <(F)page n='g5b'> Solution 2: Redundant Storage of String The string may be repeated with different markup each time. This is an obvious solution but causes problems for views other than the one in which the ambiguity is visible: they see only the repeated content, not the difference in tagging. Solution 3: Out-of-Line Markup (Empty Elements) The chart of this sentence may be represented with an empty element for each arc of the chart, with pointers to the endpoints of the arc. The DTD will have: Where the tokens of the text are numbered 1-N, and the endpoints of the nodes are 0-N, node K follows token K of the text. (If validation of endpoints by the SGML parser is desired, then make these changes or additions to the DTD: and assign the SENTENCE ID to be the node before the first word.) The text will have (using SHORTTAG to omit redundant attribute names), and using comments in the right margin to indicate selected phrases: I saw the man with the telescope. Obviously, the unambiguous arc information can be interspersed with the text, leaving the PARSE elements to group the competing analyses. This does complicate the DTD and the text. Note: The solution described here is fundamentally similar to that Note: offered by TEI P1's tags for linguistics analysis: out-of-line analysis linked to the analysed text by pointers implemented by SGML ID and IDREF attributes. Like this one, the notation allows multiple analyses of the same content; the intermingling of content and analysis is not contemplated, for simplicity's sake. Solution 4: Special Notation Use a special notation to express the parses more compactly, at the cost of losing validation by the SGML parser. Using a DTD like this: We can have a text like this: I saw the man with the telescope. Or: I saw the man with the telescope. Solution 5: Treat as Arbitrary Segments Treat all parse subtrees as arbitrary segments using the techniques already outlined. Commentary Where local ambiguities are independent, leading to combinatorial explosion of overall ambiguity, concurrent markup is not wholly satis- factory, since it requires a separate markup stream for each overall interpretation of the ambiguity. Out-of-line markup in the style of solution 3 or the construct of TEI P1 are preferred in these cases. Where there is no combinatorial explosion (as in the multiple paginations of classic works) and the different segmentations of the text do not interact, CONCUR is the preferred solution. 4 4 MARKING OVERLAPPING (E.G. BI-CLAUSAL ANALYSIS) Problem: how does one mark text segments which can associate either left or right, as in "she (took advantage [of) Joan]" or "Broadway Hit or Miss?" or as in apo koinu constructions? Solutions: these examples appear to be solved by the methods of arbitrary segmentation and by the out-of-line markup mechanisms described in chapter 6 of TEI P1 ( and ). 5 5 SYNCHRONOUS PARALLEL STRUCTURES AND TRANSCRIPTIONS Problem: How do you mark the synchronization points of a set of par- allel texts (e.g. texts of the Bible, or the nine language versions of EEC legislation, or phonetic, phonemic, and orthographic transcriptions of the same text)? Examples: 1 parallel texts (translation equivalents) 2 parallel texts (manuscript variants or recensions) 3 phonemic and orthographic transcriptions of same content Solution 1: Implicit Parallelism So long as order is preserved, parallelism between synchronous struc- tures can be implicit. The lowest level at which the parallelism is to be expressed contains a sequence of parallel versions. For example 3, the DTD might include: (It is assumed that SEGMENT is small enough that all gross text struc- turing occurs above it in the hierarchy.) The text then is: content ... (phonemic transcription of 'the') the (phonemic transcription of 'fat') fat (phonemic transcription of 'cat') cat content ... This is the method used in the / tagging defined in chapter 6 of TEI P1 and demonstrated on translation equivalents in appendix A.6.3. Solution 2: Explicit Synchronization Using Common Identifiers Where sequence is not preserved, locations or segments must be given identifiers, and cross-references from one text to another must indicate the parallelism. E.g. Tor nach Durchfahrt bitte zumachen! and Please close the gate after passing through! This is the approach taken in synchronization through canonical ref- erences (see appendix A.6.1 of TEI P1); a more elaborated version of the same approach, allowing for one-to-many matching of segments, is found in the mechanism of TEI P1 chapter 6. Solution 3: Explicit Synchronization with Many-to-one Linkages Where the segments of parallel texts not only appear in varying orders but do not match one-to-one, the use of common identifiers to align the texts does not suffice. In this case, the mecha- nism of TEI P1 (described above as solution 4 for arbitrary segmenta- tion) must be used. An example of the application of alignment maps to parallel texts may be found in appendix A.6.2 of TEI P1. 6 6 INTERNAL AND EXTERNAL CROSS-REFERENCES Problem: how does one refer to locations elsewhere in the same or in a separate document? Solution: ID/IDREF Use the ID/IDREF mechanism. For external references, this will require application support, but the specification of an ID name with a (possibly system-dependent) document identity will uniquely point to a specific ID in any document. This is the basic mechanism specified in section 5.7 of TEI P1. 7 7 VAGUENESS OF LOCATION Problem: How do you mark a segment or text element with "fuzzy" ends? Examples: 1 the passage begins approximately here, but it is not certain exactly where 2 the passage begins somewhere between one point (a) and another (b) (e.g. an echo of another text, which may begin and end gradually, providing a section which is certainly an echo, in the opinion of a tagger, surrounded by a penumbra which might or might not represent an echo) 3 the passage referred to by a marginal note which does not have a corresponding symbol in the text Solution 1: PRECISION Attribute on Tag Use a PRECISION=VAGUE attribute on the tag whose location is uncer- tain. Solution 2: Double Tagging Use double tagging, either with empty tags (as for arbitrary and overlapping segments) or with nested elements, so that one tag occurs at point (a) of example 2 and one at point (b). The nested elements could be separate elements, the inner representing the text segment where the text feature ("echo" in the example) is certainly present, the outer where it might be present. Alternatively, if the element can self-nest, the outer element could have the attribute STATUS=POSSIBLE and the inner STATUS=CERTAIN. ------------------------- (1) Or more elaborately, the passage marked by straight lines in the left margin, the passage marked by wavy lines in the left margin, the passage underlined by hand with simple straight line, the pas- sage underlined by hand with simple straight line which was later deleted by hand, etc., as in the transcriptions of Wittgenstein's manuscripts in the Norwegian Wittgenstein project. See Claus Huit- feldt and Viggo Rossvoer, The Norwegian Wittgenstein Project Report _________________________________________ 1988 ([Bergen]: NAVFs EDB-Senter for Humanistisk Forskning / Norwe- ____ gian Computing Centre for the Humanities, 1989), esp. pp. 201-236. (2) This is the equivalent of the tag defined in drafts 1.0 and 1.1 of the guidelines for segmentation of text according to pagination of multiple editions. The tag differs, how- ever, in assuming a simple single-level segmentation of the text: the values specified in any tag apply to all following text until the next marked as belonging to the same edi- tion. Hence no explicit end marker is needed and the ID / IDREF mechanism can be dispensed with. (3) The TEI expects to recommend better support for such non- hierarchical start-end segment-tag pairs as an enhancement of SGML to be added during the next revision of ISO 8879. See document TEI ML W32 for TEI proposals for the revision of SGML. (4) N.B. the feature structure tags here use a name attribute rather than a separate tag as in TEI P1 version 1.1; this change will be present in TEI P1 version 2. 3.1, May 9, 1991