The Validated---or Violated?---Text: <title>Issues in Specifying Document Structures <author>C. M. Sperberg-McQueen <address>ACH/ACL/ALLC Text Encoding Initiative <aline>University of Illinois at Chicago </address> <docnum>TEI &docfile. <date>&docdate. </titlep> <!> <abstract> The importance of formal validation of text structures is described. Various specifications of basic text structure are compared, and the formal mechanisms thus far developed for formal specification of text structure are described. Problems arising in an attempt to use SGML to specify document structures formally are discussed. </abstract> <toc> </frontm> <!> <body> <h1>Introduction <p>As mechanical processing of text becomes easier, it also becomes easier---and more important---to specify formally what a text is and to use that specification to ensure the validity of the data stream which represents the text in the machine. Validation becomes important because application software uses increasingly complex data structures for text representation, and because our mechanical processing can destroy or corrupt data with an efficiency and thoroughness that far exceed the wildest dreams of the most assiduous scholar working by hand. Validation has become easier because computer science has provided a rich set of data structures to use in representing texts and increasingly sophisticated notations for specifying the valid forms of those data structures. This paper will report on some of the problems encountered in the work of the ACH/ACL/ALLC Text Encoding Initiative (TEI) in the TEI's efforts to create formally verifiable specifications of textual structures: the substantive questions of what the structural components of texts are; the methodological questions of choosing abstract structures with which to represent texts and formal notations with which to specify the abstract structures; and a number of concrete problems in the proper application of such abstract structures and formal notations to pre-existing texts of the sort studied by most textual scholars. <p>Any text-encoding scheme must provide ways to represent the characters of a text, its basic structure, intrinsic features other than structure, and extrinsic information associated with the text by an annotator. I am here concerned not with the first of these, but only with the other three. As examples of current practice, the paper will take the international standard Office Document Architecture (ODA); the Standard Format Exchange Program (StanFEP) developed at the Max-Planck Institute in Goettingen; some commonly used text-processing programs; and of course the Standard Generalized Markup Language (SGML), and some tag sets developed within it, notably that created by the TEI. <p>Before moving into the technical portion of the talk, let me describe very briefly some basic assumptions I am making. I won't go into these in detail (that is another talk of its own) but they are perhaps worth making explicit: <ol> <li>Markup of a text reflects a theory of the text. A markup language reflects a theory of texts in general. <li>One's understanding of texts is worth sharing (and thus worth expressing in markup). <li>No finite markup language can be complete (and therefore any general-purpose markup language should be extensible). <li>Texts are linguistic objects. <li>Texts occur in / are realized by physical objects. <li>Texts are fundamentally linear; texts are fundamentally hierarchical. <li>Texts are fundamentally network structures, with the arcs defined by cross-references and other hypertextual links. <li>Texts <emph>refer</emph> to objects in a real or fictive universe. <li>Texts are cultural and therefore historical objects. </ol> <h1>Substantive Issues: what does belong in a text? <p>All markup schemes must identify and classify the structural and other components of text; if we study them, we can attempt to describe the consensus and disagreements of current practice on a number of issues: <h2>Basic text structure <p>The top-level document structure is defined in three ways: <ul> <li>front, body, back <li>header, text (front, body, back) <li>(collapsed hierarchy) </ul> <h3>Document header / profile <h3>Front Matter <h3>Back matter <h3>Document Body <h2>Crystals <p>Crystals are internally structured free-floating units of text, such as figures, table, or bibliographic citations ('crystals' in the felicitous term of Steven DeRose). <h3>Figures <h3>Tables <h3>Bibliographic references <h2>Phrase-level attributes <p>Within the paragraph, the text structure suddenly breaks down and we are confronted with the consistency of soup. This soup requires a rather elaborate specification, for the phrase-level tags, for the larger chunks (crystals, we've already seen, but also for non-crystalline floating chunks of text like annotation), and for miscellaneous embeddable items like graphics. Let us take first the treatment of phrase-level attributes of text which appear freely in prose. <h2>Typographic details, layout, processing <p>Treatment of typographic details, layout, and other specific processing requirements <h2>Annotation <h1>Questions of Method The methodological discussion will describe both the models of text visible in current practice and the methods of text definition and validation they make possible. These models and methods range from wholly unstructured text strings punctuated by occasional processing instructions constrained only by specific implementation details, through varying levels of richness of construction and rigidity of structure, to arbitrarily complex structures with fully explicit definitions in formal languages. Some comparisons among the more formal of the specification methods will clarify the particular characteristics of SGML and of the TEI's SGML document type definitions (DTDs). <h2>Models of Text <h3>One Damn Thing after Another <p>Strings, punctuated by events. E.g. WScript, TeX, troff. <h3>Ill-defined Hierarchies <p>WGML, Syspub, -ms macros. <h3>Well-defined Hierarchies <p>Word Cruncher, COCOA click-over rules, TLG, Scribe, LaTeX, SGML without pointers. <h3>Arbitrarily Complex Graphs <p>Hypertext systems, SGML with pointers. <h2>Text Definition and Text Validation <h3>Implicit hard-coded grammar <p>WGML, LaTeX <h3>Explicit one-shot hard-coded grammar <p>Daphne, Scribe <h3>Explicit revisable grammar <p>BNF, regular right part grammars <h3>SGML content models <p>These differ from RRP grammars in extra operators and in the SGML exception mechanisms. <h2>Specific Design Issues Specific issues which have arisen in connection with the TEI's efforts to create SGML tag sets will be considered in the light of the foregoing discussion of current approaches to document structure. These include: <ul> <li>the basic tension between prescriptive and descriptive specifications of document structure and the risk that rigid formal document specifications may fail to match the chaotic reality of historical documents, leading to violations of the historical integrity of studied texts <li>the need to control the complexity of markup specifications by grouping tags into tag subsets which can be defined and understood independently of each other <li>the need for arbitrary combinations of these specialized tag subsets to be usable together <li>the need to allow flexibility in the document definition without losing all useful structure, and the related technical problem of specifying the formal characteristics of normal prose <li>the need to define multiple hierarchies across a single document <li>the need to distinguish document content from document markup, and the difficulty if specifying for all cases what should be encoded as content and what as markup <li>difficulties raised for analysis of all kinds by the presence of annotation related to other types of analysis, and the complications caused by analysis for the specification of multiple hierarchies The current treatment of these problems in the TEI, the logic behind that treatment, the problems which remain unsolved, and the prospects for their solution will be discussed. </ul> </body> <!> <backm> <h1>Bibliography <bl> <bib>Gierl, Martin, Thomas Grotum, and Thomas Werner. <cit>Der Schritt von der Quelle zur historischen Datenbank: StanFEP: Ein Arbeitsbuch.</cit> Halbgraue Reihe zur historischen Fachinformatik, Serie A: Historische Quellenkunden, Bd. 6. St. Katharinen: Scripta Mercaturae Verlag, 1990. <bib>Homann, Kathrin.<cit>StanFEP: Ein Programm zur freien Konvertierung von Daten.</cit> Halbgraue Reihe zur historischen Fachinformatik, Serie B: Softwarebeschreibungen, Bd. 6. St. Katharinen: Scripta Mercaturae Verlag, 1990. <bib>Huitfeldt, Claus, and Viggo Rossvœr. <cit>The Norwegian Wittgenstein Project Report 1988</cit> [Report 44 of the Norwegian Computing Centre for the Humanities.] [Oslo]: NAVFs EDB-Senter for Humanistisk Forskning, October 1989. <bib>Knuth, Donald. <cit>The TeXbook.</cit>Reading: Addison-Wesley, 1984. <bib>Lamport, Leslie. <cit>LaTeX: A Document Preparation System</cit> Reading: Addison-Wesley Publishing Company, 1986. </bl> <h2> <h2> </backm> </gdoc>

.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* Document proper begins. The Validated---or Violated?---Text: <title>Issues in Specifying Document Structures <author>C. M. Sperberg-McQueen <address>ACH/ACL/ALLC Text Encoding Initiative <aline>University of Illinois at Chicago </address> <docnum>TEI &docfile. <date>&docdate. </titlep> <!> <abstract> The importance of formal validation of text structures is described. Various specifications of basic text structure are compared, and the formal mechanisms thus far developed for formal specification of text structure are described. Problems arising in an attempt to use SGML to specify document structures formally are discussed. </abstract> <toc> </frontm> <!> <body> <h1>Introduction <p>As mechanical processing of text becomes easier, it also becomes easier---and more important---to specify formally what a text is and to use that specification to ensure the validity of the data stream which represents the text in the machine. Validation becomes important because application software uses increasingly complex data structures for text representation, and because our mechanical processing can destroy or corrupt data with an efficiency and thoroughness that far exceed the wildest dreams of the most assiduous scholar working by hand. Validation has become easier because computer science has provided a rich set of data structures to use in representing texts and increasingly sophisticated notations for specifying the valid forms of those data structures. This paper will report on some of the problems encountered in the work of the ACH/ACL/ALLC Text Encoding Initiative (TEI) in the TEI's efforts to create formally verifiable specifications of textual structures: the substantive questions of what the structural components of texts are; the methodological questions of choosing abstract structures with which to represent texts and formal notations with which to specify the abstract structures; and a number of concrete problems in the proper application of such abstract structures and formal notations to pre-existing texts of the sort studied by most textual scholars. <p>Any text-encoding scheme must provide ways to represent the characters of a text, its basic structure, intrinsic features other than structure, and extrinsic information associated with the text by an annotator. I am here concerned not with the first of these, but only with the other three. As examples of current practice, the paper will take the international standard Office Document Architecture (ODA); the Standard Format Exchange Program (StanFEP) developed at the Max-Planck Institute in Goettingen; some commonly used text-processing programs; and of course the Standard Generalized Markup Language (SGML), and some tag sets developed within it, notably that created by the TEI. <p>Before moving into the technical portion of the talk, let me describe very briefly some basic assumptions I am making. I won't go into these in detail (that is another talk of its own) but they are perhaps worth making explicit: <ol> <li>Markup of a text reflects a theory of the text. A markup language reflects a theory of texts in general. <li>One's understanding of texts is worth sharing (and thus worth expressing in markup). <li>No finite markup language can be complete (and therefore any general-purpose markup language should be extensible). <li>Texts are linguistic objects. <li>Texts occur in / are realized by physical objects. <li>Texts are fundamentally linear; texts are fundamentally hierarchical. <li>Texts are fundamentally network structures, with the arcs defined by cross-references and other hypertextual links. <li>Texts <emph>refer</emph> to objects in a real or fictive universe. <li>Texts are cultural and therefore historical objects. </ol> <h1>Substantive Issues: what does belong in a text? <p>All markup schemes must identify and classify the structural and other components of text; if we study them, we can attempt to describe the consensus and disagreements of current practice on a number of issues: <h2>Basic text structure <p>The top-level document structure is defined in three ways: <ul> <li>front, body, back <li>header, text (front, body, back) <li>(collapsed hierarchy) </ul> <h3>Document header / profile <h3>Front Matter <h3>Back matter <h3>Document Body <h2>Crystals <p>Crystals are internally structured free-floating units of text, such as figures, table, or bibliographic citations ('crystals' in the felicitous term of Steven DeRose). <h3>Figures <h3>Tables <h3>Bibliographic references <h2>Phrase-level attributes <p>Within the paragraph, the text structure suddenly breaks down and we are confronted with the consistency of soup. This soup requires a rather elaborate specification, for the phrase-level tags, for the larger chunks (crystals, we've already seen, but also for non-crystalline floating chunks of text like annotation), and for miscellaneous embeddable items like graphics. Let us take first the treatment of phrase-level attributes of text which appear freely in prose. <h2>Typographic details, layout, processing <p>Treatment of typographic details, layout, and other specific processing requirements <h2>Annotation <h1>Questions of Method The methodological discussion will describe both the models of text visible in current practice and the methods of text definition and validation they make possible. These models and methods range from wholly unstructured text strings punctuated by occasional processing instructions constrained only by specific implementation details, through varying levels of richness of construction and rigidity of structure, to arbitrarily complex structures with fully explicit definitions in formal languages. Some comparisons among the more formal of the specification methods will clarify the particular characteristics of SGML and of the TEI's SGML document type definitions (DTDs). <h2>Models of Text <h3>One Damn Thing after Another <p>Strings, punctuated by events. E.g. WScript, TeX, troff. <h3>Ill-defined Hierarchies <p>WGML, Syspub, -ms macros. <h3>Well-defined Hierarchies <p>Word Cruncher, COCOA click-over rules, TLG, Scribe, LaTeX, SGML without pointers. <h3>Arbitrarily Complex Graphs <p>Hypertext systems, SGML with pointers. <h2>Text Definition and Text Validation <h3>Implicit hard-coded grammar <p>WGML, LaTeX <h3>Explicit one-shot hard-coded grammar <p>Daphne, Scribe <h3>Explicit revisable grammar <p>BNF, regular right part grammars <h3>SGML content models <p>These differ from RRP grammars in extra operators and in the SGML exception mechanisms. <h2>Specific Design Issues Specific issues which have arisen in connection with the TEI's efforts to create SGML tag sets will be considered in the light of the foregoing discussion of current approaches to document structure. These include: <ul> <li>the basic tension between prescriptive and descriptive specifications of document structure and the risk that rigid formal document specifications may fail to match the chaotic reality of historical documents, leading to violations of the historical integrity of studied texts <li>the need to control the complexity of markup specifications by grouping tags into tag subsets which can be defined and understood independently of each other <li>the need for arbitrary combinations of these specialized tag subsets to be usable together <li>the need to allow flexibility in the document definition without losing all useful structure, and the related technical problem of specifying the formal characteristics of normal prose <li>the need to define multiple hierarchies across a single document <li>the need to distinguish document content from document markup, and the difficulty if specifying for all cases what should be encoded as content and what as markup <li>difficulties raised for analysis of all kinds by the presence of annotation related to other types of analysis, and the complications caused by analysis for the specification of multiple hierarchies The current treatment of these problems in the TEI, the logic behind that treatment, the problems which remain unsolved, and the prospects for their solution will be discussed. </ul> </body> <!> <backm> <h1>Bibliography <bl> <bib>Gierl, Martin, Thomas Grotum, and Thomas Werner. <cit>Der Schritt von der Quelle zur historischen Datenbank: StanFEP: Ein Arbeitsbuch.</cit> Halbgraue Reihe zur historischen Fachinformatik, Serie A: Historische Quellenkunden, Bd. 6. St. Katharinen: Scripta Mercaturae Verlag, 1990. <bib>Homann, Kathrin.<cit>StanFEP: Ein Programm zur freien Konvertierung von Daten.</cit> Halbgraue Reihe zur historischen Fachinformatik, Serie B: Softwarebeschreibungen, Bd. 6. St. Katharinen: Scripta Mercaturae Verlag, 1990. <bib>Huitfeldt, Claus, and Viggo Rossvœr. <cit>The Norwegian Wittgenstein Project Report 1988</cit> [Report 44 of the Norwegian Computing Centre for the Humanities.] [Oslo]: NAVFs EDB-Senter for Humanistisk Forskning, October 1989. <bib>Knuth, Donald. <cit>The TeXbook.</cit>Reading: Addison-Wesley, 1984. <bib>Lamport, Leslie. <cit>LaTeX: A Document Preparation System</cit> Reading: Addison-Wesley Publishing Company, 1986. </bl> <h2> <h2> </backm> </gdoc>