From: CBS%UK.AC.EARN-RELAY::EARN.SMUVM1::ZRCC1001 24-FEB-1990 02:00:08.66 To: lou CC: Subj: part1 Via: UK.AC.EARN-RELAY; Sat, 24 Feb 90 1:59 GMT Received: from UKACRL by UK.AC.RL.IB (Mailer X1.25) with BSMTP id 9419; Sat, 24 Feb 90 01:57:56 GM Received: from SMUVM1.BITNET (ZRCC1001) by UKACRL.BITNET (Mailer X1.25) with BSMTP id 7552; Sat, 24 Feb 90 01:57:53 G Date: Fri, 23 Feb 90 19:55:02 CST To: lou@UK.AC.OXFORD.VAX From: ZRCC1001@EARN.SMUVM1 Comment: CROSSNET mail via MAILER@UKACRL Subject: part1 (part 1 of 2) February 21, 1990 3909 Swiss Avenue Robin C. Cover Dallas, TX 75204 Draft, TEI-REP Working Paper (214) 296-1783/841-3657 For TEI-TRR4, sections 6.9 - 6.10 BITNET: zrcc1001@smuvm1 INTERNET: robin@txsil.lonestar.org *Supersedes draft "February 18, 1990" UUCP: attctc!utafll!robin distributed on TEI-REP UUCP: attctc!texbell!txsil.robin ENCODING FOR TEXTUAL PARALLELS AND CRITICAL APPARATUS INTRODUCTION In this paper I discuss some proposals and partial solutions to the problems of encoding textual parallels, textual variation and related features. Encoding for textual parallels and textual variation will employ SGML-based (and perhaps non-SGML) constructs developed for other encoding problems, both within the domain of TEI-REP and within the TEI ANA group. The two inter-related topics specifically assigned to me under the current TEI-REP work agenda ("textual parallels" and "textual variants") are dependent upon and closely related to the decisions reached on "Reference Systems" (6.7; Mylonas), "Cross-References and Text-Links" (TEI TRR4 6.11; DeRose). Likewise, Johansson's work on "Normalization Practices and Documenting Them" (TEI TRR4 7.2.1) bears similarity to the text-critical issue of machine collation and machine analysis of texts in different languages, scripts, transcription systems, orthographic-strata, etc. Accordingly, I plan to revise the following sections for TEI TRR2 6.9-6.10 in light of other decisions and recommendations reached in the Oxford TEI-REP meetings. I. TEXTUAL PARALLELS [revised material to be inserted at TEI TRR2 6.10] A. DEFINITIONS I use the term "(textual) parallels" to refer broadly to any two documents, or regions of text within the same document, or sets of documents which literary/linguistic analysts would like to compare in some way because they share significant degree of structure and/or content, or perhaps pedigree. The comparison may be quantitative (e.g., searching for index of term-frequencies or co-occurrences) or simply visual (e.g.,parallel displays which support synchronous scrolling of two or more documents). Provisionally, I think of "textual variants" as a special case of textual parallelism (though requiring much more annotation) in that they are parallel textual objects competing for antiquity, typological priority, authenticity and so forth. The issue of document versioning is tangent to textual criticism in that both involve attempts to reconstruct or record exact evolutionary stages and processes in the composition and transmission of texts. In the more general case, "textually" variant readings are simply parallel texts which contribute information to our understanding of the evolutionary history of an idealized (fictional) literary entity we think of as a literary "text." Here are some examples of (sub-) documents I would call "parallel" texts: *a literary allusion and its assumed/purported origin in the classic; a quotation and its contextual source *a literary text and its translation (into another language, or into multiple languages) *the same text viewed simultaneously in different scripts, within different orthographic strata or under different transcriptional systems [here document views is envisioned as being generated from separate documents or parts of a database rather than through document filtering] *texts which stem from a common source (eg., the three New Testament synoptics, sharing a common origin in the hypothetical document Q) *multiple (serial or synchronous) recensions of the same text, which nevertheless maintain a common identity as a certain literary document (e.g., the eight versions Sargon's annalistic reports; multiple recensions of the flood epic in several Mesopotamian languages at several periods; the long and one-seventh-shorter versions of biblical Jeremiah) *instances of oral formulaic poetry (line, stanza) which appear in different epic or liturgical compositions *a literary text and running commentary on that text, including dedicated region(s) of textual commentary on the text or textual apparatus *texts which bear some unclear textual relationship, but share significant amount of content and/or structure (e.g., sections of biblical Samuel/Kings/Isa/Chron) *sections of texts "reused" in other contexts (long sections of certain Psalms in prose texts narrating a liturgy) *sections of texts excerpted wholesale and thereafter acquiring independent transmission history(e.g., biblical Odes), *paraphrases or synopses (e.g., Josephus' Antiquities of the Jews, which sometimes tracks very closely with the biblical (Greek) Septuagint text *a lineated/versified text and its associated (sets of) cross-references to other texts or locations within the same text *a "clean" text and alternate views of that text based upon its literary or linguistic annotations; B. LEVELS OF GRANULARITY In order to provide encoding for such parallel texts or regions of texts, the encoder must begin by: (a) determining the level of granularity at which the text(s) are to be tagged, and (b) designating names for those textual objects which will be matched as the parallels (c) devising referencing a scheme which allows one to unambiguously point to regions of "parallel" text delimited by the tagging (so that reference may be made to these units from external sources). The referencing scheme is critical because the parallels may not involve matched textual objects (e.g., "verses" in one edition with "verses" in another edition). The parallels may involve mappings between points and spans, or between spans of text very unlike in some characteristics. The levels of granularity chosen of course depend upon the applications. I believe that for representing "parallel" structures (and "textual") variants, character-level encoding and everything higher must be supported. Certainly this will be required by linguists in the TEI ANA group, but we may imagine multiple uses for character-level encoding for "textual variation" and "parallels" as well. In text criticism, paleographic annotations at character level are often given by various typographic conventions. For example, in Qumran publications there are 4-5 special sigla attached to characters (usually supralinear attachment) for indicating whether the characters are of such-and-such varying degrees of legibility, or suspended characters in a word, etc. In other cases, uncertain characters are interpreted in terms of a closed set of alternatives: "b/m" (written in Hebrew script) would mean "the character beth or mem -- we know it is one or the other, because and both make sense and are paleographically possible , but we don't know which letter this is for sure." Or suppose we wish to study parallel displays in which adjacent orthographic strata are involved, and we want to see "level n+1" orthographic innovations in a color-coded scheme (e.g, in historic periods when quiescent consonants or semi-vowels are being introduced as vocalization markers). Or a literary critic may wish to selectively tag consonance or assonance using character-level or syllable-level annotation. At a higher (sub-word) unit, morpheme-level annotations are also both necessary for text criticism and parallel analysis. Consider the case where a German student may be assisted in Greek language instruction by viewing parallel texts, say an Greek tragedy and a German translation, or an English student learning Hebrew uses synchronized parallel displays for Hebrew and English. In either instance, the student might tab through a sentence in either window and see equivalent (parallel) terms displayed in reverse video in both windows. But morpheme-level would be required: the Hebrew "and/from/his/hellish abode" is written as one "word" (yet four or more "morphemes," as counted by Hebrew linguists) but as four words in English, so the color display of the single Hebrew word will show four color-coded regions matching the similarly color-coded four separate words in English. And German separable prepositions will map at morpheme level to single Greek terms. For text-critical work, variants often involve just differences in person, number, gender, mood, state (whatever) which are resolved in inflectional morphemes; since these kinds of variations need to be annotated for cataloging, it's clear that morpheme-level annotations (morpheme-level tagging) are necessary. Syllable-level marking and morpheme-level marking are necessary for making text-critical comparisons of cuneiform texts, for example, because of mixed orthographies in various scribal traditions, geographic regions, etc. (e.g., the neo-Assyrian scribes use far more logographic (ideographic) spellings with phonetic complements for certain words than Babylonian scribes copying the same texts with syllabic writing; for these languages, text critical analysis will require not only orthographic normalization, but in some cases syllable- and morpheme level tags. It is easy to recognize, or course, that linguistic analysis requires morpheme-level annotations, and probably multiple alternate morphemic representations based upon varying linguistic theories. I have started with two examples of sub-word granularity for parallelism because it is not entirely clear that an SGML encoding should be used for this level of tagging/markup/linking. I have no personal opinion or preference on this matter. On the authority of two SGML experts, I am cautioned about two potential concerns: (a) the amount of overhead involved in SGML tagging at character- and morpheme-levels is extraordinary -- we can easily get more meta-data than data, which might be a problem; (b) character-level tagging using SGML has (apparently) not yet been adequately tested. Here follows an example of an encoding of one single Hebrew word. I know it's monstrous: it evoked a gasp of "oy,veh" from a more qualified member of TEI-REP (I suppose not just because it's so ugly, and redundant -- fails to use legal minimizations, but also because there is no DATA here yet except marking of morphemes with reference id's; I have delimited only one of the 13 characters; I could have also added "syllableid," which would be useful in text criticism for cuneiform texts. ========= I wrote: ======== "Suppose I mark up a Hebrew text as follows, where (in my particular DTD, "verseid," "wordid" and "morphid" are required, but "charid" is optional (used mostly for citing Qumran readings, or in manuscript publications themselves). In this sample, I put a charid tag on the first char by way of example because I know there's a textual variant I'll want to talk about. But I use no minimizations yet. Suppose I do something like in setting up markers for the first word... (the "text", if you're looking for it, is the first word of II Samuel 22:43 (assuming Michigan-Claremont uses IRV chars which arrive bei dir): W:)E$:XFAQ"73M where "73" is the accent number from the standard Michigan-Claremont encoding scheme, and the word means "(thus) I beat them flat") w: )E$:XFQ "73M.... (This example would be inbearably monstrous if charid's were assigned to each of the thirteen characters in the word.) ================ One obvious criticism of this approach is that much of the encoding can be generated (e.g., charid's) by a program; so why introduce this stuff into the text? Why not just let the application handle it, and use some non-SGML system for referencing? On the hypothesis that such encoding is at least legal, and help would satisfy our need for character- and morpheme-level annotations, it remains unclear to me whether this encoding would actually be useful in solving the problem of "parallel" and "textually variant" texts. If character- and morpheme level units tagged with an SGML-based tagging system cannot be addressed (referenced) with SGML mechanisms (IDREF, IDREFS) for purposes of synchronization, then perhaps they are of lesser value; perhaps a non SGML (applications-level) solution should be proposed. The issue of SGML based referencing will be discussed more below. Besides the fact that this markup is ugly, redundant, and can in part be programmatically-generated, there are other criticisms: (a) current processors will gag and choke; (b) the file now defies the canon of "human-readable" SGML files. Possible answers to these objections: (a) "don't let processors (except batch processors working as filters) look at the monstrosity," or "processors are getting faster all the time" and; (b) "so what? (give up the myth of 'human-readable' SGML)." A non-SGML solution which would be well-suited (?) to humans and computers alike would be to use a standard referencing system down to verse or line level, then some kind of regular-expression style "pattern match" to get the offsets. Of course, this is the method used by human editors and authors in traditional scholarship. We say, "as reported by Josephus in Antiquities XVI.187, '(preceding_context) relevant_text (following_context)' (Loeb/Marcus edition) blah blah..." For use of this convention in encoding, however, the rules would need to be much more rigorously defined and enforced (how to qualify second-occurrences of substrings; how to designate omission of characters/words with ellipsis). The advantage of this system is that it's instantly democratized because it's based upon unambiguous rules and applicable to paper editions as well as electronic editions. A scholar in a remote part of the world publishing a new text, or encoding an extant text could cross-reference any other "parallel" text at any level of granularity without knowing how others (unbekannt) were encoding those "parallel" texts. From the machine's point of view, the system is unambiguous and efficient because it uses character offset in the file. The downside would be that character-offsets (in machine languages) are vulnerable when files are changed in minor ways, and methods for updating offsets/indices or redundancy checks would place another burden on the application. C. SGML-BASED REFERENCING SYSTEM More will be said below about the kinds of annotations which need to be attached at character- morpheme- word- arbitrary_string- and other (linguistic) levels for encoding textual variation. Since parallel texts are just declared to be "matched" (but not HOW), the situation is easier. For the purpose of synchronizing textual regions for parallel display or analysis, the primary concern is whether/how the SGML-based encoding can be used, in connection with other factors, to help drive these applications. Some of the "other factors" will now be discussed. Most of these issues were discussed at length in my earlier TEI-REP paper (TEI TRW2, available in 2 parts on the UICVM listserver as TRW2PT1 SCRIPT and TRW2PT2 SCRIPT), and need not be rehearsed in detail here. *referencing discontinuous segments (morphemes, words) belonging to the same object (SGML id) *referencing individual sub-elements, and sequencing of those sub- elements of such discontinuous segments *referencing arbitrary strings, word-substrings (e.g., where word boundaries are in dispute, or have been misunderstood); we may think of arbitrary character-offset as one example, though for texts with syllabic, ideographic or mixed writing systems, other levels of arbitrary spans are possible *multiple, overlapping, alternative hierarchies (without multiple DTD's and burdensome CONCUR overhead) *synchronizing or normalizing incompatible external referencing schemes which APPEAR TO point to identical content, but do not; normalizing existing canonical referencing schemes (see especially "FACTOR 4" in TEI-TRW2) *referencing textual elements within parallel streams of an interlinear text, Paritur-Umschrift/ ("score") edition, or similar document, where the text streams *could* be conceived as or generated from independent documents, but in which specific text-geography (presentation) of the printed edition is an essential aspect of the document's personality (see my earlier description of variations,TEI TRW2 "Factor 7,") Two general approaches for synchronization are obvious; in our CDWord hypertext project, we have used the first method: (1) one may place structure markers from text A as empty tags into all other documents which need to be set in synchrony with text A (multiple binary mappings), and, (2) one may place artificially-devised "canonical" reference markers in all texts, with the help of ancillary tables which help resolve anomalous situations relative to the canonical standard text. Synchronizations may then be made between the artificial scheme and the individual referencing schemes used in traditional scholarship (incongruent systems being resolved at this level). Whether either of these method is ideal, and how either would be best implemented with an SGML-based language I will defer to others. Perhaps there are far better systems. I have some skepticism about both methods, based on difficulties we have encountered. Against the first method: (a) the amount of overhead in meta-data will become burdensome for texts which are heavily cross-referenced, and (with CONCUR?) seems to imply multiple or overfull DTD's; (b) it fails to exploit the usefulness of currently existing canonical referencing schemes; (c) it may (?? -- not sure) not support the need to see different sections of the same document in parallel. However, in favor of the first method (multiple binary mappings) is that some synchronizations, especially those at a very low level of granularity, appear to be required at binary level. Imagine, for example, trying to map a dozen versions or translations of the Bible onto a common "canonical" scheme, based upon unique (SGML) id's for each morpheme in each version, with an ultimate goal of supporting color-coded "parallels" at morpheme level between any two versions or all versions. Binary mappings at this level appear feasible (though labor intensive), but multiple synchronizations via a single canonically scheme seems hard. Maybe it would work, but I have some doubts (general linguists: please speak up). Multiple binary mappings sounds like a lot of work, but currently the CATSS project at the University of Pennsylvania/Hebrew University, and the Fils d'Abraham Project (Maredsous) are making binary mappings between Bible versions, apparently with satisfactory results. At a higher level of granularity (e.g. biblical "verse" level), using an idealized general referencing system may be feasible; it works tolerably well in our CDWord project for the purpose of visually aligning (synchronous scrolling) biblical texts, translations, commentaries and other "canonically-structured" documents in parallel displays. I do not know whether there may be more optimal SGML or non-SGML solutions to the referencing problem which avoid cluttering the text with unique id markers. I feel confident that other members on the TEI-REP subcommittee (those with professional training in formal language theory) will have valuable judgments about these markup issues; I cannot promote other than the general options I see on the surface. In response to a query, Steve DeRose briefly suggested using an open-ended canonical scheme with integers or whole numbers; presumably Steve (section 6.11, "Cross-References and Text-Links") and Elli Mylonas (section 6.7, "Reference Systems") David Durand will offer better presentations of the options. I do feel it would be wise to communicate with the TEI-ANA group about their solution(s) to the problem of character- and morpheme-level annotations. The encoding of text-linguistic features with character level and morpheme-level annotations is inevitable: how does TEI-ANA propose to deal with id-markers? Another concern parallel to that of TEI-ANA is that in some textual-critical arenas (as with linguistic/literary annotations), the volume of text-critical annotation will become immense: where shall these annotations be located in the SGML file? It may become increasingly distasteful (to some??) to envision text-critical encoding (kilobytes per "word" in some cases) inter lineated with "the text." It may be that "the text" should be kept free both of id markers and text-critical annotations (placing this information elsewhere in the document) so that our processors can (directly) read "the text" in real-time. Others might argue that the encoding is not meant to be used by applications directly (but only as data/document interchange), so never mind if the "words" of "the text" are separated by a 100,000 characters of meta-data and annotation-data in the flat file. I have no opinion, except that I would like to purchase affordable SGML software for my 386-class machine and not have it choke on my data. II. TEXTUAL VARIANTS [revised material to be inserted at TEI TRR2 6.9] A. PRELIMINARIES I suggested above that "textual variants" may be viewed as special cases of "textual parallels," though more complicated parallels. Such a view is the more reasonable if we believe that talking about a textual problem in a textual commentary (referring to the core data of lemma, variants and witnesses) or from any EXTERNAL locus is just as important as viewing textual variation from within a given document (the that document's "lemma" as over against readings of other witnesses cited only by their sigla). I think the former is the proper (or optimal) way to conceptualize "textual variation" anyway, but incline even more strongly to it for pragmatic reasons. Textual parallels may be thought of as immediate candidates for hypertext even if the taxonomy of links (link types) is underdeveloped. We simply declare that loci A and B and C (where A, B, C may be points, spans, or discontinuous spans) are equivalent in some way, and let the application handle the expression of those links. For encoding textual variation, the taxonomy of links must (usually) be much richer: texts A, B, and C are not just formally equivalent, but they are related in very specific ways. The complex network of inter-dependencies between parallel objects is one important factor in making the encoding of textual variation more demanding. I would be interested in the judgments of others about the relevance of hypertext (link types) to textual criticism; the paper of Steve DeRose ("Expanding the Notion of Links," Hypertext 1989 Conference) is suggestive, but I have been unable to talk to him about this in detail. Based upon exchanges with other scholars interested in textual variation, I sense that each one's formulation of a model for encoding textual variation will strongly reflect the particular field of textual studies (modern literature? epigraphic texts? ancient texts?), the adequacy of printed critical editions in the field, and the particular text-critical theories each one embraces. These biases probably cannot be escaped, nor are they necessarily bad. I would suggest that for TEI encoding "standards" (recommendations) to be accepted in humanities areas, it will be necessary to create user manuals highly specific to sub- specialty areas. Scholarly needs and goals may be very different in specific areas, and the domain-experts in each literary field should be assisted in the refinement of prioritized goals in which encoding of text-critically-relevant information will play an important role. In some fields, standard editions may contain evidence from manuscripts that are badly and incompletely collated, so that encoding through programmatic re-collation would be an optimal effort. In other fields, the most fruitful historical results may come from intense study of scribal ductus and manuscript paleography which has hitherto been inadequately investigated. Other fields may be blessed with superb critical editions in which the encoding of critical apparatuses alone may yield a rich, complete text-critical database. The "manuals" for "encoding textual variants" should therefore reflect the variable situations and priorities in our respective literary fields. At the same time, I feel it would be wise to develop as general a model as possible for the encoding of information germane to the text-critical enterprise. We may accept this on the strength of the obvious assumption that for standards-purposes, general solutions are better than dedicated solutions (e.g., solutions which are matched to current applications, or lack of applications). We are fortunate to have on the TEI-REP subcommittee and immediate orbit several authorities on critical editions and machine-collation: *Wilhelm Ott (Director of the University of Tbingen in the Abteilung fr literarische und dokumentarische Datenverarbeitung, Zentrums fr Datenverarbeitung; TUSTEP programs for text collation and textual editing) *Susan Hockey and Lou Burnard (most recently with Peter Robinson in development of the COLLATE programs for Old Norse; cf. LLC 4/2, 99-105 and 4/3, 174-181) *Manfred Thaller (whose representations for text-critical data I have not seen, unfortunately) *Michael Sperberg-McQueen (superb control of the standard, combined with background in letters; I have learned far less than I should have in several email exchanges; SGML-izing of EDMACS) *Dominik Wujastyk (EDMACS, = modification of John Lavagnino's TEXTED.TEX macros, for formatting critical apparatuses) *Robert Kraft and Emanuel Tov (CATSS project at the University of Pennsylvania/Hebrew University, Jerusalem) *R.-Ferdinand Poswick (Secretary of AIBI and director of the multi-lingual biblical databank project Fils d'Abraham at the Centre Informatique et Bible, Maredsous) *Claus Huitfeldt (Norwegian Wittgenstein Project) I look forward to the contributions of these individuals on encoding text-critical data as TEI moves into its second cycle. B. THE GOALS OF ENCODING TEXT-CRITICAL DATA My assignment in TEI TRR4 (6.9) is specified as providing assistance on encoding of the "Critical Apparatus." Textual apparatuses have been used in printed books for several centuries, so the critical apparatus is undeniably an issue of concern for TEI-REP. In my earlier TEI-REP paper (TEI TRW2, "Factor 2), I outlined several reasons why I (nevertheless) felt that a focus upon the "Critical Apparatus" was not the best approach to modeling the encoding of textual variation and textual transmission. If the goal of the TEI-REP encoding recommendations is to provide mechanisms for the encoding of critical apparatuses exactly as they appear in printed manuscripts and editions, then we are faced with a broad task of surveying all kinds of critical apparatuses used in world literature (something I have been unable to do). But volumes I have access to employ significantly different conventions in critical apparatus (some of which, like the "score" or /Paritur-Umschrift/, could possibly be handled as textual parallels; see TEI TRW2 "Factor 7," final paragraph). If the goal is to provide mechanisms for encoding text critical information in critical apparatuses within a new standard sub document type (which represents the "best" traditions of critical apparatuses, e.g., in the Oxford critical editions), then the task is easier. If the goal is to provide general mechanisms for recording knowledge/information germane to textual composition, textual evolution, recension and transmission, then we must determine whether and how optimally the encoding of the critical apparatus contributes to that process, and how to encode analysis not contained within the traditional critical apparatus. In any case, the goal involves more than the simply the "markup" of a single document: it involves the encoding of complex relationships between elements of many documents. The analysis involved in the taxonomy of relationships appears to me tangent to the forms of literary and linguistic analysis being developed in the TEI-ANA group. I briefly outline my recommendation here. I believe it is more important to focus on encoding KNOWLEDGE about textual readings and textual evolution (e.g., information which is traditionally contained in separate volumes or sections of textual commentary), with a view to the creation of a text-critical database. Much of this knowledge is not, traditionally, information coded in critical apparatuses: the fact that it could be is less relevant than the fact that it is not for discernible reasons. My perspective is that data/knowledge about textual variants and textual evolution is parallel to the matter of (encoded) linguistic annotations, which contribute to lexicons, grammars, corpus-linguistics databases and other linguistic tools. Thus, many of the critical editions I own have separate sections or associated companion volumes containing textual commentary: additional tables and lists of examples of orthographic practices, dialectical features, scribal proclivities, tell-tale recensional data, etc. Such data would be far more valuable if included as part of the text-critical database (enriched even more because exhaustive lists and tables, based upon annotations, could then be generated). In cases where text critical, philological other literary-critical data are mixed in the same commentary, it may be adequate or preferable to link (via our "textual parallel" mechanism) the text or textual apparatus to the commentary. On balance, I still judge that much of the information in question is most useful in a database, and that "critical apparatus" and "textual commentary" should not simply be encoded as separate subdocuments. In light of several exchanges with Michael Sperberg-McQueen, I am prepared to believe that my perspective arises from experience with ancient oriental (especially, biblical) texts, and that it may be unrepresentative of the scholarly interests and goals in a majority of literary fields. For this reason, I relegate to the Appendix some additional arguments and illustrative data on this point, but I do not ask that everyone read them. In attempting to represent the interests of SBL/AOS/ASOR, I do feel it is necessary to document these concerns (which may be more germane to TEI cycle two). C. ENRICHED ENCODING SCHEME Here follows a (preliminary) list of features which I recommend be considered as standard (where applicable) for an enriched encoding scheme -- encoding which would be destined for a database, from which improved critical apparatuses may be printed, including expression-driven user specified critical apparatuses. For this purpose, I accept that we may differentiate "lemma" and "alternative readings," though it is unclear from a database point of view why a "lemma" deserves a privileged place or that its features would be any different than those of alternative readings. The distinction is useful in that traditional critical apparatuses often do represent "lemma" and "alternative readings" in different ways. Obviously, many of these features will pertain only to textual arenas where multiple languages and long traditions are involved (e.g., most sacred texts and other religious literature). This first list is a prosaic description of the most important desiderata, but it is followed by a list of features in more formal terms. (1) exact identification of witness(es) offered as variands -- Variant readings may usefully be grouped together "in support" of a certain textual alternative reading, and especially, with implied or explicit top-level normalization for readings in various languages. However, this grouping should not be at the expense of other important underlying information, including the listing of actual witnesses or at the expense of obscuring other details. They should not be grouped together unless every other relevant attribute (other than witness-id and textual_locus) is identical or unless access to the differences in detail is provided for (language, identical orthography, complete (not restored or partial) readings, etc.). If groupings of dissimilar witnesses are needed for convenience (as in the Goettingen LXX, for example), then a mechanism for viewing the underlying individual readings should be supported. (2) declaration of the NAME of the canonical referencing scheme -- The DTD of the (app crit's) containing document will presumably include some identification of the traditional referencing scheme used (most relevant when similar/identical referencing schemes actually point to different content). In order to provide proper synchronization, the names of the referencing schemes of the alternative witnesses (which may not be available in electronic format) should be provided; such information will also be useful for machines when the texts of the alternative readings are machine-readable. (3) exact identification of textual locus for each cited witness -- Just as the full identification of witnesses and "irrelevant" details about their readings are often suppresses, the exact locations are often not given. The notation in a siglum (viz, the presence of the siglum) usually implies: "go look THERE, you'll find it at the appropriate point"). I consider this unsatisfactory, because machines will not be able to retrieve the context of the alternative reading merely from a siglum which implies "go look for it." Referencing system(s) for textual_locus to be determined by TEI-REP, and included as properties of the alternative readings. (4) the language in which the alternative readings occur (5) the script, and/or orthographic level and/or transcriptional system -- Applications will have difficulty comparing alternative readings which occur in different languages, scripts, orthographic strata or transcriptional schemes. While the (app crit's) containing document will presumably contain these declarations in the DTD, relevant information about the alternative readings must be supplied in the encoding of the variants. This requirement also applies equally to scholarly emendations offered as part of textual reconstructions, restorations or as is primitive readings. (6) the exact reading of a witness, not just a notation THAT it exists -- When alternative readings all occur in a single language (the same language as the lemma) this requirement may be obvious or unnecessary. The feature is more important (non-optional) in cases where alternate_readings occur in various languages, and the language/script/orthography of the alternative_reading is different than that of the lemma (7) encoding of linguistic/literary attributes of characters, morphemes, words, phrases, clauses (etc.) in the lemma and in the alternative readings -- Readings frequently vary in predictable (but text-critically important) ways: grammatical concord, scribal hyper-correction at other linguistic levels. The linguistic annotations on the readings (lemma and variands) will bear on their interpretation and frequently be useful in classification of the reading for machine-collation & quantitative analysis. (8) encoding of paleographic and similar information -- Such details will sometimes not be known, but encoding should be supported for publication of new texts and text fragments, and in cases where manuscript collation is used to verify readings. These notations would register information about restored readings, character-level annotations about degree of legibility, alternative paleographic interpretations, erasures, corrections, marginal supralinear, sublinear (interlinear) readings, etc. (9) encoding of known or assumed inter-dependencies between the alternative reading and the lemma, or between various alternative readings -- Such annotations are more appropriate when the readings can obviously be recognized as derivative from typologically-antecedent readings, or when some readings are (translations) in derivative languages, and so forth. More subtle evidence of genetic/stemmatic relationships may be shown by collation programs, of course; I envision here some standard kinds of dependencies (variously applicable to different textual arenas). (10) a top-level normalized rendering (retroversion, transcription conversion, orthographic-stratum conversion) of the "readings" -- Some kind of normalization for lemma and alternative reading is usually implied in a standard critical apparatus, but it may be necessary or desirable to make this representation explicit to allows machine collation of all the readings together, or to allow linguists to make use of the text-critical database. (Perhaps this could always be inferred from elements and the associated attribute lists, but I'm not sure.) (11) translation of the lemma and alternate readings into modern language(s) -- This may be viewed as a concession, and even an inappropriate concession, to non-specialists. I feel it can be justified, and should be encouraged, to assist non-specialists, including inter-disciplinary scholarship, in surveying the specialists' field. Similarly, it seems unnecessary to use Latin as a single "standard" language, when other international languages would make the information more accessible to persons who have legitimate interest in the data (personal opinion). (12) evaluation of the typological placement of each alternate reading and the lemma) within its own "inner dimension" (language group, geographical region, time period, etc.) and within the larger scope of textual evolution -- This information would be germane in textual arenas which have long traditions, or multiple phases of textual evolution (13) evaluation/explanation of the alternate readings (and lemma) in terms of standard text-critical comments -- Examples (will vary across fields): the reading represents expansion, euphemistic rendering, paleographic confusion, dittography, haplography, aural error, rendering based upon homographic root (polysemy), (hyper-) etymological translation, midrashic paraphrase, misplaced marginal/supralinear gloss, modernizing, secondary grammatical adjustment, other conflation, etc. I confess that some standard sigla make sense only on the convention that each witness (as a published, encoded document) has its own "lemma" as over against the rest of the universe's "alternative readings" (e.g., "textual plus," "textual minus," "different word order"), and these would have to be converted or interpreted when moving the annotations to a database. From: CBS%UK.AC.EARN-RELAY::EARN.SMUVM1::ZRCC1001 24-FEB-1990 02:01:59.44 To: lou CC: Subj: part2 Via: UK.AC.EARN-RELAY; Sat, 24 Feb 90 2:01 GMT Received: from UKACRL by UK.AC.RL.IB (Mailer X1.25) with BSMTP id 9425; Sat, 24 Feb 90 01:59:43 GM Received: from SMUVM1.BITNET (ZRCC1001) by UKACRL.BITNET (Mailer X1.25) with BSMTP id 7558; Sat, 24 Feb 90 01:59:40 G Date: Fri, 23 Feb 90 19:55:11 CST To: lou@UK.AC.OXFORD.VAX From: ZRCC1001@EARN.SMUVM1 Comment: CROSSNET mail via MAILER@UKACRL Subject: part2 (part 2 of 2) D. SGML REPRESENTATION I have only begun to work out some details in terms of an SGML-based encoding scheme. I am increasingly pushed in the direction of this recommendation: that we provide basic guidelines, and examples (in a manual) but assume that the details of the feature list, including the designations of elements, attributes and nestings, will need to be filled out by domain experts in various fields of literature. Two other conclusions are clear: (a) the "standard" template, somewhat like a bibliographic template, will involve many optional features; (b) the templates will be sometimes very complicated, with multiple levels of inter-dependencies; they will involve discipline-specific or language specific mechanisms for expressing the inter-dependencies. It's often not clear to me whether to suggest elements or attributes for some features, so I welcome comments from other TEI-REP members and from the meta-language experts. The following crude scheme reflects a desire to be able to reference a text-critical situation from outside the text (from a commentary or the database point of view), and thus envisions an external referencing mechanism for both "lemma" and "alternative reading." It would be helpful to use the same general structure whether from the standpoint of internal or external referencing. The two objects seem to be the same textual objects except that some essential information about the lemma is automatically inherited from the document DTD which needs to be supplied for the alternative readings. I do not intend these names to be taken seriously, but as provisional handles that are clear. ============================================= ..... an element, (a) embedded within the text loc cit at some legal level of granularity [character/syllable/morpheme/word level], or (b) in an associated text-critical file sub-document, or (c) in a separate document type; it contains.. one , zero or more and zero or more ; one or more of OR is assumed. ..... open & end tags for the primary reading required element within ; may be omitted if the encoding is part of the containing document's app crit, and/or DTD (??) attributes of the would include: canonical_referencing_scheme_name: (optional; e.g., "uses the Goettingen Septuagint versification"; not necessary if this info is in the containing document's DTD) language: (required; means should be documented for indicating various kinds of bi-lingual documents; refinements to the 2-character language codes of ISO 639 should be recommended to account for regional and genre-dialects, etc.) text_locus: (required; using the canonical reference plus offset; using some specified referencing system (perhaps different options for different textual arenas & genre types) to designate the precise locus(offset), which may contain two or more discontinuous textual elements orthographic_stratum: (optional but recommended for texts which were written in different orthographic strata at different periods/locations; useful for artificial/hypothetical readings discussed in textual commentary or in emendations, where presumed orthographic systems inferred through historical linguistics) script: (optional, but required for languages which are written in more than one script) other_attributes: (optional; a host of other attributes about the witness, whether a printed edition, manuscript, tablet; these would be standard bibliographic data in many cases; date of publication; current museum location; date of discovery; provenience; date of the witness; name of archive; physical substance & medium (inscribed stone, papyrus, codex, clay tablet, fired brick, inscribed shard); literary genre & sub-genre (e.g., scripture text on a mezuza)) required, single element within the nested element of lemma, the content of text_locus language: language attribute if different than that specified for document orthographic_stratum: necessary if different than that specified or assumed for the document script: script attribute if different than that declared for document (e.g., Hebrew tetragrammaton spelled in archaic characters) ====================================================================== I have several uncertainties here: a) which should be attributes/tags b) how to signify parentage if the attributes are attached at several different hierarchical sub-levels within the lemma (for example, when "script" attribute changes for one of the three words of the lemma ====================================================================== , one or more optional nested elements of lemma; should provide for alternate translations of the lemma in a single language, or in multiple modern languages ll_encodings: mostly optional tags/attributes; examples: *character-level, morpheme-level, syllable-level, phrase-level (etc.) id markers *morphological descriptions (morpheme & word-level parse) *lexical descriptions (lexical lemmas) *syntactic tags *paleographic/orthographic annotations *other linguistic/literary annotations (all the ll_encodings potentially overlap with the TEI-ANA annotation database) optional nested element of lemma when appropriate to textual arena; this would be the normal mechanism used by the application for grouping all similar alternative readings of witnesses in different languages (with some better name! to designate the parent(s) and/or child(ren) in presumed translation or dependency stream; e.g., Hebrew >> translated by Greek/Syriac as xxxx; Greek << retroversion from Hebrew Vorlage XXXX; Armenian << mediated through Greek YYYY << from Hebrew ZZZZ; this could be a table of mappings relevant to the particular circumstances of this lemma (word, phrase level) optional nested element of lemma (which contains one or more elements of.... one or more of (e.g., name of scribal lapsus; unexplained corrupt reading; preferred reading; conflate reading (with demarcation & origin of sub-elements in conflation); the standard_tc_comment will be discipline specific) nested within any prose, probably allow this at any level of nesting (optional element of lemma as applicable to textual arena; genetic or stemmatic placement if known; justification for typological_placement as standard comment and/or freeform; would make use of data in conversion_table) .... the alternate readings will have the same features, in general, as the ... the emendations will have fewer of the features of the , but include one or more of a set of principled_reason justifying the emendation; some scholars prefer to put emended readings in separate categories, while others (and more appropriately in some fields) would place emended readings on equal par with ; this makes sense in textual arenas where retroversions and other language-equivalents amount to guesses anyway. ==================================== I do not have a strong opinion whether this text-critical information should normally be held at the close-tag of the text_locus or in some other file (sub-document) which is merged with the "text" prior to processing, or in some completely separate document (linked to the lemma containing document via (non-?) SGML link mechanisms. In any case the precise SGML expression of the file would just be a flattened out form once we decide which are tags and which are attributes. If a document has sparse, sporadic text-critical annotation, there would be nothing wrong with putting the data at the text_locus. I think I prefer the notion of holding text-critical data in a separate file if the amount of annotation is massive (more efficient use of processing using buffers?). E. OTHER SYSTEMS FOR SGML REPRESENTATION I may bring to Oxford some refinements/improvements on the above scheme, and welcome alternative proposals from others. Perhaps a couple examples. Michael has sent me a sample text with variants (with full DTD's) based upon SGML-ized EDMACS; I do not include it here, but hope Michael (and perhaps Dominik) will bring these samples and proposals. I especially solicit comments from Professor Ott. I hope also to hear from Professors Thaller and Huitfeldt in Oxford (who are reported to have schemes for encoding textual variation), and I have yet to analyze a sample of marked text from Bob Kraft. I look forward to seeing the detailed work of Peter Robinson (Hockey/Burnard) in connection with COLLATE and OCP productions using SGML for textual variation. Overall, I feel that the referencing scheme is the hardest part. The taxonomies for inter-dependencies can be worked out by domain experts, and we should be able to settle on the core terms/structures in Oxford. The motivation for a highly detailed and rich (but constrained) set of text-critical annotations, obviously, is in support of the richness of the database. I look to Lou Burnard for recommendations on database. =================================================================== APPENDIX The following sections of Appendix attempt to explain why I feel that a focus on the traditional "critical apparatus" is, at least in some textual arenas, less appropriate for encoding than a focus upon the total available body of text-critical knowledge. I admit the probability that this evaluation and its significance are of less moment in some literary fields than in others. The appendix has three parts: A-1. A reworked HUMANIST posting stating the case generally A-2. A summary of positive and negative appraisals of the traditional critical apparatus A-3. A worked example from the standard critical edition of the Hebrew Bible ======================================= A-1. GENERAL STATEMENT INDICATING THAT/WHY I THINK ENCODING KNOWLEDGE ABOUT TEXTUAL VARIATION, AND ENCODING PRIMARY DOCUMENTS IS A BETTER FOCUS FOR ENCODING THAN THE "CRITICAL APPARATUS" (REWORKED HUMANIST POSTING) I feel a lot of work remains to be done before we are prepared to assess how we may best represent knowledge about "textual variation" (textual evolution, textual parallels) using SGML markup languages or other "portable" formalisms. In the simplest textual arenas, or in the event that someone wishes to represent in electronic format JUST what is visible on a printed page of a critical edition, the challenge may not be too difficult. Indeed, several schemes are currently in use by scholarly editing and text-processing systems which can be expressed in an SGML language. By "simple" textual arenas, I refer to: (a) cases in which all textual witnesses are written in the same language and the same "scripts" (= one level within a stratified orthographic system); (b) cases in which the witnesses can be seen in close genetic/stemmatic relationship, not as products of complex textual evolution through heavy recensional/editorial activity; (c) cases in which the number of witnesses and amount of necessary textual commentary represents a small body of information; (d) cases in which one is not concerned about paleographic information and other character-level annotations or codicolgical information. In such cases, the traditional critical apparatus serves literary scholarship very well (the apparatus-region offers enough space for unambiguous presentation of the information), and the SGML-ish representations are fairly straight-forward. But I think the assumptions above may not pertain to the work of a significant number of humanities scholars. The goal of encoding "JUST what is visible on a printed page" (a traditional apparatus criticus, for example) might constitute an important and economical step in the creation of a text-critical database, if assumptions (b) and (c) and (d) were also germane. But when the textual data and published knowledge about that "textual" data become very rich, the standard critical apparatus represents (increasingly) a concession to the limitations of the traditional paper medium: both physical space (the amount of selectivity and condensation) and the ability of a reader to absorb (synthesize, evaluate) large amounts of textual information in complex relationships. In these more complex situations (biblical studies, for example), the paper app crit will contain a selection of data, not all the data (excluding orthographic variants, for instance, which may be important for historical linguistics); it will indicate THAT a certain manuscript or manuscript tradition bears testimony to a certain reading, but will not indicate the steps of principled evaluation which were used to make this judgment (language retroversions, for example); it will tell you THAT a certain manuscript tradition (e.g., "Syriac/Ethiopic/Arabic/Aramaic/Coptic" in support of a certain variant of the Hebrew Bible) supports a given reading, but not which manuscripts exactly, or where, precisely (machine-readable terms) these Syriac/Ethiopic/Arabic/Aramaic/Coptic readings may be found, or what expressions are actually used there. Similarly, some editors (at some periods) were narrowly focused on particular aspects of textual history, textual variation and textual evolution, and systematically ignored the "irrelevant" and "trivial" variations which contributed nothing to their own interests. Searches for texts of highest antiquity or authenticity, for example, have often been given prominent and full representation in critical apparatuses: this priority serves well the interests of certain kinds of historical inquiry (search for the Urtext), but ignores data which are valuable for understanding later traditions which inadvertently or consciously "corrupted" the texts, perhaps for liturgical purposes. It is my opinion, then, that to model the "electronic critical editions" of the 21st century after paper editions would, in some cases, represent a short-sighted goal. We no longer have to exclude "minor" (sic!) orthographic variants (essential for some forms of historical or comparative linguistics) from the databank just because they would render a traditional app crit "unreadable" from the perspective of scholars disinterested in orthography. Rather than just "encoding" or "marking up" modern critical editions (a necessary or desirable step, perhaps), we need to think rather about representation of the knowledge about textual variation, held in critical editions, to be sure, but also in textual commentaries and in fully-encoded manuscripts (tablets, shards, lapidary inscriptions, papyri or other primary documents) which themselves constitute the primary data. In short: we need the encoding of ALL the human knowledge about physical texts, textual "variants" AND the scholarly judgments about processes of textual evolution. "Hypertext" and "SGML-based" encoding can then be put to work in applications software which allows us to study the text with multiple views, even hypothetical documents created with the aid of an SQL/FQL and the text critical database. We may then dispense with the static (sometimes overly selective, sometimes overfull, sometimes inaccurate) app crits and instead enjoy dynamic user-specified app crits containing particular classes of text-critical information we wish to see at a given moment; we may have several different app-crits on the screen, simultaneously. We will be able to do simulations and test hypotheses by dynamically querying hypothetical texts reconstructed from an FQL expression. It is also my judgment that we are quite a distance away from knowing how to encode knowledge about textual relationships in which inter dependencies are complex ("variants," recensions, parallels, allusions, quotations, evolutionary factors, hermeneutical- translational factors). But I think SGML embodies one indispensable ingredient in getting there: encouraging us to assign unique names to objects in our textual universe, and to other properties of text and textual relationships. Our conceptions about these textual (literary, linguistic) objects will inevitably prove to be crude approximations, but by coding our current understanding about them in syntactically- rigorous ways (using SGML based languages), we at least contribute to a legacy of preserving the text and our understanding of it. This conception of encoding that is self-documenting represents an advance upon the less thoughtful processes of antiquity (and in some modern conceptions of text), which were usually self-destructing. ======================================= A-2. A SUMMARY OF POSITIVE AND NEGATIVE APPRAISALS OF THE TRADITIONAL CRITICAL APPARATUS If the reader fails to feel any relevance of the tension I have articulated above, it may be because the traditional "critical apparatus" serves varying fields more or less well. I will promote here the argument (should be, "hypothesis") that generally speaking, the adequacy of a critical apparatus (*adequacy conceived in MODERN terms*) will be inversely proportional to the mass and complexity of relevant textual data. Here is an amazing situation: I have before me open copies of six critical editions (gathered off my shelves at random)... - Hesiod's Theogony (Oxford/Clarendon, 1966/1982) - Hesiod's Works and Days (Oxford/Clarendon, 1970/1983) - Greek New Testament (Nestle-Aland, 26th edition) - Hebrew Bible (BHS, 1967) - Ms Neophyti I, Aramaic Targum to Genesis (Madrid, 1968) - Annals of Assurbanipal, King of Assyria (Leipzig, 1916)... ... and each of the six critical editions contains about the same proportion of "critical apparatus" to "text" per page, though the Greek NT and Targum have slightly higher proportion of apparatus. Does this mean fortune has preserved for us an equivalent wealth of textual evidence for inclusion in these critical apparatuses? Hardly. It means that the scholarly convention called the "critical" edition typically contains pages in which "text" makes up 1/2 - 3/4 of the page, usually at the top, and the remainder of the page is free for "critical apparatus." If the two Bibles included a level of textual detail (data/perspicuity) and percentage of associated relevant facts equivalent to that contained in the Hesiod and Assurbanipal volumes, no human would be able to lift the book. We may also reflect on the fact that most of these volumes, as with most critical editions of literary texts, contain separate sections or companion volumes of textual commentary: why separate sections or companion volumes? Why separate sections in a critical edition volume, or companion volume, for textual commentary -- when the same subject matter is covered? Because the traditional critical apparatus is, at the same moment, both a useful and feeble (paper) convention. I qualify "paper" because in the age of hypertext we are no longer bound by some of the debilitating features of linear text on paper, and the textual commentary supplies one nice example. Positively: The critical apparatus is a powerful and useful scholarly convention, and (I suspect) will continue to be so for the future, even when reproduced on the computer display in character-for-character mimicry of the paper apparatus. We feel sure that encoding of text-critical information will enable scholars to electronically produce far more complete, accurate and informative critical apparatuses than in the past, whether for paper distribution or for use in programs. For textual traditions containing fewer than a dozen witnesses, the app crit may contain exhaustive inventory of textual variants (fully spelled out), including "mere" orthographic variants. For textual traditions with a wealth of evolutionary history, the app crit supplies an essential, digestible overview or summary of the "most important" text-critical issues. Negatively: and negatively in direct proportion to the wealth of textual tradition, the app crit is a feeble instrument in that it only provides a "summary" or "overview" of the textual issues that are most important in the editor's personal judgment. The fact that the textual commentary is isolated in a separate section, or in a companion volume, is a concession to this problem of linear (paper) text. Hypertext offers the power to zoom up-and-down through our choice of detail, at any layer; the textual commentary need not be tucked away somewhere else, and the data of the apparatus need not be an "overview" unless we choose that format. The syntax of the apparatus need not be cryptic and ambiguous (as is often the case in my field -- though this should not be tolerated). To the delight of many younger inter-disciplinary scholars, the language of the apparatus need not be Latin (ahem...where the scholarly world of humanities seems intent on outdoing the evil of the medieval Church in keeping certain information...); the app crit may be in EVERY language, if desirable. Consider, for example, that the app crit of the Hebrew Bible (mentioned in the group above) frequently cites among its witnesses the "Greek." But in traditional terms (conventional amount of space allotted to the app crit), this "citation" can never be more than a pointer to the eclectic critical edition of the Greek text, which lives in some 18 separate volumes (still incomplete in the standard Goettingen LXX), which in turn is built up through the investigation of its Greek witnesses and daughter versions (Armenian, Arabic, Ethiopic, Coptic, Syriac, etc.) each of which have their own critical editions. Hypertext will allow us to traverse this path, if the exact paths are charted (rather than citing sigla of "traditions alluded to"), but gaining access to this data will not fit (sensibly) within the traditional conception of a critical apparatus. A traditional critical apparatus must be "manageable," and not take up more than a certain acceptable percentage of the page, and one must be able to peruse it for a synthetic view. Or consider again, when a textual note in the Hebrew Bible says "&gothicQ + mult vrb" it means, "go look for a Qumran manuscript containing a reading relevant to this textual locus, and you'll see it adds lots of additional words." The editor's failure to cite the words may indicate that he thinks they are secondary, or it may indicate that the three additional sentences of Hebrew will not fit on the page. We may fault the editor for not being more specific, (or the editorial board for wanting to print the Henrew Bible in one volume) but within the constraints of the apparatus space, there may indeed not be room for giving the Qumran reading. There is not space for giving dozens of other interesting variants. In short, the negative features of the app crit (even if these are not theoretically-necessary negatives) disappear with cheap/compact electronic storage and non-sequential access to the textual databank. The principle of selectivity in a critical apparatus (or for any encoded, annotated text) is essential, but the question is: WHOSE power to select? In a fully marked text (all manuscripts, text-critical annotations, linguistic annotations, bibliographic annotations, literary-critical annotations), 99% of what's "in" the text will be garbage at any one moment. The power of selectivity needs to be handed to the user. Scholars who are forced to work with inadequate critical editions should not be encouraged to "encode" those critical editions; they should be given guidelines and tools which make possible the encoding of information, ultimately designed for a (relational?) database, from which they can create their own critical apparatuses and make queries of the sort that critical-apparatuses were never intended to answer. ================================== A-3. A WORKED EXAMPLE FROM THE STANDARD CRITICAL EDITION OF THE HEBREW BIBLE Here follows an example of the deficiency of critical apparatus in cases where the principle of selectivity and pressure for brevity work decidedly against the usefulness of the apparatus. Two rebuttals could be offered (by anyone patient enough to read this example): "it's an isolated, unrepresentative and extreme example," and "the editor should be shot." Neither are fair: I can find far worse examples (which I sometimes use in instructing students that ad fontes does not mean the critical apparatus!); and the editors were working under hopelessly unrealistic constraints: "(something like) the critical apparatus shall consume no more that 2 vertical inches at the bottom of the page..." The standard edition of the Hebrew Bible (BHS) offers four textual notes on Deuteronomy 32:8, where the text reads (according to the translated Hebrew BHS text): When the Most High (Elyon) gave the nations their patrimony When he made a division for the sons of men He established the territories/boundaries of the people According to the number of the sons of Israel. The fourth textual note on this verse in the BHS apparatus concerns the expression "sons of Isreal." It's tough using IRV characters to indicate what's happening here, but I will try in successive stages: (1) to represent the textual note (2) to explain what's meant by the textual note (3) to explain what's missing and wrong and inadequate from the perspective of a machine-readable version. (1) REPRESENTING THE PHYSICAL TEXT (MIXED NOTATION FOR ECONOMY). ∥ d — d &gothicQ;&gothicG; (&smoothbreathing;αγγε´ λω&nu θε&ogr;&tilde) σ´&gothicL;Syh prb recte bny ℵ"l vel bny ℵ"lym &par Comment: I don't mean this as an optimal or even legal SGML represe tation, but simply an intelligible representation of surface/typographic features; it's stupid to use "direction" as an attribute, but I did not want to write rules; different representations of Hebrew in the app crit will use different directions of writing. In this textual note, the orthographic stratum changes twice within Hebrew clauses, once within a single Hebrew word. (2) EXPLICATION OF THE PHYSICAL TEXT REPRESENTATION. &par := indicates/delimits textual notes d — d := the superscript d-d notation means that the textual note pertains to the text string in the main Hebrew text which is likewise delimited by superscript d-d in Roman characters &gothicQ := Qumran &gothicG := Old Greek (...) := Greek text meaning "angels of god" σ´ := Symmachus (late Greek revision) &gothicL; := Old Latin Syh := Syrohexaplaric reading prb recte := editor's evaluative comment in latin bny ℵ"l := first Hebrew reading which "might be" the correct retroverted/restored reading, and means "sons of El/god"; the use of "hebrewfont1" and "hebrewfont2" are to signify two different levels of orthography, one without vowels and one with vowels; the lemma needs to be represented (tagged) as belonging to a third stratum of the orthography, having accentuation, designation of spirant allophones, etc. thus: b.:n"y71 yi&sin;rF'"75l (means: "sons of Israel") vel := editor's comment in latin bny ℵ"lym := the second possible Hebrew reading, according to the editor which might stand behind the readings of "Qumran," "Old Greek," "Symmachus," "Old Latin" and "Syrohexapla" (3) WHAT'S INADEQUATE/WRONG ABOUT THE READING FROM A MACHINE'S POINT OF VIEW? (a) The convention of mapping the textual note to the text string with "superscript d-d" works just fine for humans, but is not a legal SGML representation. If we wish to place such textual data in the flat file loc cit, a number of possibilities are open: flag the lemma at both ends with .... or whatever. I question whether we want to do this, since in some cases there will be literally pages of "textual_variant" data for each of several words in a sentence. An alternative is to find a qualified referencing system (perhaps using id's and refid's), and to hold text-critical data in a separate file. Note that when giving the Hebrew lemma, a third orthographic stratum (different from the two in the app crit) must be used as a language (script?) qualifier: the main text contains cantillation marks as well as consonants, matres and vowels. The application software will have to be smart enough to convert readings from one orthographic stratum to another in order to make comparisons. (b) The siglum &gothicQ for "Qumran" is meant for a humans. It means that somewhere, sometime a manuscript was found in one of the Qumran (Dead Sea) caves which bears some relationship (yet to be made clear) to this text. But which cave? Which manuscript? Where was it published? Plate number? Line number? There are hundreds of published and unpublished Qumran manuscripts. What language (Hebrew, Aramaic, Greek, -- all three languages are among the Qumran witnesses)? Is this reading in a biblical manuscript, or in a liturgical text, or quoted in a non canonical apocalyptic work? If I look up the siglum &gothicQ in the introduction to the text, I am simply informed that "&gothicQ" refers to discoveries made at Chirbet Qumran and published in the series DJD (Discoveries of the Judean Desert), 1960ff. Now, if I didn't know any better, I might hunt through all DJD volumes to find this Qumran reading bearing on Deut 32:8, but would not find it. The text is actually published (yet not in full editio princeps) in two journal articles, as I determine from other bibliographic research. When I read the articles and look at the only published photograph, I discover that the Qumran reading (bny 'lhym) is actually *neither* of the two alternatives offered by the editor, who proposed either "bny 'l" or "bny 'lym". The correct reading was published in 1959, though the editor of BHS did his work in 1972. Thus, the siglum &gothicQ is entirely misleading, a mere cipher alerting me to the existence of Qumran evidence for this text, which I have to find for myself and read for myself. The interpretation actually given in the app crit is wrong. (c) The siglum &gothicG means that the "Old Greek" (as determined through careful sifting of hundreds of Greek manuscripts and daughter versions dependent upon the Old Greek) has a bearing on the text at this lemma. But since the Old Greek had no Hebrew reading at all, the citing of this siglum should be accompanied by an indication of what the Old Greek reading was, and whether its retroversion to Hebrew is assured, and with what level of confidence, and on what grounds, and with what precise Hebrew result. It may be that the Greek reading which follows in parenthesis constitutes part of that evidence: but there is no grammar telling what "parentheses" means in this syntactic relationship; a human would not know for sure what the parenthesized reading (&smoothbreathing;αγγε´ λω&nu θε&ogr;&tilde) "angels of god" means for this textual variant. It's presence between the Old Greek siglum and that of Symmachus (another Greek witness) suggests that it might pertain to the Old Greek tradition. So we now turn to the standard critical edition of the (Old) Greek Deuteronomy (the Goettingen Septuagint volume, 1977) to explicate the ambiguity. We find that the eclectic text of the Goettingen LXX reads "sons of god" not "angels of God." Studying carefully the apparatus of the Goettingen LXX presents a confusing picture, so we turn to the companion volume of textual commentary which supports the Goettingen Septuagint (Text History of the Greek Deuteronomy. MSU XIII. 1978). There we find that the reading of the eclectic text in the critical edition is not attested in the extant Greek manuscripts themselves, but is inferred from the (derivative-of Old-Greek) Armenian text and from a partial reading in one very prestigious text in Greek (manuscript 848, dating to the middle of the first century B.C.E.). So it turns out, on further inspection, that the reading apparently attached to "Old Greek" in BHS is wrong, or at best, entirely misleading. Again: the siglum served as a cipher alerting us that the Old Greek tradition has relevant testimony bearing on this Hebrew text, and we will have to find it and study it. (d) The next siglum σ´ says that the late Greek reviser Symmachus has a reading which reflects one of the two Hebrew alternatives suggested by the editor. But what is Symmachus' reading in Greek? On what grounds can it be claimed to support one of the two Hebrew readings which follow as the editor's proposals? We are not told: we must find a copy of Symmachus and make our own evaluation. (e) The next siglum (&gothicL) tells us that Old Latin supports a Vorlage like one of the two proposed Hebrew readings (or perhaps just a reading in support of the (now exposed) "Old Greek," but we cannot tell for sure. In either case, can we trust this judgment? What does Old Latin read (in Latin)? Is the Old Latin reading secure at this locus, and on what basis is it claimed as support? We must find the Old Latin ourselves and make the evaluation. (f) The superscript "Syh" following &gothicL is curious for being superscripted (presumably a typographic error, but demands investigation). In any case, we are not told what "the" Syriac reading is, in what manuscripts/authors it is found, and on what basis it can be claimed to support the Old Greek (as indirect evidence) or one the Hebrew readings proposed by the editor? (g) The two Hebrew readings proposed by the editor cannot be used directly by software: (a) as direct substitutes for the lemma (b) as alternatives to be compared to the lemma. The lemma carries full orthography for vocalization (vowels) and cantillation (tone, pitch), and thus embeds 4 or 5 distinct strata of Hebrew orthography in one particular script; the proposed Hebrew variants in this case use the same script, but contain mixed orthographies, with only partial vocalization and no cantillation. One could propose that software absorb the burden of identifying transcriptional systems, orthographic strata and then generating normalizations, but I doubt this would be wise: if editors use mixed orthography with words, the encoding should identify this. Thus, the encoding would require that the scholar supply linguistic information which is only inferred (though easily by humans) in the critical apparatus. (h) Finally, the app crit supplies no information about witnesses that support the lemma. All Masoretic texts? A majority? How about the Samaritan Pentateuch? And where is the editor's explanation for the lemma's reading "sons of Israel" in place of the alternative "sons of (the) god(s)"? What is the justification for the editor's judgment "prb recte"? CONCLUSION The app crit for the 4th textual reading in BHS Deuteronomy 32:8 is a more like a footnote or pointer to external textual evidence than a record of precise textual evidence. The scholar must locate and evaluate several other sources to determine what the alternate readings are, what their inter-relationships are, and credentials these readings have in support of the editor's two alternatives. From the perspective of a traditional critical apparatus, the major elements are indeed present: lemma, variand(s), witnesses. But essential information is missing: annotations to these witnesses and readings expressed in terms of objects which can be pronounced, written-out, classified and counted. An encoding of this information in a form useful for analysis would need to contain much more information.