Date: Fri, 25 Jan 91 23:19 CST To: "Michael Sperberg-McQueen" , Lou Burnard , N330004@UNIVSCVM From: FORTIER@UOFMCC Subject: TEI Guidelines Critique THE TEI GUIDELINES (VERSION 1.1 10/90) A CRITIQUE by Paul A. Fortier Limited Circulation Version For the Literature Working Group 1. Perspective A. The Poughkeepsie Principles Three items of the Poughkeepsie principles statement are of particular relevance to scholars in literature, viz. 2. The guidelines are also intended to suggest principles for the encoding of texts in the same format. 5. The guidelines should include a minimal set of conventions for encoding new texts in the format. 9. Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of infor- mation not already coded in the texts. It is my opinion that these three principles are of particular importance to scholars in literature, and that they are not sufficiently reflected in the current version of the Guidelines. My reasons for this opinion will become clear in the rest of this report. B. The Perspective of the Literature Scholar The scholar in literature typically works with large amounts of data, since computer processing is used mainly when it is not practical to commit a text to memory. These scholars are concerned mainly with inputting texts as rapidly and with as reasonable a cost as possible, verifying it as effectively and cheaply as possible, and getting on as quickly as possible to the analytic work which was their reason for working with the machine. Except when they are generating such a canonical text, literature scholars work with a specific edition of a text which is considered canonical in the sense that it is the one which is cited and quoted in serious professional work. According to situations, this canonical text will be a critical edition, a prestigeous edition, a trade edition. They will want to refer easily to pages and lines in this text. Literature scholars are not interested in, in fact many object violently to, the perspective of obtaining texts which already contain - explicitly or implicitly - literary interpretations. The present version of the Guidelines is not in harmony with this perspective. Some Examples: p. 1 (1.1.1) The claim is made that this document will provide guidance to the neophyte both as to what to code and how to code it. Either the editors must take this claim seriously, or remove it from the document, for the present version constitutes a dangerous trap for the neophyte. p. 3 (1.1.4) This section recommends full SGML tags to be used at the time of data entry. This recommendation implies doubling the volume of text to be entered. The added matter is very complex and since it is a formalised and essentially arbitrary set of conventions, it will be very difficult and time-consuming to enter, and virtually impossible to verify without tripling or quadrupling the cost of verifying un-coded text. p.4 para 3. States that full tags need not be entered by hand, and vague allusion is made to macros or parsers; no examples are furnished, no names or references are furnished. Do such things exist? Is this just the technician's flip way of saying the humanist's problems are of no concern in her exalted domain? If macros and parsers exist, examples of both should be provided here and at least half of the examples in the rest of the document should show their use. p. 15 (2.1.4) Recommends embedding a given interpretation into mark up at the time of data capture or conversion in the form of a DTD. This is absolute nonsense, based on a negligence of the distinction between controversial and non controversial tagging. To indicate that a given word is capitalised or italicised, is non-controversial and might properly be called coding. To indicate that a given word is used in an ironic sense is open to question, controversial. This is interpretation in the usual sense and is the domain of the scholar working on the completed text, not that of the coder inputting or converting the text. Such recommendations will alienate the vast majority of literature people working with computer. p. 16 (2.1.4.2) Minimisation rules are a good idea. Examples (note the plural) should be provided. p. 23, the example. Much too wordy, particularly the indiction of change of page. p. 28 (2.1.7) Entity Reference (string substitution). This is excellent. It must be stressed more, alluded to more, and shown frequently in examples. p. 55 (4.1.4) Since most scholarly work in literature is based on a canonical text, in which pagination and lineation frequently varies with the PRINTING not just the edition; it is essential to identify the date of printing and the print shop in the header material of a machine readable file of a text based on a printed edition. p. 62 I suggest putting print shop and date of printing between the information on the publication and that of the distribution. p. 65 (4.5) The encoding declarations are of course the ideal place to put allusions to and/or explanations of the local coding conventions. Please stress this fact here. p. 71 (5.1), para 5. Again, the ability to point to a unique place in the text of the original printed document is essential to the needs of literature scholars. This must be stressed here and shown in the examples. p. 77 (5.2.5) Colophon -- not a term everyone can be expected to know. Note that the Pleiade edition shows this as front matter. Given the practical importance of printing date and print shop information included here, I recommend that it be but at the beginning of the file, right after the publisher identification. p. 77 (5.3.1) Given their importance for locating a quoted or identified passage, line breaks should be mentionned here and their importance stressed. p. 93 (5.6) A strong recommendation to code page breaks: EXCELLENT. Please put in as strong or a stronger recommendation to code line breaks, i.e. always put in unless there is a compelling reason not to do so. p 125 (5.11.2) Information about the layout of the edition input is crucial to the needs of most literature scholars. It MUST NOT be downplayed in this fashion. p. 126. Similarly, the suggestion that lineation can somehow not be important in a text runs counter to the needs and practices of scholars of literature. p. 178 (7.3.1.2) Even for rhyme of type "aa" French prosody recognizes three types: rime suffisante (not necessarily the same as assonance), rime pauvre, and rime riche. Perhaps this should also be taken into account. BETTER, given the range of languages to which the guidelines are to apply and the large number of prosodic systems in question, perhaps the Guidelines should not be so prescriptive. p. 200 Putting tags, entities and redefinitions in a separate file for calling up by many texts is an excellent idea. Unfortunately the example is not at all clear, and makes this seem much more complex and confusing that it is or need be. pp. 207-09. It is a trap for the unwary and an irritation to the experienced to show the suppression of typographical information (line breaks) in an extended example like this. The flip justification that the edition used wasn't very good makes things worse; poor editions should not be converted to machine-readable form! It was stated in conversation that this was done because of copyright problems. The editors should be capable of finding publishers willing to grant letters of permission to use their editions in short examples. The Guidelines should include a warning that copyright should be secured for texts input for commercial exploitation. But in fact the whole argument to suppress line breaks because of copyright is specious. A. Differences in lineation would not fool anyone and be no defense for the person accused of producing a pirate edition. B. The Guidelines provide TECHNICAL rules for the encoding of text; it is not their function to solve problems in other domains, such as the LEGAL one, where the question of copyright lies. If the editors disagree, and feel capable of dealing with problems outside the narrow domain of the guidelines, they should at least turn their attention to more pressing problems: hunger and illiteracy in the developing world, AIDS, political instability in the Middle East, etc. pp. 219-33 (A.6) I agree that in the case of the Bible the older and more authoritative method of identifying passages should prevail. 2. Coding Levels The Guidelines recommend three levels of coding: 1. Required in any TEI conformant document (e.g. Title, author, etc.) 2. Required for interchange, but a more succinct local code is recommended (e.g. accented letters, non-roman alphabetics). 3. Optional e.g. really. It is not always easy to tell which is which from the present version of the document. This distinction must be made clear. The preferable method would be to separate out each type and group them from required, to required for interchange, to optional. An alternate method would be to tag each heading with a parenthetical indication of which class each tag or tag type belongs to. The optimum method would be to do both. Further comments on coding levels follow: p. 1 (1.1.2) the text grudgingly allows for the use of simpler and less wordy codes in a local environment, which codes are to be translated into full SGML coding for interchange. This should be the RECOMMENDED approach. Examples of existing coding schemes upgradable to TEI level taken from existing archives should be given. Other examples (made up for the purpose) should be given. It must be made clear to the user that clean, clear and easy codes are to be the NORM for local use, and that the full codes are for interchange and possibly archive purposes only. p. 4 para 5. Interchange format does not allow any tag reduction. This is legitimate. But it MUST be made clearer that local minimization is encouraged, as long as automatic upgrading to full interchange codes is possible from the local code. pp. 13-14 The examples are the perfect place to show a local code first, then the full interchange code. pp. 45-52 (3.2) Character Sets. It MUST be made clear that this applies to interchange only. Local codes MUST be recommmended and SHOWN which are easy to input and easy to use on a screen and printer of MAC, DOS and Mainframe machines (at least 2 sets of examples for each of the three). Preferably get some from existing databases and some from the various forms of 8859. pp. 58-59 The examples provide an excellent opportunity to show both local codes and interchange codes. pp. 82-83 (5.3.6, 5.3.7) It MUST be made clear that these very wordy and error-prone features are optional. Please try to cut down their length. pp. 84-6 (5.3.8) List handling is excessively wordy and takes too much for granted. There must be an example of a simplified local code as well as the full interchange code here. pp. 86-89 (5.3.11) Numbers: a perfect example here of a trap for the unwary. Only "may" on p. 87 shows that this extremely wordy coding is optional. pp. 89-90 (5.4) This is a good idea but for a post-input markup. This fact must be made clear and encouraged. Mention that this is a relatively rare occurrence. p. 93 (5.6.1) It is absolutely necessary to have an example here and to show both local and interchange formats. p. 94 (5.6.1) It is absolutely necessary here to have an example and to show both local and interchange coding. p. 97 (5.6.4) Seems to suggest only fully explicit coding in milestones. You really need to show brief local codes here, PLUS their expansion into interchange codes. p. 103 (5.8.1) Explicit tagging of sentences. This is overkill. This must be clearly indicated as optional and another part needs to be added suggesting how to set up a local code permitting automatic conversion to this level of coding. pp. 110 ff (5.10.3) The examples from pp. 110 through 117 are prime candidates for examples of both local and interchange codes. p. 170 (7.2.1) The encoding declarations are an EXCELLENT idea and to be encouraged, perhaps made required. They also foster the definition of local standards which can be converted automatically into interchange format. pp. 207-09 A perfect place for a two-step example the first part showing local code, the second showing interchange code. 3. Coding Types Here are discussed the two types of coding Presentational (capital letters, line breaks, italics, etc.), and Descriptive (Proper noun, italics showing irony, stress or a foreign word, etc.) My perspective is that coding (inputting or converting text) is not the same as interpreting. Descriptive coding as presented in the Guidelines is squarely in the domain of interpretation. Scholars do not want interpreted texts; they expect to do that job themselves. When possible scholars hire assistants to input texts, and do not expect these assistants to do the interpretation. This whole aspect needs to be brought into conformity with scholarly practice, otherwise the TEI standards will not be respected. Comments on details follow: p. 12 (2.1.2) Direct quotation, indirect quotation, indirect discourse, free indirect discourse, authorial comment, description or narration -- all of these aspects of a text can blend one into another. Which is which is open to interpretation and debate. It is ludicrous to tag them as if such distinctions could be made once and for all. p. 71 (5.1) Presentational mark up is allowed here, as well as descriptive. NO! Presentational mark up should be recommended, with descriptive at most recognized as possible if one wants to use it, but with warnings against it. The examples will have to be revised. pp. 78-9 (5.3.2) This section is presented primarily in terms of descriptive mark up, which is wrong. The presentational should be recommended, if only because it avoids the excessive wordiness of the descriptive approach. pp. 79-81 (5.3.3) Do NOT recommend tagging of features, just the opposite. Stick with the for open and close quotes, suggest something else for block quotes, e.g. . Remind the user that she can use open and close quotes or guillemets (other things for embedded quotes) for a local code and have a conversion program take care of the rest. pp. 81-82 (5.3.4, 5.3.5) Perfect traps for the unwary. This is interpretation and dependant on time; it adds unnecessary work, confusion and possibility for error. Particularly true in the example on p. 82. p. 103 (5.8.1) Explicit tagging of sentences. This takes for granted that such can be known, which is not the case for numerous poets, and even novelists since the l930's cf. Celine, Simon, etc. in French. Here is an excellent example of why descriptive coding is wrong. p. 105 (last para) It is most questionable whether one should EVER remove an interpretable feature from a text and replace it by an interpretation. Not only does this make impossible verification of the data (it has to be re-interpreted not proofread) but it involves the coder usurping the role of the scholar who does the interpretation. p. 123 (5.11.1) Here representational mark up is presented as exceptional and extraordinary, earlier it was presented as a valid alternative; consistent standards never hurt. More important representational mark up should be the standard, with descriptive only an option which is allowed with cautions. p. 124 (5.11.1) The example. What edition was used? What are the page and line boundaries? Or was this all made up too? This example is a perfect demonstration of the weakness of descriptive mark up: Is Anglice latin or italian or a representation of a personal pronunciation? Are the italics quotes, emphasis or ironic? Let the coder code and leave the interpretation to the scholar. p. 176 (7.3) First, according to certain schools of interpretation texts can and should be regarded in isolation, and it is not the place of the TEI to pass judgement on this question of literary theory. Second, presentational mark up is essential because the Guidelines deal with coding a text, not its interpretation. The role of a given textual feature is ALWAYS open to interpretation, so the function of a good coding scheme is to facilitate interpretation, not pre-empt it. p. 214 (bottom) The Hamlet example. The stage type describes only the first half of the stage direction; this is the problem with descriptive tagging. Someone should try to reduce the wordiness of this tagging, particularly in the case of the speaker distinctions. 4. Other This section contains comments that do not fit easily into the categories used above. p. 76 (5.2.4) The distinction between legal and illegal forms is not clear. p. 105 (5.8.2) Soft hyphens EXIST in source texts. Please suggest more clearly how to handle them when they occur. pp. 110 ff. (5.10.3). Find a real text for a real example here. The imaginary and humourous one trivialises what is being done. p. 129 (6.1) para 2. Trying to define forms with no reference to content is a mug's game. The whole concept of structure shows that form determines content and content determines form, in varying degrees according to the context, example and interpretative perspective of course. In other words, you must create unanimity among the community of scholars BEFORE you can define the forms they can use. Not a practicable enterprise. p. 130 (6.1) The priniciple for linguistics (welcome all theoretical positions, favour none) is EXCELLENT. I recommend the same thing for literature; this is the basic premice of most of the preceding comments. pp. 140-44 (6.2.4) Incredibly wordy and unreadable coding for linguistic features. If the linguists consider this a good idea, more power to them. I recommend that we not get into this for literature texts. p. 169 (7.1) "verse, drama and narrative". Narrative used in the sense of prose. Not all prose is narrative (cf. Cook books, or the TEI Guidelines), not even all literary prose is narrative (some is descriptive). If you are going to try to dictate, or even make suggestions, to scholars in literature, you must get the technical language right. p. 181 (7.3.2.4) French texts of plays also show the date and place of the first production as well as the names of the actors. You should provide for this. p. 181 (7.3.3) Use PROSE not narrative, to include the essay and free form creations (cf. Butor's works).