Literature Needs Survey Results Paul Fortier University of Manitoba Document Number: TEI AI3W4 January 22, 1991 Version 1, January 22, 1991 In late November l990 the TEI Working Group on Literature Texts sent out a survey to try and identify the needs of people working in litera- ture with computers. Fifty-three responses were received to date. The following document contains a summary of the responses. To facilitate use of the data some organizational principles were applied to it. First, responses were classified into two groups: 1. Experienced interdisciplinary scholars, generally defined as peo- ple working in literature who have input texts, used texts from elsewhere, processed them and, usually published results based on their use of the computer. There were 40 responses from such peo- ple. 2. Other, people who are not literature scholars, not experienced with literature texts, etc. There were 14 responses from such people. These responses are not included in the following summa- ry. The questions from the Questionnaire are repeated followed by a line of responses; the numbers indicate the number of respondents who checked each category; a category "Not Answered" is added where appropriate. After this line are found copies of all comments submitted by the first group of respondents. The numbers (1) to (40) permit the user to iden- tify comments from a given respondent from one question to the next without revealing names.(1) We wish, in transmitting this report, to thank all those who have given the committee their time and advice in this matter. We wish also to thank those who expressed solidarity with our work by sending in the form although literature is not their field, a number of whom appended a note to that effect. If you wish to help us with further comments please send them to me at FORTIER@UOFMCC.BITNET, or by paper post to Paul A. Fortier, Depart- ment of French and Spanish, University of Manitoba, Winnipeg, Man., R3T 2N2, CANADA. Cordially, Paul A. Fortier, for the Literature Working Group. I. STANDARDS FOR LITERATURE TEXTS. I would rate the importance of the indicated categories to standards for literature texts as follows: Bibliographical Information A. Bibliographical Information: (Place, date of publication, edition, printing, etc): 32_Essential 7_Important __Not important 1_Should not be included Further comments: (2) The TEI Guidelines I thought did an excellent job of setting standards for these. (10) Documentation for early texts should make clear whether the spelling is old, or modernized to English or American standards. (16) This information or some of the categories may be essential, depending on the case. For me, the date & edition. (22) If facsimiles can be included in text-bases. (26) Electronic texts that do not correspond to known printed edi- tions make accurate citations impossible. (28) That bibliographic information be given is essential. That it be absolutely standard in form is only important. (34) Though I sometimes think it would be just as helpful if, for a given corpus, such details were included in a separate file. (35) Without bibliographic information the text would be useless for any comparative study. (38) Just place and date. Formal Characteristics B. Formal Characteristics (Chapters and sub-chapters, page and line Breaks, stanza divisions, speakers in plays, stage directions, etc.): 35_Essential 5_Important __Not important __Should not be included Further comments: (2) Representation of the physical characteristics of the edition or MS used is very important. (4) In order to cite lexical searches, a textual location is essen- tial. (10) In an archive, it is best to include all; the researcher can remove unwanted items. (16) Chapter or story divisions are often essential for me. (24) == rather a vague question: stage directions are not the same kind of formal characteristic as chapter divisions or line breaks, (26) Plays and poems have standard reference systems for location but prose fiction does not. Here electronic identifications may have no hard copy equivalents. (28) If you don't have the formal structure, you don't have the work. (29) By no means do all texts involved in the study of literature merit the efforts of received texts in terms of the extent of encoded information. The proceedings of an historical society, for instance, may be quite useful in the study of particular literature, but they may not be worth the effort required to provide full tagging. (33) This question seems to mix together several sorts of character- istics: structural divisions (chapters, lines in poetry, stanzas), pieces of text (stage directions), and typographical features which may have little relation to the work's form (most page breaks, line breaks in prose). Perhaps it would be more productive to ask two questions, about formal characteristics and typographical characteristics; the for- mal characteristics always need to be marked, but it's not always impor- tant to indicate things like line breaks in prose, depending on what you plan to do with the text. (34) Line breaks seem frequently less helpful except in verse and for manuscripts, especially since the best search programs do their own counting. (35) These elements are important if the text base is to be used for anything other that individual-item storage. (36) With maximum flexibility for representing alternative punctua- tion, breaks etc. whenever these are not authorial/editorial absolutes. (39) Page and line breaks not important unless the first edition is a printed edition. Ancient and ms texts should not have the pagination of modern editions in them. Grammatical Information C. Grammatical Information (Basic Form, Part of Speech, Inflection Iden- tification, etc): 5_Essential 15_Important 5_Not important 10_Should not be included 5_Not Answered Further comments: (2) Standards for this information need to be provided for scholars who are interested in it, but provision should be optional. (4) While this would be nice, it would take forever! (I know from experience.) Furthermore, it would make the files extremely long. (6) The inclusion of morphological information might be essential for some purposes for some texts, but (given how much work is involved in coding such information) I would not want to make this a standard fea- ture. (7) Not for lit. e-texts like those I usually use. (8) Some scholars with lexicographical and syntactic interests will find such information central; literary analysis will generally not require such markup. (10) Two versions should be held when possible; one with, the other without such information. (11) Conservative treatment preferred, with full account of catego- ries and decisions. (16) These categories are essential for me, but I am not convinced that generic e-texts should be tagged this way unless by the experts on grammar & the author. (19) Parts of Speech: A lot of variation in assigning their charac- teristics exists. Could be helpful if consistent. (20) But not necessarily in the standard encoding. There could be a companion analytical apparatus. (24) Rather depends on the application. (26) I don't work currently in this area so I have no comment. (27) For this and the following categories, I don't mean to suggest that every etext deposited in an archive should be required "by law" to encode such features, but I do mean that standard ways for coding these features should be developed and published by the TEI. (28) All interpretation of linguistic objects (including literature) is based on linguistic understanding. Without ways to distinguish the various interpretations of an ambiguity, we cannot handle much serious work. (34) This is immediately a more contentious element, since the cat- egorization is inevitably less "objective". (35) While this is not important to the ways in which I see myself using an electronic text, it might be important to a linguist. (36) Difficult. Emphasis must be on flexibility, and avoiding any hint of prescription which would encourage the tail to wag the dog. I'd be interested in anything that could relate this sort of tagging (of, e.g., syntactical/semantic ambiguities or surface v. deep grammatical structures) with methods of representing such tensions dynamically. Distinction between tagging WITHIN the text and in separate files which could be integrated with the text is presumably important here. (37) I do use this, but this depends on the use to which the text is put; sometimes this information gets in the way. (39) I really think that this should be done by the researcher, not by the text preparer. Metrical Information D. Metrical Information for Poetry: 3_Essential 23_Important 5_Not important 3_Should not be included 6_Not Answered Further comments: (1) Hard for me to say. To what extent can metrics be automatically determined? If poorly, then this is ess[ential]. (2) Here again, this information is useful and standards for it should be established, but it should not be required. I am somewhat more in favor of providing it as a matter of course, simply because it can generally provided in a relatively easy fashion if you are thinking only of such things as line length and rhyme scheme. More detailed information (such as hiatus, synalepha, etc.) would normally be omitted. (4) I'm not sure. For my purposes, it is not necessary. (6) The inclusion of metrical information might be essential for some purposes for some texts, but (given how much work is involved in coding such information) I would not want to make this a standard feature. (7) depends totally on what is prepared for whom. (8) Important but subjective. Required markup should be descriptive and (more or less) objective. (10) Two versions should be held when possible; one with, the other without such information. (18) This can get rather complicated, especially with mixed meters, but should still be done. (21) Essential to me, but perhaps not to everyone. (26) I don't work currently in this area so I have no comment. (28) If this is not defined, it will just have to be invented. (Dit- to, of course, if it's defined but not in a useful way.) (33) The separation of this from categories C and E is, I hope, meant to suggest that it falls in between the categories of fact and interpre- tation. There exist a number of different theories of English prosody, and a standard can, at best, provide tags for some of these particular theories, none likely to be the last word. The situation for other lan- guages may be simpler. (34) Of less relevance to French than to stress-timed languages. (35) Most folk can work this out for themselves. (36) Ditto. A generally accepted standard for generally accepted metrical principles would cope with the majority of conventional poetic texts by conventional researchers, and standard for such tagging would be valuable. As long as electronic text also encourages unconventional approaches with non-restricting guidelines i.e. tagging which allows text/analysis interchange and which is conformant with SGML/TEI sensi- tive programs but with the greatest freedom for researcher-defined cat- egories within these constraints. (39) But metrical analysis is often a matter of controversy. Can the text preparer distinguish agreed-upon from contentious issues? Interpretative Information E. Interpretative Information (e.g. Narrative vs. expository passages, direct and indirect discourse, point of view, themes, images, allusions, etc.): 8_Essential 7_Important 4_Not important 20_Should not be included 2_Not Answered Further comments: (1) Because such "information" is so mortal, there should always be a version at least without this in it, and it should be last on the list of things to do. NB: it is this category of information that interests me the most! (2) The problem with this category is that it becomes increasingly subjective. The more subjective it is, the less justification there is for including it as part of a standardized text; although, again, tags should be provided. (4) The direct discourse should be clear from quotes. I presume you will include the normal punctuation of the text in addition to any extra material you may need. (6) Many of these items are not objectively verifiable, and how such things are coded will vary from one researcher to another. The chances that a text marked up with these features would meet my needs is slim: even if I wanted such things marked, I would probably not like the way someone else had marked them. (7) Doing work of critic; will it [be] reliable? (8) If I were marking up etexts for my own analysis, I would certain- ly encode such features; however, given the subjective nature of most of these interpretations, I would be reluctant to endorse such markup as part of a minimal TEI standard. (10) Two versions should be held when possible; one with, the other without such information. (11) Conservative treatment preferred, with full account of deci- sions. (13) Images, allusions, especially indications of foreign languages, geographical locations--these need to be coded. (16) Most of these I consider my domain of research, but would of course be interested in evaluating & perhaps using the results of oth- ers. I would find the narrative discourse attributions extremely useful (e.g., 1st, 2nd, 3rd person). (19) Too much dependent on interpretation. (20) All these categories are potentially very useful, but once again would belong in a companion encoding. (24) Some odd bed fellows here. (26) If identified, this annotation can be of extraordinary value. (28) I would like to be able to tag/distinguish: * narrative vs. exposition (and all the other distinctions suggested by Barthes in L'analyse structurale du re''cit) * direct and indirect discourse; point of view * themes, images, topics * division of text into scenic/narrative structural units (e.g. Propp's and competing analyses of tale structure) Divergent analyses must be allowed. (31) For most of this stuff, the coding would be too subjective. (34) Probably too contentious to systematize comprehensively, and (for example) a merely partial (or flawed) identification of themes or imagery can actively mislead. (35) This kind of interjection is to be avoided at all costs. The text should remain as clear and simple as it can be. (36) Ditto. Any coding should work to loosen the web of the text and encourage its multivalence to be exploited in a non-print medium. Exciting potential here, I'd have thought. How can coding facilitate the exposition of multiple levels of, say "point of view" or the com- plexity of "themes", without constricting them? Can coding be suffi- ciently sensitive to maximize the examination of tensions between, say, overt levels of meaning, and, perhaps covert or subverted/-sive levels caused by dislocations within the varying "points of view" (authorially intended, historically and culturally conditioned, skewed by time, class, gender, race, etc. or the sheer slipperiness of the signifiers themselves) or between such semantic levels and acoustic/semiotic/ paralinguistic levels? (38) Can this be optional--separate file. (39) This is a mixed bag. Such material should be put in by the researcher, as a rule, not the mark-up people. But direct vs. indirect discourse vs. exposition (or whatever) can safely be marked up. As to the rest, let's have guidelines for how a commentator might make his electronic commentary TEI-observant, but keep notes on images, illu- sions, etc. out of standard E-texts. However where a quotation or par- aphrase is from a known source, it should be tagged with a 'source of quotation' marker. Important Items Please Specify Important items in this category: (19) I prefer "clean" text without extraneous tags other than identi- fiers. (20) Direct and indirect discourse, images, point of view, allusions. (24) Discourse levels (narrative, quoted speech, allusions). (34) Some uniformity in the conventions used to indicate direct speech may be desirable (or alternatively may actually be at odds with an author's own practice. Tags Needed F. In order to do my work as I prefer, I need generally accepted tags for the following aspects of texts (Please be as specific as possible. This is not a test but an opportunity to express your wishes. Please number your "wishes" and rank them in descending order of preference.): (1) 1. B [Formal characteristics] 2. A [Bibliographical information] 3. C [Grammatical information] (2) 1. Editorial annotations 2. Textual variants 3. Foreword and afterword material (author's prologue, translator's prologue, dedicatory, poems dedicated to the author, legal license, religious approbation, table of contents), rubrics, titles, running heads, etc. (3) 1. tags should be given at the start of subdivisions so that words, sentences, etc., can be identified by position (i.e., by page in a standard printed edition, or by act and scene in plays, or by title in a collection of short stories or essays). 2. If parallel variant text is given, it should not only be noted as such in a tag, but the tag should contain the length of the vari- ant text so that a program can easily skip it in counts of the stan- dard text. (4) 1. line number or chapter 2. language 3. part of speech, if included (5) 1. Page breaks 2. Italics 3. Foreign words 4. Quotes 5. Poetry 6. Parts of speech (6) 1. Bibliographical identification 2. Formal divisions (7) 1. accents 2. paragraph or other breaks 3. ligated letters (8) 1. Structural elements: act, scene, line for drama 2. Speaker designations and stage directions 3. Glosses for archaic or colloquial terms (as footnoted in many editions) 4. Explanations for references (explanatory footnotes) (9) 1. References (Book, chapter, line) but not SGML! 2. Quotations (citations in French) (10) 1. Speech prefixes (11) 1. Location-markers (e.g. page, line, act, scene) 2. Speaker-tags 3. Tags for major homographic cases (e.g. to/to, that/that/that) (12) 1. tags for placement, extent, and possibly other info about illus- trations 2. tags for hypertext and hypermedia (linking; filtering info 3. publishing history: i.e., if form, say, chapter divisions, changed since original or first publication, e.g. Dickens and other periodical authors then converted to book; this will be crucial when and if print texts becomes widely available in electronic and hyper- text forms, which demand a different format. (13) 1. Proper names and places 2. indication of languages used in the text (14) 1. Formal Characteristics (16) 1. Lemma with disambiguation 2. Frequency (including totals and coefficients) 3. Pages of occurrence w/chapter divisions 4. Narrative point of view (1st, 2nd, 3rd person; singular or pl.) 5. Part of speech 6. Thematic affiliation(s) 7. Notations on errors in spelling, deletions, repetitions, etc. 8. Non-textual information (e.g., illustrations) (17) I prefer no tags at all and regard such as unwarranted editorial intervention. (18) 1. Stressed syllables (Accent)--word accent (e.g. gi''ving) and lexical accent in monosyllables (e.g. give) 2. End of syntactic phrase (19) 1. Divisions: Chapters, paragraphs, speakers (plays) 2. Spanish characters 3. Titles and sub-titles (21) 1. Act, Scene and line 2. Speaker 3. Part of speech, at least for homographs 4. Number of syllables in a line of verse (22) 1. Parts of speech 2. Speakers 3. Poetic or rhetorical devices (23) In order to make 'handy' concordances I want to know if a word in poetry is in a rhyming position. (24) 1. [will think about these further] (25) 1. formal characteristics listed above 2. textual variants 3. font information (26) 1. before tagging I require a verified text--most O[xford] T[ext] A[rchive] texts I've used aren't. 2. corresponding location(s) in standard edition(s) 3. textual variants in MSS versions, printed editions, comments 4. word and motif index, concordance, statistics, apparatus (27) My views on this matter are embodied in Cheryl Fraser's thesis (which I co-supervised). (28) 1. formal structure (I assume this is reasonably simple) 2. manuscript variations in the text 3. structural decomposition of text 4. simple linguistic analysis (part of speech, morphological info) 5. metrical information (number of syllables, placement of ictus, location of caesura, placement of linguistic stress) 6. identification of phrases as 'formula', 'formulaic' etc. (ideally by different tests) 7. manuscript abbreviations and their resolutions (29) 1. titles 2. authors 3. sub-titles (30) 1. Book 2. Chapter 3. Verse (31) 1. Print text id (line numbers for poems, act and scene for plays 2. Edition/source/variant info (32) 1. Because I work with medieval texts, marking up a text for varia- tions in orthography (lemmatizing) would be very handy. 2. Unique mark-up strings for runic characters (thorn, eth, etc.) would enable me to do global search and replace to insert my own versions. 3. I usually search for vocabulary especially synonyms and some- times antonyms. Standard tags to indicate vocabulary relationships would be useful if I wanted to share my marked-up text. 4. Tags for an editor's choice of type fonts and other typographi- cal or manuscript features such as line breaks, rubrication, gloss- es, etc. (33) Since I'm more interested for my own work in the formal and structural aspects of texts, the current standard generally seems to cover the things I need right now. (34) 1. accented characters 2. Parts of speech 3. Inflections (plurality, tense, etc) (35) 1. type faces and fonts 2. diacriticals (36) I'm engaged at the moment in trying to use the computer to tease out textual dynamics which cannot easily be expressed in linear form. I'm using an authoring package to investigate this and am intrigued by the possibilities for textual analysis and for teaching. Markup is effected not by inserting tags around the word/phrase/morpheme or what- ever at the same physical 'level' of text, but by selecting (on Mac) the w/p/m etc and typing the coding into what might be seen as an underlying level. Many levels can be set up, allowing the word etc to be coded according to different features at different levels, or putting in alternatives at the same level To take a simple example of the latter--the OE compound 'wintercearu' (lit. 'winter-sorrow')--where the 1st element (surface level/ conventional grammar--a noun) might be seen with varying degrees of validity with adverbial, adjectival and/or substantival function depend- ing upon the relationships perceived between the two elements. Similar- ly, 'cearu' may be the surface grammatical subject of the clause, but at a deeper grammatical level may function as object or agent. All these glosses can be easily written in as alternative codes, and the combina- tions and permutations examined interactively within the wider context of the text. By coding these and many other features (rhythmical pat- terns, sound values, semantic resonances etc.) into set levels behind the words, and then revealing the coding at the different levels, one can get some sort of visual representation of the text's rhythmical pulses/alliterative patterns/ surface grammatical and semantic struc- tures and the tensions between these and other levels of syntax and meaning/etc etc. This is very easy for the user--click/replace--and sympathetic to the way in which the mind works (not linear tags around, but levels behind and within the word). Indeed, the text can be seen to slip between levels on the screen. There's no way, however, in which such texts could be exchanged as they are entirely software/hardware dependent, and yet they are highly encoded. I just wonder whether more textual analysis won't be done using increasingly sophisticated '3-dimensional' approaches, with linear tagging very much the secondary process--the result rather than the initial parameters. So my main interest is in the potential ease of interface between such programs and standard-coded linear text, and how readily TEI guidelines apply here. (37) 1. Metrical Information 2. Grammatical 3. Narrative vs. direct discourse (38) I don't quite understand but 1. genre 2. original language of authorship (39) 1. Clear and standard reference scheme to basic editions. 2. Elimination of 'typographical artifacts' (page-breaks, line- breaks, non-semantic hyphens, etc.). If they are included, they must be tagged so I can strip them out easily. 3. Speaker identity, change of speaker, quotation and indirect dis- course. Further Suggestions G. Further suggestions to the Work Group on Literature Texts: (3) Lines should never be longer than 80 characters and they should always end with ASCII 13 and 10. (4) It would be nice to include translations as well as original works; perhaps reworking also. I would particularly like to see a machine readable version of Berni's Rifacimento of Boiardo's Orlando Innamorato so that I could compare them side by side, for example. (12) Mostly matters that obviously follow from my last suggestion. (16) Providing information with tagged text on corpora which include said text would be very useful. (18) Compile a library of "prosodic-phonological dictionaries" and/or sets of algorithms to generate prosodic-phonemic transcriptions from orthographic texts. (19) Work with foreign language professional organizations. Some have guidelines already in place (Spanish). (21) I need a program to transcribe verse into IPA symbols and to perform metrical analysis. (24) I assume it is OK to forward the questionnaire to as many likely literary types as I can think of. There seem to be quite a lot of very familiar email addresses in the current list! (25) Minimally tagged texts are better than no texts at all. (29) Tags are essential for proper retrieval of information, apart from analysis of text. (36) [This is more concerned with the sort of editions it would be nice to have in electronic archives]. For medieval literature where oral delivery and non-author controlled dissemination are particularly at odds with the medium of print and authorial/editorial control, and where punctuation is imposed by the editor, it would be good to have tagging which brings out the fluidity inherent in such text. If an edition is being prepared, the editor, who will have deeply pondered textual cruces, alternative punctuation etc, could tag the choices which will be used in the printed version, but leave the options embedded within the electronic text E.g. If an editor were editing an Old English text, using, say, a package combining the nesting facilities as above with good standard editing facilities (which must be the next development), then, when deciding how to punctuate the text, they might put in punctuation choic- es as flexibly as possible at positions of ambiguity (in an either/or/or ... facility). To evolve towards a final version for printing they could then click the one preferred in each case. In which case the other choices disappear from the surface version, which is fossilized in printed form, but remain underneath. If an interface could transfer the texts, with their underlying coding, to a standard format, then a version of the text could exist from which an infinite no. of slightly different versions/interpretation could be readily generated--one physical coded text/several possible output forms. The academic community would gain a dynamic text which would be so much more than the static printed version. We could benefit from the editor's particular expertise and insights, recovering some of the things that usually end up in the bin. (Printed version 100% manifest, all choices which were rejected 0% represented--even if the decision between alternatives was 51/49). Will an editor do this if this coding is a time-consuming (post- event) phase, rather than integral to the process of developing the text for publication? It seems to me that the very medium of electronic text prompts new insights and approaches, and this is more exciting than electronic text merely as conventional print-based text in another medi- um. How can the academic community benefit from the fruits of the labors of those who are not algorithm-proficient in these respects?--i.e. the importance of the interface again. II. THE CURRENT VERSION OF THE STANDARDS (TEI P1 VERSION 1.1) I would propose the following modifications to make them more appropri- ate to the needs of scholars of literature (Please list in descending order of preference): (2) 1. A summary section of preceptive instructions for different text types, i.e., a cook book with a set of rules with the minimally acceptable tag set for a particular kind of text. (3) See F and G above. (7) In general I find them far too cumbersome for my own use. I would never have accepted them for the Dartmouth Project had they exist- ed back when we started. They would have cost us lots of time and money we do not have. Again, it is a question of what text is being prepared for what user. I can see the need for TEI, but hope TEI can see that some of us, especially those involved in free-standing data banks which do not distribute text will want our own simple-as-possible mark-up. (17) Less emphasis on tagging, editorial intervention, etc. and more attention to page and line references (absolutely necessary), page lay- out, etc. (20) 1. basic forms of words 2. textual subdivisions 3. titles of subdivisions 4. speakers in plays 5. correspondents and dates in epistolary fiction (22) 1. Reproducing spacing if poetry on the page (24) 1. Give the editors each a fat cigar and a blonde 2. And three weeks holiday in Mauritius (28) 1. improved treatment of drama, poetry, prose fiction 2. more specialized treatment of special literary forms (e.g. par- ticular genres of lyric) 3. treatment of literary analysis: narrative structure, themes, images, levels/types of discourse (33) 1. Although metrical indications are important, it strikes me as both simpler and more important to provide ways of encoding the stanzaic structure of poems: right now there are provisions for indicating stanza breaks, indentation, and rhyme, but no way to indicate a pattern that combines these and is uniform throughout a poem. 2. The encoding for dramatic texts looks like it's on the right track; but it needs fuller examples, and either an illustration or a rethinking of --I don't see the reason for doing this as an empty element, rather than as a structural element within a line. (37) I'm sorry but I'm not familiar with the standards. I wonder if there is a way to hide codes one doesn't need when working with a text. ------------------------- (1) In this version of this document, responses have also been formatted more uniformly for easier reading; typographic errors have been silently corrected, spelling has been unified, missing punctuation supplied, and various forms of emphasis typical of electronic mail have been rendered with italics. -Ed. Version 1, January 22, 1991