Daniel I Greenstein Revised dlinraft minutes of the Meeting of the AHC's Working Group to consider, ACH-ACL-ALLC, Guidelines for the Encoding and Interchange of Machine Readable Texts, Draft Version 1.0, eds., CM Sperberg-McQueen & Lou Burnard (1990), held at Southampton University, 13-14 April 1991 <date>28 April 1991 <body> Working Group members: Daniel Greenstein Convenor: Glasgow University Manfred Thaller Max-Planck, Gottingen Hans-Jorgens Marker Danish Data Archives Ingo Kropac Graz University Jan Oldervoll Tromso University Peter Denley Queen Mary Westfield College,London Caroline Bourlet CNRS, Paris Donald Spaeth CTICH, Glasgow University Kevin Schurer Cambridge Group for Population Studies Documents tabled: Daniel Greenstein, Draft comments on the Guidelines, with reference to the documents of modern political history Donald Spaeth, Draft comments on the Guidelines, with reference to the documents of early modern history Kevin Schurer, Draft comments on the Guidelines, with reference to census and census-like Letters commenting on the Guidelines from Jan Oldervoll and Caroline Bourlet Session 1: Historical sources historical research, and the TEI draft proposals. General discussion. 5:00-7:30 Present: DIG, MT, H-JM, IK, JO, PD, DS Apologies: CB, KS The purpose of the session was to allow participants to air general comments and concerns about the TEI's proposed standards before the more specifically source-oriented discussions in sessions 2-5. MT introduced discussion with the general comment that what historians are interested is the facts that texts reveal and that in many instances the appearance of a text is integral to its interpretation. The problems alluded to are threefold: how to describe the appearance of texts particularly manuscript texts; how to tag meanings derived from the appearance of text elements , particularly where their meanings are ambiguous; software-related problems with parallel or concurrent tagging hierarchies used to represent texts' semantic and structural properties respectively. IK agreed with the general problem and proposed to use the formalized language of diplomatics which distinguishes between the external properties of a text - properties you can see and describe - and the internal properties - semantic contents. Criticized the Guidelines for blurring these distinctions. H-JM agreed and added that machine-readable manuscript sources required visual representation of text elements described by the encoder. MT SGML is capable of tagging concurrent hierarchies but is there (or will there be) software of analyses which draw simultaneously on information encoded in separate hierarchies. [Note p2 paragraph 1 of the draft report tabled by KS where he indicates that a census manuscript ideally encoded would describe the appearance of the manuscript (especially important where information is provided in several hands), and the structure of the manuscript as perceived by the historian and as it is conducive to the analysis he/she wishes to conduct] DS, MT, IK lengthy discussion about what to do with parallel or concurrent hierarchies and agree that the Guidelines provide for their representation but give no indication about how one might analyze information encoded in them. DS SGML is something historians will use when there is software which is both easy to use and readily available. But doubt whether it will be possible to "represent" completely all the complexities of historical sources (especially given what has been said above about the ambiguities involved in manuscript sources and of the complex interaction between a source's appearance and its interpretation). Wonders too how ready historians will be to incur the additional cost in a computer-aided historical project which TEI conformant data entry will require. Doubts very seriously whether people using heavily structured sources (largely of the modern period) will ever use SGML suggesting that what is perhaps more essential for their work is source-specific standards such as those recommended in Kevin Schurer's document. MT Identified a related problem: there is a difference between describing the external properties of a text in order to reproduce it in a printed form from the machine- readable version, and describing the external properties of a text in a way which will facilitate its interpretation and analysis. Provides an example of a manuscript document which comprises several lists laid out in rather irregular columns. In describing the external properties necessary to reproduce the document in machine-readable form the encoder may need not be sensitive to the problem of associating information in proximite columns. [A further example drawn from a printed sources may be gleaned from the section on newspapers in DIG's tabled draft comments on the Guidelines] IK Addressed the related issue of separating anything in the source from anything known about the source, and that the Guidelines should they be implemented would dangerously mix the two. Thus, the secondary analyst could never recover from the encoder's interpretation of specific external properties. A rigorous distinction between tags for external and internal text properties would facilitate the development of a parser which would enable a user to eliminate a whole category of tags which might interfere with interpretation. As an example he proposes a string of characters in a medieval charter which are identified as a person's name followed by an unrecognized symbol, or a string of unrecognized symbols and recognized characters which may or may not be associated with the name as an indication of title, place or residence/birth, or social status and could mean any of these three things with equal probability. To assign the entire string with a tag - id=name is suitable for reproducing the machine-readable document in printed form but is useless for interpreting the information contained within the string of symbols/characters. JO wondered whether software-related problems were beyond the working group's remit. Recommended generic tags for each of the three probable meanings of the string of unrecognized symbols/characters IK The problem is not one of generic tags but linking the concept(s) that the encoder used when interpreting the unrecognized string of characters/symbols. Wants the tags to provide some indication of those concepts and of the probability that they are accurately applied in this instance. [Note that the indications of probable accuracy or correctness are now widely used in statistical analyses of data gathered through ambiguous manuscript sources as well as data compiled from several sources on the basis record linkage which is conducted with different levels of certainty across different cases] MT A plausibility attribute should be considered seriously by the TEI H-JM It may be possible to standardize the tagged representation of plausibility but not to get researchers to use the same criteria when assigning probabilities or to adopt the same probability scales. IK different kinds of information contained within a text will require different probability scales DS It might be possible to use several coterminous codes which require no ranking [in the example provided above the unrecognized characters/symbols immediately after a person's name might be indicated as rank = XXX, social status = XXX, title = XXX] and gives example of fields in a database which store multiple values possibly arranged in a meaningful order, separated by an pre-assigned delimiter. MT This isn't allowed by SGML but should be. Recommends that the name example cited above could be solved with two attributes: tagname and tagcontent where the relevant multiple values would be indicated in the tagname eg: <startfield> tagname attribute=surname, origin, rank or nobility, tagcontent "actual item of text" </endfield>. Identified two "kinds" of standards: that represented by the TEI and that by Kevin Schurer's draft guidelines for encoding census and census-like sources. The latter is very specific. It provides very strict guidelines for historians using a specific kind of source but may prove difficult to generalize. The former offers very general standards which may prove difficult to apply in specific cases. IK Identified another problem with the Guidelines: Their ease of use will influence the extent to which the proposals are taken up. For example, a project using 20,000 medieval charters having to provide a header for each might opt not to use the guidelines. MT Indicated general problem with devising standards independently of any consideration of the very different purpose to which such standards will be put, and that at the moment. DS SGML is something historians will use when there is software which is both easy to use and readily available. But then wonders whether it is genuinely possible to "represent" the complexities of historical sources (especially given what has been said above about the ambiguities involved in manuscript sources and of the complex interaction between a source's appearance and its interpretation). Wonders too whether historians will really want to incur the additional cost in a computer-aided historical project which TEI conformant data entry will require. Doubts very seriuosly whether people using heavily structured sources (largely of the modern period) will ever use SGML suggesting that what is perhaps more essential for their work is source-specific standards such as those recommended in Kevin Schurer's document. MT Historians require a series of standards which provide guidance in treating crystals commonly encountered in a wide variety of sources which are used as the basis of computer- aided research. Such standards would, for example, recommend that historians not treat X as a widow of Y but as discrete entities which are somehow related, which would treat dates, names and places in an agreed fashion etc. After a discussion about these two kinds of standards it was agreed that standard treatment of specific crystals frequently encountered in historical sources would receive fuller treatment in a second workshop to be held at Graz University on 2-3 May 1991, while the problem of identifying the structural features common to distinctive document types used by historians should be the main thrust of the subsequent discussions in Southampton. DG Suggested that an AHC-position on standards would be useful MT agreed and hopes this will come out of the AHC meeting in Odense 28-30 August 1991 DG proposed altering the following day's agenda to give L Burnard, co-editor of the TEI, a chance to address some of the general concerns which had emerged during session 1. General problems emerging from session 1 that TEI might want to consider: The problems inherent in manuscript sources are nowhere adequately addressed in the Guidelines. Some of the more significant ones are: 1) It is not possible with manuscript sources to list all the possible characters as with printed sources. Consequently, we need an open-ended set of glyphs which indicate unique text features which may or may not be significant characters/symbols and/or the ability to describe such symbols/features. 2) Indicating or describing a non-standard symbol or character has two purposes which may be distinctive: for reproducing the machine-readable text in a printed form, and for interpreting the semantic contents of the text. These two usages may require separate treatment especially as the meaning of non-standard characters/symbols is often ambiguous. Tags which impart semantic meaning to non-standard characters/symbols need to make specific reference to the assumptions used in interpreting meaning and perhaps some indication of the assumed probability that such meaning is in fact correct. 3) The problems posed by the distinctive external and internal properties of manuscript texts and about the assumption-based connections between them throws up a series of software-related questions, or at least points to the need for SGML parsers which will be able to draw upon information tagged within multiple hierarchies. 4) The use in the Guidelines of linguistic models of analysis makes the very important chapter on analytical and interpretive features unapproachable to the non-linguist. 5) Software considerations are also important as they will ultimately determine the extent to which the Guidelines are adopted. Session 2: Further consideration of some general issues 9:30-11:00 Present: DIG, MT, H-JM, IK, JO, PD, DS Apologies: CB, KS Attending: LB DG introduced the session with a brief outline of the history working group and its place in relation to other TEI working groups, and with a synopsis of the problems which came to light in the previous evening's discussion. The problems were: (1) the absence from the Guidelines of any discussion of what it is to define and describe a text; (2) the possible need of greater distinction between tags used for marking up internal and external properties of a text; (3) the need to associate concepts and probabilities to tagged semantic tags especially where they are used on the basis of a text's external properties. (4) the analytical need to draw information encoded in different concurrent or parallel hierarchies; (5) the relationship between general standards for describing text features commen to distinctive document types (as represented in the Guidelines) and specific standards for dealing with text features common to many sources and which present particular problems in historical analysis; (6) whether a common tag set could be used to describe a text for reproduction and to describe a text for analytical purposes LB Surprised at the concern expressed over the level of distinction between tags which represent internal and external properties respectively because the TEI committee structure which helped to produce the Guidelines reflected just such a conceptual division. The general problem is that tags are oriented towards linguistic and not historical analysis, and that the relevant TEI committee did not provide for the level of description appropriate to manuscript sources IK insisted on a rigid distinction between tags used for external and internal properties respectively so that the conceptual information brought to bear on the usage of the latter set of tags could be rigidly isolated. But admitted to a grey area between texts' external and internal properties (eg certain characteristic features of a text that one can see and describe, for example symbols and heraldry in a medieval charter, often bear on the semantic contents of a text) and wondered whether by allowing for a grey area the attempt to preserve any sort of distinction necessarily fails? LB The problem can be handled with concurrent tagging hierarchies. MT SGML has a problem handling concurrent hierarchies and provides the following example: This is a text, of more than one scribe. Tagging the external properties of the example text would entail some tagging hierarchies which which start at the beginning of the text and go through to its end (eg rendition=, begin line etc). Others eg those which indicated a change of hand are only relevant to a given portion of the text. The problem is that if we stick to the hierarchical tagging schema proposed in the Guidelines we are marking external properties of the text according to a model of text which assume that text is normally linear and sequential, when in fact the model of texts appropriate to some analytical schema will not necessarily be so. His problem is that we need a set of tags for external properties which may be used independently or outside of any existing hierarchies. LB The example problem could be solved by segmenting the text with the <s> tag claiming that at no one point could a text be attributed to more than one scribe. PD disagrees because of the existence of glosses, corrections, etc where several hands are apparent. The general problem is that LB's solution works for linear sequential texts but that texts are not always sequential. After some general discussion it is agreed that a finite number of properties or attributes which need to be treated independently of any hierarchy and that these may be identifiable. It is thought that this question might be considered at the next AHC workshop in Graz, 2-3 May 1991. But the following problems remain: can SGML represent text features which occur outside any one hierarchy or across several? what are the software ramifications? Could an SGML parser be developed which could handle these text features? How to identify text elements which are not part of a standard character set, but which belong to and in a text eg symbols in a medieval document which indicate days of the week, the end of the text, or are used as administrative seals? Discussion ensues in which it is agreed that: a finite but unknowable number of such symbols exist; their number is too large and their shape (or form) to varied to allow each one a unique tag; their meaning is often ambiguous; and consequently, their accurate description is important to the secondary analyst. SGML is too inflexible to allow for sufficient description of these symbols' form. It was also agreed that by concentrating on the symbols' meaning or function (eg indicates date or person's social status) rather than on their appearance, their number would be reduced to a more manageable size. The approach seemed to make sense especially given the likelihood that visual images would eventually be linkable to encoded texts thus allowing secondary analysts to see the symbol without having the original document. It was agreed that the problem might be referred to the the working group which was to meet at Graz on 2-3 May. MT There must be a conceptual distinction between standard characters which make up a text and the images or symbols which may be the same size as standard characters and take up the same space as standard characters, but which are not in fact standard characters (refers to the example of a symbol used to indicate a date in a medieval document - an M with a little o above it and slightly to the right). LB Such a distinction is impossible to draw since there are occurrences of images which contain texts within them and refers to the frontpiece of Gulliver's Travels as an example. :MT agreed and brought up a further problem"that the model of an image which might inform its encoded description would inevitably be overlaid with conceptual or interpretive information. [providing a nice example of how any rigid distinction between a text's external and internal properties must eventually break down]. This only reinforces the historian's need for tags which would allow the encoder to handle a symbol in one of three ways when it is encountered in the text: to treat it as a recognized symbol and use the appropriate tag to treat it as a symbol for which there is a defined name to indicate that it is a new symbol not yet identified and then to reject it (ie not encode it), or go through the whole procedure of defining the symbol as an entity. PD Indicated the enormous ambiguity which necessarily blurs the any distinction between a text's internal and external properties and, after some general discussion, it is agreed that such a rigid distinction cannot possibly be maintained in practice. DS Some of these problems might be overcome if the Guidelines allowed tags to indicate an element of uncertainty or probability. Thus, a symbol which might indicate a date could be tagged as something which may or may not be a date. Some general discussion about the problems of standardizing historians' use of certainty levels produces agreement that relevant guidelines could be produced. MT Indicated a further problem of crystals whose internal structure requires very precise description as it bears directly on interpretational/analytical issues. The Guidelines allow insufficient flexibility to precisely document the internal structure of such crystals and offered the following example: <field is name = surname, origin, rank> text... </field> LB attempted to work out the problem while the meeting broke for coffee (due to technical difficulties, DIG wiped the brilliant solution off the board before it had a chance to be fully discussed or even written down on paper) General issues emerging from Session 2 that the TEI might want to consider: 1) Tags used to define a text's external properties don't provide for a level of description especially appropriate for historians, especially those working with manuscript sources. 2) There may be a finite number of properties or attributes of texts which need to be treated independently of any hierarchy. These might be identifiable. 3) Manuscript documents, in particular those of the medieval period, have a host of non-standard characters or symbols whose meaning is unclear which cannot be adequately described with SGML, and that such description may not be possible anyway. It might be possible to tag non-standard symbols or characters according to their function in a text rather than their appearance, but further work was needed. A manageable, functional typology of symbols could be constructed, it would still be desirable to allow for such symbols to be described in some detail (eg, looks like an M with a little o hovering above it and slightly to the right). This is especially important where the symbol's meaning is ambiguous. The existence of a functional typology of symbols would require tags which could represent the encoders' uncertainty (there is a high probability that the following unknown symbol represents a date). In practice, however, such a facility would only be used to advantage if standard levels of certainty could be agreed. Session 3 11:30-1pm It was agreed that the problem of crystals that are of particular importance to historians will receive further discussion in Graz, 2-3 May, and that sessions 3-4 should be given over to the identification of specific document types whose structure is not yet catered for by the Guidelines. DS summarised his report on the problems raised by early modern documents, particularly depositions and other court records. MT suggested several other document types, and then suggested a system for classifying document types based on function. (I have the precise classifications in my notes, if you want them.) PD suggested that handbooks on diplomatic might provide a suitable classification system, but there was resistance to both systems proposed. It was unproductive to debate over which classification best described a source. It was finally agreed to produce a list of important document types without reference to an overall classification scheme. After some discussion of the most appropriate typology with which to approach historical sources of distinctive types, the following list of representative sources was produced: Charters IK Contract IK (medieval) DG (modern) Will MT Parish register JO Poll Book DG Court Record DS Newspaper DG Inventory of inanimate objects H-JM Account book H-JM Census JO Records of deliberation PD (medieval) (minutes, transcripts) DG (modern) Private letter PD (medieval) DG (modern) Historiography IK Transcripts of spoken matter PD (medieval) DS (early modern) DG (modern) Statute PD (medieval) DG (modern) Official correspondence PD (medieval) DS (early modern) DG (modern) Inscriptions MT Images MT Composite document H-JM Diaries DG Compilations PD (medieval) DG (modern) General point (came up in discussion of document types) DS suggested that many historical sources should be regarded as lists rather than paragraphs, and provided an extract from a court book as an example. Any software tools for analysis would have to handle list processing. In the course of the discussion Alan MacFarlane, Reconstructing Historical Communities was mentioned as a useful guide to the problems inherent in at least some of these sources, and MT articulated his concern with the role of standards from the perspective of the historian. The general approach to texts adopted by the TEI attempts to describe their characteristic structures but neglects: that the historian's fundamental purpose in using a particular type of source or document is analytical; that the historian is likely to select from any document only that material which is relevant to a particular analysis rather than to record the whole corpus. consequently, the structure and meaning of text features frequently encountered by historians in several types of sources whose encoding has practical ramifications for historical analysis require more urgent attention. After some discussion the two distinctive approaches to standardization floated the previous evening are fleshed out. One approach, from the bottom up, starts with the so- called crystals or text features whose appearance and innate structure bears directly on interpretive and analytical issues. Such crystals don't necessarily exist in any one document type, nor can they be confined to any one tagging hierarchy. A second, top-down approach definesthe overall appearance of particular document types. Agreed that the former will be given consideration in Graz and the latter here at Southampton and are adequately represented by document types listed above. Session 5 Where to now? 2-3:30 It was agreed that: members of the working group would take responsibility for documenting as precisely as possible the characteristic features of the document types outlined above and the assignments (as indicated above) were made; that a report should be made for each document type of between 1 and 4 pages in length; that the reports should make references wherever possible to publications in which methods of handling the source are considered, and to articles, books or on-going projects where the source is being produced in machine-readable form; that the reports should indicate where the identified features are tied to particular parts of the document or might appear anywhere within it, whether features are always tied together with others etc.; that together the reports should address documents composed in several different languages and that examples of the documents themselves might be usefully presented in some cases; that completed drafts of the reports should be sent to D Greenstein no later than 1 June 1991; that the working group's deliberations should be presented in published form to the International AHC conference to be held in Odense 28-30 August and that the publication should consist of the following chapters: 1) Introduction MT 2) The work of the TEI and an introduction to SGML LB 3) SGML and the problems of computer-aided historical research (report on the general deliberations of the working group) DG 4) The compiled document-type descriptions discussed above together with facsimile examples where appropriate that the publication schedule should be as follows: reports to DIG by 1 June as indicated above; corrected proofs should be returned to contributors by 15 June; final manuscripts returned from contributors to Gottingen by 15 July; galley proofs are sent to printers (from Gottingen) on 1 August; the working group should attempt to assemble one day prior to the conference for further discussion of the published report; that the working group should attempt to assemble one day prior to the conference for further discussion of the published report; that the publication will be distributed to interested parties and to everyone attending the AHC conference in Odense, and will be the subject of further deliberations at Odense. that chapters 3 and 4 of the publication, along with the minutes of the meetings in Southampton, Graz and Odense will form the basis of the report to the TEI. The meeting was adjourned at 3:45pm DIG