Draft comments on ACH-ACL-ALLC, Guidelines for the Encoding and Interchange of Machine Readable Texts, edited by C.M. Sperberg-McQueen and Lou Burnard, Draft Version 1.0, 16 Jly 1990. With particular concentration on the documents of political history. <date>10 Feb 1991 <body> A working group of historians has been set up under the auspices of the Text Encoding Initiative to comment on the draft guidelines for the encoding and interchange of machine readable texts. Essentially, the group was asked to the characteristic features of different textual sources that historians encounter in their work, to indicate how these sources were used in research, and thus to inform attempts to devise encoding schemes for such texts using the mark-up language SGML. Essentially, the members of the working group were asked: if you were given a machine-readable version of the sources with which you were most familiar, what features of the text would have to be tagged in order that the source would sustain the kind of analysis you normally make of it. The following draft comments on the TEI proposals are prepared in six sections. The first offers a number of general suggestions. The second looks only very briefly at character sets and character set definitions. The third section discusses version control. The fourth, source descriptions which would proceed an encoded source providing bibliographic information about the source. The fifth section looks at features common to many texts but only to elucidate on those features given insufficient treatment in the draft guidelines. The final section tries to identify the structural features of several distinctive document types, or historical sources which may need to be treated differently than the document types already identified by the TEI (including corpura, verse, rhyme, drama, various narrative structures, dictionaries, and office documents). The treatment of sources is by no means exhaustive - it concentrates primarily on the sources of modern political history, nor is the draft complete. Further, other members of the working groups are at present preparing their draft comments for the sources of medieval, demographic, economic, and early modern history. Other reports are forthcoming from historians experienced in devising encoding schemes for historical sources and in software creation. These will be circulated upon their receipt. DI Greenstein Lecturer, Modern History, Glasgow University digger@glasgow.arts.sun1 I. General Comments. The TEI is first and foremost interested in providing standards for encoding structural features of a variety of different text types. It is also interested that tagging or mark-up procedures prove sufficiently flexible to support the variety of analyses that anyone working ordinarily with texts might carry out. It is not, at this juncture interested in software development although very obvious pointers are being constructed to software potential developers. In the light of these two points are worth making. Firstly, most, but certainly not all computer-aioded historical analyses of textual material is based on the entity-attribute model which is commonly used to describe the structure of standard database tables. It is worth stressing that this adherence to the entity-attribute model is not a reflection of the historical community's imprisonment within a software paradigm, but the fact that the software paradigm adequately reflects the nature of much historical research. The entity-attribute model merely reflects the best an oft used research method by which the characteristics of given objects, individuals, institutions, or events, are systematically researched, catalogued, and on occasion categorized so that the characteristics of individual entities within a defined domain of like entities (an elite, early modern English churches, the working, middle, and upper classes, strikes in france) might be contrasted and compared. The model also provides some means of assessing particular characteristics which might be said to be causal of given events (does ethnicity or social class determine voting behaviour, were men or women more likely to join evangelical religious associations during the Great Awakening). The model may also underpin counterfactual and predictive analyses of given events or historical situations (would American economic development occurred at the same rate without the railroad, would the American south have been economically better off without slavery). The importance of the entity-attribute model to computer- aided historical research cannot be overemphasized. Nor can the centrality of hierarchical and relational database technology to computer-aided historical research. This does not mean, of course, that computer-aided historical investigations are limited to highly structured sources such as census manuscripts, city directories and business acounts and parish registers which may easily be transcribed into normal database tables. Collective biographies of political elites or analyses of mob behaviour in the nineteenth century have relied consistently on sources newspaper accounts, diaries, autobiographies, and legislative records which in fact comprise running prose (one hesitates to use the phrase unstructured too loosely). The historian can easily extract information from such sources to build up the entity-attribute relationships which are desired, often by preparing index cards or some equivalent for each entity involved in a particular study and filling out the index cards with the various attributes that are required wherever and whenever these are found in the source. Having structured data in this fashion, the leap from the index card to the database table is hardly a quantum one. Moreover, it is one which promises results which may be obtained through tried and tested techniques, and techniques which are becoming increasingly accessible to the non- computer-specialist through successive generations of database software. It is also necessary to point out that computer-aided historical research projects are often launched with one end in view; the production of analytical results in the short term. Should secondary analysis be possible on the basis of datasets thus created, then they are an added bonus of any given research project, one which is more frequently available with the advent of more sophisticated dbms technology. Rarely does the historian, working on limited research grants find him- or herself in the position of being able to create machine-readable sources for the sole purpose of having them used by others. The combination of these two principal features of computer- assisted historical research: their reliance on the entity- attribute relationship, and their aim of producing analytical results in the short term probably indicate that the historical community will not in the near future be the source of machine-readable sources encoded according to whatever standard. Having ventured this bold proposition, it is necessary to make several qualifications immediately. Firstly, it is likely that historians will form an enthusiastic "user-community" of textual sources which are coded according to standard procedures. No one doubts the value or usefulness of encoding standards especially as they will in time make available to computer-assisted investigation, a wide range of sources hitherto out of bounds if only due to the cost involved in their use. One must point out, however, that the enthusiasm with which encoded texts will be received by the historical community will be directly proportional to the extent to which the standard coding procedures will sustain analyses based on the entity attribute relationship briefly described above. Historians' use of machine-readable sources coded in accordance with the guidelines that are being developed will also be determined by the analytical software that will be available to them. When diaries of politial elites which are encoded according to the standards being developed by the TEI can support network analyses which will describe the extent of their elites', political and economic intercourse; and when such software is as accessible as, for example, SPSSX, then the historical community will be won over. Software developments will also determine the rate and pace at which historians begin to follow the coding procedures being developed herein in preparing their own datasets. No one would doubt the usefulness of having the entire corpura of elites' diaries in machine readable form especially if analytical software could sustain the desired analyses of them. None the less, the advantages offered by computer access to the whole fruit which any source or set of sources offers will not outweigh the disadvantages of costly and time-consuming preparation. So long as it is easier, cheaper and less time-consuming to use tried and tested methods of combing sources for discrete pieces of information deemed relevant to an analysis, such methods will not be readily supplanted especially when no analytical gain is envisaged (note that the author is by no means trying to underplay the problems involved in selective treatment of textual sources nor to ignore the array of problems associated with the use of highly structured DBMS for historical research). Again, software developments which await us in the future will likely determine the extent to which the historical community will embrace the standards being developed by this initiative. From this brief discussion, the following lessons may, I think, be drawn. Firstly, the coding procedures being developed by the Initiative must be able to sustain the analytical treatment which historians will often want to give their machine-readable sources. Secondly, some thought must be given to the preparation of standard forms or style sheets which might be invoked while encoding a text, for example, when preparing its source definition. Such style sheets or forms could, for example, provide a checklist of tags which need to be provided for when describing the kind of source (giving bibliographic references) of the kind being encoded, and an additional checklist of tags which are considered desirable. Such style sheets would not of course, be constraining. They could be edited and amended as necessary, but would facilitate encoding from the point of view of the user. Similarly, style sheets or forms could guide encoding declarations by identifying those text features that are normally associated with and frequently tagged when marking up documents similar to the document being encoded by the user. Here too, such forms would provide more in the way of guidance and help the user think perhaps more rigorously about the source being encoded and the uses to which the encoded sources might be put by others if not by that particular user. II. Characters and Character set definitions. It is sufficient to say at this point that the TEI has appointed a working group to consider characters and character-set definitions. This working-group will prepare and test writing system declarations for the languages of the European Community in their modern forms, modern Slavic, Hebrew, and Arabic. It will also draw up inventory and proposed character sets or sub-sets for histoical linguistic and philological work in the languages of the European community, older Slavic, Hebrew and Arabic. Clearly, some work would need to be done as well on Greek (ancient and modern) if Greece is not considered within this view of the European Community. Those who have a professional interest in character sets not being addressed by this group and/or who would like to contribute its work are urged to get in touch with Harry Gaylord (XXX supply e-mail). Two points are, however, worth raising. Firstly, historians will essentially be interested in the character sets associated with all known languages past and present. Secondly, that the TEI has been careful to provide for extension of the guidelines in many vital areas including that of character set defintition. It asks only that definitions for additional character set pay attention to the draft guidelines and that character set definitions be submitted to the intiative for consideration. Consequently, there should be few problems for those historians whose texts depend on character sets not being considered in this first round of development. III. Bibliographic Control, encoding declarations and version control The problems here are manifold if only because encoding declarations rely on the authors (or are they creators?) of machine-readable texts identifying very precisely what text features were and were not marked up or identified when encoding. For example, when encoding Anne Frank's Diary, one might indicate a whole variety of features including new entries, new page, new line, new paragraph, etc, but not where the original text was underlined or appeared in a bolder hand. Clearly in this example, any text in the original manuscript which was underlined or written in a "heavy hand" perhaps for emphasis, will be lost to a secondary user. If the initial encoder specifies what was not encoded, (the following text features were not encoded - underlining and bold type... etc) secondary users will at least know the limitations of the machine-readable source. The problem of course is that the initial encoder may not recognize or appreciate the importance of distinctive text features which may be of the greatest importance to other users. A postgraduate who creates a machine-readable version of a series of medieval wills may not be well-enough trained, for example, to identify where subtle changes in penmanship might indicate a document's multiple authorship. In this case, the possible occurrence of multiple authorship would not be available to subsequent (secondary) users because it would not be indicated by the original encoder. In short, any machine-readable version of an historical source, is itself a source belying as much about the research interests and skill of its author as the original manuscript or published document from which the machine- readable text is derived. Where the original encoder is able to specify which text features were not encoded, at least the secondary user can be made aware of their occurrence and then decide whether it is necessary to return to the original manuscript of published source. It is, however, highly unlikely that the original encoder can ever be sufficiently aware of all of the text features in any document which might hold forth some interest to an investigator in the future. A similar problem exists wherever interpretation or meaning is identified in a machine readable text. Here, the encoder's assumptions about the meaning of particular text features or text strings are imposed on the machine-readable source, though not necessarily in an irreversible way. If such texts are to be of any use to secondary analysts, they must comprise definitive documentation (a codebook in database terminology) about how the assumptions were made. In some cases, it may be necessary to include as part of the machine readable document a section indicating coding schemes. This might prove useful, for example, in a machine readable version of the Dictionary of National Biography or any other collective biography where text has not only been marked up indicating that it refers to occupation but has been coded for analytical purposes as well according, let us say, to the Registrar General's scheme for coding occupations. In such a case, the machine-readable version of the text would ideally contain a section indicating what coding scheme(s) was (were) used (Registrar General's etc). It should also ideally contain a straightforward code-book or index providing for each occupational code used in the scheme the full list of occupations listed in the text to which that code was applied. The secondary analyst would thus have instant access to information about the coding scheme and to assumptions that were made when applying it to individual text strings indicating, in this case, jobs. The example of a collective biography would likely contain several indeces; one for each category of information that was identified and coded for analysis in the text. Other examples of code books may be found, for example, in the codebooks which accompany datasets stored at data archives of interest to historians (eg ICPSR, ESRC). Although highly or regularly structured texts such as collective biographies provide an obvious example of where codebooks or indeces should ideally be provided, one could apply the same concept to other less structured running texts. Take for example, diaries or autobiographies which might be used to identify inter-connections between political elites. Here the encoder might wish to mark up "significant events" such as meetings between people, or attendance at club dinners. Here, too, it might prove useful not only to mark up the occurrence in the text of a significant event, but to provide interpretive information for analytical purposes. for example, one might want to code events the events themselves as event type, eg dinner engagements, business meetings, religious gathering. Where second persons (noted in the text as also attending the event), or third persons (those not actually at the event but either mentioned in the text as being at the event "At dinner we talked about Abraham Lincoln", or who might have had some causal relationship to the event happening "I went to Abraham Lincoln's funeral"), are mentioned in the text, it might prove desirable to mark up their occurrence in the text and to encode it according to some scheme (second person = business associate, third person = public figure not known personally to the author). As in the case of the collective biography, the machine- readable document would be usefully given over to identifying the kinds of interpretive information that was singled out for identification in the text and how it was categorized. In this example, such a section would indicate that events are any occasion where the hero meets with other individuals and event types are categorized as follows (eg business meeting, social occasion, intimate occasion etc); an also present category simply gives the names of individuals with whom our hero met at various event types; identifications indicate the nature of the relationship between our hero and individuals with whom he/she met at various events. Here, it might be necessary to cross reference to other sections of the machine readable document from which one can infer the identification of the individual (more about this below). Such information would be available in the mark up of the machine readable text, but should be condensed and presented as well in indeces where interpretive strategies and assumptions might be instantly apparent to the secondary analyst. IV. Front matter (source description) So far the discussion of "front matter" in the TEI draft proposals is too limited to that appropriate to a book. In general, the historian will want/need to look at a far greater range of sources which require different bibliographic conventions. References to parliamentary papers are structured differently than those made to articles in learned journals, books, manuscripts etc. The mark-up used to define the source must therefore be flexible and comprehensive. Another problem, of course, is that bibliographic conventions differ from library to library, from country to country, and of course, from one historian to another. None the less, it might prove worthwhile identifying commonly used document types and specifying or recommending standard mechanisms of bibliographic reference to them. This section should be considered thoroughly by librarians who will have a better indication of the full range of document types and the bibliographic references they require. A very partial list of sources requiring different bibliographic formats is provided below: Book (author(s), title, place and date of publication, publisher, number of pages including index, ISBN or other identifying number, place where book might be found (eg library - especially important for rare volumes), shelf mark at place where book might be found) Contribution to a collection of essays by different authors (as for book except there would need to be an independent tag for editor(s)) Article in a periodical or journal (Author, Title, name of journal, volume, number, date of publication, page reference to article, place where journal might be found especially if rare, shelf mark at place where journal/periodical might be found) Newspaper article (as for article in a periodical except it would require an additional tag to identify news services (eg Reuters) where articles are anonymoujs) Unpublished diaries are frequently without titles and dates of publication. These should be supplied by the encoder (the diary of Josephine Bloggs, 1 May 1812 - 13 Dec 1854) Unbound collections of poetry, essays, sermons and lectures should be treated as articles in a journal only making reference to the folio etc in which the collection is contained in a way distinct from the reference to a published journal. Here too, individual items may be without a title and may need to be referred to by the first line. Tags for dates will need to distinguish date of authorship from the date(s) of delivery (in the case of a sermon or lecture). Manuscript sources generally would need tags identifying the archive where the document(s) is to be found, a general reference to the shelf, box, folio or other collection of documents in which the item might be found and perhaps including a more specific reference to folder or item number within box, folio, etc. Author tags would need to take account of "fuzzy" authors ("attributed to"), dates would need to take account of fuzzy dates (eg c.1844, 1650-1701). A title tag will be necessary for those items which are without a title and which are referred to by their first line "Begins. The Delegacy for Lodgnings was established in 18......"). Letters, telgrams, memoranda and the like will require a tag to identify their author(s), intended recipients and their respective addresses, as well as the dates when the letter etc was sent and perhaps also when it was received Manuscript maps, graphs, charts, building plans and other drawings will also require appropriate tags indicating the form that the document takes and where titles aren't readily apparent, a brief description made up by the encoder to describe the document (Map of Savannah, Georgia, c.1850). The minutes of official bodies (including legislative and judicial bodies, transactions of professional, religious, and purely social associations) will often be preserved in volumes and may be identified in a manner similar to that used for journal articles. A tag will be necessary to identify the name of the body which may be distinct from the name of the journal in which minutes and proceedings are kept. Documents contained within any given volume will normally be set out chronologically by session and so will require a date tag to indicate the session (March 13, 1891). This will normally be distinct from the date attributed to the volume (January 1, 1891-April 30, 1891). It will also prove important to distinguish between the different kinds of documents normally included in such volumes. These will include: A record of the proceedings eg the actual minutes. Here tags may be desirable to identify those in attendance at a given session and those absent perhaps with reference to their status within the body (chairman, treasurer etc). Legislation (acts, ordinances) being considered by the body at a particular session may also be treated as journal articles but will also include tags to indicate the authors (who may only be identified by faction/organization or committee name), of the legislation, and tags for the authors of its various amendments including titles of those amendments (eg Section IX. para 4), and perhaps even for signatories (eg with the US Constitution, Declaration of Independence). Where judicial decisions and legislation have dissenting opinions appended, they too will require author tags and reference to a title or section of legislation in which the dissenting opinion is to be found. Since legislation will often be considered, sent back to committee for revision, and then reconsidered, (and printed in the volume of minutes each time it is brought before the body) some attempt must also be made to indicate draft is being referred to, and, ideally, the ultimate fate of the legislation (passed by division 100-9 on a given date, or tabled indefinitely on...) The minutes of official bodies will also contain supporting materials considered by each session. These may be in the form of letters, testimonials, and petitions (to be treated as manuscript letters or as journal articles as appropriate and related to session, and journal of official proceedings, and with a sufficient number of authorship tags to indicate not only authors but signatories, factions committees, interest groups etc). Supporting materials may also come in the form of reports, graphs, tables of data, or pieces of legislation passed by other equivalent but different bodies. Religious and political broadsheets, caricatures, cartoons, and tracts may also require a description made up by the encoder as a title may not always be available (Picture of Theodore Roosevelt wearing slippers, smoking a cigar and carrying a big stick) and a description of the context in which the item was produced (Progressive party's campaign material, Presidential election of 1912). When a machine-readable dataset is the source being encoded, there are additional problems in identifying the sources from which the database has been compiled. Bibliographic references, footnotes, and marginal commentary comprise at least in part the normal mechanisms for indicating where and how a text is composed from information gleaned from other sources. They provide signposts to documents, concepts, images etc which are external to the text but which have been integral in its composition. The same conventions do not exist for databases made up of information taken from several sources. Two possibilities are envisaged for identifying those external sources. A list of sources would provide the most general guidance. Preferably, it would give some more precise information as to which of the data in a database were gathered from any particular source (Dictionary of National Biography provided biographical information on people X, Y, Z,...n, minutes of the Court of Quarter Sessions for the County of Westmoreland, 1650-7 provided information on the following judicial cases etc). The database documented to the fullest extent would, of course, indicate the source for each datum though may of course be impracticable. In general, the TEI might consider the various efforts to devise standard data descriptions which have been undertaken since the mid-1970s. The above discussion also seems to have ramifications for the section on body matter so far also too limited to consideration of literary texts which are to some extent hierarchically arranged. Manuscript collections, periodicals and transactions or proceedings of various official (or even non-official) bodies, may simply be sequentially arranged, perhaps chronologically (as with the case of transactions/proceedings) or according to no apparent scheme whatsoever (folio collection of sermons). See the discussion below under VI. Features of Specific Text Types V. Features common to many text types. The guidelines already identify and treat in considerable and sufficient detail a number of featues (including "non structural features viz paragraphs and their contents, highlighting and related features, quotations and related features, foreign words and expressions, terms, cited words, and glosses, names, abbreviations, lists, notes, index entries, numbers and dates, other crystals; editorial comment; bibliographic citations and references; referencing systems; linkds and cross references; segmentation of prose and treatment of ambiguous punctuation; formulas, tables and figures; critical apparatus; typographical rendition and other appearnce features). The following merely elucidates what the author envisages as problem areas. A. Names. There are a number of problems here. Firstly, names may identify entities (in the sense used in database construction) in which the analyst is interested. In such cases, it will often be desirable to ascribe to the entity, characteristics (attributes) mentioned elsewhere in the text so that a comprehensive and systematic characterization of any given entity may be had from the encoded text. Proper names may also refer to attributes which need to be connected somehow to entities which exist elsewhere in the text (perhaps nearby) and perhaps in the form of other proper names. The fact that John lives in New York may not be derived from text contained within any one sentence or even paragraph. Proper names may be both entities about which it is desirable to build up a comprehensive picture from the text through association with a variety of different attributes. At the same time, the same proper name may also be an attribute. New York is an attribute of John who lives there and may also be an entity about which it is desired to systematically collect characteristic information that exists elsewhere in the same text (is a grimy city with flowery insipid suburbs). The same proper name may be used in a text to identify more than one entity. Rochester, Minnesota and Rochester, New York, might be referred to in the same text simply as Rochester. Identifying the occurrence simply as a place has very limited analytical value. The problem is even more likely to occur with people's names than with names of places. Identifying the characteristics which relate to proper names is terribly important for another reason as well. The classic example comes from Thaller who states that the actual geographical location of a particular place may only be made clear with reference to additional information, normally the date with which the place name is associated. The USA will describe very different geographical locations (or geographical locations of different dimensions) in 1789, 1804, 1849 and 1990). The same may be said of any proper name. In a text identifying several John Smiths (for example a manuscript census, city directory, collective biography, or telephone book), the person John Smith is insufficiently identified as being a John Smith who is distinct from other John Smiths, until the name is associated with other extant information (date of birth, address, telephone number etc). The same may be said of proper names assigned to events. Carnival in New Orleans and Carnival in Basil are radically different events while Pope's Day in the American colonies and Guy Fawkes Day in England, though referred to by different proper names are in fact one and the same thing. Similarly, a range of Democratic parties exist across the world and can only truly be interpred when associated with place. Even within one country, the assignation could have radically different meanings (eg George Wallace and John Kennedy were both Democrats but of very different political persuasion. The reference may only be made clear with reference to the geographical location of the party organization eg Georgia and Massachusetts). It is also perhaps worth pointing out that proper names are not strictly limited to people and places. They may also be used for events (The Great Depression), holidays (the 4th of July, Christmas), associations (Women's Christian Temperance Union), religious and political bodies (Church of England, Socialist Workers' Party). B. Numbers and dates present all sorts of problem again vis a vis their interpretation. Thaller is again the standard reference point. A number can only have a given value when fixed in time and place. 100 US dollars in 1930 has an entirely different value than 100 dollars in 1841. In the latter case, the dollars themselves would need to be specified by place of issue, as notes issued from several independent banks would frequently be exchanged at a discount. Provisions for fuzzy dates must also include the uncertainty introduced (consciously or otherwise) by a text's original author. The autobiographical account which reads: "I left my home in 1950" may provide no greater certainty than that which reads: "It was around 1950, I think, when I left my home". Bibliographic citations would need to be extended considerably to take account of the bibliographic entries cited above. C. Links and cross-references. For analytical purposes, the historian will be concerned with this particular section as it seems provide the means by which marked up texts may be accessible to the kinds of analyses most likely to be conducted by the historical community. The following points should be mentioned. Firstly, cross referencing should allow for relational as well as hierarchical linkages. Secondly, categorizaing information for analytical purposes will often take place on the basis of several cross references between entities and attributes. Consequently, tags should be available which impart interpretive assumptions to a series of cross references. In the autobiography referred to above, for example, it might be possible to build up social profiles of the author and a number of his associates on the basis of information scattered throughout the text. Cross references connecting biograhpical characteristics to associate A1 would necessarily unerlie any determination that A1 was a banker, was born in Philadelphia, held directorships on the X, Y, and Z philanthropic associations etc. D. Tables, formulas and figures. These are terribly important to the historical community as well as they are likely to crop up in a vareity of sources. Information is presented in tabular form in sources as diverse as censuses (both published and manuscript), parish registers, newspapers (electoral, financial, social, sporting and other kinds of useful numeric data), scientific publications of historical interest, official documuments (in committee reports and proceedings, etc). It is vital to the historical community that that the values stored in individual cells of any table (whether they are numeric or non-numeric) are accessible according to their position in the table. In other words, the row and column structure of the table needs to be preserved as does the value in each cell. In addition, column descriptions, table titles, and any marginal comments, footnotes, or cross-references will need to be preserved. Marked up in such a way, the historian will be free to rework data that are presented tables, combine them with data from other tables within the same text, or with data from other texts etc. A major problem is that tables are rarely presented in a text in a normalized way most suitable to analysis along the entity-attribute data model. Data stored as attributes in table columns may not adhere all that closely to strict rules of data-type definitions. A value in any given cell, for example, might be stored alongside some qualifying remark or footnote reference which would need to be tagged separately and overlooked or corrected for in subsequent re- working of the tabular data. Similarly, table columns may include several datatypes in each cell, listed one below another as follow: Presidential electoral returns by party 1900-1920 (a = Democrat, b = Republican, c = other parties) year returns 1900 (a) 100,988 (b) 150,765 (c) 20,113 1904 (a) 123,456 (b) 122,443 (c) 8,765 ... Clearly, the normal and most desirable table structure would look be as follows: Presidential electoral returns by party 1900-1920 year Democrat Republican Other 1900 100,988 150,765 20,113 1904 123,456 122,443 8,765 ... Ideally, the data presented in the text would be tagged so that it could yield the normal table structure. Information which is eminently suited to tabular analysis and display is also often embedded in prose when it would be more usable in tabular form. Take for example an enumeration of electoral results which might be found in any newspaper. "In the first Ward, Republicans turned out in great number to return their favoured son, IM Boss. Boss received 100,000 votes over his Democratic opponent E Manual Labor. Labor said afterwards he wished he was a Republican. In the second ward..." Here too, SGML tags which would allow for these data to be recombined into some normal table structure would be ideal. With graphs, one would ideally like to have access to the data values underlying the graphical image (or estimates thereof) which could be presented in normal tabular form for subsequent analysis. Otherwise, the image itself is of the utmost importance. Consequently, care will need to be taken to preserve the scale of graphical axes, the precise dimensions of the graphical image itself, graph and axes titles, and legends associated with graphed data ranges. VI. Chapter 7 Features of specific text types: Again, the draft guidelines address several specific text types including language and other corpora and collections; verse, drama and narrative; dictionaries; and office documents. This section will elucidate problems in some of these areas and attempt to define other text types with specific reference to the sources of modern political history. Censuses and census-like documents have been left to another member of the history working group and are not included herein. A. Corpora and collections. Newspapers, periodicals and anthologies would, it seems to me be covered by the guidelines for corpora as developed so far. The list, however, may be extended to include transactions and proceedings of various societies (as discussed above under Minutes and proceedings of official organizations), collections of manuscript sources (eg the letters of Franklin Delano Roosevelt, or the Speeches of Abraham Lincoln). One needs to emphasize, however, that in the process of historical research, one does not simply immerse oneslf in corpora (whether they be all of the numbers of Oxford Magazine from 1914-70, or the speeches of Abraham Lincoln chronologically arranged from 1856-64) to collect facts. Instead, the corpura will often enhance what the national school curriculum for history refers to as empathy with the period or the subject matter. It is not infrequent, therefore, that one finds in the historian's notes entirely subjective comments such as ("the science area rarely intrudes into Oxford magazine in this period. The journal is clearly written for and read by members of the university who are well versed in classics, theology and law"). Clearly, tags exist to enable the encoder to include such observations as variant readings or as pieces of interpretive information in the text. In order for the secondary user to have at least something akin to the experience of handling the original document itself, the text must be encoded so that both its structure (layout, etc) and its content are preserved. Some general discussion follows about various anthologies that might be incorporated under this section. A.1. Newspapers (including journals of professional societies, trade associations, social clubs, transactions of learned bodies etc). As with many corpora, the historian will attempt to read or at least sample from a large number of newspaper issues, say, for example, by reading every issue of a weekly newspaper for the period 1800-54, of by looking at articles pertaining to a particular subject over an equally long period. The two approaches are used for different purposes. Take for example a community study which concentrates on frontier towns in the period 1820-60, in this case, on Rochester, New York. The investigator would in this case want to "read through" the relevant newspaper(s) simply to get a feel for the period, even to discover particular subject areas which seem worthy of further research (for example, it might become apparent through reading through the newspapers for the period, that revivals in the 1830s seemed to have a high public profile at least insofar as this may be measured in the popular press. Consequently, further research into the religion on the frontier would be indicated. To sustain this approach, a machine-readable version of a newspaper would need to be very explicit in tagging the structure of the text itself: its layout, the placement and dimensions of pictures, advertisements, drawings, images, the use of bold type and of different fonts both in headlines and in articles' text, the width, length and number of columns used for any one article, and an article's layout on one or several pages. Only then might it be possible for the researcher to realize, for example, that items pertaining to women, where they exist at all, are only given the slightest treatment toward the back of the newspaper, and always under the most insignificant headlines. Similarly, one might recognize the intrusion into the advertisements of specialist shops replacing those for general goods stores. In the second case mentioned, one turns to a newspaper with a specific goal: to look at election or political reporting in the period X to Y, or at articles about municipal works and services. Here, the structure of the newspaper may prove a distraction. More essential would be quick access to articles which bear directly or indirectly on the subject under investigation. The investigator of politics on the frontier, for example, might want to read in their entirety, the newspapers published in the several weeks before and the week immediately after particular dates when elections were known to have taken place. The analyst of municipal government would prefer that individual articles be accessible by their title, or by subject keywords included when encoding information about various articles. In both cases, it would be necessary to describe the size, paper quality and in some cases, colour of each number in a series. These characteristics frequently with editorship, ownership, and or publisher and may be significant. Moreover, one wants to preserve the possibility of more subjective characterization. "Under the editorship of Joe Bloggs, the paper seems to have gone more down-market clearly targetting lower-income groups in its advertising, and less-informed readers through the elimination of sophisticated coverage of political, cultural and social events, and the inclusion of tawdry human interest stories". Here, too, the encoder is introducing a perception but one which might prove useful and which could indeed be checked up by secondary analysists through the machine readable text. A2. The transactions of official bodies (including legislative records) should also be considered as corpura only paying close attention to the source definition outlined above. The text will normally be arranged hierarchically. Each volume will comprise the transactions of the society between two specified dates. At the next level, the particular sessions or meetings of the body will be recorded chronologically between the two specified dates. Each session therefore forms the next or second level of the hierarchically arranged text. Each meeting will require tags for its date and members present. At the third level of the hierarchy will be the minutes of the particular meeting or session, followed by accompanying documents, reports, testimonials, petitions, legislation etc considered at that meeting. The supporting material will require headings as indicated above associating it with a particular session, and with particular authors. It is only at this level that keywords indicating the subject matter will be of any usefulness whatsoever. Occasionally, transactions are laid out in two sections: the first presents the minutes of each meeting chronologically arranged within the period covered by the volume; the second presenting the supporting documents considered at each meeting also chronologically arranged. Indeed I have seen such minutes where the pagination of the minutes is in arabic and that of the supporting documents in roman numerals. Should this fomat exist in the text it is worth preserving since it is frequently only the supporting materials which and not the minutes of particular sessions which will illuminate particular historical problems. A3. Collections of unpublished, unbound manuscripts stored in boxes, in folios volumes or in folders present something of a problem where the individual items are only loosely related. Where the "letters of IM Famous" preserved in folio volumes clearly comprise a corpus and should be treated as those already discussed in the text. The "Box of material relating to political reform movements in Philadelphia, 1800-1900", however, may contain a wide range of different materials, letters, broadsheets, political tracts etc, some of it only vaguely related to the umbrella description provided for the corpus by the cataloguer. Here, it is perhaps best to treat items individually and indicate as their location, the manuscript collection of which they make up a part. B. The discussion above of source description for letters and telegrams should provide sufficient guidance for text features. C. Broadsheets, and pamphlet literature. Will need to be treated as newspapers and journals with express reference made to the quality and appearance of the particular item as well as to its structure, layout, font size, placement of imagery etc). Only thus will the secondary analyst be able to remark that the "Methodists seemed more capable than the Baptists of producing high-quality and up-market evangelical material in the 1830s indicating perhaps that they were recruiting "better" clientele". D. Diaries. The source description should indicate something about the diarist filling in biographical details (at least vital statistics) where these are known. The overall quality and appearance of the diary will also provide useful additional information. The important sub-headings will be chronological, normally by day but occasionally by hour for the more assidious diarist. The text composed at any one time (noon on the 12th of July 1823), should be treated as a chapter of a book, with the date (and time) of entry indicated in a manner similar to that used for chapter titles. Subject keywords describing the contents of any one particular entry would also be useful at this level of documentation, as would brief descriptions indicating the apparent mood of the author ("seemed rather frenetic after his divorce of such and such a date" (cross-reference to relevant diary entry)). E. The log books and diaries that businessmen kept to record transactions etc should also be treated as diaries. Within the daily entry, there may, however, be subsections which should be treated and tagged as such. These might include, for example, the author's rather more ponderous prosaic musings as to the state of his/her affairs, itemized transactions of the day's business, statements of holdings in various accounts, stock lists etc. information indicating daily business, as distinct from F. Collective biographies, encyclopaedia, reference books. These should be treated rather more as dictionaries than as formal narrative. As with dictionaries, the reference works mentioned above are composed of entries. Moreover, they use typographic conventions to denote structure. Collective biographies will use all of the source description conventions appropriate to a book. They will also require tags which define the group of individuals for whom information is collected, the criteria used for their inclusion (or, perhaps more interesting, criteria used for the exclusion of others), who or what organization drew up the criteria mentioned above, the manner in which information was collected, estimate of what proportion of the total population meeting the above mentioned criteria are included. Though collective biography differ as to the details that they provide, tags for the following categories of information should be included (these categories are often indicated in the text with a variety of typographic conventions and/or abbreviated keywords): date of birth; place of birth; father's occupation;father's name, title etc; mother's name; mother's father's name; confessional status; school(s) attended; school societies or sporting teams; scholarships and prizes won; university(ies) attended; degree(s) obtained; university societies and/or sporting teams; professional/vocational qualifications; paid employment (with dates); unpaid employment (eg directorships of various companies, etc; public service (membership of Royal Commissions, unpaid political posts held etc); club and society membership; spouse's name, title, etc; spouse's father's name; children's names and sex (or number of children by sex as sometimes children are only mentioned as 1 daughter, 2 sons); publications; honours; titles. In each case, the entry heading should consist of the individual's name and title. Collective biographies are often laid out in sections, for example by year or span of years and alphabetical within years. Such works will normally comprise indexes where names are arranged alphabetically and appropriate page references given. These need to be be retained if the collective biography is to remain a useful source. City directories. These might be treated as collective biographies with the exact same headers (city directories are to this day selective and not inclusive). Each entry needs to be identified by names and titles, and tags are required for home address, work address, occupation, title of business, telephone number. Where an entry refers to a business, the entry is identified by the business name or title and should provide tags for the names of the business's chairman, vice-chairman, president, vice- president, treasurer, controller. One would also expect to use extensive cross-referencing to link for example, the individual who puts "The Steel Corp" under his own entry as a place of work, and the entry for "The Steel Corp". As with newspapers, careful attention will have to be paid to documenting the structure of business advertisements as these are often indication of the relative size and importance of a given company. G. City directories will often have appendices giving the officers/members of the most important public, financial, commercial, and industrial institutions of the locality covered by the directory. These need to be identified as subsections or even appendices. Each sub-section will have a title which will refer to a civic institution: (the city council, the Bank of the Northern Liberties, the Public Library, the Methodist Temperance Union), and often a brief description about the institution for which the following tags will be necessary (date of foundation, institution's address(es), capitalization (in the case of banks and insurance companies). The body of the sub-sections will be the names of the instition's members/officers. These should be treated as entries in the main body of the city directory with an additional tag for office (eg treasurer). More recent city directories (become known as telephone directories and tend to lose occupational information for individuals after about 1920 in the US), often are laid out in two sections; one for individuals and one for businesses. H. Poll books should be treated as city directories. The structure of poll books will vary. Some are laid out alphabetically, others alphabetically by political subdivision or district. Where sub-divisions are used they need to be identified. Entries will be defined by the names and title of the individual specified and will require the same tags as entries in city directories plus tags for first, second, third, etc choice of candidates. The source description of a poll book must indicate the date of election, the locality in which it took place. It would also usefully contain the names (and parties of contestant), and the result of the vote. Note that city directories and poll books will both have introductions composed principally of running prose and should be treated as the introduction in a book. DIG 10 Feb 91

Daniel I Greenstein Draft comments on ACH-ACL-ALLC, Guidelines for the Encoding and Interchange of Machine Readable Texts, edited by C.M. Sperberg-McQueen and Lou Burnard, Draft Version 1.0, 16 Jly 1990. With particular concentration on the documents of political history. <date>10 Feb 1991 <body> A working group of historians has been set up under the auspices of the Text Encoding Initiative to comment on the draft guidelines for the encoding and interchange of machine readable texts. Essentially, the group was asked to the characteristic features of different textual sources that historians encounter in their work, to indicate how these sources were used in research, and thus to inform attempts to devise encoding schemes for such texts using the mark-up language SGML. Essentially, the members of the working group were asked: if you were given a machine-readable version of the sources with which you were most familiar, what features of the text would have to be tagged in order that the source would sustain the kind of analysis you normally make of it. The following draft comments on the TEI proposals are prepared in six sections. The first offers a number of general suggestions. The second looks only very briefly at character sets and character set definitions. The third section discusses version control. The fourth, source descriptions which would proceed an encoded source providing bibliographic information about the source. The fifth section looks at features common to many texts but only to elucidate on those features given insufficient treatment in the draft guidelines. The final section tries to identify the structural features of several distinctive document types, or historical sources which may need to be treated differently than the document types already identified by the TEI (including corpura, verse, rhyme, drama, various narrative structures, dictionaries, and office documents). The treatment of sources is by no means exhaustive - it concentrates primarily on the sources of modern political history, nor is the draft complete. Further, other members of the working groups are at present preparing their draft comments for the sources of medieval, demographic, economic, and early modern history. Other reports are forthcoming from historians experienced in devising encoding schemes for historical sources and in software creation. These will be circulated upon their receipt. DI Greenstein Lecturer, Modern History, Glasgow University digger@glasgow.arts.sun1 I. General Comments. The TEI is first and foremost interested in providing standards for encoding structural features of a variety of different text types. It is also interested that tagging or mark-up procedures prove sufficiently flexible to support the variety of analyses that anyone working ordinarily with texts might carry out. It is not, at this juncture interested in software development although very obvious pointers are being constructed to software potential developers. In the light of these two points are worth making. Firstly, most, but certainly not all computer-aioded historical analyses of textual material is based on the entity-attribute model which is commonly used to describe the structure of standard database tables. It is worth stressing that this adherence to the entity-attribute model is not a reflection of the historical community's imprisonment within a software paradigm, but the fact that the software paradigm adequately reflects the nature of much historical research. The entity-attribute model merely reflects the best an oft used research method by which the characteristics of given objects, individuals, institutions, or events, are systematically researched, catalogued, and on occasion categorized so that the characteristics of individual entities within a defined domain of like entities (an elite, early modern English churches, the working, middle, and upper classes, strikes in france) might be contrasted and compared. The model also provides some means of assessing particular characteristics which might be said to be causal of given events (does ethnicity or social class determine voting behaviour, were men or women more likely to join evangelical religious associations during the Great Awakening). The model may also underpin counterfactual and predictive analyses of given events or historical situations (would American economic development occurred at the same rate without the railroad, would the American south have been economically better off without slavery). The importance of the entity-attribute model to computer- aided historical research cannot be overemphasized. Nor can the centrality of hierarchical and relational database technology to computer-aided historical research. This does not mean, of course, that computer-aided historical investigations are limited to highly structured sources such as census manuscripts, city directories and business acounts and parish registers which may easily be transcribed into normal database tables. Collective biographies of political elites or analyses of mob behaviour in the nineteenth century have relied consistently on sources newspaper accounts, diaries, autobiographies, and legislative records which in fact comprise running prose (one hesitates to use the phrase unstructured too loosely). The historian can easily extract information from such sources to build up the entity-attribute relationships which are desired, often by preparing index cards or some equivalent for each entity involved in a particular study and filling out the index cards with the various attributes that are required wherever and whenever these are found in the source. Having structured data in this fashion, the leap from the index card to the database table is hardly a quantum one. Moreover, it is one which promises results which may be obtained through tried and tested techniques, and techniques which are becoming increasingly accessible to the non- computer-specialist through successive generations of database software. It is also necessary to point out that computer-aided historical research projects are often launched with one end in view; the production of analytical results in the short term. Should secondary analysis be possible on the basis of datasets thus created, then they are an added bonus of any given research project, one which is more frequently available with the advent of more sophisticated dbms technology. Rarely does the historian, working on limited research grants find him- or herself in the position of being able to create machine-readable sources for the sole purpose of having them used by others. The combination of these two principal features of computer- assisted historical research: their reliance on the entity- attribute relationship, and their aim of producing analytical results in the short term probably indicate that the historical community will not in the near future be the source of machine-readable sources encoded according to whatever standard. Having ventured this bold proposition, it is necessary to make several qualifications immediately. Firstly, it is likely that historians will form an enthusiastic "user-community" of textual sources which are coded according to standard procedures. No one doubts the value or usefulness of encoding standards especially as they will in time make available to computer-assisted investigation, a wide range of sources hitherto out of bounds if only due to the cost involved in their use. One must point out, however, that the enthusiasm with which encoded texts will be received by the historical community will be directly proportional to the extent to which the standard coding procedures will sustain analyses based on the entity attribute relationship briefly described above. Historians' use of machine-readable sources coded in accordance with the guidelines that are being developed will also be determined by the analytical software that will be available to them. When diaries of politial elites which are encoded according to the standards being developed by the TEI can support network analyses which will describe the extent of their elites', political and economic intercourse; and when such software is as accessible as, for example, SPSSX, then the historical community will be won over. Software developments will also determine the rate and pace at which historians begin to follow the coding procedures being developed herein in preparing their own datasets. No one would doubt the usefulness of having the entire corpura of elites' diaries in machine readable form especially if analytical software could sustain the desired analyses of them. None the less, the advantages offered by computer access to the whole fruit which any source or set of sources offers will not outweigh the disadvantages of costly and time-consuming preparation. So long as it is easier, cheaper and less time-consuming to use tried and tested methods of combing sources for discrete pieces of information deemed relevant to an analysis, such methods will not be readily supplanted especially when no analytical gain is envisaged (note that the author is by no means trying to underplay the problems involved in selective treatment of textual sources nor to ignore the array of problems associated with the use of highly structured DBMS for historical research). Again, software developments which await us in the future will likely determine the extent to which the historical community will embrace the standards being developed by this initiative. From this brief discussion, the following lessons may, I think, be drawn. Firstly, the coding procedures being developed by the Initiative must be able to sustain the analytical treatment which historians will often want to give their machine-readable sources. Secondly, some thought must be given to the preparation of standard forms or style sheets which might be invoked while encoding a text, for example, when preparing its source definition. Such style sheets or forms could, for example, provide a checklist of tags which need to be provided for when describing the kind of source (giving bibliographic references) of the kind being encoded, and an additional checklist of tags which are considered desirable. Such style sheets would not of course, be constraining. They could be edited and amended as necessary, but would facilitate encoding from the point of view of the user. Similarly, style sheets or forms could guide encoding declarations by identifying those text features that are normally associated with and frequently tagged when marking up documents similar to the document being encoded by the user. Here too, such forms would provide more in the way of guidance and help the user think perhaps more rigorously about the source being encoded and the uses to which the encoded sources might be put by others if not by that particular user. II. Characters and Character set definitions. It is sufficient to say at this point that the TEI has appointed a working group to consider characters and character-set definitions. This working-group will prepare and test writing system declarations for the languages of the European Community in their modern forms, modern Slavic, Hebrew, and Arabic. It will also draw up inventory and proposed character sets or sub-sets for histoical linguistic and philological work in the languages of the European community, older Slavic, Hebrew and Arabic. Clearly, some work would need to be done as well on Greek (ancient and modern) if Greece is not considered within this view of the European Community. Those who have a professional interest in character sets not being addressed by this group and/or who would like to contribute its work are urged to get in touch with Harry Gaylord (XXX supply e-mail). Two points are, however, worth raising. Firstly, historians will essentially be interested in the character sets associated with all known languages past and present. Secondly, that the TEI has been careful to provide for extension of the guidelines in many vital areas including that of character set defintition. It asks only that definitions for additional character set pay attention to the draft guidelines and that character set definitions be submitted to the intiative for consideration. Consequently, there should be few problems for those historians whose texts depend on character sets not being considered in this first round of development. III. Bibliographic Control, encoding declarations and version control The problems here are manifold if only because encoding declarations rely on the authors (or are they creators?) of machine-readable texts identifying very precisely what text features were and were not marked up or identified when encoding. For example, when encoding Anne Frank's Diary, one might indicate a whole variety of features including new entries, new page, new line, new paragraph, etc, but not where the original text was underlined or appeared in a bolder hand. Clearly in this example, any text in the original manuscript which was underlined or written in a "heavy hand" perhaps for emphasis, will be lost to a secondary user. If the initial encoder specifies what was not encoded, (the following text features were not encoded - underlining and bold type... etc) secondary users will at least know the limitations of the machine-readable source. The problem of course is that the initial encoder may not recognize or appreciate the importance of distinctive text features which may be of the greatest importance to other users. A postgraduate who creates a machine-readable version of a series of medieval wills may not be well-enough trained, for example, to identify where subtle changes in penmanship might indicate a document's multiple authorship. In this case, the possible occurrence of multiple authorship would not be available to subsequent (secondary) users because it would not be indicated by the original encoder. In short, any machine-readable version of an historical source, is itself a source belying as much about the research interests and skill of its author as the original manuscript or published document from which the machine- readable text is derived. Where the original encoder is able to specify which text features were not encoded, at least the secondary user can be made aware of their occurrence and then decide whether it is necessary to return to the original manuscript of published source. It is, however, highly unlikely that the original encoder can ever be sufficiently aware of all of the text features in any document which might hold forth some interest to an investigator in the future. A similar problem exists wherever interpretation or meaning is identified in a machine readable text. Here, the encoder's assumptions about the meaning of particular text features or text strings are imposed on the machine-readable source, though not necessarily in an irreversible way. If such texts are to be of any use to secondary analysts, they must comprise definitive documentation (a codebook in database terminology) about how the assumptions were made. In some cases, it may be necessary to include as part of the machine readable document a section indicating coding schemes. This might prove useful, for example, in a machine readable version of the Dictionary of National Biography or any other collective biography where text has not only been marked up indicating that it refers to occupation but has been coded for analytical purposes as well according, let us say, to the Registrar General's scheme for coding occupations. In such a case, the machine-readable version of the text would ideally contain a section indicating what coding scheme(s) was (were) used (Registrar General's etc). It should also ideally contain a straightforward code-book or index providing for each occupational code used in the scheme the full list of occupations listed in the text to which that code was applied. The secondary analyst would thus have instant access to information about the coding scheme and to assumptions that were made when applying it to individual text strings indicating, in this case, jobs. The example of a collective biography would likely contain several indeces; one for each category of information that was identified and coded for analysis in the text. Other examples of code books may be found, for example, in the codebooks which accompany datasets stored at data archives of interest to historians (eg ICPSR, ESRC). Although highly or regularly structured texts such as collective biographies provide an obvious example of where codebooks or indeces should ideally be provided, one could apply the same concept to other less structured running texts. Take for example, diaries or autobiographies which might be used to identify inter-connections between political elites. Here the encoder might wish to mark up "significant events" such as meetings between people, or attendance at club dinners. Here, too, it might prove useful not only to mark up the occurrence in the text of a significant event, but to provide interpretive information for analytical purposes. for example, one might want to code events the events themselves as event type, eg dinner engagements, business meetings, religious gathering. Where second persons (noted in the text as also attending the event), or third persons (those not actually at the event but either mentioned in the text as being at the event "At dinner we talked about Abraham Lincoln", or who might have had some causal relationship to the event happening "I went to Abraham Lincoln's funeral"), are mentioned in the text, it might prove desirable to mark up their occurrence in the text and to encode it according to some scheme (second person = business associate, third person = public figure not known personally to the author). As in the case of the collective biography, the machine- readable document would be usefully given over to identifying the kinds of interpretive information that was singled out for identification in the text and how it was categorized. In this example, such a section would indicate that events are any occasion where the hero meets with other individuals and event types are categorized as follows (eg business meeting, social occasion, intimate occasion etc); an also present category simply gives the names of individuals with whom our hero met at various event types; identifications indicate the nature of the relationship between our hero and individuals with whom he/she met at various events. Here, it might be necessary to cross reference to other sections of the machine readable document from which one can infer the identification of the individual (more about this below). Such information would be available in the mark up of the machine readable text, but should be condensed and presented as well in indeces where interpretive strategies and assumptions might be instantly apparent to the secondary analyst. IV. Front matter (source description) So far the discussion of "front matter" in the TEI draft proposals is too limited to that appropriate to a book. In general, the historian will want/need to look at a far greater range of sources which require different bibliographic conventions. References to parliamentary papers are structured differently than those made to articles in learned journals, books, manuscripts etc. The mark-up used to define the source must therefore be flexible and comprehensive. Another problem, of course, is that bibliographic conventions differ from library to library, from country to country, and of course, from one historian to another. None the less, it might prove worthwhile identifying commonly used document types and specifying or recommending standard mechanisms of bibliographic reference to them. This section should be considered thoroughly by librarians who will have a better indication of the full range of document types and the bibliographic references they require. A very partial list of sources requiring different bibliographic formats is provided below: Book (author(s), title, place and date of publication, publisher, number of pages including index, ISBN or other identifying number, place where book might be found (eg library - especially important for rare volumes), shelf mark at place where book might be found) Contribution to a collection of essays by different authors (as for book except there would need to be an independent tag for editor(s)) Article in a periodical or journal (Author, Title, name of journal, volume, number, date of publication, page reference to article, place where journal might be found especially if rare, shelf mark at place where journal/periodical might be found) Newspaper article (as for article in a periodical except it would require an additional tag to identify news services (eg Reuters) where articles are anonymoujs) Unpublished diaries are frequently without titles and dates of publication. These should be supplied by the encoder (the diary of Josephine Bloggs, 1 May 1812 - 13 Dec 1854) Unbound collections of poetry, essays, sermons and lectures should be treated as articles in a journal only making reference to the folio etc in which the collection is contained in a way distinct from the reference to a published journal. Here too, individual items may be without a title and may need to be referred to by the first line. Tags for dates will need to distinguish date of authorship from the date(s) of delivery (in the case of a sermon or lecture). Manuscript sources generally would need tags identifying the archive where the document(s) is to be found, a general reference to the shelf, box, folio or other collection of documents in which the item might be found and perhaps including a more specific reference to folder or item number within box, folio, etc. Author tags would need to take account of "fuzzy" authors ("attributed to"), dates would need to take account of fuzzy dates (eg c.1844, 1650-1701). A title tag will be necessary for those items which are without a title and which are referred to by their first line "Begins. The Delegacy for Lodgnings was established in 18......"). Letters, telgrams, memoranda and the like will require a tag to identify their author(s), intended recipients and their respective addresses, as well as the dates when the letter etc was sent and perhaps also when it was received Manuscript maps, graphs, charts, building plans and other drawings will also require appropriate tags indicating the form that the document takes and where titles aren't readily apparent, a brief description made up by the encoder to describe the document (Map of Savannah, Georgia, c.1850). The minutes of official bodies (including legislative and judicial bodies, transactions of professional, religious, and purely social associations) will often be preserved in volumes and may be identified in a manner similar to that used for journal articles. A tag will be necessary to identify the name of the body which may be distinct from the name of the journal in which minutes and proceedings are kept. Documents contained within any given volume will normally be set out chronologically by session and so will require a date tag to indicate the session (March 13, 1891). This will normally be distinct from the date attributed to the volume (January 1, 1891-April 30, 1891). It will also prove important to distinguish between the different kinds of documents normally included in such volumes. These will include: A record of the proceedings eg the actual minutes. Here tags may be desirable to identify those in attendance at a given session and those absent perhaps with reference to their status within the body (chairman, treasurer etc). Legislation (acts, ordinances) being considered by the body at a particular session may also be treated as journal articles but will also include tags to indicate the authors (who may only be identified by faction/organization or committee name), of the legislation, and tags for the authors of its various amendments including titles of those amendments (eg Section IX. para 4), and perhaps even for signatories (eg with the US Constitution, Declaration of Independence). Where judicial decisions and legislation have dissenting opinions appended, they too will require author tags and reference to a title or section of legislation in which the dissenting opinion is to be found. Since legislation will often be considered, sent back to committee for revision, and then reconsidered, (and printed in the volume of minutes each time it is brought before the body) some attempt must also be made to indicate draft is being referred to, and, ideally, the ultimate fate of the legislation (passed by division 100-9 on a given date, or tabled indefinitely on...) The minutes of official bodies will also contain supporting materials considered by each session. These may be in the form of letters, testimonials, and petitions (to be treated as manuscript letters or as journal articles as appropriate and related to session, and journal of official proceedings, and with a sufficient number of authorship tags to indicate not only authors but signatories, factions committees, interest groups etc). Supporting materials may also come in the form of reports, graphs, tables of data, or pieces of legislation passed by other equivalent but different bodies. Religious and political broadsheets, caricatures, cartoons, and tracts may also require a description made up by the encoder as a title may not always be available (Picture of Theodore Roosevelt wearing slippers, smoking a cigar and carrying a big stick) and a description of the context in which the item was produced (Progressive party's campaign material, Presidential election of 1912). When a machine-readable dataset is the source being encoded, there are additional problems in identifying the sources from which the database has been compiled. Bibliographic references, footnotes, and marginal commentary comprise at least in part the normal mechanisms for indicating where and how a text is composed from information gleaned from other sources. They provide signposts to documents, concepts, images etc which are external to the text but which have been integral in its composition. The same conventions do not exist for databases made up of information taken from several sources. Two possibilities are envisaged for identifying those external sources. A list of sources would provide the most general guidance. Preferably, it would give some more precise information as to which of the data in a database were gathered from any particular source (Dictionary of National Biography provided biographical information on people X, Y, Z,...n, minutes of the Court of Quarter Sessions for the County of Westmoreland, 1650-7 provided information on the following judicial cases etc). The database documented to the fullest extent would, of course, indicate the source for each datum though may of course be impracticable. In general, the TEI might consider the various efforts to devise standard data descriptions which have been undertaken since the mid-1970s. The above discussion also seems to have ramifications for the section on body matter so far also too limited to consideration of literary texts which are to some extent hierarchically arranged. Manuscript collections, periodicals and transactions or proceedings of various official (or even non-official) bodies, may simply be sequentially arranged, perhaps chronologically (as with the case of transactions/proceedings) or according to no apparent scheme whatsoever (folio collection of sermons). See the discussion below under VI. Features of Specific Text Types V. Features common to many text types. The guidelines already identify and treat in considerable and sufficient detail a number of featues (including "non structural features viz paragraphs and their contents, highlighting and related features, quotations and related features, foreign words and expressions, terms, cited words, and glosses, names, abbreviations, lists, notes, index entries, numbers and dates, other crystals; editorial comment; bibliographic citations and references; referencing systems; linkds and cross references; segmentation of prose and treatment of ambiguous punctuation; formulas, tables and figures; critical apparatus; typographical rendition and other appearnce features). The following merely elucidates what the author envisages as problem areas. A. Names. There are a number of problems here. Firstly, names may identify entities (in the sense used in database construction) in which the analyst is interested. In such cases, it will often be desirable to ascribe to the entity, characteristics (attributes) mentioned elsewhere in the text so that a comprehensive and systematic characterization of any given entity may be had from the encoded text. Proper names may also refer to attributes which need to be connected somehow to entities which exist elsewhere in the text (perhaps nearby) and perhaps in the form of other proper names. The fact that John lives in New York may not be derived from text contained within any one sentence or even paragraph. Proper names may be both entities about which it is desirable to build up a comprehensive picture from the text through association with a variety of different attributes. At the same time, the same proper name may also be an attribute. New York is an attribute of John who lives there and may also be an entity about which it is desired to systematically collect characteristic information that exists elsewhere in the same text (is a grimy city with flowery insipid suburbs). The same proper name may be used in a text to identify more than one entity. Rochester, Minnesota and Rochester, New York, might be referred to in the same text simply as Rochester. Identifying the occurrence simply as a place has very limited analytical value. The problem is even more likely to occur with people's names than with names of places. Identifying the characteristics which relate to proper names is terribly important for another reason as well. The classic example comes from Thaller who states that the actual geographical location of a particular place may only be made clear with reference to additional information, normally the date with which the place name is associated. The USA will describe very different geographical locations (or geographical locations of different dimensions) in 1789, 1804, 1849 and 1990). The same may be said of any proper name. In a text identifying several John Smiths (for example a manuscript census, city directory, collective biography, or telephone book), the person John Smith is insufficiently identified as being a John Smith who is distinct from other John Smiths, until the name is associated with other extant information (date of birth, address, telephone number etc). The same may be said of proper names assigned to events. Carnival in New Orleans and Carnival in Basil are radically different events while Pope's Day in the American colonies and Guy Fawkes Day in England, though referred to by different proper names are in fact one and the same thing. Similarly, a range of Democratic parties exist across the world and can only truly be interpred when associated with place. Even within one country, the assignation could have radically different meanings (eg George Wallace and John Kennedy were both Democrats but of very different political persuasion. The reference may only be made clear with reference to the geographical location of the party organization eg Georgia and Massachusetts). It is also perhaps worth pointing out that proper names are not strictly limited to people and places. They may also be used for events (The Great Depression), holidays (the 4th of July, Christmas), associations (Women's Christian Temperance Union), religious and political bodies (Church of England, Socialist Workers' Party). B. Numbers and dates present all sorts of problem again vis a vis their interpretation. Thaller is again the standard reference point. A number can only have a given value when fixed in time and place. 100 US dollars in 1930 has an entirely different value than 100 dollars in 1841. In the latter case, the dollars themselves would need to be specified by place of issue, as notes issued from several independent banks would frequently be exchanged at a discount. Provisions for fuzzy dates must also include the uncertainty introduced (consciously or otherwise) by a text's original author. The autobiographical account which reads: "I left my home in 1950" may provide no greater certainty than that which reads: "It was around 1950, I think, when I left my home". Bibliographic citations would need to be extended considerably to take account of the bibliographic entries cited above. C. Links and cross-references. For analytical purposes, the historian will be concerned with this particular section as it seems provide the means by which marked up texts may be accessible to the kinds of analyses most likely to be conducted by the historical community. The following points should be mentioned. Firstly, cross referencing should allow for relational as well as hierarchical linkages. Secondly, categorizaing information for analytical purposes will often take place on the basis of several cross references between entities and attributes. Consequently, tags should be available which impart interpretive assumptions to a series of cross references. In the autobiography referred to above, for example, it might be possible to build up social profiles of the author and a number of his associates on the basis of information scattered throughout the text. Cross references connecting biograhpical characteristics to associate A1 would necessarily unerlie any determination that A1 was a banker, was born in Philadelphia, held directorships on the X, Y, and Z philanthropic associations etc. D. Tables, formulas and figures. These are terribly important to the historical community as well as they are likely to crop up in a vareity of sources. Information is presented in tabular form in sources as diverse as censuses (both published and manuscript), parish registers, newspapers (electoral, financial, social, sporting and other kinds of useful numeric data), scientific publications of historical interest, official documuments (in committee reports and proceedings, etc). It is vital to the historical community that that the values stored in individual cells of any table (whether they are numeric or non-numeric) are accessible according to their position in the table. In other words, the row and column structure of the table needs to be preserved as does the value in each cell. In addition, column descriptions, table titles, and any marginal comments, footnotes, or cross-references will need to be preserved. Marked up in such a way, the historian will be free to rework data that are presented tables, combine them with data from other tables within the same text, or with data from other texts etc. A major problem is that tables are rarely presented in a text in a normalized way most suitable to analysis along the entity-attribute data model. Data stored as attributes in table columns may not adhere all that closely to strict rules of data-type definitions. A value in any given cell, for example, might be stored alongside some qualifying remark or footnote reference which would need to be tagged separately and overlooked or corrected for in subsequent re- working of the tabular data. Similarly, table columns may include several datatypes in each cell, listed one below another as follow: Presidential electoral returns by party 1900-1920 (a = Democrat, b = Republican, c = other parties) year returns 1900 (a) 100,988 (b) 150,765 (c) 20,113 1904 (a) 123,456 (b) 122,443 (c) 8,765 ... Clearly, the normal and most desirable table structure would look be as follows: Presidential electoral returns by party 1900-1920 year Democrat Republican Other 1900 100,988 150,765 20,113 1904 123,456 122,443 8,765 ... Ideally, the data presented in the text would be tagged so that it could yield the normal table structure. Information which is eminently suited to tabular analysis and display is also often embedded in prose when it would be more usable in tabular form. Take for example an enumeration of electoral results which might be found in any newspaper. "In the first Ward, Republicans turned out in great number to return their favoured son, IM Boss. Boss received 100,000 votes over his Democratic opponent E Manual Labor. Labor said afterwards he wished he was a Republican. In the second ward..." Here too, SGML tags which would allow for these data to be recombined into some normal table structure would be ideal. With graphs, one would ideally like to have access to the data values underlying the graphical image (or estimates thereof) which could be presented in normal tabular form for subsequent analysis. Otherwise, the image itself is of the utmost importance. Consequently, care will need to be taken to preserve the scale of graphical axes, the precise dimensions of the graphical image itself, graph and axes titles, and legends associated with graphed data ranges. VI. Chapter 7 Features of specific text types: Again, the draft guidelines address several specific text types including language and other corpora and collections; verse, drama and narrative; dictionaries; and office documents. This section will elucidate problems in some of these areas and attempt to define other text types with specific reference to the sources of modern political history. Censuses and census-like documents have been left to another member of the history working group and are not included herein. A. Corpora and collections. Newspapers, periodicals and anthologies would, it seems to me be covered by the guidelines for corpora as developed so far. The list, however, may be extended to include transactions and proceedings of various societies (as discussed above under Minutes and proceedings of official organizations), collections of manuscript sources (eg the letters of Franklin Delano Roosevelt, or the Speeches of Abraham Lincoln). One needs to emphasize, however, that in the process of historical research, one does not simply immerse oneslf in corpora (whether they be all of the numbers of Oxford Magazine from 1914-70, or the speeches of Abraham Lincoln chronologically arranged from 1856-64) to collect facts. Instead, the corpura will often enhance what the national school curriculum for history refers to as empathy with the period or the subject matter. It is not infrequent, therefore, that one finds in the historian's notes entirely subjective comments such as ("the science area rarely intrudes into Oxford magazine in this period. The journal is clearly written for and read by members of the university who are well versed in classics, theology and law"). Clearly, tags exist to enable the encoder to include such observations as variant readings or as pieces of interpretive information in the text. In order for the secondary user to have at least something akin to the experience of handling the original document itself, the text must be encoded so that both its structure (layout, etc) and its content are preserved. Some general discussion follows about various anthologies that might be incorporated under this section. A.1. Newspapers (including journals of professional societies, trade associations, social clubs, transactions of learned bodies etc). As with many corpora, the historian will attempt to read or at least sample from a large number of newspaper issues, say, for example, by reading every issue of a weekly newspaper for the period 1800-54, of by looking at articles pertaining to a particular subject over an equally long period. The two approaches are used for different purposes. Take for example a community study which concentrates on frontier towns in the period 1820-60, in this case, on Rochester, New York. The investigator would in this case want to "read through" the relevant newspaper(s) simply to get a feel for the period, even to discover particular subject areas which seem worthy of further research (for example, it might become apparent through reading through the newspapers for the period, that revivals in the 1830s seemed to have a high public profile at least insofar as this may be measured in the popular press. Consequently, further research into the religion on the frontier would be indicated. To sustain this approach, a machine-readable version of a newspaper would need to be very explicit in tagging the structure of the text itself: its layout, the placement and dimensions of pictures, advertisements, drawings, images, the use of bold type and of different fonts both in headlines and in articles' text, the width, length and number of columns used for any one article, and an article's layout on one or several pages. Only then might it be possible for the researcher to realize, for example, that items pertaining to women, where they exist at all, are only given the slightest treatment toward the back of the newspaper, and always under the most insignificant headlines. Similarly, one might recognize the intrusion into the advertisements of specialist shops replacing those for general goods stores. In the second case mentioned, one turns to a newspaper with a specific goal: to look at election or political reporting in the period X to Y, or at articles about municipal works and services. Here, the structure of the newspaper may prove a distraction. More essential would be quick access to articles which bear directly or indirectly on the subject under investigation. The investigator of politics on the frontier, for example, might want to read in their entirety, the newspapers published in the several weeks before and the week immediately after particular dates when elections were known to have taken place. The analyst of municipal government would prefer that individual articles be accessible by their title, or by subject keywords included when encoding information about various articles. In both cases, it would be necessary to describe the size, paper quality and in some cases, colour of each number in a series. These characteristics frequently with editorship, ownership, and or publisher and may be significant. Moreover, one wants to preserve the possibility of more subjective characterization. "Under the editorship of Joe Bloggs, the paper seems to have gone more down-market clearly targetting lower-income groups in its advertising, and less-informed readers through the elimination of sophisticated coverage of political, cultural and social events, and the inclusion of tawdry human interest stories". Here, too, the encoder is introducing a perception but one which might prove useful and which could indeed be checked up by secondary analysists through the machine readable text. A2. The transactions of official bodies (including legislative records) should also be considered as corpura only paying close attention to the source definition outlined above. The text will normally be arranged hierarchically. Each volume will comprise the transactions of the society between two specified dates. At the next level, the particular sessions or meetings of the body will be recorded chronologically between the two specified dates. Each session therefore forms the next or second level of the hierarchically arranged text. Each meeting will require tags for its date and members present. At the third level of the hierarchy will be the minutes of the particular meeting or session, followed by accompanying documents, reports, testimonials, petitions, legislation etc considered at that meeting. The supporting material will require headings as indicated above associating it with a particular session, and with particular authors. It is only at this level that keywords indicating the subject matter will be of any usefulness whatsoever. Occasionally, transactions are laid out in two sections: the first presents the minutes of each meeting chronologically arranged within the period covered by the volume; the second presenting the supporting documents considered at each meeting also chronologically arranged. Indeed I have seen such minutes where the pagination of the minutes is in arabic and that of the supporting documents in roman numerals. Should this fomat exist in the text it is worth preserving since it is frequently only the supporting materials which and not the minutes of particular sessions which will illuminate particular historical problems. A3. Collections of unpublished, unbound manuscripts stored in boxes, in folios volumes or in folders present something of a problem where the individual items are only loosely related. Where the "letters of IM Famous" preserved in folio volumes clearly comprise a corpus and should be treated as those already discussed in the text. The "Box of material relating to political reform movements in Philadelphia, 1800-1900", however, may contain a wide range of different materials, letters, broadsheets, political tracts etc, some of it only vaguely related to the umbrella description provided for the corpus by the cataloguer. Here, it is perhaps best to treat items individually and indicate as their location, the manuscript collection of which they make up a part. B. The discussion above of source description for letters and telegrams should provide sufficient guidance for text features. C. Broadsheets, and pamphlet literature. Will need to be treated as newspapers and journals with express reference made to the quality and appearance of the particular item as well as to its structure, layout, font size, placement of imagery etc). Only thus will the secondary analyst be able to remark that the "Methodists seemed more capable than the Baptists of producing high-quality and up-market evangelical material in the 1830s indicating perhaps that they were recruiting "better" clientele". D. Diaries. The source description should indicate something about the diarist filling in biographical details (at least vital statistics) where these are known. The overall quality and appearance of the diary will also provide useful additional information. The important sub-headings will be chronological, normally by day but occasionally by hour for the more assidious diarist. The text composed at any one time (noon on the 12th of July 1823), should be treated as a chapter of a book, with the date (and time) of entry indicated in a manner similar to that used for chapter titles. Subject keywords describing the contents of any one particular entry would also be useful at this level of documentation, as would brief descriptions indicating the apparent mood of the author ("seemed rather frenetic after his divorce of such and such a date" (cross-reference to relevant diary entry)). E. The log books and diaries that businessmen kept to record transactions etc should also be treated as diaries. Within the daily entry, there may, however, be subsections which should be treated and tagged as such. These might include, for example, the author's rather more ponderous prosaic musings as to the state of his/her affairs, itemized transactions of the day's business, statements of holdings in various accounts, stock lists etc. information indicating daily business, as distinct from F. Collective biographies, encyclopaedia, reference books. These should be treated rather more as dictionaries than as formal narrative. As with dictionaries, the reference works mentioned above are composed of entries. Moreover, they use typographic conventions to denote structure. Collective biographies will use all of the source description conventions appropriate to a book. They will also require tags which define the group of individuals for whom information is collected, the criteria used for their inclusion (or, perhaps more interesting, criteria used for the exclusion of others), who or what organization drew up the criteria mentioned above, the manner in which information was collected, estimate of what proportion of the total population meeting the above mentioned criteria are included. Though collective biography differ as to the details that they provide, tags for the following categories of information should be included (these categories are often indicated in the text with a variety of typographic conventions and/or abbreviated keywords): date of birth; place of birth; father's occupation;father's name, title etc; mother's name; mother's father's name; confessional status; school(s) attended; school societies or sporting teams; scholarships and prizes won; university(ies) attended; degree(s) obtained; university societies and/or sporting teams; professional/vocational qualifications; paid employment (with dates); unpaid employment (eg directorships of various companies, etc; public service (membership of Royal Commissions, unpaid political posts held etc); club and society membership; spouse's name, title, etc; spouse's father's name; children's names and sex (or number of children by sex as sometimes children are only mentioned as 1 daughter, 2 sons); publications; honours; titles. In each case, the entry heading should consist of the individual's name and title. Collective biographies are often laid out in sections, for example by year or span of years and alphabetical within years. Such works will normally comprise indexes where names are arranged alphabetically and appropriate page references given. These need to be be retained if the collective biography is to remain a useful source. City directories. These might be treated as collective biographies with the exact same headers (city directories are to this day selective and not inclusive). Each entry needs to be identified by names and titles, and tags are required for home address, work address, occupation, title of business, telephone number. Where an entry refers to a business, the entry is identified by the business name or title and should provide tags for the names of the business's chairman, vice-chairman, president, vice- president, treasurer, controller. One would also expect to use extensive cross-referencing to link for example, the individual who puts "The Steel Corp" under his own entry as a place of work, and the entry for "The Steel Corp". As with newspapers, careful attention will have to be paid to documenting the structure of business advertisements as these are often indication of the relative size and importance of a given company. G. City directories will often have appendices giving the officers/members of the most important public, financial, commercial, and industrial institutions of the locality covered by the directory. These need to be identified as subsections or even appendices. Each sub-section will have a title which will refer to a civic institution: (the city council, the Bank of the Northern Liberties, the Public Library, the Methodist Temperance Union), and often a brief description about the institution for which the following tags will be necessary (date of foundation, institution's address(es), capitalization (in the case of banks and insurance companies). The body of the sub-sections will be the names of the instition's members/officers. These should be treated as entries in the main body of the city directory with an additional tag for office (eg treasurer). More recent city directories (become known as telephone directories and tend to lose occupational information for individuals after about 1920 in the US), often are laid out in two sections; one for individuals and one for businesses. H. Poll books should be treated as city directories. The structure of poll books will vary. Some are laid out alphabetically, others alphabetically by political subdivision or district. Where sub-divisions are used they need to be identified. Entries will be defined by the names and title of the individual specified and will require the same tags as entries in city directories plus tags for first, second, third, etc choice of candidates. The source description of a poll book must indicate the date of election, the locality in which it took place. It would also usefully contain the names (and parties of contestant), and the result of the vote. Note that city directories and poll books will both have introductions composed principally of running prose and should be treated as the introduction in a book. DIG 10 Feb 91