Daniel I Greenstein Draft desciption of some documents used in modern history <date>20 May 91 <body> 1. Newspapers. As with many corpora, the historian will attempt to read or at least sample from a large number of newspaper issues, say, for example, by reading every issue of a weekly newspaper for the period 1800-54, or by looking at articles pertaining to a particular subject over an equally long period. The two approaches are used for different purposes. Take for example a community study which concentrates on frontier towns in the period 1820-60, in this case, Rochester, New York. The investigator might want to "read through" the relevant newspaper(s) simply to get a feel for the period, even to discover particular subject areas which seem worthy of further research. For example, it might become apparent through reading through the newspapers for the period, that revivals in the 1830s seemed to have a high public profile at least insofar as this may be measured in the popular press. Consequently, further research into religion on the frontier would be indicated. To sustain this approach, a machine-readable version of a newspaper needs to be very explicit in tagging the structure of the text itself: its layout, the placement and dimensions of pictures, advertisements, drawings, images, the use of bold type and of different fonts both in headlines and in articles' text, the width, length and number of columns used for any one article, and an article's layout on one or several pages. Only then might it be possible for the researcher to realize, for example, that items pertaining to women, where they exist at all, are only given the slightest treatment toward the back of the newspaper, and always under the most insignificant headlines. Similarly, one might recognize the intrusion into the advertisements of specialist shops replacing those for general goods stores. In the second case mentioned, one turns to a newspaper with a specific goal: to look at election or political reporting in the period X to Y, or at articles about municipal works and services. Here, the structure of the newspaper may prove a distraction. More essential would be quick access to articles which bear directly or indirectly on the subject under investigation. The investigator of politics on the frontier, for example, might want to read in their entirety, the newspapers published in the several weeks before and the week immediately after particular dates when elections were known to have taken place. The analyst of municipal government would prefer that individual articles be accessible by their title, or by subject keywords included when encoding information about various articles. In both cases, it would be necessary to describe the size, paper quality and in some cases, colour of each number in a series. These characteristics frequently with editorship, ownership, and or publisher and may be significant. Moreover, one wants to preserve the possibility of more subjective characterization. "Under the editorship of Joe Bloggs, the paper seems to have gone more down-market clearly targetting lower-income groups in its advertising, and less-informed readers through the elimination of sophisticated coverage of political, cultural and social events, and the inclusion of tawdry human interest stories". Here, too, the encoder is introducing a perception but one which might prove useful and which could indeed be checked up by secondary analysists through the machine readable text. To meet the specifications outlined above, it is envisaged that a newspaper would contain in its header its title, number or volume as appropriate, date, place of publication and price. Each article in a newspaper would contain in its header the usual material about author(s) and dateline, and some reference to the physical placement of the article for example by providing some indication of the size and length of the headline (some run across several columns), and a list of text segments organized by page number, column number and line length. Thus, an article taking up part of two columns on the first page and an additional column on the 15th page might have in the header some reference indicating that the article consists of text on page 1, column 1, lines 13-65, page 1, column 2 lines 13-35, page 15, column 6, lines 8-27. Where articles include pictures or graphical images, these should also be referenced at this point. Lines of texts might then be treated within a hierarchy consisting of page number at the top level of the hierarchy, column number (proceeding always from left to right) at the next level, and finally, line number. Pictorial images and other graphical material would also have headers giving the graphic's name (as in the case of a caption), artist or photographer (where relevant), and position (by defining the column and line position of the upper-left and lower-right hand corners of the image). Text contained within an image (for example in an advertisement for men's shoes) would then be indicated as according to page, column and line numbers as with text for an article. Newspapers do not, alas, consist solely of articles and images. Thus, the header for any particular newspaper item would need to include a tag indicating the nature of the textual feature about to be described (eg article, review, editorial, letter to the editor, advertisement etc). Letters to the editor should follow the conventions set down for personal correspondence with the addition of information in the letter's header giving its placement on the page. Broadsheets, and pamphlet literature. Will need to be treated as newspapers and journals with express reference made to the quality and appearance of the particular item as well as to its structure, layout, font size, placement of imagery etc). Only thus will the secondary analyst be able to remark that the "Methodists seemed more capable than the Baptists of producing high-quality and up-market evangelical material in the 1830s indicating perhaps that they were recruiting "better" clientele". 2. The transactions of official bodies The transactions of official bodies (including legislative records) should also be considered as corpura only paying close attention to the source definition as outlined in the section on citations above. The text will normally be arranged hierarchically. Each volume will comprise the transactions of the society between two specified dates. At the next level, the particular sessions or meetings of the body will be recorded chronologically between the two specified dates. Each session therefore forms the next or second level of the hierarchically arranged text. Each meeting will require tags for its date and members present. At the third level of the hierarchy will be the minutes of the particular meeting or session, followed by accompanying documents, reports, testimonials, petitions, legislation etc considered at that meeting. For the minutes, tags may be desirable to identify those in attendance at a given session and those absent perhaps with reference to their status within the body (chairman, treasurer etc). There may also be a category of those "attending" ie non-members of the association who were invited to attend to address a particular issue. Within a body of minutes, text is often associated with an individual or individuals' names or initials indicating their responsibility for the discussion summarized in the text. Wherever members are referred to by initials either in the margin or in the text itself, they need to be identified <member=>. The text of a minute will also include specific action points which indicate decisions taken by the body either agreeing postponing or repudiating a proposed action. These action points may also need to be tagged. Votes on particular issues are an example of a particular action point. Here, the actual vote totals (where given) need to be tagged. Supporting documents that are likely to accompany a minute include: Legislation (acts, ordinances) being considered by the body at a particular session. These should contain headers containing title and author(s) where appropriate. The latter may only be identified by faction/organization or committee name. Amendments, sections and titles within a particular piece of legislation may also have a title and/or authors associated with them and these will need to be tagged, as will signatories (eg with the US Constitution or the Declaration of Independence). Where legislation has dissenting opinions appended, these too will require author and title tags. Since legislation will often be considered, sent back to committee for revision, and then reconsidered, (and printed in the volume of minutes each time it is brought before the body) some attempt must also be made to indicate which draft is being referred to, the dates of other sessions where earlier generations of the legislation was considered, and ideally the ultimate fate of the legislation (passed by division 100-9 on a given date, or tabled indefinitely on...). In some transactions the information needed to link a given document to its occurrence elsewhere in a a volume or volumes is given with reference marks which should be treated as described under Sequential Numeric Information above. Other supporting documentation This may be in the form of letters, testimonials, and petitions (to be treated as manuscript letters or as journal articles as appropriate and related to session, and journal of official proceedings, and with a sufficient number of authorship tags to indicate not only authors but signatories, factions committees, interest groups etc). Supporting materials may also come in the form of reports, graphs, tables of data, or pieces of legislation passed by other equivalent but different bodies. Occasionally, the transactions of official bodies will be laid out in two sections: the first presents the minutes of each meeting chronologically arranged within the period covered by the volume; the second presenting the supporting documents considered at each meeting also chronologically arranged. Indeed I have seen such minutes where the pagination of the minutes is in arabic and that of the supporting documents in roman numerals. Should this format exist in the text it is worth preserving since it is frequently only the supporting materials and not the minutes of particular sessions which will illuminate particular historical problems. Supporting materials to the sessions of legislative or other official bodies have in their titles keywords which define the content of the materials and so may themselves be consistent. A brief survey of several such sources yielded the following keywords: Memorandum [from][by][etc] Communication [from][by][etc] Application [for][from] Protest [by][from] Petition [by][from] Observation [of] Schemes for Report by Acts Ordinance Decree Reply [of][from][etc] Brief Suggestion for Memorial on Testimonial of Statement of Appeal [from][of] Further to (any of the above) Draft (any of the above) Redraft of (any of the above) Interim (any of the above) 3. Collective biographies, encyclopaedia, reference books. These should be treated rather more as dictionaries than as formal narrative. As with dictionaries, the reference works mentioned above are composed of entries. Moreover, they use typographic conventions to denote the entries' internal structure. Collective biographies will use all of the source description conventions appropriate to a book. They will also require tags which define the group of individuals for whom information is collected, the criteria used for their inclusion (or, perhaps more interesting, criteria used for the exclusion of others), who or what organization drew up the criteria mentioned above, the manner in which information was collected, estimate of what proportion of the total population meeting the above mentioned criteria are included. Though collective biography differ as to the details that they provide, tags for the following categories of information - some of which are discussed above as crystals under Date, Name, and Controlled Vocabulary - should be included (these categories are often indicated in the text with a variety of typographic conventions and/or abbreviated keywords): date of birth; place of birth; father's occupation;father's name, title etc; mother's name; mother's father's name; confessional status; school(s) attended; school societies or sporting teams; scholarships and prizes won; university(ies) attended; degree(s) obtained; university societies and/or sporting teams; professional/vocational qualifications; paid employment (with dates); unpaid employment (eg directorships of various companies, etc; public service (membership of Royal Commissions, unpaid political posts held etc); club and society membership; spouse's name, title, etc; spouse's father's name; children's names and sex (or number of children by sex as sometimes children are only mentioned as 1 daughter, 2 sons); publications; honours; titles. In each case, the entry heading should consist of the individual's name and title. Collective biographies are often laid out in sections, for example by year or span of years and alphabetical within years. Such works will normally include indexes where names are arranged alphabetically and appropriate page references given. These need to be be retained if the collective biography is to remain a useful source. City directories. These might be treated as collective biographies with the exact same headers (city directories are to this day selective and not inclusive). Each entry needs to be identified by names and titles, and tags are required for home address, work address, occupation, title of business, telephone number. Where an entry refers to a business, the entry is identified by the business name or title and should provide tags for the names of the business's chairman, vice-chairman, president, vice- president, treasurer, controller. One would also expect to use extensive cross-referencing to link for example, the individual who puts "The Steel Corp" under his own entry as a place of work, and the entry for "The Steel Corp". As with newspapers, careful attention will have to be paid to documenting the structure of business advertisements as these are often indication of the relative size and importance of a given company or at least of the importance that the company's owner wants to impress upon the readership. City directories will often have appendices giving the officers/members of the most important public, financial, commercial, and industrial institutions of the locality covered by the directory. These need to be identified as subsections or even appendices. Each sub-section will have a title which will refer to a civic institution: (the city council, the Bank of the Northern Liberties, the Public Library, the Methodist Temperance Union), and often a brief description about the institution for which the following tags will be necessary (date of foundation, institution's address(es), capitalization (in the case of banks and insurance companies). The body of the sub-sections will be the names of the instition's members/officers. These should be treated as entries in the main body of the city directory with an additional tag for office (eg treasurer). More recently city directories have become known as telephone directories and tend to lose occupational information for individuals after about 1920 in the US. These are frequently laid out in two sections; one for individuals and one for businesses. Poll books should be treated as city directories. The structure of poll books will vary. Some are laid out alphabetically, others alphabetically by political subdivision or district. Where sub-divisions are used they need to be identified. Entries will be defined by the names and title of the individual specified and will require the same tags as entries in city directories plus tags for first, second, third, etc choice of candidates. The source description of a poll book must indicate the date of election, the locality in which it took place. It would also usefully contain the names (and parties of contestant), and the result of the vote. Note that city directories and poll books will both have introductions composed principally of running prose and should be treated as the introduction in a book. 4. Diaries. The source description should indicate something about the diarist filling in biographical details (at least vital statistics) where these are known. The overall quality and appearance of the diary will also provide useful additional information. The important sub-headings will be chronological, normally by day but occasionally by hour for the more assidious diarist. The text composed at any one time (noon on the 12th of July 1823), should be treated as a chapter of a book, with the date (and time) of entry indicated in a manner similar to that used for chapter titles. Subject keywords describing the contents of any one particular entry would also be useful at this level of documentation, as would brief descriptions indicating the apparent mood of the author ("seemed rather frenetic after his divorce of such and such a date" (cross-reference to relevant diary entry)). The log books and diaries that businessmen kept to record transactions etc should also be treated as diaries. Within the daily entry, there may, however, be subsections which should be treated and tagged as such. These might include, for example, the author's rather more ponderous prosaic musings as to the state of his/her affairs, itemized transactions of the day's business, statements of holdings in various accounts, stock lists etc. blowit Version control The problems here are manifold if only because encoding declarations rely on the authors (or are they creators?) of machine-readable texts identifying very precisely what text features were and were not marked up or identified when encoding. For example, when encoding Anne Frank's Diary, one might indicate a whole variety of features including new entries, new page, new line, new paragraph, etc, but not where the original text was underlined or appeared in a bolder hand. Clearly in this example, any text in the original manuscript which was underlined or written in a "heavy hand" perhaps for emphasis, will be lost to a secondary user. If the initial encoder specifies what was not encoded, (the following text features were not encoded - underlining and bold type... etc) secondary users will at least know the limitations of the machine-readable source. The problem of course is that the initial encoder may no recognize or appreciate the importance of distinctive text features which may be of the greatest importance to other users. A postgraduate who creates a machine-readable version of a series of medieval wills may not be well-enough trained, for example, to identify where subtle changes in penmanship might indicate a document's multiple authorship. In this case, the possible occurrence of multiple authorship would not be available to subsequent (secondary) users because it would not be indicated by the original encoder. In short, any machine-readable version of an historical source, is itself a source belying as much about the research interests and skill of its author as the original manuscript or published document from which the machine- readable text is derived. Where the original encoder is able to specify which text features were not encoded, at least the secondary user can be made aware of their occurrence and then decide whether it is necessary to return to the original manuscript of published source. It is, however, highly unlikely that the original encoder can ever be sufficiently aware of all of the text features in any document which might hold forth some interest to an investigator in the future.