Trip Report Berkeley and Irvine, California 8-13 March 1994 Confidential, not for further distribution. Invited by Mary Kay Duggan in the School of Library and Information Science at Berkeley to give a talk / workshop on SGML and the TEI to her graduate class in computing in the humanities, I was also scheduled to give a public lecture on "Standards for Electronic Text: SGML and the TEI," and to repeat the workshop for a number of librarians who were unable to attend owing to a schedule conflict. After a flight marked only by a forty-five minute delay in taking off, I was met at the San Francisco airport by Charles Faulhaber, the creator / organizer of an important bibliography of Old Spanish manuscripts and their secondary literature (PhiloBiblon) and an American collaborator with Francisco Marcos Marin on the creation of the Admyte CD-ROM collection of (currently) sixty-five early Spanish texts. We chatted about electronic bibliographies and the difficulties we have each encountered in designing adequate record structures for bibliographic data, or in finding adequately designed products. His most recent tale of woe: the engine used for PhiloBiblon, Advanced Revelation, will not search a database on a CD-ROM disk, because in the interests of data integrity the program was designed to insist on locking records while they are being read, and so cannot work on a drive to which it cannot write. This was discovered a week or so before the CD-ROM containing the data was mastered; the database was included on the disc, but it must be copied to a read/write disk before use. After checking in at the Women's Faculty Club and eating lunch, I went to see Mary Kay Duggan, who after an introductory chat and tour of the School of Library and Information Science (SLIS) led me to the Moffitt Undergraduate Library, where I gave my public talk to an audience of between fifty and sixty people, most of whom stayed, after the forty-five (well, probably fifty) minute talk for half an hour to ask questions. I kept the talk mostly non-technical, focusing on the basic problems of representing texts in electronic form (beginning with my standard pedantic quibble about any attempt to "put texts into computers" and leading to the fundamental requirements for markup languages) and the problems of interchange (lack of altruism caused in part by failure to credit or reward creation of electronic resources, undocumented non-standard encoding schemes, and differences of interest among scholars). These topics lead naturally to a discussion of the TEI, its goals, and its history, and to a discussion of SGML and why we chose it as the framework for our encoding scheme; as usual, I could not resist pointing out the vast differences in longevity between the texts humanists study and the systems we are now using to study them, since the short lifetime of our systems helps make much clearer to humanists why we must insist upon making our data representations system independent. Finally, I returned to the TEI and summarized the salient characteristics of our results (focus on logical not physical structure, resolute application of Occam's Razor, insistence upon extensibility, and the use of the pizza model to ensure modularity), concluding by quoting Yahweh's description of what people can do if they speak one language. After the talk, both my hosts had other commitments, so I walked around Berkeley to find dinner and review my plans for the next day's workshop. Because I had decided to use Franklin D. Roosevelt's first inaugural address as an example for group tagging, I needed to find a respectable printed edition of it, and so had an engaging time trying to talk my way into the stacks of the main library without having a stack pass. Having found there a printed edition (heavily marked by generations of students with marginalia, underlinings, etc.), I discovered to my chagrin that to buy a copy card I had to leave the stacks. When I had bought a card, my friend at the security desk had been relieved by a colleague; I decided not to push my luck, and gave up on the stack copy. Fortunately, the reference room had a desk copy of Henry Steele Commager's collection Documents of American History, which includes the speech, so I was able to go eat my supper with peace of mind, amusing myself by collating the Commager reprint with the Project Gutenberg version of the text, which has some hair-raising blunders. After dinner, for old times' sake I revisited the Durant Hotel, the Steering Committee's meeting place in 1991, where I drank a glass of single malt scotch and thought about Don Walker. The next morning, I confirmed what I had only dimly suspected before: if you want to get some assistance in a strange institution, where you are not strictly entitled to it, do it at night. During the day, it's much easier for the person at the desk to find a supervisor, who is much better informed about the rules and much more willing to say no to you, be your smile as winsome and friendly as you like. So I never did get back into the stacks. Instead, I went across to the microcomputer facility, where the previous afternoon the students on duty had advised Mary Kay Duggan to give me network access by opening an account for herself and giving me the password. The evening before, just before closing time, I had used this account to fetch a fresh copy of Roosevelt's speech from Project Gutenberg. In the cold light of morning, however, the staff waved me away like a leper, urging me earnestly to warn Prof. Duggan that she could get into VERY SERIOUS TROUBLE, lending her account to someone else that way. In vain I protested that it was the staff at this very center which had advised her to lend me her account, and indeed had created it for her for that express purpose. In vain I pointed out that I was giving a workshop that afternoon and all I needed was to fetch some files from my home machine. It was no use; it was daylight; the staff were adamant. An attempt to try again at the computer lab in the undergraduate library failed, because the lab was inside the library, and I needed a pass. They said I could get a stack pass if I went to the main library, but I decided one trip around this particular circuit was enough. Fortunately, in the Balkanized organization of computing at Berkeley, the computer facility in SLIS (soon to be renamed the School of Information Management, I was told with some pleasure, which as a long-time lover of libraries I did not share) is entirely unconnected with the microcomputer lab which had already rejected me as a sinner that morning. The computer specialist at SLIS was even happy to see me, since she had not had a very clear description of my hardware needs for the workshop that afternoon. We set up the LCD display, and she logged me on to their LAN with a guest account, which worked fine until I hung the system by trying to interrupt a file transfer. Eventually, I did get the files I needed (I was missing some files needed for the TEI DTD, which I wanted to use for the group tagging of the Roosevelt text in the workshop), but it took the remainder of the morning and I did not have a chance to test them. At noon, a brown-bag lunch had been arranged for the discussion of information arcades, on which I had apparently become an instant expert on account of having mentioned, in my bio, that my current responsibilities at the UIC computer center include participation in the planning of UIC's version of this suddenly fashionable type of library service. Fortunately, those attending did not visibly resent it when I told them I was attending because I wanted to find out from them what THEY were doing. The discussion was vigorous and revealed a deep cleft between the haves, who wondered why anything like an information arcade was really necessary, and the have-nots, who pointed out that without an information arcade, most humanities faculty would have no place on campus where they could see Mosaic or similar software in action: the Windows version of Mosaic cannot be run on any of the public PCs at Berkeley, I was told, either because it's not installed (in the centrally operated labs) or because the machines don't have enough memory (in the SLIS lab). The usual issues were aired, sometimes in new ways: why do librarians have to show patrons how to use all this software? we don't show people how to read books, do we? yes, we do: reference librarians regularly instruct readers in the use of the Science Citation Index and many other commonly used reference sources. But there are other reference sources for which the patrons are on their own; some bibliographies of older German literature for instance are approached even by specialists in the period only with trepidation and carefully written out instructions. The situation would become much better, of course, if we only had more money. And so on. After the brown-bag, Charles Faulhaber and I met in the library so he could demo the Admyte CD-ROM to me. I got there first and followed his invitation to start the program and see if it made intuitive sense. I managed to crash it immediately by trying to call up the help screen (when I showed him this, he put his head in his hands and sighed deeply), and while I restarted it I tried to test the current TEI DTDs I had retrieved from Chicago just before lunch. He arrived just as I discovered I was missing yet another file, and we spent half an hour looking at the disc. One of the two CD-ROMs is a text-only transcription of a number of medieval texts, mostly obtained from the Dictionary of the Old Spanish Language in Madison, using an encoding scheme whose documentation Charles had sent to the TEI several years ago. Some modifications have been introduced: the Spaniards who produced the CD-ROM refused to encode n-with-tilde as 'n~'; they have retained, however, the Madison / DOSL conventions for marking superscript letters and the like. The search engine provides more or less the usual facilities; a list of lemmata and forms has been built in, so that one can search in a single pass for all the many variant forms of a word like 'alguna' [some]. The proximity searching works only for words on the same page, however, the hit count gives not the number of occurrences of the word but the number of pages on which it occurs, and when one asks to see the results the screen is positioned at the beginning of the page rather than at the occurrence of the word. These restrictions seem wilful in the text-only collection, but make much more sense when one is searching the other of the two CD-ROMs, which includes page images linked to the transcriptions. This collection amply demonstrates Claus Huitfeldt's argument that when page images are provided, the physical organization of the page can be treated in much less detail within the text-level encoding. It is also handy for checking suspect points of transcription: we found an apparent misspelling in the text, and were able to confirm, by examining the page image, that the transcription was indeed in error. The CD-ROM software does have some quirks, however: the Spanish developers aim mostly at the small-office market (legal data, etc.), and the software is not prepared to deal with the unexpected luxury of having both CD-ROM discs mounted at the same time, in a double-decker drive. So to switch from one drive to the other, one must leave Admyte, run a setup routine to change the configuration file, and restart the program. I didn't see anything in the Madison manuscript encodings for which chapter PH leaves TEI encoders unprepared, but it would be wise to check systematically before TEI P4. The 'workshop' which formed the central reason for my visit was ostensibly organized for the benefit of Mary Kay Duggan's fifteen-member class on humanities computing, but thirty extra chairs had been placed in the room and slots made available for signup. Since the event lasted only three hours and I had the only machine in the room, it was more of a lecture with discussion than a workshop, but by calling it a workshop we did succeed in preparing people to participate actively. Borrowing heavily from the outline which Elaine Brennan and I are developing for the workshop we will hold at Urbana next month (similarly a talking-heads-plus-discussion event), I divided the afternoon into six topics, with two interludes of group activity: 1 Context-Setting. Why do you want to use e-text in the first place? What do you hope to accomplish? Distinction among text description languages, page description languages, and silicon microfilm. Do you care about markup at all? Answer to the last question: yes, you do. Everyone --- even Michael Hart, as it turns out! --- agrees that e-texts may need to include meta-information. Examination of Project Gutenberg edition of FDR's inaugural: half the file is taken up with self-identification, description of Project Gutenberg, appeal for funds, instructions on retrieving other texts and a current catalog, and a legal disclaimer. No way, however, for a program to distinguish this meta-information from FDR's text; a careless application of OCP would reveal that Roosevelt used the word 'etext' at least twenty times in his speech --- perhaps he needs to be given more credit for the emergence of modern technology. Also no indication of the source of the transcription. 2 Document analysis. Why it's necessary, how to do it systematically, and two distinctions to keep in mind. The systematic plan is that used by ATLIS Consulting, and the two distinctions are between the useful and the merely true and between the archival maintenance form of one's information and the delivery or process-specific forms one uses to exploit the capabilities of particular systems. Interlude: group document analysis of a sample text (Roosevelt). I had had cold feet about this, since the text is formally very very simple: just a series of paragraphs, with no italics or bold, no dates, and only a couple of proper nouns. There seemed a risk that it would not provide enough matter for discussion. I was surprised, then, that the group named at least fifteen elements which might usefully be captured, including large quantities of non-textual metadata (date and place of delivery, source edition, etc.) 'Paragraph' was named only after four or five other elements. I had distributed copies of one page of the Dunciad, but we did not use it; in the interests of time, I said, we would leave it as an exercise for private study, and I would grade it if they sent me an analysis by email. (This was a joke, but at the end of the session they did ask me for my email address.) 3 Introduction to SGML: basic characteristics, what is a 'markup language', SGML delimiters, elements, tags, attributes, entities. 4 Introduction to the TEI: design goals, constraints on a general-purpose research-oriented scheme, architecture, tag sets. Break. Interlude: group tagging of Roosevelt's inaugural. Since I had not succeeded in getting the TEI DTDs to parse on my laptop, we used the old TINY DTD, which is still in use in the current tagged versions of P3; they were impressed (and so was I) to see that even in this ostensibly small and limited subset of the TEI DTD, the S tag and the SALUTE tag were nevertheless defined, so we could demonstrate how to mark orthographic sentences and Roosevelt's salutation of his predecessor and the Chief Justice at the beginning of his speech. We did not get into any of the metadata, however, since the header is very primitive in the TINY tag set. 5 Technical Topics: the SGML DTD, syntax of declarations, notes on content models, Waterloo DTDs, attribute data types, NOTATION declarations. After this, a whirlwind discussion of the TEI's special innovations in SGML usage (pizza model, class system, facilities for simple renaming), which had to be faster than desirable because we were running out of time and had to get on to the final topic. 6 Practical Questions. When running a project, what problems do you face? Problems of data capture; software. Problems of data management, manipulation, and processing; software. Problems of information delivery; software. The latter provided an occasion to discuss HTML, which I praised as a good delivery mechanism but not an ideal markup scheme for maintenance. Final description of important network-based information sources. Time having run very short, I did not even attempt to demonstrate Author/Editor or Intellitag. After the workshop, Mary Kay and I had a nice chat with a graduate student at SLIS who has been assigned the task of creating an SGML equivalent for the US-MARC record. The point is to be able to embed MARC records into SGML-encoded texts, with the constraint that the translation must be fully information-preserving in both directions. (The TEI header does not meet this constraint.) We talked a bit about some of the problems he sees, and the approach he wants to take in solving them, and I encouraged him to persevere, to keep us informed about his progress, and to get into touch with other people who may be working in the same area, or may wish to. I also planted the thought that if his MARC-equivalent SGML tags could function as an extension of the TEI header, it would be useful. That is, he should use as much of the existing as he can, and extend it where he needs to, and the TEI should commit itself to considering adopting his extensions as a legal variant of the TEI header. At dinner, I saw Daniel Pitti and some other librarians, who explained that the reason they had been unable to come to the workshop that afternoon was that they were finishing a three-day training session with Electronic Book Technology. Daniel bounced in his seat with enthusiasm, since his finding-aid DTD had proved to work well with DynaText, and he now had a sample finding aid up in DynaText, for which he had written rudimentary style sheets during the EBT training. The next morning, I repeated the workshop in a library 'early bird' session: these staff professional development seminars are held at 8:30 a.m. so that part of the time comes out of the normal working day and part comes from the staff members' own time. (Anything to soothe agitated legislators convinced that universities do nothing but waste money!) At such an hour, in this location, I had expected eight or ten dedicated SGMLers, but found that this session had actually been advertised rather widely, and that there were between thirty and fifty people in attendance, many from the library but some, I gathered, from elsewhere in the Bay Area. The early hour led to equipment difficulties (the equipment department was not open yet) and much to-and-fro in search of keys. So I restricted myself to overhead transparencies and did not use the laptop; this in turn saved time, since we could not do the group tagging. Apart from that and apart from the technical topics, which I dropped, we covered the same material as on the preceding day, although in less depth. Once again, the group document analysis of FDR elicited more meta-data elements than textual elements (this was, after all, a library crowd), and 'paragraph' was named only after date, place, author/speaker, and source edition had all been placed on the list. After the session ended, Daniel Pitti showed me the DynaText version of the Duke finding aid, which enabled me to become a little clearer on what, exactly, he means by the term 'finding aid': it is typically a description, and the collection, subcollection, box, and sometimes item level, of the contents of some archival or manuscript collection. In the case of manuscript libraries, even a full item by item catalogue may be classed as a finding aid. We continued our discussion over lunch, and I proposed to him that he should consider formulating the finding-aid DTD as a TEI-compatible tag set (whether as a base tag set or an additional tag set was not immediately clear, but on consideration we leaned toward making it a base, since his current DTD has different
-like elements with very distinct content models). He found this an interesting prospect, and wants to pursue it, so I gave him a reading list of the chapters most important for him to understand (CO, ST, DS, SA) and a copy of the draft tutorial (TEI U1). Political considerations may complicate the decision: the archival community is very nervous about the dangers of standardization, even when the standards are formulated by archivists. An alliance with non-archivists might cause mass phobic outbreaks. Also, he needs to meet a schedule, and wants to be sure this will save time, not cost time. The advantages of getting the TEI's analysis of running prose, hypertext linking, etc., for free, and of making possible a virtually seamless integration of finding aids with full text encodings of the objects they catalog, were obvious to him and to his colleague in the finding aids project, whose name I have lost track of. After our lunch, Mary Kay Duggan drove me to the airport for a flight to Irvine, for meetings with Patrick Sinclair, Ted Brunner, and William Johnson about SGML, the TEI, and the electronic version of the Thesaurus Linguae Latinae. On the way, she mentioned a graduate student in the computer science department who is taking her course and had expressed an interest in acquiring the source code to TACT, so as to retrofit it with SGML awareness. When this was put to Ian Lancashire last week, however, he had reacted with a coolness or indifference which had rather puzzled the Californians. But truth to tell, the ardor of the graduate student had also rather cooled when he learned that the TACT source is 5 Mb of Modula-2, not C. We agreed that it would be a shame if this student's interest in creating SGML-aware software for humanistic research were blunted, and that he should be encouraged to do something useful. I mentioned that Modula-to-C translators do exist, and also that (given that TACT was designed for a COCOA-style, rather than an SGML-style view of text) perhaps a reimplementation of the core functionality might be easier than a retrofit. An alternative project which might prove appealing, I said, would be to generalize the existing tkWWW client, developed at MIT, which can browse and edit HTML files, so as to remove the restriction to HTML and handle arbitrary DTDs with appropriate style sheets. The student, it develops, has worked a lot with tcl (the Tool Control Language, which tkWWW uses as a macro language), so this might be relatively simple for him. We agreed to keep in touch. The flight to Orange County involved being packed, with numerous other apprentice sardines, into a small metal cylinder which was lifted from the ground in Northern California, shaken gently for ninety minutes, and deposited back on the ground in Southern California. Better to have the shaking done in the air, perhaps, than on the ground. Patrick Sinclair met me at the Orange County airport and deposited me at my hotel; I had no obligations that evening and slept long and gratefully. The next morning, Patrick Sinclair picked me up and we drove to the office of the Thesaurus Linguae Graecae project, on one edge of the university campus. There, we spoke for ninety minutes or so with Ted Brunner and William Johnson (Shirley Johnson being absent with a cold). Ted Brunner founded the TLG in 1972, and has over the decades has succeeded in shepherding it virtually to completion; the project has encoded about 69 million running words of Greek, comprising virtually every word of every extant text from the beginnings to the cutoff date (which escapes me). Some remaining 'puddles' remain to be mopped up, mostly fragments and inscriptions, I believe, but they will add only a couple million words to the total. William Johnson has been associated with the project for a number of years, and I was told over lunch that the university has now given him a permanent position and that he will take over the project when Brunner retires, which he expects to do in about five years. Ted Brunner began the meeting by observing that, with the encouragement of Susan Hockey, he had encouraged Sinclair to invite me to Irvine, since he felt that he should at least look at the TEI and consider it, even if he decided to use some other markup scheme. We began with a brief summary of Pat Sinclair's project for an electronic version of the Thesaurus Linguae Latinae. Sinclair, a Latinist who has just completed a book about the use of sententiae and related rhetorical structures in Tacitus, spent a post-doctoral year in Munich about six years ago and has maintained ties with the TLL project there since. The TLL itself is a dictionary of Latin on historical principles, in the mold of Grimm and the OED and of comparable importance. The TLL was organized in 1894 and published beginning in 1900. The original schedule called for completion of the dictionary in 1915; so far five thousand pages or more have been published and they have reached P, the letter O having been completed in 1981. The letter N was skipped over for a while, because "the large number of articles with very extensive collections of material would have caused long delays in publication" (from the Introduction). When he showed up in Munich with a laptop, it was, Sinclair says, to all appearances the first computer most of the staff had ever seen. In the meantime, however, computers, especially Macintoshes, have made inroads among the staff, and they even send material to the publisher, Teubner, on Macintosh disks. So when the idea of creating an electronic version of the dictionary arose, the project staff in Munich were not averse. In Irvine, Sinclair has set up a Center (Consortium?) for Latin Lexicography, which paid for a portion of my visit (the other portion being covered by the good offices of CETH), with the goal of creating such an electronic version of the TLL. One or two sample articles had been encoded by William Johnson, he said, and discussed with the TLL; next, the plan was to do a pilot project, encoding 100 columns (50 pages) of material, which he would like to finish in time for the conference celebrating the centenary of the TLL project. The data entry would be done by a service bureau, John Klein in Chicago, which also handles data entry for Dee Clayman's project encoding the Année Philologique. Sinclair showed me a page of the TLL (the first of fifteen pages in the article for 'ordo' [order, system]), pointing out various typographic complexities including the use of italics, bold, expanded type (Sperrdruck), Greek, etc. William Johnson quickly said of course, it was not just font shifts they needed to mark, but structural information and the nature of the information. The typographic rendition, however, was often ambiguous: parentheses, for example, might enclose a side note on the manuscript attestation of a particular citation, an alternative form, a supplied word, etc. Bold parenthesis marks might mean something other than non-bold. Actually, Sinclair told me later, the use of bold face for parentheses reflects the editor's perception that a given section of the article is typographically complex enough that the reader needs help finding the matching parenthesis; structurally, it is random. Brunner asked a number of sharp questions about the TEI and its potential use. Why had the OED project not used SGML? I explained the paradox of DTDs for historical texts, and argued that the existence of the TEI scheme changed the situation somewhat: the TEI had succeeded in the task of writing descriptive DTDs which can accept widely varying text structures, while still being somewhat more constraining than the Waterloo DTD. (Did anyone really think that a sense-number could contain an etymology, three definitions, and a related entry? No. But the Waterloo encoding scheme allows such a structure, even though it is likely to be a typo.) How much would SGML increase the cost of such a project? I said I didn't know. I didn't know how much, he continued, but I agreed it would raise the cost? I said no, I didn't know by personal experience one way or the other. At least one major consulting firm, I pointed out, claimed that doing a major project would be less expensive using SGML than for some other electronic form (meaning ATLIS Consulting, in their successful bid on the American Memory project at the Library of Congress). Oh, he said; I suppose that would be because of lower costs for V[erification] and C[orrection]. I said I believed that was the explanation. I stressed, however, that this was their claim, not mine, and that I had no intention of sitting in a room with Ted Brunner and pretending that I knew anything about cost estimation for data entry. How much, he asked then, would SGML and TEI encoding add to the cost of the project by increasing the number of programmers needed? None that I could see, I said: if the project invented a new scheme, it would need technical people to design it and to write software for it. If it used some modification of the TLG Beta encoding scheme, it could draw on the reservoir of software and technical expertise at Irvine, but the Beta scheme would need substantial extension to handle dictionary entries, and existing software would have to be modified to handle the extensions; a programmer would still be needed. If the project used the TEI scheme, or some modification thereof, it would need technical people to assist in testing and possibly extending the dictionary tag set to handle TLL entries, and to handle SGML technicalities. On the surface, SGML and the TEI scheme appear more complex than some other schemes, e.g. an extension of the Beta encoding, so one might expect to need more technical help. However, using SGML in general and TEI encoding in particular would mean the project could exploit the software being developed elsewhere for TEI and SGML texts. This benefit would balance out the apparently greater complexity of TEI encoding. The high-level conceptual objects of SGML and the declarative nature of most SGML data management programs suggest that non-technical end users might be able to do more with SGML data than with other data, in the same way that SQL allows (some) end users to do things they could never do with a procedural dbms interface. The electronic TLL project will need at least one programmer, regardless of its choice of markup language, simply because the complexity of the data will require it. How many dictionaries had been encoded using TEI or SGML? I said I didn't know, but believed the number of dictionaries using chapter DI in its present form was zero, the number with TEI-influenced or TEI-similar encodings was probably between eight and twenty (counting on the Vassar/CNRS project to provide five to eight), and the number of other SGML dictionaries I could not guess. William Johnson asked why there would be any benefit to using the TEI dictionary encoding, rather than the tag set developed at Waterloo for the OED. Wouldn't using the OED tag set ensure that the TLL could use the software developed for the OED? Certainly, I said, but since the OED software I knew was in fact both DTD-independent, and often not even SGML-specific, the TLL could use Pat and Lector and many other OED programs pretty much regardless. Over lunch, the sharp questioning continued; Johnson, in particular, wanted to know why the TEI Guidelines were not easier for practicing humanists to read (because they are a reference manual, not a tutorial), and why the TEI had not paid more attention to problems of software development --- for example, software needed indexes to get quick access to texts; why hadn't there been any effort to standardize a format for index structures? I said that most programmers I knew preferred to make their own choices among the available index structures, and that the decision to use or not to use indices really ought to be reserved to the developer and not imposed by standardization. (This was a flagrantly irrelevant reply, but it did get the topic shifted.) After lunch, we returned to the TLG office, where Brunner invited me to check my email from his machine, saying he had things to do elsewhere; unfortunately, the connection broke several times and I didn't actually get anything read. Meanwhile, Brunner stood on the balcony outside, chain-smoking. At two, a group of about fifteen or twenty people, half or so of them staff members of the TLG, found their way into the library of the TLG rooms to hear me talk about SGML and the TEI. I repeated the talk I had given at Berkeley on Tuesday, finding as I did so that it felt very odd to be describing the scholarly community's unwillingness to share electronic texts, in the rooms of a project which has distributed, for nominal fees, hundreds of copies of a seventy-million word corpus, created for no other purpose than sharing it with others. It was thus necessary to point out periodically that the TLG in some ways exemplified the points I was making, and in others formed a heroic exception to the general practice in the field. The TLG does share its texts; it does document its encoding scheme; it does use a non-procedural notation. When I said that an encoding scheme must provide methods for encoding analytic or interpretive information in a text, I observed that here, perhaps, Ted might begin to dissent from what I was saying. After I had finished, however, he smiled and said no, he had agreed with everything I had said. If true, this would be good news. On Saturday, Pat Sinclair and I discussed the TLL project over lunch. He is planning to prepare a major NEH funding proposal for this year's Tools deadline (1 September), and expressed an interest in active collaboration with the TEI. He asked how we should organize it. In general, I said, collaborative efforts like this one could be (a) included in a general TEI funding proposal, (b) included in a general proposal for our partner, or (c) formulated as a separate joint proposal. In all three cases, some portion of the money would flow through the institutional grantee of record to the other partner. In this case, it seemed to the two of us that the joint work we had in mind would most readily be incorporated into his general proposal; this simplifies the paperwork, at the cost of raising the overall price and increasing the risk of sticker shock for the reviewers. (If the TEI wishes to function as a subcontractor to the University of California in connection with this project, however, we must create a legal identity for the TEI as soon as possible. I believe that we should very soon investigate incorporation as a non-profit organization controlled by the three sponsoring organizations, and ensure that such incorporation will allow us the flexibility we require for the future.) As possible deliverables of joint work, I suggested we consider: - specification of a TEI tag set for historical dictionaries, as an extension to the current tag set for dictionaries; this breaks down into systematic testing against a stratified random sample of articles from the TLL, systematic document analysis using the TLL and other historical dictionaries, and documentation of the tag set in the form of a draft chapter for the Guidelines - development of a tutorial introduction to SGML and the TEI (ca. 50 pp.), for makers of historical dictionaries (hard copy and/or electronic form) - a markup manual (up to 150 or 200 pp.) suitable for use as a reference manual for day to day work within a project, so as to make it less frequently necessary to refer to the TEI Guidelines themselves; they still serve as a backup reference manual for cases where they are necessary - one or more workshops on the topic of markup for historical dictionaries; from the funders' point of view, these are small invited conferences, while for the TEI they function as work group meetings - one or more workshops on the general topic of historical dictionaries in electronic form, which allow the TLL project to describe its internal workings and its use of TEI markup to others interested in planning similar projects. - software and software-related materials (specifications, prototypes, documentation, ...) relevant to the project, whether for end-user interrogation of the text or for project-internal processing. I noted the possible utility of system-integrator tools like the EBT System Integrator Toolkit or the SoftQuad Application Builder; I would like to use projects like this to persuade vendors to give the TEI copies of the software, either as a charity or at a stiff discount; the grant would have to cover the non-discounted portion of the cost, staff training, etc. Sinclair seemed to think all of these deliverables could be good ideas; we agreed that if the TEI also succeeds in setting up a working partnership with Greg Crane's Greek Lexicon project, then the cost of things like workshops and work group meetings would be divided between the projects. Sinclair explained that he does not think the data entry service can possibly capture the full logical structure of the entries in the TLL; it seems likely that they will be tagging mostly presentational information like font shifts, with the translation to the TEI tag set being performed in Irvine by the project proper, preferably by mechanical means. This would mean that this project is an excellent test bed for techniques of stepwise data refinement using SGML-to-SGML transformations and transformation-pipelining techniques. After lunch, we examined the entry for 'ordo' briefly again, and I described how the current draft of DI would tag various parts of the entry. He agreed to send me an electronic version of some not-too-complicated entry, and I agreed to tag it. We also agreed to meet again later, to discuss plans further, and I agreed to ask the executive committee to confirm that the TEI is willing to make the relevant commitments, which I shall do in a separate note. I spent the rest of Saturday working on chapter SA and this trip report, and returned home without event on the Sunday. Respectfully submitted, C. M. Sperberg-McQueen