Proposal for Funding for

An Initiative to Formulate Guidelines for

the Encoding and Interchange of Machine-Readable Texts

[Copy of NEH Proposal, February, 1988. Please note that specific details of the timetable and funding are tentative and subject to ongoing revision. Detailed budget information has been omitted from this copy of the proposal.]

The Association for Computers and the Humanities
Nancy Ide, Vassar College
C. M. Sperberg-McQueen, University of Illinois at Chicago

The Association for Computational Linguistics
Robert Amsler, Bell Communications Research
Donald Walker, Bell Communications Research

The Association for Literary and Linguistic Computing
Susan Hockey, Oxford University
Antonio Zampolli, University of Pisa

[TEI SC G2]

February 1988

[This version of this document was created for public distribution in April 1988.]

Abstract

The Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing propose to continue a five-phase project to develop and promote guidelines for the preparation of machine-readable texts for scholarly research and for the interchange of such texts among research sites.
Phase 1 of the project (Planning and High-Level Design) was funded by the National Endowment for the Humanities with support from Vassar College; it is now complete. During that phase, a planning conference on Text Encoding Practices convened in November, 1987, at Vassar College to determine the feasibility and desirability of the proposed guidelines and to formulate aims and technical specifications for them. The sponsoring and other participating organizations also agreed upon an organizational structure to continue the work.
This proposal covers phases 2 through 4 (Detailed Design and Drafting; Revision; Review and Approval) of the project. During these phases, four working committees of approximately ten members each will study the problems associated with
  • the documentation of encoded texts,
  • the representation of texts at the typographic level,
  • the representation of scholarly analysis and interpretation in an encoded text, and
  • formal descriptions of the syntax of this and other encoding schemes.
An editor in chief and a consulting or associate editor will coordinate the work of these committees and ensure the coherence and clarity of the guidelines as a whole. A steering committee comprising representatives of the sponsoring organizations will supervise the project as a whole, and the results will be submitted to an advisory board of representatives selected by a larger group of participating organizations.
Funding is requested (1) to support the editorial staff and the heads of the committees, and (2) to subsidize meetings of the working committees, steering committee, and advisory board by paying travel and subsistence expenses for American participants. European funds will be sought to subsidize the participation of European scholars as committee heads and members of the working committees.
Following successful completion of this work, the guidelines will be published (phase 5 of the project) and the sponsoring organizations will undertake to maintain them by revision and expansion as necessary. No funding is requested for phase 5 in this proposal.


1. Rationale

1.1. The Need for a Common Text-Encoding Scheme

Before they can be studied with the aid of computers, texts must be encoded in machine-readable form. Standard data-processing practice furnishes convenient solutions for basic text-representation problems, but many texts of interest to scholarly research present difficulties not resolved by the industrial standards. Over the years, scholars have developed many different methods for:
  1. representing individual characters of a text not foreseen in industry practice (e.g. accented characters, special symbols, non-Roman alphabets); [1]
  2. reducing texts with footnotes, marginalia, text-critical apparatus or other complications into the single linear sequence assumed by most computer file systems;
  3. encoding the logical divisions of the text (e.g. book, chapter, verse);
  4. representing analytic or interpretive information relevant to the text (e.g. syntactic, morphological, or semantic analysis);
  5. documenting the source of an encoded text and the nature of the recording.
The collection of rules, techniques, or conventions used to solve these problems for a given text or set of texts is an “encoding scheme” or “format”.
Over the decades, scores—probably hundreds, in fact—of such encoding schemes have been developed from scratch or adapted from existing schemes and used to encode thousands of texts. Some were created to serve the needs of large-scale projects such as the Thesaurus Linguae Graecae at Irvine or the Responsa Project at Bar-Ilan (Israel); some were created as specifications for text input to text-analytic software such as COCOA (“word COunt and COncordance program on Atlas”), the Oxford Concordance Program, or the Waterloo Concordance Program (WatCon); and still others were developed to regularize large text archives such as those of the Treasury of the French Language at Nancy (TLF) and Chicago (ARTFL), the Institute of the German Language (IDS) at Mannheim, and the Institute for Computational Linguistics (ILC) at Pisa, each containing millions of words of texts. A great many schemes, however, were developed for individual projects working with smaller bodies of text. These texts may or may not have been deposited with a text archive, and their encoding scheme may or may not have been documented—so that even if they are accessible in machine-readable form they are not necessarily usable as they exist. Finally, there are the schemes developed not for research projects or text-analysis software but by computer software developers for commercial or academic word-processing (e.g. Runoff, Script, Scribe, nroff, troff, TeX).
Because of this multiplicity of encoding schemes, one might find the same information (e.g. a chapter division in a novel) encoded in any of the following ways, as well as in countless others: [2]
    |chap 1
    <C 1>Loomings
    \chapter
    \chapter[1]{Loomings}
    :h1.1.  Loomings
    MOBY001001Loomings
    |c1
    .chapter Loomings
    .cp;.sp 6 a;.ce .bd 1.  Loomings
    ~x
Scholars have realized for years that the multiplication of incompatible text formats wastes time, effort, and money. Because there is no common, generally recognized format, scholars must choose among the variety of mutually incompatible schemes, or develop new ones from scratch. Because existing schemes often reflect the research interests of their originators and the peculiarities of the texts they studied, scholars beginning new projects often find them ill-suited to their own needs and elect to develop new schemes, at great cost of time and effort. Because many texts are encoded for specific investigations and are made available to other researchers only as an afterthought, if at all, the practice followed in encoding them is often documented poorly or not at all. Later users of these texts must then decipher the encoding or inquire for more details from the originator, instead of proceeding to their own analysis. Even more important in the long run, the lack of a common encoding scheme causes wasteful duplication of effort: when scholars using one type of computer or one set of software are unable to use—or sometimes even to read—texts encoded with other software on other machines, they must keyboard or scan them again.
A common text encoding scheme developed for the needs of scholarly research could eliminate or minimize many of these problems. Scholars would not have to develop personal encoding schemes; documentation of the encoded text would be simpler; duplication of effort would be reduced. Wide acceptance of a common format would encourage software developers to accommodate that format, thus making it possible to use the same text with many different software systems—not possible now without substantial changes to the tagging within a text. Obviously, in a world of limited resources for textual research, both large text archives and individual scholars stand to gain from a standard or normalized practice.
No existing encoding scheme is likely to gain acceptance as a standard. Though existing research-oriented encoding formats serve the needs of the projects for which they were developed, none is sufficiently flexible or generalizable to apply to the encoding of textual materials across the full spectrum of applications and research interests. Some are outdated schemes, based on 80-column card input or other obsolete technologies. They rely on the sequence numbers of punched cards to provide structural information, or depend upon the characteristics of specific hardware devices. Other schemes are too intimately connected with the peculiarities of the text corpus they were designed for to be readily applicable to other texts or text types. They may assume, for example, that no structural hierarchy has more than three levels, (so that book, chapter, verse may be tagged easily, but not poem, canto, stanza, verse), or that no encoding will require more than twenty different types of tags. Encoding schemes developed by the data-processing industry tend to lack both portability and the generality required for a research-oriented text format. They may make no provision for structural hierarchies at all, or place the burden for such niceties upon the user, who must write extensions to the software to produce the desired effects. Quite frequently, commercial schemes encode only the layout of the text on the page, forcibly confusing (for example) words italicized for emphasis, words in a foreign language, and words italicized as part of a bibliographic citation. And it must be admitted that some existing schemes, both commercial and academic, are obscure in operation, difficult to learn, and rebarbative in appearance.
Textual computing needs a single, easy-to-use, and flexible scheme, suitable for encoding all types of textual materials and accommodating a wide variety of scholarly research interests. No one scheme among those now existing, however, can be found to unite all these attributes. If textual scholarship is to have a single common format for machine-readable texts, it must be developed.

1.2. Previous Efforts

Scholars have tried in the past to create a standard form of text encoding, or a common format in which to exchange existing texts. Ten years ago, with the support of the National Endowment for the Humanities, a conference of North American experts convened to discuss text encoding standards and related issues in San Diego. In 1980, the European Science Foundation sponsored a meeting on the same issues in Pisa, in conjunction with a conference on computerized lexicography. Neither of these earlier efforts led to any substantial agreement, still less to a common format for texts in computer-readable form. At the San Diego meeting, many participants agreed in principle on the need for a common encoding scheme, but dissension concerning the details of the encoding scheme was so great that the project was ultimately abandoned. At the Pisa meeting, the participants concurred on the need to work for “normalization” of practice in lexical databases and text corpora, but were unable to agree to the term “standardization,” largely because some feared the loss of local decision-making responsibility. Some cooperative efforts among European centers have resulted from the Pisa meeting, but no common text format is being developed.

1.3. The Current Situation

The situation is more promising now. Earlier efforts underscored the need for a common encoding scheme, but failed to generate sufficient consensus on first principles. At the planning conference for the current effort, by contrast, over thirty representatives of universities, professional organizations and text archives agreed not just on the need for common practice, but also upon basic principles to govern the guidelines for encoding and exchange, and upon an organizational structure to continue the work. This consensus is the result of several key factors: first, as time passes, more is known about the problems of text encoding and basic principles become clearer. Second, even though funding limitations and a compressed timetable made it impossible to invite everyone with a major potential contribution to the effort, nevertheless the Vassar conference succeeded in bringing together more representatives of key organizations and active research centers than had ever met in one place before to discuss these problems. Third, the recently developed Standard Generalized Markup Language (SGML), defined by the international standard ISO 8879, appears to provide an invaluable tool for developing a simple, flexible, extensible encoding scheme capable of satisfying the widely varying needs of textual researchers. And finally, the newly achieved consensus also reflects the growing urgency of the need. At the San Diego and Pisa meetings, it was predicted that if the humanities computing community did not adopt common practices for text encoding, chaos would ensue. At the Vassar meeting, no one needed to predict chaos: it is, as several speakers observed, the status quo.
As more and more new practitioners enter the field of computer-assisted textual study, the chaos grows ever wilder. With the declining price of optical scanners for reading large volumes of text into electronic form, the number of research texts available electronically will grow geometrically. The increasing availability of CD-ROM for mass storage and the boom in projects to create CD-ROM for wide distribution of textual data will complicate matters further, since the texts on CD-ROM are frozen and cannot be reformatted to suit a different text-analysis program. [3] Several projects are already underway to encode and distribute massive data bases of texts via CD-ROM, each proceeding or planning to proceed with its own encoding scheme. Together, the scanner and the CD-ROM promise to aggravate the problem of anarchic encoding practices by several orders of magnitude within a very few years unless action can be taken soon.
The consensus on goals and methods achieved at the Vassar planning conference indicates that the text-computing community is sufficiently alarmed by the current and expected situation to set aside past differences and begin work immediately, in earnest, towards the establishment of common ground for text encoding and text interchange. The costs of failure are all around us, and grow higher as time passes; we believe the humanities research community must act as quickly as possible to build a consensus on the basis of the discussions and agreements made at Vassar, and to formulate a concrete set of guidelines for text encoding and text interchange, in the context of scholarly research and teaching.

1.4. Impact

The impact of this project on humanities computing will be substantial. At this time several projects to put massive amounts of text (including literary texts, bibliographies and dictionaries), into machine-readable form are in the planning stages. [4] Many are intended to be made widely available on CD-ROM. Most of these projects are developing encoding schemes for their data, completely in isolation from one another. Thus it is assured that without guidelines to suggest a common encoding scheme, each project will develop a scheme quite different from, and very likely incompatible with, those developed in all the other projects. This will be true even if all of these projects base their work on SGML, since SGML defines no specific encoding scheme but only a framework within which encoding schemes (“tag sets”) may be developed. Existing SGML applications like IBM GML, Waterloo GML, or the Association of American Publishers Electronic Manuscript Project tag set, do not provide an adequate basis for normalization of practice as they restrict themselves to the most common textual structures and do not cover the encoding needs of most texts intended for scholarly research.
If a common encoding scheme existed, the effort of creating an encoding format for specific projects would be minimized. Furthermore, the materials created by these projects would be in a uniform format, comprehensible to anyone familiar with the single, accepted encoding scheme. Even more important, we can assume that the existence of a common format will prompt software developers to accommodate this format. [5] Therefore, the materials created by projects over the next decade could serve as input to as-yet undeveloped software designed for any number of text analytic tasks. If both the creators of textual scholarly materials and software developers utilize a common encoding format, the texts may be used with any software package. Thus the texts will be widely usable with no modifications, save the addition of tags for specialized purposes, with any software package accommodating the common scheme. [6] The common scheme, which will provide a syntax as well as a tag set for almost all applications, will also facilitate specialized modifications when they are necessary.
In short, the development of a common encoding scheme for textual data will have profound effects on humanities computing, both by eliminating the need to create encoding schemes within specific projects, and by nurturing an environment where machine-readable texts will be usable, and distributable for use with, software which performs a wide variety of formatting and analytic tasks.

2. History and Current Status of the Project

The current project began in 1987 on the initiative of the Association for Computers and the Humanities, which proposed a five-phase project to prepare guidelines for text encoding and text interchange. (See Appendix A for description.) The first phase called for an international planning conference to discuss issues of content and feasibility and to work out organizational arrangements among the groups interested in participating in the effort. The later phases (discussed in more detail below) called for the drafting, circulation and revision, approval, and publication of the guidelines.
The planning conference, funded by the National Endowment for the Humanities, was held at Vassar College in Poughkeepsie, New York, on November 12 and 13, 1987. Thirty-one experts from universities, learned societies, and text archives from North America, Europe, Israel, and Japan, met for intensive discussion of the desirability, feasibility, and basic principles of a common set of guidelines for machine-readable text encoding. (A list of participants is in Appendix B.) The firm consensus was that the textual computing community emphatically needs a common format for the interchange of existing data, and that individual scholars and projects alike need recommendations for minimal text encoding practices, as well as the facility to extend those minimal practices to cover special problems of interest to individual researchers. The group further agreed that such a common framework was not only necessary but feasible, and agreed, after vigorous discussion, on several basic principles to govern the scope and the organization of a set of guidelines for encoding textual materials.

2.1. Recommendations of the Planning Conference

After lengthy discussions of the basic purposes and functions of the guidelines, the syntax of the scheme they should propose, and the organization of the drafting process, the participants at the planning conference recorded their consensus in the following closing statement:
  1. The guidelines are intended to provide a standard format for data interchange in humanities research.
  2. The guidelines are also intended to suggest principles for the encoding of texts in the same format.
  3. The guidelines should
    1. define a recommended syntax for the format,
    2. define a metalanguage for the description of text-encoding schemes,
    3. describe the new format and representative existing schemes both in that metalanguage and in prose.
  4. The guidelines should propose sets of coding conventions suited for various applications.
  5. The guidelines should include a minimal set of conventions for encoding new texts in the format.
  6. The guidelines are to be drafted by committees on
    1. text documentation
    2. text representation
    3. text interpretation and analysis
    4. metalanguage definition and description of existing and proposed schemes,
    coordinated by a steering committee of representatives of the principal sponsoring organizations.
  7. Compatibility with existing standards will be maintained as far as possible.
  8. A number of large text archives have agreed in principle to support the guidelines in their function as an interchange format. We encourage funding agencies to support development of tools to facilitate this interchange.
  9. Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of information not already coded in the texts.
Three overall goals were thus defined for the guidelines:
  • To specify a common interchange format for machine readable texts.
  • To provide a set of recommendations for encoding new textual materials.
  • To document the major existing encoding schemes, and develop a metalanguage in which to describe them.
Consensus was also reached on a possible syntactic basis for the guidelines and the organization of the work.

2.2. Common Interchange Format for Machine-Readable Texts

Because some archives already hold large quantities of data, and even more because many have made substantial investments in locally developed analytic software keyed to the specific formats they have developed for their data, archivists generally have neither desire nor motive to convert their existing holdings to a new format for internal storage. Existing archives will not necessarily be able to use the new scheme even for new texts, which they will need to encode for consistency with their existing holdings. The text archivists at the Vassar conference nevertheless saw a pressing need for a common format for data interchange. Such a format would allow each archive to write software with which to convert its present data formats into the common interchange format before sending texts to other users and convert incoming texts from the common interchange format into the form around which the data archive and its specialized local software are built. This procedure, requiring translation to and from only one lingua franca, would constitute a dramatic improvement over the Babel of current practice, which requires translation to and from many external formats.

2.3. Recommendations for Encoding New Textual Materials

The guidelines must also provide recommendations to those who are encoding texts for the first time and are not required to conform to an existing scheme, to assist them in deciding what textual features to encode and how to encode them. Inflexible requirements cannot be formulated because the varieties of both textual materials and research interests defy exhaustive classification or prescription. Still, recommended practices reflecting the consensus of informed scholarship can improve data compatibility and help ensure higher data quality in the future. Such recommendations, as noted already, will not in any sense be binding upon the practice of existing archives or commit them to the slow, costly conversion of their holdings into the new format. But as one participant in the planning conference observed, the texts already available in machine-readable form—hundreds or perhaps thousands of millions of words, all told—will represent only a drop in the ocean of texts that will be encoded within the next ten to fifteen years. Since many of those texts will be encoded by scholars or new research centers without an investment in any existing scheme, the new guidelines should explicitly recommend the encoding of specific minimal textual features about which the community of users can achieve consensus. [7]
Desirable as standardization is, the requirements of textual research vary with the researcher and the text. No simple set of absolute requirements can be applicable to all texts or purposes, and the idea of a set of absolute requirements was eventually set aside as being reductive and simplistic. The guidelines will recommend encoding the textual features commonly found useful in more than one kind of analysis, but even these commonly useful features will not be requirements of the scheme. With the scheme, as without it, individual scholars will be faced with the inescapable necessity to take responsibility for the useful encoding of their individual texts.
In addition to the minimal recommended basic encoding, the guidelines will define sets of textual features relevant to specific disciplines or text types, and provide techniques for representing them. These extensions to the basic encoding will be strictly optional and no recommendation that they be universally encoded will be implied. For scholars interested in a given type of textual problem, however (e.g. lexical analysis, textual criticism, thematic study, etc.), these extensions will provide a basis for compatible encoding of similar information by different projects, and thus for easy interchange of the encoded texts.
For most types of textual study, the extensions will present a formal syntax for representing salient textual features or analytic categories, without further complications. For example, the extension for metrical study would provide methods for encoding the number of syllables in a given line, the scansion(s) suggested for the line, the metrically significant features (e.g. stress, vowel length) of syllables, and so on. For manuscript criticism and the encoding of variant readings, methods must be developed for representing variants, relating them to the lemma in the text, classifying the variation, indicating the manuscripts, and so forth. For such extensions to be successful, they must be based on a thorough understanding of the intellectual problems involved, and their preparation is expected to involve significant research.
In many types of textual analysis, of course, the textual features studied vary widely depending upon the theoretical orientation of the researcher. [8] For the most part, the guidelines we propose to develop are not the appropriate forum for airing theoretical issues, and the guidelines will perforce take an eclectic position, providing conventions for encoding whatever textual features the individual scholar does wish to consider. For a few types of analysis, those most frequently performed in literary and linguistic computing, it will be worthwhile to attempt more than eclecticism. In these cases an effort will be made to establish a dialogue among representatives of various schools and work for agreement on some recommended minimum content and a common polytheoretical basis for encoding, at least to the extent of making it easier to exchange data between projects working on different theoretical lines. (Such harmonization of diverse theoretical bases is of course not guaranteed to be successful, and will be attempted only where adjudged wise.)

2.4. Metalanguage and Documentation of Major Existing Encoding Schemes

A third important task is to document, in a single place and a single vocabulary, the major existing schemes for text encoding. Information of this type will have immense value both for those attempting to use and interpret texts encoded in one of these existing schemes, and for those familiar with existing schemes who want a concise account of their differences from the new format. Reliable knowledge of existing encoding schemes is essential, of course, for planning any interchange format, and will also assist in the formulation of recommended encoding practices.
Moreover, the computational linguists at the planning conference observed that if existing encoding schemes can be described not only in prose but in a formal language, it will be possible to generate, from the formal description of an encoding scheme, a program to translate texts from that scheme into the common interchange format. Because of the advantages of such automatically generated translation programs, the group as a whole agreed that a special committee should undertake to develop a “metalanguage” for the description of encoding schemes and to formulate descriptions in that metalanguage of the major existing schemes.
To develop the metalanguage, it will be necessary both to perform an extensive survey of current and past practices in scholarship and in computerized typesetting, [9] and to explore the formal properties of the markup conventions they use. After isolating, in an analytic phase, the minimal set of syntactic features required to describe the various existing text formats, the working party responsible for metalanguage development must synthesize a formal language to describe the syntactic and semantic features of the existing schemes and ensure that the syntax of the new encoding scheme is capable of expressing the distinctions and regularities expressed by the older schemes without information loss. Both of these phases will require significant knowledge of formal language theory; the task of creating a formal language with sufficient descriptive power may prove to be challenging and is expected to constitute an interesting research problem in this field.

2.5. Committee Organization

Subsequent discussions at the planning conference addressed the overall plan of work required to develop the guidelines.
Since the problems and aims of text encoding vary with the type of text encoded and with the discipline of the encoder, while many problems of a formal text-representation scheme cut across disciplinary borders, the division of labor for drafting a scheme cannot perfectly mirror the logical organization of the scheme itself. After clarifying these points, the planners agreed to divide the work on pragmatic grounds among four committees:
  • committee on text documentation
  • committee on text representation
  • committee on text interpretation and analysis
  • committee on metalanguage issues

2.6. Syntax of the Scheme

No final decision about the syntactic basis for the new text encoding scheme was made at the planning conference. All present agreed that if possible the syntax should be borrowed from some existing scheme, rather than created out of whole cloth. The syntax must be relatively simple, capable of expressing the fine distinctions and occasionally complex overlapping hierarchical structures required by textual material, and must allow for user-defined extensions to the pre-defined set of tags.
The most obvious candidate is the Standard Generalized Markup Language. SGML forms the basis for the recently developed markup scheme of the Association of American Publishers Electronic Manuscript Project, and a survey of encoding problems recently performed at Queens University concluded that SGML offered a better basis for research-oriented text encoding than anything else currently available. [10] SGML and markup languages built in conformity with it have already proven flexible and powerful; while existing SGML applications do not provide markup tags for many textual features needed for research, most of what they do provide is also needed for research work. It would be pointless to develop again from scratch what has already been developed during years of study and testing. Some participants at the planning conference asked whether SGML, with its reliance on simple hierarchies of text structures, can handle the multiple conflicting hierarchies of textual research (e.g. canto, stanza, line and poem, sentence, word). Those versed in SGML believed it would meet the needs of the case, and it was decided to begin work on the syntax with SGML as a model, and abandon SGML only if it proved inadequate for the needs of research.
The closing statement expresses this commitment as part of the general goal of compatibility with existing standards.

2.7. Sponsorship and Participation

During preparations for the Vassar conference, three major international organizations dedicated to the exploitation of modern technology in humanistic or textual disciplines tentatively agreed to sponsor the initiative jointly. These are:
  • the Association for Computers and the Humanities (ACH)
  • the Association for Computational Linguistics
  • the Association for Literary and Linguistic Computing
At the Vassar meeting, a temporary steering committee was formed, consisting of representatives of each of these organizations, to oversee the project until a permanent steering committee could be named by the executive councils of the sponsoring associations.
Provisional agreements have also been reached for a number of other organizations to participate in the work of the initiative. At this writing (January, 1988), these organizations include:
  • the American Historical Association (AHA)
  • the American Philological Association (APA)
  • the Association for Computing Machinery, Special Interest Group for Information Retrieval (ACM/SIGIR)
  • the Association for Documentary Editing (ADE)
  • the Association of American Publishers (AAP)
  • the Linguistic Society of America (LSA)
  • the Modern Language Association of America (MLA)
These organizations are mostly North American in their base, although several are international in membership. Equivalent European societies will also be invited to participate.
Appended to this proposal (as Appendix D) are draft memoranda of understanding describing the commitments of the sponsoring and participating organizations; these will be submitted to the governing bodies of the organizations, a list of which is attached in Appendix E.

3. Plan of Work

The further work of the project corresponds to phases 2 through 5 of the original plan:
  • Phase 2: Detailed Design and Drafting (lower-level overall design, followed by drafting)
  • Phase 3: Revision (public circulation of drafts, public comment, and revision)
  • Phase 4: Review and Approval
  • Phase 5: Publication and Maintenance
The tasks of each phase will be undertaken as appropriate by a steering committee of representatives from the sponsoring organizations, an advisory board of representatives from each participating organization, editors appointed by the steering committee, and drafting committees appointed by the steering committee upon the nomination of participating organizations.

3.1. Organization

3.1.1. Sponsoring Organizations

The sponsoring organizations (those applying for this grant) will undertake the project of developing and disseminating the guidelines for text encoding and text interchange. The terms of their cooperation in the project are defined by a “Memorandum of Understanding” (reproduced in appendix D), which has been prepared by the steering committee and submitted by them to the sponsoring organizations for formal approval. The sponsoring organizations will exercise their responsibility for the project through the steering committee, which is composed of two representatives from each sponsoring organization.

3.1.2. Participating Organizations

The participating organizations include major organizations for literary, linguistic, and humanities research, computer science, and data processing. They endorse the idea of a common encoding scheme and interchange format and participate in an advisory capacity as to the content of the guidelines and their suitability for the needs of the organizations' members. The participating organizations will influence the guidelines both through the work of their members on drafting committees and subcommittees and through the participation of an official representative on the project's advisory board.

3.1.3. Steering Committee

The steering committee is appointed by the sponsoring organizations, each represented by two members. It will oversee the project on behalf of the sponsoring groups, appoint and supervise the editors and committee heads, and appoint voting members of the working committees (who may be nominated by participating organizations or by the steering committee itself). The steering committee will meet eight times during the project, in addition to meeting twice with the advisory board. The editors will also attend these meetings, at which issues of design, strategy, and (where appropriate) funding will be considered. The steering committee has thus far met once, at the Istituto di Linguistica Computazionale in Pisa, in December of 1987. A second meeting is planned for March 12-13, 1988, in Morristown, New Jersey. The Association for Computers and the Humanities is represented on the steering committee by Prof. Nancy Ide of Vassar College and Dr. C. M. Sperberg-McQueen of the University of Illinois at Chicago. The Association for Computational Linguistics is represented by Dr. Donald Walker and Dr. Robert Amsler, both of Bell Communications Research, Morristown, New Jersey. The Association for Literary and Linguistic Computing is represented by Ms. Susan Hockey of Oxford University and Prof. Antonio Zampolli of the University of Pisa. Abbreviated vitas of the steering committee are attached to this document as Appendix F.

3.1.4. Advisory Board

The advisory board will comprise the steering committee, the editors, and one representative of each participating organization. It will consider questions of basic principle and ensure that the interests of the participating organizations and their members are adequately addressed. Individual members of the advisory board will act as links between their organizations and the working committees and circulate progress reports and interim drafts within their organizations. Members of the advisory board will be named by the participating organizations in the manner each chooses.

3.1.5. Editors

The editor in chief, appointed by the steering committee and serving half time, will coordinate the day-to-day work of the project and act as its administrative head. The steering committee expects also to appoint an associate or consulting editor to share the editorial tasks. However, such an appointment depends upon the availability of qualified individuals, which cannot be determined at this time. The budget included in this proposal requests funding for the consulting editor at one quarter time. The editors will:
  1. ensure the proper circulation of documents among those working on the project, and perform other duties of a scientific secretariat;
  2. coordinate the work of the four working committees and their subcommittees, serving as liaison among the committees and between the committees and the steering committee;
  3. draft the documents describing the organization of committee work, in conjunction with the steering committee;
  4. draft the basic charge or list of responsibilities for each committee;
  5. receive from the committees the results of their work with all the relevant information, and ensure their compatibility each with the other;
  6. review and edit the final document, integrating the work of the various committees into a single coherent whole and rewriting portions of the document as necessary so as to ensure consistency; subject to the guidance of the steering committee, the editors are responsible for the wording of the final documents;
  7. administer the paperwork of the initiative, and distribute travel funds and subsidies; and
  8. coordinate the effort with members of the advisory committee and the associations they represent, to ensure widespread participation and publicity.
In all of these tasks the editors will work under the supervision of the steering committee, and with its help. The editor in chief will be Dr. C. M. Sperberg-McQueen of the University of Illinois at Chicago. Dr. Sperberg-McQueen was unanimously selected as editor in chief by the other members of the steering committee. Centrally involved in the preparation of the initial proposal to NEH to fund the Vassar conference, he was also the primary author of two documents prepared and distributed as background for the meeting, one of which formed the basis of the design for the guidelines ultimately accepted by the participants. His leadership skills, exhibited in preparation for the meeting and in the discussions there, promise well for his coordination of the project. Since November, he has served as secretary of the steering committee. With a doctorate in comparative literature and years of experience in academic computing, Dr. Sperberg-McQueen is one of the still small group of scholars in this country with substantial backgrounds in both a humanities discipline and computing. This combination led to his being invited, in January, 1985, to become Princeton University's first consultant in humanities computing. While at Princeton, he worked on (among other things) the collection of textual data for users, the analysis of texts in French, German, and English, and the installation and use of concordance and text-analysis software. He provided technical support for an ambitious long-term project to encode the thousands of historical texts of the Cairo Geniza—a complex corpus of medieval and modern texts and fragments which documents centuries of life in the Hebrew and Islamic communities of Cairo. For the Geniza project Dr. Sperberg-McQueen helped develop software for Hebrew-Arabic-English text editing, and procedures for encoding and manipulating these complex texts. Dr. Sperberg-McQueen is currently a systems programmer at the University of Illinois at Chicago, where he supports the library automation system. He also provides technical support for faculty research projects, including ones on Greek epigraphy and Roman law. He is a member of the Executive Council of the Association for Computers and the Humanities and serves on the ACH Committee for Text Encoding Practices. A vita is attached in Appendix F.

3.1.6. Committee Heads

The steering committee will appoint heads for the four working committees, who will organize the work and meetings of the committees and help ensure the intellectual soundness of the committees' work. The committee heads will report regularly to the editors and steering committee, cooperate with the editors and other committee heads, and organize ad hoc subcommittees, as necessary, to deal with topics too specialized to be handled by the committee as a whole. They will be responsible for arranging meetings of their committees, drafting summary documents representing the work of the committee, and ensuring that the work of the committee meets the project schedule. The committee heads occupy a key position within this project: they oversee the development of the standard within a specified area, they draft the documents reflecting committee decisions, and they are in large part responsible for ensuring that the committee's work continues in a timely fashion. These activities will demand a substantial commitment of time to the project. Therefore, funds permitting, quarter time release from their other duties will be obtained for the committee heads.

3.1.7. Drafting Committees

Voting members of the four committees called for by the Vassar planning conference will be nominated by the participating organizations or the steering committee and appointed by the steering committee. The committees will consider specific problem areas and recommend specific text encoding practices to solve them. The deliberations and analyses of these committees constitute the crucial work of this project, and the organizational structure is set up to assist in that work as far as possible. The four committees and their areas of responsibility will be:
  • Committee on Text Documentation
    This committee will address problems of labeling an encoding so that its source text and other identifying characteristics are well documented. The needs of library cataloguing, archive documentation, end users, and processing programs will all be considered. The computational documentation of the file (in the form of declarations, etc.) may also be considered by this committee, but the major responsibility for the content of declarations will be borne by the committees on text representation and analysis and, for the syntax of the declarations, by the committee on metalanguage issues.
  • Committee on Text Representation
    This committee will address the problems of representing in machine-readable form (1) the physical aspects of a copy source, (2) all information explicitly present in the copy text on the physical or graphetic level, and (3) all the textual features (e.g. emphasis, words in other languages, basic text structure) conventionally represented by the typography of a printed edition, whether present in the copy text or added by the encoder or later analysts.
    Topics included in the field of this committee thus include the marking or encoding of:
    • quotations
    • mathematical formulas
    • figures, tables, and illustrations and their captions
    • hyphenations (including declaration of how hyphenation is treated)
    • punctuation
    • diacritics and `special' character sets
    • change of language or alphabets
    • the conventional use, in a given encoding, of characters as alphabetics, punctuation, diacritics, or separator
    • topography or layout of the text
    • recto and verso, color of page, etc.
    • logical structure of a text (chapters, paragraphs, etc.)
    • conventional reference numbers for a text
    • lineation (on page, in column, in logical subdivision, etc.)
    • editorial additions, deletions, or corrections
    • editorial apparatus (apparatus criticus)
    • special problems of numismatic, epigraphic, or paleographic material
    • special problems posed by the physical realization of a genre (e.g. comic strips).
    Unlike those of other specific genres, the special problems of spoken texts and of dictionaries will not be taken up here but in the committee on text analysis and interpretation. It should be noted that the distinction between this committee and the next is not that between objective and subjective or interpretive information, since the typographic level of the text addressed by this committee can indicate specific editorial interpretations of the text by means of font, layout, or special punctuation.
    As noted, the voting members of the working committees will be formally appointed by the steering committee after the first meeting of the advisory board. At the Vassar planning conference, however, the following participants expressed an interest in working on the problems assigned to this committee:
    • Lou Burnard, Oxford University
    • David Chesnutt, University of South Carolina
    • Yaacov Choueka, Bar-Ilan University
    • Jacques Dendien, National Institute of the French Language
    • Paul Fortier, University of Manitoba
    • Randall Jones, Brigham Young University
    • Terence Langendoen, Graduate Center, City University of New York
    • Junichi Nakamura, Kyoto University
    • Wilhelm Ott, University of Tübingen
    • Eugenio Picchi, Institute for Computational Linguistics, Pisa
    • Jean Schumacher, CETEDOC, Louvain-la-neuve
    • Paul Tombeur, CETEDOC, Louvain-la-neuve
  • Committee on Text Analysis and Interpretation
    This committee will address problems of representing, in machine-readable form, the results of interpretive and analytic work by scholars. A full list of the possible areas of application that fall under this heading will be developed in the early stages of this project; a short list of examples includes
    • phonology
    • morphology
    • syntax
    • stylistics
    • metrics
    • information retrieval
    • thematic study
    • semantics
    • content analysis
    • lexicography.
    For pragmatic reasons, problems peculiar to some specific text types will also be handled here, wherever texts of that type are typically encoded primarily by scholars interested in a specific type of analysis. Most notably, transcripts of oral speech, dictionaries, and glossaries, which are of paramount interest to computational linguists and lexicographers, will be treated by this committee rather than the committee on text representation.
    The porous borderline between representation and interpretation of texts will require careful coordination between this committee and the preceding one on issues of content. The presentation of statistical summaries and certain kinds of textual-critical analysis are two of the more obvious borderline cases that must receive special attention.
    At the Vassar planning conference, the following participants expressed an interest in working on the problems assigned to this committee:
    • Lou Burnard, Oxford University
    • Roy Byrd, IBM Research
    • Nicoletta Calzolari, University of Pisa
    • David Chesnutt, University of South Carolina
    • Yaacov Choueka, Bar-Ilan University
    • Paul Fortier, University of Manitoba
    • Robert Kraft, University of Pennsylvania
    • Stig Johansson, University of Oslo
    • Ian Lancashire, University of Toronto
    • Terence Langendoen, Graduate Center, City University of New York
    • Penny Small, Rutgers University
    • Paul Tombeur, CETEDOC, Louvain-la-neuve
  • Committee for Metalanguage Issues
    This committee is charged with several tasks:
    • They will examine existing encoding schemes, and in particular SGML and the American Association of Publisher's standard for electronic manuscript markup, to determine the degree to which the syntax of the markup scheme recommended by the guidelines can be compatible with these schemes, or can be an extension of these schemes. Compatibility with the existing international standard is a major desideratum of this project; clearly, there is no need for our project to duplicate the work already done on SGML, the design of which has been carefully developed over the past several years.
    • They will develop a formal metalanguage for the description of encoding schemes, and formulate in it adequate descriptions of the major existing schemes. The purposes of this metalanguage are (1) to ensure that the encoding scheme proposed by the guidelines is compatible with existing schemes, in the sense that anything expressed in an existing scheme is translatable into the new scheme; and (2) to provide, through the metalanguage, a formal mechanism to simplify the design of programs to translate from existing schemes to the new scheme.
    • Because this group will have special skills in formal language theory, they will also be active in formulating specifics of syntax for the new encoding scheme (for example specifying the form to be taken by the declaration of new tags or multiple character sets) and ensuring its notational extensionality. As noted already, SGML and the Association of American Publishers tag set will be carefully considered as models for the syntax of the new scheme.
    At the Vassar planning conference, the following participants expressed an interest in working on the problems assigned to this committee:
    • David Barnard, Queens University
    • Lou Burnard, Oxford University
    • Paul Fortier, University of Manitoba
    • Eugenio Picchi, Institute for Computational Linguistics, Pisa
    • Jean Schumacher, CETEDOC, Louvain-la-neuve
The four committees are expected to meet six times each during the project, with the exception of the committee on text documentation, which will meet only twice owing to its more narrowly circumscribed duties. Several face-to-face meetings are imperative in order for the work of the committees to move forward: the discussions they make possible are essential to ensure that the guidelines address the needs of specific applications adequately and appropriately. Wide, vigorous discussion also helps ensure widespread acceptance of the guidelines after their completion. While we expect that between meetings much of the work of the committees will take place by means of electronic mail, the need for face-to-face exchanges cannot be overemphasized. The work of the Vassar meeting could not have been accomplished without the give-and-take of vigorous discussion, which typically involved the majority of the participants commenting and responding on any given point. The decisions with which the committees are faced will often be difficult, indeed possibly controversial; broadscale discussion is the only sure way to ensure that these decisions accurately reflect the needs of the research community. Substantial subsidies for the work of these committees are not feasible. Much of the work will have to be voluntary and unremunerated. Where possible, meetings will be held in conjunction with major conferences in the field, so that travel expenses will be minimized. In order to encourage proper professional credit for the work performed in the committees, the steering committee will arrange, where possible, for the publication of working papers from the committees in the journals of the field. Thus far the editors of several scholarly journals associated with the sponsoring societies and their representatives (Computers and the Humanities, Literary and Linguistic Computing, Computational Linguistics, and Linguistica Computazionale) have expressed a willingness to publish such papers, where appropriate. For other papers, the steering committee will issue a series of working papers, to be published in the manner of technical reports.

3.1.8. Subcommittees

Where the charge to a committee is broad enough to suggest or require it, the committee head may organize subcommittees to consider special problems, e.g. those of a specific discipline, or those of texts in a specific language or script. The membership of these subcommittees will be open to all volunteers; there is no expectation that subcommittee members will be voting members of the parent committee.

3.2. Phases of the Work

3.2.1. Lower-level Design

The first task in the actual preparation of the guidelines will be to extend the principles enunciated at the Vassar conference, and to prepare a more detailed design for the encoding scheme. The editors will prepare, in cooperation with the steering committee, a document describing the design goals and principles for the scheme; this basic design document will then be circulated to all participating organizations and other interested groups or individuals for comment, after which the steering committee and editors will revise it. While the basic design document is circulating for public comment, the committee on metalanguage issues will develop a preliminary syntax for the encoding scheme, in order to define the syntactic framework within which the content-oriented committees on text documentation, text representation, and text analysis will work. As stated above (sec. 2.6), it is expected that the syntax of the encoding scheme will conform in general with the international standard SGML, except where deviations may be necessary to accommodate the particular requirements of historical, linguistic, and literary textual analysis.

3.2.2. First Meeting of the Advisory Board

When the preliminary syntax has been completed, and the basic design document has been revised, the steering committee will convene an inaugural meeting of the Advisory Board. The basic design principles and syntax will be presented, discussed, and modified as necessary prior to their approval by the advisory board. The advisory board will also discuss the organization of the four drafting committees, so that the steering committee, in a meeting immediately following the advisory board's, can proceed with the constitution of the drafting committees.

3.2.3. Drafting

Once the basic design goals and syntactic framework have been set, the various drafting committees will begin their work. The editors and steering committee will by this time have prepared a formal charge or brief for each committee, delineating its field of responsibility and recommending attention to specific problem points. The committee heads will begin the committee work by drafting an analysis of the problem area, to serve as a starting point for discussion and further work. The detailed organization of the drafting process is best left to the committee heads, but it is expected that individual subcommittees will draft working papers on areas of specific concern, each working paper analyzing the textual features relevant to the area and proposing a set of SGML tags for encoding them. A great deal of earlier work must be reviewed and synthesized by a general theoretical analysis in order for such working papers to be successful. Most of the committee work can reasonably by done by mail, whether conventional or electronic, but several committee meetings will be held, when possible in conjunction with major academic conferences. (As noted above in section 3.1.7, the committees are expected to meet six times each, except for that on text documentation, which will meet twice.) The editors will attend whenever possible, in the interests of coordinating the work of the various groups; since the work of many of the areas falling under different committees will obviously overlap, such coordination efforts are essential. Full reimbursement of participants' travel costs to committee meetings would be prohibitively expensive, but this proposal does budget funds for travel subsidies for members of working committees who would not otherwise be able to attend the meetings.
3.2.3.1. Electronic Conferencing
In addition to conventional and electronic mail, electronic conferencing will also be used to speed the drafting process. The University of Illinois at Chicago has agreed to sponsor one or more electronic conferences for the discussion of issues raised in the text encoding initiative. These conferences will be based in the academic network Bitnet and will be open to all interested parties. Experience in this project and others shows that such electronic discussions can materially aid in the preparation of complex documents.

3.2.4. Public Review and Comment

This phase overlaps with the preceding. As the working committees prepare their recommendations, interim drafts will be circulated for comment among the committee heads and the advisory board. The advisory board will be responsible for passing the drafts on to individuals or committees in the participating organizations who will be competent to comment on the substance of the draft. Drafts will also be posted on the electronic conferencing systems already mentioned, and comments will be solicited from participants in the conference. When a more nearly final form of the guidelines is ready, it too will be circulated to the advisory board and participating organizations and will be posted electronically. Additionally, the existence of a complete draft of the guidelines will be publicized by the sponsoring and participating organizations, and copies sent to all inquirers for comment.

3.2.5. Revision

The comments on the complete draft of the guidelines will be considered by the steering committee, editors, and working committees, and the guidelines will be revised accordingly.

3.2.6. Approval

The revised draft will then be submitted to the advisory board for discussion, final amendment, and approval. This final discussion will take place at a second meeting of the advisory board.

3.2.7. Publication

Following the approval of the guidelines by the advisory board, the sponsoring organizations will arrange for the publication of the guidelines, either in journals or in book form. Funds for this phase are not sought at the present time.

3.2.8. Maintenance

The sponsoring organizations have agreed to maintain a mechanism for revising and extending the guidelines after their first publication, based on experience with the guidelines in practice. This joint maintenance mechanism will be responsible for issuing supplemental interpretations of the guidelines and extensions of the guidelines to further problem areas. Revisions of the guidelines as a whole may also be undertaken as appropriate. It may prove appropriate, after approval and publication of the guidelines, for the sponsoring organizations to seek adoption of the guidelines as national or international standards. On the other hand, since the guidelines are expected to take the form of a tag set defined in accordance with the Standard Generalized Markup Language (already an international standard) and may possibly function as a compatible extension of the Association of American Publishers tag set, which is now being proposed as an American National Standard, such adoption may prove neither appropriate nor necessary. In any case, a formal description of the guidelines will be prepared, structured in the fashion conventional for standards documents.

3.3. Timetable

June, 1988 NEH Grant begins. Informal discussions at ALLC meeting in Jerusalem.
June - August, 1988 Editors draft detailed design goals, overview of committee responsibilities, notes on committee procedures, and charges for the individual committees. Editors collect documentation for existing encoding schemes.
August, 1988 Steering committee meets, reviews design goal and committee organization documents, considers personnel issues, constitutes committee for preliminary syntax.
August - December, 1988 Syntax committee (nucleus of later metalanguage committee, possibly with additions) develops preliminary SGML-based syntax for encoding scheme.
December, 1988 - January, 1989 Advisory Board meets to approve detailed design goals and preliminary syntax and discuss committee organization. (Meeting in North America.) Steering committee appoints heads of documentation, representation, and analysis committees, and makes first appointments of committee members. Target date for completion of first collection of documentation for existing schemes; copies of documentation obtained thus far distributed to committee heads.
January - March, 1989 Committee heads write initial analyses of their problem areas (`startup papers'), to serve as starting points for the work of their committees.
March, 1989 Steering committee meets to review progress and plan work ahead. (Meeting in Europe.)
March - October, 1989 Working committees and subcommittees meet to analyse their problem areas, document and discuss existing practice, and consider possible recommendations.
June, 1989 Steering committee meets with committee heads to review progress; working committees meet for working sessions at ICCH, Toronto.
October, 1989 Steering committee meets to review progress (meeting in Europe). Committees report in writing. Target date for completion of committees' preliminary analysis of their problem areas.
October, 1989 - February, 1990 working committees meet to develop draft recommendations from their preliminary analyses.
February, 1990 Steering committee and committee heads meet to review progress and resolve problems (meeting in North America). First draft of each committee's recommendations expected for this time.
February - March, 1990 Editors prepare combined first draft of guidelines, it circulates to advisory board and other interested parties for examination and trials on texts.
March - December, 1990 Public considers first draft of guidelines and makes suggestions. Committees and subcommittees test their draft recommendations on sample corpora, exchange their examples, revise their recommendations based on their experience and comments from public.
June, 1990 Steering committee meets. (Meeting in North America.)
October, 1990 Steering committee meets (with committee heads) to review revisions and resolve problems.
January, 1991 Revised recommendations sent from committees to editor. Editors begin preparation of final document.
March, 1991 Steering committee meets to consider editors' draft of final document, suggest revisions.
April, 1991 Editors revise draft in accordance with steering committee recommendations. Fair copies are distributed to advisory board.
May - June, 1991 Advisory board meets to approve final version. Steering committee makes arrangements for publication. Sponsoring organizations institute a joint mechanism for maintenance and revision of the guidelines.

3.4. Concrete Results

Ultimately, this project will produce a single potentially large document which will:
  • define a format for encoded texts, into which texts prepared using other schemes can be translated,
  • define a formal metalanguage for the description of encoding schemes,
  • describe existing schemes (and the new scheme) formally in that metalanguage and informally in prose,
  • recommend the encoding of certain textual features as minimal practice in the encoding of new texts,
  • provide specific methods for encoding specific textual features known empirically to be commonly used, and
  • provide methods for users to encode features not already provided for, and for the formal definition of these extensions.
In the process of preparing this document, a number of working documents (mentioned above in the Timetable) will be prepared, to document the passage from one phase of the project to another. Some of these will be of transient interest, important primarily as preliminary records of decisions about the final guidelines. Others will be works of analysis expected to have permanent utility as expositions of the reasoning behind certain portions of the guidelines. These latter will be published either in the journals of the field or in the working papers of the project. In addition, it is expected that software of various types will be developed in connection with this project: conversion programs to translate files from other encoding schemes into the new format, extensions to various common formatting programs to allow them to process texts encoded in the new tag set, and possibly various types of analytic software. These will not be deliverables of this project in any strict sense, but it is reasonable to suppose that they will be produced only if this project proceeds.

4. Notes on the Use of Automation Technology

4.1. Rationale

Computing machinery will be used in the conduct of this grant because
  • electronic mail speeds correspondence and the exchange of drafts among participants,
  • word processing systems make the production of large documents by groups of authors faster and more reliable,
  • electronic conferencing provides one of the best methods of publicizing this initiative among the interested community, and
  • some concrete implementation of the encoding scheme is required in order to test the recommendations of the working committees.
Computers are, in addition, part of the normal working conditions for the participants in the project, and are used in their daily work. The computational load on individual members of the working committees is unpredictable, and costs are expected to be borne by their host institutions. Funds are included for the editorial site in Chicago, which will handle most of the documents involved in the project, many of them in multiple versions; these funds will be contributed by the University of Illinois at Chicago.

4.2. Hardware

The editor in chief will perform his work on the academic computing facilities of the University of Illinois at Chicago; at the time of writing these are an IBM 3081 model K running VM/SP, and an IBM 3090 model 120 E running MVS/SP. In addition, the editor in chief will be provided by UIC with a personal computer, so that he can develop techniques for adapting microcomputer word processing software to the encoding scheme, and vice versa.

4.3. Software

Text processing software available at UIC and likely to be used for this project includes the VM System Product Editor (Xedit), Waterloo Script, Waterloo GML, TeX, and the Oxford Concordance Program. Although it is anticipated that software may be developed to process data encoded in the format developed by this project, such software development will take place in other projects which will have their own funding requirements. No formal software development is anticipated as part of the central activities under this grant.

4.4. Costs

Costs for the Chicago editorial site have been estimated using past billings on the machinery in question for work of about the same intensity in terms of machine time. It may be noted that the machine costs at UIC compare favorably with those in many comparable centers at institutions around the country.

5. Funding

The first phase of this project (Planning and High-Level Design) was funded by the National Endowment for the Humanities under grant RT-20880-87, with support also from Vassar College. The current proposal requests partial funding for the major portion of the project (drafting, revising, and approving the guidelines), principally for the participation of U.S. citizens. Funds for European participation in the meetings of advisory board and working committees will be sought from European sources. Because the European involvement in this project dates only from last December, there is thus far little concrete to be reported on the progress of our search for European funds. The British Library is funding a one-year position at the Oxford University Computing Service to study problems associated with the encoding and use of texts in the Oxford Text Archive; we hope to draw upon the results of that study for this project. Some European research centers (e.g. the Istituto di Linguistica Computazionale in Pisa) have expressed a willingness to cooperate in the effort to develop the guidelines and take part in the necessary fundamental research in the encoding-related problems of textual analysis. Some of these centers expect to be in a position to fund workshops or specific research projects relevant to the project, within the framework of their institutional activities and their specified research goals for the coming years. Preliminary discussions are now underway with the most likely sources of broader European funding. And one member of the Standing Committee on Humanities of the European Science Foundation is now preparing, in due form, a proposal that the ESF support some of the research needed for this project.

6. Budget

6.1. Justification

This section follows the budget summary form item by item, indicating the rationale for each expenditure and the method used to estimate each cost.

6.1.1. Salaries and Wages

The editors will coordinate the project, attend meetings of the working groups and of the steering committee, publicize the effort, assist in the work of the committees, and edit the text of the guidelines themselves. The editor in chief will devote 50% of his time to the project, and have ten hours a week of student clerical help for administrative and mechanical tasks. The consulting or associate editor will work quarter time on the project, attending committee meetings and helping the editor ensure the intellectual cohesion of the encoding scheme as a whole. The editor in chief's salary is projected from the current level, with raises of five percent annually. The student clerical help is budgeted for ten hours per week, fifty weeks annually, at $5.00 per hour the first year, with five percent hourly raises the second and third years. The hourly rate is set slightly higher than average on the UIC campus in order to attract better qualified help; the five percent raises are desired in order to make it easier to retain the same student over the grant period, if possible. The heads of the working committees on text representation, text analysis, and metalanguage are asked to make a substantial time commitment. They must analyze the problems of their area, organize subcommittees to report on specialized problems, run the meetings of their own committees, and take general responsibility for seeing that their committees perform their analysis work capably and produce useful recommendations. The active participation of many scholars in the committee work will be essential to ensure the lively discussion of basic principles and widespread support for the results without which this project must fail. It is equally essential to secure the vigorous participation of competent committee heads to channel the discussion and direct it to useful results. In order to enable the committee heads to devote serious effort to their responsibilities, it is important to give them time free of their normal burdens. Accordingly, we budget for 25% release time for the three committee heads named, over the central two years of the project. (The committee on text documentation has a much smaller task, and its head is expected to need no release time; if necessary, the editor in chief will head this committee.) As noted above, the consulting or associate editor and the heads of the working committees have not yet been named. Their salaries are estimated at $37,500 on the assumption that they will probably be, on average, slightly senior to the editor in chief. It is anticipated that the consulting editor and two of the committee heads will be U.S. citizens; one committee head is expected to be a European funded from European sources. (An amended budget will be filed if these expectations prove incorrect.)

6.1.2. Fringe Benefits

The fringe benefit rate for the editor in chief is 13.126%; that for the student clerical help is 0.26%. These are the standard rates at UIC. Fringe benefits for the consulting editor and committee heads are calculated at 25%, which appears to be a more usual rate. [11]

6.1.3. Travel

The size of the travel budget (it is the largest single item in the budget) reflects our conviction that face-to-face discussion of the issues is necessary if the working committees are to create a consensus on the difficult technical issues they are expected to address. The editors must travel widely and regularly, both to encourage broad participation by discussing the project at conferences and to participate in the deliberations of the working committees. The steering committee must meet regularly to ensure that the editor and the working committees are making adequate progress in their work, and to hear reports from the committee heads and discuss issues with the committee heads as a group. In order to minimize travel costs overall, meetings will be held where possible in conjunction with conferences in the field of literary and linguistic computing; other meetings are expected to take place at or near important centers for literary and linguistic computing, so that consultations with scholars at those centers can be combined with the work of the meeting. Travel costs are listed separately for the steering committee, the committee heads (for travel to steering committee meetings), the advisory board, the editors, and the working committees. Each group's travel is summarized on a single line, since the individual points of departure will be known only after the participating organizations have nominated, and the steering committee has constituted, the working committees. Specific destinations will be known as the schedule of conferences in the field becomes clearer; in what follows, it is assumed that the committee meetings, like the committee memberships, will be divided between North America and Europe, with slightly more meetings in North America, and slightly more participants from North America. Fares and subsistence costs are estimated as follows:
  • Air travel within North America is given an average cost of $500 round trip. We are advised that the average fare in the U.S. is more nearly $750, [12] but since most of these trips will be planned well in advance, the average fare will be somewhat better. At most meetings one or more participants will be in their home city, which will further depress the average travel cost.
  • For steering committee meetings, air fare within the U.S. is estimated at $300, since the North American members of the steering committee are concentrated in the East and Midwest.
  • For travel within Europe, air travel is given an average fare of $400.
  • For travel from North America to Europe, or vice versa, air travel is given the average cost of $1100. [13]
  • Per diem expenses for U.S. travel are estimated at $102 ($80 for room, $22 for meals); for European travel they are estimated at $100 (room and meals combined), in accordance with the voucher approval policies of the state of Illinois.
  • Meetings are expected to last two days; participants crossing the Atlantic to attend the meeting are estimated to spend one and a half days each way in transit; other participants are estimated to spend half a day each way in transit. [14]
6.1.3.1. Steering Committee
The steering committee will meet as a group eight times over the course of the project, as well as twice with the advisory board. Four of the meetings will be in North America, four in Europe. They are allocated to the three years of the grant as follows:
  • 1988-89: one meeting in the U.S., two in Europe.
  • 1989-90: two meetings in the U.S., one in Europe.
  • 1990-91: one meeting in the U.S., one in Europe.
6.1.3.2. Committee Heads
In order to ensure the smooth and regular work of the project, the committee heads will report regularly to the steering committee on the progress of their committee and the problems they are encountering. Three meetings of the steering committee will be devoted to discussions with the committee heads; these are allocated one to each year of the project:
  • 1988-89: one meeting in the U.S.
  • 1989-90: one meeting in the U.S.
  • 1990-91: one meeting in Europe.
6.1.3.3. Advisory Board
The advisory board will meet twice, both times in North America. Based on the list of organizations being invited to participate, nine Americans and five Europeans are expected to constitute the group, to which are added the steering committee, making thirteen Americans and seven Europeans. The first meeting will ratify the design principles agreed upon at the planning conference at Vassar; it will take place in the first year. The second meeting, in the third year of the grant, will approve the guidelines themselves.
6.1.3.4. Editors
The editors will attend as many meetings of the working committees as possible; in addition, they will attend the meetings of the steering committee, and travel to text archives and other centers of textual computing in order to confer with practitioners on details of existing encoding schemes and suggestions for the new scheme. Many of these trips will take the editors to conferences in the field, where they will render accounts of the progress and principles of the project to interested scholars. On average, the editors are expected to make about one trip per month, divided evenly between North America and Europe, with an average length of five days.
6.1.3.5. Working Committees
While it may be impossible to pay in full for every trip by every member of every working committee or subcommittee to attend every meeting of the group, nevertheless it is essential to provide as full a subsidy as possible. If the guidelines are to express the consensus of the textual computing community, they must be discussed very fully while they are drafted. While we expect that many scholars will be willing to contribute some of their own travel costs in order to help make sure the guidelines reflect the technical needs of their discipline, it is important to show, by financial support, how central the working committees are to the success of the project as a whole. The working committees are scheduled to complete their work in about two years; over this period the committee on text documentation is expected to have two meetings, and the committees on representation, analysis, and metalanguage six meetings each. (Adjustments will be made as necessary.) Of these twenty committee meetings, twelve are expected to be in North America, eight in Europe. For purposes of budgeting, the meetings are estimated as having ten participants each, six North Americans and four Europeans. In the budget, the meetings are allocated as follows to the three years of the grant:
  • 1988-89: two meetings in the U.S., two in Europe.
  • 1989-90: six meetings in the U.S., four in Europe.
  • 1990-91: four meetings in the U.S., two in Europe.

6.1.4. Supplies and Materials

Estimates for the supplies and materials needed for the project are based on the costs incurred by the editor in chief over the past year working at the same site. The monthly usage of mainframe computer resources is estimated, when monetarized, to be about $800. (Over the past year the editor in chief has used about $1600 of resources per month.) Office supplies are estimated based on the per capita expenditure of the host computer center. The cost of a personal computer, to be provided by the host institution, is that of a typical model now available (IBM PS/2 Model 50) at the institutional price. The personal computer is required in order to run common microcomputer software and test its compatibility with the new encoding scheme, and to examine products designed for processing documents written in an SGML-compatible markup language. (Commercial monthly rental of a PC in Chicago runs about $250 per month for IBM PC/XT-compatible machines, about $400 per month for PS/2 machines. The PC is therefore budgeted as for an outright purchase. At the conclusion of the grant, ownership of the PC will be retained by the host institution.) In order to provide a terminal connection for the student assistant, a modem for the campus network must be installed; for this the standard on-campus charge is that shown ($800). (No terminal rental or purchase is shown, because it is anticipated the student will use the microcomputer as a terminal.)

6.1.5. Services

Monthly telephone and postal costs are hard to estimate with certainty but are expected to be rather high, given the need to keep in touch with as broad a spectrum of interested people as possible, not all of whom are accessible through electronic mail. The UIC computer center has agreed to pay telephone and postage charges as shown ($300 per month and $100 per month). The standard charge for printing at the host site is four cents per page (only laser printing is available). The estimate of 7500 pages per month is based on the experience of the editor in chief over the past year, and includes the printing of small runs of the working papers to be issued by the project as technical reports.

6.1.6. Subcontracts and Indirect Costs

The three applicant organizations will subcontract the administration of the grant to the University of Illinois at Chicago (UIC) and to other colleges and universities as appropriate. Release time for salaries for the editorial staff and committee heads will be subcontracted to the appropriate institutions; administration of travel funds and the central editorial site will be handled by UIC. In addition to the main project budget in this section, a budget for the subcontract with UIC is attached as Appendix G. As the committee heads and consulting editor remain to be named, details of the subcontracts for their salaries remain to be worked out with their institutions and no separate budget is included at this time. The “Indirect Costs” section of the budget shows a rate of 38% on the funds allocated to the subcontract with UIC. This rate reflects the Indirect Cost Agreement of April 3, 1987, between UIC and the Office of Naval Research. However, the University of Illinois at Chicago has generously agreed to waive a portion of its federally negotiated indirect cost recovery rate on travel funds provided by NEH to this project. On these funds, a rate of 10% will be charged to the grant, and the other 28% of the indirect costs will be contributed by the university. This represents a waiver of approximately 60% of the total indirect costs normally recoverable by the university. The cost sharing column also shows UIC's contribution of the indirect costs on services provided by UIC without charge to the grant. For funds to be subcontracted to other institutions (specifically the salaries and fringe benefits of the committee heads and the consulting editor), an indirect cost rate of 45% is shown. This figure is obtained by averaging the overhead charges of a number of institutions [15] and rounding down. Half of this indirect cost figure is shown in the cost sharing column, as the other subcontractors will be expected to make financial contributions roughly proportionate to that made by UIC.

6.2. Budget Summary Sheets

[Note: Detailed budget information has been omitted from this copy of the NEH proposal.]

A. Funding Proposal for Phase 1 (Planning Conference)

Extracts

Introduction: the Need for Text Encoding Standards

The ability of computers to perform mechanical tasks reliably and quickly can be as useful in the manipulation of texts as in the manipulation of mathematical quantities. But before information can be manipulated by a computer, it must be represented in the computer. Whatever the nature of the information to be processed, the machine works by the fast manipulation of symbols which represent that information. And the quality of the result depends largely upon the quality of the machine's symbolic representation of the information to be worked with.

What Text `Encoding' Is

Scholars working with textual material, like those working on mathematical, physical, or chemical problems, must thus find appropriate ways of representing or encoding their data (in this case, their texts) in forms suitable for mechanical manipulation. Typically, a scheme for encoding texts must include:
  1. Methods for recording the individual characters of a text. For European languages written in the Latin alphabet, the encoding of most characters is given by standard practices of the data-processing industry, but special encoding conventions may be needed for:
    1. diacritics, such as those needed for French, German, Spanish, and many other languages;
    2. the special consonants and vowels needed for other languages, such as the Scandinavian languages, Hungarian, Polish, or Turkish, or the medieval forms of other languages (e.g. Old and Middle English);
    3. punctuation marks conventionally used in some languages, but not present in modern data-processing standards: Greek colon, the various manuscript symbols used for Latin et, etc.;
    4. special symbols like the signs of the zodiac, chemical and astrological symbols, or simple line drawings inserted into the text by the author;
    5. characters distinct in Western printed books, but not distinct in modern data-processing standards (opening and closing quotations, the various forms of the dash and hyphen, ligatures, etc.), where the distinctions are important for the research being undertaken.
    Some character representation must also be found for texts in other alphabets, for which there may be no industry standards: a character code may be devised from scratch, or adopted from existing sources, or the text may be represented in transliteration. Special problems arise when texts include mixtures of languages and alphabets, e.g. texts in Hebrew and Greek with footnotes in Latin, or text in Armenian with Russian footnotes and a Chinese glossary.
  2. Conventions for reducing texts to a single linear sequence wherever footnotes, text-critical apparatus, parallel columns of text (as in polyglot texts), or other complications make the linear sequence problematic.
  3. Methods for recording the logical and physical divisions of the source text: e.g. book, chapter, paragraph, sentence; play, act, scene, line; or volume, page, printed line, etc., so that passages found by a search can be located in a printed copy of the text. For analytical bibliography, exact details of type usage in early printed books must be encoded, including details of ligatures used, justification of lines, font use, inverted letters, printers' ornaments, and devices. For numismatic or epigraphic collections, information of this type might become equally complex.
  4. Methods for recording linguistic and literary information important for scholarship, whether explicitly marked in the text or extra-textual: author, date, genre, sentence boundaries, syntax, dialect, direct and indirect speech, use of archaisms, scansion of verse, stylistic features, allusions and quotations, etc. Here the variety of elements to be represented is as great as the variety of approaches among those who use machine-readable texts.
  5. Conventions for delimiting comments and other material in the machine-readable file which is not strictly part of the text being encoded.
The limitations of early keypunches and printers led early on to elaborate encoding schemes for character data: asterisks to signal uppercase letters, numeric renditions of accents, transliteration conventions for Greek and Cyrillic, and so forth. More recently, the production of large text corpora in a single encoding scheme [16] and the development of standard concordance packages intended to handle wide varieties of text types [17] have led to greater concentration on conventions for representing the logical and physical divisions of a text, as well as to a greater awareness of the need for a standard set of practices in encoding texts for machine-aided analysis.

The Need for Standard Encoding Practices

For many years, scholars interested in the computer-aided analysis of texts being in a small minority both within computing and within the humanities, most projects began and ended in solitude. Into the 1960s and '70s many researchers developed their own software and even more their own (incompatible) systems of encoding, driven in part by the needs of their software and in part by the peculiarities of the individual texts they worked with.
Gradually, the community of potential users grew. Communication among those interested in the field improved. The potential uses of machine-readable texts were seen to extend beyond the production of printed concordances and to include literary and linguistic studies performed readily by machine but not readily or not at all by hand. As a result, interest grew in preserving machine-readable texts for later re-use, and the exchange of texts grew more frequent. This exchange, increasingly, is institutionalized in the form of large text archives, which hold and distribute copies of texts for computer-aided analysis, whether prepared elsewhere or at the archive itself. [18]
Text exchange is still hampered, however, by the many and inconsistent schemes used to encode the texts in the archives. Many machine-readable texts, originally prepared for personal or local use, lack any formal documentation of their status or organization: what the codes mean, who encoded the text, what edition was used as the source, etc. Existing text archives do valiant work trying to document what they have and improve what they acquire, but the rise of computer use among humanists promises a flood of new machine-readable texts in the near future which will overwhelm our current ability to document and assimilate variations in encoding practices. Humanists' use of computers, especially microcomputers, has increased manyfold over the last two or three years, and one of the first thoughts of most humanists, when beginning to move from word processing to computer-aided research, is to begin typing into the machine the text or texts they are working on at the moment.
Similar problems in the social sciences have been successfully addressed by formulating standard practices for data representation and documentation. These standards are supported and adhered to by most major social-science data archives and statistical analysis programs, so that social scientists who use computers in their work face a less chaotic world than do their humanist counterparts.
A single lucid, compendious set of recommendations for encoding textual data in machine-readable form is essential, if the coming floods of machine-readable texts are to be usable by the entire scholarly community. The experience of the past four decades has laid an adequate groundwork for standardization: we know now many of the pitfalls to be avoided; the rise, in this decade, of new legions of computer users in the textual disciplines requires us to formalize that experience and point out those pitfalls now, so that those who create new texts—individual scholars working alone, and large archives engaged in mammoth projects, alike—can turn their minds to the problems that have not yet been solved, instead of (as the saying goes) re-inventing the wheel and running, as always, the risk of putting the axle in the wrong place.

Advantages of a Standard Practice for Text Encoding

It has become clear that in the textual disciplines, just as in the social sciences, secondary analysis will soon become—if it is not already—far more common than primary analysis. [19] Clear and complete documentation of the encoding scheme, ease of data exchange, and accessibility to more than one type of analysis package, will all become correspondingly important.
Guidelines for encoding textual data can be expected to benefit a variety of interested parties:
  1. Scholars encoding texts for the first time will be able to find guidance in choosing what information to encode, and what to omit. Without guidelines, those who encode texts must invent their own scheme, and those doing so for the first time may often overlook obvious requirements for useful encoding. A set of guidelines which would remind them of less obvious problems would thus help ensure a higher level of quality and usefulness in encoded texts.
  2. Scholars who encode in accordance with the guidelines will be able to exchange texts with others with less special effort to document the encoding scheme. Instead of describing every tag used and including a separate document to convey information (such as copy text or source and date of the encoded version itself) not contained in the text proper, text preparers would need merely to refer to the guidelines themselves and perhaps discuss briefly the peculiarities of the individual text. This would be a particular benefit in the administration of large archives of machine-readable texts.
  3. Scholars who acquire texts encoded by others will enjoy the converse benefit: the text they obtain will be easier to understand, because better documented. (Under current conditions, a great many texts circulate with no significant documentation at all, despite the best efforts of the archives to acquire and distribute documentation for their texts, and even if the texts use a common encoding scheme like the COCOA tags adopted for the Oxford Concordance Program [20] the user must guess at the significance of the actual tags used. Standardized text encoding practices would end this problem.)
  4. Developers of text-analysis software, who at present usually develop their own special encoding schemes, will (like the scholars encoding texts) be reminded of textual features that need to be accommodated in an encoding scheme. At present, software-determined encoding schemes have an alarming tendency to handle only the relatively restricted class of texts in which the software developers are conscious of an interest: nineteenth-century novels, for example, or Biblical texts. So a text encoding scheme could be expected, if it were followed, to result in better software development and more generally useful programs for textual analysis.
  5. Users of texts, who typically may perform a variety of analyses on a text, involving often a number of different programs, will have less work to edit the various texts they acquire into a common scheme, and will less frequently have to change that scheme when they move from one program to another for a different type of analysis.

The Need for Prompt Action

With every passing month, one hears of more plans for encoding and distribution of texts and whole text archives on CD-ROM, [21] and more investments of time and money are made in the chaotic wilderness of divergent text encoding schemes that marks the current state of affairs. Many of those involved in planning and executing text-encoding and text-distribution projects express the wish that some standard practice existed, and some archive directors are delaying major projects in the hope that some sort of international agreement on encoding schemes, or at least upon a conceptual framework for the description of such schemes, can be agreed upon soon. [22]
If guidelines for standard practices can be agreed upon soon, many of the projects still in the planning stage can still (and, it is to be hoped, will) encode their texts in accordance with the guidelines. If, however, there is no prospect of any guideline for common practice within the foreseeable future, these projects, and others, will have no choice but to proceed without guidelines, and a great chance for standardization and quality control will have been missed.

Inadequacy of Existing Schemes

Existing text tagging schemes may be divided into two classes: first, those associated directly with specific concordance software or large text-corpus projects (COCOA tags, the tagging scheme used by the Thesaurus Linguae Graecae, or the tags required by WatCon, or ARRAS); and second, general-purpose schemes developed within the data-processing industry for text preparation, exchange, revision, and publication. Cutting across this division one may also distinguish schemes which tag the content or structure of the text and those which specify its appearance on the page (or screen).
It would be clearly advantageous if some common encoding scheme were usable both for analytic research and for the preparation of printed versions (editions, typeset concordances, commentaries with extended quotations, lexica, glossaries, dictionaries) of textual material. This would make it easier to publish the results of textual research, make texts prepared for publication usable for research purposes, and also ease the learning process for scholars who use the same computer resources both for writing (word processing) and for textual analysis. If one scheme, or two related schemes, could be used for both purposes, there would be less confusion about how to mark a given phenomenon in the text, and more likelihood that both encoding schemes could be learned and used.
No one existing scheme, however, is adequate to both uses. Many members of the first (research-oriented) group suffer from inadequate generality, arbitrary limits on the depth of hierarchical tagging of the logical or physical structure of the text, or from limits on the maximum number of distinct tags. All suffer from the ignorance of the data-processing industry at large: that is, none of them, even the best conceived, can be used for any purpose other than concordance-making or other analytic purposes. Within the second group (schemes oriented to office automation and machine-assisted publishing), many schemes require more technical sophistication than can realistically be expected of even experienced computer users, and no scheme provides adequate facilities for encoding all the attributes of a text that may be required for research.
Within the second group two recently promulgated schemes for machine-readable text markup merit special mention here. [23] The Standard Generalized Markup Language (SGML) recently promulgated by the ISO, and the SGML-style tag set recently developed at great effort and expense by the Association of American Publishers (AAP) merit closer discussion.
SGML represents an attempt to standardize the syntax of "text formatting" or "markup" languages [24] and to this end it defines both a "reference concrete syntax" that prescribes specific characters for specific functions and a generalized "abstract syntax" that allows different implementations of GML to use different characters for the same functions. One of the greatest strengths of the SGML approach is thus its flexibility: it allows implementors to define their own sets of metacharacters within the same syntactic framework. For example, a GML tag is preceded, in the reference concrete syntax, by the '<' sign, and followed by the '>' sign. In Waterloo GML, the tag must also be delimited by specific characters, but by default GML tags are preceded by colons and followed by either a space or a period. The Waterloo implementation diverges from the concrete syntax but remains compatible with the abstract syntax.
SGML does not define a specific set of tags with which to mark texts up. This is left to the individual implementation, and is expected to vary from application to application, [25] although in IBM mainframe sites one may find an informal standard centered around the specific tag sets provided by IBM's Document Content Facility GML and the similarly conceived Waterloo GML.
The generality of the SGML abstract syntax and the flexibility of the scheme by which tag sets may be constructed and extended recommend SGML as a potential medium for devising a common encoding scheme for texts used in humanities research and teaching. It is not wholly clear whether all levels of a text are equally well served with an SGML-based syntax: word-by-word tagging (e.g. of sentence functions, word classes, and other parsing information), at least, might be better achieved with a different scheme. SGML remains, however, a very promising development and one that should be carefully considered in developing guidelines for research-oriented text encoding.
The specific implementation of SGML by the Association of American Publishers merits, <foreign>a fortiori</foreign>, even greater scrutiny than SGML in the abstract. Efforts are underway to give this tag set the status of an American national standard, which would make it even more attractive. The AAP tag set does solve many of the problems facing anyone encoding texts for literary, historical, or linguistic analysis. It allows, for example, for a very fine-grained delineation of the logical structure of a document. It also provides detailed rules for representing characters not present in the coded character sets (ASCII and EBCDIC) of the data-processing industry, a clear and comprehensible syntax, and thoroughly worked-out methods of representing tabular, columnar, or other difficult layouts and even mathematical and chemical formulas in a linear form. The format is thus extremely promising.
But although well conceived and well executed, the AAP tag set does not in its present form supply the broad range of tagging types needed to handle the information humanists need to encode with their texts. In particular, the publishers sought for obvious reasons to retain the greatest possible freedom for textual revision and to reserve as much freedom as possible to the book designer (not the author) in specifying the physical layout of the text. Thus the AAP tag set, like most formatting languages, provides for automatic numeration of chapters, pages, lists of numbered points, etc. Unlike most other formatting languages, it also severely restricts the degree to which the physical structure (type face, title formats, page breaks, etc.) in a text can be specified in tags. Such decisions are effectively reserved for the book designer, who specifies in a "style sheet" how the various logical components of a manuscript are to be realized in layout and type styles. These decisions make perfect sense for the preparation of manuscripts for book or electronic publication, which is the AAP's primary area of interest. In texts prepared for research, however, the numeration of pages, chapters, and numbered lists of points is given explicitly in the source, and ought to be indicated explicitly in the encoding. Errors in transcription will be easier to find, and location information will be more useful, if it is. Similarly, description of the physical layout of the text, because it is less interpretive and thus less controversial, may be preferable, or at least easier, for research purposes than the interpretation of the layout in terms of a logical hierarchy of structural elements in the text. And for many types of study (analytic bibliography for the modern period, codicology, papyrology, numismatics and epigraphy for the older periods, and textual criticism for all periods) the description of the physical realization of the text is at least as important as the logical structuring of the text.
The promise of the AAP tag set for textual analysts, therefore, lies in its potential as the basis around which a solution might be constructed, rather than in its providing a solution ready-made. At the very least, the AAP tag set will require extension for enumerated textual components and for the physical description of the copy text, in order to suffice for some scholarly purposes. In order to provide an encoding scheme broad and general enough to provide tagging across the full range of interests and concerns for humanists, even broader extensions are necessary. Such extensions would provide tags for items such as the following: syntactic, stylistic, metrical, morphological, and semantic features at the level of syllable, word or phrase; structural parts of a text such as chapter or act and scene; discontinuous and recurrent features such as speaker, stage direction, direct or indirect quotation; and spans of text that do not begin with a whole word, such as lacunae or words on the recto and verso of a page or the lineation of a printed text. The marking of specific points in a text, as well as of passages, which are marked for beginning and end, and the marking of arbitrary spans of text, must also be made easy. Since research by its nature is often concerned with features of texts not studied before by others (and thus not foreseen in any scheme for text encoding), the text-encoding guidelines should also specify means to include user-definable extensions that will be understandable and usable by others. Implementing tags such as these requires consideration of a variety of issues related to the humanistic analysis of texts that have not been addressed in SGML or the AAP scheme simply because their concerns are restricted.
We conclude, therefore, that SGML and the AAP's implementation of SGML, while likely to provide a usable base for a common set of text-encoding guidelines, do not as presently constituted solve the problems of text encoding for historical, literary, or linguistic research. Because SGML appears to be increasingly accepted as a framework for representing texts in machine-readable form, and the AAP tag set seems likely to increase rapidly in importance and utility, it would be sensible to try to make any set of text-encoding guidelines compatible with the AAP tag set, or at least with the SGML syntax. We propose to make a concerted effort to make our text-encoding guidelines, where possible, compatible with appropriate existing schemes, of which SGML and AAP appear at present to be the most promising. The AAP has agreed to attend the planning conference and participate in the drafting, validation, and approval of the proposed text-encoding guidelines.

Phases of a Project to Develop a Text Encoding Standard

In order to achieve the advantages mentioned above, guidelines for text encoding practices must address the concerns of as many interests as possible: ease of migration from existing encoding schemes, compatibility with encoding schemes intended for similar uses, simplicity and ease of application, adequacy for the most commonly required text-analytic procedures, and ability to encode the information needed for less common but no less important specialized research applications. Excessive haste in formulating the guidelines will surely impair our ability to achieve these goals. We propose, therefore, a five-phase process for formulating a set of guidelines, with each phase issuing in written specifications and drafts which will be refined and expanded in the later phases. This proposal requests funding for the first phase; the later phases will be described in more detail in other funding proposals.
In the long run, the guidelines developed in this project should encompass the encoding of any sort of textual material, in any language, for virtually any scholarly purpose. For pragmatic reasons, we expect to limit the scope of the immediate effort in several ways. First, the guidelines will deal only with the problems of encoding texts for research or teaching purposes: preparation of 'electronic manuscripts' for electronic or paper publication is beyond their scope. (As noted above, compatibility with industry standards for manuscript publication is a desideratum, but only one of several.) The guidelines will emphatically not be concerned with physical storage methods for files in any medium. Further, the immediate effort will be concerned primarily, perhaps exclusively, with the encoding of texts in Latin-based alphabets. The problems of encoding texts in non-Latin alphabets are not qualitatively different from those of Latinate scripts, however, so most of the underlying principles of any encoding scheme would be transferable. Explicit, wholly adequate guidelines for handling Cyrillic, Hebrew, Arabic, and other non-Latin alphabets, however, will probably await later extensions of the guidelines, as will any attempt to deal adequately with non-alphabetic scripts. Finally, the needs of specialized disciplines (e.g. numismatics, epigraphy, paleography, and possibly some of the more arcane philological sub-disciplines) will be reserved for later investigation by special groups and handled in extensions to the guidelines. For the first version of the guidelines, we anticipate focusing on the needs of literary and linguistic research as commonly practiced, and those of text-oriented historical research (including documentary editing) which are similar in nature. In setting the basic design of the guidelines, every effort will be made to ensure that extensions to texts and textual research of other types will be feasible and straightforward.
The five phases foreseen for the project are:
  1. Planning and High-Level Design, to be performed by an international planning conference of text-archive directors, the formulators of existing schemes, and other parties that represent interests and perspectives that must be taken into account in the formulation of the guidelines. Also during this phase, the ACH and other cooperating groups will make arrangements for the working party needed in the next phase.
  2. Detailed Design and Drafting, to be performed by a smaller working party organized by cooperating groups during and after the planning conference. In this phase the actual guidelines will be developed.
  3. Revision, to be performed by the smaller working party on the basis of comments from members of the planning conference as well as any and all interested parties.
  4. Review and Approval, to be performed by representatives of interested organizations.
  5. Publication and Maintenance, to be performed by the Association for Computers in the Humanities. This phase includes the development of eventual extensions and revisions of the guidelines.

Planning and High-Level Design Phase: International Planning Conference

In this phase we propose to convene an international planning conference of European and North American text-archive directors and other interested parties, to discuss text encoding issues and suggest the general structure and approach of common guidelines for text encoding practices.
Also during this phase, the ACH will formally invite a number of organizations to cooperate in establishing text-encoding guidelines (specifically to participate in phases 2 through 5 by helping draft, revise, approve and support the guidelines) and the cooperating groups will together appoint a draft committee or working party to be responsible for the second and third phases of the work. [26] Most of these organizations will be represented informally at the planning conference. No formal representation of all groups is planned; however, if an organization not already represented at the planning conference feels the need of such representation, then a representative to the planning conference may be appointed (funds permitting).
The planning conference is described in more detail in the next section. Its task is to clarify fundamental issues and provide guidance for the later phases of the process. Actually drafting the guidelines is not feasible in so short a time, as the actual text of any set of guidelines will have to include substantial amounts of technical information and documentation on existing practices, so that writing a draft will require substantial time from a number of individuals, and correspondence with a large number of individuals in a position to provide relevant technical information. The planning conference, accordingly, will not attempt to specify all the details of the guidelines, but only to discuss advantages and problems of existing practices, clarify the basic issues, settle the fundamental policy questions of scope, structure, and general approach of the guidelines, leaving details and technical points to be worked out by a smaller technical committee or 'working party' organized for that task.

Detailed Design and Drafting Phase: Working Group

In this phase a working party appointed by the cooperating groups will first publish the architecture or general plan developed at the planning conference, accept comments from interested parties, and revise the formal specification of the architecture.
Within the architecture or general shape of the text encoding guidelines specified by the planning conference (as revised), the smaller committee will work out the details of the guidelines and formulate a full draft of the guidelines on paper. While the workings of this committee are best left for themselves to arrange, it seems clear that much of the work will have to be done by mail or by electronic network correspondence. At least one and probably two face-to-face working meetings will be necessary to allow for intensive discussion of technical issues.

Revision Phase: Publication, Comments and Revision Cycles

This phase comprises one or more cycles of publication, comment, and revision in which the draft guidelines are circulated to interested parties and the public for comments and suggestions. The working party will be responsible for considering the suggestions and comments and incorporating them, or not, into the draft guidelines as is appropriate.

Review and Approval Phase: Review Conference

The revised draft prepared by the working committee will be submitted, after public comment and revision, to the appropriate learned societies and professional bodies for their formal approval. The exact form this approval will take is a matter for discussion within the societies, but it is our view that an appropriate mechanism to ensure the widest discussion and hence widest acceptance of the draft proposed standard would be to convene a second conference of representatives from all of these groups (and perhaps other interested bodies such as the American National Standards Institute) at a later stage, at which properly constituted representatives could vote on any controversial elements in the proposed guidelines. Funding for this second conference is not however sought at this point.

Publication and Maintenance Phase

After the draft is completed and approved, the Association for Computers in the Humanities will publish it and undertake to provide continuing maintenance and support for it, including making arrangements for periodic review and revision as needed. While we do not anticipate constant change, which would undermine the advantages of the guidelines, further experience with the guidelines could well lead to suggestions for modifications and improvements. The advancing sophistication both of our computer systems and of the textual research performed with their aid, moreover, will encourage regular extensions of the guidelines to cover new areas of research. As the initiator of this effort the Association for Computers in the Humanities undertakes to maintain a technical committee to receive correspondence on the guidelines, assist interested parties with implementation and interpretation, and cooperate with other groups on future revisions and extensions.

Tasks and Functions of the Planning Conference

Tasks the Conference Should Accomplish

The conference must set the general framework for further work on the text encoding guidelines. The conference, therefore, must specify the essential structure of the final guidelines clearly and correctly enough to provide guidance to the working group, without being prematurely specific in technical details. Questions the conference must address include:
  1. Fundamental questions:
    1. Under what conditions is a standard encoding scheme in fact feasible and desirable?
    2. Is there a set of basic principles in terms of which all or most existing encoding schemes can be described?
    3. If so, what are those principles?
    4. Which of those principles should be incorporated into the new guidelines?
    5. With what existing encoding schemes (e.g., SGML and the AAP scheme) should the new guidelines (ideally) be compatible?
  2. Scope of the guidelines:
    1. What types of textual material should encoding guidelines cover? Continuous texts only, or also structured texts like dictionaries, glossaries, lexica, word lists, word frequency lists, parsing rules, corpora of random or non-random samples, commentaries on other texts, etc.?
    2. What types of analysis should an encoding scheme attempt to support? Stylistics, textual criticism, analytic bibliography, computational linguistics, thematic studies, metrics, authorship identification studies, commentary?
    3. What other activities (data exchange, archival storage, ...) should be addressed by the guidelines?
  3. Structure of the Guidelines:
    1. What form should the guidelines take? Specifically, what should the table of contents of the guidelines look like?
    2. How specific should the guidelines be? Should they specify or suggest specific tags with which to encode texts, or only a structural plan for the tags, with the specific lists of tags to be determined by those who encode the text? (That is, should the guidelines specify only how texts should be encoded, or what features of the text ought ideally to be encoded?)
    3. If specific tags are specified, should the list be exhaustive or provide only a core of listed tags, with provisions for user-defined special-purpose extensions?
  4. Possible Content of the Guidelines:
    1. How extensive should the list (if any) of required or recommended tags be, and what types of information should they convey?
    2. What information (if any) about physical layout and appearance of the text (manuscript, edition) should be specified?
    3. What information (if any) about linguistic units (parsing, meaning, dictionary forms of words, etc.) should be tagged?
    4. What interpretive information (stylistic, metrical, rhetorical, narratological, etc.) should be tagged?
    5. Should the guidelines attempt to specify explicitly what binary codes are to be used for tags, require the user to do so, or specify a normal practice and allow deviations in individual cases?
    6. Should the guidelines include recommendations on how tags should be encoded, and how they should be re-defined?
    7. If specific binary codes are to be suggested or required, how should they be chosen?
  5. Organizational Issues:
    1. Who should be responsible for producing a draft of the guidelines?
    2. Once the working group has produced a draft, how should it be criticized, revised, and accepted? (Note: one possible outline of the further work is described elsewhere in this proposal, but details of the further proceeding are subject to change.)
    3. Who should be responsible for accepting the draft and encouraging its use?
    4. Who should be responsible for publishing, maintaining, and (as need arises) revising the guidelines?
Both in order to illuminate the questions just mentioned and in order to clarify current practice, representatives to the planning conference will be asked to describe any encoding schemes they have devised or use, indicating both how tags are indicated in the text and specifically what features of the text are included in the tags. Descriptions will be requested in writing and distributed to all participants beforehand. One immediate product of the conference, therefore, will be a broad account of the various methods in use now for encoding texts for humanistic research. This account may be integrated with the eventual guidelines or published separately.

Tasks the Conference Should Not Undertake

The first working conference should not attempt to decide all the questions of text encoding themselves; the details of the draft guidelines should be left to the working party. Among the issues we specifically recommend the conference leave undecided:
  1. What specific tags should be included, and at what levels (required, recommended, optional) of the scheme.
  2. What binary codes should be used in the guidelines.
  3. If the working party is instructed to strive for compatibility with certain existing standards, what deviations from those standards might be necessary or allowable. (This determination is best left to the working party itself.)

B. Participants in the ACH Meeting on Text Encoding Practices

(Vassar Planning Conference)

[NOTE: Network addresses are from Bitnet]

C. Closing Statement of the Vassar Planning Conference

The Preparation of Text Encoding Guidelines

Poughkeepsie, New York
13 November 1987
  1. The guidelines are intended to provide a standard format for data interchange in humanities research.
  2. The guidelines are also intended to suggest principles for the encoding of texts in the same format.
  3. The guidelines should
    1. define a recommended syntax for the format,
    2. define a metalanguage for the description of text-encoding schemes,
    3. describe the new format and representative existing schemes both in that metalanguage and in prose.
  4. The guidelines should propose sets of coding conventions suited for various applications.
  5. The guidelines should include a minimal set of conventions for encoding new texts in the format.
  6. The guidelines are to be drafted by committees on
    1. text documentation
    2. text representation
    3. text interpretation and analysis
    4. metalanguage definition and description of existing and proposed schemes,
    coordinated by a steering committee of representatives of the principal sponsoring organizations.
  7. Compatibility with existing standards will be maintained as far as possible.
  8. A number of large text archives have agreed in principle to support the guidelines in their function as an interchange format. We encourage funding agencies to support development of tools to facilitate this interchange.
  9. Conversion of existing machine-readable texts to the new format involves the translation of their conventions into the syntax of the new format. No requirements will be made for the addition of information not already coded in the texts.

D. Draft Memoranda of Understanding

Memorandum of Understanding Among the Sponsoring Organizations

An agreement entered into by the undersigned, on the dates indicated, on behalf of the organizations named, to cooperate in developing, formulating, and disseminating guidelines for the encoding of texts in machine-readable form for research or teaching and a common interchange format for the exchange of literary and linguistic data.

Purpose

The organizations entering into this agreement ('sponsoring organizations') undertake to sponsor, as described herein, an initiative to draft, publish, and support guidelines for the encoding of texts in machine-readable form for research or teaching and a common interchange format for the exchange of literary and linguistic data.

Basic Principles

The basic principles governing the initiative and the resulting format shall be those enunciated in the closing statement of the conference sponsored by ACH at Vassar College in Poughkeepsie, New York, on November 12 and 13, 1987 (appended).

Organization of the Initiative

The initiative shall be guided by a steering committee comprising two voting representatives of each sponsoring organization. The steering committee shall appoint one or more editors, as they shall choose, to coordinate the actual drafting of the guidelines and interchange format. The steering committee shall also constitute working committees from the literary and linguistic computing community for the actual drafting and development work. The function of the steering committee will be to appoint the editor(s), to monitor the work of editor(s) and drafting committees, to consider and settle policy questions with the editors, to present the text to the advisory board of participating organizations for approval, and to authorize the publication of the final text.

Commitments of the Sponsoring Organizations

The sponsoring organizations undertake:
  1. to appoint two members to the steering committee of the initiative.
  2. to encourage their members to contribute their technical expertise to the success of the initiative by serving on the drafting committees.
  3. to publicize the initiative in their newsletters and other organs of communication.
  4. to circulate drafts and partial drafts to their members for comment.
  5. to encourage publication of working papers of the initiative in their journals (if appropriate).
  6. to encourage the discussion of text encoding problems at meetings organized or sponsored by the organization (e.g. in special sessions).
  7. to contribute to the initiative such administrative services as are already routinely performed by the association (e.g. provision of mailing lists or mailing labels).
  8. to endorse the use of the guidelines and interchange format for the encoding of texts and the exchange of already encoded texts.
  9. to publish, in concert with the other sponsoring organizations, the final form of the guidelines and interchange format, and encourage their dissemination among the interested community of teachers and scholars.
  10. to establish and maintain some mechanism (e.g. a standing committee or bureau) for the continuing development of the guidelines and interchange format in cooperation with the other sponsoring organizations after completion of the first version. These mechanisms shall be responsible for accepting comments and suggestions from the membership of the sponsoring organizations, monitoring the success of the recommendations, and considering revisions and extensions in the light of experience.
Signed:
For the Association for Computers and the Humanities:
For the Association for Computational Linguistics:
For the Association for Literary and Linguistic Computing:

Memorandum of Understanding with Participating Organizations

An agreement to define the terms of participation in the cooperative initiative for text encoding.

Purpose

The organizations entering into this agreement ('participating organizations') will participate (as described below) in an initiative to draft, publish, and support guidelines for the encoding of texts in machine-readable form for research and teaching and a common interchange format for the exchange of literary and linguistic data.

Basic Principles

The basic principles governing the initiative and the resulting format shall be those enunciated in the closing statement of the conference sponsored by ACH at Vassar College in Poughkeepsie, New York, on November 12 and 13, 1987 (appended).

Organization of the Initiative

The initiative shall be guided by a steering committee comprised of two voting representatives from each of three primary sponsoring organizations: the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The steering committee shall appoint one or more editors to coordinate the actual drafting of the guidelines and interchange format. The steering committee shall also constitute working committees from the literary and linguistic computing community for the actual drafting and development work.

Advisory Board

Each participating organization will name one representative to an advisory board. The advisory board will provide liaison between the steering committee and editor(s) and the membership of the participating organizations, to ensure:
  1. that the needs of the membership for encoding literary and linguistic data are adequately addressed in the guidelines and interchange format;
  2. that similar or related efforts within specific areas of the literary and linguistic computing community are made aware of and, where appropriate, coordinated with the work within this initiative; and
  3. that the membership of participating organizations is made and kept aware of the work of this initiative and provided the opportunity to participate in it.

Commitments of the Participating Organizations

The participating organizations undertake:
  1. to appoint one member to the initiative's advisory board.
  2. to encourage their members to contribute their technical expertise to the success of the initiative by serving on the drafting committees.
  3. to publicize the initiative in their newsletters and other organs of communication.
  4. to circulate drafts and partial drafts to their members for comment.

Publication

The names of the participating organizations will be listed in the front matter of the final document describing the guidelines and interchange format, under the heading "Advisory Board of Participating Organizations."
Signed:

E. Organizations Invited to Participate

  1. the American Historical Association (AHA)
  2. the American Philological Association (APA)
  3. the Association for Computing Machinery, Special Interest Group for Information Retrieval (ACM/SIGIR)
  4. the Association for Documentary Editing (ADE)
  5. the Association for History and Computing (AHC)
  6. the Association of American Publishers (AAP)
  7. the Dictionary Society of North America (DSNA)
  8. the European Association for Lexicography (Euralex)
  9. the International Linguistic Association (AILA)
  10. the Joint Steering Committee for the Revision of the Anglo-American Cataloguing Rules (or some other representative of the library community)
  11. the Linguistic Society of America (LSA)
  12. the Modern Language Association of America (MLA)
  13. the Societas Linguistica Europaea (SLE)

F. Vitas of the Steering Committee

Robert A. Amsler

Artificial Intelligence and Information Science Research Group
Bell Communications Research
435 South Street
Morristown, NJ 07960
(201) 829-4278

Education

  • Ph.D., Computer Sciences/Information Science/Ethnosemantics, University of Texas, Austin, 1980.
  • M.S., Computer Sciences/Mathematics, Courant Institute of Mathematical Sciences, New York University, 1969.
  • B.A., with honors, Mathematics, Florida Atlantic University, Boca Raton, Florida, 1967.

Current Employment

  • 1984-Present: Research Computer Scientist, Artificial Intelligence and Information Science Research Group, Bell Communications Research.

Previous Professional Experience

  • 1983: Computer Scientist, Natural-Language and Knowledge-Resource Systems Group, Advanced Computer Systems Department, SRI International.
  • 1981-1982: Computer Scientist, Artificial Intelligence Center SRI International.
  • 1976-1980: Computer Programmer, Linguistics Research Center.
  • summer/fall, 1976; summer 1974 - fall 1975: Computer Programmer, Computation Center, University of Texas at Austin.
  • summer, 1975: Research Assistant, University of Southern California, Information Sciences Institute, Marina del Rey, California.
  • 1971-1974: Research Assistant, Computer Sciences Dept., University of Texas at Austin.
  • 1969: Mathematician, Central Intelligence Agency, Scientific Applications Division, Washington, D.C.

Professional Societies

Association for Computational Linguistics, American Association for Artificial Intelligence, Association for Computing Machinery, Institute of Electrical and Electronics Engineers, American Society for the Advancement of Science.

Grants

  • February, 1979: As a graduate student (and programming manager of the Linguistics Research Center) at University of Texas at Austin, wrote proposal and managed NSF Grant MCS77-01315 Development of a Computational Methodology for Deriving Natural Language Semantic Structures from Machine-Readable Dictionaries awarded by proxy to Winfred P. Lehmann and Robert F. Simmons (graduate students ruled ineligible to submit grant proposals by University of Texas).
  • 1982: NSF “New Investigator's Grant” awarded to me while at SRI International's AI Center.

Professional Publications

Semantic Space: A Computer Technique for Modeling Connotative Meaning. M.S. degree, N.Y.U. February, 1969.

Modeling dictionary data, (with Robert F. Simmons) in Directions in Artificial Intelligence, Natural Language Processing, (Ed. by Ralph Grishman) Computer Science Report 7, Courant Institute of Mathematical Sciences, C.S. Dept., New York University. August, 1975. Pp. 1-26.

The Structure of The Merriam-Webster Pocket Dictionary. Ph.D. dissertation. TR-164. C.S. Dept., University of Texas, Austin, December, 1980.

Inference Nets for Modeling Geoscience Reference Knowledge, (with Julie H. Bichteler and Jonathan Slocum). Proceedings of the 43rd ASIS Annual Meeting: Communicating Information, Anaheim, Calif. Oct. 5-10, 1980. Washington: American Society for Information Science, 1980.

Report to the Faculty Computer Committee from the Textual Applications Group (TAG). (with D. Richardson) University of Texas at Austin, September, 1980.

A Taxonomy for English Nouns and Verbs, Proceedings of the 19th Annual Meeting of the Association for Computational Linguistics, Stanford, California. June 29-July 1, 1981. Menlo Park, California: Association for Computational Linguistics, 1981. Pp. 133-138.

Computational Lexicology: A Research Program, In Proceedings of the 1982 National Computer Conference, Houston, Texas. June 7-10, 1982. Arlington:AFIPS Press. Pp. 657-663.

Natural Language Access to Structured Text (with Jerry R. Hobbs and Donald E. Walker). COLING-82: Proceedings of the Ninth International Conference on Computational Linguistics, Prague, July 5-10, 1982. (edited by Jan Horecky). Amsterdam: North-Holland, 1982, Pp. 127-132.

Computer-Assisted Compilation of a Nahuatl Dictionary, (with F. Karttunen) in Computers and the Humanities. 1984.

The Use of Machine-Readable Dictionaries in Sublanguage Analysis, (with D. E. Walker, describing the results of my 1982 NSF new investigator's grant) in Proceedings of NYU Workshop on Sublanguage Description and Processing, New York City, January 19-20, 1984. (Also appeared as chapter in Analyzing Language in Restricted Domains: Sublanguage Description and Processing, edited by Grishman and Kittredge.)

Machine-Readable Dictionaries, chapter in Annual Review of Information Science and Technology, Vol. 19, ed. by Martha E. Williams. White Plains: Knowledge Industries Publications, 1984.

Deriving Lexical Knowledge from Existing Machine-Readable Information Sources. Bellcore TM-ARH-008761 (02/18/87). Paper presented at the EEC-sponsored workshop Automating the Lexicon: Research and Practice in a Multilingual Environment, held 19-23 May, 1986 in Marina di Grosseto, Italy. It will be published in the proceedings of the workshop (in progress (1988)).

Words and Worlds. Bellcore TM-ARH-009470 (06/09/87). Published in the Proceedings of the 3rd `Theoretical Issues in Natural Language Processing' (TINLAP3) Workshop at New Mexico State University at Las Cruces, NM. (Presentation available as videotape).

How Do I Turn This Book On? - Preparing Text for Access as a Computational Medium. Bellcore TM-ARH-010121 (09/10/87). Published in Proceedings of the 3rd Annual Conference of the University of Waterloo's Centre for the New Oxford English Dictionary (Uses of Large Text Databases) held on the Univ. of Waterloo campus in Waterloo, Ontario, Canada on Nov. 9-10, 1987.

Susan Margaret Hockey

Oxford University Computing Service,
13 Banbury Road,
Oxford,
OX2 6NN
England
Telephone: Oxford (0865) 273226
BITNET: SUSAN@VAX.OXFORD.AC.UK
ARPA etc: SUSAN%VAX.OXFORD.AC.UK @ UK.AC.UCL.CS.NSS

Present Position

Teaching Officer for Computing in the Arts and Section Manager, Computing in the Arts, Oxford University Computing Service

Education

1965-69 Lady Margaret Hall, Oxford (Mary Hammill Exhibitioner)
March 1967 Honour Moderations in Greek and Latin Literature: Class II
June 1969 Final Honour Schools in Oriental Studies (Egyptian with Akkadian): Class I

Positions Held

1969-75 Computer programmer at the Atlas Computer Laboratory, Chilton. (The Atlas Laboratory at that time provided computer facilities for university applications, where the application was not suitable for the university's own computer.)
1975- Teaching Officer for Computing in the Arts at Oxford University Computing Service. Over the past 12 years the post has developed into Manager of the Computing in the Arts section with a staff of 10 (of whom 5 are on permanent appointments and 5 fixed-term).
1979- Fellow (by Special Election) at St Cross College, Oxford. Director of Computing in the College.
1984 Visiting Distinguished Professor at the University of Alberta, Edmonton (January-February).

Professional Activities

1973-Founder member of the Association for Literary and Linguistic Computing and committee member for most of the period since its foundation.
1978 Founder member of the Association for Computers and the Humanities
1979-83 Member of the Executive Council of the Association for Computers and the Humanities
1979-83 Editor of the Bulletin of the Association for Literary and Linguistic Computing.
1984- Chairman of the ALLC.
1985- Member of the Editorial Committee for Literary and Linguistic Computing and of the program committee for the annual ALLC conferences.

Conference papers

Numerous conference papers, invited lectures, and workshops in Great Britain and abroad. I have also been a consultant/academic visitor to the departement d'Informatique at the Université de Montréal in 1979 and 1986.

Books

A Guide to Computer Applications in the Humanities, London: Duckworth, and Baltimore: Johns Hopkins, 1980; reprinted as Johns Hopkins paperback, 1984.

SNOBOL Programming for the Humanities, Clarendon Press, 1986.

Articles in Journals and Books

(with R.F. Churchhouse) The Use of an SC4020 for Output of a Concordance Program, p221-229 in The Computer in Literary and Linguistic Research, ed. R.A. Wisbey, Cambridge University Press, 1971.

A Concordance to the Poems of Hafiz with Output in Persian Characters, p291-306 in The Computer and Literary Studies, eds. A.J. Aitken, R.W. Bailey and N. Hamilton-Smith, Edinburgh University Press, 1973.

Input and Output of Non-standard Character Sets, ALLC Bulletin, 1, No 2 (1973), 32-37.

(with V. Shibayev) The Bilingual Analytical and Linguistic Concordance - BALCON, ALLC Bulletin, 3 (1975) 133-139.

(with Alan Jones and George Mandel) Indexing Hebrew Periodicals with the Aid of the FAMULUS Documentation System, p38-46 in The Computer in Literary and Linguistic Studies (Proceedings of the Third International Symposium, eds. Alan Jones and R.F. Churchhouse, Cardiff: University of Wales Press, 1976.

(with Gordon Appleton) A Course on the Use of Computers in Textual Analysis and Bibliography for the Board of Celtic Studies of the University of Wales, UMRCC Journal, 5, no 2 (1978), 25-28.

Colloquium on the Use of Computers in Textual Criticism; A Report, ALLC Bulletin, 6 (1978), 180-181.

Computing in the Humanities, ICL Technical Journal, Vol 1 Issue 3 (1979), 280-291.

(with Ian Marriott) The Oxford Concordance Project (OCP), series of four articles in ALLC Bulletin, 7 (1979), 35-43, 155-164, 268-275 and 8 (1980), 28-35.

(with Ian Marriott) OCP - The Oxford Concordance Project, p337-345 in Proceedings of the International Conference on Literary and Linguistic Computing Israel, ed. Zvi Malachi, Tel-Aviv, 1979.

Report on ALLC Symposium at Easter 1980, ALLC Bulletin, 8 (1981), 268-270.

The Oxford Courses for Literary and Linguistic Computing, p175-182 in Computers in Literary and Linguistic Research: Proceedings of the VII ALLC International Symposium, eds. L. Cignoni and C Peters, Pisa: Giardini Editori e Stampatori, 1983.

Report on the Twelfth ALLC International ALLC Conference - Nice 5-8 June 1985, ALLC Bulletin, 13 (1985), 77-79.

Literature and the Computer at Oxford University, p53-75 in Literary Criticism and the Computer, eds Bernard Derval and Michel Lenoble, Montreal, 1985.

Some Future Developments for Arts Computing, University Computing, 7 (1985), 33-37.

Letter from Oxford, p16-26 in Today's Research: Tomorrow's Teaching: Conference on Computing in the Humanities, conference preprints, ed. Ian Lancashire, Toronto, April 1986.

OCR: The Kurzweil Data Entry Machine, Literary and Linguistic Computing, 1 (1986), 63-67. (This article is being translated into Japanese.)

Workshop on Teaching Computers and the Humanities Courses, Literary and Linguistic Computing, 1 (1986), 228-229.

Report on the Thirteenth ALLC Conference, Literary and Linguistic Computing, 1 (1986), 233-235.

An Historical Perspective, p20-30 in Information Technology in the Humanities: Tools, Techniques and Applications, ed. Sebastian Rahtz, Ellis Horwood, 1987.

SNOBOL in the Humanities: Keynote Address, p1-25 in ICEBOL 86, Proceedings of the 1986 International Conference on the Applications of SNOBOL and SPITBOL, ed. Eric Johnson, Dakota State College, Madison SD.

(with Jeremy Martin), The Oxford Concordance Program Version 2, Literary and Linguistic Computing, 2 (1987), 125-131.

Some Considerations in Providing an Academic Typesetting Service: Experiences at Oxford University, p 91-102 in Studies in Honour of Roberto Busa S.J., ed. Antonio Zampolli, Pisa: Giardini Editori e Stampatori, 1987.

In Press

A Survey of Practical Aspects of Computer-Aided Maintenance and Processing of Natural Language Data, in Computational Linguistics. Ein internationales Handbuch zur computergestutzten Sprachforschung und ihrer Anwendung, eds I. Batori, W. Lenders and W. Putschke, to be published by Walter de Gruyter.

review of Cynthia Spencer, Programming for the Liberal Arts, Rowman and Allenheld, 1985, to appear in Computers and the Humanities.

Publications Pending

Series editor for Oxford University Press for Oxford Studies on Computing and the Humanities. This series will consist of monographs on various applications of computers in humanities research.

(with Nancy Ide, Ian Lancashire, and Glyn Holmes) editor of a volume of state-of-the-art essays on computing and the humanities, to be published by University of Pennsylvania Press in 1989.

(with Robert Kraft), Computer Analysis of Ancient Texts: A Historical Survey, to appear in Aufstieg und Niedergang der Romischen Welt, Band II, 35.

Creating and Using Large Text Databases for Scholarly Research in the Humanities: Some Practical Issues, to appear in Festschrift for B. Quemada, ed A. Zampolli.

Nancy M. Ide

Department of Computer Science
Vassar College
Poughkeepsie, New York 12601
(914)452-7000 ext. 2478
Bitnet: IDE@VASSAR

Education

  • B.S. Psychology, The Pennsylvania State University, 1972.
  • B.A. English, The Pennsylvania State University, 1972.
  • M.A. English, The Pennsylvania State University, 1976. Thesis: Thematic Structure in William Blake's `Proverbs of Hell.'
  • Ph.D. English with Computer Science minor (M.S. equivalent), The Pennsylvania State University, 1982. Dissertation topic: “Patterns of Imagery in William Blake's The Four Zoas.”
  • Visiting Scholar, Linguistics Summer Institute, 1986.

Current Employment

  • 1982-present: Assistant Professor of Computer Science and member of the Cognitive Science faculty, Vassar College.

Books

Pascal for the Humanities, Philadelphia: University of Pennsylvania Press, 1987.

Methodologies in Humanities Computing, co-edited with S. Hockey, G. Holmes, I. Lancashire, forthcoming from University of Pennsylvania Press, 1989.

Articles

Image Patterns and the Structure of William Blake's The Four Zoas, Blake: An Illustrated Quarterly, 20, 4 (Spring, 1987).

The Lexical Database in Semantic Studies, forthcoming in Linguistica Computazionale: Computational Lexicology and Lexicography, ed. A. Zampolli, 1988.

Computers and the Humanities Courses: Philosophical Bases and Approach, Computers and the Humanities, vol. 21, no. 3, 1987.

Semantic Studies and Computational Linguistics, forthcoming in Methodologies in Humanities Computing, ed. S. Hockey, G. Holmes, N. Ide. I. Lancashire, University of Pennsylvania Press, 1989.

Semantic Patterning: Time Series and Fourier Analysis, forthcoming in Festschrift in Honor of Etienne Evrard, ed. C. Delcourt, Liege, 1988.

The Computational Determination of Meaning in Literary Texts, in Computers and the Humanities: Today's Research, Tomorrow's Teaching, Toronto: University of Toronto, 1986.

Patterns of Imagery in Blake's The Four Zoas, Proceedings of the Twelfth International ALLC Conference (selected papers), ed. E. Brunet, Geneva: Slatkine, 1985.

Computers and the Humanities: Considerations for Course Design and Content, Proceedings of the Seventh International Conference on Computers and the Humanities (selected papers), ed. R. Jones, Dortrecht, The Netherlands: D. Reidel, 1987.

Documentation for the Non-scientist, ACM/SIGUCC Newsletter, Winter, 1981.

Papers

Numerous invited papers and other conference contributions in the fields of humanities computing and computer science education.

Professional Activities

President, Association for Computers and the Humanities, 1985-present. Member, Liberal Arts Computer Science Consortium, 1984-present. International Representative for the United States to the Association for Literary and Linguistic Computing, 1987-present. International Representative for the Northeastern United States to the Association for Literary and Linguistic Computing, 1984-1987. Editorial Advisory Board, The Humanities Computing Yearbook, ed. I. Lancashire and W. McCarty, Oxford University Press. Organizer and Co-director, Meeting on Text Encoding Practices, funded by the National Endowment for the Humanities, Vassar College, November 1987. Organizer and Director, Workshop on Computers and the Humanities Courses, sponsored by the Alfred P. Sloan Foundation and the Association for Computers and the Humanities, Vassar College, July, 1986. Member, Executive Council, Association for Computers and the Humanities, 1984-1985; nominating committee, 1984, 1985 (chairman). Chair, Steering Committee for the Scholarly Research Text Initiative, 1987-present. Chair, Association for Computers and the Humanities Committee on Text Encoding Practices, 1987-present. Member, Steering Committee, National Educational Computer Conference, 1986-present. Member, Steering Committee, Eastern Small College Computing Conference, 1985-87.
Member of: Association for Computational Linguistics, Association for Computers and the Humanities, Association for Computing Machinery, Association for Literary and Linguistic Computing, Linguistic Society of America, Modern Language Association.

Grants and Awards

  • National Endowment for the Humanities Grant for the Development of Text Encoding Guidelines, 1987.
  • Alfred P. Sloan Foundation grant for the Workshop on Teaching Computers and the Humanities Courses, 1985.

C. M. Sperberg-McQueen

Computer Center (M/C 135)
University of Illinois at Chicago
Box 6998
Chicago, Illinois 60680
(312) 386-3584 / 996-2477
Bitnet: U18189@UICVM

Education

1985 Ph.D., Comparative Literature, Stanford University. Dissertation: “An Analysis of Recent Work on Nibelungenlied Poetics.”
1982-83 Georg-August Universität zu Göttingen
1978-79 Université de Paris IV (Sorbonne)
1977 A.M., German Studies, Stanford University
1977 A.B., German Studies and Comparative Literature, with distinction, with Honors in Humanities and Honors in German Studies, Stanford University
1975-76 Rheinische Friedrich-Wilhelms-Universität Bonn, Freie Universität Berlin

Publications

In press Sigurdharkvidha in skamma and Brot af Sigurdharkvidhu. In Encyclopedia of Scandinavia in the Middle Ages, ed. Phillip Pulsiano. New York: Garland, [forthcoming].
In progress (With Prof. Joseph Harris) Bibliography of the Eddas.
1987 Review of Nancy M. Ide, Pascal for the Humanities (Philadelphia: Univ. of Pennsylvania Press, 1987). Computers and the Humanities 21 (1987): 261-64.
1985 The Legendary Form of Sigurdharkvidha in skamma. Arkiv för nordisk filologi 100 (1985): 16-40.
1979 Approaching `Mother Courage,' or, Who's Afraid of Bertolt B.? Stanford Honors Essays in Humanities, 22. Stanford, Ca., 1979.

Talks

1987 Issues of Scope and Flexibility in Designing Guidelines for Machine-Readable Text Encoding. Association for Computers and the Humanities, National Endowment for the Humanities: Meeting on Text Encoding Practices. Poughkeepsie, New York, 12 November 1987.
1987 “Concordances and Beyond: Data Base Techniques for Literary and Linguistic Research” Claremont Colleges, Mellon Foundation: Computer Applications to the Curriculum—Beyond the Quantitative (workshop for selected Claremont Colleges faculty). Claremont, California, 1-2 June 1987.
1987 “Providing Centralized Support for Humanities Computing.” International Conference on Computers and the Humanities. Columbia, S.C., 10 April 1987.
1987 “What Computers Ought to be Doing for Us: Three Daydreams.” Pomona College. Claremont, California, 10 February 1987.

Teaching

1980 Section Leader, Humanities 62 (supervised two discussion sections meeting weekly for two hours in survey of medieval and Renaissance thought and literature), Stanford University
1979-80 German 1 and 2 (first-year German), Stanford University

Honors and Fellowships

1982-83 Deutscher Akademischer Austauschdienst fellowship in Göttingen
1977-81 Departmental Fellow, Stanford University
1976 Phi Beta Kappa, Stanford University

Professional Activities

1987- Member, Executive Council, Association for Computers and the Humanities
1987 Panelist, Discussion on Instructional Software Development. IBM Seminar for Deans of Arts and Letters. Colorado Springs, Colorado, 8 April 1987.
1986 Panelist, Final Summary Session. Association for Computers and the Humanities, Sloan Foundation: Workshop on Teaching Computers-in-the-Humanities Courses. Vassar College, Poughkeepsie, New York, August 1986.
1986 Organizer and Chair, Round Table Discussion of Foreign Language Processing. IBM Advanced Education Projects Conference. San Diego, California, February 1986.
1986- Member, Association for Computers and the Humanities, Working Committee on Text Encoding Practices.
1985-86 Member, Executive Board, Northeast Association for Computers in the Humanities
1983 Editorial consultant (Nibelungenlied stanza, Middle High German metrics), revised edition of Princeton Encyclopedia of Poetics.
1982-86 Outside reader (German medieval literature), MLN.
1981-82 Editorial assistant, MLN 97.3 (1982).
1980- Extensive work developing computer-based tools for humanities teaching and research; teaching computer use to humanists, and consulting with humanists on computer applications in research and teaching.

Employment

1987- Research Programmer, University of Illinois at Chicago Computer Center
1985-86 Humanities Computing Specialist, Princeton University Computing Center (Research Services Group)

Donald E. Walker

Bell Communications Research
435 South Street, MRE 2A379
Morristown, NJ 07960
Telephone: 201:829-4312
Arpanet: walker@flash.bellcore.com
Usenet: ucbvax!bellcore!walker

Specialized Professional Competence

Research and Research Management in Computer and Information Science. Focusing primarily at the intersection of computational linguistics, artificial intelligence, and information science. Particular interests in the development of natural language and knowledge resource systems, that is, the design and implementation of computer-based facilities that support accessing, organizing, and using knowledge and information. Extensive experience with natural language understanding systems—both text and speech, text processing systems, and personal file management systems with applications in medicine and law. Current activities concentrating on the lexicon, machine-readable dictionaries and other online reference works, and the management of massive document files.
Projects administered have been funded in excess of $10,000,000.

Professional Appointments

1984- District Manager, Artificial Intelligence and Information Science Research, Bell Communications Research, Morristown, New Jersey.
1983-84 Program Manager, Natural-Language and Knowledge-Resource Systems. SRI International, Menlo Park, California.
1971-83 Project Leader and Senior Research Linguist, SRI International, Menlo Park, California.
1962-71 Head, Language and Text Processing, The MITRE Corporation, Bedford, Massachusetts.
1961-62 Technical Staff, The MITRE Corporation.
1961-71 Visiting Scholar, Research Affiliate, and Guest, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, Massachusetts.
1960-61 Clinical Assistant Professor of Psychology, Department of Psychiatry, Baylor University College of Medicine, Houston, Texas.
1959-61 Research Associate, Department of Psychiatry, Baylor University College of Medicine, Houston, Texas. Principal Investigator, Public Health Service Grant on Language Variability in Psychiatric Patients.
1957-61 Research Psychologist (part time), Veterans Administration Hospital, Houston, Texas.
1955 Visiting Assistant Professor of Linguistics in the Linguistic Institute and Research Associate in the Office of the University Examiner (Summer), University of Chicago, Chicago, Illinois.
1953-1961 Assistant Professor of Psychology, Rice University, Houston, Texas.

Education

1945-1947 Deep Springs College, Deep Springs, California
1947-1952 University of Chicago, Chicago, Illinois; Psychology Ph.D. 1955; Thesis: The relation between creativity and selected test behaviors for mathematicians and chemists
1952-1953 Yale University, New Haven, Connecticut; Linguistics
1953 Indiana University (Summer), Bloomington, Indiana; Linguistics

Honors

Association for Computational Linguistics, Certificate of Recognition, 1987. American Association for Artificial Intelligence, Certificate of Recognition, 1984. Association for Computing Machinery National Lecturer, 1973-1974. Social Science Research Council Fellow, 1952-1953. Phi Beta Kappa, elected 1952. Sigma Xi, elected 1952. University Fellow and various scholarships, University of Chicago, 1947-1952. Deep Springs Scholarships, Deep Springs College, 1945-1947.

Major Professional Activities

  • Association for Computational Linguistics: Secretary-Treasurer, 1976-present; President, 1968; Vice President, 1967
  • International Joint Conferences on Artificial Intelligence: Secretary-Treasurer of the Board of Trustees and of the Conferences, 1977-present; Trustee, 1973-1977; Past General Chair, 1973 Conference; General Chair, 1971 Conference; Program Chair, 1969 Conference; Council Member, 1969-1973
  • International Federation for Documentation: Committee on Linguistics in Documentation, Chair 1973-1980; Vice-Chair, 1981-present
  • American Association for Artificial Intelligence: Secretary-Treasurer, 1979-1983; Finance Committee, 1981-1984; member of AAAI organizing committee
  • American Federation of Information Processing Societies: Member, Board of Directors, 1967-1971, 1979-1983; Secretary, 1971-1972
  • American Society for Information Science: Chair, Special Interest Group on Automated Language Processing, 1970-1972
  • Association for Computing Machinery: National Lecturer, 1973-1974; Visiting Scientist Program, 1969-1970
  • National Academy of Sciences: U.S. National Committee for FID (International Federation for Documentation), Chair, 1979-1982; member, 1974-1982; NAS Delegate to the FID 1980 General Assembly; member, SYSTRAN Panel, Advisory Committee to the Air Force Systems Command, 1970-1972
  • National Science Foundation: Consultant, Office of Science Information Service, 1971; panelist and reviewer for proposals for Computer Science, Linguistics, Information Science and Technology, 1966-present
  • National Institutes of Health: Special Study Section, 1979
  • U.S.#Air Force Scientific Advisory Board: Member, Ad Hoc Committee on Machine Translation, 1974
  • Center for Applied Linguistics: Advisory Committee, 1972
  • Houston Psychological Association: Secretary-Treasurer, 1957-1958; Executive Committee, 1959-1960

Editorial Boards

  • American Journal of Computational Linguistics: Managing Editor, 1976-present; Editorial Board, 1974-1977
  • Annual Review of Information Science and Technology: Advisory Board, 1983-1986
  • Artificial Intelligence, An International Journal: Editorial Board, 1968-present
  • Data and Knowledge Engineering: Associate Editor, 1985-present
  • Linguistica Computazionale: Editorial Board, 1983-present
  • Computers and the Humanities: Editorial Board, 1982-present
  • Encyclopedia of Artificial Intelligence: Editorial Board, 1983-1987
  • International Forum on Information and Documentation: Editorial Board, 1978-1982
  • Linguistic Calculation (Series sponsored by the Kval Institute for Information Science and published by D. Reidel Publishing Company): Editorial Board, 1982-present
  • Sprache und Datenverarbeitung: Editorial Board, 1977-present
  • Studies in Natural Language Processing (Series sponsored by the Association for Computational Linguistics and published by Cambridge University Press): Editorial Board, 1982-present

National Professional Societies

American Association for Artificial Intelligence, 1979-present. American Psychological Association, 1953-1972. American Society for Information Science, 1970-present. Association for Computers and the Humanities, 1979-present. Association for Computational Linguistics, 1965-present. Association for Computing Machinery, 1963-present. Association for Literary and Linguistic Computing, 1985-present. Dictionary Society of North America, 1985-present. Euralex, 1985-present. Linguistic Society of America, 1960-present. Society for the Study of Artificial Intelligence and Simulation of Behavior, 1979-present.

Publications and Major Reports

The Development of a Technique for Measuring Stimulus-Bound—Stimulus-Free Behavior. American Psychologist, 6, 365, 1951.

Consistent Characteristics in the Behavior of Creative Mathematicians and Chemists. American Psychologist, 7, 371, 1952.

The Relation Between Creativity and Selected Test Behaviors for Mathematicians and Chemists. Ph.D. Thesis, University of Chicago, 1955.

Psycholinguistics: A Survey of Theory and Research Problems. Edited by Charles E. Osgood and Thomas A. Sebeok. Indiana University Publications in Anthropology and Linguistics, Memoir 10. Waverly Press, Baltimore, 1954. Also Journal of Abnormal and Social Psychology, 1954, Supplement. Reissued by Indiana University Press, Bloomington, Indiana, 1965.

Parents of Schizophrenics, Neurotics, and Normals (with Seymour Fisher, Ina Boyd, and Diane Sheer). A.M.A. Archives of General Psychiatry, 1959, 1, 149-166.

The Brain and Behavior. Rice Institute Pamphlet, 1960, 47, 48-80.

Whittaker's `Postulates of Impotence' and Theory in Psychology (with Trenton W. Wann). Psychological Record, 1961, 11, 383-393.

The Interpretation of Data: Puberty Rites (with Edward Norbeck and Mimi Cohen). American Anthropologist, 1962, 64, 463-485.

The Structure of Languages for Man and Computer (with James M. Bartlett). SS-10, The MITRE Corporation, Bedford, Massachusetts, November 1962. Presented at the First Conference on Information System Sciences, The Homestead, Virginia, November 1962.

The Concept `Idiolect': Contrasting Conceptualizations in Linguistics. In Proceedings of the Ninth International Congress of Linguistics (Cambridge, Massachusetts, August 27-31, 1962). Edited by Horace G. Lunt. Mouton, The Hague, 1964. Pp. 556-561.

Information System Sciences (Ed., with Joseph Spiegel). Spartan Books, Washington, D.C., 1965.

English Preprocessor Manual (Ed.). SR-132, The MITRE Corporation, Bedford, Massachusetts, December 1964; revised, May, 1965.

The MITRE Syntactic Analysis Procedure for Transformational Grammars (with Arnold M. Zwicky, Joyce Friedman, and Barbara C. Hall). AFIPS Conference Proceedings: Fall Joint Computer Conference, 1965, 27, 317-326.

Recent Developments in the MITRE Analysis Procedure (with Paul G. Chapin, Michael L. Geis, and Louis N. Gross). MTP-11, The MITRE Corporation, Bedford, Massachusetts, June 1966. Presented at the Annual Meeting of the Association for Machine Translation and Computational Linguistics, Los Angeles, July 1966.

Information System Science and Technology (Ed.). Thompson Book Company, Washington, D.C., 1967.

Critique of papers by Silvio Ceccato and Sidney M. Newman. (International Symposium on Relational Factors in Classification, College Park, Maryland, June 1966.) Information Storage and Retrieval, 1967, 3, 216-218, 348-350.

SAFARI, an On-Line Text-Processing System. Proceedings of the American Documentation Institute, 1967, 4, 144-147.

Proceedings of the International Joint Conference on Artificial Intelligence (Ed., with Lewis M. Norton). Washington, D.C., 1969.

On-Line Computer Aids for Research in Linguistics (with Louis N. Gross). In Information Processing 68. Edited by A.J.H. Morrell. North-Holland, Amsterdam, 1969.

Computational Linguistic Techniques in an On-Line System for Textual Analysis. International Conference on Computational Linguistics: COLING 1969 (Sanga-Saby, Sweden, 1-4 September 1969). Preprint No. 63. KVAL, Stockholm, 1969.

Interactive Bibliographic Search: The User/Computer Interface (Ed.). AFIPS Press, Montvale, New Jersey, 1971.

Social Implications of Automatic Language Processing. In Research Trends in Computational Linguistics. Center for Applied Linguistics, Washington, D.C., 1972. Pp. 78-86.

Speech Understanding Research. Annual Report, Project 1526, Artificial Intelligence Center, Stanford Research Institute, Menlo Park, California, February 1973.

Automated Language Processing. In Annual Review of Information Science and Technology, Volume 8. Edited by Carlos A. Cuadra and Ann W. Luke. American Society for Information Science, Washington, D.C., 1973. Pp. 69-119.

Speech Understanding Through Syntactic and Semantic Analysis. IEEE Transactions on Computers, 1976, C-25, 432-439. Also in Advance Papers, Third International Joint Conference on Artificial Intelligence (Stanford, California, 20-23 August 1973). Stanford Research Institute, Menlo Park, California, 1973, 208-215.

Speech Understanding, Computational Linguistics, and Artificial Intelligence. In Computational and Mathematical Linguistics, Proceedings of the International Conference on Computational Linguistics (Pisa, Italy, 27 August-1 September 1973.), Volume I. Edited by Antonio Zampolli and Nicoletta Calzolari. Casa Editrice Leo S. Olschki, Firenze, 1977. Pp. 725-740.

Speech Understanding Research. Annual Report, Project 1526, Artificial Intelligence Center, Stanford Research Institute, Menlo Park, California, May 1974.

The SRI Speech Understanding System. IEEE Transactions on Acoustics, Speech and Signal Processing, 1975, ASSP-23, 397-416. Also in Contributed Papers, IEEE Symposium on Speech Recognition (Carnegie-Mellon University, Pittsburgh, Pennsylvania, 15-19 April 1974). IEEE, New York, 1974, 32-37.]

Speech Understanding Research (with William H. Paxton, Jane J. Robinson, Gary G. Hendrix, and Ann E. Robinson). Annual Report, Project 3804, Artificial Intelligence Center, Stanford Research Institute, Menlo Park, California, June 1975.

Progress in Speech Understanding Research at SRI. Proceedings of the Fourth International Congress of Applied Linguistics (Stuttgart, German Federal Republic, 25-30 August 1975). Edited by Gerhard Nickel. Hochschul Verlag, Stuttgart, 1976.

Artificial Intelligence and Language Processing: A Directory of Research Personnel. American Journal of Computational Linguistics, 1976, Microfiche 39.

Speech Understanding Research. Semiannual Technical Report, Project 4762, Artificial Intelligence Center, Stanford Research Institute, Menlo Park, California, June 1976.

Speech Understanding Research (Ed.). Final Technical Report, Project 4762, Artificial Intelligence Center, Stanford Research Institute, Menlo Park, California, October 1976.

An Overview of Speech Understanding Research at SRI (with Barbara J. Grosz, Gary G. Hendrix, William H. Paxton, Ann E. Robinson, Jane J. Robinson, and Jonathan Slocum).

Speech Understanding Systems: Report of a Steering Committee (with Mark F. Medress, Franklin S. Cooper, James W. Forgie, C. Cordell Green, Dennis H. Klatt, Michael H. O'Malley, Edward P. Neuburg, Allen Newell, D. Raj Reddy, H. Barry Ritea, June E. Shoup-Hummel, and William A. Woods). Artificial Intelligence, 1977, 9, 307-316; SIGART Newsletter, April 1977, 62, 4-8.

Procedures for Integrating Knowledge in a Speech Understanding System (with William H. Paxton, Barbara J. Grosz, Gary G. Hendrix, Ann E. Robinson, Jane J. Robinson, and Jonathan Slocum). Proceedings of the Fifth International Joint Conference on Artificial Intelligence (Cambridge, Massachusetts, 22-25 August 1977). Carnegie-Mellon University, Pittsburgh, Pennsylvania, 1977. Pp. 36-42.

Research on Speech Understanding and Related Areas at SRI. In Proceedings, Voice Technology for Interactive Real-Time Command/Control Systems Applications (NASA Ames Research Center, Moffett Field, California, 6-8 December 1977). Edited by Robert Breaux, Mike Curran, and Edward M. Huff. NASA Ames Research Center, Moffett Field, California, 1977. Pp. 45-63.

Natural Language in Information Science: Perspectives and Directions for Research (Ed., with Hans Karlgren and Martin Kay). Skriptor, Stockholm, 1977.

Introduction. In Natural Language in Information Science: Perspectives and Directions for Research. Edited by Donald E. Walker, Hans Karlgren, and Martin Kay. Skriptor, Stockholm, 1977. Pp. 3-5.

The Workshop and Its Results. In Natural Language in Information Science: Perspectives and Directions for Research. Edited by Donald E. Walker, Hans Karlgren, and Martin Kay. Skriptor, Stockholm, 1977. Pp. 7-18.

Understanding Spoken Language (Ed.). Elsevier North-Holland, New York, 1978.

Introduction and Overview. In Understanding Spoken Language. Edited by Donald E. Walker. Elsevier North-Holland, New York, 1978. Pp. 1-13.

Natural Language Access to a Melanoma Data Base (with Martin Epstein). In Proceedings of the Second Annual Symposium on Computer Applications in Medical Care, Washington, D.C., November 5-9, 1978. IEEE, New York, 1978. Pp. 320-325.

Information Retrieval on a Linguistic Basis. In Aspects of Automatized Text Processing. Edited by Sture Allen and Janos Petofi. Helmut Buske, Hamburg, 1979. Pp. 137-156.

SRI Research on Speech Understanding. In Trends in Speech Recognition. Edited by Wayne A. Lea. Prentice-Hall, Englewood Cliffs, New Jersey, 1980. Pp. 294-315.

Economic Impact of Research on Natural Language (with Hans Karlgren). Proceedings of FID Congress 1980, Copenhagen, Denmark, 25-28 August 1980. Also International Forum on Information and Documentation, 1981, 6, 7-10.

Natural Language Access to Medical Text (with Jerry R. Hobbs). Proceedings of the Fifth Annual Symposium on Computer Applications in Medical Care. IEEE, New York, 1981. Pp. 269-273.

The Organization and Use of Information: Contributions of Information Science, Computational Linguistics and Artificial Intelligence. Journal of the American Society for Information Science, 1981, 32, 347-363.

Organizing and Using Textual Information. 1982 Office Automation Conference Digest. Arlington, Virginia: AFIPS Press, 1982. Pp. 681-686.

Reflections on 20 Years of the ACL: An Introduction. Proceedings of the 20th Annual Meeting of the Association for Computational Linguistics. Menlo Park, California: Association for Computational Linguistics, 1982. Pp. 89-91.

A Society in Transition. Proceedings of the 20th Annual Meeting of the Association for Computational Linguistics. Menlo Park, California: Association for Computational Linguistics, 1982. Pp. 98-99.

Natural Language Access Systems and the Organization and Use of Information. COLING-82: Proceedings of the Ninth International Conference on Computational Linguistics, edited by Jan Horecky. Amsterdam: North-Holland, 1982. Pp. 407-412.

Natural Language Access to Structured Text (with Jerry R. Hobbs and Robert A. Amsler). COLING-82: Proceedings of the Ninth International Conference on Computational Linguistics, edited by Jan Horecky. Amsterdam: North-Holland, 1982. Pp. 127-132.

Sublanguages (with Richard Kittredge, Joan Bachenko, Ralph Grishman, and Ralph Weischedel). In Applied Computational Linguistics in Perspective: Proceedings of the Workshop, edited by Carroll Johnson and Joan Bachenko. American Journal of Computational Linguistics, 1982, 8, 79-82.

The Polytext System--A New Design for a Text Retrieval System (with Hans Karlgren). In Questions and Answers, edited by Ferenc Kiefer. Reidel, Dordrecht, Holland, 1983. Pp. 273-294.

Computational Strategies for Analyzing the Organization and Use of Information. In Knowledge Structure and Use: Implications for Synthesis and Interpretation, edited by Spencer A. Ward and Linda J. Reed. Temple University Press, Philadelphia, Pennsylvania, 1983. Pp. 229-284.

Contributor, Information Science Panel Report, Proceedings of the Information Technology Workshop, (Leesburg, Virginia, 5-7 January 1983).

Automatic Segmentation and Analysis of Digital Data Streams (with Robert A. Amsler and Bernard Elspas). Final Report: Project Definition Phase, Project 5383. Computer Science and Technology Division, SRI International, Menlo Park, California, May 1983.

The Use of Machine-Readable Dictionaries in Sublanguage Analysis (with Robert A. Amsler). In Analyzing Language in Restricted Domains, edited by Ralph Grishman and Richard Kittredge. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1986. Pp. 69-83.

Knowledge Resource Tools for Information Access. Future Generations Computer Systems, 1986, 2, 161-171.

Knowledge Resource Tools for Accessing Large Text Files. In Machine Translation: Theoretical and Methodological Issues, edited by Sergei Nirenberg. Cambridge University Press, Cambridge, England, 1987. Pp. 247-261. Distributed Expert-Based Information Systems: an Interdisciplinary Approach (with Nicholas Belkin and others), Information Processing and Management, 1987, 23:5, in press.

Automating the Lexicon: Research and Practice in a Multilingual Environment (Ed. with Antonio Zampolli and Nicoletta Calzolari). Submitted for publication in the Association for Computational Linguistics series, Studies in Natural Language Processing.

Introduction. In Automating the Lexicon: Research and Practice in a Multilingual Environment, edited by Donald E. Walker, Antonio Zampolli, and Nicoletta Calzolari. Submitted for publication in the Association for Computational Linguistics series, Studies in Natural Language Processing.

November 1987

Antonio Zampolli

Dipartimento di Linguistica
Università di Pisa
Via S. Maria 36
I-56100 Pisa
Italy
Tel. 050 (24773).
Istituto di Linguistica Computazionale
Via della Faggiola 32
I-561000 Pisa
Italy
Tel. 050 (502082).

Current Position

Full Professor of Mathematical Linguistics at the University of Pisa and Director of the Institute for Computational Linguistics of the Italian National Research Council (CNR).

Teaching Experience

1970-1976 Associate Professor of Mathematical Linguistics (“professore incaricato”) at the Faculty of Letters and Philosophy of the University of Pisa, Italy.
1975-1977 Full Professor of General Linguistics (“professore straordinario”) at the University of Genoa, Italy.
1978- Full Professor of Mathematical Linguistics (“professore ordinario”) at the Faculty of Letters and Philosophy at the University of Pisa, Italy.
1975-1978 Professor in Computational and Mathematical Linguistics at the School for Graduate Studies in Linguistic Sciences (“Scuola di Perfezionamento in Scienze Linguistiche”) at the Faculty of Letters and Philosophy of the University of Pisa, Italy.
1982- Director of the PhD Curriculum in Computational and Applied Linguistics for the Doctorate of Research (PhD in Linguistics).
1970 Director of the International Summer School on Mathematical and Computational Linguistics “Linguistic and Literary Electronic Data Processing”.
1972-1977 Director of the II, III, and IV International Summer Schools on Mathematical and Computational Linguistics.
1982- Lecturer of the course in Linguistics for the School of Graduate Studies in Phoniatrics and Logopedics of the Faculty of Medicine of the University of Pisa.

Scientific Activity in International Committees and Associations

  • ICCL (International Committee on Computational Linguistics): Vice President.
  • ALLC (Association for Literary and Linguistic Computing): Elected to the Directive Committee; ALLC representative for Italy; “Corresponding Editor” of the Bulletin; President of the Specialist Group for Italian texts; Member of the Specialist Group for International Standards for Data Storage in Machine-readable Form and for Networks. From 1983: President of the ALLC.
  • CIRPHO (Cercle Internationale de Recherche Philosophique par ordinateur): Vice President.
  • FID (Federation international de documentation): Member of “Linguistics in documentation” ad hoc group.
  • AILA (Association Internationale de Linguistique Appliquée): Vicepresident; Editor of the Bulletin; Member of the Executive Committee; Co-president of the Scientific Commission for Applied Computational Linguistics; Member of the Committee for the International Summer Schools; Member of the Committee for the Scientific Commissions; Member of the Financial Committee.
  • ACH (Association for Computers and the Humanities): Vice President (pro-tempore). From 1983: Italian representative of ACH.
  • EJCSC (European Joint Committee for Scientific Cooperation of the Council of Europe): Co-president of the group for the coordination of automatic text processing procedures.
  • CETIL (“Comité d'Experts pour le Transfert de l'Information entre Langues europeennes” of the EC): Vicepresident; Member of the group for the Coordination of Computer-aided Translation.
  • CIDST (Comité pour l'Information et Documentation Scientifique et Tecnique of the EC): Vicepresident of the ad hoc group for the Automatic Processing of Multilingual Information.
  • Member of the Scientific Council of the Index Thomisticus.
  • Member of the Committee of Experts of the Italian Ministry for Foreign Affairs for the teaching of Italian abroad.
1982- Member of the Ad hoc Group for lexicography of the European Science Foundation.
1982- Italian correspondent for INFOTERM (Vienna).
1983- Italian representative for ACPM (Committee of EUROTRA).
1983- Member of Executive Committee of EUROLEX.
1985- Italian delegate in the Commission of the European Community CGC 12 (Management and Coordination Advisory Committee - Linguistic Problems).
1985- Chairman of the ad hoc Task Force (for Machine Translation) of the CGC 12.
1985- Co-opted member responsible for the use of computers in the humanities in Europe for the Standing Committee for the Humanities of the European Science Foundation.

Organisation of International Meetings

1968 Chairman of the Organizing Commitee of the “Colloque International sur le Dictionnaire latin de machine”.
1970 Chairman of the Organizing Committee of the “Colloque International L'elaboration electronique en lexicologie et en lexicographie” (Pisa).
1972 Member of the “Program Committee” for the “5th International Conference on Computational Linguistics”.
1973 General Coordinator of the 5th International Conference on Computational linguistics (Pisa).
1974 Responsible for the “Computational Linguistics” section of the “4th International Congress of Applied Linguistics” (Stuttgart, 1975).
1974 Member of the “Program Committee” for the “6th International Conference on Computational Linguistics” (Montreal, 1975).
1974 Member of the Honorary Committee of the “2nd International Conference on Computers and the Humanities” (Los Angeles, 1975).
1975 Member of the Scientific Program Committee and Co-president of the “Computational Linguistics and Machine Translation” section of the “4th AILA World Congress” (Stuttgart).
1976 Member of the Scientific Program Committee for the section on “Lexicography and Quantitative linguistics” of the “International Conference on Computational Linguistics” (Ottawa).
1978 Co-president of the Scientific Committee for the “Computational Linguistics” section of the “5th AILA World Congress” (Montreal).
1978 President of the “Scientific Program Committee” of the “International Conference on Computational Linguistics” (Bergen).
1980 Member of the “Scientific Program Committee” of the “International Conference on Computational Linguistics” (Tokyo).
1981 Responsible for the workshop “On the Possibilities and Limits of the Computer in producing and publishing Dictionaries”, organised for the European Science Foundation (Pisa).
1981 Member of the Organizing Committee of the Round Table on the Application of the Computer to Spanish (Pisa).
1982 Member of the Scientific Program Committee of the “International Conference on Computational Linguistics” (Prague).
1982 Chairman of the Organizing Committee and member of the Program Committee of the VII ALLC Symposium “Computers in Literary and Linguistic Research” (Pisa).
1983 Member of the “Scientific Programme Committee” of the ICCL (San Francisco).
From 1983 Co-chairman of the ALLC Symposia and Proceedings.
1984 Member for computational linguistics of the Scientific Committee of Eurolex II (Zurich 86).

Research Work

1960 Degree cum laude in Classics at the University of Padua with a thesis in glottology (Studies in linguistic statistics performed using an IBM system).
1961 NATO study grant at the “International Summer Institute on Mechanical Translation”.
1962 NATO study grant at the “International Summer Institute on Automatic Documentation”.
1960-1965 Research Director and Assistant to the Director at the “Centro per l'Automazione dell'Analisi Linguistica (CAAL) di Gallarate”.
From 1966 Consultant to the Accademia della Crusca, Opera del Vocabolario.
1967-1978 Director of the “Divisione Linguistica” of CNUCE (Institute of the Italian Research Council).
From 1971 Director of the project “Vocabolario Italiano di macchina” (Italian Machine Dictionary).
1974 Member of the research group for “Moderne Tecnologie della documentazione bibliografica ed emerografica corrente per le Università. e gli Enti di ricerca in Italia” of the Committee for Technological Research of CNR.
1974 Representative for Computational Linguistics of the CNR Scientific Committee for the project to Automatize the Documentation of the House of Commons.
From 1978 Director of the “Istituto di Linguistica Computazionale (ILC)” of the Italian National Research Council (Pisa). The Institute has a staff of thirty and the following research projects are currently in progress
Procedures and tools for the automation of philological research; Software and technologies for linguistic data processing; Processing and management of textual archives and linguistic multifunctional databases; Italian Machine Dictionary; Automatic syntactic analysis of Italian; Studies and textual analyses of present day Italian; Morphosyntactic analyses and lemmatisation of Spanish Knowledge representation language and semantic parser.
The Institute also collaborates in more than 50 text-processing projects underway in Italy and abroad which use the encoding standards and the generalised procedures of the ILC, and has the statutory obligation of contacts, collaborations and exchanges between Italy and other countries in the field of literary and linguistic data processing using the computer.

Editorial Activity

Member of the Advisory Council of Computers and the Humanities (New York).
Member of the Editorial Advisory Board of TA-Information (Paris).
Member of the Editorial Advisory Board of ITL - Review of Applied Linguistics (Louvain).
Member of the Editorial Board of the ALLC Bulletin (Cambridge, U.K.).
Member of “Comité de redaction” of the Cahiers de Lexicologie (Besancon).
Editor of the AILA Bulletin.
Member of the Editorial Advisory Board of Multilingua (The Hague).
Director of the journal Linguistica Computazionale.

Conferences and Seminars

Invited speaker at Institutes and Research Centres in: Amsterdam, Aarhus, Barcellona, Budapest, Brussells, Chicago, Copenhagen, Liege, Leuven, London, Lubjana, Luxembourg, Madrid, Minneapolis, Moscow, Nancy, New York, Oviedo, Paris, Prague, Provo, Stanford, Stockholm, Varna, Vienna, etc.

G. Budget for Subcontract with the University of Illinois at Chicago

[Note: Detailed budget information has been omitted from this copy.]

Notes

[1] Or even the distinction between upper and lower case letters, which was at one time not preserved by standard data-processing industry practice.
[2] The various examples given include—but not in this order, since some forms apply to more than one program, and vice versa—references usable for processing with the ARchival Retrieval and Analysis Program (ARRAS), COCOA (Word COunt and COncordance on Atlas), IBM SCRIPT/DCF, the TEX macro package LATEX, the Oxford Concordance Program, software designed for the Thesaurus Linguae Graecae, Waterloo SCRIPT, Waterloo GML, and Electronic Text Corporation's WORD CRUNCHER.
[3] It should be noted that CD-ROM are at present usually encoded in proprietary formats unreadable by any program but the maker's; this reflects both the lack of any industry standard for physical organization of CD-ROM and the desire of vendors to prevent unauthorized copying. In this environment, an encoding scheme of the type we propose has no function, since the data format is already known to the only program intended to read it. As efforts to develop a standard physical organization for CD-ROM progress, however, and the data on CD-ROM, WORM disks, and other optical storage devices become accessible to generic disk-reading routines, the need for a standard, well designed, and well documented encoding scheme will re-surface.
[4] Among the projects now in prospect are the Oxford project to create a computer database of pre-Restoration English drama, the Queens University project to create an electronic library of Canadian literature, and the effort among computational linguists to organize a cooperative encoding of the classic Century Dictionary. Further along is a project organized by scholars in Toronto and Otago, New Zealand, to create a Tudor Textbase containing a range of written works from the sixteenth century. This corpus will form the basis of a glossary or short dictionary of early Tudor English. The work of entering the texts is already in progress at Otago and Toronto; Oxford University Press have expressed an interest in publishing the text base and lexicon in electronic form. The organizers have expressed interest in testing the proposed common text encoding format on their corpus.
[5] The developers of the Oxford Concordance Program have already announced their intention to support any common format created by this project. As noted below, it is also expected that other software compatible with the new scheme will be created, as offshoots of this project, as byproducts of the development work on this project, or independently.
[6] The prospect of compatibility of one text format with a wide variety of processing programs is one strong reason, apart from the intrinsic merits of its syntax, to explore SGML very carefully as a possible basis for the new encoding scheme. Because of the variety of existing and projected programs supporting SGML applications, SGML offers the prospect of uniting common word processing functions with research-oriented text encoding by means of a relatively simple common syntax of “markup tags” used to describe the logical structure of a document. Word processing programs can generate correctly formatted printed copies, in a variety of styles, of a text marked with such tags. Analytic programs built to handle SGML syntax could produce concordances, lists of lemmata, syntactic analysis or summaries of syntactic data, or other research tools from the same file.
[7] The recommendations will specify both what features are to be encoded (e.g. chapter divisions, title of text, etc.) and how those features are to be represented (by stipulating the use of specific tags or markup conventions).
[8] Thus one metrist may need to encode vowel length but be indifferent to the number of syllables in a line; another may be interested in lexicality of individual words and syllables as a possible metrical determinant but not care about vowel length. Theoretical divergences are even more obvious, and more profound, in syntactic study and lexicology.
[9] The automatic or semi-automated conversion of typesetting tapes to a more generally useful form is regarded as a key benefit of the metalanguage development effort. At present, many publishers routinely destroy the computer tapes used to set new books; even if the tapes are retained, they often do not contain the last typographic corrections (done by hand). And when publishers are willing to share the tapes with researchers, the researchers typically face a daunting task to understand the encoding and render it useful.
[10] Cheryl A. Fraser, An Encoding Standard for Literary Documents, M.S. Thesis (Queen's University, Ontario), 1986. The work was performed under the direction of Prof. David T. Barnard.
[11] Our information is limited, but seems to have a clear pattern. The rate at Oxford is 25%; at Vassar, 24.5%; at the University of California at Berkeley, 25% or 30%; at Princeton, 30%.
[12] By the Cliff Johnson Travel Agency in Oak Park, Illinois.
[13] Cliff Johnson Travel estimates the average fare at $1500, but as with the North American travel we assume that special fares will be available often enough to depress the average.
[14] These figures are based on observation of travel patterns at the Vassar conference and the later meeting of the steering committee in Pisa.
[15] Specifically: Oxford University (44%), Princeton University (67%), the University of California at Berkeley (47%), the University of Illinois at Chicago (38%), the University of North Carolina at Chapel Hill (43.5%), and the University of South Carolina (50%). Vassar College charges 72% but excludes fringe benefits. On a faculty salary, this is equivalent to an overhead charge of 58% figured on salary plus fringe benefits. The arithmetic average of these figures is 49.6%, but private institutions with high overhead rates are perhaps slightly overrepresented in the sample.
[16] Among pioneering text corpora one may mention the Thesaurus Linguae Graecae, the Brown Corpus of modern American English and its various supplements, and the Treasury of the French Language corpus of modern French materials.
[17] Early mainframe concordance programs to achieve wide distribution include the batch-oriented Oxford Concordance Program and WatCon (the Waterloo Concordance Program), and more recently the interactive program ARRAS (the ARchival Retrieval and Analysis System). More recently still, microcomputer versions of these programs have been developed, as have other microcomputer-based concordance programs too numerous to name.
[18] The Oxford Text Archive, one of the first and largest of these archives, was founded in 1976. Other archives have appeared in many European countries, sometimes with specific linguistic or disciplinary limitations, sometimes without. While there is still no major general-purpose archive in the U.S., the American Philological Association does maintain a Repository of Machine-Readable Classical Texts, and a number of universities have begun to prepare and collect machine- readable texts.
[19] Secondary analysis is analysis by scholars or parties other than those who first entered the material into machine readable form, and for purposes other than those first envisaged. Primary analysis is analysis by the original owner of the data, for the originally conceived purposes. In textual studies, secondary analysts (e.g. lexicographers) often need to aggregate a variety of independently encoded shorter texts and collections into a larger corpus for linguistic research; this is feasible only if the editorial and encoding practices of the smaller texts are the same, or can be harmonized.
[20] "COCOA" stands for "word COunt and COncordance program on Atlas", the name of an early British program, which first introduced the flexible and simple style of tagging still called, in the Oxford documentation, "COCOA tags".
[21] At least two CD-ROM versions of the Thesaurus Linguae Graecae are planned: one to be distributed by the Brown University "Isocrates" project together with the text retrieval software developed by the Harvard Classics department, and one to be distributed with the Ibycus microcomputer developed by David Packard. In addition, Pennsylvania University has announced plans for yet another CD-ROM with ancient texts. Queen's University, together with several other Ontario universities, is now negotiating with a major publisher of Canadian material with the goal of producing a machine-readable version of the New Canadian Library (over 200 volumes), and possibly distributing it via CD-ROM. Commercial projects are becoming almost too numerous to keep track of.
[22] At Oxford, two major projects have already identified an urgent need for encoding standards. One of these will encode all the set texts used by undergraduates for honors courses in classics and modern languages; the other, a more ambitious undertaking, will encode all surviving early texts of pre-Restoration drama in English. The first project has now been funded and work will begin early in the spring, while the second is still seeking funding; they may be regarded as typical of those going on at many UK universities. The text archive project in Canadian literature at Queen's University is also dependent upon encoding standards which are at present being designed as a prerequisite to scanning the material itself.
[23] One could also mention others, such as the Xerox Interscript encoding, the IBM Document Interchange Architecture and Document Content Architecture, or the Office Document Architecture and Office Document Interchange Format standards promulgated by the International Organization for Standardization (ISO). They suffer from the comparison with SGML: in comparison, they appear to be too opaque in their construction and too distant from the readable text in their storage formats to achieve any significant following among humanists interested in encoding texts, and they will not be discussed here. A useful discussion of these schemes from the relevant point of view may be found in Cheryl A. Fraser, "An Encoding Standard for Literary Documents," (M.A. Thesis, Queen's University, 1986), x + 212 lvs.
[24] Two prominent examples of "markup languages" are the IBM and Waterloo Generalized Markup Language (GML) products, based respectively on IBM's Document Composition Facility (DCF) Script, and on the genetically related Waterloo Script. Similar programs, which may or may not include the same terminology and concepts, and which thus may or may not be regarded strictly as "markup languages," but which perform much the same function, include TeX (Tau Epsilon Xi, a public-domain typesetting language distributed by the American Mathematical Society), the TOPS-10 program Runoff, the Vax VMS program Scribe, and ROFF (a ubiquitous "runoff" or text-formatting program) and its descendants, most prominent the Unix utilities nroff and troff.
[25] Thus an emphasized or italicized phrase may be preceded in one GML by the tag "<ital>" and followed by the tag "</ital>", while another GML may accomplish the same thing with the tags ":hp1." and ":ehp1." ('highlighted phrase, style 1', and 'end highlighted phrase, style 1') or "<emph>" "<emph/>". The GMLs may also distinguish such phrases from the titles of works, which by convention are also italicized in print, but which are logically distinct in kind and may be represented by the tags "<ttl>" ... "</ttl>" or ":cit." ... ":ecit."
[26] Details of the cooperative agreement remain to be worked out between ACH and the other interested groups, but are expected to include common representation on the working party and arrangements for final validation and approval of the guidelines which may supersede those sketched out later in this document.