AN SGML-BASED STANDARD FOR ENGLISH MONOLINGUAL DICTIONARIES by Robert A. Amsler Bellcore Morristown, NJ, USA and Frank Wm. Tompa Bellcore & The Centre for the New Oxford English Dictionary University of Waterloo Waterloo, Ontario, Canada ABSTRACT This paper summarizes the progress to date on the development of an interchange standard for English monolingual machine-readable dictionaries. The need for such a standard has become urgent in the fields of computational lexicology, lexicography, and linguistics with the rapid dispersion of copies of several different machine-readable dictionaries and extensive research on the applications of such dictionaries. ******************************************************************************* TABLE OF CONTENTS 1. The Diversity of Existing Machine-Readable Dictionaries 2. Standard Generalized Markup Language 3. The Dictionary Encoding Standard 3.1. Content vs. Layout 3.2. Accommodating Print-Based Conventions 3.3. Logical Grouping 4. Universality of the Standard REFERENCES ************************************************************************** LIST OF FIGURES Figure 3-1: W7 Entry for "apple" in Dictionary Standard Notation ************************************************************************** 1. The Diversity of Existing Machine-Readable Dictionaries In addition to raw typesetting tapes, which presumably have existed since computers were used to typeset dictionaries, there are several formatted dictionaries in fairly common use in the computational lexicology community. The first dictionaries to be prepared for computational applications include the _Merriam-Webster Seventh Collegiate_ and the _Longman Dictionary of Contemporary English_, both of which are available in record-based formats as described by [Peterson 86] and [Michiels 82]. More recently, the _Oxford English Dictionary_ and the _Oxford Advanced Learner's Dictionary of Current English_ have become available in formats based on "descriptive document markup," which do not segment the text into more conventional records. Unfortunately, the variation in formats among these dictionaries is so great, that it has been virtually impossible to implement dictionary-independent application programs, much less programs that access more than one dictionary (e.g., to support comparative lexicology). To complicate matters, multiple machine-readable versions of each dictionary have been developed as researchers convert the dictionaries into alternative forms more suited to their own applications. Thus, corrections applied to the text of a dictionary by one researcher have typically been unavailable to others who supposedly use the same dictionary. In an attempt to correct these problems, work began on developing a standard at the "Workshop on the Lexicon in a Computational and Theoretical Perspective" held during the Summer Linguistics Institute hosted by Stanford University in the summer of 1987. Because most computational lexicologists have used _W7_ or _LDOCE_, the initial goal of a dictionary standard was approached by assuming that the existing ad hoc conventions used for the two record-based formats could be modified to reach a new compatible standard which would then be extendible to all machine-readable dictionaries [Slocum 87]. However, subsequent to that meeting, it became clear that basing the standard on an existing internationally endorsed standard language for textual information would better serve the needs of the research community and perhaps influence the publishing community as well. That standard language, of course, is ISO-8879, the Standard Generalized Markup Language [SGML 86]. 2. Standard Generalized Markup Language SGML basically is a language in which to write standards for text (called applications). SGML itself does not require any specific delimiters or tags be used; it merely sets forth a method by which an application of SGML can itself describe its particular choice of delimiters and terminology in a formal manner. Therefore, the decision to use SGML in no way prescribes what the representation of text must look like and any decision to endorse this application of SGML should be made independently of the decision to adopt some SGML-based application for representing machine-readable dictionaries. Following the conventions used by most SGML applications, the contents of a document are tagged as follows: [n.b. In the following example, the numbers/letters in the second line should be interpreted as subscripts of the letters above and to the left of them in the element description - ed. WP, 11 June 1992] < tid a =v a =v ... a =v> content < / tid > 1 1 2 2 n n Each unit of text to be marked is surrounded by a pair of tags having an identifier (represented here by "tid") indicative of the role of the unit in the document (e.g., "paragraph", "headword"). Opening tags may contain attributes-value pairs, represented here by a =v . i i A second markup feature in SGML is the availability of "entity references," which are used to represent encoded units stored outside the document (e.g., graphics or separately described documents). The most frequent use of entity references, in practice, is for representing non-ASCII characters. In fact, Appendix D in the published SGML standard includes declarations of several "public character sets" for non-English alphabets, special mathematical symbols, standard characters used in publishing, etc. Entity references are typically encoded in SGML applications using the syntax "&ref;" in which "ref" is the name of the encoded unit. One important application of SGML is that produced by the Association of American Publishers (AAP) for a few forms of published text (specifically, books, articles, and serials) [AAP 87]. Because dictionaries can be viewed as another form of published text, it would seem worthwhile to define a dictionary standard as an extension of the AAP standard. However, in the AAP's proposed standard there are certain assumptions about the nature of tags and tag names for several document classes, not all of which seem appropriate for the computational needs of lexicologists. Thus, where possible we have followed the AAP's guidelines and directions, but in some areas, the nature of the text being represented and the anticipated uses for that text preclude our proposed dictionary standard from being a simple extension of the AAP standard. 3. The Dictionary Encoding Standard To accommodate the variety of content and presentation methods used in dictionaries, we have based the standard on many sources. The principal sources include two SGML-based dictionaries, the _OED_ as documented in [NewOED 87] and the undocumented _OALD_, each in an original and a new version; two record-based dictionaries, _W7_ as documented in [Peterson 86] and _LDOCE_ as documented in [Michiels 82]; and a new record-based format that emerged from the Stanford Workshop [Slocum 87]. In addition to serving as evidence for consensus or diversity in developing the dictionary encoding standard, we have also attempted to accommodate these dictionaries within the standard whenever they contained unique features. This has been done because the obvious second step to creating a dictionary encoding standard is to convert existing dictionary formats into their representations in the standard format. It is the intention of Bellcore to complete this rewrite process for all the dictionaries with which we are working. This includes, in addition to the previously mentioned sources, a few dictionaries in their original typesetting formats (Bellcore has obtained copies of the _Macquarie Dictionary_, the _Collins English Dictionary_, and the _McGraw-Hill Dictionary of Scientific and Technical Terms_, and we hope to obtain additional machine-readable dictionaries with which to experiment in the future, producing standard formatted versions of them as a by-product. Figure 3-1 shows the entry for "apple" from the _Merriam-Webster Seventh Collegiate Dictionary_. Within the (main entry) tag there are a number of major sections, each introduced by a capitalized tag name. These include, in this simple entry, (Etymology) and (Meaning) sections. Other aspects of the encoding will be discussed below. Figure 3-1: W7 Entry for "apple" in Dictionary Standard Notation apple ap&schwa;l appel fr. æppel akin to apful apple abl˘ko the fleshy usu. rounded and red or yellow edible pome fruit of a tree Malus of the rose family an apple tree a fruit or other vegetable production suggestive of an apple In the remainder of this paper, we will describe the major problems that arose in designing a dictionary standard, and we will present the solutions adopted. Where possible, we will relate the discussion to the encoding of the example entry for "apple." 3.1. Content vs. Layout While the SGML standard provides the means to encode the content of a machine-readable dictionary, it is not intended to describe the layout of the printed book. This initially seems fine, as the layout is one of the most artificial aspects of a dictionary, largely just concerned with where the white space is placed around the content objects. However, there is a shadowy middle-ground between pure layout issues and content issues that is most easily understood in the context of specific editions of historically important dictionaries. Scholars studying such works might, for example, be concerned with where breaks occurred across pages or columns. If there are several editions of such a dictionary, these characteristics might provide insights into the lexicographic decision-making of the editors and publishers or otherwise be needed to work with the data as an acceptable proxy for the printed pages of the original book. Thus,the ability to record in the dictionary encoding standard where line, column or page breaks occurred has been anticipated. Within SGML there is provision for multiple hierarchies to describe a single document. Layout and content are good examples of two hierarchies for a document. That is, the beginnings and ends of the content items in a document do not typically coincide with the beginnings and ends of the pages, columns, paragraphs and lines. These two hierarchies are thus not nested. This is a very nasty representational problem as it complicates all computer processing of text. To handle this problem we are proposing the following convenient SGML "hack". SGML allows comments to be present in a text if appropriately tagged. Such comments begin and end with a special symbol, which following the AAP standard, we have selected to be `--'. If layout tags are created for pages, columns, lines, etc., then these can be inserted within a content tag set as comments that are harmlessly discarded when processing the content hierarchy of the dictionary entries. Likewise, if one simply reverses the interpretation of the non-comment (content) tags and these layout-comment tags, one can "see" the alternate layout hierarchy of the text with the content tags appearing as mere comments within that data. 3.2. Accommodating Print-Based Conventions Publishers and editors have adopted various conventions to represent common features in their dictionaries. In order to standardize the machine-readable versions, information idiosyncratically represented in dictionaries (especially as values taken from a closed set of abbreviations) will be represented via attributes. For example, the tag introducing the example entry for apple, includes a part-of-speech attribute with the value "n-a", the abbreviation for "noun, often attributive" which will be taken from a standard table of designated abbreviations. Use of such tables will minimize cross-dictionary variation, while nevertheless allowing individual dictionaries to retain their own part of speech labels when formatting text based on the SGML codes. A related problem, and one which arises even among multiple editions of a single dictionary, is that of representing information which is associated with lexical objects but without intrinsic meaning apart from the particular dictionary in which it appears (e.g. homograph or sense numbers). Again, attribute-value pairs will be used to record such information in such a way that it can be ignored when dealing solely with content yet accessed when required, e.g., for cross-reference resolution. The machine-readable versions of dictionaries can also use attribute-values to achieve a separation of information that is typically interleaved in printed editions. For example, the tag contains the orthographic form (i.e., spelling and permissible hyphenation points). Most printed dictionaries record hyphenation (or occasionally syllabification) points intermixed with the characters of the form. Thus, if "apple" were written as "ap:ple" anyone searching for "apple" in the data would have to know what its hyphenation pattern was in order to find it. To avoid the complication of suppressing such marks when scanning the data, and to provide a more consistent recording convention among dictionaries, we adopt the technique used in the record-based form of _W7_: encode such points in terms of "character distance encoding." Character distance encoding is a general method used whenever there is information to be recorded about the position of marks inside content objects. Position information is recorded as a series of alphanumeric "character distances" (1-9,A-Z). The individual digits in this form of encoding detail the number of characters to skip before the next hyphenation (or syllable) boundary. Thus, the attribute-value "hyph=2" indicates that "apple" has a hyphenation point after the first two characters, as "ap:ple". For "applesauce", the code would have been "hyph=23" which represents "ap:ple:sauce." A similar solution is provided for recording information about pronunciation. The tag gives the sole pronunciation, the stress, and syllable boundaries. Stress is recorded as another type of distance encoding, this time for syllables instead of characters. One letter will appear for each syllable indicating whether it had Primary, Secondary, or Unstressed status in the dictionary. Thus once again, although dictionaries denote stress with stress marks, these are idiosyncratic and we have chosen to regard them as solely a display characteristic of the published dictionary which is generated from the "syllable distance encoding." 3.3. Logical Grouping The tags of the dictionary standard can be divided into three broad classes: first, those that delimit specific content categories of dictionaries, such as orthographies, pronunciations, etymologically related words (etymons), and definition texts; second, those that are common to all forms of unrestricted text, such as italicized words used for biological nomenclature, foreign expressions, or to delimit the names of people or places; and third, those which provide information about the structure of the dictionary entry by grouping together related content and text tags into units such as etymologies, meanings, or synonymy subsections of a dictionary entry. One such tag is
, which permits the grouping of related orthographies and pronunciations within an entry. Thus, any given word may have multiple ways of being spelled or pronounced. For each such spelling there may be one or more acceptable associated pronunciations (or, visa versa, for each pronunciation there may be one or more acceptable spellings). The tag is used to group the related pronunciations and spellings together. A grouping function is served by the and tags in etymologies. At the highest levels of the dictionary, the (Etymology), (Meaning), and (Related Entry) tags group together the major components of the entry. Components of an entry (especially senses) may be organized into hierarchical groups and numbered accordingly (using Roman, Arabic or alphabetic symbols). These groups often reflect grammatical or historical organizing principles deemed important to users' understanding of dictionary content, but grouping is sometimes simply adopted to reduce the repetition of labels in the printed book. It is important that the organizing principles of the dictionaries be preserved. However there is no one organizing principle: some dictionaries group first by part of speech and then by historical development, others group first by historical development, and still others ignore historical development and group by subject only. Imposing one standard hierarchy on all dictionaries would lose the editors' intents, but accommodating all possible organizational structures in a standard results in unnecessarily idiosyncratic representations which will complicate document interchange and processing software. In place of specific tags that denote hierarchical grouping (e.g., ), the standard provides an attribute ("sn") for the content tags of the elements of the group, having as its value the group number. In the example, the (sense) tags record the sense number as the value for attribute "sn". More generally, the value of the attribute is an ordered list of the group numbers giving the complete path from the root of the hierarchy. Thus, for example, the tag represents a typical sense number for the OED. As a consequence of adopting this convention, the components of nested structures can be more easily manipulated individually, while the nesting information is still preserved for identifying structurally-related components. Thus, for example, given part-of-speech attributes for the sense tags as well, a program can be written to extract all definitions for adjectival uses of a word from a dictionary regardless of whether part of speech, etymology, or domain is used as the basis of the sense hierarchy. 4. Universality of the Standard There are three major problems in creating a standard to accommodate all dictionaries, even when restricted to English monolingual ones. The most obvious is to anticipate what might occur in a dictionary that is not among the sources at hand. We are confident that the wide variety of dictionaries at our disposal, in printed and machine-readable form, has broadened our definition of the standard to encompass the dominant characteristics of other dictionaries as well. As with all standards, however, we need to provide for extension mechanisms to allow for unpredicted tags, entity references, or values assumable by attributes. The second problem is to accommodate the exceptions that lexicographers have included in existing dictionaries (whether intentional or not). For example, we have chosen to represent part of speech as an attribute that takes its value from a (to-be-determined) closed set. However, how should the information "collective noun taking a singular verb" be represented? Presumably as a special value in the set. But what then of "collective noun *often* taking a singular verb, except when..."? In fact, dictionary editors rely on the option to use unconstrained text to describe many aspects of our highly flexible language. The standard must adopt a method for balancing the need for diverse modes of explanation against the computational need for consistency. Thirdly, there is the problem of our own shortcomings in understanding the structure of a dictionary's entry, when the documentation of that structure is so sparse. In our work to date, this problem was apparent when designing the encoding for etymologies. In spite of well-written prefatory material in the several dictionaries (esp. [OED 28, W3I 61, Century 11]), and extensive reference books about lexicography and computational linguistics (e.g. [Crystal 86, Landau 84]), we were unable to uncover a definitive description of the structure within a typical etymology (e.g., the meaning of punctuation symbols and the scope of language names). The solution we adopted is to include tags for "etymons" (the word forms, with language as an attribute), "etymological units" (eu, the equivalent of a lexical entry, including form, pronunciation, meaning, and so forth) and "etymological segments" (es, branches of a hypothetical universal etymology tree, including information about the relationships "rel" among the components). We must wait for other experts to help us determine whether or not this organization is adequate for the standard. As a closing note, we observe that the real test of the standard cannot be accomplished until several dictionaries have been encoded successfully. We expect to begin such work in the near future. REFERENCES [AAP 87] _Standard for Electronic Manuscript Preparation and Markup_ Version 2.0 edition, AAP, Washington, D.C., 1987. Electronic Manuscript Project. [Benbow 87] Timothy Benbow, Peter Carrington, Gayle Johannesen, Frank Wm. Tompa, and Edmund Weiner. _Report on the New Oxford English Dictionary Survey_ Technical Report OED-87-05, University of Waterloo Centre for the New Oxford Dictionary, Waterloo, Ontario, Canada, November, 1987. [Boguraev and Levin 88] Boguraev, Branimir and Beth Levin. Machine-Readable Dictionaries: A Computational Linguistics Perspective In _Second Conference on Applied Natural Language Processing_. Association for Computational Linguistics, 9 February, 1988. Copies of overhead transparencies and bibliography to accompany tutorial presented at the conference. [CED 79] Laurence Urdang (editor). _Collins English Dictionary_. Collins, London, 1979. [Century 11] William Dwight Whitney (editor). _The Century Dictionary: An Encyclopedic Lexicon of the English Language_. The Century Company, New York, 1911. [Crystal 86] Crystal, David. _A Dictionary of Linguistics and Phonetics_. Basil Blackwell, New York, 1986. [DOSATT 86] Sybil P. Parker (editor). _McGraw-Hill Dictionary of Scientific and Technical Terms_. McGraw-Hill Book Company, New York, 1986. [Kazman 86] Rick Kazman. Structuring the Text of the Oxford English Dictionary through Finite State Transduction. Master's thesis, Computer Science Department, University of Waterloo, 1986. Also Printed as Technical Report CS-86-20 of the Data Structuring Group. [Landau 84] Sidney I. Landau. _Dictionaries: The Art and Craft of Lexicography_. Charles Scribner's Sons, New York, 1984. [LDOCE 78] Paul Procter (editor). _Longman Dictionary of Contemporary English_. Longman Group UK Limited, Burnt Mill, Harlow, Essex, England, 1978. [Macquarie 86] Arthur Delbridge (editor). _The Penguin Macquarie Dictionary_. Penguin Books, New York, 1986. [Michiels 82] Archibal Michiels. _Exploiting a Large Dictionary Data Base_. PhD thesis, Universite de Liege, 1982. [NewOED 87] Oxford University Press. _New Oxford English Dictionary Project -- Phase 1: Description of the Processed Text of the New OED_. Technical Report, Oxford University Press, Oxford, England, July, 1987. [OALDCE 77] Hornby, A.S. (editor). _Oxford Advanced Learner's Dictionary of Current English_. Oxford University Press, Oxford, England, 1977. [OED 28] Murray, James A.H. _The Oxford English Dictionary_. Oxford at the Clarendon Press, Oxford, England, 1928. [Peterson 86] James L. Peterson. _Webster's Seventh New Collegiate Dictionary: A Computer-Readable File Format_. Technical Report, Microelectronics and Computer Technology Corporation (MCC), Austin, Texas, August 8th, 1986. [Pullum&Ladusaw 86] Geoffrey Pullum and William A. Ladusaw. _Phonetic Symbol Guide_. University of Chicago Press, Chicago, 1986. [SGML 86] _International Standard ISO 8879: Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML)_ First edition -- 1986-10-15 edition, ANSI, New York, NY, 1986. Ref. No. ISO 8879-1986 (E). [Slocum 87] Jonathan Slocum. Dictionary Exchange Format: re-revised ``record oriented'' proposal. October, 1987. Personal communication to the authors as a follow-up to a 1-day dictionary standards meeting at MCC in Austin, Texas. [W3I 61] Philip Babcock Gove (editor). _Webster's Third New International Dictionary of the English Language_. G. & C. Merriam Company, Springfield, Massachusetts, 1961. [W7 67] Philip Babcock Gove (editor). _Webster's Seventh New Collegiate Dictionary_. G. & C. Merriam Company, Springfield, Massachusetts, 1967. APPENDIX-PROPOSAL FOR DICTIONARY ENCODING 1. Presentation of proposed tags and attributes This appendix contains a preliminary list of tags and attributes that could be used to represent the information included in several dictionaries. For each proposed tag, we first present a tag identifier and a definition or some usage notes to clarify the intent of the tag. We then identify how the proposed content of the tag is encoded in current representations: - For the _Oxford English Dictionary_, we give the corresponding tag name in the encoding currently used by the Oxford University Press followed by the tag name in the encoding currently used by the University of Waterloo. For example, [OED]:hwlem/HW means that Oxford has named its tag hwlem and Waterloo has chosen the identifier to be HW. - Similarly, for the _Oxford Advanced Learner's Dictionary_, we give the tag name used in the tagged version available from the Oxford Text Archive followed by the tag name used in a trial version being distributed by the Oxford University Press. - For _Webster's Seventh_, we give the record identifier for the corresponding information and we give the field number in cases where the complete record contains other data as well. This information is represented for the existing encoding used by many researhers and for the encoding proposed by Slocum, as in the following example: [W7]: text definition D:6 / text M:3 which shows that the existing encoding of W7 includes the information in record D field 6 (entitled text definition) and Slocum has proposed it to be in record M field 3, which he calls text. - Similarly, for _Longman's Dictionary_, we give the record number and field number used to encode the information. For example, [LDOCE]: 03,04:3, 10:4 shows that the information corresponding to the proposed tag is represented in the LDOCE data in record 03, in record 04 field 3, and in record 10 field 4. The last part of each explanation contains our proposed list of attributes, and shows for each its name, the intent of the attribute, the domain from which values are to be taken, and (where appropriate) a list of corresponding encodings in current machine-readable dictionaries. For example, dial regional or social dalectic variant = CDATA [OALD]/lab = CDATA [W7]/dial(L:5) shows that the attribute dial will take arbitrary strings of characters (CDATA) as values, that the OALD available from the Oxford University Press uses an attribute lab for the same purpose, and that W7 uses field 5 in record L (named dial) for encoding the same information. If a proposed attribute is represented by a "tag" in either the OED or the OALD, the name of the tag is given in angle brackets; thus, [OALD] / ps indicates that the OALD available from the Oxford Text Archive uses the tag identifier pos whereas the one available from the Oxford University Press uses an attribute name ps. 2. Layout and Content Hierarchies Physical units (page,line,column,etc.) page attributes: left-guideword right-guideword num illustration column Logical units (entry,headword,etc.)---See below. 3. Universal attributes definition: attributes that can show up on any tag Note: In many cases, attributes shown as single units might appear instead as lists; this has to be accommodated somehow. ed edition/printing current usage: [W7]: ver# (H:7) by author/scholar id identification number = CDATA current usage: [OED]:id = NUMBER/--- [LDOCE]: Serial No on 01 records = CHAR NUMBER 4. Logical Units 4.1. factotum definition: [W3] an ornamental oversized capital letter used in printing current usage: [LDOCE]: 20 4.2. ME -- main entry definition: [W7] vocabulary entry---a word (as the noun "book"), hyphenated or open compound (as the verb "book-match" or the noun "book review"), word element (as the affix "pro-"), abbreviation (as "agt"), verbalized symbol (as "Na"), or term (as "man in the street") entered alphabetically in a dictionary for the purpose of definition or identification or expressly included as an inflectional form (as the noun "godlessness" or the adverb "globally") or related phrase (as "one for the book") run on at its base word and usu. set in a type (as boldface) readily distinguishable from that of the lightface running text which defines, explains, or identifies the entry. current usage: [OED]: entry/E [OALD]: entry/ent [W7]: F/H [LDOCE]: 01 attributes: id identification number = CDATA current usage: [OED] id=NUMBER/--- [LDOCE] Serial No = CHAR NUMBER key sort key to determine entry's placement = CDATA current usage: [OALD]---h = CDATA [W7]---/hdwrd(H:2) type entry type = (main|xref|affix|abbr|suppl) current usage: [OED] xref=(t) and use of tags for suppl/del/com/etc. [OALD] type=(main|xref/--- [W7] Prefix/Suffix/Infix (F:4)/ [LDOCE] entry form codes {S,A,R,N,Z} (02:3.1) status word status=status-NAMES current usage: [OED] status= (obs|ali|spu|err) hom homonym/homograph number = NUMBER current usage: [OED] hom=NUMBER [OALD] /hn = NUMBER [W7] homograph number (F:3)/hom# (H:3) [LDOCE] Homograph (02:2) pos entry part of speech = pos-NAMES current usage: [OED] / [OALD] /--- [W7] part of sp.,joiner,secondary part of speech (F:6-8)/cats (H:6) [LDOCE] POS (05:2) geo geographic region (e.g., Australia) dom subject domain (e.g.,nuclear physics) regis register (e.g.,colloquial) time currency or frequency (e.g.,obsolete,rare) sem semantic (e.g.,figurative) gram grammatical code (e.g.,transitive) ---The next two features are included for compatibility with [LDOCE]--- posf [LDOCE] part of speech of first element of open compound (02:4.1) posl [LDOCE] part of speech of last element of open compound (02:4.2) 4.3. F --- Forms (written/spoken) definition: set of associated written and/or spoken forms of lexical items note: existing dictionaries keep this implicitly by ordering the parts of entries. Comment: Must this be elaborated to handle OED's entry for "be v." which includes and ? current usage: [LDOCE]: 04 for variants attributes: infl inflectional use (e.g., pl, pt, pp, comp) = infl-NAMES dial regional or social dialectal variant = CDATA hist temporarily restricted variant = CDATA 4.3.1. orth --- orthography current usage: [OED]: distributed among hwlem, vf, blem, ilem, etc. [W7]: in F, V, R records / hdwrd, form in H, D, and F records [LDOCE]: in 01, 04, 05, 10 attributes: cap capitalization convention (usu, sometimes, etc.) = freq-NAME [OALD]/cap = (t|f) dial regional or social dialectal variant = CDATA [OALD]/lab = CDATA [W7]/dial (L:5) hist temporarily restricted variant = CDATA [OED] variant date () pref preference level (implicit, or "usu" vs. "also" etc.) = CDATA [W7] level (V:4)/level (V:5) type (word|affix|phrase|alleged) note regarding word division: Some dictionaries (e.g, [LDOCE],[RH2]) indicate syllable boundaries, whereas others (e.g., [W7]) show only points where words can be broken at ends of lines (i.e., hyphenation points). An easy test is to check words which begin or end with single-letter syllables (e.g., aback, awash, any). syl syllabification distance encoding (DIGIT|CHAR)* [LDOCE] embedded codes in headword hyph hyphenation distance encoding (DIGIT|CHAR)* [OALD] embedded codes in headword [W7] hyph (F:5,R:3)/hyph (H:4,D:3,F:3) ---This feature is included for compatibility with [LDOCE]--- form headword form (no. of words, hyphens, etc.) (DIGIT|CHAR) [LDOCE] entry form codes {1-9,D,T,Q,C,Y,X,W,Z} (02:3.1) 4.3.2. P --- pronunciation note: IPA vs. other encoding may be recorded using ed or by attributes. current usage: [OED]: pr / PR [W7]: P / P [LDOCE]: 03, 04:3, 10:4 attributes: type amount of pronunciation given = (whole,prefix,infix,suffix) dial regional or social dialectal variant = CDATA [OALD] lab = CDATA hist temporally restricted variant = CDATA pref preference level (implicit, or "usu" vs. "also" etc.) = CDATA stress stress syllable distance encoding = (P|S|U)* P=Primary Stress, S=Secondary Stress, U=Unstressed syl syllable distance encoding (DIGIT|CHAR)* [LDOCE] embedded codes in headword 4.3.3. hwd --- headword note: The headword is an artifact of existing machine-readable data records. It delimits that set of one or more (e.g. `A,a') orthographies placed together as the alphabetic basis of an entry. A headword tag is only needed if form and orthography tags do not convey the same information. current usage: [OED]: hwlem / HW [OALD]: hwlem / hw [W7]: main entry (F:2) / orth (H:5) [LDOCE]: Headword (01:3) 4.4. M --- meaning definition: Aggregate of all senses. The M tag delimits the block of senses from the other types of information in the entry. current usage: [OED]: signification (no longer tagged?) 4.4.1. S --- sense definition: definition (including the text, examples, and related matter) current usage: [OED]: sen0..sen7 / S0..S7 [OALD]: sen / hsn [W7]: D / M [LDOCE]: 08,18 attributes: sn sense number = NMTOKENS current usage: [OED] num (repeated for each level descended)/<#> [OALD] lab / sn [W7] sense number,s. letter,s. subnumber (D:2-4)/sns (M:2) [LDOCE] definition no (07:2) --- doesn't identify subsenses pos sense part of speech = pos-NAMES current usage: [OED] / [OALD] / ps [W7] part of speech (D:5) / cats (M:3) geo geographic region (e.g., Australia) dom subject domain (e.g., nuclear physics) regis register (e.g., colloquial) time currency or frequency (e.g., obsolete, rare) sem semantic (e.g., figurative) gram grammatical code (e.g., transitive) ---The following attributes are used to encode the semantic restrictions encoded in LDOCE's box codes documented to be #7,9,10 (but occurring as #5,9 and 10)--- nclass noun class (e.g., slipstream can be used as "Gas") aclass adjective class (e.g., lambent modifies nouns in nclass "Gas") vclass verb class --- value is a list giving semantic restrictions on the nclass of a verb's subject, direct object, and indirect object 4.4.2. deftext --- definition text usage note: typically in roman font current usage: [OED]:--- [OALD]:--- [W7]: text definition D:6 / text M:3 [LDOCE]: definition text 08:2, definition text continuation 18:2 attributes: type (also | esp | specif | broadly) 4.4.3 ex --- examples of form or usage current usage: [OED]: quot / Q [OALD]: ex / ex, gx [W7]: --- / X [LDOCE]: not tagged attributes: type classification of the example = (cited | invented) current usage: [not present in any MRD] geo geographic region (e.g., Australia) dom subject domain (e.g., nuclear physics) regis register (e.g., colloquial) time currency or frequency (e.g., obsolete, rare) sem semantic (e.g., figurative) gram grammatical code (e.g., transitive) 4.4.4. utext --- usage text 4.4.5. xr --- cross reference current usage: [OED]: xra / XR [OALD]: xra / xr [W7]: cross reference X / K [LDOCE]: usage note text (09:3) attributes: hom xref homonym/homograph number = NUMBER current usage: [OED] hom = NUMBER [OALD] / hn = NUMBER [W7] homograph number (X:3) / H-sns (K:3) [LDOCE] --- sn xref sense number = NMTOKENS current usage: [OED] num (repeated for each level descended) / [OALD] lab / sn [W7] subscript (X:4) / H-sns (K:3) [LDOCE] --- pos xref part of speech = pos-NAMES current usage: [OED] / [OALD] / ps [W7] unneeded --- info carried in hom [LDOCE] --- type (illus | see | also | ... | external) current usage: [OED] --- [OALD] --- / illus=y or type [W7] (X:5) / type (K:1) [LDOCE] X-Ref (09:2) sub sub-part or associated section referenced = CDATA current usage: [OED] --- [OALD] --- [W7] secondary word (X:6) / secform (K:4) [LDOCE] --- 4.5. RE --- related entry usage note: typically in bold appearing within an entry (including run-in, run-on, compounds, idioms, derivatives) An can contain any tag defined for current usage: [OED]: sube/E [OALD]: entry type=sub / cd or ip [W7]: R / D or F [LDOCE]: 10, or phrases as untagged boldface in 08 attributes: key sort key to determine entry's placement = CDATA current usage: [W7] --- / form (D:2,F:2) style placement style = ( run-on | run-in ) current usage: unused type relation = ( root | deriv | idiom | compund | phrase ) current usage: [OED] --- / --- [OALD] type = ( main | xref ) / [W7] / D or F [LDOCE] --- ref sense(s) to which this RE is related = NMTOKENS current usage: unused pos entry part of speech = pos-NAMES current usage: [OED] / [OALD] / ps [W7] pt of sp.,joiner,2nd pt of sp. (R:4-6) / cats (D or F:4) [LDOCE] POS (10:5) geo geographic region (e.g., Australia) dom subject domain (e.g., nuclear physics) regis register (e.g., colloquial) time currency or frequency (e.g., obsolete, rare) sem semantic (e.g., figurative) gram grammatical code (e.g., transitive) 4.6. synonyms & antonyms 4.7. E --- etymology current usage: [OED]: etym / ET [OALD]: not applicable [W7]: E / E [LDOCE]: not applicable attributes: type class of word formation = NAME note: (reflecting W7's ISV (= Intern. Scientific Vocab.), acronym, biographical, geographical, borrowed word, affixation, unknown origin, etc.) current usage: [---] not tagged in any MRD 4.7.1. epart --- etymology of one variant form of the entry note: (as in W7's among/amongst) 4.7.2. es -- etymological segment definition: a unit of derivation in an etymology 4.7.3. eu --- etymon unit definition: a package uniting an etymon with its lang, deftext, etc. note: The following are lowest-level tags found particularly in etymologies. To date these have not been tagged in any MRD (except for cf in the OED) 4.7.4. etymon --- word, morpheme, or phrase cited in an etymology usage note: nearly always printed in italics attributes: lang language of the etymon = lang-NAME usage note: lang attribute inherits its value from previous . type (word | affix | phrase | alleged) gender (M | F | N) 4.7.5. lang --- language = lang-NAME 4.7.6. rel --- relation = rel-NAME examples: a.,fr.,ad.,<,:- Note: Previous tags , also typically appear in etymologies. The following additional tags might also play a useful role: 4.7.7. cert --- certainty examples: prob., ? 4.7.8. basis --- basis for etymologist's belief examples: `by folk etymology', `assumed', `according to' 4.8. Additional text tags 4.8.1. taxon --- taxonomic name usage note: typically in italics. attributes: lev level = lev-NAMES e.g. (K|P|C|O|F|G|S|V) 4.8.2. wd --- anaphoric word usage note: typically represented by swung dash current usage: [OALD] ---/h attributes: h referenced word