Notes on Features and Tags C. M. Sperberg-McQueen Lou Burnard Document Number: TEI EDW5 February 2, 1990 (13:19:31) Draft February 2, 1990 (13:19:31) 0. INTRODUCTION A marked-up document is one in which specific textual features are iden- tified by tags or other mechanisms. TEI working committees are charged with proposing lists of such textual features and the tags used to spec- ify them. When comparing such lists it will be essential to group simi- lar features, independent of the way in which they are identified, both to avoid redundancy and to make possible a tighter focus on related but distinct areas of concern. Use of a simple taxonomy of features will allow one both to assess the likelihood that two tags, purportedly for the same textual feature, also have the same semantics, and to focus on differences among tags grouped together by the taxonomy. This paper therefore proposes a simple model for classifying textual features and their associated tags based on two structural characteris- tics of all textual features: their content and their location. It begins by describing the various possible relationships among tags, fea- tures, and syntaxes, continues with a theory of textual features, and concludes with a design for a database to hold information about specif- ic tags in existing or new markup languages. 1. DEFINITIONS We begin with the distinction between the tag and the feature it denotes. The Formex tag , for example, and the ISO "starter set" tag are both used to signal an occurrence of the same textual feature, namely that at this position in the document a graphic document element is to be located.(1) A page formatter might leave space on the page for the artwork; a galley formatter might print a marginal note calling attention to the expected artwork; a text database might embed a special symbol indicating that the original document had material not included in the database. The Formex tag and the starter set tag similarly mark the same textual feature. Note that a tag must refer to a feature, but a feature need not be tagged. The same feature can be denoted by many tags. Thus in the ISO starter set, the tag

is used to identify occurrences of the feature chapter. In other tag sets the same feature might be tagged with , , , or . We use the term feature pre- cisely in order to stress what would be common among all these tags and ignore what varies (the name). Users of the TEI Guidelines may wish to abbreviate tag names or render them into another language. The TEI must provide mechanisms for such substitutions which allow the tag names to be changed without altering either the definition of the features them- selves or the syntax which governs their occurrences. More formally, we define feature: a characteristic of some segment of text or of some location in a text, considered independently of any name used to denote it. In SGML terms, a segment of text possessing the feature may be tagged as a document element, though the presence of the feature may also be indicated by other SGML constructs. tag: a component of some markup language used to identify an occurrence of a specific textual feature. A tag has both a tag name and a textu- al feature associated with it. tag name: the string of characters used to identify a tag, considered as a string of characters and independent both of any particular instance of the tag used to mark the feature and of the feature itself. In SGML terms, the tag name is the generic identifier of an element. We also identify sets (that is, unordered collections) of features, tags and names as follows: feature set: a set of features, considered independently of the names used to denote them tag set: a set of tags, i.e. features together with the names used to denote them in any one markup scheme, considered independently of any syntax defining their allowable interrelationships name set: a set of tag names (e.g. all those related to a given tag in different markup schemes, or all those used in a given DTD) To define the allowable relationships among tags and features, docu- ment markup languages define syntaxes for the tags of the language. The formal syntax is necessarily defined in terms of the tag names of the language; the semantics of the tags (their links to specific features) are not usually specified formally. If the distinction between tag and tag name is rigorously maintained, then, formal syntaxes like DTDs must be viewed strictly speaking as governing only tag names, not tags. If we wish to speak of rules governing tags rather than tag names, we must include both the DTD and the semantic specifications of the markup scheme. We make this distinction in what follows, without wishing to insist that this degree of rigor is always necessary. We define: feature syntax: a set of rules specifying the allowable combinations of feature occurrences within a given type of text, considered indepen- dently of the tag names used to identify them tag syntax: a set of rules defining the allowable sequences of tags and textual data in a markup language DTD (document type definition): a tag syntax defined according to the rules of SGML From any document type definition, one may derive a tag set, which is precisely all and only the tags used in the DTD.(2) The tag syntax of a DTD is expressed by the content models it contains. If we modify a DTD by changing the syntax rules expressed in its content models, the tag set with its associated name set and feature set remain unchanged. The tag set, specifically, is what remains constant when the syntax is changed without introducing new tags or deleting old ones. We may change a DTD by translating all the tag names into another language (as is done with the tag names for the ISO starter set in ISO TR 9573, for example). The tag syntax remains the same, the features described remain the same; the DTD changes only because the names used to define the syntax change. The term feature syntax is coined to denote what does not change if one translates all the tag names into another vocabulary. 2. FORMAL RELATIONSHIPS AMONG FEATURES, TAGS, AND SYNTAXES Each tag corresponds with exactly one feature and one tag name, although each feature may have many tag names and each name may be used in dif- ferent markup languages for many different features.(3) Each feature, tag, and tag name can appear in many feature sets, tag sets, or name sets respectively. Each set, of course, can contain many individual features, tags, or names. A given tag syntax corresponds to exactly one name set (viz., the set of tag names used to define the tag syntax); in conjunction with the semantic rules of a language, a tag syntax also corresponds to exactly one tag set. Similarly a given feature syntax corresponds to exactly one feature set. Many tag syntaxes may however be defined for any given name set or tag set, just as many feature syntaxes may be postulated for any given feature set. These relationships are summarized in the following diagram, in which the lines indicate meaningful relationships and the arrowheads their degree. A single headed arrow indicates that just one item of the type indicated is involved in the relationship; a double headed arrow that more than one item of the type indicated may be involved. feature Ž éétagŽŽ étag name … … … … … … feature Ž éétagŽŽ éname set set set … … … … … … feature Ž éétagŽŽ éname syntax syntax syntax (DTD with (DTD without semantics) semantics) Among the syntaxes corresponding to any given set of names or fea- tures, several classes may usefully be distinguished. At one extreme are syntaxes using the full power of SGML DTDs to restrict the legal sequences of tags and data in a document. We call this the class of fully restrictive syntaxes. At the other extreme is a syntax which places no restrictions on the combination of tags in a document except that they must properly nest and that each tag must have an end tag. We call this a Waterloo syntax. In SGML terms, a Waterloo syntax is deriv- able from any more restrictive DTD by 1. replacing the "omitted tag minimization" for each element with "- -" to prohibit omission of any tags, and 2. replacing the "content model" for each element with the keyword "ANY", which allows any combination of data and properly nested tags to occur within the element.(4) Each element declaration, that is, must look like this: Intermediate classes of syntaxes may be defined, including e.g. the class of syntaxes which define element contents using only the declared content "EMPTY", the content model "(#PCDATA)", or the content model "ANY", or those which do not use inclusion or exclusion exceptions. Syntaxes which specify rules of greater expressive power than those pro- vided by SGML may also be imagined. No further discussion of these intermediate syntaxes is provided here. 3. CLASSIFICATION OF TEXTUAL FEATURES Tags are most usefully grouped according to the similarity of the fea- tures they denote. One approach might be to base a classification on semantic or functional properties of the features: features concerned with presentation, language, register, application area etc. Elaborating such a taxonomy would however be at once too complex and too controver- sial for our present purposes. We propose instead to classify features simply in terms of their position and use within a feature syntax, spe- cifically the ways in which other features may nest within it (in SGML terms, its "content model") and the places where it may itself occur (its appearance in other features' content models). We distinguish these two characteristics as the binding and the structure of the feature. 3.1. Binding A feature may either be bound to some specific location(s) in a document or else it may float and appear anywhere within running text in the doc- ument. In the former case the feature is bound; in the latter it is unbound.(5) Some features may have one fixed location at which they are required, while retaining the ability to float and appear additionally elsewhere in the document. The feature date, for example, which may be required in the front matter of a document to show the document's date of composition, may also appear in free text wherever dates are used. Features of this third type, both bound and unbound, are termed binda- ble. Each may be resolved, if desired, into two distinct features, one bound and one unbound. Note: Alternative terms: Anchored, floating, tethered. Moored, floating, anchored. Sedentary, migrant, and semi-nomadic. Bound, unbound, bindable. Bound, floating, semi-bound. Anchored, floating, floatable. Bound, free, and indentured. For SGML-based markup languages, this distinction may be made formal- ly on the basis of content models in the DTD as follows: * if a tag name appears in content models only in an or group, and group, or sequence group with other tag names or with #PCDATA, then the feature denoted by that tag name is bound (in that feature syn- tax). * if a tag name appears in content models only in an alternation with the keyword #PCDATA, then the feature denoted by that tag name is unbound (in that feature syntax). * if a tag name appears in both types of content models (i.e. both within and outside of alternations including #PCDATA), then the fea- ture denoted by that tag name is bindable (in that feature syntax). Classification on a purely formal basis may be inconsistent from one markup language to another. A tag , marking calendar dates, for example, may be required (and thus bound) in a document's front matter, where it indicates the date of copyright or publication. Dates, of course, are not restricted a priori to such use: they may appear freely in running text. If such running-text dates can also be tagged , then will be a bindable feature; if such occurrences of dates in running text are not accommodated, then must be classed as a bound feature. Such contradictions reveal underlying differences in the semantics of the tags and isolate a problem to be resolved in developing any new syntax. Examples of bound features (taken from the features of Formex and the ISO starter set) are front matter, body, back matter, cell of a figure, glossary, document number, and table of contents. Examples of unbound features are numbered list (and all other list types), footnote, high- lighted phrase, and reference to a figure (and all other reference types). Examples of bindable features are address (bound when in the front matter it applies to the author's address, unbound when in free text it marks other addresses, date, and title. 3.2. Internal Structure The feature itself may be internally structured or unstructured. As a special case, we may note features whose contents have unstructured, but restricted, values; these may be called typed features, by analogy with typed variables in programming languages. Formally, these correspond to SGML content models with restricted element-only content models (struc- tured features), content models containing an alternation including #PCDATA (unstructured features), and content models containing only #PCDATA but which could be expressed by enumeration or some other data- type definition mechanism (typed features).(6) Examples of structured features (from the list of examples given ear- lier) are front matter (comprises title page, table of contents, etc., usually in a strict order), body (comprises hierarchy of chapter, sec- tions, etc.), glossary (comprises structured list of glossary entries), numbered list and all other list types (comprises sequences of list entries), date (comprises month, day of month, and year), and address (comprises multiple address lines, or in some definitions comprises street address, city, etc.). Examples of unstructured features are cell of a figure, title, footnote, and highlighted phrase, any of which can contain free text with other elements intermingled. Examples of typed features are document number, table of contents, reference to a figure and all other reference types (like table of contents, a special case of enumerable data types: empty elements). 4. DESIGN FOR A TAGS DATABASE 4.1. Column Descriptions The salient information about any tag in any markup language may be stored in a database with the following fields (columns) in each record (row): tagname (internal identifier): the tag name as used by the markup lan- guage, e.g. 'tbl' full name: the tag name expanded to a natural-language word or phrase, e.g. 'table' equivalent tags: tag names used in other languages for the same feature Note: In a normal-form relational database, this information should go into a separate table with columns for feature name (with some canonical name for the feature), markup language, tag name, and comments. In a hierarchical database, it should go either in a sepa- rate file with fields as just given, or in a repeating structure com- prising fields for markup language, tag name, and comments. feature: A description in prose of the use and meaning of the tag (or: a description of the feature marked by the tag) binding: Is the feature bound, unbound, or bindable? Only these three terms should be used. (Or: yes, no, half) structure: Is the feature internally structured, unstructured, or typed? Only these three terms (or: yes, no, typed) should be used. If typed or structured, the constraints on the contents of the element should be given in the data description column. data description (content): what sort of content may occurrences of this feature have? If feature is typed or otherwise constrained, the domain of legal values for occurrences of the feature (e.g. "a code taken from ISO list of language abbreviations"). If feature is struc- tured, specify what features may be occur within occurrences of this feature. If feature is untyped, unstructured, and unconstrained, spec- ify either "free text" (if other features may nest within this one) or "character data" (if nested features are not allowed). usage: Is it optional or required? arity: Can it be repeated? Note: If it can, consider whether repetitions are ordered, and if so, whether an additional feature (the ordering) is required. parent tags: Where can this tag appear? "Free" for unbound features, a list of legal parent tags for bound or bindable features. E.g. for author in the ISO starter set, "in titlepage". default content: What will be supplied if this element is omitted or included without content? E.g. in some implementations will default to "the current date (when document is processed)". example: An short example, in context, of the use of the tag Some additional fields may prove useful for database maintenance: language: which markup language uses this tag? source of information: short version of manual title and page number date added to database: when was the record describing this tag creat- ed? added by: who created this record? date last updated: when was this record last modified? updated by: who modified it? 4.2. SGML Tags for Tag Descriptions These pieces of information can of course, also be expressed in an SGML syntax using the following tags: (contains all the following, in sequence:) , , (repeatable, contains and ) , , , , , , , , , , , (includes and ), (optional; includes and ). 5. EXTENSION OF TAGS DESIGN TO OTHER MARKUP CONVENTIONS As noted above, not all markup information is conveyed by SGML tags; some is conveyed by SGML attributes, by parameters to tags in non-SGML languages, by special symbolic names (corresponding to standard entity names in SGML) or by other conventions. To record a broader range of markup conventions in a database we must broaden the field descriptions above by understanding them to apply to whatever convention is being recorded, not just to conventions which correspond to SGML tags. SGML attributes, for example, will always be classed as bound and the generic identifier of the elements for which they are defined will be given under parent tags. Two further pieces of information are also required: tag class (or native class): What type of thing is this, in the classi- fication used by the markup language being described? For SGML tag sets, the answers would be "tag", "attribute", "entity", "short refer- ence", etc. For others, the preferred terms may be "command", "parame- ter", "control word", etc. attributes: Just as the data description of a tag should contain a list of elements which are expected to occur within it, the attributes field should contain a list of attributes defined for the tag, if the tag is from an SGML markup language, or in other cases a list of parameters (or whatever term is used). These can be expressed in SGML using the tag , which should be inserted in the list above after , and the tag , which should be inserted before . ------------------------- (1) For Formex, see C. Guittet, ed., Formex: Formalized Exchange of ____________________ Electronic Publications (Luxembourg: Office for Official Publica- _ tions of the European Communities, 'New Technologies -- Project Man- agement' Department, 1984; for the starter set see ISO 8879-1986 Information processing -- Text and office systems -- Standard Gener- _______________________________ alized Markup Language (SGML) ([n.p.]: ISO, 1986), Annex E.1, pp. _ 136-139. The starter set is discussed and various national-language versions defined in ISO technical report ISO/TR 9573-1988(E) Infor- _ mation processing--SGML support facilities--Techniques for using ______________________________________________________________ SGML. (2) More rigorously, the DTD directly determines only a name set: the set of tag names declared in the DTD. Taken in conjunction with the semantic rules of the encoding scheme, however, the name set unique- ly determines a tag set, so within the context of a specific markup language, a DTD does uniquely determine a tag set. (3) Examples of synonymous tags have already been given. Use of the same tag name for different features is rarer but does occur. For- mex, for example, uses a tag to mark names of corporate bod- ies; the ISO starter set and other tag sets use the same tag name to mark the main body of a document. In our terms, Formex and the starter set are using the same tag name but not the same tag, because they assign different features to it. (4) This process may be referred to informally as tompering with the DTD. (5) In a tompered feature syntax, all features are unbound. (6) In document TEI TRR7, Steven DeRose proposes the term crystals for structured features, but as he uses it the term appears also to apply to some typed features. Draft February 2, 1990 (13:19:31) Appendix A Column Definitions for a TAGS Database in a DBMS Column Definitions for a TAGS Database in a DBMS The TAGS database design described above has been implemented in Chicago in the Waterloo file management system WatFile. The column names, col- umn widths, and data types are specified in the following WatFile file definition. Data types used are "L" (character data, left-aligned) and "YMD" (date in YY/MM/DD form). The column widths are by no means sacred and need not be respected in other implementations; the database for TEI tags, especially, will need wider columns for descriptions and example. (WatFile has a system maximum of 79 characters per column.) %WATFILE/Plus V3.5 Saved 89/12/05 17:46:27 define Tag Name = 16 L define Full Name = 32 L define Native Class = 12 L define Description = 72 L define Bound = 5 L define Structured = 5 L define Data Description = 72 L define Usage = 64 L define Arity = 64 L define Parents = 32 L define Attributes = 32 L define Default Content = 32 L define Example = 72 L define Source = 20 L define AddDt = 8 YMD define AddBy = 3 L define UpdDt = 8 YMD define UpdBy = 3 L title "FORMEX SGML 89/11/09" Appendix B Extract from Formex Tag Database Extract from Formex Tag Database As illustration, the tag name, full name, binding, and structure columns for each record in the Formex tag database in Chicago are given below. Tag Name Full Name Bound Structured AB abstract ? no ACCOMP accompanying material yes no AD address yes no AF affiliation yes yes AN authority number yes typed BINDING binding yes typed BLKn document block yes yes BODY corporate body ? yes CCF CCF data ? yes CLASS classification scheme notation ? typed COLn column heading in a table yes yes CY country yes typed DATE date half no DIM dimensions of the item yes no ED edition statement ? no EXPL explanatory note no no FRAGMENT page fragment yes typed HT highlighted text no no ID document identification number ? typed ITM item or figure in a table yes no LOC location yes no MAT type of material ? typed MEDIUM physical medium yes typed MEETING meeting ? yes NA name yes no NO number of meeting yes typed NOTE note yes no OT other part of name yes no PAGE pagination defining a part yes typed PART part statement ? no PERSON person ? yes PHYS physical description no yes PIC picture no typed PIECES number of pieces and designation yes no PRICE price of the item yes typed QT quotation no no QUAL qualifier yes no REF bibliographic reference no no ROWn row heading in a table yes yes SO Agency at source of record ? typed SUBJECT description of subject ? typed TARIFF tariff ? yes TBL table no yes TI title no no WEIGHT weight yes typed Appendix C Sample SGML-tagged Tag Descriptions for Formex Sample SGML-tagged Tag Descriptions for Formex The full database information for a few tags is given below, in SGML form. Note: In the electronic form of this document, the SGML tag examples Note: given below must remain uninterpreted. In a conformant SGML document, such uninterpreted data might either be included in an external entity declared as CDATA, within a conditional section using the CDATA keyword, or within an element whose content is declared CDATA or RCDATA. It is not clear which of these possibilities will be preferred by the TEI Met- language committee. The technique used here is to refer to an external file containing the material to remain uninterpreted. As a result, two files are included in the electronic distribution of this document. You are now reading one of them. The other is to be embedded immediately following this note. AB abstract tag A brief description of the contents of the item. ? no Free format. Attribute LA. optional repeatable for each language ? LA none Text of abstract ... p.119 Formex Manual 89/11/13 VAM 89/11/30 MSM ACCOMP accompanying material tag describes any material that accompanies the document yes no Free format optional not repeatable PHYS none folding map and three diskettes p.120 Formex Manual 89/11/13 VAM 89/11/30 MSM AD address tag postal address of a corporate body or private address of a person yes no free format? optional not repeatable inside an AF, BODY, or PERSON group AF, BODY, PERSON none 5, rue du Commerce, L-2410 Luxembourg p.121 Formex Manual 89/11/13 VAM 89/11/30 MSM AF affiliation tag name and/or address of organization to which a person is associated yes yes Contains AD, AN, CY, NA (either AN or NA mandatory) optional repeatable for each person or affiliation PERSON none Office for ...New Techn....... p.122 Formex Manual 89/11/13 VAM 89/11/30 MSM AN authority number tag a unique number code assigned to a corporate body or group yes typed use AN if there is an assigned code; else use NA and name optional not repeatable inside a group AF, BODY, PERSON none (none given) p.123 Formex Manual 89/11/13 VAM 89/12/05 MSM BINDING binding tag physical binding: stapled, hardcover, softcover ... yes typed Code taken from appendix D10 optional not repeatable PHYS none ... BR ... p.124 Formex Manual 89/11/13 VAM 89/11/30 MSM BLKn document block tag block of text of the document itself, exclusive of secondary information yes yes free format; doc body divided into nested blocks. Attr. ID, LA, TYPE optional repeatable BLK(n-1) ID, LA, TYPE none Commission Regulation ...The ... p.125 Formex Manual 89/11/13 VAM 89/11/30 VAM BODY corporate body tag corp. body responsible for the work ? yes Contains AD, AN, CY, LOC, NA (AN or NA is mandatory). Attr: RESP, ROLE mandatory when applicable repeatable for each corporate body free RESP, ROLE none Office for Official ... ... p.126 Formex Manual 89/11/13 VAM 89/11/30 VAM CCF CCF data tag Delimits purely CCF data to be ignored by SGML ? yes CCF data optional repeatable free none (none given) p.127 Formex Manual 89/11/13 VAM 89/11/30 VAM CLASS classification scheme notation tag notation assigned to item according to classification scheme ? typed use notation defined by classification scheme optional repeatable for each classification scheme ? SCHEME none 812.23 p.128 Formex Manual 89/11/13 VAM 89/11/30 VAM COLn column heading in a table tag column heading in a table yes yes free format + seq. of lower-ranked columns. Attr: ID, DATA, LA mandatory repeatable TBL ID, DATA, LA none ...CountryDirect costsTotal ... p.129 Formex Manual 89/11/13 VAM 89/11/30 VAM CURRENCY currency attribute the currency unit in which the amount is expressed yes yes a code taken from Appendix D11 mandatory not repeatable PRICE ECU 50 p.153 Formex Manual 89/11/13 VAM 89/12/05 MSM CY country tag country where person, corporate body, or meeting is situated yes typed a code taken from Appendix D2, or free form. Attr: STD optional not repeatable AF, BODY, MEETING STD none Office ...5, rue du ...LU p.130 Formex Manual 89/11/13 VAM 89/12/05 MSM DATA Type of data in column attribute nature of data in the column (data-type) -- will guide presentation yes no short name describing data: integer, decimal number, string, code ... optional not repeatable COLn data-type of next higher level (none given) p.129 Formex Manual 89/11/30 MSM 89/12/05 MSM DATE date tag calendar date (of publication or other event) half no ISO std form, or free form. Attributes: LA, STD, TYPE mandatory repeatable for different dates MEETING LA, STD, TYPE none 1985000019850501 p.131 Formex Manual 89/11/13 VAM 89/12/05 MSM DIM dimensions of the item tag size in centimeters yes no customary to enter only height or height times width optional not repeatable PHYS none ... ... 23 cm. ... p.132 Formex Manual 89/11/13 VAM 89/11/30 VAM ED edition statement tag edition number and/or identifier for monograph or collection ? no As given in the item, abbreviating; incl. word 'edition'. Attr: LA mandatory repeatable for multiple or parallel edition statements free? LA none Braille edition p.133 Formex Manual 89/11/13 VAM 89/11/30 MSM