Languages for Recording Typographic Rendition in TEI Documents C. M. Sperberg-McQueen Lou Burnard TEI ED W35 February 2, 1993 (11:58:17) Draft February 2, 1993 (11:58:17) ABSTRACT DRAFT. NOT FOR QUOTATION. NOT FOR CIRCULATION. This paper discusses problems raised by the TEI global attribute rend and considers several possible approaches to the task of specifying a language in which its values may be expressed. Although no single lan- guage is likely to suffice for all applications, it proves possible to define small languages sufficient for fairly realistic projects. Note: DRAFT. NOT FOR QUOTATION. NOT FOR CIRCULATION. This paper is a draft for discussion, not for circulation outside the Text Encoding Ini- tiative. (At the moment, it is not even intended for circulation out- side the editorial staff.) It represents only the views of its authors -- and not necessarily even those -- and has not been endorsed by the TEI or by the entire editorial staff of the TEI. Chapter 1 INTRODUCTION The attribute rend, available on all elements in a TEI document, is intended to allow the user to specify, for any given instance of any element type, how the instance is presented, or rendered, in the source text. (An obvious alternative use allows the user to specify how the instance is to be rendered in a formatted version of the document pro- duced by a formatter like TeX, troff, or Script, or by a browser like DynaText or BookManager. This alternative use is not -- and cannot be -- forbidden by the Guidelines, but is intentionally not described in the Guidelines, lest the imperative semantics of rend overshadow the declarative semantics.) The text of the Guidelines, however, is silent on crucial issues regarding rend. No vocabulary of primitives is provided. This helps ensure that users of the Guidelines have the freedom to specify rendi- tion at any desired or feasible level of detail. On the other hand, no extensive examples are provided, either, and the Guidelines provide no guidance on the design of what amounts, after all, to a potentially com- plex language for expressing typographic concepts. Examples in the text suggest, implicitly, that such language design is feasible and should enable useful automatic processing of rend values, but the examples are mostly short and mostly evade any difficult questions, both in TEI P1 and in TEI P2. The treatment of quotation marks is a notable exception in both documents: it suggests specific keywords with identifiable meanings, and the discussion in P2 reflects an attempt to deal with the most obvious shortcomings of that in P1. Even here, however, there is more handwaving than specification, and we are far from showing that the rend attribute can actually be made to bear the burden being assigned to it in the Guidelines. This paper is an attempt to begin coming to grips with the rend attribute. If we are successful, it may be possible to include sample specifications of languages for rend values in the Guidelines, or in accompanying material, both as examples of the kind of specification appropriate for serious work with rend and possibly as definitions of starter-sets, which can be used as is or extended for use by individual projects. Our premise is that the set of legal values of rend, like that of legal values of the from and to attributes on the and ele- ments, ought to be definable as a language (a "rend language"). Our attention should thus be directed to the syntax of the language and to its semantic primitives. In order to keep the discussion relatively concrete, we begin by introducing several sample applications which seem relatively typical in the requirements they pose for the rend attribute. Then we briefly consider problems of syntax, and review the semantic primitives required by the sample applications. Finally, we define a small set of requirements for a simple "toy" application, and define several sample languages for that set of toy requirements. Chapter 2 SAMPLE APPLICATIONS AND THEIR REQUIREMENTS The rend attribute is intended to allow the encoder to capture the salient information about the physical presentation of the text. Unfor- tunately, what is salient varies from project to project. For purely lexical work, many linguists feel free to ignore virtually all the pres- entational characteristics of their sources. For interpretive work, it is frequently useful to know that a given word or phrase is italicized or bolded; in new editions, most editors will attempt to preserve both such styling and at least the broad patterns of layout in their source text. Serious work on the physical transmission of the text, however, both by text critics and by analytic bibliographers, codicologists, and paleographers, will need even more detail in the description of the source. In some cases, such detail might be carried by photographic or digital facsimiles of the source, and not in the encoding, but research- ers searching for patterns in large bodies of data are likely to need at least the possibility of representing their information in the encoded text as well as in a facsimile. We assume that very detailed physical description of a source may require techniques beyond the use of rend, so we do not regard it as a failing in the languages sketched out here if they prove unable to han- dle the most exigent demands of analytic bibliography and codicology. For present purposes, we focus on the level physical descriptions asso- ciated with style sheets and copy editors, which bear a recognizable relation to the physical details but operate at a slightly more abstract level. We ignore, for now, many details of the physical variations offered by typesetting machines or by a hand press. For present purposes, we will be pleased if we can formulate rend languages adequate to the needs of the following applications: * record the typographic presentation of articles in Webster's New Collegiate Dictionary in enough detail to allow it to be re-typeset in recognizably the "same" style. * record the appearance of older print matter (specifically, the essay Condition of Women in enough detail to make it possible to print a new-spelling edition for teaching purposes which retains the same patterns of font shifts and layout (but not necessarily the same fonts exactly) * describe the intended presentation of TEI P2 in terms of the rendi- tion associated with various element types For the dictionary example, we work from the following description (in TEI AI5 D15, a draft of chapter DI) of the default rendition of var- ious elements in Webster's New Collegiate Dictionary. This description was initially prepared purely for use within work group AI5, to aid the members in interpreting the examples from the dictionary; we can thus hope that it represents an unbiased list of the presentational features important to a lay reading of the text and omits details only because they are insignificant in the context, not because they are hard to for- malize. * dictionary entries begin new line, left-offset one en. * related entries are run on, preceded by em-dash. * contents of always italic, 1 em space each side. * headword itself (contents of ) always bold. * spaces added around middot in transcription for legibility. * within
, \...\ marks pronunciation. * sense number is always bold, preceded by one-em space, and followed by space + boldface colon + space. If sense number is null, colon appears anyway, but there is only one space before the colon. If the sense is immediately divided into two subsenses, as in network 1, sense 4, -- i.e. if the sense number is immediately followed by another sense number, then no colon appears). N.B. in AI5 D15, sense numbers may be recorded either as values of n on the element or as contents of an element given as the first child of . * spaces between angle brackets and examples they surround added in transcription (to simplify SGML parsing) * The etymology (encoded appears in square brackets. * Language names ( elements) are roman, glosses are roman and have no quotation marks. * Cited words () are italic. * Cross-references are printed in small caps (here, encoded ). * The contents of are in small caps unless enclosed in , which is always roman. * In an example (), the citation at the end is preceded by an em- dash. * Citation text is in italics, in its entirety. For the older-text example, we accept as binding the formatting information captured by the Brown Women Writers Project in their encod- ing of the text. Notably, we need concise descriptions of the following situations (as well as the more common problems noted above and below): 1. dropped initial caps need to be marked; at least the identity of the letter, its height (and width), its vertical position relative to the rest of the text (i.e. how many baselines down its baseline occurs, how many lines of text are set to the right, rather than below, the cap), and if possible some indication of the charac- ter's font or ornamentation 2. capitalization of words at the beginning of a chapter or section must be recorded; we desire to be able to specify that a given book always capitalizes one, two, three, or n words, or one full line of text; in the latter case, we desire to be able to say how many words are capitalized in a given case, at least in works where we are not otherwise recording line breaks. 3. the common practice of italicizing all but the initial capital of a proper noun should be marked, preferably without requiring high- lighting elements in mid-word. For the third example, we accept as binding the formatting performed on the P2X-files of TEI P2 by the Waterloo Script macros used to produce the output files made available to the public. For present purposes, it is enough to note the following: * Chapter and section numbers (in the form n, n.n, n.n.n, etc.) are managed by the formatter, but the default number for a chapter or section is overridden by the n attribute value, if any, specified on the
. * Parts (encoded ) begin with a single page labeled with the part number and name. * Chapters () start at the top of a page, with the phrase "Chap- ter n" centered in bold, and the chapter , centered in bold, below it. A 1/3 inch band of white space separates the title from the first paragraph. * Parts, chapters and the first two levels of sections within chapters are recorded in the table of contents. * Paragraphs are indented and preceded by a blank line; the first paragraph of a section, note, or list item, however, is not indent- ed. The first paragraph of a note or list item is not preceded by a blank line. * The elements , , , and are printed in italics. Terms are put in the index; attributes are put in the index with the second-level phrase "(attribute)", unless the tei attribute has the value "no". If the context is already italic, all these elements should be printed in roman type. * The elements and are printed in quotation marks (ini- tially double for each, though it would be preferable to use single for the latter, single if nested, double is nested again, etc.), unless a has a rend attribute of "display". N.B. this descrip- tion assumes the rend language can interpret "display" either as a primitive or as a macro to be expanded. If neither is the case, we will be content if we can use an entity reference to an entity named display, or we may be willing to provide a distinct attribute named style for the distinction. * The elenents and are printed bold (or in small caps, if LB's style is used), within angle brackets. The former are placed in the index, unless the attribute tei has the value "no"; the lat- ter are not. * If it has a rend of "display" the element is printed as a block quote, with a narrower text measure (wider margins), preceded and followed by white space. Current formatting does not change the size of type, but it ought to be reduced slightly. * The element is printed either as a block note (like a block quotation preceded by the word "Note:" in bold), as a footnote, or as an end-note (either at end of chapter or at end of book, depend- ing on settings), depending on the value of the place attribute: "inline", "foot", or "end". * The element is preceded and followed by white space and printed as an indented block, in a monospaced font, without text justifica- tion; line breaks of the text are respected. Chapter 3 SEMANTIC PRIMITIVES From the examples in the previous section, we can note the following requirements for rend languages: * element contents may be printed in different type styles; the change may involve family, style, weight, or leading * element contents may be preceded or followed by specific text (which may itself require specification of a type style, because it may be bold or italic) * rendition may depend on context (like the colon after a sense num- ber) * rendition may depend on attribute values other than rend, as does the placement of notes * rendition may involve text formed from attribute values (e.g. n values) or from application-maintained variables (chapter numbers, section numbers, concatenations of chapter and section numbers) * rendition may include side effects (entries in index or table of content) * rendition may affect the layout of the page: margins, measure, leading, indentation, offset, etc. For the present, we ignore the problem of running headers and foot- ers, pagination, etc. By hypothesis, we know that default rendition is sometimes overridden by a local rendition specification. We have no mechanism, however, of specifying the default rendition, short of redefinining the attribute rend for every element. Since as things now stand with the TEI DTDs this would involve redefining every element from scratch, this is not a viable option. This could be solved by an ad hoc change to the TEI DTDs, like omitting rend from the global class (like TEIform) and pro- viding a special parameter entity for the default rend value of each element type. For purposes of this paper, however, we assume it is solved by defining a new element for the TEI header (location as yet unspecified), as follows: If the user is seriously expected to define a useful language for rend values, or declare the use of an existing one, the TEI header may need a new section, or the TEI may need a new auxiliary DTD for a Rendi- tion Language Declaration (which might be generalized to an Auxiliary Language Declaration, to allow language definitions for other attributes as well). 3.1 Text Styling Several of the examples involve printing the contents of an element in a particular font. We therefore assume we will need primitives to change the following aspects of the type face for the duration of the element: * type face (Times, Helvetica, Optima, Palatino, Bookman, ...) * style (roman, italic, slant, ...) * weight (normal, bold, demibold, light, ...) * size (9pt, 10pt, ...) * leading (11pt, 12pt, ...) A rend language might allow the user to assign a single "font identifi- er" to one setting for all these attributes, and invoke it with a single call; for the moment, we assume instead that our rend languages will provide distinct primitives for each of these attributes. Serious work in analytic bibliography may need finer distinctions in the attributes family (and possibly others) than other users, e.g. to distinguish different cuttings of the same design by different punch cutters, or by the same cutter at different dates. It might be prefera- ble to consider characteristics like the identity of the punch cutter or the date of the cutting to be distinct attributes, to be set by distinct primitives in the rend language. Similar considerations apply to obser- vations on the type's state of wear, etc. Some examples from the WWP text suggest the need for primitives which change the text styling for some portion of the element contents, but not for all: * italicize only the second character to the end * capitalize the first N tokens, or the first line * print the initial character as a dropped initial cap These might of course be handled with elements marking the charac- ters which must be rendered in a special style; we will count it as a strength of a rend language if it allows us to treat the use of a dropped initial cap or capitalization of initial words as a rendition characteristic of the paragraph, or better yet of the
whose begin- ning is thus marked. See also the discussion of language extensions, below in section "REFID=side not defined". 3.2 Layout Several examples involve formatting the contents of the element as a text block, or (as Knuth might say) as a series of vertical blocks. For the moment, we assume that even in a complex book, the number of dis- tinct shapes used for paragraphs, etc., is limited, and therefore assume a single semantic primitive for specifying the layout shape to be taken by the contents of an element. The examples given above mention only the following layout shapes: * indented paragraph (normal

) * unindented paragraph (first

in a

) * offset paragraph (first line juts out at left) (dictionary ) * paragraph with dropped initial capital We may also assume the existence of run-on and break-off primitives, which cause an element to be treated as a continuation of the preceding layout block, or as the beginning of a new layout block, even when they would "normally" be treated otherwise. 3.3 Side Effects, Flow of Control, and Extensions Several examples involve the creation of entries in a table of con- tents or an index. If we limit the rend attribute strictly to presenta- tion of element contents in-line, these might be omitted as not belong- ing to rend. However, it is hard to see where else such behaviors can reasonably be described in a TEI document, so we will count it an advan- tage in a rend language to be able to specify side-effects like the cre- ation of entries in a table of contents or an index; this can occur either through primitives for pre-specified operations like toc-entry and index-entry, or by a more general mechanism, in which tables of con- tents and indices are concrete cases of a more general problem. Since styling of the type "capitalize initial letter, italicize the remainder" may occur in only a few texts, it may be better for a lan- guage not to provide a primitive for this style, but to allow its expression using simpler primitives. Since however such a style may be very common within those texts which use it, encoders may reasonably desire the convenience of expressing it with a single keyword. It thus seems desirable for any rend language to allow the user to define new keywords, using something like a macro facility or a procedure-call mechanism. It is not clear how 3.4 A Set of Toy Requirements To simplify the discussion of the sample languages below in section "REFID=langs not defined", we formulate a very short list of require- ments -- short enough that we can hope to define a complete language meeting those requirements in a page or so of text. The language must: * allow the specification of font family, style, weight, leading * allow the specification of strings to precede and follow the element content * allow the specification of conditional rendition dependent on posi- tion in the SGML tree or on values of other attributes * allow the specification of layout patterns (e.g. indent, offset, etc.) * allow assignment and reference to string and integer-valued vari- ables to be maintained by the application, including variables describing the current rendition state (font, paragraph shape, etc.) For the moment, we do not require the ability to specify special styles for portions of the element content, as with the WWP examples above. For each language, we provide examples which demonstrate these fea- tures using the element defined above, defining default pro- cessing (as described above) for the elements , , , , , and

. Chapter 4 SYNTAX We have carte blanche in defining a syntax for the language of rend attributes. At the moment, the following styles of syntax seem most plausible: Scheme-like: a syntax based on Scheme makes every expression in the language into a list; lists are delimited by parentheses, and their members delimited by white space. This appears likely to be the approach of DSSSL. Keyword-based: a keyword-based syntax might resemble the extended pointer language, the examples given in chapter 6 (Core) of P2, or the attribute value specification lists occurring in SGML start-tags. Functional: a syntax based on conventional functional notation might also allow infix operators for arithmetic and boolean expressions, as in Pascal, C, and similar languages. The values of rend might be expressions (in a declarative style language) or statement sequences (in a more procedural language). Z-like: a syntax based on first-order predicate calculus might resemble that of Z; the legal strings would presumably be predicates. SGML-based: rendition attributes might take the form of entries in a style-sheet document, whose syntax is defined by an SGML DTD. Values of rend might be tagged SGML text parseable using such a DTD, or they might be restricted to giving the name(s) of one or more applicable style-sheet entries. It is not clear at first glance which of these approaches will bear best fruit; we assume the immediate need is for experimentation using all of these. Sample definitions of small languages using several of these styles are included below, with examples. Chapter 5 SAMPLE LANGUAGES This section describes and defines a number of small sample languages which address, sometimes incompletely, the requirements given above in section 3.4, "A Set of Toy Requirements." Definitions are given in the same BNF style used for chapter GR of TEI P2. 5.1 Procedural Description of the Sample Processing To give ourselves a slightly more precise idea of the required pro- cessing than is possible with the descriptions above, we provide proce- dures written in an imaginary programming language (which however looks a lot like Rexx) which perform the desired processing. These procedures call on an imaginary function library whose details are left unspeci- fied; it is hoped that the function names and comments will make clear to the reader what is happening. The procedures assume the current formatting state is completely described by stacks called style, face, weight, size, leading, lmargin, and rmargin. To avoid lengthy explanations of the procedure-call inter- face, the procedures each show some work which ought really to be done by some central general routine. For : /*****************************************/ /* ProcessEmph: handle one EMPH element */ /*****************************************/ ProcessEmph: /* prepare to push new values onto all the stacks */ i = currentDepth /* depth of open-element stack */ j = i + 1 /* depth of new element */ /* if parent is italic, go roman; otherwise, go italic */ if style.i = 'italic' then style.j = 'roman' else style.j = 'italic' /* values of all other stacks remain the same */ face.j = face.i weight.j = weight.i size.j = size.i leading.j = leading.i lmargin.j = lmargin.i rmargin.j = rmargin.i /* now increment the depth count currentDepth = j drop i, j /* now process the element */ call processContent /* at end of element, now decrement depth; ** all stacks will return to their old states */ currentDepth = currentDepth - 1 return For : /*****************************************/ /* ProcessQ: handle one Q element */ /*****************************************/ ProcessQ: /* push new values onto all the stacks */ call inheritFormat /* assume everything is the same; we will change later what we need to change */ i = currentDepth + 1 /* depth of newly opened element */ currentDepth = i /* if rend = inline, do inline; if display, do block quote */ if attvalue(rend,i) = 'inline' then do n = countAncestors('q') /* how many open Qs are there? */ assert(n > 0) /* we are *in* one Q now! */ /* if at outer level, or two deep, use double quote; if at first nesting, or third, or ..., use single */ if n % 2 = 1 then output(&ldq) else output(&lsq) end /* inline quote */ else if attvalue(rend,i) = 'display' then do lmargin.i = lmargin.i + 5 /* indent on L */ rmargin.i = rmargin.i - 5 /* indent on R */ size.i = size.i - 2 /* reduce size */ leading.i = leading.i - 2 /* reduce leading */ output(vskip) end /* display quote */ else abort 'Unknown value for REND on Q element' /* now process the element */ call processContent /* at end of element, close off quote */ if attvalue(rend,i) = 'inline' then do if n % 2 = 1 then output(&rdq) else output(&rsq) end /* inline quote */ else if attvalue(rend,i) = 'display' then do output(vskip) end /* display quote */ currentDepth = currentDepth - 1 return For : [to be supplied]. For : [to be supplied]. For : [to be supplied]. For

: [to be supplied]. 5.2 A Keyword-Based Language This keyword-based language continues in the vein of the examples in TEI P2, chapter CO (Core), using keywords like STYLE and WEIGHT to sig- nal font shifts, PRE and POST to indicate preceding and following strings, and keywords like INDEX or TOC to invoke a predefined set of side effects. Note that we are unable to express the full default behavior of (which should shift to italic only if the environment is non-italic, otherwise to roman); or to note that the colon following the sense num- ber is bold (we must rely on some global rule that says the preceding and following text matches, or does not match, the element content in style -- either way, it is sure to break in some cases). We can ensure that display quotes are processed correctly either by defining DISPLAY as a keyword with appropriate meaning, or by defining an entity thus: and tagging block quotes thus:

As Aristotle says: ...

But Plato ... We have no way of defining the default behavior of the n attribute of . A formal definition of this small keyword-based language is: // Keyword-based language value ::= term | value term ; term ::= 'PRE' strings | 'POST' strings | 'STYLE' typestyle | 'FACE' typeface | 'WEIGHT' typeweight | 'SIZE' typesize | 'LEAD' leading | 'INDEX' ixterms ; // Strings strings ::= string | strings string ; string ::= /* empty string */ | "'" characters "'" | '"' characters '"' | entity // an entity name | attref // reference to an att value | conditional // special conditional keywords | '*Content*' // contents of element | string '||' string // explicit concatenation ; entity ::= NAME ; attref ::= '*' || NAME || '=' // value of attribute named ; conditional ::= 'condvskip' // vertical skip unless a bigger // skip has just been performed | 'condindent' // indent unless eldest child | 'condspace' // space (but not if adjacent to // a white space) ; // Type styling typeface ::= 'Times' | 'Courier' | 'Helvetica' | 'Optima' | 'Bookman' | 'Century Schoolbook' ; typestyle ::= /* nothing */ // default: roman | 'roman' | 'italic' ; typeweight ::= /* nothing */ // default: normal weight | 'bold' | 'light' | 'demi' ; typesize ::= NUMBER // in points ; leading ::= NUMBER // in points ; // Side effects // ixterms: terms for an index entry (up to four levels) ixterms ::= string | string string | string string string | string string string string ; 5.3 A Scheme-Like Language 5.4 An Expression Language This expression language uses a conventional functional notation for the rend values. All expressions are string-valued, with implicit con- catenation (as in Snobol). Side effects, including the setting of vari- ables, are accomplished by functions which return the empty string, and thus can appear anywhere in the expression. Conditionals are handled using the normal if-then-else notation familiar from Algol. 5.5 An Z-like Language 5.6 An SGML-coded Style Sheet Language Draft February 2, 1993 (11:58:17)