Received: by UICVM (Mailer R2.03B) id 5320; Thu, 11 Jan 90 05:19:30 CST Date: Thu, 11 Jan 90 12:15:00 +0100 Reply-To: Text Encoding Initiative - Text Representation Committee list , Stig Johansson Sender: Text Encoding Initiative - Text Representation Committee list From: Stig Johansson Subject: Working paper TEI TR W10 (draft) To: Michael Sperberg-McQueen TEI TR W10 Problems with punctuation marks (draft) 1. Punctuation marks in ISO 646 The following punctuation marks (taken in a rather wide sense) are available among the non-national characters of ISO 646: space ! " ' ( ) , . / : ; < > ? (According to Wilhelm Ott, we should exclude the exclamation mark. Did all the other characters come through correctly?) Characters which are not available can be represented by entity references or dealt with by other forms of markup (see further DeRose TEI TRW7). 2. Entity names The following lists, which have been compiled from ISO 8879-1986(E) and Bryan (1988), may serve as a basis for further work. (I see in the Luxembourg minutes that 'entity names should be chosen from existing sets', so disregard my comments at the beginning of Section 3.) 2.1. Separators . period (full stop) , comma ? question mark ¿ inverted question mark ! exclamation mark ¡ inverted exclamation mark : colon ; semicolon There are special entity names for full stops used as decimal points and as indications of ellipsis: · middle (decimal) dot … horizontal ellipsis (three dots) … em leader (three dots) ‥ en leader (double baseline dot) 2.2. Includers ( left parenthesis (opening) ) right parenthesis (closing) [ left square bracket (opening) ] right square bracket (closing) ⟨ left angle bracket (opening) &rang: right angle bracket (closing) { left curly braces (opening) } right curly braces (closing) 2.3. Quotation marks " quotation mark (double, straight) ‘ left (opening) single quotation mark ’ right (closing) single quotation mark “ left (opening) double quotation mark ” right (closing) double quotation mark ‚ rising single quote, left (low) ’ rising single quote, right (high) „ rising double quote, left (low) ” rising double quote, right (high) « left angle quotation mark (French opening guillemet) » right angle quotation mark (French closing guillemet) ‹ left single angle quotation mark (embedded guillemet) › right single angle quotation mark (embedded guillemet) 2.4. Spaces   em space   en space (1/2 em)   digit space (width of a number)   1/3 em space   punctuation space (width of a comma)   no break (required) space 2.5 Dashes and hyphens — em dash – en dash ‐ neutral dash/minus ‐ hyphen ­ soft hyphen 2.6. Other ' apostrophe / solidus (shilling stroke) \ backslash (reverse solidus) | vertical bar ¦ broken vertical bar ― horizontal bar _ lowline (baseline rule) 3. Brief comment on entity names Entity names should be transparent and consistently built up. If there are groups of related features, this should be reflected in the names. Some of the entity names given above could be made more explicit, e.g. ­ = soft hyphen. It would seem preferable to indicate the main type first and then the particular type (deviating from ISO 8879 and Bryan, where bodies of entity names may have both prefixes and suffixes), e.g. ".ls; = left single quote, &hyph.s; = soft hyphen, &quest.i = inverted question mark. To achieve greater transparency and consistency, it may be necessary to allow longer names. Entity names are used where the character set is insufficient. In some cases of the commonest problems will be briefly discussed below. 4.1. Period (full stop) The period is used to mark a) sentence endings, b) abbreviations, c) decimal point, d) enumerators in lists (1., a., etc), e) parts of list items (e.g. in bibliographies). We have already seen that decimal points can be identified by entity names, if necessary. Tagging of abbreviations takes care of the second use listed above. To identify sentence endings I would like to propose the following mechanism: We recognise a unit roughly corresponding to an orthographic sentence. Let us call it an S-unit. Each S-unit is marked by an empty element, optionally followed by a reference number. S-units are orthographic sentences (ending with a period, an exclamation mark, or a question mark) or other structurally independent forms (e.g. headings). From the notion of S-units we may want to exclude enumerators in lists, names of speakers in dramas, page references, and similar discourse organisers. The text documentation section of a text should specify whether S-units are used or whether there is some other form of reference system. In either case, further details should be provided on the reference system. Marking of S-units in this manner at the same time disambiguates the period and provides a consistent reference system (which we need anyway). 4.2. Question and exclamation marks Question and exclamation marks almost always mark the end of sentences. But they may be used occasionally for other purposes, e.g. as a mid-sentence comment by the author (! - to express surprise or some other strong feeling, such cases can be distinguished from mid-sentence commenting ! and ? by the following quotation marks (or markup indicating quotations). 4.3. Quotation marks The best way of dealing with quotation marks is probably to replace them by descriptive markup indicating begin-quote and end-quote, especially as quotations are not always marked by quotation marks (notably long quotations) and we need this form of markup anyway. Quotes within quotes can be handled by SGML (as far as I understand). The main problem arises with cases where quotation marks are used for other purposes, e.g. to give the title of an article, to gloss the meaning of a word, to indicate that a word is a technical term, or used them as a distancing device (as in: she hated 'good' books). We need special tags for these uses, perhaps something like: title, gloss, term, so-called. Quotation marks indicating direct speech should have special tags. Note, incidentally, that direct speech may be unmarked in printed texts or may be indicated by some other device, such as by a dash or by indentation. (In dramas there is no need to mark direct speech, as long as we tag stage directions and speaker attributions.) 4.4. Hyphen In reproducing printed texts it is necessary to reach a decision on the representation of soft (line-end) hyphens. Where the lineation of the machine-readable text is different from the original (which is probably most often the case), the editor can either eliminate soft hyphens or replace them in some manner (by an entity reference or some other convention, e.g. hyphen followed by space). It does not matter which solution is chosen, as long as it is reported (in the section on text documentation). Writers of original machine-readable texts should be recommended not to use soft hyphens. (Or some other convention for word continuation should be devised.) 4.5. Apostrophe Apostrophes must be distinguished from single quote marks. This is perhaps best done by using descriptive tagging for quotations. However, aspostrophes have a variety of uses. In English they mark contractions, genitive forms, and plural forms (occasionally). Disambiguation of these uses belongs to the level of (linguistic) analysis and interpretation. 5. Final remarks Using markup it is possible not only to express distinctions made in print but also to transcend limitations inherent in current conventions for printed texts. The greater distinctiveness carries a price, however: the text becomes bulkier and hard to use and produce in a direct way. Software must be provided which assists the writer in putting in tags, e.g. inserting S-unit tags after periods, question marks, exclamation marks, and end tags for headings (leaving the writer the choice to delete them where they are not applicable). To take another example, quotation marks could be replaced by start and end tags for quotations (leaving the writer the choice to change the tags to 'title', 'gloss', 'term', 'so-called', etc). Similarly, there is a need for software which enables users to rearrange and display the text according to their needs. Finally, we need software which helps in checking the accuracy and consistency of marked-up text. The age of print has led to great gains in clarity of text representation. We should build on conventions for printed texts, without being restricted by them. The ultimate aim is to achieve even greater clarity and efficiency of use and arrive at a text representation appropriate to the electronic age. 20 December 1989 Stig Johansson University of Oslo