TEI CE M 01TEI Character Set Work Group Meeting Minutes, 2002-07-23/24Tübingen, Germany

Initials used for people present

DA Deb Anderson
SB Syd Bauman
MB Michael Beddow
DB David Birnbaum
BB Brian Bruya
LB Lou Burnard
PD Patrick Durusau
TM Tone Merete Bruvik
MT Masayuki Toyoshima
CW Christian Wittern

Note that TM and BB were only present during day 2, and that PD missed the first hour or two of day 1.

Initials used for people, not present

EO Espen Ore

Meeting took place in ZDV, University of Tübingen, Tuesday 23 and Wednesday 24 July, 2002.

Introduction, administrative announcements
Review of relevant sections in P4
Review of use cases and current practice
Architectural issues: Processing model for TEI texts, modularization of WSD.
Closing remarks, thanks to the hosts.

Introduction, administrative announcements

SB taking minutes.

Thanks to local organizers.

Building we are in closes at 16:30, but we can stay later ¹ .

LB hands out P4 CDs. Note that paper insert still says ‘1999’, ignore. It's new.

CW suggests that when viewing Guidelines ² it would be good if reader could tell what text has been modified. General agreement. Editors agree in principle; they point out that there is a not-quite-yet-published errata list for P4. ³

Welcome to visitor MT.

Review of relevant sections in P4

Editors thankful to CW & CE-WG for effort put into P4.

CW asks if effort should be put into P4 or P5 (answer from LB, with SB concurring, is that except for errors and the like, we're done with P4, concentrate on P5.) The question then arises as to whether the requisite chapters in P5 should be rewrites of P4 or start afresh. LB answers either, hope there's a base in P4 that's usable.

CW asks if P5 will support SGML at all. LB responds there will be resistance to dropping SGML. But WG could recommend that P5 SGML projects use Unicode as the base character encoding.

SB points out SGMLers could still use P4, of course.

Note to XML migration WG to consider character set issues, particularly transcoding.

Briefly addressed problem of attaching WSD-NG to instance — where should WSD go?

in front of every document in subset or header;
in separate free-standing document & link to it either by explicit or implicit link;
use current external entity mechanism.

DB suggests mechanism to suck WSD into DTD view; editors point out issue is where WSD goes, not where its declaration goes.

SB asks should we work on end of Ch 25?

MB points out 25 not used often, but 4 is pedagogically wrongish, too much ‘how we got here from there’ stuff.

Editors point out audience for P4 was P3 users, but P5 can start with new user in mind.

MB states that in particular the reader is told about variants ⁴ too much too early.

DB suggests Unicode's ‘gentle intro to characters and glyphs’. CW asserts it is geared towards software developers, not encoders. DB says he has some experience with it with his students, and still recommends it.

LB points out that CH is unlike rest of Guidelines because it is introductory not a reference. LB recommends, analogous to SG, CH presents that introduction necessary to understand Unicode as you need to know it for TEI, but written as reference not as tutorial.

MT asks why not divide P5 into a reference and a separate supporting (tutorial, introduction?) part. In response PD asked what parts, if any, is not needed? SB thought the details of byte-order; LB pointed out that MB was claiming the history was not needed in a tutorial. CW suggested that the history is needed in the tutorial. ⁵

LB suggests WG thinks about what goes into P5 in place of CH. A complete rethink may be in order, stating principles up front (may want 2 chapters, one with how to, other with whys and wherefores).

CW points out that completely free-standing is less visible; LB that he wasn't excluding included & separate (like SG).

CW suggests that we further nail down principles underlying character sets (CH & WD) of P5, then find volunteers to draft.

On the strength of MB's volunteering for part of the task, and on CW's recomendation, the WG appointed MB to draft first an outline (by mid- to late- August, and then a draft of the replacement(s) for CH and WD.

After MB expressed some trepidation, as he does not feel qualified to draft several of the subsections that would be involved, LB and others assured him that he was not being left responsible for the entirety — he should feel free to leave ‘this section to be written by someone else’ as the content of some of the subsections. LB noted that the subsection headings should still be included in the outline. ⁶

The next question discussed was how to divide up the various topics that need to be covered. MB points out that there is lots of stuff TEI has to do even if other bodies had done their jobs perfectly. LB suggests not dividing at drafting stage, but rather cover topics, and then go through draft ⁷ .

MB suggests avoiding clashes by paying attention to MSS WG work.

Responding to SB's query CW says CH needs to fill in gaps. Many projects answered ‘we use English, so we have no problem’. PD raised the question, if they don't think they have a problem, do they?

LB: need a section ‘why this chapter is needed’

DB suggests we pick up Lou's suggestion and outline what CH should look like, and then suck in parts from P4 as needed.

MB to draft new P4 during mid-Aug with PD & tei-chars help; DB suggests outline early

CW suggests we defer this discussion until later in meeting

Under ‘items we need to cover’, reviewed CW's presentation at Pisa that reviewed the Berkeley meeting.

CW says we will table until later.

Review of use cases and current practice

Besides TEI-L, requests were sent to unicode, unicore, linguist, and some other list. DA says that there was also a poor response from these lists, but what results we have indicate that some projects are using the PUA.

MB: tagalog to use Unicode?

MENOTA has decided, as far as possible, to do diplomatic transcription at the character level. Produced extensive documentation how they plan to encode & document use of PUA for this. They have system to generate entity name, have created names even for those that already have ISO names. LB considers it a mistake. MB says that we need to give Guidance; LB that we need to warn against making this mistake.

DB says it's not just presentation & encoding (glyph & character), but two different levels of information. General agreement.

CW: We need to provide entity-less solutions for non-valid XML documents (which are by definition well-formed).

Ascertain: are there entities with the various schema flavors?

CW presents his use-case, <gaiji> .

The question was raised as to what is the difference between using <orig> with reg vs a WSD solution to normalized characters? CW: difference is that there are multiple modern versions.

PD asks if by doing this in WSD are we limiting future use.

DB points out if you don't make a distinction in the document, you won't be able to do this kind of normalization.

MT asks about using multiple WSDs; ...

Do we need more use cases? PD suggests looking at the encoding of various sites without eliciting a case from the project.

CW: we need to develop an inventory of 'em, if not an actual test suite, but we don't need to discuss them here & now.

Lou is interim TEI webmaster. CW & LB to have TEI site mirror CW's documents. E.g., this document will be CEM01. WG documents should be submitted in TEI P4 Lite XML.

Architectural issues: Processing model for TEI texts, modularization of WSD.

Issues related to the extension/modification of the document character set.

Problem: once a character code point is entered, there's no way to recover which of variant glyphs was used. (CW demonstrates with slide from his paper.)

PD points out that you can already do this with current WSD.

DB: 2 problems —

you point into font at the ‘right’ spot, but the glyph you want isn't there;
⁸

CW thinks LB's examples represent problems that should be handled in stylesheet.

DB points out that WSD does have mechanism to point into abstract glyph inventory. (AFII — although CW notes it probably will be nuked)

DB points out that as long as we handle many-to-one and one-to-many character to glyph cases, 1-1 will follow; also warns we need to come up with processing model that can actually do something ...

LB suggests nothing has changed, we need to create WSD-NG that is expressible in XML.

CW suggests 3 different levels

put character into document, i.e., being able to define characters
define character semantics
define linguistic properties

⁹

It was suggested that we will need two passes: special TEI parse either 1st or 2nd; perhaps WSD as XSLT stylesheet first. DB states that will we need, at a minimum, a triplet of information: unicode_char_point+, glyphs+, unique_entity_name

DB presents defining ‘<!ENTITY lou "GOLOOKITUP-lou">’ solution.

LB presents problem: <c attr-that-points-to-table-entry="it">ǲ</c>. ¹⁰

DB suggests that TEI could insist that if you need this level of character control you can't use <sic> / <corr> . ¹¹

If we're going to go that way ( <c> with attribute), CW says we should look at SVG & MathML carefully.

MT asks about composed characters — CW points out you can have multiple characters as content of <c> , and that the pointer can point to composed characters.

WSD addressed problem of single characters -> multiple glyphs, multiple characters -> single glyph by overloading entity references.

LB points out that NDATA declared entities remain unparsed.

We have four possible proposed models for solution:

use NDATA unparsed entities (if it works — looks like no)
<c> element with pointer attribute to point to WSD data (problem: need to create non-attribute versions of <sic> / <corr>
TEI-markup that flags character
generating entity tables from WSDs: more complicated parsing, but works in attribute values

Could use a driver file to combine them. Won't work with many XML processors that read whole document first.

Considered model 3 above in more detail, including using PI before DOCTYPE declaration.

Need to make sure that it is implementable.

DA — Unicode will be using variant selectors as a fall-back position.

Do we want to have a mechanism that duplicates creating new Unicode character or variation selector? DB says yes, but because of fringe cases not because the Unicode Consortium would be nasty about it.

CW thinks definitional and semantic properties should be on different levels.

SB: Extra elements we need to create to get around <sic> / <corr> and the WSD-like <teiHeader> elements could be an additional tagset that most TEI users won't need.

solutions in XML

DB presents his work of last night: proof-of-concept implementations of each of our 4 solutions in XSLT.

DB's numbering scheme is different than the one used above, so is reproduced here:

possible solutions

Solutions that use no elements (and thus can be used in attribute values):
- TEI-specific entity-like references
- Using PUA code point as an index into WSD-NG. ¹²
Solutions that require elements:
- The NDATA solution: declare the WSD as an unparsed NDATA entity
- The ‘replacement-in-the-instance’=20 solution: use the <c> element
- The ‘table-in-the-header’ solution.

Discussion of inability to put icky characters in attribute values. What about n or rend? Some feel it is quite reasonable to want icky characters there.

Multiple solutions? Requires that document exchangers be able to process each solution.

PUA? Have a PUA value for each <gaiji> ¹³ ; processor has to read WSD first, of course.

things to do

DB to write CE W 02, XSLT proof-of-concept of possible solutions.

CW to write CE W 03, use cases.

Topics still to consider: xml:lang and declaration & description of languages.

DA to find out how others use PUA characters.

MT points out that Japanese dictionary group had used PUA approach but this year abandoned that approach and now just puts an image in. But LB points out this does not invalidate it for our purposes.

MB wants us to define the ‘base’ (I think he me means non-WSD) method of general person encoding a non-Unicode character.

MB points out that xslt2 will provide writing and node-set conversion, either of which would enable two-pass processing.

LB sees four possible ways to handle the basic problem (of not having SDATA or SUBDOC):

Provide element alternatives to attributes (thus allowing for element-required solutions).
Tell users they can't have special characters it in attributes.
Can do it in attributes, but have to parse it yourself (either by using the special TEI-markup entity-like reference (‘HI-LOU’ in DB's examples) or by using standard markup escape (<)) and parse it yourself.
Do nothing. So there!

Issues related to the declaration/description etc. of the language(s) used in the document.

Other issues

LB raised question of one project trying to accept documents from two different other projects that use their own entity names and PUA code points — overlap (using same name or code point for same character) could occur with either or both, collisions (using same name or code point for different character) could occur with either or both. Thus recipient would have to exhaustively compare each value to all other values. Non ideal. Also have to be able to differentiate PUA from normal code points.

Test of markup strategies

Future planning

EO (PD as fall-back) to write CE W 04, problem description of 4.2 — in P3 lang does everything (both natural language & writing system); but we also have xml:lang. Berkeley meeting resulted in decision to divide them. How you indicate what language, how you define it, and how you uncouple it from the WSD.

CW to write CE W 05, content of WSD and its <gaiji> .

Drafts due Mon 2002-08-26.

LB to get web posting easier before 2002-08-26.

Closing remarks, thanks to the hosts.

CW points out, with complete consensus, that meeting has run very smoothly despite no local organizers being on the work group. Good job U of Tübingers!

Notes

Later CW informed group we needed to be out by 18:00 or risk being locked in

I believe he meant online HTML, but perhaps PDF as well.

I don't have exactly what LB said, but the upshot is that while doing a diff on the infoset and displaying updated text differently is not at all an unreasonable thing to do in general, it is likely to be something we can't actually do due to time & budget constraints. We will certainly publish the errata list, though. And on further consideration I think it might be reasonable to do something with the errata along the lines of what the annotated XML specification does.

Not sure whether he had meant glyph variants, character variants, or both, but nor do I think it matters at this point.

At this point MB chimed in with a brief monologue which I was not fast enough to record. MB?

This paragraph is out of chronological order, but I'm not sure where it belongs.

deciding which topic goes where, I guess

Either DB never got to second item, or I never wrote it down.

I have the text ‘Is entity resolution that a problem SB/MB’ and ‘ask DB ...’ in my notes, but embarrassingly have to admit I have no idea what they mean.

10.

I believe the problem referred to here is that if we use the <c> element it can't go in an attribute value, e.g., the sic of <corr> or the corr of <sic> .

11.

Interesting: the phrase ‘sic corr’ as most of us use it is deliberately ambiguous, meaning both ‘<gi>sic</sic> <gi>corr</gi>’ and ‘<gi>sic</sic> <ident type="attrName">corr</ident>’. Hmmm...

12.

This solution was not actually addressed at this point chronologically, but was mentioned at some point, and I think may deserve some more serious attention, although, per DB's notes ‘This is a two-pass solution and requires promiscuous use of PUA.’

13.

The work group has chosen to use the term ‘gaiji’, a Japanese word referring to a character not in the typesetter's type drawer (i.e., font), as the GI for the element used to describe a character in whatever solution is used at least as we discuss it now, if not for real in P5.

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org