CEW07: Private use characters in XML


Contents

Abstract

This note discusses how the XML architecture can be modified to fix problems that prevent privately defined characters from being exchangeable. Modifications to the XML character set and to the parsing process are suggested. This will with a minimal modification to existing software provide XML users with a functionality similar to what SDATA entities had been used for in SGML and will greatly increase interoperability of XML documents containing privately defined characters.

Background

It might be useful to first discuss what role and usage SDATA entities (Specific character data entity, cf. Ch. Goldfarb The SGML Handbook, p. 146.7) play in SGML.

The main purpose of SDATA in SGML is to isolate system specific data entities from the more general non-system specific data entities. The idea was that only a small subset of entities would need attention during a transition of data from one data processing environment to another. Although this is a useful concept, it has not been retained in the transition to XML.

Usage of SDATA entities in SGML

In practice, SDATA entities have in the SGML world been most widely used to overcome limitations and incomatibilities of coded character sets, someting that is painfully missing in XML. The SGML Handbook contained a long list of public entity declarations, where SDATA entities where defined in this way:
<!ENTITY aacute SDATA "[aacute]" --=small a, acute accent -->
By itself this does not achieve much more than returning the name of the entity, enclosed in square brackets. The fact that this occurred in a SDATA entity, thus likely requiring additional treatment from the receiving application is included in the information passed from the parser to the application, for example in the SGML Element Structure Information Set (ESIS, see Goldfarb, p. 588ff.). SGML applications would then usually use a system specific mapping from the declarative names to a coded character set or other kinds of representations. In contrast to that, if instead of a SDATA entity, a CDATA entity would have been declared, no indication that the entity replacement text originated from an entity, rather than being directly character data content of the document would be passed to the application and there would be now way to trigger any special treatment in this case.

It should be noted, by the way, that this specific notation for SDATA expansion, although very often encountered is in no way required or implied by SGML. In fact, SGML makes no assumptions about what kind of processing occurs on SDATA entities and how applications are supposed to find out above this. SDATA are application specific and could not be interchanged meaningfully without either prior agreement or manual interference. SDATA in SGML did not specify a complete mechanism for dealing with characters outside of a documents character set. The fact that SDATA "somehow worked" in SGML is only due to the fact that most applications looked to the published entity sets for guidance.

Does XML need SDATA or similar processing?

The work group charged with the specification of the XML recommendation discussed the issue of SDATA in XML extensively in October and December 1996 (the archives of this discussions are publicly accessible at http://lists.w3.org/Archives/Public/w3c-sgml-wg/1996Oct and /1996Dec. While the problem that SDATA tried to solve was clearly recognized, it was felt that the mechanism was underspecified and could not immediately be ported to XML, so it was left on the agenda for future versions of XML to deal with. It was also felt, that with Unicode being the document character set, which even had set aside a code space for privately defined characters, there was enough in place to make up for this lost flexibility.

While the SDATA entities as specified in the SGML standard ISO 8879 do not suffice to completely cover the problem at hand, they do provide to the application
  • A notification that some system specific (special) entity is to be received
  • A label for that entity
This does not seem much, but it gives a hook to any application that wants to deal with this, which is not possible anymore in XML.
With the current XML infrastructure, where Unicode provides the character encoding layer and XML the markup layer, the closest we can get to a similar mechanism is to
  • Use characters from the private use area (PUA) for characters not in Unicode
  • Scan the stream of document characters for codepoints in the PUA area and trigger some kind of processing from there.

While the PUA characters could thus be seen as providing a similar facility, there are still some problems: There is no way an application could be told what to do with a PUA character It also does not immediately allow for associating meaningful labels with unknown characters as has been the case with SDATA entities. Another problem is the fact that this poses a significant overhead on the postprocessing of the data stream, which occurs completely outside the realm of XML processing, in an application layer beyond the markup. Most importantl, however, this solution does not easily allow for document interchange, where potentially conflicting assignments to PUA characters are bound to arise.

Assigned vs. private characters

Before proposing a solution to this problem, a fundamental flaw in the present XML recommendation should be pointed out. Unicode/ISO 10646 makes a strict distinction between codepoints that have been assigned characters by the standards body (and with it a number of other properties, that enable applications to properly handle these characters) on the one side, and codepoints that are reserved for private use, that is assignment by software vendors, text encoders or other purposes, where a universal assignment is not possible or not desired on the other side. XML does not in any way adopt that distinction! Although XML does have a concept of different character classes (name characters, etc), as far as character data are concerned, there is no distinction at all between characters with properties assigned in a global scope and privately assigned characters.

The purpose of this note is to suggest a way to correct this fundamental flaw and at the same time introduce a general mechanism for handling PUA characters.

Suggested solution

A first step towards a solution for this problem could be to treat characters from the PUA area differently from other character data. Instead of simple passing a PUA character along in the same way as univeral characters, the parser would treat PUA characters as if they were a "special entity" and look for a declaration that matches the codepoint encountered. While the specifics would need some consideration, such a treatment seems to be both appropriate on an architectural level, since it reflects the fact that these two character classes have a different scope, and it is also appropriate in a very practical sense, since it overcomes one of the few problems for data interchange that XML did not sufficiently solve. It would also be a considerable enhancement in usability, since in a local context, PUA characters could be used without constraints, while at the same time retaining interoperability. Furthermore, it will warn users that might inadvertibly have inserted PUA characters into a document, since he might not be aware of the fact that his system has some assigments of PUA characters built in. 1

Now what will be in the declaration for a PUA codepoint? In the most simple case, it will be simply the codepoint itself. In this case there will be no real difference in the parsed document tree from what it would look like currently (leaving out, for the purpose of this discussion, that some additional noded would still exist in the prologue). In fact, a statement in the document prolog or internal subset could trigger a processing where every PUA character has the declared value of itself, thus retaining some degree of backwards compatibility. This should however not be the default treatment of PUA characters.

In most cases, a receiving application will want to examine the PUA characters in an incoming document, in order to treat them appropriately. While the issue of how to identify unencoded characters is thorny and outside the scope of this document, it would suffice in this context to provide a simple hook for such a treatment by supplying a identifying string for the character. A declaration for a PUA character will thus have an obligatory system specific declared value (which in many cases will be another PUA codepoint, but not necessarily the same one) and in addition to that a public value. Obvious candidates for such a public identifier for characters would be character names constructed in the same way as the character names in the universal character set, Ideographic Description Sequences for Han characters, or URIs used in a way similar to XML namespaces. In order to make the specification as flexible as possible, but without sacrifying interoperability, using the NDATA mechanism to specify what type is used might be the best solution 2 .

The values of these declarations will need to be available to the receiving application, while it might not be necessary to record the PUA value from the original document, since the parsed document will contain the new value.

Here are some examples:
  • PUA codepoint mapped to itself, with character name
    <!ENTITY #xe000 SYSTEM "&#xe000;" PUBLIC "LATIN CAPITAL LETTER A WITH TWO MACRON" NDATA CHARNAME>
  • This PUA codepoint is remapped to a different PUA codepoint
    <!ENTITY #xe000 SYSTEM "&#xf877;" PUBLIC "http://funnychars.com/a-with-two-macrons" NDATA CHARURL>
  • PUA codepoint remapped and annotated with a Ideographic Definition Sequence
    <!ENTITY #xe000 SYSTEM "&#xe086;" PUBLIC "&#x2ff0;&#x6728;&#x72ac;" NDATA CHARIDS>

Processing chains

In processing chains that do repeated parsing (as opposed to directly work on the document tree), the declarations must be passed along to the next parsing process. Specifications such as XSL might need a way to be able to write out declarations for PUA characters and be able to write statements governing the treatment of PUA characters to the internal subset of the document.

Conclusion

The mechanism proposed here will provide the means to solve the longstanding problem of missing portability of privately defined characters with only minimal changes to the XML architecture. By doing so, there will be significant gains for large user communities, especially in East-Asia, where the requirement to exchange non-standard characters frequently arises, with very little sacrifice to existing solutions.

Notes
1.
Examples could be the Apple logo, or characters defined in the Hongkong Special Administration Zone.
2.
The specification could however predefine NDATA values for the most common cases.

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org