Character entities and public entity sets <report.number> TEI TR1 W4 <author>Harry E. Gaylord <date>January 9, 1992 <version>1.0 <status>Public <front> <acknowledgements> I want to acknowledge here the help of several people. John Esling and Alexandra Gaylord have helped very considerably on organising the IPA material. Elli Mylonas has advised on the Greek material. Alaa Eddine M. Ghoneim of IBM in Cairo has drawn up the inventory for the Arabic entity set. </front> <body> <div0><head>Character entities General entities in SGML are used for several purposes. Perhaps the most often cited one is &SGML. It is defined as an entity in the following manner in the DTD or at the opening of the document where I use it: <xmp> <!ENTITY SGML "Standard Generalized Markup Language"> </xmp> When it is called in a Document Instance with &SGML, the entity calls up the whole phrase at this point in the text. This simply saves the drudgery of having to type such long phrases often. Character entities are a specific subset of the general class of entities. These are used especially where character codes are not available or unsafe to use. Their advantage is that they are safe from corruption and yet they can be processed as if they are there in the text. Let's take an example which is often complained of on e-mail communication. I am sending a text discussing medieval French currency. I declare the following entity: <!ENTITY eacute "eacute">. Now in the paper I discuss the écu (not to be confused with the modern theoretical European thing). This string will be able to pass back and forth through all sorts of ASCII to EBCDIC to ASCII translations on the network without corruption. That would not have happened if I had used the character in my local computer's character set for the first letter. It can be received correctly by someone on an IBM mainframe using any brand of EBCDIC. On the other hand that person would receive the text intact and could replace é with the appropriate character code for that machine or use it as is. With SGML conformant software we can achieve much more with these character entities. If I make use of the public entity for this which is <!ENTITY eacute SDATA "[eacute ]" --=e with acute accent--> and use é in my text I can declare an alternative entity realization in my local character set. If my machine is an IBM PC using the Code Page 437 or 850 I could declare: <!ENTITY eacute SDATA "‚">. On the other hand if I were using a machine that had the ISO 8859-1 character set I would place in my DTD for local processing <!ENTITY eacute SDATA "Ø">. Using the same file with SGML software the écu in the file would appear on both screens as the three letter word it is. No one would have to perform a replacement change in the file. This can be extended one step further for a print formater. The entity could be transformed into a processing instruction for the printer, but this is rather crude and not recommended. If the formatter can handle AFII registered glyph information, this could be placed in the quote. AFII is an organisation which is setting up a registration of glyphs together with ISO. In the entity declaration the replacement text is labelled as SDATA Specific character data) because as in the last two examples this replacement is dependent on a particular system. <div0><head>Entity names The rules for entity names are that they contain characters from the set within these brackets [A-Za-z0-9.-], i.e. upper and lower case letters, digits, period, and hyphen. Entity opener delimiter is ampersand, the normal entity closer delimiter is semicolon. The first character in the datastream which is a non entity character will also be taken as closer delimiter. Therefore é and é, are valid entities followed by space and comma respectively. In Reference Concrete Syntax entity names are restricted to a maximal length of 8 characters. In 8879:1986 a series of character entities have been included (See below, Appendix 1). These are mostly taken from one character set (ISO 6937) and those of the American Mathematical Society. There are also entity sets for Greek and Cyrillic. The conventions in these are simply that they are a string of not more than 6 characters, the last two of which ("gr" and "cy") denote the alphabet. Furthermore, capital and small letters are distinguished in the letters denoting the letter, e.g. "OHgr" is capital omega, "ohgr" is small omega. In drawing up an entity set for classical Greek here, I have had to add some more conventions. The most central is that if the entity contains a single accent on a letter, this is indicated by a three letter string, e.g. &arougr; is Greek alpha with rough breathing mark. If the entity contains a letter with multiple accents, one character denotes each accent, e.g. &arcigr; is Greek alpha with rough breathing and circumflex and iota subscript. By doing this I have been able to remain within the 8 character limits for entity names. With some alphabets this may not be possible. The order of accents also is important to standardize. In the Greek I have tried to use beginning to end, top to bottom. This order is rough/ smooth acute/grave/circumflex dieresis iota subscript. The IPA phonetic entities form an exceptional group. People working in representing speech have been drawing up character sets for their use for a considerable time. Until recently there has been little consensus on which symbols should be used for what speech phenomena except in the International Phonetics Association. The IPA has developed a clear policy on most of this over the years. In their convention in Kiel in 1989 IPA chart was revised. This revision involved agreeing on a preferred unique symbol for each phonetic/articulatory category. In some cases variant symbols until then had been used by IPA members. Many more alternate symbols, of course, have been and will be used by others outside the IPA. The workgroup on computer coding of IPA symbols then assigned a name (in fact, a three digit number) for each symbol on the chart which was a number. The number indicates its position on the IPA chart. Consonants are numbered in the 100's and 200's, vowels in the 300's, diacritics (i.e. supplemental information on consonants and vowels) are in the 400's, suprasegments are in the 500's. These numbers have been used in drawing up our entity set. The primary purpose of this entity set is for interchange. By restricting it to the symbols on the chart alone there will be no ambiguity even though different users may use alternative symbols in their local systems. In effect this gives the maximal freedom to each individual or group working in phonetics and the least ambiguity in transfering work among colleagues, who may use different conventions, or to publishers who will have their own house styles for such things as pronunciation in dictionaries. The description of each entity includes its shape and its categorization in IPA usage. The basic publications which form the basis of this entity set are "The IPA 1989 Kiel Convention Workgroup 9 report: Computer Coding of IPA Symbols and Computer Representation of Individual Languages", Journal of the International Phonetic Association 19 (1989), 81ff and John Esling, "Computer Coding of the IPA: Supplementary Report", Journal of the International Phonetic Association 20 (1990), 22ff. In the latter article IPA numbers have also been assigned to older IPA symbols and some non IPA symbols which are not part of the current IPA system. This is indicated by the fact that they are numbered downward beginning with 99 of the appropriate category, e.g. 499,498, etc. The phonetic wsd on TEI-L indicates in a note on these symbols what symbol they duplicate if there is one. Geoffrey Pullum and William Ladusaw, Phonetic Symbol Guide, University of Chicago Press, 1986, is a useful book to consult about the meaning of IPA and non IPA symbols and their meanings. <div0><head>Public entity sets The non-mathematical public entity sets from ISO 8879:1986 are included here in appendix 1. They are also available from listserv@uiucvm as the separate files <list> <item>ISOlat1 Entities <item>ISOlat2 Entities <item>ISOdia Entities <item>ISOgrk1 Entities <item>ISOgrk2 Entities <item>ISOcyr1 Entities <item>ISOcyr2 Entities <item>ISOnum Entities <item>ISOpub Entities </list> Version 1 of the first four TEI public entity sets are in appendix 2. These are Classical Greek (to be used with ISOgrk1 and ISOgrk2), Basic Arabic (excluding ligatures), Coptic and IPA phonetic. Paper copies of these can be ordered from TEI <address>University of Illinois at Chicago, Computer Center (M/C 135), Box 6998, Chicago IL 60680, U.S.A.</address> They are also available separately from listserv@uiucvm with the names <list> <item>TEIarb Entities <item>TEIcop Entities <item>TEIgrk Entities <item>TEIipa Entities </list> <div0><head>External Entities External entities can be used for public and local text. These have to be declared and referenced. ISO public entities are declared in a special way. <xmp> <!ENTITY % ISOlat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"> </xmp> This is a parameter entity which is contained in the ISO standard 8879:1986. It contains a collection of entities, which is called Added Latin 1. The text is in English. It would be referenced by %ISOlat1;. TEI public text can be declared in a slightly different manner. <xmp> <!ENTITY % TEIgrk PUBLIC "-//Text Encoding Initiative//ENTITIES Added Classical Greek//EN"> </xmp> The beginning minus means that while the text is public, it has no formal registration. If it did, it would have a plus. If a file containing these entities is used locally, one can alternately declare it as local. <xmp> <!ENTITY % TEIgrk SYSTEM "TEIgrk.ent"> which is kept in the file TEIgrk.ent. </xmp> This would be referenced with %TEIgrk;. The external entities discussed so far are parameter entities, indicated by the procent sign instead of ampersand. Their contents, i.e. a series of entities in the examples, are made available. On the other hand if an external file is to be included at this point in the text it is declared as a general entity. <xmp> <!ENTITY chap1 SYSTEM "chap1.bk"> &chap1; </xmp> <div0><head>Future Public Entity Sets There are a number of noticable gaps in these entity sets and the TEI TR1 workgroup is working together with ISO to draw up additional supplements. We need help on what to include and would appreciate comments from scholars in the relevant fields. We will issue them for public use in TEI and they will be submitted for inclusion in ISO TR 9573. This Technical Report is entitled "Information processing - SGML support facilities - Techniques for using SGML". A new edition is being prepared in 16 parts. Part 14 is Public entity sets for Latin based alphabets, Part 15 is Public entity sets for non-Latin alphabets, and Part 16 is Public entity sets for ideograms. Dr Anders Berglund of the ISO central secretariat is the editor of this technical report and we are co-operating with him in this work. As we assemble additional entity sets we can issue them through TEI and make them publicly available on TEI-L and submit them to ISO for inclusion in TR 9573. This is also what the American Mathematical Society did with its MathSci entity sets. Currently work is being done on entity sets for Hebrew, Early Slavic (Cyrillic and Glagolitic), Devanagri, and preliminary work on Japanese and Chinese. If you would like to contribute to this work, please contact Harry Gaylord<address>e-mail internet: galiard@let.rug.nl post: Alfa Informatica, Faculty of Arts, Groningen University, POB 716, Groningen, The Netherlands NL 9700 AS.</address> <div0><head>Writing System Declarations (WSD's) Writing system declarations are designed to describe the encoding method used for TEI documents. They give structural information on the encoding system used in them and information in attributes for revising the encoding. For characters not simply used at character code level, they provide information of the code used, the relevant public entity (if available), and the provisional coded character position in ISO DIS 10646 (in hexadecimal value, if available). In addition a description of the character or symbol is also given. Two samples are included in appendix 3. These are for Classical Greek and IPA phonetic symbols. The Classical Greek wsd contains the encoding in the TLG Beta system. <xmp> <grapheme code='A(' entity.name="arougr" ucscode='1F10'> <d.name>GREEK SMALL LETTER ALPHA WITH ROUGH </d.name></grapheme> </xmp> The phonetic wsd contains all the IPA symbols and not just the ones recommended for interchange. No code value is indicated because this, in all likelihood, would be locally set for the present time. One of the interchange subset looks like this: <xmp> <grapheme entity.name='IPA101' ucscode="0070"> <d.name>IPA Lower-case P <note>voiceless bilabial plosive</note></d.name></grapheme> </xmp> The entity name is IPA101, its position in ISO DIS 10646 is hex 0070. The IPA name for this symbol is Lower-case P. Its phonetic/articulatory category is placed in a note. On the other hand, if it is not one of the interchange set, the member of that set is indicated in the note. <xmp> <grapheme entity.name='IPA210' ucscode="0067"> <d.name>IPA Cursive G <note> voiced velar plosive For interchange use IPA110</note></d.name></grapheme> </xmp> There are a few of the symbols which have not been included yet in ISO DIS 10646. There are also a few which do not have an interchange equivalent. These have never been used by the IPA. They are in the n90's range. Additional symbols for disturbed speech are currently being reviewed by the IPA and will be added to the public entity set and the wsd when they are approved. These will be assigned numbers in the 6nn range. </body> <back> <appendix> &ISOlat1; &ISOlat2; &ISOdia; &ISOgrk1; &ISOgrk2; &ISOcyr1; &ISOcyr2; &ISOnum; &ISOpub; <appendix> &TEIarb; &TEIcop; &TEIgrk; &TEIipa; <appendix> &TEIWSDgreek; &TEIWSDphonetic; </appendix> </back> </tei.1>

From LISTSERV@LISTSERV.UIC.EDU Wed Sep 1 17:38:59 1999 Date: Wed, 1 Sep 1999 11:21:05 -0500 From: "L-Soft list server at University of Illinois at Chicago (1.8c)" To: Lou Burnard Subject: File: "TR1W4 TEI1" Character entities and public entity sets <report.number> TEI TR1 W4 <author>Harry E. Gaylord <date>January 9, 1992 <version>1.0 <status>Public <front> <acknowledgements> I want to acknowledge here the help of several people. John Esling and Alexandra Gaylord have helped very considerably on organising the IPA material. Elli Mylonas has advised on the Greek material. Alaa Eddine M. Ghoneim of IBM in Cairo has drawn up the inventory for the Arabic entity set. </front> <body> <div0><head>Character entities General entities in SGML are used for several purposes. Perhaps the most often cited one is &SGML. It is defined as an entity in the following manner in the DTD or at the opening of the document where I use it: <xmp> <!ENTITY SGML "Standard Generalized Markup Language"> </xmp> When it is called in a Document Instance with &SGML, the entity calls up the whole phrase at this point in the text. This simply saves the drudgery of having to type such long phrases often. Character entities are a specific subset of the general class of entities. These are used especially where character codes are not available or unsafe to use. Their advantage is that they are safe from corruption and yet they can be processed as if they are there in the text. Let's take an example which is often complained of on e-mail communication. I am sending a text discussing medieval French currency. I declare the following entity: <!ENTITY eacute "eacute">. Now in the paper I discuss the écu (not to be confused with the modern theoretical European thing). This string will be able to pass back and forth through all sorts of ASCII to EBCDIC to ASCII translations on the network without corruption. That would not have happened if I had used the character in my local computer's character set for the first letter. It can be received correctly by someone on an IBM mainframe using any brand of EBCDIC. On the other hand that person would receive the text intact and could replace é with the appropriate character code for that machine or use it as is. With SGML conformant software we can achieve much more with these character entities. If I make use of the public entity for this which is <!ENTITY eacute SDATA "[eacute ]" --=e with acute accent--> and use é in my text I can declare an alternative entity realization in my local character set. If my machine is an IBM PC using the Code Page 437 or 850 I could declare: <!ENTITY eacute SDATA "‚">. On the other hand if I were using a machine that had the ISO 8859-1 character set I would place in my DTD for local processing <!ENTITY eacute SDATA "Ø">. Using the same file with SGML software the écu in the file would appear on both screens as the three letter word it is. No one would have to perform a replacement change in the file. This can be extended one step further for a print formater. The entity could be transformed into a processing instruction for the printer, but this is rather crude and not recommended. If the formatter can handle AFII registered glyph information, this could be placed in the quote. AFII is an organisation which is setting up a registration of glyphs together with ISO. In the entity declaration the replacement text is labelled as SDATA Specific character data) because as in the last two examples this replacement is dependent on a particular system. <div0><head>Entity names The rules for entity names are that they contain characters from the set within these brackets [A-Za-z0-9.-], i.e. upper and lower case letters, digits, period, and hyphen. Entity opener delimiter is ampersand, the normal entity closer delimiter is semicolon. The first character in the datastream which is a non entity character will also be taken as closer delimiter. Therefore é and é, are valid entities followed by space and comma respectively. In Reference Concrete Syntax entity names are restricted to a maximal length of 8 characters. In 8879:1986 a series of character entities have been included (See below, Appendix 1). These are mostly taken from one character set (ISO 6937) and those of the American Mathematical Society. There are also entity sets for Greek and Cyrillic. The conventions in these are simply that they are a string of not more than 6 characters, the last two of which ("gr" and "cy") denote the alphabet. Furthermore, capital and small letters are distinguished in the letters denoting the letter, e.g. "OHgr" is capital omega, "ohgr" is small omega. In drawing up an entity set for classical Greek here, I have had to add some more conventions. The most central is that if the entity contains a single accent on a letter, this is indicated by a three letter string, e.g. &arougr; is Greek alpha with rough breathing mark. If the entity contains a letter with multiple accents, one character denotes each accent, e.g. &arcigr; is Greek alpha with rough breathing and circumflex and iota subscript. By doing this I have been able to remain within the 8 character limits for entity names. With some alphabets this may not be possible. The order of accents also is important to standardize. In the Greek I have tried to use beginning to end, top to bottom. This order is rough/ smooth acute/grave/circumflex dieresis iota subscript. The IPA phonetic entities form an exceptional group. People working in representing speech have been drawing up character sets for their use for a considerable time. Until recently there has been little consensus on which symbols should be used for what speech phenomena except in the International Phonetics Association. The IPA has developed a clear policy on most of this over the years. In their convention in Kiel in 1989 IPA chart was revised. This revision involved agreeing on a preferred unique symbol for each phonetic/articulatory category. In some cases variant symbols until then had been used by IPA members. Many more alternate symbols, of course, have been and will be used by others outside the IPA. The workgroup on computer coding of IPA symbols then assigned a name (in fact, a three digit number) for each symbol on the chart which was a number. The number indicates its position on the IPA chart. Consonants are numbered in the 100's and 200's, vowels in the 300's, diacritics (i.e. supplemental information on consonants and vowels) are in the 400's, suprasegments are in the 500's. These numbers have been used in drawing up our entity set. The primary purpose of this entity set is for interchange. By restricting it to the symbols on the chart alone there will be no ambiguity even though different users may use alternative symbols in their local systems. In effect this gives the maximal freedom to each individual or group working in phonetics and the least ambiguity in transfering work among colleagues, who may use different conventions, or to publishers who will have their own house styles for such things as pronunciation in dictionaries. The description of each entity includes its shape and its categorization in IPA usage. The basic publications which form the basis of this entity set are "The IPA 1989 Kiel Convention Workgroup 9 report: Computer Coding of IPA Symbols and Computer Representation of Individual Languages", Journal of the International Phonetic Association 19 (1989), 81ff and John Esling, "Computer Coding of the IPA: Supplementary Report", Journal of the International Phonetic Association 20 (1990), 22ff. In the latter article IPA numbers have also been assigned to older IPA symbols and some non IPA symbols which are not part of the current IPA system. This is indicated by the fact that they are numbered downward beginning with 99 of the appropriate category, e.g. 499,498, etc. The phonetic wsd on TEI-L indicates in a note on these symbols what symbol they duplicate if there is one. Geoffrey Pullum and William Ladusaw, Phonetic Symbol Guide, University of Chicago Press, 1986, is a useful book to consult about the meaning of IPA and non IPA symbols and their meanings. <div0><head>Public entity sets The non-mathematical public entity sets from ISO 8879:1986 are included here in appendix 1. They are also available from listserv@uiucvm as the separate files <list> <item>ISOlat1 Entities <item>ISOlat2 Entities <item>ISOdia Entities <item>ISOgrk1 Entities <item>ISOgrk2 Entities <item>ISOcyr1 Entities <item>ISOcyr2 Entities <item>ISOnum Entities <item>ISOpub Entities </list> Version 1 of the first four TEI public entity sets are in appendix 2. These are Classical Greek (to be used with ISOgrk1 and ISOgrk2), Basic Arabic (excluding ligatures), Coptic and IPA phonetic. Paper copies of these can be ordered from TEI <address>University of Illinois at Chicago, Computer Center (M/C 135), Box 6998, Chicago IL 60680, U.S.A.</address> They are also available separately from listserv@uiucvm with the names <list> <item>TEIarb Entities <item>TEIcop Entities <item>TEIgrk Entities <item>TEIipa Entities </list> <div0><head>External Entities External entities can be used for public and local text. These have to be declared and referenced. ISO public entities are declared in a special way. <xmp> <!ENTITY % ISOlat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"> </xmp> This is a parameter entity which is contained in the ISO standard 8879:1986. It contains a collection of entities, which is called Added Latin 1. The text is in English. It would be referenced by %ISOlat1;. TEI public text can be declared in a slightly different manner. <xmp> <!ENTITY % TEIgrk PUBLIC "-//Text Encoding Initiative//ENTITIES Added Classical Greek//EN"> </xmp> The beginning minus means that while the text is public, it has no formal registration. If it did, it would have a plus. If a file containing these entities is used locally, one can alternately declare it as local. <xmp> <!ENTITY % TEIgrk SYSTEM "TEIgrk.ent"> which is kept in the file TEIgrk.ent. </xmp> This would be referenced with %TEIgrk;. The external entities discussed so far are parameter entities, indicated by the procent sign instead of ampersand. Their contents, i.e. a series of entities in the examples, are made available. On the other hand if an external file is to be included at this point in the text it is declared as a general entity. <xmp> <!ENTITY chap1 SYSTEM "chap1.bk"> &chap1; </xmp> <div0><head>Future Public Entity Sets There are a number of noticable gaps in these entity sets and the TEI TR1 workgroup is working together with ISO to draw up additional supplements. We need help on what to include and would appreciate comments from scholars in the relevant fields. We will issue them for public use in TEI and they will be submitted for inclusion in ISO TR 9573. This Technical Report is entitled "Information processing - SGML support facilities - Techniques for using SGML". A new edition is being prepared in 16 parts. Part 14 is Public entity sets for Latin based alphabets, Part 15 is Public entity sets for non-Latin alphabets, and Part 16 is Public entity sets for ideograms. Dr Anders Berglund of the ISO central secretariat is the editor of this technical report and we are co-operating with him in this work. As we assemble additional entity sets we can issue them through TEI and make them publicly available on TEI-L and submit them to ISO for inclusion in TR 9573. This is also what the American Mathematical Society did with its MathSci entity sets. Currently work is being done on entity sets for Hebrew, Early Slavic (Cyrillic and Glagolitic), Devanagri, and preliminary work on Japanese and Chinese. If you would like to contribute to this work, please contact Harry Gaylord<address>e-mail internet: galiard@let.rug.nl post: Alfa Informatica, Faculty of Arts, Groningen University, POB 716, Groningen, The Netherlands NL 9700 AS.</address> <div0><head>Writing System Declarations (WSD's) Writing system declarations are designed to describe the encoding method used for TEI documents. They give structural information on the encoding system used in them and information in attributes for revising the encoding. For characters not simply used at character code level, they provide information of the code used, the relevant public entity (if available), and the provisional coded character position in ISO DIS 10646 (in hexadecimal value, if available). In addition a description of the character or symbol is also given. Two samples are included in appendix 3. These are for Classical Greek and IPA phonetic symbols. The Classical Greek wsd contains the encoding in the TLG Beta system. <xmp> <grapheme code='A(' entity.name="arougr" ucscode='1F10'> <d.name>GREEK SMALL LETTER ALPHA WITH ROUGH </d.name></grapheme> </xmp> The phonetic wsd contains all the IPA symbols and not just the ones recommended for interchange. No code value is indicated because this, in all likelihood, would be locally set for the present time. One of the interchange subset looks like this: <xmp> <grapheme entity.name='IPA101' ucscode="0070"> <d.name>IPA Lower-case P <note>voiceless bilabial plosive</note></d.name></grapheme> </xmp> The entity name is IPA101, its position in ISO DIS 10646 is hex 0070. The IPA name for this symbol is Lower-case P. Its phonetic/articulatory category is placed in a note. On the other hand, if it is not one of the interchange set, the member of that set is indicated in the note. <xmp> <grapheme entity.name='IPA210' ucscode="0067"> <d.name>IPA Cursive G <note> voiced velar plosive For interchange use IPA110</note></d.name></grapheme> </xmp> There are a few of the symbols which have not been included yet in ISO DIS 10646. There are also a few which do not have an interchange equivalent. These have never been used by the IPA. They are in the n90's range. Additional symbols for disturbed speech are currently being reviewed by the IPA and will be added to the public entity set and the wsd when they are approved. These will be assigned numbers in the 6nn range. </body> <back> <appendix> &ISOlat1; &ISOlat2; &ISOdia; &ISOgrk1; &ISOgrk2; &ISOcyr1; &ISOcyr2; &ISOnum; &ISOpub; <appendix> &TEIarb; &TEIcop; &TEIgrk; &TEIipa; <appendix> &TEIWSDgreek; &TEIWSDphonetic; </appendix> </back> </tei.1>