3 Entities

The word ``entity'' is used in SGML rather differently from its use elsewhere, notably in database design methodology. An SGML entity is simply a named bit of text, considered entirely independently of any structural or categorial classification it might have. A document may be an SGML entity, as may any arbitrary sequence of characters within it, or any symbol it contains. The definition of an entity associates a name with a particular string of bytes, which may be the representation of some characters in a particular computer encoding or held in a system-defined container of some kind (such as a file). Within an SGML document, entities are represented by reference, using the defined name. This mechanism has a number of important uses, specified further below, primarily in making it possible to encode textual features such as special characters or symbols which are unique to a particular environment or application in a way that is independent of that particular environment.

Everyone who has communicated by electronic mail, or even tried to move a file from a Macintosh computer to a PC, knows that some of the symbols of which texts are composed are less portable than others. With the best will in the world, computer manufacturers and standards bodies alike will never be able to represent all the possible symbols occurring in written texts in a single universally agreed code set, simply because these symbols do not form a closed set: the task is as hopeless as that of enumerating all the words in a language. Moreover, it is a fact of life that different computing environments adopt different methods of representing the same symbols, disregard entirely the existence of some and insist on distinguishing others.

A notorious consequence of this state of affairs is that some letters which appear perfectly normal when typed at a keyboard in Odense, Paris, or Tübingen, will either be mysteriously transmuted into a percent sign, or lost completely when transmitted over the network to the UK. How then are words to be stored in a computer file in such a way as to ensure that it can be satisfactorily processed by any computer, not just those which have the decency to be aware of the Danish, French, or German national character sets?

Exactly the same problem arises, in a more acute form, when considering the range of symbols likely to required in transcribing manuscript texts or spoken language. There is no computer character set in which the long form of s is distinguished from the short, still less for distinguishing ligatured forms of the same letters, or for representing all scribal abbreviations, astrological symbols, non-vocalic grunts, pauses etc. Nor, UNICODE notwithstanding, do I think it likely that there ever will be.

The SGML solution is to encode characters not available in the particular character set used for document transmission [See note 5] by means of entity reference. If Hans Jørgen is represented as Hans Jøorgen, I can associate the unlovely acronym oslash with whatever particular stream of bytes is necessary on my computer to produce the slashed-o in the Danish national set. [See note 6]

Some have objected to the apparent verbosity of such mnemonics, by comparison with the variety of encoding tricks or ad hoc solutions customarily resorted to. The advantage of the entity reference solution is simply that it forms part of a single and consistent convention, comprehensible without resort to special purpose documentation (which is generally absent). Sets of standard mnemonics for all the accented letters and special symbols of modern European languages are to be found in ISO 8879 and elsewhere.

The same mechanism can be applied more widely for any stream of bytes to be included within a document. The use of a single short abbreviation for a much repeated or particularly complex phrase, is a simple way of ensuring consistency and reducing effort; it is worth noting in this context that the value of an entity reference can include other mark-up, such as tags or other entity references, provided that any element opened within a given entity is also closed within it. This method has been adopted for example by the TEI committee responsible for defining linguistic annotation. Another use is for identifying objects which cannot be directly represented in a text, for example non-textual entities like graphics or formulae. More mundane uses are not difficult to identify.

It should be stressed that entities have no structural properties: they are simply shortcuts enabling an SGML aware processor to substitute a system-defined string of bytes for a name identified as such by special SGML delimiters. As such they are merely a special (if well thought out) way of doing the kinds of things which transcribers and encoders of text have already been doing for many years.


Back to table of contents
On to next section
Back to previous section