2 Markup and Markup Languages

The word markup was originally used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples, familiar to proofreaders and others, include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the production of texts was automated, the term was extended to cover all sorts of special ``markup codes'' inserted into electronic texts to govern formatting, printing, or other processing.

Generalizing from that sene, we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. At a banal level, all printed texts are encoded in this sense: punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might all be regarded as a kind of markup, the function of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings, and syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is in principle, like transcribing a manuscript from scriptio continua, a process of making explicit what is conjectural or implicit. It is a process of directing the user as to how the content of the text should be interpreted.

A ``markup language'', may be no more than a loose set of markup conventions used together for encoding texts. A markup language must specify what markup is allowed and whereabouts, what markup is required, how markup is to be distinguished from text, and what the markup means. As noted above, SGML provides the means for doing the first three of these only; it allows you to describe a markup language independently of what the markup is intended to do. To understand and act upon the markup, additional semantic information is needed, which will differ in different situations. Documentation like that enshrined in the TEI's Guidelines provides such information. In just the same way as one may be able to parse the syntactic structure of a Latin unseen without having the least idea what it is about, so an SGML-aware processor can analyze the structure of an SGML-encoded document with no sense of its meaning. This independence is necessary, given the open-ended nature of electronic textual applications. It does not, of course, imply that the intentions of the encoder of a text are unimportant or vacuous; only that they are formally distinct from the encoding itself.

Three basic concepts are fundamental to an understanding of all markup languages, when described in SGML terms. These are the notions of a markup entity, a markup element, with its associated attributes, and a document type. At the most primitive level, texts are composed simply of streams of symbols (characters or bytes of data, marks on a page, graphics, etc.): these are known as entities in SGML. At a higher level of abstraction, a text is composed of representations of objects of various kinds, linguistically or functionally defined. Such objects do not appear randomly within a text: coherence demands that particular types of object appear in specifiable relationship to other objects -- they may be included within each other, linked to each other by reference or simply presented sequentially, for example. This level of description sees texts as composed of structurally defined objects, known as elements in SGML. The grammar defining how elements may legally be combined in a particular class of texts is known as a document type. (This view of the nature of text has been nicely defined by De Rose et al [See note 4] as an ``ordered hierarchy of content objects''.) These three fundamental concepts together are, it is claimed, adequate to describe all the complexities of marked-up texts, of whatever kind and for whatever purposes. Each is discussed in turn in the next three sections


Back to table of contents
On to next section
Back to previous section