1 What is SGML for?

The objectives of those who designed SGML were simple. Confronted with an increasing number of so-called ``markup languages'' for electronic texts, each more or less bound to a particular kind of processing or even to a particular software package, they sought to define a single language in which all such schemes could be re-expressed, so that the essential information represented by such texts could be transferred from one program or application to another. I begin therefore by giving a slightly more formal definition of what is meant by the term ``markup language''. A universal language necessarily presupposes some basic concepts or semantic primitives in which the notions of all other languages can be expressed: the semantic primitives of SGML are simple and few in number, and their definition forms the bulk of the rest of this paper. I begin however with a few remarks about what SGML is not.

Newcomers to SGML often think of it as a special case of the kind of markup language with which they may be familiar. They expect it to define a universal set of tags or to define exactly what tags mean, in terms of how the items identified by tags are to be processed. But the semantics of a markup language are precisely what SGML does not concern itself with: it describes only the formal properties and inter-relations of the components of a document. It does not tell you what it means to define part of a text as belonging to some category (say, ``blort''); it simply tells you how things-called-blort can legally appear within texts -- whether they can be decomposed into ``blortettes'', or whether more than one of them can appear at the start of a document, and so on. Determining what a thing-called-blort actually might be is inextricably entangled with how the text is to be processed, and the function of SGML is to define the content of a document in terms that are entirely independent of its processing.

It follows from this that it is nonsensical to think of SGML as a kind of text formatting system (although its origins can be readily traced in the world of electronic text formatting), or as a competitor for such languages as TeX or PostScript. These are languages which define how ink (or its equivalent) is to be placed onto paper (or its equivalent); they are not primarily concerned with the formal structure of the language represented by those placements of dark-on-light. SGML by contrast is decidedly unhelpful about how texts are to be reproduced, since this is but one of the many applications for which a text may be placed into electronic form. Its strength is that by separating the notion of what the text actually is from how the text is rendered, it makes possible the use of the same text by many different kinds of processor.

As a simple example, consider the headings used to introduce the subdivisions of the present document. These need to be distinguished from the body of the text so that they can be formatted in a particular way. However, I have not yet decided how -- and it is more than likely that those responsible for printing this text will prefer to format them in some other way in any case. If therefore I use the facilities available on my word processor for the display of headings -- say a change of font size, a margin indent and a switch to bold font -- I will not be helping the typesetter's task very much. Moreover, should I wish to prepare a list of the subheadings in my text to serve as an index, I will very probably find it quite difficult to distinguish occasions where bold font indicates headings from cases where it indicates (say) emphasized phrases in the text. By the same token, I will find it difficult to check that each subsection has one and only one subheading, that any numbers included in the subheadings are in the right sequence and so forth. And when, in the fullness of time, this text enters the great database of late twentieth century prose, future linguists and historians will have comparable difficulties in assessing the linguistic properties of text used as subheadings as distinct from those of the main text. If however, I simply tag each sub-heading as such, using some unique string of codes to say ``here begins the text of a subheading'' and some other to mark its end, then the same input text can be used unchanged by any formatter, any indexing program and any linguistic analysis program. Each one will be able to decide for itself what it wants to do with the subheadings -- how it would like to process them, if at all.

While indexing the subheadings in a document of this nature is clearly of somewhat limited importance, it should be apparent that the solution proposed for that problem is an entirely generalisable one. Consider historical source materials for example. Which is likely to be of more use in compiling a list of the names in an electronic transcription of the records of an ecclesiastical court -- a version in which the names are simply italicised (as are, for example, Latin phrases, running titles, annotations etc.) or one in which each name is marked off clearly by a tag such as <name>? Which is likely to be of more use in extracting statistical data for input to a spreadsheet analysis of the average age of offenders -- a version in which birth dates are clearly marked as such, perhaps incorporating some normalised version of the date concerned, or one in which all dates are simply intermingled with the running text?

Back to table of contents
On to next section