TEI AI2 W1: Working paper on spoken texts <date>October 1991  <author>Stig Johansson, University of Oslo <author>Lou Burnard, Oxford University Computing Services <author>Jane Edwards, University of California at Berkeley <author> And Rosta, University College London <abstract> This paper discusses problems in representing speech in machine-readable form. After some brief introductory considerations (Sections 1-5), a survey is given of features marked in a selection of existing encoding schemes (Section 6), followed (Section 8) by proposals for encoding compatible with the draft guidelines of the Text Encoding Initiative (TEI). Appendix 1 contains example texts representing different encoding schemes. Appendix 2 gives a brief summary of the main features of some encoding schemes. Appendix 3 presents a DTD fragment and examples of texts encoded according to the conventions proposed. The discussion focuses on English but should be applicable more generally to the representation of speech. </front><body> <div1><head>The problem In the encoding of spoken texts we are faced by a double problem: <list> <item>there are no generally accepted conventions for the encoding of spoken texts in machine-readable form; <item>there is a great deal of variation in the ways in which speech has been represented using the written medium (transcription). </list> We therefore have less to go by than in the encoding of written texts, where there are generally accepted conventions which we can build on in producing machine-readable versions. In addition to this basic problem, there are other special difficulties which apply to the encoding of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. The production and comprehension of speech are intimately bound up with the speech situation. A transcription must therefore also be able to capture contextual features. Moreover, there is an ethical problem in recording and making public what was produced in a private setting and intended for a limited audience (see further Section 2.2). <div1><head>From speech to transcription and electronic record <div2 id=2.1><head> The requirements of authenticity and computational tractability There is no such thing as a simple conversion of speech to a transcription. All transcription schemes make a selection of the features to be encoded, taking into account the intended uses of the material. The goal of an electronic representation is to provide a text which can be manipulated by computer to study the particular features which the researcher wants to focus on. At the same time, the text must reflect the original as accurately as possible. We can sum this up by saying that an electronic representation must strike a balance between the following two, partially conflicting, requirements: authenticity and computational tractability. A workable transcription must also be convenient to use (write and read) and reasonably easy to learn. Here we focus on a representation which makes possible a systematic marking of discourse phenomena while at the same time allowing researchers to display the data in any form they like. We hope that this representation will also promote "insightful perception and classification of discourse phenomena" (Du Bois, forthcoming). <div2 id=2.2><head>The ethical requirement Speech is typically addressed to a limited audience in a private setting, while writing is characteristically public (but we have of course public speech and private writing). The very act of recording speech represents an intrusion. If this intrusion is very obvious (e.g. with video recording), the communication may be disturbed. On the other hand, if the recording is too unobtrusive, it may be unethical (if the speakers are unaware of being recorded). It is not only the recording event itself which is sensitive, but also the act of making the recording available outside the context where it originated. Transcribers have usually taken care to mask the identity of the speakers, e.g. by replacing names by special codes. In other words, the transcriber must also strike a balance between the requirement of authenticity and an ethical requirement. <div1 id=3><head>Who needs spoken machine-readable texts and for what purpose? It follows from what has already been said that a transcription - and its electronic counterpart - will be different depending upon the questions it has been set up to answer. The following are some types of users who might be interested in spoken machine-readable texts: <list> <item>students of ethnology and oral history, who are mainly interested in content-based research; <item>lexicographers, who are mainly interested in studying word usage, collocations, and the like; <item>students of linguistics, who may be interested in research on phonology, morphology, lexis, syntax, and discourse; <item>dialectologists and sociolinguists, whose primary interest is in patterns of variation on different linguistic levels; <item>students of child language or second language acquisition, who are concerned with language development on different levels; <item>social scientists and psychologists interested in patterns of spoken discourse; <item>speech researchers, who need an empirical basis for setting up and testing systems of automatic linguistic analysis; <item>engineers concerned with the transmission of speech. </list> The first two categories are probably best served by a transcription which follows ordinary written conventions (perhaps enhanced by coding for key words or lemmas). The other groups generally need something which goes beyond "speech as writing". <div1 id=4><head>The delicacy of transcriptions The texts in Appendix 1 are examples of different types of transcription, some existing in machine-readable form and others only occurring in the written medium. The range of features represented varies greatly as does the choice of symbols. All the texts have some way of indicating speaker identity and dividing the text up into units (words and higher-level units). Words are most often reproduced orthographically, though sometimes with modified or distorted spelling to suggest pronunciation. Less often they are transcribed phonetically or phonemically (as in G and W). The higher-level units are orthographic sentences (as in A and B) or some sort of prosodic units (as in H and R). The degree of detail given in the transcription may be extended in different directions (it is worth noting that in some cases there are different versions of the same text, differing according to the features represented or, more generally, in the delicacy of the transcription; see H, J, K, and Q). Some texts are edited in the direction of a written standard (e.g. A and B), others carefully code speaker overlap and preserve pauses, hesitations, repetitions, and other disfluencies (e.g. H, I, and N). Some texts contain no marking of phonological features (e.g. A and B), others are carefully coded for prosodic features like stress and intonation (e.g. H, I, and Q). The degree of detail given on the speech situation varies, as does the amount of information provided on non-verbal sounds, gestures, etc. Note the detailed coding of non-verbal communication and actions in the second example in V. Two of the example texts in Appendix 1 consistently code paralinguistic features (H1, W). In some of the texts there is analytic coding which goes beyond a mere transcription of audible features, e.g. the part-of-speech coding in J3 and K3- 5 and various types of linguistic analysis of selected words in F and on the "dependent tier", indicated by lines opening with %, in R. In G there is an accompanying standard orthographic rendering, in W and X a running interpretive commentary, in Y and Z a detailed analysis in discourse terms. Here we are not concerned with analytic and interpretive coding, but rather with the establishment of a basic text which can later be enriched by such coding. <div1 id=5><head>Types of spoken texts Before we go on to a detailed discussion of features encoded in our example texts, we shall briefly list some types of spoken texts. These are some important categories: <list> <item>face-to-face conversation <item>telephone conversation <item>interview <item>debate <item>commentary <item>demonstration <item>story-telling <item>public speech </list> In addition, we may distinguish between pure speech and various mixed forms: <list> <item>scripted speech (as in a broadcast, performances of a drama, or reading aloud) <item>texts spoken to be written (as in dictation) </list> Here we are mainly concerned with the most ubiquitous form of speech (and, indeed, of language), i.e. face-to-face interaction. If we can represent this prototypical form of speech, we may assume that the mechanisms suggested can be extended to deal with other forms of spoken material. <div1 id=6><head>Survey of features marked in existing schemes The survey below makes frequent reference to the example texts in Appendix 1. No attempt is made to give a full description of each individual scheme (for a comparison of some schemes, see Appendix 2). The aim is rather to identify the sorts of features marked in existing schemes. Reference will be made both to electronic transcriptions and to transcriptions which only exist in the written medium. <div2 id=6.1><head>Text documentation Speech is highly context-dependent. It is important to provide information on the speakers and the setting. This has been done in different ways. Note the opening paragraphs in A and B, the first lines in E, the header file in Q, and the header lines in R. In some cases information of this kind is kept separate from the machine-readable text file (see Appendix 2). <div2 id=6.2><head>Basic text units While written texts are broken up into units of different sizes (such as chapters, sections, paragraphs, and orthographic sentences, or S-units) which we can build on in creating a machine-readable version, there are no such obvious divisions in speech. In dialogues it is natural to mark turns. As these may vary greatly in length, we also need some other sort of unit (which is needed in any case in monologues). Most conventional transcriptions of speech (such as A and B) divide the text up into "sentences" in much the same way as in ordinary written texts, without, however, specifying the basis for the division. The closest we can get to such a specification is probably the "macrosyntagm" used in a corpus of spoken Swedish (cf. Loman 1972: 58ff.). A macrosyntagm is a grammatically cohesive unit which is not part of any larger grammatical construction. Linguists often prefer a division based on prosody, e.g. the tone units in the London-Lund Corpus (I) and in the scheme set up for the Corpus of Spoken American English (Q). Others are sceptical of this sort of division and prefer pause- defined units (Brown et al. 1980: 46ff.). In the new International Corpus of English project texts are divided into "text-units" to be used for reference purposes, and there is "no necessary connection between text unit division and any features inherent in the original text" (Rosta 1990). Text-units are used both in written and spoken texts. In written texts they often correspond to an orthographic sentence, in spoken texts to a turn. The primary criterion is length (around three lines or 240 characters or twenty words, not including tags). On basic text units in different encoding schemes, see further Appendix 2. <div2 id=6.3><head>Reference system If a text is to be used for scholarly purposes, it must be provided with a reference system which makes it possible to identify and refer to particular points in the text. Reference systems have been organized in different ways (see Appendix 2). In the London-Lund Corpus (I) a text is given an identification code and is divided into tone units, each with a number. Some spoken machine-readable texts (e.g. K1) do not seem to contain such a fine-grained reference system. <div2 id=6.4><head>Speaker attribution In a spoken text with two or more participants we must have a mechanism for coding speaker identity. This is generally done by inserting a prefix before each speaker turn; see A, H, I1, J, L, N, O, etc. in Appendix 1 and the survey in Appendix 2 (in the London-Lund Corpus there is actually such a prefix before each tone unit; see I2). The prefix is generally a code; more information on the speakers may be given in the accompanying documentation. Problems arise where a speaker's turn is interrupted (see how continuation has been coded in Appendix 2; note the examples in text I in Appendix 1) and, particularly, where there is simultaneous speech. <div2 id=6.5><head>Speaker overlap In informal speech there will normally be a good deal of speaker overlap. This has been coded in a variety of ways, as shown in Appendix 2. In the London-Lund Corpus simultaneous speech is marked by the insertion of pairs of asterisks; see text I in Appendix 1. Dubois et al. (1990) instead use pairs of brackets; see Q. Both solutions involve separation of each speaker's contribution and linearization in the transcription. In some cases speaker overlap is shown by vertical alignment (alone or combined with some other kind of notation); see O, P, Q, T, and V. <div2 id=6.6><head>Word form Words are rarely transcribed phonetically/phonemically in extended spoken texts; the only examples in Appendix 1 are G and W, both of which do not exist in machine-readable form. There are several reasons why there are very few extensive texts with transcription of segmental features. Such transcription is very laborious and time-consuming. A detailed transcription of this kind makes the text difficult to use for other purposes than close phonological study. And the conversion to electronic form has been difficult because of restrictions on the character set. Orthographic transcriptions of speech reproduce words with their conventional spelling and separated by spaces (apart from conventionalized contractions). This is true not only of texts like A and B, but also of the much more sophisticated transcriptions in H, I, etc. In some cases, e.g. in E, P, and R, the transcription introduces spellings intended to suggest the way the words were pronounced (takin', an', y(a) know, gon (t)a be, etc.). Some basically orthographic transcriptions introduce phonetic symbols in special cases; see the remarks on quasi-lexical vocalizations in the next section. Simple word counts based on orthographic, phonemic, and phonetic transcriptions of the same text will give quite different results. Homographs are identical in an orthographic transcription, although they may be pronounced differently: row, that (as demonstrative pronoun vs. conjunction and relative pronoun), etc. Conversely, homophones may be written quite differently: two/too, so/sew/sew, etc. These are commonplace examples, but they show that even answers to seemingly straightforward questions (How many words are there in this text? How many different words are there?) depend upon choices made by the transcriber. On the treatment of word form in different encoding schemes, see further the comparison in Appendix 2. <div2 id=6.7><head>Speech management A "speech as writing" transcription will normally edit away a lot of features which are essential for successful spoken communication. A comparison of texts like A and B versus H, I, L, and Q is instructive. The former contain little trace of speech-management phenomena such as pauses, hesitation markers, interrupted or repeated words, corrections and reformulations. Interactional devices asking for or providing feedback are less prominent. For content-based research they are irrelevant or even disruptive, for discourse analysis they are highly significant. The survey in Appendix 2 illustrates how such features have been handled. (On speech-management phenomena, see further Allwood et al. 1990.) Different strategies are used to render quasi-lexical vocalizations such as truncated words, hesitation markers, and interactional signals. The London-Lund Corpus introduces phonetic transcription in such cases (see examples in text I); other transcriptions use ordinary spelling (e.g. b- and uh in text Q, hmm and mmm in text R). Some schemes have introduced control lists for such forms; see Appendix 2. <div2 id=6.8><head>Prosodic features The most obvious characteristic of speech, i.e. the fact that it is conveyed by the medium of sound, is often lost completely in the transcription. The only traces in texts like A and B are the conventional contractions of function words (and the occasional use of italics or other means to indicate emphasis; see one example in B). Even texts produced for language studies generally reflect only a small part of the phonological features. The best-known computerized corpus of spoken English, the London-Lund Corpus (I), focuses on prosody: pauses, stress, and intonation. The same is true of the Lancaster/IBM Spoken English Corpus (K1) and the systems set up for the Corpus of Spoken American English (Q) and the Survey of English Usage (H) - which, incidentally, is the system which the transcription in the London-Lund Corpus is based on. There is a great deal of variation in the way pauses, stress, and intonation are marked, as shown by our example texts in Appendix 1 and the survey in Appendix 2. Note that some schemes have special codes for latching and lengthening. <div2 id=6.9><head>Paralinguistic features Paralinguistic features (tempo, loudness, voice quality, etc.), which are less systematic than phonological features, have usually not been marked in transcriptions. Two of our example texts, however, include elaborate marking of such features (H1, W); neither of them has an exact counterpart in machine-readable form. It should be noted that the London-Lund Corpus is not coded for paralinguistic features, although the texts were originally transcribed according to the Survey of English Usage conventions (as in H). The simplification was made for a number of reasons, "partly practical and technical, partly linguistic" (Svartvik & Quirk 1980: 14). The omitted features were considered less central for prosodic and grammatical studies, which were the main concern of the corpus compilers. For more information on the encoding of paralinguistic features, see Appendix 2. <div2 id=6.10><head>Non-verbal sounds Non-verbal sounds such as laughter and coughing are generally noted, usually as a comment within brackets. See examples in texts B, E, H, L, P, and Q. Q has a special code for laughter (@), which may be repeated to indicate the number of pulses of laughter. The London-Lund Corpus may add a code to indicate length of the non-verbal sound, e.g. (. laughs), (- coughs), (-- giggle). Gumperz & Berenz (forthcoming) distinguish between such features as interruption and as overlay. See further Appendix 2. <div2 id=6.11><head>Kinesic features On the edges of paralinguistic features and non-verbal sounds we find kinesic features, which involve the systematic use of facial expression and gesture to communicate meaning (gaze, facial expressions like smiling and frowning, gestures like nodding and pointing, posture, distance, and tactile contact). They raise severe problems of transcription and interpretation. Besides, they require access to video recordings. For some examples of how such features have been marked, see especially texts P and V. See also the survey in Appendix 2. <div2 id=6.12><head>Situational features The high degree of context-dependence of speech makes it essential to record a variety of non-linguistic features. These include movements of the participants or other features in the situation which are essential for an understanding of the text. We have already drawn attention to the need for documentation on the speech situation (Section 6.1). It is also important to record changes in the course of the speech event, e.g. new speakers coming in, long silences, background noise disturbing the communication, and non-linguistic activities affecting what is said. Note some comments of this kind in %-lines in R and the marking of actions in U. Note also the situation descriptions in texts A and B. See further Appendix 2. <div2 id=6.13><head>Editorial comment In our example texts editorial comment is indicated in a variety of ways, e.g. by additions within parentheses (such as "Expletive deleted" in A) or comments in %-lines in R. The London-Lund Corpus indicates uncertain transcription by double parentheses (see examples in I); there are also comments within parentheses like "gap in recording" or "ten seconds untranscribable". Other schemes indicate uncertain hearing in other ways. Note the elaborate coding of normalization in the International Corpus of English (J2). See further Appendix 2. <div2 id=6.14><head>Analytic coding Though we are not concerned with analytic coding which goes beyond the establishment of a basic text, we shall draw attention to some conventions used in our example texts. Text F includes analytic coding after selected words. J3 has word- class markers before each word. K3-5 have word-class tags accompanying each word (in different ways). R includes speech act analysis in %-lines (%spa: ...). W and X have a running interpretive commentary. Y and Z are coded for discourse analysis. There is thus a great deal of variation in the features coded (not to speak of the choice of codes). <div2 id=6.15><head>Parallel representation In order to convey a broad range of features, some schemes provide parallel representations of the texts. Note the phonetic, phonemic, and orthographic versions in text G and, in particular, the multi-layered coding in V and W. An adequate scheme for the encoding of spoken texts will no doubt need to provide mechanisms for parallel representation. <div1 id=7><head>Spoken texts and the Text Encoding Initiative To what extent can spoken texts be accommodated within the Text Encoding Initiative? They were not dealt with in the current draft (Sperberg-McQueen & Burnard 1990), but many of the mechanisms which have been suggested for written texts can no doubt be adapted for spoken material. These include the ways of handling text documentation, reference systems, and editorial comment. Phonetic notation can be handled by the general mechanisms for character representation (though these will not be dealt with here, as they are the province of another working group). In considering the encoding of face-to-face conversation, which is the main focus of interest here, we shall sometimes make a comparison with dramas, which are dealt with (briefly) in the current TEI draft. This literary genre should of course not be confused with ordinary face-to-face conversation, but we can look upon it as a stylized way of representing conversation, and some of the mechanisms for handling dramatic texts may be adapted for encoding genuine conversation. <div1 id=8><head>Proposals The overall structure we envisage for a spoken text is given in Figure 1. In other words, a TEI-conformant text consists of a header and the text body. The latter consists of a timeline and one or more divisions (div); divisions are needed, for example, for debates, broadcasts, and other spoken discourse types where structural subdivisions can easily be identified. The timeline is used to coordinate simultaneous phenomena. It has an attribute "units" and consists of a sequence of points, with the following attributes: id, elapsed, since. <xmp><![ CDATA[ Example: <point id=p43 elapsed=13 since=p42> ]]></xmp> This identifies a point "p43" which follows 13 units (as specified in the "units" attribute of the <tag>timeline</tag>) after "p42". Divisions consist of elements of the following kinds: <list> <item>utterances (u), which contain words (for more detail, see 8.2); <item>vocals, which consist of other sorts of vocalizations (e.g. voiced pauses, back-channels, and non-lexical or quasi-lexical vocalizations of other kinds); <item>pauses, which are marked by absence of vocalization; <item>kinesic features, which include gestures and other non-vocal communicative phenomena; <item>events, which are non-vocal and non-communicative; <item>writing, which is needed where the spoken text includes embedded written elements (as in a lecture or television commercial). </list> In addition, there may be embedded divisions. All of the elements mentioned have "start" and "end" attributes pointing to the timeline (there may also be attributes for "duration" and "units" expressing absolute time). If a "start" attribute is omitted, the element is assumed to follow the element that precedes it in the transcription. If an "end" attribute is omitted, the element is assumed to precede the element that follows it in the transcription. If "start" and "end" attributes are used, the order of elements within the text body is unrestricted. Events, kinesic features, vocalizations, and utterances form a hierarchy (though not in the SGML sense), as shown in Figure 2. Utterances are a type of vocalization. Vocalizations are a type of gesture. Gestures in their turn are a type of event. We can show the relationship using the features "eventive", "communicative", "anthropophonic" (for sounds produced by the human vocal apparatus), and "lexical": <fig> eventive communicative anthropophonic lexical event + - - - kinesic + + - - vocal + + + - utterance + + + + </fig> Needless to say, the differences are not always clear-cut. Among events we include actions like slamming the door, which may certainly be communicative. Vocals include coughing and sneezing, which are or may be involuntary noises. And there is a cline between utterances and vocals, as implied by our use above of the term "quasi-lexical vocalization". Individual scholars may differ in the way the borderlines are drawn, but we claim that the four element types are all necessary for an adequate representation of speech. There is another sort of hierarchy (in the SGML sense) defining the relationship between linguistic elements of different ranks, as shown in Figure 3. Texts consist of divisions, which are made up of utterances, which may in their turn be broken down into segments (s). See further Section 8.2. Before we go on to a more detailed discussion of the individual types of features (following the same order as in Section 6 above), we should briefly draw attention to a couple of other points illustrated in Figure 1. Note that pauses and vocals may occur between utterances as well as within utterances and segments. "Shift" is used to handle paralinguistic features (see Section 8.9) and "pointer" particularly to handle speech overlap (see Section 8.5). <div2 id=8.1><head>Text documentation The overall TEI framework proposed for written texts can be adapted for spoken texts. Accordingly, there is a header with three main sections identified by the tags <tag>file.description</tag>, <tag>encoding.declarations</tag>, and <tag>revision.history</tag>; cf. Figure 1. The content of the last two need little adaptation, while the first one will be quite different. With spoken texts there is no author; instead, there is recording personnel and one or more transcribers, and the real "authors" appear in a list of participants (as in the dramatis personae of a play). There is also (as in all electronic texts) an electronic editor, who may or may not be identical to the transcriber. There is no source publication as for printed texts (except for scripted speech); instead there is a recording event followed by a transcription stage. The following structure is suggested (see also Figures 4-8): <xmp><![ CDATA[ <file.description> <title.statement> <title>name of the text supplied by the electronic editor

AI2W1 TEI AI2 W1: Working paper on spoken texts <date>October 1991  <author>Stig Johansson, University of Oslo <author>Lou Burnard, Oxford University Computing Services <author>Jane Edwards, University of California at Berkeley <author> And Rosta, University College London <abstract> This paper discusses problems in representing speech in machine-readable form. After some brief introductory considerations (Sections 1-5), a survey is given of features marked in a selection of existing encoding schemes (Section 6), followed (Section 8) by proposals for encoding compatible with the draft guidelines of the Text Encoding Initiative (TEI). Appendix 1 contains example texts representing different encoding schemes. Appendix 2 gives a brief summary of the main features of some encoding schemes. Appendix 3 presents a DTD fragment and examples of texts encoded according to the conventions proposed. The discussion focuses on English but should be applicable more generally to the representation of speech. </front><body> <div1><head>The problem In the encoding of spoken texts we are faced by a double problem: <list> <item>there are no generally accepted conventions for the encoding of spoken texts in machine-readable form; <item>there is a great deal of variation in the ways in which speech has been represented using the written medium (transcription). </list> We therefore have less to go by than in the encoding of written texts, where there are generally accepted conventions which we can build on in producing machine-readable versions. In addition to this basic problem, there are other special difficulties which apply to the encoding of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. The production and comprehension of speech are intimately bound up with the speech situation. A transcription must therefore also be able to capture contextual features. Moreover, there is an ethical problem in recording and making public what was produced in a private setting and intended for a limited audience (see further Section 2.2). <div1><head>From speech to transcription and electronic record <div2 id=2.1><head> The requirements of authenticity and computational tractability There is no such thing as a simple conversion of speech to a transcription. All transcription schemes make a selection of the features to be encoded, taking into account the intended uses of the material. The goal of an electronic representation is to provide a text which can be manipulated by computer to study the particular features which the researcher wants to focus on. At the same time, the text must reflect the original as accurately as possible. We can sum this up by saying that an electronic representation must strike a balance between the following two, partially conflicting, requirements: authenticity and computational tractability. A workable transcription must also be convenient to use (write and read) and reasonably easy to learn. Here we focus on a representation which makes possible a systematic marking of discourse phenomena while at the same time allowing researchers to display the data in any form they like. We hope that this representation will also promote "insightful perception and classification of discourse phenomena" (Du Bois, forthcoming). <div2 id=2.2><head>The ethical requirement Speech is typically addressed to a limited audience in a private setting, while writing is characteristically public (but we have of course public speech and private writing). The very act of recording speech represents an intrusion. If this intrusion is very obvious (e.g. with video recording), the communication may be disturbed. On the other hand, if the recording is too unobtrusive, it may be unethical (if the speakers are unaware of being recorded). It is not only the recording event itself which is sensitive, but also the act of making the recording available outside the context where it originated. Transcribers have usually taken care to mask the identity of the speakers, e.g. by replacing names by special codes. In other words, the transcriber must also strike a balance between the requirement of authenticity and an ethical requirement. <div1 id=3><head>Who needs spoken machine-readable texts and for what purpose? It follows from what has already been said that a transcription - and its electronic counterpart - will be different depending upon the questions it has been set up to answer. The following are some types of users who might be interested in spoken machine-readable texts: <list> <item>students of ethnology and oral history, who are mainly interested in content-based research; <item>lexicographers, who are mainly interested in studying word usage, collocations, and the like; <item>students of linguistics, who may be interested in research on phonology, morphology, lexis, syntax, and discourse; <item>dialectologists and sociolinguists, whose primary interest is in patterns of variation on different linguistic levels; <item>students of child language or second language acquisition, who are concerned with language development on different levels; <item>social scientists and psychologists interested in patterns of spoken discourse; <item>speech researchers, who need an empirical basis for setting up and testing systems of automatic linguistic analysis; <item>engineers concerned with the transmission of speech. </list> The first two categories are probably best served by a transcription which follows ordinary written conventions (perhaps enhanced by coding for key words or lemmas). The other groups generally need something which goes beyond "speech as writing". <div1 id=4><head>The delicacy of transcriptions The texts in Appendix 1 are examples of different types of transcription, some existing in machine-readable form and others only occurring in the written medium. The range of features represented varies greatly as does the choice of symbols. All the texts have some way of indicating speaker identity and dividing the text up into units (words and higher-level units). Words are most often reproduced orthographically, though sometimes with modified or distorted spelling to suggest pronunciation. Less often they are transcribed phonetically or phonemically (as in G and W). The higher-level units are orthographic sentences (as in A and B) or some sort of prosodic units (as in H and R). The degree of detail given in the transcription may be extended in different directions (it is worth noting that in some cases there are different versions of the same text, differing according to the features represented or, more generally, in the delicacy of the transcription; see H, J, K, and Q). Some texts are edited in the direction of a written standard (e.g. A and B), others carefully code speaker overlap and preserve pauses, hesitations, repetitions, and other disfluencies (e.g. H, I, and N). Some texts contain no marking of phonological features (e.g. A and B), others are carefully coded for prosodic features like stress and intonation (e.g. H, I, and Q). The degree of detail given on the speech situation varies, as does the amount of information provided on non-verbal sounds, gestures, etc. Note the detailed coding of non-verbal communication and actions in the second example in V. Two of the example texts in Appendix 1 consistently code paralinguistic features (H1, W). In some of the texts there is analytic coding which goes beyond a mere transcription of audible features, e.g. the part-of-speech coding in J3 and K3- 5 and various types of linguistic analysis of selected words in F and on the "dependent tier", indicated by lines opening with %, in R. In G there is an accompanying standard orthographic rendering, in W and X a running interpretive commentary, in Y and Z a detailed analysis in discourse terms. Here we are not concerned with analytic and interpretive coding, but rather with the establishment of a basic text which can later be enriched by such coding. <div1 id=5><head>Types of spoken texts Before we go on to a detailed discussion of features encoded in our example texts, we shall briefly list some types of spoken texts. These are some important categories: <list> <item>face-to-face conversation <item>telephone conversation <item>interview <item>debate <item>commentary <item>demonstration <item>story-telling <item>public speech </list> In addition, we may distinguish between pure speech and various mixed forms: <list> <item>scripted speech (as in a broadcast, performances of a drama, or reading aloud) <item>texts spoken to be written (as in dictation) </list> Here we are mainly concerned with the most ubiquitous form of speech (and, indeed, of language), i.e. face-to-face interaction. If we can represent this prototypical form of speech, we may assume that the mechanisms suggested can be extended to deal with other forms of spoken material. <div1 id=6><head>Survey of features marked in existing schemes The survey below makes frequent reference to the example texts in Appendix 1. No attempt is made to give a full description of each individual scheme (for a comparison of some schemes, see Appendix 2). The aim is rather to identify the sorts of features marked in existing schemes. Reference will be made both to electronic transcriptions and to transcriptions which only exist in the written medium. <div2 id=6.1><head>Text documentation Speech is highly context-dependent. It is important to provide information on the speakers and the setting. This has been done in different ways. Note the opening paragraphs in A and B, the first lines in E, the header file in Q, and the header lines in R. In some cases information of this kind is kept separate from the machine-readable text file (see Appendix 2). <div2 id=6.2><head>Basic text units While written texts are broken up into units of different sizes (such as chapters, sections, paragraphs, and orthographic sentences, or S-units) which we can build on in creating a machine-readable version, there are no such obvious divisions in speech. In dialogues it is natural to mark turns. As these may vary greatly in length, we also need some other sort of unit (which is needed in any case in monologues). Most conventional transcriptions of speech (such as A and B) divide the text up into "sentences" in much the same way as in ordinary written texts, without, however, specifying the basis for the division. The closest we can get to such a specification is probably the "macrosyntagm" used in a corpus of spoken Swedish (cf. Loman 1972: 58ff.). A macrosyntagm is a grammatically cohesive unit which is not part of any larger grammatical construction. Linguists often prefer a division based on prosody, e.g. the tone units in the London-Lund Corpus (I) and in the scheme set up for the Corpus of Spoken American English (Q). Others are sceptical of this sort of division and prefer pause- defined units (Brown et al. 1980: 46ff.). In the new International Corpus of English project texts are divided into "text-units" to be used for reference purposes, and there is "no necessary connection between text unit division and any features inherent in the original text" (Rosta 1990). Text-units are used both in written and spoken texts. In written texts they often correspond to an orthographic sentence, in spoken texts to a turn. The primary criterion is length (around three lines or 240 characters or twenty words, not including tags). On basic text units in different encoding schemes, see further Appendix 2. <div2 id=6.3><head>Reference system If a text is to be used for scholarly purposes, it must be provided with a reference system which makes it possible to identify and refer to particular points in the text. Reference systems have been organized in different ways (see Appendix 2). In the London-Lund Corpus (I) a text is given an identification code and is divided into tone units, each with a number. Some spoken machine-readable texts (e.g. K1) do not seem to contain such a fine-grained reference system. <div2 id=6.4><head>Speaker attribution In a spoken text with two or more participants we must have a mechanism for coding speaker identity. This is generally done by inserting a prefix before each speaker turn; see A, H, I1, J, L, N, O, etc. in Appendix 1 and the survey in Appendix 2 (in the London-Lund Corpus there is actually such a prefix before each tone unit; see I2). The prefix is generally a code; more information on the speakers may be given in the accompanying documentation. Problems arise where a speaker's turn is interrupted (see how continuation has been coded in Appendix 2; note the examples in text I in Appendix 1) and, particularly, where there is simultaneous speech. <div2 id=6.5><head>Speaker overlap In informal speech there will normally be a good deal of speaker overlap. This has been coded in a variety of ways, as shown in Appendix 2. In the London-Lund Corpus simultaneous speech is marked by the insertion of pairs of asterisks; see text I in Appendix 1. Dubois et al. (1990) instead use pairs of brackets; see Q. Both solutions involve separation of each speaker's contribution and linearization in the transcription. In some cases speaker overlap is shown by vertical alignment (alone or combined with some other kind of notation); see O, P, Q, T, and V. <div2 id=6.6><head>Word form Words are rarely transcribed phonetically/phonemically in extended spoken texts; the only examples in Appendix 1 are G and W, both of which do not exist in machine-readable form. There are several reasons why there are very few extensive texts with transcription of segmental features. Such transcription is very laborious and time-consuming. A detailed transcription of this kind makes the text difficult to use for other purposes than close phonological study. And the conversion to electronic form has been difficult because of restrictions on the character set. Orthographic transcriptions of speech reproduce words with their conventional spelling and separated by spaces (apart from conventionalized contractions). This is true not only of texts like A and B, but also of the much more sophisticated transcriptions in H, I, etc. In some cases, e.g. in E, P, and R, the transcription introduces spellings intended to suggest the way the words were pronounced (takin', an', y(a) know, gon (t)a be, etc.). Some basically orthographic transcriptions introduce phonetic symbols in special cases; see the remarks on quasi-lexical vocalizations in the next section. Simple word counts based on orthographic, phonemic, and phonetic transcriptions of the same text will give quite different results. Homographs are identical in an orthographic transcription, although they may be pronounced differently: row, that (as demonstrative pronoun vs. conjunction and relative pronoun), etc. Conversely, homophones may be written quite differently: two/too, so/sew/sew, etc. These are commonplace examples, but they show that even answers to seemingly straightforward questions (How many words are there in this text? How many different words are there?) depend upon choices made by the transcriber. On the treatment of word form in different encoding schemes, see further the comparison in Appendix 2. <div2 id=6.7><head>Speech management A "speech as writing" transcription will normally edit away a lot of features which are essential for successful spoken communication. A comparison of texts like A and B versus H, I, L, and Q is instructive. The former contain little trace of speech-management phenomena such as pauses, hesitation markers, interrupted or repeated words, corrections and reformulations. Interactional devices asking for or providing feedback are less prominent. For content-based research they are irrelevant or even disruptive, for discourse analysis they are highly significant. The survey in Appendix 2 illustrates how such features have been handled. (On speech-management phenomena, see further Allwood et al. 1990.) Different strategies are used to render quasi-lexical vocalizations such as truncated words, hesitation markers, and interactional signals. The London-Lund Corpus introduces phonetic transcription in such cases (see examples in text I); other transcriptions use ordinary spelling (e.g. b- and uh in text Q, hmm and mmm in text R). Some schemes have introduced control lists for such forms; see Appendix 2. <div2 id=6.8><head>Prosodic features The most obvious characteristic of speech, i.e. the fact that it is conveyed by the medium of sound, is often lost completely in the transcription. The only traces in texts like A and B are the conventional contractions of function words (and the occasional use of italics or other means to indicate emphasis; see one example in B). Even texts produced for language studies generally reflect only a small part of the phonological features. The best-known computerized corpus of spoken English, the London-Lund Corpus (I), focuses on prosody: pauses, stress, and intonation. The same is true of the Lancaster/IBM Spoken English Corpus (K1) and the systems set up for the Corpus of Spoken American English (Q) and the Survey of English Usage (H) - which, incidentally, is the system which the transcription in the London-Lund Corpus is based on. There is a great deal of variation in the way pauses, stress, and intonation are marked, as shown by our example texts in Appendix 1 and the survey in Appendix 2. Note that some schemes have special codes for latching and lengthening. <div2 id=6.9><head>Paralinguistic features Paralinguistic features (tempo, loudness, voice quality, etc.), which are less systematic than phonological features, have usually not been marked in transcriptions. Two of our example texts, however, include elaborate marking of such features (H1, W); neither of them has an exact counterpart in machine-readable form. It should be noted that the London-Lund Corpus is not coded for paralinguistic features, although the texts were originally transcribed according to the Survey of English Usage conventions (as in H). The simplification was made for a number of reasons, "partly practical and technical, partly linguistic" (Svartvik & Quirk 1980: 14). The omitted features were considered less central for prosodic and grammatical studies, which were the main concern of the corpus compilers. For more information on the encoding of paralinguistic features, see Appendix 2. <div2 id=6.10><head>Non-verbal sounds Non-verbal sounds such as laughter and coughing are generally noted, usually as a comment within brackets. See examples in texts B, E, H, L, P, and Q. Q has a special code for laughter (@), which may be repeated to indicate the number of pulses of laughter. The London-Lund Corpus may add a code to indicate length of the non-verbal sound, e.g. (. laughs), (- coughs), (-- giggle). Gumperz & Berenz (forthcoming) distinguish between such features as interruption and as overlay. See further Appendix 2. <div2 id=6.11><head>Kinesic features On the edges of paralinguistic features and non-verbal sounds we find kinesic features, which involve the systematic use of facial expression and gesture to communicate meaning (gaze, facial expressions like smiling and frowning, gestures like nodding and pointing, posture, distance, and tactile contact). They raise severe problems of transcription and interpretation. Besides, they require access to video recordings. For some examples of how such features have been marked, see especially texts P and V. See also the survey in Appendix 2. <div2 id=6.12><head>Situational features The high degree of context-dependence of speech makes it essential to record a variety of non-linguistic features. These include movements of the participants or other features in the situation which are essential for an understanding of the text. We have already drawn attention to the need for documentation on the speech situation (Section 6.1). It is also important to record changes in the course of the speech event, e.g. new speakers coming in, long silences, background noise disturbing the communication, and non-linguistic activities affecting what is said. Note some comments of this kind in %-lines in R and the marking of actions in U. Note also the situation descriptions in texts A and B. See further Appendix 2. <div2 id=6.13><head>Editorial comment In our example texts editorial comment is indicated in a variety of ways, e.g. by additions within parentheses (such as "Expletive deleted" in A) or comments in %-lines in R. The London-Lund Corpus indicates uncertain transcription by double parentheses (see examples in I); there are also comments within parentheses like "gap in recording" or "ten seconds untranscribable". Other schemes indicate uncertain hearing in other ways. Note the elaborate coding of normalization in the International Corpus of English (J2). See further Appendix 2. <div2 id=6.14><head>Analytic coding Though we are not concerned with analytic coding which goes beyond the establishment of a basic text, we shall draw attention to some conventions used in our example texts. Text F includes analytic coding after selected words. J3 has word- class markers before each word. K3-5 have word-class tags accompanying each word (in different ways). R includes speech act analysis in %-lines (%spa: ...). W and X have a running interpretive commentary. Y and Z are coded for discourse analysis. There is thus a great deal of variation in the features coded (not to speak of the choice of codes). <div2 id=6.15><head>Parallel representation In order to convey a broad range of features, some schemes provide parallel representations of the texts. Note the phonetic, phonemic, and orthographic versions in text G and, in particular, the multi-layered coding in V and W. An adequate scheme for the encoding of spoken texts will no doubt need to provide mechanisms for parallel representation. <div1 id=7><head>Spoken texts and the Text Encoding Initiative To what extent can spoken texts be accommodated within the Text Encoding Initiative? They were not dealt with in the current draft (Sperberg-McQueen & Burnard 1990), but many of the mechanisms which have been suggested for written texts can no doubt be adapted for spoken material. These include the ways of handling text documentation, reference systems, and editorial comment. Phonetic notation can be handled by the general mechanisms for character representation (though these will not be dealt with here, as they are the province of another working group). In considering the encoding of face-to-face conversation, which is the main focus of interest here, we shall sometimes make a comparison with dramas, which are dealt with (briefly) in the current TEI draft. This literary genre should of course not be confused with ordinary face-to-face conversation, but we can look upon it as a stylized way of representing conversation, and some of the mechanisms for handling dramatic texts may be adapted for encoding genuine conversation. <div1 id=8><head>Proposals The overall structure we envisage for a spoken text is given in Figure 1. In other words, a TEI-conformant text consists of a header and the text body. The latter consists of a timeline and one or more divisions (div); divisions are needed, for example, for debates, broadcasts, and other spoken discourse types where structural subdivisions can easily be identified. The timeline is used to coordinate simultaneous phenomena. It has an attribute "units" and consists of a sequence of points, with the following attributes: id, elapsed, since. <xmp><![ CDATA[ Example: <point id=p43 elapsed=13 since=p42> ]]></xmp> This identifies a point "p43" which follows 13 units (as specified in the "units" attribute of the <tag>timeline</tag>) after "p42". Divisions consist of elements of the following kinds: <list> <item>utterances (u), which contain words (for more detail, see 8.2); <item>vocals, which consist of other sorts of vocalizations (e.g. voiced pauses, back-channels, and non-lexical or quasi-lexical vocalizations of other kinds); <item>pauses, which are marked by absence of vocalization; <item>kinesic features, which include gestures and other non-vocal communicative phenomena; <item>events, which are non-vocal and non-communicative; <item>writing, which is needed where the spoken text includes embedded written elements (as in a lecture or television commercial). </list> In addition, there may be embedded divisions. All of the elements mentioned have "start" and "end" attributes pointing to the timeline (there may also be attributes for "duration" and "units" expressing absolute time). If a "start" attribute is omitted, the element is assumed to follow the element that precedes it in the transcription. If an "end" attribute is omitted, the element is assumed to precede the element that follows it in the transcription. If "start" and "end" attributes are used, the order of elements within the text body is unrestricted. Events, kinesic features, vocalizations, and utterances form a hierarchy (though not in the SGML sense), as shown in Figure 2. Utterances are a type of vocalization. Vocalizations are a type of gesture. Gestures in their turn are a type of event. We can show the relationship using the features "eventive", "communicative", "anthropophonic" (for sounds produced by the human vocal apparatus), and "lexical": <fig> eventive communicative anthropophonic lexical event + - - - kinesic + + - - vocal + + + - utterance + + + + </fig> Needless to say, the differences are not always clear-cut. Among events we include actions like slamming the door, which may certainly be communicative. Vocals include coughing and sneezing, which are or may be involuntary noises. And there is a cline between utterances and vocals, as implied by our use above of the term "quasi-lexical vocalization". Individual scholars may differ in the way the borderlines are drawn, but we claim that the four element types are all necessary for an adequate representation of speech. There is another sort of hierarchy (in the SGML sense) defining the relationship between linguistic elements of different ranks, as shown in Figure 3. Texts consist of divisions, which are made up of utterances, which may in their turn be broken down into segments (s). See further Section 8.2. Before we go on to a more detailed discussion of the individual types of features (following the same order as in Section 6 above), we should briefly draw attention to a couple of other points illustrated in Figure 1. Note that pauses and vocals may occur between utterances as well as within utterances and segments. "Shift" is used to handle paralinguistic features (see Section 8.9) and "pointer" particularly to handle speech overlap (see Section 8.5). <div2 id=8.1><head>Text documentation The overall TEI framework proposed for written texts can be adapted for spoken texts. Accordingly, there is a header with three main sections identified by the tags <tag>file.description</tag>, <tag>encoding.declarations</tag>, and <tag>revision.history</tag>; cf. Figure 1. The content of the last two need little adaptation, while the first one will be quite different. With spoken texts there is no author; instead, there is recording personnel and one or more transcribers, and the real "authors" appear in a list of participants (as in the dramatis personae of a play). There is also (as in all electronic texts) an electronic editor, who may or may not be identical to the transcriber. There is no source publication as for printed texts (except for scripted speech); instead there is a recording event followed by a transcription stage. The following structure is suggested (see also Figures 4-8): <xmp><![ CDATA[ <file.description> <title.statement> <title>name of the text supplied by the electronic editor name of the electronic editor electronic editor (if applicable) ... ... ... size of file ... (if applicable) ... ... ... ... ... (if applicable) ... ... ... ... ... ... audio/video ... ... (for recordings of broadcasts) ... ... ... ... ... ... ...

...

... ... ... ... ... (if replacing name) ... ... ... ... ... ... ... ... ... ... ... (may be needed in case the participant is speaking on behalf of an organization) ... ... ... ... e.g. mother friend ... role in the interaction (the language used in the situation, or comments on language use) ... ... ... ... ... ... ... school class/audience etc. ... ... (The general principle is that can take any of the tags that occur under , provided that they apply to the whole group.) ... (repeated location needed with telephone conversation and other cases of distanced spoken communication; the "who" attribute identifies the relevant participant or participants) ... ... (repeated time needed with distanced spoken communication) ... ... ... ... ... room/train/library/open air, etc. (repeated surroundings needed with distanced spoken communication) ... activities of the participants ... (if the recording is published) ... (repeated statement needed where there are several recordings) ... ... (if applicable) ... transcriber ... ... ... ... ... (if the transcription is published) ... ... ... notes stating the language of the text, giving copyright information, specifying conditions of availability, etc. ]]>

The difference with respect to the current TEI heading for written texts is that "source description" is replaced by a "script statement" (structured like the heading for a written text; used where the recording is scripted), a "recording statement", and a "transcription statement". To stress the analogy with the heading for written texts, it is preferable to embed these under "source description" (see Figure 4b):

<![ CDATA[ <source.description> <script.statement>...</script.statement> <recording.statement>...</recording.statement> <transcription.statement>...</transcription.statement> </source.description> ]]>

Spoken texts also differ from written ones in that they often do not have a title. We suggest that the title of the electronic text should be lifted from the "script statement" (if there is a script), the "transcription statement" (if it has a title), or the "recording statement" (if it has a title). If not, the electronic editor should assign a title.

There is a problem in deciding how the responsibility for the text should be stated. The following may be involved: the electronic editor, the author of the script, the recording personnel, the transcriber, the speaker(s). The last four are naturally embedded in the "script", "recording", and "transcription" statements, as shown above. The electronic editor is naturally given in the "title statement" for the electronic text. A possible solution is that the person who is regarded as having the main responsibility for the text is lifted from the "script", "recording", or "transcription" statement and inserted before the electronic editor, analogously to the current practice for editions of written texts. Depending upon the nature of the material, the main responsibility for the text is then allocated to the author (where there is a script), the interviewer, the transcriber, or the speaker(s):

<![ CDATA[ <title.statement> <title>...</title> <statement.of.responsibility> <name>...</name> <role>author/interviewer/transcriber/speaker</role> <name>...</name> <role>electronic editor</role> </statement.of.responsibility> </title.statement> ]]> It is worth considering whether the same lifting procedure should also apply to electronic editions of written texts (at the cost of some redundancy of coding).

The most complicated part of the heading for a spoken text is the "recording statement". Note, in particular, the structure suggested for the participants and the setting. Each participant has an "id", which we can use to define his/her relationship to other participants, to state the time and location in cases of distanced communication, to assign utterances to particular participants, etc. The categories for participants should be coordinated with those needed in history and the social sciences (particularly those grouped together under "demographic information", while those placed under "situational information" are more speech-specific).

Some of the types of information provided for above by tags are probably better expressed as attributes. This applies, for example, to the sex of participants and participant groups:

<![ CDATA[ <participant id= sex=m/f> <participant.group id= size= members= sex=m/f/mixed> ]]> Incidentally, the "members" value in the last case is a list of ids of any participants who belong to the participant group (i.e. individual participants who also participate jointly). The value of "size" is a number. Another case where attributes may be recommended are for "time" (under "setting"). This should be coordinated with general TEI recommendations for the coding of date and time. The general principle should be that, when values for a parameter are predictable and can thus be constrained by the DTD, the parameter should be an attribute rather than a tag.

The "script statement" is analogous to an ordinary heading for a written text and requires no special machinery. The "transcription statement" is fairly simple. The complexity here is transferred to the statement of transcription principles (under "encoding declarations"). Note that there may be more than one script and more than one recording and transcription statement, each with an "id".

As pointed out above, we have focused on the "file description" of spoken texts. The other main text documentation sections will define the principles of encoding (transcription principles, reference system, etc.) and record changes in the electronic edition. Note that text classification goes under "encoding declarations". We do not address the problem of text classification, as it is the responsibility of the corpus work group.

Where a spoken text is part of a corpus, the information must be split up between the corpus header (what is common for all the texts in the corpus) and the header for each individual text; cf. Johansson (forthcoming) and the suggestions of the corpus work group.

Note, finally, that we do not suggest that all the slots in the header should be filled; we just provide a place for the different sorts of information, in case they are relevant. The amount of information which goes on the "electronic title page" will no doubt vary depending upon the type of material and the intended uses (and depending upon how much information is, in fact, available). In many cases it may be appropriate to give a description in prose, e.g. of recording information or the setting, rather than one broken down by tags. The following is suggested as an absolute minimum: specification of title and main responsibility for the text (under "title statement"), recording time and circumstances of data gathering (under "recording statement"), some information about the participants (under "recording statement"), transcription principles (under "editorial principles" in the "encoding declarations" section). Basic text units

We suggest the basic tag u (for utterance) referring to a stretch of speech preceded and followed by silence or a change of speaker. The u tag may have the following attributes: who=A1 (speaker) 'A1 B1 C1' (several speakers or possible speakers) uncertain= (description of uncertainty, e.g. speaker attribution) script= (if applicable, "id" of script) trans=smooth (smooth transition; default) latching (noticeable lack of pause with respect to the previous utterance) n=1 2 3 ... (for number)

In addition, u has "start" and "end" attributes pointing to the timeline; cf. the beginning of Section 8. u may contain another u, but only where there is a change to/from a script or between scripts; the speaker value must be the same as for the matrix u.

As utterances may vary greatly in length, we also need tags for lower-level units of different kinds: tone units (or intonational phrases), pause-defined units, macrosyntagms, or text-units defined solely for reference purposes (cf. Section 6.2). We suggest the use of the ordinary TEI s tag (for segment) in all these cases, with attributes for type, truncation, and number (as well as "start" and "end" attributes pointing to the timeline; cf. the beginning of Section 8): type=toneunit pauseunit macrosyntagm textunit trunc=no (default) yes number=1 2 3 ... The interpretation of the segment types should be defined more exactly in the "encoding declarations" section of the file header.

Exceptionally an s may cross utterance boundaries, e.g. where the addressee completes a macrosyntagm started by the first speaker (or a speaker continues after a back-channel from the addressee):

<![ CDATA[ <s type=macrosyntagm trunc=yes n=1.1>have you heard that John</s> <s type=macrosyntagm trunc=yes n=1.2>is back</s> ]]> In other words, we have two utterances, each consisting of a fragment of a macrosyntagm. The identity of the macrosyntagm is indicated by the "number" attribute of s.

Where more than one type of s is needed, we get problems with conflicting hierarchies. In these cases we must resort to milestone tags or concurrent markup. Concurrent markup will certainly be needed where the text is analysed in terms of turns and back-channels (although, most typically, a turn will be a u and a back-channel a vocal). This belongs to the area of discourse analysis, which is beyond the scope of our present paper. The same applies to a more detailed analysis of elements above u, where we have just suggested a general div element; see the beginning of Section 8.

Among linguistic units we also find writing; cf. the beginning of Section 8 and Figure 1. This contains a representation of an event in which written text appears. Typical cases are subtitles, captions, and credits in films and on TV, though also overhead slides used in a lecture might count. It has the same attributes as u and, in addition, an "incremental" attribute (with the values "yes" or "no"), specifying whether the writing appears bit by bit or all at once. If the "who" attribute is specified, it picks out a participant who generates or reveals the writing, e.g. a lecturer using an overhead slide or writing on the blackboard. writing may contain pointer, but only if the value of "incremental" is "yes". It should perhaps be allowed to contain all the tags in written texts. For texts with extensive elements of writing there should be a corresponding "script statement". Reference system

The reference system is intimately connected with the choice of text units. Here we can make use of the "number" attributes of utterances and s-tags, refer to periods between points in the timeline, use milestone tags, or define a concurrent reference hierarchy: (ref)s.../(ref)s. The mechanism(s) should be declared in the file header, in the "encoding declarations" section. Speaker attribution

In dramas speaker attribution is indicated in a double manner in the current TEI scheme; in the first place, by the tag speaker which identifies speaker prefixes; secondly, by the "speaker" attribute of the tag speech. We suggest a "who" attribute for utterances (Section 8.2) and also for vocals (Section 8.7), kinesic features (Section 8.11), and events (Section 8.12). The value of the "who" attribute is the "id" given in the list of participants in the file header.

Where attribution is uncertain, this may be indicated by an "uncertain" attribute:

<![ CDATA[ ]]> The value of "uncertain" is a description of the uncertainty and optionally a statement of the cause of the uncertainty, the degree of uncertainty, and the identity of the transcriber. In the first case above, the probable speaker is A1, in the second A1, B1, or C1. Where the identity of the speaker is completely open, the "who" attribute takes the value "unknown".

If the utterance is the collective response of a group, e.g. the audience during a lecture or school children in a class, we can use the "id" of the participant group, as defined in the file header. This should be distinguished from overlapping identical responses of individual participants who do not otherwise act as a group in the interaction. In the latter case we recommend the normal mechanisms for speaker overlap (see Section 8.5). Speaker overlap

Where there is simultaneous speech, the contributions of each speaker are best separated and presented sequentially. Whole utterances which overlap are catered for by the "start" and "end" attributes of the elements (cf. the beginning of Section 8):

<![ CDATA[ have you heard the news no no ]]> More likely, there is partial overlap between the utterances. In these cases we must insert pointer tags specifying the start and end of the overlapping segments. pointer is an empty element, with an attribute pointing to the timeline. <![ CDATA[ Example (see Figure 9): this<pointer time=p2>is<pointer time=p3>my<pointer time=p4>turn balderdash no<pointer time=p4>it's mine ]]> In other words, the first speaker's "is my" overlaps with the second speaker's "balderdash", and the first speaker's "my turn" overlaps with the third speaker's "no it's mine". The overlap may occur in the middle of a word, as in this example (adapted from Gumperz & Berenz, forthcoming; see also Figure 10): <![ CDATA[ you haven't been to the skill cen<pointer time=p2>ter no<pointer time=p3>I haven't so you have<pointer time=p4>n't seen the workshop there ]]>

Overlap between u and vocal, kinesic, event is handled in a corresponding manner:

<![ CDATA[ have you read Vanity<pointer time=p2>Fair yes <kinesic who=C1 start=p2 end=p3 desc=nod> ]]> Overlap between instances of vocal, kinesic, and event is handled by their "start" and "end" attributes.

Overlap involving writing is dealt with in the same manner as for u. Word form

We shall only consider orthographic representations, as this is the form of transcription most often used in extended spoken texts (and as the problems of phonetic notation are the concern of the work group on character sets). Words will then be represented as in writing and will normally not present any difficulties.

If there are deviations from ordinary orthography suggesting how the words were pronounced (cf. Section 6.6), it is preferable to include a normalized form, using the mechanisms of editorial comment (see Section 8.13).

Standard conventions for hyphenation, capitalization, names, and contractions are best used as in writing. If required, standard contractions may be "normalized" in the same way as idiosyncratic spellings (thus simplifying retrieval of word forms). Initial capitalization of text units is naturally used in a "speech as writing" text with ordinary punctuation, but is best avoided where the text is divided into prosodic units.

In a transcription which is to be prosodically marked it is essential to write numerical expressions in full, e.g. twenty- five dollars rather than $25 and five o'clock rather than 5:00.

The conventions for representing word forms should, like all other editorial decitions, be stated in the encoding.declarations section of the file header.

As regards truncated words and other types of quasi-lexical vocalizations, see the next section. Speech management

Depending upon the purpose of the study, the transcriber may edit away the disfluencies which are so typical of unplanned speech (truncated words or utterances, false starts, repetitions, voiced and silent pauses) or transcribe the text as closely as possible to the way it was produced. If the disfluencies are left in the text, it may be desirable to distinguish them by tags. These are some suggestions for the treatment of speech management phenomena: truncated segment - use the "truncation" attribute of s; see Section 8.2 truncated word - write letter or letter sequence; if the word is recognizable, tag trunc.word, with attributes for "editor" and "full"; if it is not, tag vocal; see further Section 8.13 false start - tag false.start, with an "editor" attribute repetition - tag repetition, with an "editor" attribute See further Section 8.13.

Silent pauses are represented as empty elements. They may be sisters of u or included within u or s; cf. Figure 1. The tag pause has attributes for "start", "end", "duration", and "units"; cf. the beginning of Section 8. In addition, there may be a "type" attribute, which can take values like: short, medium, long. Within u and s pauses can be represented by entity references; see further Section 8.8.

Voiced pauses and other quasi-lexical vocalizations (e.g. back-channels) are tagged vocal. This is an empty element which occurs as a sister of u or within u or s; cf. Figure 1. It has the following attributes: who=A1 (speaker) 'A1 B1 C1' (several speakers or possible speakers) uncertain= (description of uncertainty) script= (if scripted, "id" of script) type= (subclassification) iterated=yes no (single; default) desc= (verbal description) n=1 2 3 ... (number) In addition, there are attributes for "start", "end", "duration", and "units"; cf. the beginning of Section 8.

It may be convenient to have lists of conventional forms for use as values of "desc". Examples (based on lists in existing encoding schemes; note that the list includes suggestions for non-verbal sounds; cf. Section 8.10): descriptive: burp, click, cough, exhale, giggle, gulp, inhale, laugh, sneeze, sniff, snort, sob, swallow, throat, yawn quasi-lexical: ah, aha, aw, eh, ehm, er, erm, hmm, huh, mm, mmhmm, oh, ooh, oops, phew, tsk, uh, uh-huh, uh-uh, um, urgh, yup Within u and s vocals can be represented by entity references; here we can make use of conventional forms like those listed above: &cough; &mm; etc.

As already mentioned in passing, the borderline between u and vocal is far from clear-cut. Researchers may wish to draw the quasi-lexical type within the bounds of u and treat them as words. This would agree with current encoding practice, where quasi-lexical vocalizations are typically represented as words and non-verbal sounds by descriptions within parentheses (cf. Appendix 2). As for all basic categories, the definition should be made clear in the "endcoding declarations" section of the file header. Prosodic features

The marking of prosodic features is of paramount importance, as these are the ones which structure and organize the spoken message. In considering pauses in the last section we have already entered the area of prosody. Boundaries of tone units (or "intonational phrases") can be indicated by the s tag, as pointed out in Section 8.2.

The most difficult problem is finding a way of marking stress and pitch patterns. These cannot be represented as independent elements, as they are superimposed on words or word sequences. One solution is to reserve special characters for this purpose, as in a written prosodic transcription, and to define a set of entity references for different types of pause, stress, booster, tone, etc. We will not make any specific suggestions, as this is within the province of the working group on character sets. Paralinguistic features

These features characterize stretches of speech, not necessarily co-extensive with utterances or other text units. We suggest the milestone tag shift, indicating a change in a specific feature. shift may occur within the scope of u and s tags; cf. Figure 1. It has attributes for "time" (pointing to the timeline), "feature", and "new" (defining the change in the relevant feature). The following are some important paralinguistic features, with corresponding values for "new" (the suggestions are based on the Survey of English Usage transcription, which is in its turn inspired by musical notation): tempo: a = allegro (fast) aa = very fast ac = accelerando (getting faster) l = lento (slow) ll = very slow ral = rallentando (getting slower) loud (for loudness): f = forte (loud) ff = very loud cr = crescendo (getting louder) p = piano (soft) pp = very soft dim = diminuendo (getting softer) pitch (for pitch range): high = high pitch-range low = low pitch-range wide = wide pitch-range narrow = narrow pitch-range asc = ascending desc = descending mon = monotonous scan = scandent, each succeeding syllable higher than the last, generally ending in a falling tone tension: sl = slurred lax = lax, a little slurred ten = tense pr = very precise st = staccato, every stressed syllable being doubly stressed leg = legato, every syllable receiving more or less equal stress rhythm: rh = beatable rhythm arh = arythmic, particularly halting spr = spiky rising, with markedly higher unstressed syllables spf = spiky falling, with markedly lower unstressed syllables glr = glissando rising, like spiky rising but the unstressed syllables, usually several, also rise in pitch relative to each other glf = glissando falling, like spiky falling but with the unstressed syllables also falling in pitch relative to each other voice (for voice quality): wh = whisper br = breathy hsk = husky crk = creaky fal = falsetto res = resonant gig = unvoiced laugh or giggle lau = voiced laugh trm = tremulous sob = sobbing yawn = yawning sigh = sighing The last group is reminiscent of the vocals dealt with at the end of Section 8.7. But here we are concerned with the sounds as overlay, not as individual units.

Shift is marked where there is a significant change in a feature. In all cases "new" can take the value "normal", indicating a change back to normal (for the relevant speaker). Lack of marking implies that there are no conspicuous shifts or that they are irrelevant for the particular transcription.

Paralinguistic feastures are tied to the utterances of the individual participant (or participant group). A shift is valid for the utterance of the same speaker until there is a new shift tag for the same feature. Utterances by other speakers, which may have quite different paralinguistic qualities, may intervene. To prevent confusion, shift should perhaps have a "who" attribute, though it is strictly speaking redundant (as it is always embedded in an element with a "who" attribute).

Note that there may be shifts of several features in a single speaker with different values of the "time" attribute. In other words, different categories of paralinguistic features may occur independently of each other and overlap.

One problem with the recommendations above is that one would like to link the values for "feature" and "new". This could be done by replacing the general shift tag by specific tags with a "type" attribute, as in: temposhift type=ac, loudshift type=ff. Alternatively, it would be possible to replace the "feature" attribute of the shift tag by "tempo", "loud", etc. and let them take the "new" values, as in: shift tempo=ac, shift loud=ff. Non-verbal sounds

In our proposal we make no distinction between quasi-lexical phenomena and other vocalizations. Both are handled by vocal; cf. Section 8.7. Kinesic features

These can be dealt with in much the same way as the features taken up in the preceding section, i.e. by using an empty element, in this case kinesic, with the following attributes: who=A1 ("id" of the participant(s) involved) 'A1 B1 C1' script= (if scripted, "id" of script) type= (subclassification) iterated=yes no (single; default) desc= (verbal description) In addition, there are "start" and "end" attributes pointing to the timeline; cf. the beginning of Section 8.

The tag kinesic is a sister of u; cf. Figure 1. It handles the following types of features: facial expression (smile, frown, etc.), gesture (nod, head-shake, etc.), posture (tense, relaxed, etc.), pointing, gaze, applause, distance, tactile contact. With actions the "who" attribute picks out the active participant; there should perhaps also be a "target" attribute, to be used, for example, for the goal of the gazing or pointing (incidentally, "target" might also be needed as a possible attribute of u, in case one wants to specify that an utterance is addressed to a particular participant or participant group; cf. asides in a play).

Our recommendations for kinesic features present somewhat of a dilemma. Those who are not interested in gestures and the like might very well be content with event. On the other hand, our suggestions are probably insufficient for those who are particularly concerned with kinesic features. Situational features

These sorts of features are given in stage directions in dramas, for which the current TEI draft recommends the tag stage. We suggest a tag event, to be used for actions and changes in the speech situation. Like kinesic this tag is a sister of u; cf. Figure 1. It has the same attributes: who=A1 ("id" of participant(s) involved) 'A1 B1 C1' script= (if scripted, "id" of script) type= (subclassification) iterated=yes no (single; default) desc= (verbal description) In addition, there are "start" and "end"attributes pointing to the timeline; cf. the beginning of Section 8.

With kinesic and situational features (as well as paralinguistic features) we have assumed that they are only marked where the transcriber considers them to be significant for the interpretation of the interaction. If we want a complete record throughout the interaction, the best solution is probably to have concurrent markup streams or to establish links to an audio or video recording (cf. Section 9). Editorial comment

An account of the editorial principles in general belongs in the header, under encoding.declarations. This takes up the type of transcription (orthographic, phonetic, etc.), transcription principles, definition of categories, policy with respect to punctuation, capitalization, etc. Here we are concerned with editorial comments needed at particular points in the text.

The current TEI draft defines the tags norm, sic, corr, del, and add. These may be used in spoken texts as well, as in:

<![ CDATA[ <norm ed=NN sic=an'>and</norm> <norm ed=NN sic=can't>can not</ norm> This approach may be problematic, however, where the original text is heavily tagged. A possible solution is to use the approach outlined below for variants, with alt tags for the original and the emended text (and a "source" attribute associating the alternatives with the original speaker or the editor, using their individual "id").

If it is desirable to delete truncated words, repeated words, and false starts (cf. Section 8.7), we can use a del tag, with a "cause" attribute:

<![ CDATA[ <del ed=NN cause=trunc.word>s</ del>see <del ed=NN cause=repetition>you you</ del>you know <del ed=NN cause=false.start>it's</ del>he's crazy ]]> To handle cases like "expletive deleted" in text A in Appendix 1, one would have to allow the del tag to be empty or use an editorial note (see below).

Spoken texts require special mechanisms for handling uncertain transcription. Uncertain speaker attribution was dealt with above; see Section 8.4. It is quite possible that "uncertain" should be allowed as a global attribute, providing for comments on uncertainty with respect to transcription (e.g.: the transcription of this u or s is uncertain) or classification (e.g.: this vocal is on the borderline of ...). A comment on uncertainty should include identification of the uncertainty and optionally a statement of the cause of the uncertainty, the degree of uncertainty, and the identity of the transcriber.

Very often there will be uncertainty with respect to words or brief segments for which there is no tag to carry the "uncertain" attribute. For these cases we suggest the tag uncertain, with attributes for "transcriber" (where the value is the "id" given in the file header), "cause", and "degree" (high, low), as in:

<![ CDATA[ <uncertain transcr=NN cause= degree= >they're</ uncertain> ]]> This tag occurs within the scope of u or s.

Alternative transcriptions can be identified by the tag var, used in recording variant readings in written texts, with attributes for "transcriber" and perhaps also "preference" (high, low, none) as in:

<![ CDATA[ <var> <alt transcr=NN pref= >they're</ alt> <alt transcr=NN pref= >there're</ alt> </ var> ]]> A special type of variation is found where there is deviation from a script. This could be handled by alt tags with a "source" attribute ("id" of script vs. of speaker) or, alternatively, by an editorial note (see below).

For unintelligible words and segments we can use vocal, with the usual attributes (plus "uncertain"), as in:

<![ CDATA[ <vocal who=A1 type=speech desc= uncertain= start= end= dur=3 units=sylls> ]]> As a last resort, it is possible to insert an explanatory note, with an "editor" or "transcriber" attribute, as in: <![ CDATA[ <note transcr=NN>10 seconds untranscribable</ note> ]]> All the mechanisms for editorial comment should be coordinated with those for written texts. Parallel representation

The issue of parallel representation naturally arises in the encoding of spoken texts, as shown by some of our example texts in Appendix 1 (particularly G and W). The main tool we suggest for parallel representation is the timeline, which we use to synchronize utterances, other types of vocalizations, pauses, paralinguistic features, kinesic features, and events. The timeline can also be used for multiple representation of utterances by a single speaker, as in:

<![ CDATA[ <unit> <level type=orth> <pointer time=p1>cats<pointer time=p2> eat<pointer time=p3>mice<pointer time=p4> </ level> <level type=phon> <pointer time=1>kats<pointer time=2>i:t<pointer time=p3>mais<pointer time=p4> </ level> </ unit> ]]> The pointers to the timeline establish the alignment between the two levels of representation.

When aligning the recording itself with the transcription, the unit containing the parallel levels is the text body:

<![ CDATA[ <text> <timeline>...</ timeline> <unit> <level type=transcription>...</ level> <level type=recording>...</ level> </ unit> </ text> ]]> These suggestions are preliminary and should be coordinated with the mechanisms proposed for linking different levels of linguistic analysis or parallel texts of a manuscript. Transcription versus digitized recording

New technological developments make it possible to link a transcription computationally with a digitized recording. If this is done, we may require less from the transcription and can regard it as a scaffolding which we can use to access particular points in the recording. The compilers of the Corpus of Spoken American English (cf. Dubois et al. 1990) aim at producing material of this kind. Similarly, it should be possible to link a transcription with a video recording. The use of such material is, however, restricted by the ethical considerations which we drew attention to at the beginning of this paper (Section 2.2), and it seems premature at this stage to propose TEI encodings for such material (besides, it falls within the bounds of the TEI hypertext work group). Suggestions for further work

Among the features taken up in this paper there are some which are more central than others. The minimal requirements for an encoding scheme for spoken machine-readable texts are that it provides mechanisms for coding: word forms basic text units speaker attribution and overlap a reference system text documentation (cf. the end of Section 8.1)

Other basic requirements are mechanisms for coding prosodic features and editorial comment. One goal to strive for is some agreement on basic requirements for the encoding of spoken texts. We hope the present paper, and the comments it will provoke, will be a step towards that goal.

Among the issues which require more work we would like to single out: phonological representation, both of prosodic and segmental features (this is within the province of the working group on character sets) mechanisms for linking transcription, sound, and video, including possible parallels between the representation of speech and music (this is within the province of the hypertext work group and should be linked with the current HyTime proposals; cf. Goldfarb 1991) spoken discourse structure, including speech repair in general: parallel representation, discontinuities, and concurrent hierarchies

Last but not least, the proposals made in this paper need to be worked out further and formalized within the framework of SGML, to the extent that this is possible. As a first step, we provide a DTD fragment and brief text samples encoded according to our proposals; see Appendix 3. It is also highly desirable to work on the mapping between our proposals and major existing encoding schemes. Spoken texts - a test case for the Text Encoding Initiative?

We can never get away from the fact that the encoding of speech, even the establishment of a basic text, involves a lot of subjective choice and interpretation. There is no blueprint for a spoken text; there is no one and only transcription. The most typical case is that a transcription develops cyclically, beginning with a rough draft, more detail being added as the study progresses.

It is no surprise if SGML can handle printed texts; after all, it was set up for this purpose. It remains to be seen whether the TEI application of SGML can be extended, in ways which satisfy the needs of users of spoken machine-readable texts, to handle the far more diffuse and shifting patterns of speech. References Allwood, J., J. Nivre, & E. Ahlsen. 1990. Speech Management - on the Non-written Life of Speech. Nordic Journal of Linguistics, 13, 3-48. Atkinson, J.M. & J. Heritage (eds.). 1984. Structures of Social Action: Studies in Conversation Analysis. Cambridge: Cambridge University Press. Autesserre, G. Perennou, & M. Rossi. 1991. Methodology for the Transcription and Labeling of a Speech Corpus. Journal of the International Phonetic Association, 19, 2-15. Boase, S. 1990. London-Lund Corpus: Example Text and Transcription Guide. Survey of English Usage, University College London. Brown, G., K.L. Currie, & J. Kenworthy. 1980. Questions of Intonation. London: Croom Helm. Bruce, G. 1989. Report from the IPA Working Group on Suprasegmental Categories. Working Papers (Department of Linguistics and Phonetics, Lund University), 35, 25-40. Bruce, G., forthcoming. Comments on the paper by Jane Edwards. To appear in Svartvik (forthcoming). Bruce, G. & P. Touati. 1990. On the Analysis of Prosody in Spontaneous Dialogue. Working Papers (Department of Linguistics and Phonetics, Lund University), 36, 37-55. Chafe, W. 1980. The Pear Stories: Cognitive, Cultural and Linguistic Aspects of Narrative Production. Norwood, N.J.: Ablex. Chafe, W., forthcoming. Prosodic and Functional Units of Language. To appear in Edwards & Lampert (forthcoming). Crowdy, S. 1991. The Longman Approach to Spoken Corpus Design. Manuscript. Du Bois, J.W., forthcoming. Transcription Design Principles for Spoken Discourse Research, to appear in IPrA Papers in Pragmatics (1991). Du Bois, J.W., S. Schuetze-Coburn, D. Paolino, & S. Cumming. 1990. Discourse Transcription. Santa Barbara: University of California, Santa Barbara. Du Bois, J.W., S. Schuetze-Coburn, D. Paolino, & S. Cumming, forthcoming. Outline of Discourse Transcription. To appear in Edwards & Lampert (forthcoming). Edwards, J.A. 1989. Transcription and the New Functionalism: A Counterproposal to CHILDES' CHAT Conventions. Berkeley Cognitive Science Report 58. University of California, Berkeley. Edwards, J., forthcoming. Design Principles in the Transcription of Spoken Discourse. To appear in Svartvik (forthcoming). Edwards, J.A. & M.D. Lampert (eds.), forthcoming. Talking Language: Transcription and Coding of Spoken Discourse. Hillsdale, N.J.: Lawrence Erlbaum Associates, Inc. Ehlich, K., forthcoming. HIAT - A Transcription System for Discourse Data. To appear in Edwards & Lampert (forthcoming). Ehlich, Konrad & Bernd Switalla. 1976. Transkriptionssysteme: Eine exemplarische Ubersicht. Studium Linguistik 2:78-105. Ellmannn, R. 1988. Oscar Wilde. New York: Knopf. Esling, J.H. 1988. Computer Coding of IPA Symbols. Journal of the International Phonetic Association, 18, 99-106. Faerch, C., K. Haastrup & R. Phillipson. 1984. Learner Language and Language Learning. Copenhagen: Gyldendal. Goldfarb, C.F. (ed.) 1991. Information Technology - Hypermedia/Time-based Structuring Language (HyTime). Committee draft, international standard 107444. ISO/IEC CD 10744. Gumperz, J.J. & N. Berenz, forthcoming. Transcribing Conversational Exchanges. To appear in Edwards & Lampert (forthcoming). Hout, R. van. 1990. From Language Behaviour to Database: Some Comments on Plunkett's Paper. Nordic Journal of Linguistics, 13, 201-205. International Phonetic Association. Report on the 1989 Kiel Convention. Journal of the International Phonetic Association, 19, 67-80. Jassem, W. 1989. IPA Phonemic Transcription Using an IBM PC and Compatibles. Journal of the International Phonetic Association, 19, 16-23. Johansson, S. Forthcoming. Encoding a Corpus in Machine-readable Form. To appear in B.T.S. Atkins et al. (eds.), Computational Approaches to the Lexicon: An Overview. Oxford University Press. Knowles, G. 1991. Prosodic Labelling: The Problem of Tone Group Boundaries. In S. Johansson & A.-B. Stenstrom (eds.), English Computer Corpora: Selected Papers and Research Guide. Berlin & New York: Mouton de Gruyter. 149-163. Knowles, G. & L. Taylor. 1988. Manual of Information to accompany the Lancaster Spoken English Corpus. Lancaster: Unit for Computer Research on the English Language, University of Lancaster. Kyt, M. 1990. Introduction to the Use of the Helsinki Corpus: Diachronic and Dialectal. In Proceedings from the Stockholm Conference on the Use of Computers in Language Research and Teaching, September 7-9, 1989. Stockholm Papers in English Language and Literature 6. Stockholm: English Department, University of Stockholm. 41-56. Labov, W. & D. Fanshel. 1977. Therapeutic Discourse: Psychotherapy as Conversation. New York: Academic Press. Lanza, E. 1990. Language Mixing in Infant Bilingualism: A Sociolinguistic Perspective. Ph.D. thesis. Georgetown University, Washington D.C. Loman, B. 1982. Om talsprakets varianter. In B. Loman (ed.), Sprak och samhalle. Lund: Lund University Press. 45-74. MacWhinney, B. 1988. CHAT Manual. Pittsburgh, PA: Department of Psychology, Carnegie Mellon University. MacWhinney, B. 1991. The CHILDES Project. Hillsdale, N.J.: Lawrence Erlbaum Associates, Inc. Melchers, G. 1972. Studies in Yorkshire Dialects. Based on Recordings of 13 Dialect Speakers in the West Riding. Part II. Stockholm: English Department, University of Stockholm. Ochs, E. 1979. Transcription as Theory. In E. Ochs & B. Schieffelin (eds.), Developmental Pragmatics. New York: Academic Press. 43-72. Pedersen, L. & M.W. Madsen. 1989. Linguistic Geography in Wyoming. In W.A. Kretzschmar et al. (eds.), Computer Methods in Dialectology. Special issue of Journal of English Linguistics 22.1 (April 1989). 17-24. Pittenger, R.E., C.F. Hockett, & J.J. Danehy. 1960. The First Five Minutes. A Sample of Microscopic Interview Analysis. Ithaca, N.Y.: Paul Martineau. Plunkett, K. 1990. Computational Tools for Analysis Talk. Nordic Journal of Linguistics 13, 187-199. Rosta, A. 1990. The System of Preparation and Annotation of I.C.E. Texts. Appended to International Corpus of English, Newsletter 9 (ed. S. Greenbaum). University College London. Sinclair, J.McH. & R.M. Coulthard. 1975. Towards an Analysis of Discourse: The English Used by Teachers and Pupils. London: Oxford University Press. Sperberg-McQueen, C.M. & L. Burnard (eds.). 1990. Guidelines for the Encoding and Interchange of Machine-readable Texts. Draft version 1.0. Chicago & Oxford: Association for Computers and the Humanities/Association for Computational LInguistics/Association for Literary and Linguistic Computing. Svartvik, J. (ed.). 1990. The London-Lund Corpus of Spoken English: Description and Research. Lund Studies in English 82. Lund: Lund University Press. Svartvik, J. (ed.), forthcoming. Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991. Berlin: Mouton de Gruyter. Svartvik, J. & R. Quirk (eds.). 1980. A Corpus of English Conversation. Lund Studies in English 56. Lund: Lund University Press. Terkel, S. 1975. Working. People Talk About What They Do All Day and How They Feel About What They Do. New York: Avon Books. The White House Transcripts. Submission of Recorded Presidential Conversations to the Committee on the Judiciary of the House of Representatives by President Richard Nixon. By the New York Times Staff for The Whitehouse Transcripts. New York: Bantam Books. 1974. Wells, J.C. 1987. Computer-coded Phonetic Transcription. Journal of the International Phonetic Association, 17, 94-114. Wells, J.C. 1989. Computer-coded Phonemic Notation of Individual Languages of the European Community. Journal of the International Phonetic Association, 19, 31-54.