![]() | Text Encoding Initiative |
Migrating the BNC into XML |
Home |
Hidden away on the second of the two CDs on which the BNC World Edition is delivered, there is a folder called SGML. This is not installed by default, but you can easily copy it from the CD. It contains the SGML version of the BNC DTD, in expanded and unexpanded forms, and some sample driver files, for manipulating BNC documents in an SGML environment. Its use and contents are described in some detail in the User Reference Guide chapter , I have now produced an equivalent set of files for those wishing to process the BNC in an XML context. You can download these files as a single zip archive from here. The rest of this document discusses how to use these files, and how to convert the BNC into XML. Before reading on, you should unpack the bnc-xml.zip archive into a directory called XML. You will also need access to the following bits of software:
Converting the BNC DTDThe BNC DTD was designed as a customization of the TEI system (http://www.tei-c.org). Amongst the reasons why this turns out to have been a very good idea is that the task of converting the DTD from SGML to XML becomes almost entirely automatic, just as it is for any other TEI-conformant DTD. You can use the Pizzachef to generate an XML DTD directly, or you can modify the driver files to invoke a TEI XML dtd. Here's how. Using the Pizza Chef
You now have an XML-conformant version of the BNC DTD. To test that it works, open up the file driver-local.xml in your XML directory, using your favourite XML aware editor. This driver file will use the DTD in the file bnc-xml.dtd to try to parse the corpus header (a corrected version of which is included in your XML directory in the file corphdr) along with two sample corpus files, ABC and KSV. If all is well, you should get only error messages complaining that these two files are missing, or not valid (or even well-formed) XML. You now need to convert the BNC texts to XML. Proceed to section Converting BNC SGML files into XML! Modifying the driver filesYou don't need to read this section unless you want to know more about how the bits of a TEI dtd are put together, without using the pizza chef. The TEI, as you are probably aware, is a modular system that allows you to build many DTDs from it. The pizzachef behaves rather like a compiler in that it transforms a set of declarations, (each of which selects or modifies a TEI module), into a single non-revisable and customized set of declarations. You can see how this is done in more detail by looking at the way a driver file is constructed. Open the file driver1.sgm in your XML directory, using your favourite editor (i.e. emacs). It looks like this: <!DOCTYPE bnc SYSTEM "http://www.hcu.ox.ac.uk/TEI/Guidelines/DTD/tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.spoken "INCLUDE"> <!ENTITY % TEI.general "INCLUDE"> <!ENTITY % TEI.analysis "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "/home/BNC/SGML/bncMods.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "/home/BNC/SGML/bncMods.dtd"> <!ENTITY % BNCchars SYSTEM "/home/BNC/SGML/BNCchars.ent"> %BNCchars; <!ENTITY corphdr SYSTEM "/home/BNC/Texts/corphdr"> <!ENTITY text SYSTEM "/home/BNC/Texts/A/AB/ABC"> ]> <bnc> &corphdr; &text; </bnc> You now need to modify the SYSTEM declarations so that files are picked up from the right places on your local system, but the principle behind what is going on here should be clear. To produce a similar driver file for use with the XML version of the TEI DTD, you will need to make a few changes in this file. Make a copy, renaming it driver.xml, and proceed as follows:
The SYSTEM identifier given for the underlying TEI DTD will work, but it may not continue to do so. The canonical URL is now http://www.tei-c.org/Guidelines/DTD/tei2.dtd: this will always invoke the most recent revision (currently P4X) of the TEI system. If you are running offline, or using a system that cannot cope with URLs as system identifiers (yes, they do exist), you may prefer to download and install the TEI system locally, as further described at http://www.tei-c.org.uk/Guidelines/DT.html, in which case you will be able to supply a local filename here. Your driver.xml file should now look like something like this: <!DOCTYPE bnc SYSTEM "http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [ <!ENTITY % TEI.XML "INCLUDE"> <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.spoken "INCLUDE"> <!ENTITY % TEI.general "INCLUDE"> <!ENTITY % TEI.analysis "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "bncMods.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "bncMods.dtd"> <!ENTITY % BNCchars SYSTEM "BNCchars.ent"> %BNCchars; <!ENTITY corphdr SYSTEM "corphdr"> <!ENTITY ABCtext SYSTEM "ABC"> <!ENTITY KSVtext SYSTEM "KSV"> ]> <bnc> &corphdr; &ABCtext;KSVtext; </bnc> Try giving it to your favourite XML parser. All being well, you should see no errors, other than a series of complaints about the fact that the corpus texts ABC and KSV are not in XML. We will fix that in the next section. Converting BNC SGML files into XMLAt first glance, the task of converting SGML to XML looks pretty simple. You just need to add end-tags where they have been omitted, and quote attribute values where necessary: the sort of thing any self-respecting perl hacker can whip up a script for in the time it takes to boil the kettle for tea, right? Wrong. Don't even think about doing this without using the proper tools. Fortunately, there is some very good, very reliable, software which will do the job properly, and it doesn't even cost you money. You need OpenSP (based on SP by James Clark, itself based on SGMLS by James Clark, which was based on ARCSGML by Charles Goldfarb). OpenSP is maintained by the openjade project and contains a number of related utilities: an SGML or XML parser, a normaliser and a utility called osx designed specifically for converting SGML documents to XML. osx has several important enhancements as of release 1.5 of the OpenSp system, which are missing in earlier releases (including some of the pre-releases of version 1.5). The software is distributed in source form from Source Forge, and is easily built on most GNU/Linux platforms. It has also successfully been built on Macintosh OSX (the binaries are available here for download). Regrettably, we have not yet found a binary for Microsoft Windows environment: if you're stuck with Windows, therefore, you will either have to run the system under Cygwin, or use an earlier release (which means doing without some of the more useful enhancements). The documentation for osx needs to be read carefully. For most purposes, we recommend using the following options for conversion of BNC files:
To test this, try running the following command from within your XML directory: osx -xno-nl-in-tag -xno-expand-external -xno-expand-internal bnc.dec driver-local.sgm. All being well, this should produce the following output on your screen: <?xml version="1.0"?> <!DOCTYPE bnc [ <!ENTITY % external-entities SYSTEM "extEntities.dtf"> %external-entities; <!ENTITY % internal-entities SYSTEM "intEntities.dtf"> %internal-entities; ]> <bnc TEIform="teiCorpus.2">&corphdr;&ABCtext;&KSVtext;</bnc>and also create files corphdr.xml, ABC.xml, and KSV.xml. The .dtf files contain dummy entity declarations for the internal and external entities found when osx parsed your document. The .xml files contain XML versions of the corresponding SGML input files. To check that these are in fact valid XML, you should now run them against the XML dtd you made at the beginning of this page, by giving your XML parser the file driver-local.xml again. So they're in XML, so what?XML files can be readily reformatted and processed using an XSLT stylesheet to produce other kinds of XML, or other output. This is not the place to learn about XSLT, but just for fun we provide a few sample XSLT stylesheets:
To use these stylesheets with individual files, you will need to prefix each file with a DOCTYPE statement. This is necessary because you retained the mnemonic names for character entity references, and you therefore need to include a DTD which declares the character entities. The easiest way of doing this might be to use a shell script like the following: #!/usr/bin/bash echo "<!DOCTYPE bncDoc SYSTEM 'bnc-xml.dtd' [" > $$ echo "<!ENTITY % chars SYSTEM 'bnc-xml-chars.dtd'> %chars; ]>" >> $$ cat $1 >> $$ xsltproc $2 $$ If this script is saved as transmogrify, then the command transmogrify ABC.xml pretty.xsl will generate a pretty printed version of the ABC text. |