TEI: British National Corpus Samples


The British National Corpus

This subdirectory contains some files demonstrating a particularly tricky SGML to XML conversion problem! It has been prepared for the use of the TEI Workgroup on SGML/XML Migration only; files herein are copyright and may not be re-distributed without permission. See the BNC website cited below for licensing conditions.

The British National Corpus (BNC) is a 100 million word corpus of modern British English, in which each distinct word has a part of speech code (POS) attached to it, as well as all the usual TEI flummery. This makes it Big. The DTD for the current release of the BNC is TEI compliant, inasmuchas
  • its DTD can be expressed as TEI plus a pair of TEI modification files
  • it comes with documentation explaining how the DTD has been derived from P3 (and also, confusingly, how it differs from that used by the original release of the BNC which predated publication of P3)

The documentation is included here in the original XML source and may also, more easily perhaps, be read at the BNC website starting from The BNC online user guide; in both cases there is a minor error, which I leave the discerning reader to discover.

This archive also contains:

The whole directory is also available as a single zip archive

On my system, in this directory, typing either of the following lines
nsgmls -s sgmldecl driver.sgm nsgmls -s -c TEIcatalog sgmldecl driver.sgm
produces satisfyingly no messages other than a successful compilation. Your mileage may, as they say, vary.

Lou Burnard, on Guy Fawkes Day, 2002