Using XML in the Real World

What is XML for? How is it best used? What tools are available?

What is XML for? exchanging information between people between people and machines between machines preserving information without usage-dependency without medium-dependency independent of time, space, and language

Delivering information

XML is a good way of representing information. But how about delivering XML content on the web ... and on paper storing and managing XML documents ... and virtual documents

Can we get the best of both worlds?

What tools do we need?

Appropriately expressive languages (eg TEI XML) Syntax-checking document creation tools (aka Editors) Document transformation tools Document delivery tools Document storage and management tools Programming interfaces for a variety of languages

Generic languages DOM: Document Object Model Level 2; XML Schema (description of structures and data types); XPath: addressing parts of an XML document; XSLT: transforming XML documents for use with XSL; XSL: extensible stylesheet language; XLink: XML Linking Language; XPointer: XML Pointer Language.

... specialised (but generic) languages ... SVG: scalable vector graphics; MathML: Mathematical Markup Language; RDF: Resource Description Framework; SMIL: Synchronised Multimedia Integration Language

... etc etc etc

Document creation and editing

There's an ever expanding choice of XML editing tools: Plain text editors, typing < and > by hand (e.g. Notepad) Customised plain text editors, with built in tagging (e.g. Notetab) Customised programming editors (notably GNU Emacs) Word processors with XML add-ons (e.g. WordPerfect) Data-oriented XML editors (eg XML Spy) Document-oriented XML editors (eg XMetal)

And there's also the XML that gets generated without anyone noticing...

Document transformation tools

A stylesheet allows you to define how XML elements are to be transformed.

Extensible Style Language/Transformation (XSLT): fully-featured transformation language Cascading Style Sheets (CSS): allows you to add formatting styles (only) to your document; A variety of proprietary stylesheet languages also exists, tied to specific software; Or you can use whatever software you like to map XML into something else (e.g. LaTeX, nroff, RTF, Framemaker)

Transformation tools XSLT-basedMany, but varying in implementation level: we currently recommend saxon proprietaryLegacy SGML systems like Balise, Omnimark; new scripting schemes like XML Script generic softwareeasier to develop with XML-aware libraries, written to a standard API such as DOM

Typical transformation jobs Render foo elements in italics Render foo elements within bar elements in italics Insert Foo number and the value of its number attribute in front of every foo Indent every p element by 1 em, except for the first one in a div Take the first head element inside each div and add it to a table of contents

Less obvious transformation jobs Count foo elements occurring within bar elements Sort all foo elements by the value of their which attribute, suppressing duplicates Display only foo elements whose which attribute has the same value as a bar element elsewhere Display every p element containing some string Display the parent element of every foo element, sorting them by the value of the which attribute on the last bar element they contain

XML parsers and validators

Embedded or free standing, validation is an integral part of XML document processing.

There are lots of products, both free and commercial: in Java from Sun, Oracle, and IBM as well as individuals in C, embedded in Perl and various applications like Netscape in C++ from IBM something in more or less any language you like, from Python to Dylan &ldots;plus all the existing SGML software

Processing strategies

An XML document is a serialized tree structure. How should it be processed?

There are three currently favoured approaches: event-based (e.g. SAX) tree-based (e.g. DOM) declarative or functional (e.g. XSLT)

XML on the web

Eventually, all web user agents (browsers) will be XML aware! Until they are, we have to choose : transform XML to HTML on the server (statically) transform XML to HTML on the server (dynamically, using a servlet) render XML on the client using CSS or dynamically with some kind of plugin

XML on the web: typical architecture

XML on paper

The combination of XML, XSL-T and a good FO-engine could do away with the need for expensive proprietary DTP and word processing systems

It hasn't happened yet, but it might...

Storage strategies

Data has to be stored somewhere. How should XML data be managed? There are several possibilities: as discrete XML documents within any convenient DBMS within an XML repository

XML documents

In the traditional docucentric world... information is stored in XML documents, somewhere, and in some form entities give some degree of modularity but there has to be centralized naming and management for version control, integrity, etc.

<!ENTITY doc1 SYSTEM "docs/frag1.xml"> <!ENTITY doc2 SYSTEM "docs/frag2.xml">

<?xml version="1.0" ?> <!DOCTYPE theDoc SYSTEM "theDTD.dtd" [ <!ENTITY % theDocList SYSTEM "theDocs.ent"> %theDocList; ]> <theDoc> &doc1; &doc2; </theDoc>

The docucentric world

Good points: conceptually clear robust and portable

Less good points: Everything must be an XML entity may appear inflexible or redundant

Virtual documents

Storage is a special kind of processing, like formatting, requiring a transformation in and out of some storage format. So we could store information in non-XML formats (optimized for specific functions, e.g. text retrieval or relational tables) recover all and only the information needed from the store in the form of a dynamically-generated XML document/fragment in an XML repository, access should be in XML terms; at present, there is usually a need for some mapping process

XML databases: the options Store some information as relations, and some as XML (e.g. ProtCem) Store the XML structure as relations but expose only an XML view (e.g. Phelix) Store and expose only XML (e.g. Meerkat and other RSS-based services)

DBMS or XML?

Do you have to choose?

The argument from history flatfiles gave way to network DBMS network DBMS gave way to relational will relational DBMS give way to oodbs? Getting the best of both worlds DBMS are good at storing and managing relations the equivalent XML technologies are not yet mature but DBMS can be cajoled into presenting their contents in XML terms

Delivery strategies Our goal is fast and efficient access to any subtree of the docuverse, of any size XPATH has an adequately rich semantics XSL-T has an adequately rich syntax (we think) The rest is a Simple Matter of Programming...

Delivery strategies (today) Small-scale solution Use XSLT and XPATH expressions Untidy solution Store XML in conventional database and do textual search High-tech solution Pre-index all text in all elements, and provide one-off front-end application Low-tech solution Use XML-ified grep-like utility to search documents (LTXML tools)

Development strategies XML began as a way of smuggling SGML onto the web... ... but seems to have taken over as the industry's driving force Where will XML have taken us in the next few years? What should we expect to be able to do?