TEI/ISO Joint Activity on Feature Structures: Meeting held 4-5 Nov 03 at ATILF, Université de Nancy 2.


Participants

  1. Kiyong Lee (Korea Univ, KATS)
  2. Gui-Hyun Hong (KATS)
  3. Eric de la Clergerie (INRIA)
  4. Lionel Clément (INRIA)
  5. Azim Roussonaly (LORIA)
  6. Harry Bunt (Tilburg)
  7. Laurent Romary (LORIA, TEI)
  8. Thierry Declerck (Saarbrücken)
  9. Lou Burnard (Oxford, TEI)
  10. Syd Bauman (Brown, TEI)
  11. Claude Roux (Xerox)
  12. Jean-Michèle Borde (AFNOR)
  13. Tony Hittema (AFNOR)
  14. Tomaz Erjavec (Ljubljana, TEI)

Contents

Introduction

Welcoming the group to its first face-to-face meeting, Laurent Romary noted that its work was being undertaken at a propitious moment, when the NLP community's increasing use of the feature structure formalism and consequent need for a stable interchange format coincided with the major revision of the TEI Guidelines leading to P5. He gave a brief recap of the ISO procedures to be followed and the current status of the work being undertaken: the New Work Item proposal submitted in February had been approved as a committee draft (ISO/CD 24610-1) in June, and a request for ballot/comments issued. A respectable number of comments on it had been received: these had been collated by the Activity leader, Kiyong Lee, and the business of the meeting was to respond to these, and form a consensus, which could form the basis of a second draft that could progress to DIS stage. This document, which should be feature-complete, would then be sent for a final ballot to all ISO members. The group aimed to complete substantial work on the DIS at the next meeting, to be held in Jeju Korea in February 2004.

Laurent noted that in the TEI Guidelines, feature structure representation and declaration were dealt with separately, in chapters FS and FD respectively. The current draft dealt with only the former; it was not yet clear whether a declarative vocabulary like that of the TEI Feature System Declaration was the most appropriate method of providing the latter. Hence, we were addressing only representation issues in this part of the draft ISO standard.

Responding to the request from the German delegation for a survey of existing implementations, it was agreed that such a survey should be undertaken for eventual inclusion as an informative annex to the Standard, or as a free-standing TEI document. This work would be undertaken by Thierry Declerck in collaboration with Lou Burnard, since the TEI was also keen to document implementations of all aspects of the Guidelines.

Lou gave a brief summary of the history of the TEI proposal, and asked what the future status of the standard would be with respect to the Guidelines as a whole. Would the ISO proposal become the only authoritative source of the standard, or should the TEI continue to maintain it as a part of the Guidelines? It was agreed that the TEI should continue to be responsible for the standard, and it should be maintained in parallel.

Lou also gave a brief description of the ODD ("One Document Does it all") XML vocabulary in which the Guidelines are maintained, which was currently undergoing major revision. ODD has several useful features for technical documentation, notably the ability to generate formal specifications in DTD or schema languages as well as documentation from a single XML source. The current ISO draft contained a substantial part (sections 1, 2, 3 and 4) which did not use this format, and a technical part (section 5) which was not currently in that format but had been derived from a document in an earlier form of it.

It was agreed to edit this TEI source to bring it into convergence with the revised form of the draft ISO standard, and to convert the remainder of the draft to a compatible XML format. Lou noted that in principle generation of the whole document in an ISO-approved format should be easily done by modification to existing stylesheets and associated document processing tools.

Kiyong Lee presented a brief overview of the comments received: they had come from eight individuals who were either members of the editorial committee or ISO experts, representing six different national bodies. Comments ranged widely between general remarks and detailed technical criticisms and suggestions as well as including many helpful editorial suggestions. A version of the draft document in which all the comments had been collated was circulated to all present, and formed the basis of ensuing discussion.

General comments

[JP: Hasida] We agreed that the semantics of feature structures could be represented in other ways, but the proposal was to start from the TEI formalism, since implementations for it already existed. Moreover, its more generic approach meant that documents using it could be readily converted to other more specific representations (for example, to transform <f name="foo"> to <foo>), whereas the reverse was not the case. It was agreed that this design rationale should be included in the introduction.

On the question of using a schema to perform structural validation, it was agreed that the present document did not address the issue. Such validation could be done by a specific XML schema, or by a rule-based system such as the TEI Feature System Declaration (FSD): this issue would be addressed in Part 2, which should be briefly mentioned here.

On the relation of this proposal to other standards for Knowledge Representation or RDF, we agreed that the scope of the proposal should be clarified. This document is primarily intended to be used for annotation and representation of linguistic data, but is also applicable in other domains. The basic application area is the representation of feature-value pairs, which can be processed using efficient unification operations and algorithms derived from graph theory. There are many possible extensions of this core functionality, including knowledge representation, but it is not intended to address these explicitly.

[DE] As noted above, it is intended to produce a document which lists existing implementations derived from a new survey and also from the existing ACL NLP Registry.

[UK: Gillam] No reference implementation is currently proposed, as we do not believe it is usual or appropriate to provide one. We would however seek to provide test data, and point to existing implementations.

[DE 3] It was noted that the current draft included links and references to other documents, and we agreed that the document should be self-contained. Ways of automating cross references to external documents were envisaged for the revision of the ODD format, in line with other current technical work in the TEI on linking mechanisms.

Initial review of section 4

  1. Each part of the section should be numbered including the first; however, in these notes, we retain the numbering used in the discussion draft.
  2. [4.3.3] It was agreed that XML should be introduced as a third way of representing feature structures throughout section 4, and that a paragraph introducing the basics of the XML notation was needed.
  3. [4.4] Section should be renamed Structure sharing and should explain how re-entrancy or co-reference is represented in each of graph, matrix, and XML formats
  4. [4.5] Section should be renamed Multi-valued features, and expanded to include brief discussion of sets and bags as well as lists as feature values
  5. [4.6] Material from Annex B which discusses relations on feature structures (subsumption, membership) should be moved here: possibly simplified to cover only union and concatenation
  6. [4.7] Material from Annex B which discusses operations on feature structures (unification, generalization) should be moved here.
  7. [4.8] Type definition and type inheritance are aspects of the material currently addressed in the TEI FSD proposals. Part 1 of the standard would support expression of typed feature structures; however, expression of the type itself belongs in part 2. In rather the same way as an XML system must check the well-formedness of an XML document, but need not check its validity against an external schema, so a feature structure using the proposed formalism may assert that it is of some type without also specifying how that type-membership may be validated. The group had not yet reviewed the TEI FSD notation, but would do so when the work on Part 1 was complete.
  8. [4.9] The group felt there was little need to add a full formal semantics for feature structures -- specific aspects of the meaning of a given FS representation would be conveyed by a feature system declaration (or equivalent), while general concepts are well described elsewhere in the literature. An informal discussion should give some pointers to the relevant literature.

It was agreed that Eric, Tomaz, and Harry will provide material for Kiyong to include, with a view to producing a full revision by end of December 2003.

Review of sections 1 - 3

[1] Scope: The last sentence should be dropped, since it relates to Part 2 only.

[2] Normative References. This should mention only references that are normative, i.e. which are implicit in the normative proposals of this standard. There is no harm in including other references here however. Laurent agreed to check to see whether the ISO standards referenced (eg 8601) were the most recent versions.

[3] Terms and Definitions: The group worked through this list carefully, noting a number of small corrections, and also the following more detailed points.
  1. this section currently includes no terms from section 5: these need to be included
  2. terminology from ISO 10241 might be useful
  3. [3.1] "Attribute" has a very specific XML sense. It might be better to avoid the term and use "feature" instead. Specification of a feature with a particular value forms a feature-value pair which is a member of a feature structure. A "feature value specification" thus comprises a "feature name" and a "feature value".
  4. [3.3] "Boxed integer" need not be an integer: suggest "boxed label".
  5. [3.8] Simons comment accepted (see 3.1 above)
  6. [3.10] Simons comment accepted
  7. [3.12] Simons comment is correct: note that the expansion of FSD in the TEI scheme is "feature system declaration"
  8. [3.17] Erjavec comment accepted: "the node on a directed graph which has no ancestor"
  9. [3.18] "Structure sharing: a feature structure in which one or more feature-values are shared or re-used".
  10. [3.19] The mathematical symbol is ambiguous and does not help the exposition.
  11. [3.20] "Tag" is also a very specific XML term and should be avoided. Since it is also very common in the literature, we should discuss it as a deprecated term, under the entry for "boxed label".
  12. [3.21] Should this definition be removed in favour of one for "feature structure type"?
  13. [3.25] There are two views associated with the notion of unification: one is set-theoretic and the other graph-theoretic. According to the first view, unification may be considered as the intersection of the feature specifications associated with two feature structures. According to the second view, it is formally underood to be the greatest lower bound of each feature specification associated with two feaature structures.
  14. [3.26] Suggest changing ‘a list of values’ to ‘is multi-valued’

Further review of section 4

  1. [4.8] A few remarks on feature libraries need to be included in 4.8
  2. [Clergerie re FSLite]. It was agreed that the current proposal should be revised so that the full complexity of structure was not exposed all at once. It would be a good idea to add a simple example in the introduction.
  3. [Overview section] Should be numbered. Its second paragraph should be dropped.
  4. [4.1] The examples should consistently present the same material: "orth" would be a better feature name here than "phon".
  5. [4.2] The mathematical symbols in the example should be represented using the proper Unicode characters.
  6. [4.2: Simons] If a feature is regarded as a function returning a single value, then it cannot return an alternation: agreed.
  7. [4.2: Erjavec] As noted above, some information about types and and type hierarchy should be added in 4.8, although full treatment would have to wait for Part 2.
  8. [4.3.1] The footnote mentioning the possibility of cyclic graphs should warn of the implementation problems associated with their use.
  9. [4.4: FR] The comments on symmetry, renaming, and simplicity of notation seem more relevant to section 5, where the proposed method of representing re-entrant structures should be specified.
  10. [4.5] All comments were agreed.

Review of section 5

  1. Change title to "XML Representation of Feature Structures"; add revised version of the Overview section currently at the start of section 4.
  2. LB apologised for having been unable to find more linguistic examples: we agreed more were needed.
  3. [Clergerie] After discussion, we agreed that we should define a closed list of atomic types which could appear as content for the <f> element, and which could be validated by an XML schema. This implied that only values which corresponded with W3C Schema Datatypes should be used: more complex values might be specified as feature structures, but could then only be validated at the FSD level. As such, this level of validation would be useful only for untyped feature structures.
  4. [Clergerie] We discussed the phenomenon of sharing or re-entrancy: essentially, we agreed, this implied the need to be able to attach a label to any node, so that nodes sharing the same label would be identified as potentially unifiable. The TEI global N attribute was available to name or label any node. We also identified the following implementation rules:
    • if any node has the same label as any other, the meaning is that the two parts are shared and can meaningfully be unified
    • if a node uses the same label as any of its ancestor then the corresponding graph is cyclic
    • if a label appears only once, this is not an error but has no significance
    • labelling an empty node is meaningless, since the null feature structure can be unified with anything: unification fails only when there is conflict.
  5. [DE 19] This comment correctly identifies the intention behind the structure of the section: which is to provide a document grammar with examples, building up an understanding of the semantic hierarchy in the process. The comment does not explain why this is unsatisfactory, nor what kind of change in the structure is recommended. However, we do intend to review the structure and to simplify its presentation. We do not see the relevance of RDF here, since that uses a completely different notation.
  6. [JP/HA 3] As noted above, the alternative method proposed here (using XML elements which correspond with the feature names) is less powerful and extensible than the generic <f> element in the TEI scheme. We will discuss this alternative, however, and add a section on conversion between generic and specific representations.
  7. On atomic values, we agreed to replace <plus> and <minus> with <boolean> containing a valid XML representation for true/false/one/zero. We also noted that other underspecified atomic values must be represented by feature structures.
  8. The references to a.global should be explained in the Overview section which is to be added. (Similarly a brief discussion of how the document is meant to be read). Simons comment here seems to confuse the value itself with its use: the N attribute (for example) is needed globally if all nodes are potentially to be labelled. Gillam's comments don't provide any concrete proposals for change and seem rather inconclusive. We have commented on the relevance of RDF elsewhere.
  9. [5.2] We discussed the need for feature, feature structure, and feature value libraries and agreed that they helped reduce complexity. They enable applications to register value ranges for re-use by different features, feature value pairs for re-use in different feature systems, and feature structures in different applications. They are not however strictly essential to the notation, and their introduction at this point in its presentation is clearly confusing to many readers: we therefore plan to move the topic later in the document.
  10. [JP_HA 6] We don't understand how the macro element proposed here is intended to be used, nor why it is superior to the library mechanism proposed.
  11. [Simons] We agreed that the type attribute on <fsLib> had a different meaning from that on <fs> and should therefore have a different name. Elsewhere in TEI, "type" has the kind of general sense intended here, but amongst NLP practitioners it had the more specific sense in which it is being used on <fs> . We agreed that some other term such as "name" or "group" would be more appropriate here; or we could use the global N attribute, if we were willing to change the earlier decision to use this for labelling purposes.
  12. [Simons] The comment about whether or not the <fvLib> mechanism saves anything is well taken. It is however specific to the example in 5.2: if the example included other more complex values the situation would be different. This is in turn a consequence of the decision to introduce libraries before discussing other kinds of value, which we have agreed to reverse.
  13. [5.3] We identified the minimum set of datatypes as boolean, string, number and date. We agreed that there was no need for both <str> and <sym> and reaffirmed our decision that only datatypes which could be validated by a conformant XML schema processor should be regarded as primitive.
  14. [5.4] We agreed that the extended example here should be replaced by the more linguistically-relevant one provided by KL
  15. [5.5] We felt that a single element should be used to represent all multivalued objects, with an attribute to indicate whether they were organized as a list, set, bag etc. The name <vMult> was proposed for this purpose. It would carry an org attribute, and could also be empty.
  16. [5.6] We agreed that <falt> and <valt> were useful specializations of <alt> . We agreed that the mutExcl attribute was very unlikely to have any practical application and should be dropped.

Unresolved issues and future actions

Due to lack of time, we were unable to proceed much beyond section 5.6 of the current document. This left a number of minor comments unaddressed, but only one major issue (the practical usefulness of the rel attribute) undiscussed. We agreed that discussion of this issue should proceed as rapidly as possible on the email list, with a view to final resolution well in advance of the February meeting.

The following outline timetable was agreed for the next phase of the work:
17 Nov
Minutes of meeting
25 Nov
Compilation of editorial comments completed
27 Nov
Compiled comment document to be circulated to expert group
30 Dec
Final draft for sections 1,2,3,4 sent by KL to LB
30 Dec
Final draft for sections 5 sent by LB to KL
10 Jan
Formatted complete draft sent by LB to KL

Last recorded change to this page: 2007-09-16  •  For corrections or updates, contact webmaster AT tei-c DOT org