TEI ED W41 Notes on Direct Interpretation of Syntax Trees C. M. Sperberg-McQueen (University of Illinois at Chicago) Anne Brueggemann-Klein (ABK) has shown how to create finite-state automata from regular expressions, by deriving a deterministic finite-state automaton (the Gluschkov automaton) from the syntax tree for the regular expression. This is extremely useful work. Its major drawback for SGML parsing is that SGML content models which include the '&' operator require translation into a canonical regular expression which does not use the '&' operator; since an '&'-expression with n elements can be satisfied in any of n! different ways, this translation leads to combinatorial explosion in the number of elements in the canonical regular expression (and thus in the syntax tree and the FSA derived from it). We hope to find a way to work directly from the syntax tree of the regular expression or SGML content model, in such a way as to simplify the handling of '&'. This will require an automaton slightly more complex than a standard FSA, which we will refer to as the `machine'. Consider the element declaration and a syntax tree derived from it, with ten nodes labeled 1 to 10. 1 , | +--------------+--------------+ | | | 2 + 3 * 4 ? | | | 5 b 6 , 7 c | +---------+---------+ | | | 8 d 9 e 10 f ABK derives from this an automaton of six states (labeled Qi, b, c, d, e, f), with an alphabet ¤b, c, d, e, f‡, an initial state Qi, a set of final states last = ¤b, f, c‡, and a transition function delta: 1 delta(Qi,b) = b 2 delta(b,b) = b delta(b,d) = d delta(b,c) = c 3 delta(d,e) = e 4 delta(e,f) = f 5 delta(f,c) = c 6 delta(c,c) = c If we interpret the syntax tree directly, we have a machine comprising ten nodes (one for each node of the syntax tree), each bearing some state variables. ABK generates the FSA by calculating, for each node of the syntax tree, the following variables: nullable : Boolean; /* can node be satisfied by null string? */ first : set of node; /* which nodes can be first? */ last : set of node; /* which nodes can be last? */ follow(n) : set of node; /* which nodes can be follow n within this expression? */ ABK's code for generating these variables can be paraphrased thus: select when label.n = '[emptyset]' then do nullable.n = false; first.n = '' last.n = '' end when label.n = '[nullstring]' then do nullable.n = true; first.n = '' last.n = '' end when label.n = '[literal]' then do nullable.n = false; first.n = value.n last.n = value.n val = value.n follow.n.val = '' end when label.n = '|' then do left = child.n.1; right = child.n.2 nullable.n = nullable.(left) or nullable(right) first.n = first.left first.right /* guaranteed disjoint */ last.n = last.left last.right /* guaranteed disjoint */ end when label.n = ',' then do left = child.n.1; right = child.n.2 nullable.n = nullable.(left) and nullable(right) ln = last.left do while ln <> '' parse var ln node ln val = value.ln follow.n.val = follow.left first.right /* disjoint */ end if nullable.left then first.n = first.left first.right else first.n = first.left if nullable.right then last.n = last.left last.right else last.n = last.right end when label.n = '*' then do nCh = child.n.1 nullable.n = true ln = last.nCh do while ln <> '' parse var ln nTemp ln val = value.ln follow.n.val = follow.child.val first.child end /* do ln */ first.n = first.nCh last.n = last.nCh end otherwise call error 'Unknown node label: ' label.n '!' end /* select */