Streaming Techniques for XML Processing – Part 5
Implementation of Serial Tree Transducers
In the last instalment of this weblog series I described a divide and conquer approach for processing huge XML documents. Huge XML documents often have a simple structure and consist of wide trees but with only small height. They contain lists with millions of serialized business objects. We used STX to implement an outer loop performing only local transformations and transforming buffered subtrees using XSLT.
So our aim is to combine the strengths of STX and XSLT: We only have to keep a subtree together with its parents in the main memory at one time. We use XSLT either to reuse existing transformations or to perform transformations that are too difficult to be done with a tree transducer. Now I want to put this strategy in a more formal background. If you are interested in slightly more formalism I refer to my working paper.
I hope this formalism is a first step towards generation of fast mapping programs that can be used in integration technologies. In the end we need a program that generates STX mapping programs from schema mappings by reusing existing XSL transformations. This would be in fact an interesting development project that could valuable contribution to XML community.
Serial Tree Transducers
XML documents can be modelled by trees and XML transformation by tree transformation. One model for tree transformations are tree transducers: finite automata that walk over trees and generate output. Because we want to use STX we restrict to top-down tree transducers. You can image these transducers as automata that do a simultanous translation. They walk through the XML document in a depth first order and apply a rule every time they visit an XML element. Applying a certain rule means that we insert a set of subtrees according to the current state and change the state of the transducer.
In the following we show some applications of tree transducers implemented as STX programs. The states information of the tree transducer corresponds to groups in STX and a single transducer rule corresponds to an STX template.
For those who don’t like automata theory I give some simple examples for serial tree transducers by showing typical applications.
The following transducer starts in state p and renames elements a into elements b:
\ \ \ \ \ \ \ <b> \ \ </b> \ \ \
Deepening XML Hierarchies
The following template works on elements a and inserts a child element b and copies the children of a under b.
\ \ \ \ \ \ \ <a> \ <b> \ \ </b> \ </a> \ \ \
Flattening XML Hierarchies
The following template removes elements a in a hierarchy:
\ \ \ \ \ \ \ \ \ \
In fact even skipping subtrees is possible but in the context of serial enhanced serial tree transducers it can be done more easily.
Enhancements of Serial Tree Transducers
The examples above use only a small fragment of STX. In the following we discuss two extensions: registers and arbitrary transformations on buffered subtrees that will be copied to the output stream.
In STX we can define global variables that can be changed during processing. We will use them as registers and define test expressions that allow us to modify the state of a rule, thereby registers can be either integers or strings.
The second enhancement allows transformation rules that can use subroutines. In STX we can implement these rules using the stx:buffer command that copies subtrees into a buffers that are used as input of an XSL transformation. This is shown by following example. We want to implement an enhanced serial tree transducer that works on child elements a or an root element root by copying the subtrees of the elements a after sorting the siblings. That means that the XML document on the left hand side will be copied to the element on the right hand side:
\ <root> <root> \ <a> <a> \ <z/> <w/> \ <y/> <y/> \ <w/> <z/> \ </a> </a> \ <a> <a> \ <b/> <b/> \ <a/> <a/> \ </a> </a> \ </root> </root> \
The following transformation defines rules for the element root and its children a. The subtrees of the elements a are copied to a buffer input that is processed using an XSL transformation and the result is copied to the output stream. This is done using following transformation:
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
The XSL transformation sort.xsl performs the sorting operation:
\ \ \ \ \ \ \ \ \ \
In fact this example performs only a structural mapping but it is a good example how the tree transducer works as outer loop and uses XSLT to transform buffered subtrees. For realistic applications we can handle text content and attributes either as special nodes or modify the template rules of a transducer such that they can assign the text content and attributes of a source node to a target node. Note that we can also use registers to store the text content and copy it to a target node.
In my opinion there are two interesting questions:
- Can we find a way to recognize when schema mapping can realized with a (non-trivial) enhanced serial tree transducer? Non-trivial means that we don’t want to buffer the whole document in a buffer and apply a single XSLT program on it.
- Can we even find a program that generates a (non-trivial) enhanced tree transducer from a schema mapping if this is possible?
I consider these aspects as worth for further work because serial tree transducers are a very simple computation model that correspond to STX in a very natural way.