Streaming Techniques for XML Processing – Part 2
Advanced STX Techniques
In the first part of bis blog series I introduced Steaming Transformations for XML (in short STX). Now I want to present a complex STX transformation that uses buffers to cache nodes that are needed later for output generation. The linear nature of serial transformations are the reason for their performance and efficiency but it is their greatest weakness, too. Buffering techniques are a way out of this problem.
In STX buffers are temporary XML fragments that are declared with the command stx:buffer. I contrast to XSLT 2.0 you can’t access the buffer using a path language. But in fact you don’t need that kind of access because we can treat a buffer like an input stream and apply templates on it. Moreover we can pass the content of buffers to external filters like XSLT, Schematron, SAX2 and XML Signature.
At the end of this blog I will mention some applications of the presented techniques as well as some STX best practices.
Running STX from the command line
Before I start, I want to answer a question I was asked many times: “How can I run STX?” At first you should download the latest version of Joost. You need at least Java 1.4 and logging components that are described under the same URL.
A typical Joost call from the command line is java.exe -cp .\\commons-discovery.jar;.\\commons-logging.jar;.\\joost.jar net.sf.joost.Main .\\input.xml .\\trafo.stx -o .\\out.xml. \ If you should need more help please mail to Joost mailing list
There are several possibilities to use the Java API of STX. You will find several listings in chapter 7 of Oliver Becker’s PhD-thesis. You can use it in JAXP by setting javax.xml.transform.TransformerFactory property\ to net.sf.joost.trax.TransformerFactoryImpl.
You can use StDB as debugger for STX.
An Example from WIKIMEDIA
As a first example for buffering techniques let me present an STX program from WIKIMEDIA that is used for processing huge WikiMedia XML-files. It is a filter that copies all nodes and attributes but adds an attribute namespace to elements title if this namespace is declared by an element namespace likeexplained here: if we read <namespace key="42">Foo</namespace> and we process <title>Foo:Bar</title> we give out <title namespace="42">Foo:Bar</title>.
The template that matches m:namespace copies the corresponding node into a buffer. The template that matches m:title evaluates the prefix (Foo in our example) and processes the buffer containing the elements namespaces: if the value of the attribute key stored in the buffer equals this prefix, the variablepage-namespace is assigned with that value. Then this variable is used to add an attribute namespace to the element title:
In the rest of this blog I will rewrite one of my own XSLT to STX. As we will see this transformation turns out to be very complicated and is far from being efficient so it is a perfect example that sometimes XSLT is the better choice. So feel free to decide: you can just read the summary at the end or analyse a really difficult example of an STX program.
Transformation of a BMEcat Document into an asXML-Representation of ABAP Data
In my SAP Heft XML-Datenaustausch in ABAP I discussed how to generate ABAP data from XML by creating the asXML representation. In listing 4.6 I transformed an XML document according to the BMEcat standard. The input contains several meal information that will be transformed to certain tables of the SPFLI data model you know from the SAP documentation. I wrote an STX transformation that generates an XML document in asXML representation to bridge the gap between XML and ABAP:
You can find the input BMEcat document as well an XSLT transformation that performs this task in my SAP Heft XML-Datenaustausch in ABAP and online under the same URL. If you don’t have access to it I show a grid view of the XML document:
A grid view of the XML document we want to transform.
The document contains a list of articles together with detail information (BME:ARTICLE), a tree-like catolog structure BME:CATALOG_GROUP_SYSTEM and BME:ARTICLE_TO_CATALOGGROUP_MAP that links the articles to the catalog structure,
Here is an STX program that does the same like listing 4.6 in the SAP Heft. I suggest to study the XSLT program first and then the STX transformation.
One major difficulty of the transformation above that we have to “group” the elements according to the datatype SMEAL, SMACOURSE, SDESSERT and so on. Unfotunately the corresponding XML elements that represent that information are not in this order. Moreover we need information from BME:ARTICLE_TO_CATALOGGROUPMAP that is necessary to decide whether an article is a starter, maincourse or dessert. In XSLT we could use xsl:for-each (or even xsl:for-each-group) to collect the elements we need. This is necessary because CALL TRANSFORMATION command for XSLT doesn’t support appending to internal tables (in Simple Transformations we can use tt:assign to do it). In fact, grouping elements is a difficult task for serial XML transformation languages because usually we have to access nodes that are not ancestors of the current node. In STX we can define buffers to solve this problem. In the transformation above I use stx:buffer name="article" and stx:buffer name="key" to define two buffers and use the command stx:process-buffer to access them. In fact we store the big parts of document into a buffer and generate the output by processing this buffer with certain templates of the groups: smeal, smealt, sstarter and so on (each group generates the output for the corresponding SAP DDIC-type).
Let’s have a closer look at the main template of the STX program. We define a template for the document root that is executed once when the processing starts. We plan to read the whole document into two buffers article and key. If this is done we generate the asXML output by writing literal result elements to the output stream:
We fill the two buffers by defining two groups of templates: key and article. Let’s discuss the group key first. The BMEcat document contains a list of articles together with a catalog structure. We need this catalog information to decide whether a meal is a starter or a maincourse for example. Therefore we read the BME:ARTICLE_TO_CATALOGGROUP_MAP elements and and add elements named key to the buffer; these elements contain an article and an catalog number. When we process the articles stored inthe buffer articlelater, we use this information to access the catalog structure:
After this is done the buffer key contains elements like the following: <key article="127-0175" catalog="DT1030"/> . Please remark that template that matches BME:ARTICLE_TO_CATALOGGROUP_MAP has an attribute visibility="global" so it is an entry to this group of templates.
We define a second group of templates that process the elements/BME:BMECAT/BME:T_NEW_CATALOG/BME:ARTICLE and write their id (element BME:SUPPLIER_AID) and a longtext (element BME:DESCRIPTION_LONG) into a buffer article:
After this is done the buffer key contains elements like the following:.
As I mentioned above the main part of the program processes the article elements by using the STX-statement stx:process-siblings. For each table to fill (SMEAL, SMACOURSE, SDESSERT and so) we follow the same strategy. So let’s take a look how it works. Let’s look in detail at the group sdessert: For each article element in the buffer article we search the second buffer key for an element that has the same article number (attribute @article) by executing the command stx:process-buffer name="key". Then we test whether the catalog id starts with DT1030. In this case we know that we have found a dessert and create the asXML representation. Here is the corresponding code:
In our example we generate following output:
This blog covered an application of STX in data exchange.
Limitations of STX
In fact this transformation shows the limits of STX programs. We had to save big parts of the input document in buffers. This may be proper for small documents but if we process larger ones there will be lack of memory. And for smaller documents an XSLT program seems to be the better solution because it is shorter and we can use xsl:key to speed up the transformation.
Nethertheless buffering is an importing STX technique and there are many more applications, think of integration of external filters or XSLT programs that transform only small parts of the document.
Let’s remark that Simple Transformations have a strictly linear nature during deserialization and there is no buffer command like stx:buffer. But in fact we can combine the strengths of both languages: we can use STX for more complex transformation ST can’t perform and integration of say XML Signature. After that we can perform post-processing with a Simple Transformation to brigde the gab between XML and ABAP.
Like in the first part of bis blog series we grouped STX templates to define rules for different “parts” of an XML-document. In each part we transform a set of nodes together with its children and we control the the access to each group using the visibility attribute. If it helps, you can think groups of templates like the mode attribute of an template in XSLT.
To boldly go where no man has gone before…
In my opinion STX has enormous potentials: just imagine graphical mapping tools that generate STX code if this is possible. With build in integration to external filter tools like XSLT and XML Signature it could be a part of a next generation integration engine.