Streaming Techniques for XML Processing – Part 1
Transformation Languages like XSLT are the most powerful techniques for processing XML documents. Unfortunately XSLT is very ressource-consuming so that you can’t use it to process mass data. That’s why serial transformation languages have been developed that can transform XML documents in linear time and keep only a small part of the XML document in the main memory. Under ABAP Simple Transformations are the most promising technique to speed up XML processing or to deal with huge XML documents. I discussed ST in my weblog series “XML processing in ABAP” (for example Part 6 – Advanced Simple Transformations) and in detail in XML-Datenaustausch in ABAP.
In this blog I will introduce STX (a short form for Streaming Transformations for XML) – a general transformation language for serial XML processing that you can use under Java. STX works event-based, think of it as a kind of XSLT based of SAX2 concepts.
Streaming Transformations for XML
STX is a transformation language in XML syntax that works event-based. Like in XSLT we can define templates that represent rules. These rules are evaluated while processing the input XML document in a linear way. There is a working draft for the STX-specification and Joost, an STX processor running under Java, that implements most features of the specification.
I will introduce STX and give a very simple and a more complicated example of an STX program. If you want to understand these examples in detail or you want to learn STX in depth I suggest you to read the article An Introduction to Streaming Transformations for XML. At the end I will discuss application areas of STX and topics I want to cover in the following parts of this weblog series.
STX compared to other Transformation Languages
Simple Transformations and STX are both serial and procedural transformation languages in XML syntax. But they are several differences:
- STX is a multi purpose language that can be used to transform XML documents to XML documents as well as to general text streams.
- STX doesn’t support generation of ABAP data like ST does. You have to generate the canonical asXML representation.
- STX is not symmetric.
- STX is Turing complete.
- STX is rule-based.
Le me mention two differences between STX and XSLT:
- STX is a procedural programming language: we can assign variables many times and can perform loops.
- STX uses STXPath as query language. STXPath ist derived vom XPath 2.0 but can access only the ancestors of a certain node.
- STX is event-based: STX templates are executed in the specific order of the input stream.
- STX has no named templates but we can define procedures.
- There are no commands like xslt:for-each that can change the current node.
STX uses the XSLT 1.0 data model – not the one of XSLT 2.0. It supports some features of XSLT 2.0 like multiple output documents and text processing. In fact stx:analyse-text is more powerful than the corresponding command in XSLT 2.0. If you are interested in a scientific investigation of STX I suggest you to read Oliver Becker’s PhD-thesis.
The following transformation copies an XML document and renames all elements with name person to individual:
STX commands are defined in the namespace http://stx.sourceforge.net/2002/ns. The main command is stx:transform. Please remark that we use the attribute pass-through="all" of the stx:transform command to copy the complete input XML document to the output stream. For the element person we define a special rule: if it occurs we give out the \ an element named individual. We use stx:process-attributes to copy the attributes of the just processed element person. stx:process-children continues event processing.
Consider following example: an XML document contains a list of business partners and a list of contract information of each business partner:
We want to transform the document above into two separated CSV-datasets. The first dataset contains the information of business partners and the second the corresponding contracts. Both are linked by foreign keys we have to calculate during the transformation. In fact we transform the hirarchical structure of the XML document to relationial data model. Let’s look at the wanted output:
Let me explain the structure. The number 4711 is an external parameter given to the transformation. Then we have a counter for the person, that is used in the second dataset, too. The values F;Maria;Testperson;19660703;12345;Teststadt;Teststr.are defined in the elements person_name, birth_dttm and addr.
The second dataset contains another counter for the contract information (element contract). The attribute ende is optional and we take a default value 99991231 if it is missing. Following transformation solves the task:
Let’s have a closer look. We define pass-through="none" to control the output: only literal result elements and output defined by stx:value-of will be copied to the output stream. stx:param defines an external parameter and semikolon and crlf are two variables we use as constants for formatting the output. bpnr and contractnr are two counters we will use later. We define a template with attribute match="/" that is executed at the beginning when the document node is processed. Within that template we delete the content of the two datasets person.txt and contract.txt. Then we start the event-loop and apply the templates of the group bpdaten to the elements in the input stream:
The group bpdaten contains a few variables and templates that assign the variables. Each template contains the command stx:process-children to suspend the processing of the current template by processing the children of the current node. Let me cite the STX-specification: “Using SAX2 terms: this instruction splits a template into two parts such that a SAX2 startElement event causes the execution of the first part and the corresponding SAX2 endElement event causes the execution of the second part. There must be always at most one stx:process-children instruction executed during the processing of a template.” The template that matches the bp elements increments the counter for the business partners:
There is one template we use to give out the assigned variables. Please remark that stx:result-document defines the output-stream. Moreover we use a procedure to reformate the date: stx:call-procedure name="format-date":
The procedure format-date is a child of the root element stx:transform and writes a formatted date to the output stream:
We define a second group contracts that is similar to the first one but works on the child-elements of status. It contains variables that are filled by corresponding templates. If elements are optional we assign default variables to the corresponding variables. The template public="yes" attribute in stx:template match="contract" public="yes" makes the template visible from the group bpdaten and declares this template as entry to the group:
Grouping techniques using stx:group this is a very important method for writing more readable as well as maintainable programs. In this program a group corresponds to an entity (business partners resp. contract information). The templates within a group correspond to the child elements of a certain element. While processing the child elements we apply only templates within that group. In fact we can define the visibility of each template of a group (global or local) and the visibility in a wider context, too.
Grouping templates makes the STX programs more maintainable. If there is a “local” change in the XML documents we want to transform we only need to do changes within one group. In Joost grouping also increases performance because we reduce the number of templates that have to be checked for matching a certain element.
Let me emphasize that if you have to do XML processing there are some tasks that can’t solved under ABAP. In this case you shouldn’t be afraid to use Java to have access to very powerful tools like STX.
Obvious Applications of STX
Besides “local transformations” (renaming elements and attributes) there are obvious tasks for STX, for example:
- Splitting large XML documents into smaller ones.
- Doing preprocessing before using a Simple Transformation that bridges the gap between XML and ABAP.
- Doing postprocessing after using a Simple Transformation that bridges the gap between ABAP and XML.
The second example shows the most important techniques if you want to create STX programs. But every serial transformation languages has the same problem: we have only a limited access to the XML document that is processed. In STX we can access the current element, its content and the value of its attributes as well as content and attributes of its ancestors. For complex transformations we have to save elements into buffers to access them later an. In the next part of this weblog series I will implement an XSLT program from my SAP Heft XML-Datenaustausch in ABAP in STX and perform a complex transformation.
Streaming Validations for XML
We do data exchange to link electronic business processes. Often these processes require quality assurances to the data, otherwise we need (sometimes manual) postprocessing. We can use schema languages like W3C XML Schema to model XML documents in a formal language and to check against this formal specificationm but often this is not enough. There are more powerful schema languages like Schematron that allow to define numerical checks of an XML document for example. Unfortunately most Schematron implementations rely on XSLT and have the same ressource problem when dealing with mass data. We can solve this problem by defining a schema language that is similar to Schematron that allows to define assertion about patterns of XML documents but allows an implementation in STX.
In fact I defined such a language and wrote an XSLT transformation that converts those assertions to an STX program. I will introduce that language in future in another blog.