Streaming Validation for XML
In the first part of this weblog I introduced STX and mentioned validation techniques beyond W3C XML schema \ as an application. Now this will be explained in detail. Therefore I used more advanced STX techniques. Now I want to put it all together for an application in data exchange.
We do data exchange to link electronic business process by making data of one system available for another system. Usually we don’t want to accept any data – we only accept valid XML documents. Using schema languages like W3C we have several advantages:
- We have formal specifications that can be validated using standard software.
- The sender of an XML document can perform validation so that only valid documents will be exchanged.
But validation against an W3C XML Schema has also disadvantages:
- W3C XML Schema usually can’t perform many checks, think of numerical checks for example.
- An XML message can contain thousands of serialized business objects. Sometimes we don’t want to reject a huge XML message because a single business object is error prone.
- The output of an error protocol a validation is hard to interpret. We would like to have error codes that are readable or can be interpreted by computer programs.
Validation languages like Schematron sometimes perform better because we can code rules and assertions. Unfortunately most Schematron implementations rely on XSLT so that you can’t check huge XML documents. In this weblog I will present a self made prototype of an validation language STV (Streaming Validation for XML) that is based on STX, so I expect a good performance. On the other hand compared to Schematron there is a lack of expressiveness. But combined with W3C XML Schema it is an powerful tool.
In the first part of bis blog series I transformed an XML document that containes a list of business partners and contract information. Each contract has a number and a start and an end date. Unfortunately we W3C XML Schema can’t check whether the start date is less or equal than the end date and that the list of contrats of each business partner is does not overlap. That means that we wan’t to give out error messages if following two cases occur:
The latter case is wrong because a missing attribute @ende means that we have an open end. Here is a XML document that contains those errors:
A brief look on STV
An STV transformation defines a set of rules that consist of assertions. An assertion can be coded with variables that have to be assigned first. Within a rule we can initialize buffers that can be appended and processed. I created a schema for that language, that you can dowload here:
A first STV example
I want to code the checks mentioned above in a formal language:
Let’s look how it works. All STV-commands have namespace urn://svx/001, the root element is svx:schema. We define certain rules with the command svx:rules. Each rule has two attributes context and location. The first one defines an element which is evaluated together with its children to process the rule. The attribute location defines the place (i.e. an element which triggers the execution of assertions).
Here is the part of the SVX-transformation that compares start and end date:
The rule contracts is defined within the element status executed when the element nr occurs. At each occurrence of the element begin_end_tmr the value of the attribute @beginn is assigned to a variable start using the command stv:let. The variable end is treated the same way but it is assigned to a default if it is missing. At the element end the assertion is processed and we give out an error message if it does not hold.
If we want to check whether date information of different elements nr overlap we need to introduce buffering techniques. We declare a buffer with:
We append date information to the buffer using following commmands:
We check whether these buffered dates overlap:
An STV implementation
In fact it is longer compared to the short STV schema above. I suggest you analyse this programm to learn how STV works. It is also a chance to improve your STX skills.
We can use STV to code checks that will be performed an a certain XML document. Using an XSLT 2.0 transformation we generate an STX program that performs those checks. In the example above we expect that this document is valiad according to a certain schema otherwise the generated program won’t produce correct results.
STV is a self-made tool and I think a lot of things can be done better:
- Up to now there is no STV specification.
- STV is hard to read.
- STV is too near to its implementation in STX. Perhaps we could find a more abstract syntax.
- An STV assertion can’t tell about a location of an XML-element of the checked document because neither STX nor the XML-Infoset does support that information.
- We could change the syntax so that is closer the Schematron.
- We could integrate W3C XML Schema to gain a more understandable STV syntax. If we require that a document we want to check is valid according to a specific schema then we could simplify the STX syntax. The XSLT 2.0 transformationthat produces the STX programm could work on that specific W3C XML Schema could determine the locations for the checks. As a result STV would gain a more “rule-based” character.
I hope I will have time to work on this. Any help would be appreciated.
But there is one thing to mention: I think STX and XML streaming techniques in general are not much known in the XML-community. The STX communitity is very small a I think any help improving Joost and its counterpart in Perl would be appreciated, too.
Dealing with XML mass data is still a challenge. We have powerful streaming techniques in ABAP (I suggest you to read this publication – the english version is coming soon) and in Java. I hope this weblog series helps you applying these Java techniques.