We do data exchange to make data of one system accessible for another system. Therefore we use standardized data exchange formats and frameworks for modeling business messages. And sometimes we have to create their own specifications for data exchange. But when is a specification good – and what does “good” mean?
The Business Process Experts Point of View
If we want to answer the question whether a certain specification for data exchange is good then we have to listen what a business process expert has to say: The key to the question is What it is good for? A specification is good if it supports our business processes. Therefore we have to ask following questions:
- Process Integration: What kind of processes we want to link? What does the process model look like?
- What kind of documents will be exchanged? How do they relate to the business processes?
- Compliance: Do we have to take care about authentification?
- IT based process optimization: if there is an existing data exchange what can be done better?
- Quality of the data: is there a problem? And how does it affect the business processes and what can be done about it?
The Technologists Point of View
After we answered the questions above and figured out how they affect the data exchange scenario we can go into details:
- What about the transport layer?
- Data integration: Can we define a common model for the business objects we want to exchange?
- How do we model the business objects?
- Can we use existing standards?
Today most of these standards are XML based – because XML is just a generic syntax for hierarchical data structures a lot of different standards emerged within short time. Some of these standards are very flexible: you can think of them as frameworks that allow to define encodings for new types business objects. We enocde these types using schema languages like DTD, W3C XML Schema or Schematron.
When I study XML based specifications sometimes I think that they could be done better. Let me tell you about some pitfalls:
- The authors have little experience using a schema language. In fact this problem is very easily to solve in a review.
- The authors have no experience with any eCommerce standard and translate their legacy message-specification straight forward to XML. Of course this leads to trouble: The most common mistake is that the head-body pattern isn’t applied. In the worst case they add the fields of a EDIFACT-UNB segment as attributes to the XML root element. Of course such a strategy was important once when we used to exchange our data on disks that could contain only 360 KB but today this is outdated.
- The authors reinvented the world one more time. This problem can be solved be using predefined data types. A typical example are currency units there some standards to code them. This problem is to solve very easily in a review if the reviewer knows the standards well.
- Semantic ambiguity: But what is it and how can we detect it?
Sometimes I read a specification and I think the people who created it just looked at the database schema of their system and transferred it to an XML schema language. Of course you can work this way but dont expect that someone else will understand it easily. The main problem is how to deal with semantics: because XML is just a syntax you cant expect that XML markup and a schema languages solve every problem.
I think it would be fine if had some simple but necessary rules for dos and donts.
We can define semantics by linking an message in data exchange to a model (perhaps an schema language) or even to a conceptual meta model like a conceptual model described by an ontology. If we have a common conceptual model of business entities we can ensure that we understand the data exchange scenario in a correct way and we can verify that it is appropriate with respect to our business processes. Perhaps these formalized meta-models can be used both by technologists and business experts to specify a data exchange scenario.
Computer theorists also define semantics using constraints when they formalize data exchange and data integration. Given a source schema and global schema (your specification for data exchange) they define assertions relating elements of the source schema to a global schema. This constraints can be formulated in mathematical logic but I think if you create your data exchange specification you will do it by intuition. Let me give you an example: if you write your specification you may think something like each entity A in my database is related to XML element B in the XML schema. But this is just a heuristic and can lead to trouble: deriving a specification from a database schema isnt the same as object oriented design of data types.
If you look at the sender and the receiver in a data exchange scenario both have to solve the same problem: data structured under their own schema has to be transferred to the data under the global schema back and forth and this has to be done as accurately as possible. But this view on data exchange doesnt help us much because usually the sender doesnt know the constraints used by the receiver and vice versa. And even if we would know them this wouldnt be useful. The reason is that we dont do data exchange for its own sake: we want to link business processes and often these processes require additional constraints that ensure assertions that are needed for electronic business processes. In the following I will discuss a constraint that is trivial but I consider it as important.
Uniqueness of Business Objects
Le me give a practical example. Consider the case that we want to exchange invoices and the used specification for data exchange allows us to serialize a number of invoices within a single XML document. Whenever I was involved in such a scenario that same questions occurred:
- Is it allowed that the sender sends a certain invoice for more than one time?
- Is it allowed that a single XML document contains a certain invoice more than once?
- What happens in that case? Should I ignore it? And if not: which one should be taken?
- How can I detect that an invoice is contained more than once within in a single document.
You think that these problems are easy to solve? Consider two cases: experts say that an invoice can be identified using the tuple of sender, receiver, a date and an invoice number. What happens if you get the two invoices that have the same sender, receiver, date and number but they have nothing else in common? Some people say we could solve this problem adding guids (global unique identifiers). But what happens if an XML document contains two invoices having the same guid or two invoices with different guids but there are no other differences?
These problems can be solved by adding constraints to the XML schema: We can define assertions that force uniqueness. Here this enforces object identity of a business object in terms of object oriented design. In my opinion this is crucial for most business processes.
If I study a XML based specification for data exchange I expect that I can easily identify all information belonging to a business object. In simplest case I would expect that all information correspond to a data type like xs:complexType and there are no references to other elements.
Moreover I would expect that a business object corresponds to a subtree within the XML document. If a business object uses composition and contains several other business objects I would expect that these business objects are contained in this subtree, too.
I consider this kind of simplicity as very important if we deal with mass data: with serial XML transformation languages like STX or Simple Transformations it is easy to create local transformations. If a business object is composed of other business objects that are spreaded all over the document it needs some effort to reassemble them. And there is another problem: sometimes I have to split up a single XML document containing mass data into smaller ones that can be transformed in parallel mode or even transferred to different target systems.
In this blog I was able to introduce two simple but powerful rules of thumb for designing XML based specification. I think there must be more so I would be interested to get to know about your ideas.
The principles mentioned above are not restricted to XML based specifications, in fact they apply in most other contexts in data exchange.