Thoughts about Quality of XML based Specifications...

ttrapp · ‎05-18-2007

Some things seem to be so simple that we don’t spend much time to question them. For example: how do we code the day of a week in a specification for data exchange. For an XML Schema expert this is an easy exercise to define a code list as an enumeration data type:

<xs:simpleType name="day">
     <xs:restriction base="xs:string">
          <xs:enumeration value="Mon"/>
          <xs:enumeration value="Tue"/>
          <xs:enumeration value="Wed"/>
          <xs:enumeration value="Thu"/>
          <xs:enumeration value="Fri"/>
             <xs:enumeration value="Sat"/>
          <xs:enumeration value="Sun"/>
                </xs:restriction>
</xs:simpleType>

But is it reasonable? There are doubts:

If one code in a code list changes, we have to exchange the XML schema. This is easy in an EAI scenario but in B2B with lots of partners this can be difficult.
Some code lists are huge – think of medical classifications. As a consequence we would get huge schemata and validation against them will be time consuming.
And even if one code is wrong, should we reject the whole document?

But there are other difficulties. Every French patriot will explain to you that {‘Lun’, ‘Mar’, ‘Mer’, ‘Jeu’, ‘Ven’, ‘Sam’} is the better code list. This leads to following general questions:

Which representation for a code list should we use?
Is my code list complete? Usually different user define different subsets od code lists.
Which representation of a code list do we need in a particular situation?
Are there other subsets that are more reasonable? Are we able to put those subsets together if it should be necessary?
Supposed a standardization of different code list is not possible because they coexist in different specifications. How can we define proper mappings between them? Can we define standardized mappings to translate different representations of code lists?

And what about code lists in classical, non-XML based specifications? If you look at specifications for data exchange between german compulsory health insurance and other institutions you will recognize that there are lots of proprietary code lists. We have to copy them from PDF documents to program proper mappings. There is no standard software we can use to scan the payload for incorrect codes.

In fact it would be useful if there would be a formal definition of code lists and standard software that performs mappings or can validate documents against formal definitions. With XML technology we have the chance to do it better.

Code Lists in XML based Specifications

Let’s take a look at some specifications for data exchange and study how they deal with code lists. For me it was surprising that simple data types are not that common as I expected. Some standards chose a different approach:

FpML (Financial Product Market Language - a financial XML specification) decided not define simple data types, instead they refer to an external scheme containing code lists. UBL and MDDL chose a similar approach.
Health care specifications like HL7 Messages or CDA (and its german adaption SCIPHOX) use OIDs to define code lists. OID is a short form for Object Identifier and has it‘s origin outside XML. They are unique alphanumeric identifiers registered under the ISO registration standard to reference a specific object or object class: institutions, classifications or even records. The code lists defined by OIDs are not defined by schemata and there is no way to validate them. We code the OID in an S attribute and the OID version in the SV attribute; you find the code in the V attribute:
```
<fachgruppe V="060" S="1.2.276.0.76.5.114" SV="1.00"/>
```
The german standard eHealthData which is similar to SCIPHOX uses OIDs, too, but delivers code lists within an XML document. In fact it has a very tricky mechanism that can ensure that the codes are unique within a code list and that codes in the payload are valid according to the code list specified in an XML schema. I recommend that everyone who is interested in advanced XML schema design should study this specification and especially the schema keytabs.xsd.

The german eGovernment standard XMeld refers to external code lists, too:

<statusderwohnung>
  <tabelle>http://www.osci.de/xmeld11/spezifikation#schluesseltabelle.5</tabelle>
  <schluessel>0</schluessel>
</statusderwohnung>

Genericodes

Up to now there are no techniques to validate XML documents against external code lists I mentioned above. But XML infrastructure offers tools to select codes from XML documents together with code lists that are identified using URIs. But at first we need a generic method for describing code lists and different representations of them. We need a versioning mechanism and a method to define subsets code lists.

And this is exactly what OASIS proposed: OASIS has released the committee draft of genericode 1.0, the OASIS Code List Representation XML format, for public review, see http://www.genericode.org/.

This specification defined an abstract model in XML Schema and UML with three elements:

ColumnSet defines a set of columns and keys that can be re-used in code list definitions,
Codelist defines a simple or derived code list,
CodelistSet defines a set of code list versions.

Draft versions of genericodes are used by FpML. Here's an example to refer to USD as currency unit:

USD

Summary

Simple data types are not very useful for data exchange specifications and modern approaches use external schemes. Genericodes provide a generic data model for code lists that uses ideas from relational databases. It also supports versioning and changes.

The next step is the creation of a infrastructure for managing code lists, perform mappings based on code lists and validation against code lists.