to Unicode or not to Unicode?
If you work for a company where all business is done in one language - English - chances are you would never come across it.
First time I started reading about Unicode was when after migration to ECC 6.0 (from a non-Unicode 4.7) some of the file interfaces stopped working.
Second time I came across Unicode was when I found the
SAPlink code removing
Byte Order Mark character from the XML file. There must have been some reason for doing that. I don't know. See details
here.
Now I was really puzzled and decided to research how well ABAP can work with XML and Unicode.
Research
My research goal was to develop an ABAP program that can:
- capture a "Hello" message in different languages
- save the messages in a Unicode XML file on the frontend
- display the message from the Unicode XML file in a given language or display all messages
I knew that computers do need some help to speak a language other than English. Remembering how I once "taught"
MS-DOS to speak my language I started to explore.
I found a very informative
collection of articles about "Characters and encodings" by Jukka "Yucca" Korpela.
Jukka "Yucca" Korpela writes
(...) a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated. Sometimes one can guess the encoding, but data processing and transfer shouldn't be guesswork. More...
Nicely said! That was it. To avoid "the guesswork" XML has the
encoding declaration.
For example,
The
Byte Order Mark plays an important role too:
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark (...) This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. More...
The question I needed to answer now was:
Is it possible to open an XML file as binary and let ABAP XML processor figure out what encoding to use?
After some coding the answer I found was:
Yes.
This program shows how to load and display an XML file. You can download a test
UTF-8 or
UTF-16 file or use one of your own.
Note that the exception handling is not done properly in the code below.
Say "Hello" program
The Say "Hello:" program prompts you to enter a "Hello" message in your language.
Select "Update" for your message to be added to the File.
You can download sample XML file in
UTF-8 or
UTF-16 format.
You can
download a SAPlink installation of the program and related utility classes.
XML utility class
I have developed a class ZCL_EVP_XML_UTILS that has two methods LOAD_IXML_DOC_FROM_FILE and SAVE_IXML_DOC_TO_FILE to load and save XML files from/to the frontend.
When you are saving the XML you have to specify the encoding. When the XML is loaded the encoding is not required as it will be defined by the XML encoding declaration.
You can
download a SAPlink installation of the program and related utility classes.
Source Code
The live version of the Say "Hello" program and the XML utility class is hosted
here in the Google Code project under
SVN. The Say "Hello" program is in the test directory.
Unicode!
Going back to my question in the beginning
to Unicode or not to Unicode?.
A fellow ABAP-er asked me a few days ago "How many bytes are in that string?". I replied back with a question (not polite, I know ;-)): "What encoding are you using?"
After learning a lot about Unicode I really liked its
universal concept.
My answer is definitely:
to Unicode!.
SAPlink removing Byte Order Mark
Here is what I found in
the revision 280 of the ZSAPLINK class:
Method CONVERTIXMLDOCTOSTRING removes the first character of the XML:
while _tempString(1) <> ‘<'.
shift _tempString left by 1 places.
endwhile.