Skip to Content

to Unicode or not to Unicode?

If you work for a company where all business is done in one language – English – chances are you would never come across it.

First time I started reading about Unicode was when after migration to ECC 6.0 (from a non-Unicode 4.7) some of the file interfaces stopped working.

Second time I came across Unicode was when I found the SAPlink code removing Byte Order Mark character from the XML file. There must have been some reason for doing that. I don’t know. See details here.

Now I was really puzzled and decided to research how well ABAP can work with XML and Unicode.

Research

My research goal was to develop an ABAP program that can:

  • capture a “Hello” message in different languages
  • save the messages in a Unicode XML file on the frontend
  • display the message from the Unicode XML file in a given language or display all messages

I knew that computers do need some help to speak a language other than English. Remembering how I once “taught” MS-DOS to speak my language I started to explore.

I found a very informative collection of articles about “Characters and encodings” by Jukka “Yucca” Korpela.

Jukka “Yucca” Korpela writes

(…) a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated. Sometimes one can guess the encoding, but data processing and transfer shouldn’t be guesswork. More…

Nicely said! That was it. To avoid “the guesswork” XML has the encoding declaration.

For example,


The Byte Order Mark plays an important role too:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark (…) This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. More…

The question I needed to answer now was: Is it possible to open an XML file as binary and let ABAP XML processor figure out what encoding to use?

After some coding the answer I found was: Yes.

This program shows how to load and display an XML file. You can download a test UTF-8 or UTF-16 file or use one of your own.

Note that the exception handling is not done properly in the code below.

Say “Hello” program

Say 'Hello' program

The Say “Hello:” program prompts you to enter a “Hello” message in your language.
Select “Update” for your message to be added to the File.
You can download sample XML file in UTF-8 or UTF-16 format.

You can download a SAPlink installation of the program and related utility classes.

XML utility class

I have developed a class ZCL_EVP_XML_UTILS that has two methods LOAD_IXML_DOC_FROM_FILE and SAVE_IXML_DOC_TO_FILE to load and save XML files from/to the frontend.

When you are saving the XML you have to specify the encoding. When the XML is loaded the encoding is not required as it will be defined by the XML encoding declaration.

You can download a SAPlink installation of the program and related utility classes.

Source Code

The live version of the Say “Hello” program and the XML utility class is hosted here in the Google Code project under SVN. The Say “Hello” program is in the test directory.

Unicode!

Going back to my question in the beginning to Unicode or not to Unicode?.

A fellow ABAP-er asked me a few days ago “How many bytes are in that string?”. I replied back with a question (not polite, I know ;-)): “What encoding are you using?”

After learning a lot about Unicode I really liked its universal concept.

My answer is definitely: to Unicode!.


SAPlink removing Byte Order Mark

Here is what I found in the revision 280 of the ZSAPLINK class:

Method CONVERTIXMLDOCTOSTRING removes the first character of the XML:

  while _tempString(1) <> ‘<'.
    shift _tempString left by 1 places.
  endwhile.
To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. Gregory Misiorek
    i found this website to be a good intro into Unicode characters and how they fit into the planes. what strikes me most is that Unicode is only a concept and only the encoding makes it real and we have all seen different encodings whenever a string gets “recrypted” when copied across platforms.
    (0) 
        1. Gregory Misiorek

          i was trying to show that even though the standard for U0080 code point is <control> character, Microsoft found it appropriate to “stick” the euro sign (normally U20AC) in some of their applications.

          (0) 
  2. Peter Norris
    I appreciated your logical approach to the subject.

    Unicode and its encodings can be a challenge, and your blog is helpful in understanding some of the complexities.

    The Unicode Consortium website is a useful resource when dealing with the subject.

    http://www.unicode.org/

    (0) 

Leave a Reply