Unicode makes it possible to cover the languages of most people in the world with a single character set. It is widely used in the Internet, in office software, and in many other software applications. Also, support of the Unicode character set is a must if you do not want to set up separate IT systems for different regions of the world (which then would cause trouble when you need to establish communication between them).
This blog is about how to read and write Unicode files, using ABAP as a programming language. Moreover, it is important to be able to detect that a file is encoded in Unicode. For this purpose, the ABAP class CL_ABAP_FILE_UTILITIES was shipped with recent support packages. But before getting into these details, let me recall some basics of Unicode.
The Unicode character set
The current Unicode Standard contains about 100,000 characters. So it seems to be insufficient to spend two bytes per character, whereas four bytes per character would be a waste of resources. For this reason, there are several Unicode encoding schemes, which cater for different needs. The practically most important schemes are UTF-8 and UTF-16. Both are able to represent exactly the same set of characters.
- In UTF-16, the commonly used characters occupy two bytes each. This range of characters includes scripts such as Arabic, Cyrillic, Devanagari, Ethiopic, Hebrew, Thai and many others. It also includes the vast majority of characters used in China, Japan and Korea. However, some portion of the Hong Kong Supplementary Character Set is not included, besides some mathematical and musical symbols and some dead languages. These characters are represented by surrogate pairs, which consist of two 16-bit code units.
- In the UTF-8 encoding scheme, each character occupies between one and four bytes. A byte with a value less than 128 (hexadecimal 0x80) is a one-byte character. A byte with a value greater than 191 (0xBF) is the first byte of a character with more than one byte. The subsequent bytes belonging to such a character are in the range between 128 and 191 (inclusive). So it is not difficult to find out where a new character starts when processing some text.
When dealing with files, UTF-8 has some advantages over UTF-16.
- The code values less than 128 coincide with the well-known ASCII, so UTF-8 is in fact a superset of ASCII and can, to some degree, be processed by software which does not know anything about character sets.
- UTF-8 is independent of the byte order. In contrast, the two bytes of a 16-bit code unit of UTF-16 are stored in a hardware-dependent byte order. (This is the so-called endianness, which applies to any integer that needs more than one byte.)
- In many cases, UTF-8 saves space. This holds in particular for languages based on the Latin alphabet because they contain many ASCII characters, which need just one byte in UTF-8. For languages such as Greek and Russian, both UTF-8 and UTF-16 require two bytes per character. Most Chinese, Japanese and Korean characters however need three bytes in UTF-8 while two bytes are sufficient in UTF-16. This drawback may be outweighed by the fact that business data will in practice contain many ASCII characters, the simplest example being numbers.
Let us look at two examples, the first one being the Swiss town of Genève, in English known as Geneva. The fourth letter of its name is called Latin small letter e with grave and is encoded as 0xE8 in the ISO-8859-1 code page used traditionally. In Unicode this character is the code point \u00E8. (We have borrowed the notation “\u” from other well-known programming languages.) The UTF-16 representation are the two bytes 0x00 0xE8, assuming big-endian byte order. The UTF-8 representation is 0xC3 0xA8.
The second example is the Czech town of Děčín. For Czech, the code page ISO-8859-2 has been used traditionally, which encodes the third character of Děčín as 0xE8. This is the same value we have met in Genève already. But of course, Unicode solves this conflict. The letter č is encoded as \u010D. Its name is Latin small letter c with caron and its UTF-8 representation is 0xC4 0x8D. Děčín is represented as 0x44 0xC4 0x9B 0xC4 0x8D 0xC3 0xAD 0x6E. (In the unlikely case that you do not see a c with a little v on top, your browser has a font problem — a Unicode font would solve it.)
For full details about Unicode, visit the Web site of the Unicode Consortium www.unicode.org.
The byte-order mark
When reading a file, you want to be able to distinguish easily between the big-endian and the little-endian version of UTF-16. For this reason, the first character in a file should be the byte-order mark, which is the character \uFEFF. So a big-endian UTF-16 file starts with the bytes 0xFE 0xFF, whereas a little-endian UTF-16 file has 0xFF 0xFE at the beginning. Of course, \uFFFE is not a valid character.
Clearly, it would be nice to have an indicator for UTF-8 files, too. If you convert the byte-order mark \uFEFF to UTF-8, you get 0xEF 0xBB 0xBF. This byte sequence is well suited to do the job and is often called UTF-8 byte-order mark although this is slightly paradox because UTF-8 has only one unique byte order. The probability that a non-UTF-8 file has a UTF-8 byte-order mark at the beginning is very small. For example, in ISO-8859-1 this are the characters ï»¿. Any reasonable text will not start with these characters.
Unicode in the ABAP file interface
The ABAP part of SAP’s application server is available in a Unicode and a non-Unicode version. The Unicode version uses UTF-16 internally. This is in line with both the Java and the Microsoft world. To see whether your SAP system is Unicode based, select “System → Status” from the menu, and you will find the information in the box “SAP System data”. The Unicode version became available with SAP_BASIS Release 6.20. In the sequel we will use the terms Unicode system and non-Unicode system.
The ABAP file interface employs UTF-8 for Unicode files. The files are stored on the application server. Files often are an ad-hoc solution to communicate with some other software. Of course, capabilities to download files to the front-end PC are also available, but that would be an extra blog, so it is not covered here.
If you want to read a file that might be encoded in Unicode, you will like to check whether there is a byte-order mark at the beginning. Then, in most cases, it is the best to skip the byte-order mark because you do not want to have it in your business data if, for example, the first item you want to read is an 8-digit customer number.
Before opening the file with the OPEN DATASET statement, the method CHECK_FOR_BOM of the class CL_ABAP_FILE_UTILITIES can be called to check whether a file begins with a byte-order mark. If the method returns BOM_UTF8 (this is a constant of the class CL_ABAP_FILE_UTILITIES), open the file as shown below.
IF cl_abap_file_utilities=>check_for_bom( filename ) = cl_abap_file_utilities=>bom_utf8. OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 FOR INPUT AT POSITION 3. ELSE ...
The specification AT POSITION 3 has the effect that the byte-order mark is skipped. The method CHECK_FOR_BOM became available with 6.20 SAP_BASIS support package 47 and 6.40 SAP_BASIS support package 10. Of course, it would also be possible to open the file in BINARY MODE and to implement a hand-coded check for the byte-order mark.
If the file has no byte-order mark at the beginning, it may nevertheless be encoded in UTF-8. To check this, the class CL_ABAP_FILE_UTILITIES has a method CHECK_UTF8 which returns one of the values ENCODING_UTF8, ENCODING_7BIT_ASCII, and ENCODING_OTHER. The third case will be addressed later in this blog. The caller can specify how many kilobytes are to be analyzed. If we read only a small portion of the file and get the result ENCODING_7BIT_ASCII, we are in a dilemma: The ASCII characters (their values are less than 128) are the common subset of UTF-8, the ISO-8859 code pages and many other code pages. If we want to be sure about the encoding of the file, we have to read the file until we find a non-ASCII character or the end of the file. To do this, the parameter ALL_IF_7BIT_ASCII has to be supplied with the value ABAP_TRUE when calling the method CHECK_UTF8.
The method CHECK_UTF8 became available with 6.20 SAP_BASIS support package 50 and 6.40 SAP_BASIS support package 12.
So far we dealt with reading files. If you write data into a file, it depends on the partner that will read the file whether a byte-order mark is helpful or even harmful. I do encourage everyone to make use of byte-order marks because this increases safety and simplicity when dealing with files. The class CL_ABAP_FILE_UTILITIES has a method CREATE_UTF8_FILE_WITH_BOM to create a file containing nothing but a UTF-8 byte-order mark. Subsequently, execute
OPEN DATASET filename FOR APPENDING IN TEXT MODE ENCODING UTF-8.
and then you can use the TRANSFER statement to write data.
In the Release succeeding 6.40, the OPEN DATASET statement has additions SKIPPING BYTE-ORDER MARK and WITH BYTE-ORDER MARK to be used when reading or writing files, respectively. By the way, with this Release new options to specify the carriage return/linefeed handling explicitly were introduced. But this would be an extra blog.
Unfortunately, the OPEN DATASET statement does not support UTF-16. A workaround is to open the file in BINARY MODE and to use the classes CL_ABAP_CONV_IN_CE and CL_ABAP_CONV_OUT_CE.
More information on the ABAP file interface is available in the article “File I/O with ABAP — Problems, Workarounds, and Prudent Practices” which appeared in the SAP Professional Journal in November/December 2001. And, of course, there is the online help — just press the F1 key on the OPEN DATASET statement.
What can go wrong?
If you work with a Unicode system and the files you read and write are Unicode, everything is quite safe. Problems arise when textual data is converted from Unicode to a non-Unicode character set: The data might contain characters which do not exist in the target character set. On the other hand, when textual data is converted from non-Unicode to Unicode, the typical problem is that the software assumes the data to be encoded in a code page different from the one which was actually used. We will look at some examples in a minute.
The typical statement to open a non-Unicode file for writing looks like this:
DATA msg TYPE string. OPEN DATASET filename FOR OUTPUT IN TEXT MODE ENCODING NON-UNICODE MESSAGE msg IGNORING CONVERSION ERRORS. IF sy-subrc <> 0. ...
First, note that MESSAGE msg is used. If the file cannot be opened, the variable msg will contain a message from the operating system indicating the reason, e.g., a non-existing directory. Even if you do not make any further use of the message msg, it is highly recommended to have it included in the OPEN statement because anyone who debugs the program can read the message and will find it much easier to identify the cause of any trouble.
Assume our system is a Unicode system. Which code page will be used if ENCODING NON-UNICODE is specified? It depends on the current language, i.e., the contents of SY-LANGU. During the log on, SY-LANGU is set to the user’s language. Later on, the current language can be changed by calling SET LOCALE LANGUAGE langu. If the current language is Czech, the non-Unicode code page of the file will be the code page a non-Unicode system would use for Czech, i.e., ISO-8859-2. If the data in the system contains “Genève”, the file will contain “Gen#ve”. This is because the letter “è” does not exist in ISO-8859-2. Since IGNORING CONVERSION ERRORS was specified, non-existing characters are replaced by the number sign. (You can specify a different replacement character if you do not like the number sign.) If you do not specify IGNORING CONVERSION ERRORS, an exception will be thrown, which can be caught as follows:
TRY. TRANSFER f TO filename. CATCH cx_sy_conversion_codepage. " Insert code to deal with conversion error. ENDTRY.
Here is another recommendation: Do not call SET LOCALE LANGUAGE langu while the file is open. The effect would be that the next TRANSFER statement would convert the data into a different target code page than before, namely the non-Unicode code page associated with langu.
Let us turn to reading files. If the system is a Unicode system, conversion errors will not occur. Just any non-Unicode data will be converted to Unicode. But this does not extinguish all possibilities to make an error. Assume that someone writes “Genève” into a file, using ISO-8859-1. If we open the file with ENCODING NON-UNICODE and the current language is Czech, the file will be assumed to be encoded in ISO-8859-2. Thus the byte 0xE8 is interpreted as the letter č, leading to “Genčve”. For any software system, it is hard to detect this error. Czech readers may excuse me, but why shouldn’t “Genčve” be a perfect name? (Only very few file systems have the possibility to tag files as being encoded in a particular code page.)
Next let us assume we have a non-Unicode system. Writing data into a UTF-8 file is quite safe. Some care has to be taken if the system supports more than one code page for its internal processing, i.e., the system is configured as an MDMP system. The general rule is that the current language should fit to the data being processed. The subtleties of MDMP systems would be an extra blog. They should be considered as an obsolete technology.
To make the picture complete: When reading a UTF-8 file into a non-Unicode system, the potential for errors is similar to the above case of creating a non-Unicode file in a Unicode system. Some characters may be non-existent in the target code page, and you can choose between a replacement character and an exception.
Front-End code pages and the legacy modes
There are still some more pitfalls when dealing with non-Unicode files. However, ABAP’s OPEN DATASET offers some options to avoid them. If we open a file FOR OUTPUT IN TEXT MODE ENCODING NON-UNICODE while the current language is Polish and output “Śląskie” (the name of a town in Poland), this will be encoded in the ISO-8859-2 code page. If some software running on Microsoft Windows reads this file, it may display “¦l±skie”. The reason is that it may assume the encoding to be Microsoft’s Windows-1250 code page. The file will actually be written in this latter code page if we call
OPEN DATASET filename FOR OUTPUT IN LEGACY TEXT MODE CODE PAGE '1404' MESSAGE msg.
There is a function module SCP_CODEPAGE_BY_EXTERNAL_NAME which converts a name such as “Windows-1250” into an SAP-internal four-digit number such as “1404”. The Microsoft Windows code pages are also used by the SAP GUI for Windows (if you do not yet use Unicode), but they are not used for the data inside a non-Unicode SAP application server even if it runs on the Microsoft Windows operating system. Thus, we talk of front-end code pages. To obtain the front-end code page associated with a given language, call the function module NLS_GET_FRONTEND_CP.
For ISO-8859-1 (West European languages), this problem with the front-end code page does not show up to the same extent. Every character of ISO-8859-1 does have the same code in the Windows-1252 code page, Microsoft’s counterpart of ISO-8859-1. But in the other direction, from Windows-1252 to ISO-8859-1, there are pitfalls. The Windows code page has some characters which do not exist in the ISO code page. Code page charts can be found for example at http://czyborra.com/charsets/codepages.html.
The good news is that for at least two “complicated” languages, there is no difference between the front-end code page and the code page a non-Unicode system would use. These languages are Japanese and Korean.
But there is another pitfall with Chinese, Japanese, and Korean: In the case of Unicode systems, the LEGACY TEXT MODE and the LEGACY BINARY MODE should be used with special attention only. The reason is as follows. If the content of an ABAP variable of type C and length 10 is converted to an East-Asian non-Unicode code page, we may obtain up to 20 bytes. However, at most 10 bytes are written into the file because this is the defined length of the variable and a non-Unicode system would have written at most 10 bytes. To avoid this problem, use the TEXT MODE rather than a LEGACY MODE, or use ABAP variables of type STRING.
For the LEGACY TEXT MODE and the LEGACY BINARY MODE, it is possible to specify BIG ENDIAN and LITTLE ENDIAN. However, this has no influence on textual data. It affects the ABAP types F, I, and INT2 only, i.e., the endian-dependent numeric types.
In this context I must state a last warning: Never use the LEGACY TEXT MODE for non-textual data, in particular the ABAP types F, P, and I. In certain circumstances, some bytes of the non-textual data will be interpreted as a space character or a new-line character, with fatal consequences. Use the BINARY MODE (or the LEGACY BINARY MODE) instead, or convert the data into textual form first.
We have visited the towns of Genève, Děčín, and Śląskie, but there are many more. The only way to cope with the variety of languages is Unicode. The ABAP file interface offers UTF-8, and it is recommended to use a byte-order mark.