abap-CSR: custom tool for Character Set and language Recognition
I see frequent issues in the SCN forum about reading wrong characters from text files. Some of these issues are because of file characters not encoded as expected, and that the Original Poster doesn’t know the character set (or code page) or even doesn’t know the concept of Character encoding.
These issues concern files, but also other media like HTTP, FTP, web services, archived or attached files, etc.
There’s a tool proposed by ICU (http://userguide.icu-project.org/conversion/detection), which detects automatically the character set by analyzing the bytes. Of course, its assumption may sometimes be wrong, especially if the text is made of alphanumeric codes rather than plain words, but it guesses very well with texts from classic (electronic) books. It also guesses the language (English, French, etc.), which works well too —except with some languages which are very close, like Danish and Norwegian—.
This tool would be useful for ABAP developers having issues when getting wrong characters, so I have converted the ICU C++ source code into ABAP so that it’s easy to use.
Currently, I only ported the part for Occidental and Unicode character sets (Chinese, Korean and Japanese are not ported). I did a few additions, like limiting the analysis to the first 1000 bytes and adding a UTF-16 recognition algorithm when the Byte Order Mark is absent. If you see some bugs, improvements and so on, don’t hesitate to report them at the project website, or even to participate.
Here are two ways how to use the tool:
- If it’s a file, run the program Z_CSR_FILE_DETECTOR.
- If it’s another medium, change your code (temporarily, revert after you’re done!) store the medium contents into an “XSTRING” variable and add the following lines of code to your program and debug the execution to see what language and character set have been detected:
DATA(csr_det) = NEW zcl_csr_detector( ). csr_det->set_text( xstring ). " the "text", to be passed as type XSTRING ! DATA(result) = csr_det->detect( ). " get result with highest confidence DATA(language) = result-csr->get_language( ). DATA(charset) = result-csr->get_name( ).
This tool should never be called from a custom program because it’s risky that one day a file is not correctly processed because the character set recognition failed. The programs should always assume that a file has a given character set. Moreover, the algorithm is not optimized at all (it’s why I decided to limit the recognition to the first 1000 bytes, which seems to be sufficient).
If you want to play in your sandbox, you may call the program Z_CSR_CREATE_DEMO_FILES to create automatically files in various character sets and languages. Run the program Z_CSR_FILE_DETECTOR, and it will display the confidence for each character set and language. For instance, with the file with Norwegian text, the program guesses that it’s probably (best confidence 71%) encoded with the iso-8859-1 character set and contains Norwegian text (“no”, see language codes):
If you want to reverse-engineer the code to understand how the detection works, you may find the program Z_CSR_DISPLAY_NGRAMS useful. It shows what the most frequent three consecutive characters are for each language, by decoding the data in classes in directory /src/data. For instance in English, it’s AID, AND, ATE, ATI, ENT, FOR, HAT, HER, ING, ION, SAI, TER, THA, THE, TIO. In French, it’s ANT, ATI, CON, DES, ENT, EUR, ION, LES, MEN, ONT, OUR, QUE, TIO, etc.