Skip to Content
Technical Articles

Convert Emoji Characters in Unicode String to equivalent Unicode Code Points using SAP ABAP

Preface:

 

Character encodings are no alien to SAP systems or any computer systems for that matter, as they form the basis for data storage in and communication between computer systems. In the recent times, Unicode has become the dominant encoding scheme, of which UTF-8 representation is quite popular, especially with web content.

It is very rare that we get to deal with encoding schemes directly in ABAP. But, recently, there was a unique requirement, wherein it was required to convert the emoji characters in a unicode string to their equivalent Unicode code points in Hexadecimal so that they could be properly displayed in a HTML compliant client.

As much as it appeared interesting at first, it seemed very straightforward as well. But the reality was quite different upon the realisation that I had bare understanding of how unicode data is stored using UTF-8 encoding.

Here in this article, I’m going to explain what the actual requirement was and how a ABAP solution was provided for the same. Though the need for such a solution is very uncommon, the key takeaways from this solution could be our better understanding in the following areas:

  • How does SAP store the data in the default code page configured?
  • How to convert from one code page to another in SAP?
  • How to handle the conversion between the data types such as C, I and X and so on?
  • How to perform bit manipulation in ABAP?
  • How does UTF-8 bit distribution logic work?

OK, let’s get started.

Requirement:

 

The actual requirement goes as follows:

Let’s consider the below Unicode string as input.

Test emoji 😀

 

As we see, this string has an emoji icon, technically a unicode character, whose code point is shown below:

Reference: Emoji Chart v3.0

Now, as per the requirement, the emoji icon 😀needs to be converted to &#x1F600 (Code point in Hex). As & is an unsafe character in HTML context, it needs to be escaped with & and hence the expected output would be:

Test emoji &#x1F600

 

Let’s look at the ABAP solution below.

ABAP Solution:

*&---------------------------------------------------------------------*
*& Report ZGP_EMOJI_CONV
*&---------------------------------------------------------------------*
*& Convert Emoji Characters in a Unicode String to Unicode Codepoints
*& Author: Gopu Packirisamy
*&---------------------------------------------------------------------*

REPORT zgp_emoji_conv NO STANDARD PAGE HEADING.

* Constants

CONSTANTS c_semicolon    TYPE c VALUE ';'.
CONSTANTS c_uc_codepoint TYPE string VALUE '&#x'.

* Selection screen

SELECTION-SCREEN BEGIN OF BLOCK b WITH FRAME.

PARAMETERS p_string TYPE char255 LOWER CASE.

SELECTION-SCREEN END OF BLOCK b.

* Processing Logic

PERFORM conv_emoji2codepoint.

*&---------------------------------------------------------------------*
* Convert Emojis to Unicode Codepoint
*&---------------------------------------------------------------------*

FORM conv_emoji2codepoint.

  DATA lv_xstr_idx    TYPE sy-index.
  DATA lv_hex         TYPE xstring.
  DATA lv_hex_i       TYPE i.
  DATA lv_cur_pos     TYPE sy-index.
  DATA lv_unicode_cp  TYPE string.
  DATA lv_string_utf8 TYPE char255.
  DATA lv_string_conv TYPE string.

  FIELD-SYMBOLS <fs_char>.

  lv_xstr_idx = 0.
  lv_cur_pos = 0.

* Convert text UTF-8 format (Hex string)

  DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = 'UTF-8' ).
  lo_converter->reset( ).
  lo_converter->write( EXPORTING data = p_string ).
  lv_string_utf8 = lo_converter->get_buffer( ).

  DATA(lv_len) = strlen( p_string ).

* Parse through the Hex string and identify each Unicode character
* according to its UTF-8 bit distribution pattern and
* apply codepoint conversion, if necessary

  WHILE lv_cur_pos < lv_len.

    ASSIGN lv_string_utf8+lv_xstr_idx(2) TO <fs_char> TYPE 'C'.
    lv_hex = <fs_char>.
    lv_hex_i = lv_hex.

    IF lv_hex_i >= 240.     " >= F0

      ASSIGN lv_string_utf8+lv_xstr_idx(8) TO <fs_char> TYPE 'C'.
      lv_hex = <fs_char>.
      PERFORM conv_utf8_4b USING lv_hex lv_unicode_cp.
      lv_xstr_idx = lv_xstr_idx + 8.
      lv_cur_pos = lv_cur_pos + 2.

    ELSEIF lv_hex_i >= 224. " >= E0

      ASSIGN lv_string_utf8+lv_xstr_idx(6) TO <fs_char> TYPE 'C'.
      lv_hex = <fs_char>.
      PERFORM conv_utf8_3b USING lv_hex lv_unicode_cp.
      lv_xstr_idx = lv_xstr_idx + 6.
      lv_cur_pos = lv_cur_pos + 1.

    ELSEIF lv_hex_i >= 192. " >= C0

      ASSIGN lv_string_utf8+lv_xstr_idx(4) TO <fs_char> TYPE 'C'.
      lv_hex = <fs_char>.
      PERFORM conv_utf8_2b USING lv_hex lv_unicode_cp.
      lv_xstr_idx = lv_xstr_idx + 4.
      lv_cur_pos = lv_cur_pos + 1.

    ELSE. " Other cases

      lv_unicode_cp = COND #( WHEN p_string+lv_cur_pos(1) IS NOT INITIAL
                             THEN p_string+lv_cur_pos(1)
                             ELSE | | ).
      lv_xstr_idx = lv_xstr_idx + 2.
      lv_cur_pos = lv_cur_pos + 1.

    ENDIF.

    lv_string_conv = |{ lv_string_conv }{ lv_unicode_cp }|.

  ENDWHILE.

  WRITE: lv_string_conv.

ENDFORM.

*&---------------------------------------------------------------------*
* Convert 4 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*

FORM conv_utf8_4b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.

  DATA lv_emoji_hex TYPE xstring VALUE '000000'.

  PERFORM copy_hex_bits USING: 6 iv_hex 4 lv_emoji_hex,
                               7 iv_hex 5 lv_emoji_hex,
                               8 iv_hex 6 lv_emoji_hex,

                               11 iv_hex 7 lv_emoji_hex,
                               12 iv_hex 8 lv_emoji_hex,
                               13 iv_hex 9 lv_emoji_hex,
                               14 iv_hex 10 lv_emoji_hex,
                               15 iv_hex 11 lv_emoji_hex,
                               16 iv_hex 12 lv_emoji_hex,

                               19 iv_hex 13 lv_emoji_hex,
                               20 iv_hex 14 lv_emoji_hex,
                               21 iv_hex 15 lv_emoji_hex,
                               22 iv_hex 16 lv_emoji_hex,
                               23 iv_hex 17 lv_emoji_hex,
                               24 iv_hex 18 lv_emoji_hex,

                               27 iv_hex 19 lv_emoji_hex,
                               28 iv_hex 20 lv_emoji_hex,
                               29 iv_hex 21 lv_emoji_hex,
                               30 iv_hex 22 lv_emoji_hex,
                               31 iv_hex 23 lv_emoji_hex,
                               32 iv_hex 24 lv_emoji_hex.

  ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.

ENDFORM.

*&---------------------------------------------------------------------*
* Convert 3 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*

FORM conv_utf8_3b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.

  DATA lv_emoji_hex TYPE xstring VALUE '0000'.

  PERFORM copy_hex_bits USING: 5 iv_hex 1 lv_emoji_hex,
                               6 iv_hex 2 lv_emoji_hex,
                               7 iv_hex 3 lv_emoji_hex,
                               8 iv_hex 4 lv_emoji_hex,

                               11 iv_hex 5 lv_emoji_hex,
                               12 iv_hex 6 lv_emoji_hex,
                               13 iv_hex 7 lv_emoji_hex,
                               14 iv_hex 8 lv_emoji_hex,
                               15 iv_hex 9 lv_emoji_hex,
                               16 iv_hex 10 lv_emoji_hex,

                               19 iv_hex 11 lv_emoji_hex,
                               20 iv_hex 12 lv_emoji_hex,
                               21 iv_hex 13 lv_emoji_hex,
                               22 iv_hex 14 lv_emoji_hex,
                               23 iv_hex 15 lv_emoji_hex,
                               24 iv_hex 16 lv_emoji_hex.

  ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.

ENDFORM.

*&---------------------------------------------------------------------*
* Convert 2 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*

FORM conv_utf8_2b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.

  DATA lv_emoji_hex TYPE xstring VALUE '0000'.

  PERFORM copy_hex_bits USING: 4 iv_hex 6 lv_emoji_hex,
                               5 iv_hex 7 lv_emoji_hex,
                               6 iv_hex 8 lv_emoji_hex,
                               7 iv_hex 9 lv_emoji_hex,
                               8 iv_hex 10 lv_emoji_hex,

                               11 iv_hex 11 lv_emoji_hex,
                               12 iv_hex 12 lv_emoji_hex,
                               13 iv_hex 13 lv_emoji_hex,
                               14 iv_hex 14 lv_emoji_hex,
                               15 iv_hex 15 lv_emoji_hex,
                               16 iv_hex 16 lv_emoji_hex.

  ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.

ENDFORM.

*&---------------------------------------------------------------------*
* Copy HEX bits from source byte to target byte
*&---------------------------------------------------------------------*

FORM copy_hex_bits USING    iv_src_bit  TYPE i
                            iv_src_str  TYPE xstring
                            iv_trgt_bit TYPE i
                   CHANGING cv_trgt_str TYPE xstring.

  GET BIT iv_src_bit OF iv_src_str INTO DATA(lv_bit).
  SET BIT iv_trgt_bit OF cv_trgt_str TO lv_bit.

ENDFORM.

 

Solution Explanation:

  • Read input Unicode string from selection screen via parameter p_string.
  • Convert the input string to UTF-8 Hex string (xstring) using ABAP Conversion APIs.
    • We can find default code page of the system by running the FM RFC_SYSTEM_INFO and checking the exporting parameter RFCSI_EXPORT-RFCCHARTYP. In my case, it happened to be 4102.
    • We can find the details of SAP code page by running the FM SCP_CODEPAGE_INFO. It’s found that code page 4102 is UTF-16BE Unicode / ISO/IEC 10646.
  • Loop through the converted UTF-8 text until the end of xstring and parse each each character as per the UTF-8 Bit Distribution Logic shown below.

Reference: UTF-8 Bit Distribution (From Unicode Standard Version 9.0 Core Specification)

  • As per the Bit Distribution logic above, check the first byte in the following order:
    • Case 1: If first byte value >= F0 (Hex) or 240 (Decimal) or 11110000 (Binary), the Unicode character is placed in 4 bytes.
    • Case 2: Else if first byte value >= E0 (Hex) or 224 (Decimal) or 11100000 (Binary), the Unicode character is placed in 3 bytes.
    • Case 3: Else if first byte value >= C0 (Hex) or 192 (Decimal) or 11000000 (Binary), the Unicode character is placed in 2 bytes.
    • Case 4: Else, in rest of the cases, the Unicode character is placed in 1 byte.
  • Once bit distribution pattern is identified as in above step, read the required follow up bytes and prepare the scalar value bytes for a single Unicode character, by setting its individual bits copied from distributed bits (Refer table 3-6 above).
  • Apply the conversion logic in the following manner for each character in the UTF-8 string.
    • For the cases 1, 2 and 3 above, make up the equivalent HTML entity (&amp;#x followed by scalar value in Hexadecimal) for emoji characters.
    • For case 4 above, no conversion is required as they are 7-bit ASCII characters.
  • Concatenate each character from above step and output the converted string.

 

Sample Test Results:

 

Please note that Emoji character is not displayed in SAP GUI screen below, though it’s considered for the input to the parameter p_string.


Test 1:

Input:

Emoji test => 🤷🏼

Output:

 

Code Point Reference:

 


Test 2:

Input:

🇮🇳 sap 👨‍👩‍👧‍👦

Output:

 

Code Point Reference:

 


Test 3:

Input:

No Emoji text 🙂

 

Output:

 


Code Explanation with an example:

 

OK, now it’s time to deep dive. Let’s apply our learning with an example to have the better understanding of how the whole conversion logic works.

Hope you enjoyed learning and found this information helpful. Cheers. 👍

 

12 Comments
You must be Logged on to comment or reply to a post.
  • It’s interesting to know what is Unicode, what is UTF, and so on.

    But I’m not sure to understand if your blog post is only a theoretical article, or has an interest in business, because your business case is only to have emoji characters “be properly displayed in a HTML compliant client”, that is easily achieved by inserting Character Entity References (for instance 😀 as you have explained in your post) in the HTML page.

    NB: list of many emoticons for those interested -> https://www.fileformat.info/info/unicode/block/emoticons/list.htm

    • Thanks for your comment. You’re right, we can directly insert HTML entity if we are dealing with HTML page directly, but here in this case SAP supplies the HTML content, so the conversion was necessary.

      • That’s important that you explain why HTML is to be handled differently when generated from SAP. When I generate HTML content from ABAP with special characters, I use the Character Entity References, I don’t need to escape them or to convert them in UTF-8 (or whatever kind of UTF).

        Why doing differently (what is your use case), and what do you propose?

         

    • Some people might just want to display characters U+010000 to U+10FFFF (surrogate pairs AKA few emoji and some other characters) via SAP GUI HTML Viewer. It can be done via conversion to UTF-8 which supports the conversion of surrogate pairs:

      PARAMETERS p_string TYPE string LOWER CASE DEFAULT '🤷🏼'.
      
      START-OF-SELECTION.
        DATA temp_xstring TYPE xstring.
        temp_xstring = cl_abap_codepage=>convert_to( '<html>'
            && '<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">'
            && '<body>' && p_string && '</body></html>' ).
        cl_abap_browser=>show_html(
          EXPORTING
            html_xstring = temp_xstring
            check_html   = abap_false ).
      

       

    • A more simple algorithm is proposed by SAP in method cl_scp_mapping_rules=>utf16_s_pair_to_utf32, which converts surrogate pairs of Unicode characters U+010000 to U+10FFFF (AKA few emoji and some other characters) into UTF-32, which corresponds to the Unicode code point:

      PARAMETERS p_string TYPE string LOWER CASE DEFAULT '🤷🏼'.
      
      START-OF-SELECTION.
        PERFORM string_to_html USING p_string CHANGING p_string.
        WRITE p_string.
      
      FORM string_to_html USING p_string TYPE csequence CHANGING VALUE(html) TYPE string.
        DATA: im_utf16_s_pair TYPE xstring,
              ex_utf32        TYPE xstring,
              temp_string     TYPE string,
              temp_xstring    TYPE xstring.
        DATA(d800) = CONV char1( cl_abap_conv_in_ce=>uccp( 'D800' ) ).
        DATA(dbff) = CONV char1( cl_abap_conv_in_ce=>uccp( 'DBFF' ) ).
        DATA(dc00) = CONV char1( cl_abap_conv_in_ce=>uccp( 'DC00' ) ).
        DATA(dfff) = CONV char1( cl_abap_conv_in_ce=>uccp( 'DFFF' ) ).
      
        DATA(offset) = 0.
        html = ``.
        WHILE offset < strlen( p_string ).
          IF p_string+offset(1) NOT BETWEEN d800 AND dbff.
            html = html && escape( val = p_string+offset(1) format = cl_abap_format=>e_html_text ).
          ELSE.
            temp_string = p_string+offset(1).
            ADD 1 TO offset.
            IF offset >= strlen( p_string ) OR p_string+offset(1) NOT BETWEEN dc00 AND dfff.
              html = html && 'ERR!'.
            ELSE.
              temp_string = temp_string && p_string+offset(1).
              EXPORT dummyname = temp_string TO DATA BUFFER temp_xstring.
              IMPORT dummyname = im_utf16_s_pair FROM DATA BUFFER temp_xstring IN CHAR-TO-HEX MODE.
              cl_scp_mapping_rules=>utf16_s_pair_to_utf32(
                EXPORTING
                  im_utf16_s_pair = im_utf16_s_pair
                  im_endian       = cl_abap_char_utilities=>endian
                IMPORTING
                  ex_utf32        = ex_utf32 ).
              IF cl_abap_char_utilities=>endian = 'L'.
                CONCATENATE ex_utf32+3(1) ex_utf32+2(1) ex_utf32+1(1) ex_utf32+0(1) INTO ex_utf32 IN BYTE MODE.
              ENDIF.
              temp_string = shift_left( val = |{ ex_utf32 }| sub = '0' ). " remove leading zeroes
              html = html && |&#x{ temp_string };|.
            ENDIF.
          ENDIF.
          offset = offset + 1.
        ENDWHILE.
      ENDFORM.
      

       

  • is it possible if we pass ” &amp;#x01F468;&amp;#x200D;&amp;#x01F469;&amp;#x200D;&amp;#x01F467;&amp;#x200D;&amp;#x01F466; ” in selection revert back emoji ??

     

    ex: barcode if pass in selection-screen output should come company name or relevant details of barcode?

     

    thanks.

    sanjeev

    • Yes, we can convert HTML entity to its equivalent emoji character by following the reverse process of the steps explained above. But the problem is that this Unicode characters may not be displayed on the SAP GUI screen.

      Barcode is a different topic altogether, since emoji is basically Unicode character whereas barcode is a graphic content (Binary data). We would need to use Barcode related APIs to interpret and read the text content from Barcode file/data.