Skip to Content
Author's profile photo Hendrik Brandes

Quickwin using regular expressions

In one of my last reviews, I tumbled about a simple piece of code, which does not let me go…

Suppose, you are within a SAP Netweaver BW Transformation and want to replace all unallowed characters during a transformation rule:

METHOD _compute_XXXXX
...
CALL FUNCTION z_replace_unallowed_chars
  EXPORTING i_text = i_value_in
  IMPORTING e_text = r_value_out
...
ENDMETHOD.

and the function module looks nearly this way:

FUNCTION z_replace_unallowed_chars..
*"----------------------------------------------------------------------
*"*"Local Interface:
*"  IMPORTING
*"     REFERENCE(I_TEXT) TYPE  CHAR100 OPTIONAL
*"     REFERENCE(I_DEFAULT) TYPE  C OPTIONAL
*"  EXPORTING
*"     REFERENCE(E_TEXT) TYPE  CHAR100
*"----------------------------------------------------------------------
...
*----------------------------------------------------------------------*
*       Determine unallowed characters
*----------------------------------------------------------------------*
  IF    g_char  IS  INITIAL.
    CALL FUNCTION 'RSKC_ALLOWED_CHAR_GET'
      IMPORTING
        e_allowed_char = w_allowed_char.
    g_char    =    'X'.
  ENDIF.
*----------------------------------------------------------------------*
*      Replacement
*----------------------------------------------------------------------*
  DESCRIBE FIELD i_eingabe LENGTH len IN CHARACTER MODE.
  boole    =     true.
  e_ausgabe  =   i_eingabe.
  lv_def = i_default.
  TRANSLATE lv_def TO UPPER CASE.
  WHILE   boole =  true.
    IF   e_ausgabe CO  w_allowed_char.
      EXIT.
    ENDIF.
    IF sy-fdpos < len.
      pos =    sy-fdpos.
      e_ausgabe+pos(1)  =   lv_def.
    ENDIF.
    IF sy-fdpos =  len.  
      boole    =  false.
    ENDIF.
  ENDWHILE.
ENDFUNCTION.

After reading this function module and its calling routine, I thought by myself: Ooops, something is going wrong there or is not efficient as it should be.

I tumbled not about the replacement procedure or the calling, of the BW function module. My focus has moved to the replacement block. Ok, there is no real mistake, there might have been more inbound functions like strlen( ) and there might have been another loop technique.

But: Why performing such a circumstantial way to replace the characters? Even it is not running correct in all cases, it takes a lot of performance, which cumulates especially in ETL-processes on BW, because this function module is not called 1, 2 or 3 times. When you load millions of lines into your BW it should be, that this function module is called more than 3 times per row!

My solution for this point: Using regular expressions within the ABAP built-in functionality:

REPLACE ALL OCCURRENCES OF REGEX l_pattern IN l_text WITH ii_default.

This line of code does the same, as the lines 23..39 of the listing above!

But what is this pattern-thing? Regular expressions are a very efficient way, to perform textual-operations. Within such an expression, you can define a syntax, which the matcher checks against your input. For example: EMail-Addresses, Date-Format, Naming-conventions, un-wanted characters and so on.

The pattern for this case looks like this:

[^ALLCHARACTERSYOUWANTTO]

Note: replace ALLCHARACETERSYOUWANTTO with the concrete one.

This pattern means: ,,Every character except those within the brackets”. And that’s the same, as the replacement loop. The regex-processor will now proceed the text in i_test and replace all matches of the pattern with the value in i_default.

Now, the new function module looks like this:

FUNCTION z_replace_unallowed_chars..
*"----------------------------------------------------------------------
*"*"Local Interface:
*"  IMPORTING
*"     REFERENCE(I_TEXT) TYPE  CHAR100 OPTIONAL
*"     REFERENCE(I_DEFAULT) TYPE  C OPTIONAL
*"  EXPORTING
*"     REFERENCE(E_TEXT) TYPE  CHAR100
*"----------------------------------------------------------------------
...
*----------------------------------------------------------------------*
*       Determine unallowed characters
*----------------------------------------------------------------------*
  IF    g_pattern  IS  INITIAL.
    CALL FUNCTION 'RSKC_ALLOWED_CHAR_GET'
      IMPORTING
        e_allowed_char = l_allowed_char.
        CONCATENATE '([^' w_allowed_char '])' into g_pattern.
  ENDIF.
*----------------------------------------------------------------------*
*      Replacement
*----------------------------------------------------------------------*
  REPLACE ALL OCCURRENCES OF REGEX l_pattern IN e_text WITH i_default.
ENDFUNCTION

Have a deeper look into line 19 and line 24 where the pattern is build up and used.

After testing this against correctness, I have performed a performance trace and you will see, that the new implementation is 20% faster (tested on a demo-instance, with table SBOOK and 1,300,000 rows, one field per row replaced). replace1 is the old implementation, replace2 the new one:

Auswahl_003.png

If you have a higher diversity of your input data, I expect, that this implementation speed up again.

Have a deeper look into the ABAP-documentation: http://help.sap.com/abapdocu_70/en/ABENREGULAR_EXPRESSIONS.htm or try it by yourself with the report DEMO_REGEX_TOY.

Have fun!

Assigned Tags

      6 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      good job making ABAP regex more popular !!!

      Author's profile photo Hendrik Brandes
      Hendrik Brandes
      Blog Post Author

      Hello Markus,

      thank you. I think, there will be a lot of possibilities where this technique could make life much more easier than stupid DO...IF/CASE/PERFORM...ENDDO sessions for searching and replacing strings.

      Sadly, most people feel very sick, when they see a regular expression for the first time 😉

      Kind regards,
      Hendrik

      Author's profile photo Rüdiger Plantiko
      Rüdiger Plantiko

      I thought that performance would probably improve if you used a GO_REGEX TYPE REF TO CL_ABAP_REGEX instead of a string G_PATTERN, and then REPLACE ... REGEX GO_REGEX ...

      This way, the repeated parsing & compilation of the string into a regex could be avoided.

      I tested this out, but to my surprise, both variants were almost exactly equal in time!

      http://pastebin.com/R506LndU

      This means, that there is an internal optimization for regex strings (like preserving the last regex string plus object, and re-using that regex object of the REPLACE is called again with the same regex string).

      Author's profile photo Hendrik Brandes
      Hendrik Brandes
      Blog Post Author

      Hello Rüdiger,

      thank you for sharing this test-results. I have expected an equal result, because SAP uses the same internal mechanism for regular expressions and I believe ( but this is not in all cases true ), that ABAP statements are always faster than OO-encapsulations.

      I am very happy to see, that I am not alone using regular expressions 😉 !

      Kind regards,

      Hendrik

      Author's profile photo Rüdiger Plantiko
      Rüdiger Plantiko

      Hello Hendrik,

      I am very happy to see, that I am not alone using regular expressions 😉 !

      I like regular expressions and use them in Java, JavaScript, Perl and ABAP - even in my "UltraEdit" plain text editor. Once you are used to the syntax, they are really great.

      I have expected an equal result, because SAP uses the same internal mechanism for regular expressions and I believe ( but this is not in all cases true ), that ABAP statements are always faster than OO-encapsulations.

      Hm. A regular expression which is given as a string only at run-time - like g_pattern in your example - has to be parsed and compiled into an internal "regex mini-program" at run-time, before the regex match can be executed. When used repeatedly, a CL_ABAP_REGEX object can help you avoid this parse time, because internally it keeps this mini-program ready for arbitrary many pattern match calls.

      It seems that for the regex of your kind: "([^" + ... sequence of characters + "])", parse time is negligible. I wouldn't swear this to be the case for arbitrary regexes.

      But - yes, I have no example 🙂

      Regards,

      Rüdiger

      Author's profile photo Rainer Hübenthal
      Rainer Hübenthal

      To become familiar with regex expressions and test them SAP offers the report

      DEMO_REGEX_TOY

      PS i'm ashamed that i didnt read the article to its end... otherwise i recognized that this post was obsolete 🙁