Quickwin using regular expressions
In one of my last reviews, I tumbled about a simple piece of code, which does not let me go…
Suppose, you are within a SAP Netweaver BW Transformation and want to replace all unallowed characters during a transformation rule:
METHOD _compute_XXXXX
...
CALL FUNCTION z_replace_unallowed_chars
EXPORTING i_text = i_value_in
IMPORTING e_text = r_value_out
...
ENDMETHOD.
and the function module looks nearly this way:
FUNCTION z_replace_unallowed_chars..
*"----------------------------------------------------------------------
*"*"Local Interface:
*" IMPORTING
*" REFERENCE(I_TEXT) TYPE CHAR100 OPTIONAL
*" REFERENCE(I_DEFAULT) TYPE C OPTIONAL
*" EXPORTING
*" REFERENCE(E_TEXT) TYPE CHAR100
*"----------------------------------------------------------------------
...
*----------------------------------------------------------------------*
* Determine unallowed characters
*----------------------------------------------------------------------*
IF g_char IS INITIAL.
CALL FUNCTION 'RSKC_ALLOWED_CHAR_GET'
IMPORTING
e_allowed_char = w_allowed_char.
g_char = 'X'.
ENDIF.
*----------------------------------------------------------------------*
* Replacement
*----------------------------------------------------------------------*
DESCRIBE FIELD i_eingabe LENGTH len IN CHARACTER MODE.
boole = true.
e_ausgabe = i_eingabe.
lv_def = i_default.
TRANSLATE lv_def TO UPPER CASE.
WHILE boole = true.
IF e_ausgabe CO w_allowed_char.
EXIT.
ENDIF.
IF sy-fdpos < len.
pos = sy-fdpos.
e_ausgabe+pos(1) = lv_def.
ENDIF.
IF sy-fdpos = len.
boole = false.
ENDIF.
ENDWHILE.
ENDFUNCTION.
After reading this function module and its calling routine, I thought by myself: Ooops, something is going wrong there or is not efficient as it should be.
I tumbled not about the replacement procedure or the calling, of the BW function module. My focus has moved to the replacement block. Ok, there is no real mistake, there might have been more inbound functions like strlen( ) and there might have been another loop technique.
But: Why performing such a circumstantial way to replace the characters? Even it is not running correct in all cases, it takes a lot of performance, which cumulates especially in ETL-processes on BW, because this function module is not called 1, 2 or 3 times. When you load millions of lines into your BW it should be, that this function module is called more than 3 times per row!
My solution for this point: Using regular expressions within the ABAP built-in functionality:
REPLACE ALL OCCURRENCES OF REGEX l_pattern IN l_text WITH ii_default.
This line of code does the same, as the lines 23..39 of the listing above!
But what is this pattern-thing? Regular expressions are a very efficient way, to perform textual-operations. Within such an expression, you can define a syntax, which the matcher checks against your input. For example: EMail-Addresses, Date-Format, Naming-conventions, un-wanted characters and so on.
The pattern for this case looks like this:
[^ALLCHARACTERSYOUWANTTO]
Note: replace ALLCHARACETERSYOUWANTTO with the concrete one.
This pattern means: ,,Every character except those within the brackets”. And that’s the same, as the replacement loop. The regex-processor will now proceed the text in i_test and replace all matches of the pattern with the value in i_default.
Now, the new function module looks like this:
FUNCTION z_replace_unallowed_chars..
*"----------------------------------------------------------------------
*"*"Local Interface:
*" IMPORTING
*" REFERENCE(I_TEXT) TYPE CHAR100 OPTIONAL
*" REFERENCE(I_DEFAULT) TYPE C OPTIONAL
*" EXPORTING
*" REFERENCE(E_TEXT) TYPE CHAR100
*"----------------------------------------------------------------------
...
*----------------------------------------------------------------------*
* Determine unallowed characters
*----------------------------------------------------------------------*
IF g_pattern IS INITIAL.
CALL FUNCTION 'RSKC_ALLOWED_CHAR_GET'
IMPORTING
e_allowed_char = l_allowed_char.
CONCATENATE '([^' w_allowed_char '])' into g_pattern.
ENDIF.
*----------------------------------------------------------------------*
* Replacement
*----------------------------------------------------------------------*
REPLACE ALL OCCURRENCES OF REGEX l_pattern IN e_text WITH i_default.
ENDFUNCTION
Have a deeper look into line 19 and line 24 where the pattern is build up and used.
After testing this against correctness, I have performed a performance trace and you will see, that the new implementation is 20% faster (tested on a demo-instance, with table SBOOK and 1,300,000 rows, one field per row replaced). replace1 is the old implementation, replace2 the new one:
If you have a higher diversity of your input data, I expect, that this implementation speed up again.
Have a deeper look into the ABAP-documentation: http://help.sap.com/abapdocu_70/en/ABENREGULAR_EXPRESSIONS.htm or try it by yourself with the report DEMO_REGEX_TOY.
Have fun!
good job making ABAP regex more popular !!!
Hello Markus,
thank you. I think, there will be a lot of possibilities where this technique could make life much more easier than stupid DO...IF/CASE/PERFORM...ENDDO sessions for searching and replacing strings.
Sadly, most people feel very sick, when they see a regular expression for the first time 😉
Kind regards,
Hendrik
I thought that performance would probably improve if you used a GO_REGEX TYPE REF TO CL_ABAP_REGEX instead of a string G_PATTERN, and then REPLACE ... REGEX GO_REGEX ...
This way, the repeated parsing & compilation of the string into a regex could be avoided.
I tested this out, but to my surprise, both variants were almost exactly equal in time!
http://pastebin.com/R506LndU
This means, that there is an internal optimization for regex strings (like preserving the last regex string plus object, and re-using that regex object of the REPLACE is called again with the same regex string).
Hello Rüdiger,
thank you for sharing this test-results. I have expected an equal result, because SAP uses the same internal mechanism for regular expressions and I believe ( but this is not in all cases true ), that ABAP statements are always faster than OO-encapsulations.
I am very happy to see, that I am not alone using regular expressions 😉 !
Kind regards,
Hendrik
Hello Hendrik,
I like regular expressions and use them in Java, JavaScript, Perl and ABAP - even in my "UltraEdit" plain text editor. Once you are used to the syntax, they are really great.
Hm. A regular expression which is given as a string only at run-time - like g_pattern in your example - has to be parsed and compiled into an internal "regex mini-program" at run-time, before the regex match can be executed. When used repeatedly, a CL_ABAP_REGEX object can help you avoid this parse time, because internally it keeps this mini-program ready for arbitrary many pattern match calls.
It seems that for the regex of your kind: "([^" + ... sequence of characters + "])", parse time is negligible. I wouldn't swear this to be the case for arbitrary regexes.
But - yes, I have no example 🙂
Regards,
Rüdiger
To become familiar with regex expressions and test them SAP offers the report
DEMO_REGEX_TOY
PS i'm ashamed that i didnt read the article to its end... otherwise i recognized that this post was obsolete 🙁