Technical Articles
Taming the RegEx monster
Being a long-time ABAPer who started up in the early 90’s, regular expressions are not part of my DNA. It is some sort of monster, ugly, dangerous and mysterious.
Then, I learned about the benefits of REGEX search and replace in the ABAP editor. Did you ever replace ‘ +’ with ‘ ‘? You know what I mean. This made me change my mind on a certain scale, and I found myself googling about REGEX to do more sophisticated search/replace in the editor.
Recently, the ABAP language developers added a set of REGEX-armed functions such as the function matches. I must say, I rarely used them, meanly caused by bad readability. Who wants to read something like
IF matches( val = itf_wa-tdline regex = `<DS:([^>]+)>.+` ).
when reviewing source code?
However, refactoring an old program, I stumbled across the lines:
if lv_ilart = 'REP' or
lv_ilart = 'UMR' or
lv_ilart = 'MAW'.
and reasoned about how to avoid the need of writing three times lv_ilart. In an SQL statement, I could write
where ilart in ('REP', 'UMR', 'MAW')
but how do I do in an IF? Some penetrant voice inside my head kept on whispering “Use a REGEX”. “No!” said I “nobody will understand it…”, but the voice said: “Why don’t you try?”.
After some googling about REGEX and writing some dummy code for testing it out, I arrived at
if matches( val = lv_ilart regex = 'REP|UMR|MAW' ).
Leaning back for a minute and staring at this line, I had to admit, that this could be understandable to a reader. So, if I keep my REGEX simple, it could be a good idea to use this feature.
But what about the more complicated cases? REGEX is a standard in programming, and it is very powerful. How could I use this and keep the code readable?
Mi approach is: put a complex REGEX operation into a small method whose name explains the purpose. For example, instead of writing:
if matches( val = test regex = '.+\_[0-9][0-9]' ).
I would prefer to write
if has_two_numbers_suffix( filename ).
(...)
method has_two_numbers_suffix.
result = xsdbool( matches( val = in regex = '.+\_[0-9][0-9]' ) ).
endmethod.
Now, a reviewer understands what’s supposed to be going on. Even if he’s not able to decipher the REGEX, at least he knows what it should do.
Agree on code readability. In order to decipher regex, using regex101.com makes job bit easier. You can save the regex expn in regex101.com and paste the URL in the code so reviewer can decipher regex by opening the link.
We also need to be cautious when using REGEX101.com because ABAP Regex library does not support everything.
RegEx support for ABAP has been around for quite sometime now. Tbh, i have never had the need to build fancy, complex RegEx’s.
Earlier i have used a combination of regexr.com to build the RegEx i would like. And then tested it using DEMO_REGEX_TOY report.
Since i started using aUnit, i add unit tests to the mix. They serve as technical documentation for the developers who will be maintaining the code.
My point was the readability. In my company, there are persons that do not code in ABAP, but they do read short dumps or sometimes debug. None of them knows even how to spell RegEx. So if I use heavy RegEx's, they won't understand the purpose of the line. The very useful testing tools mentioned here will not help in this case.
Very nice blog post and an interesting read on your train of thoughts here.
I think the approach of wrapping the test-logic (whether the filename ends with two numeric characters or not) into a small function with a telling name can really help to make the "main" program easier to understand.
What I don't agree with so much is using regex for the examples mentioned. A simple list match can be done in many different ways and if there are just a few cases like in the example I don't see three IF branches as an issue.
Regex, on the other hand, is rather heavyweight in its runtime behavior, easy to get wrong, and not obvious to the reader - which you mentioned in the comment.
To me, this looks like an unfortunate trade: getting rid of having to write three IF branches and introducing a famously error-prone and confusing technology that absolutely requires to be explained in comments or documentation.
Don't get me wrong, I don't mind regex per se and have used them in successfully in projects. But in my mind regex are always like a heavy-duty-power-tool that needs a lot of care and safety procedures when handling.
my 2cts on that bit.
Thank you for this valid topic! I totally agree with this.