Regex – May the force be with you.

former_member183045 · ‎12-21-2015

Regular expressions is an often discussed technique. Is it a technique which makes your coding more readable or does it add unnecessary complexity to your software project?
Recently I was facing again the decision to introduce regex functionality in an project or not. As a result I want to describe my thougths, my experience with the provided functionality in ABAP and invite you to share your opinions.

So what is the answer: Regular expression is a mandatory key functionality in processing strings or a technique to be avoided as it adds not needed functionality to your projects. Is there a general answer or as in many other cases “it depends on” ...

Lets see.

First I want to define three common requirements for discussing the facets of regular expressions:

Requirement 1: A product code should be checked whether it starts with the country code ‘AT’ or not?

Requirement 2: An user types his or her surname which should be validated before it is processed.
Requirement 3: A product code should be checked which consists of a two character country code + a 5-digit product number and the product number should be extracted for further processing.

Requirement 1: : A product code should be checked whether it starts with the country code ‘AT’ or not?

A possible solution with standard ABAP string functions for the input lv_input would be

IF lv_input CP ‘AT*’.
MESSAGE ‘String starts with AT’ TYPE ‘I’.
ELSE.
MESSAGE ‘String does not start with AT’ TYPE ‘E’.

Rather straightforward. So let’s see what an equal solution with regex looks like.

FIND REGEX ‘^AT’ IN lv_input.
IF sy-subr = 0.
MESSAGE ‘String starts with AT’ TYPE ‘I’.
ELSE.
MESSAGE ‘String does not start with AT’ TYPE ‘E’.

A great thing about ABAP is, that it features regular expressions directly in the standard notation, so you have not to include objects or libraries like in other programming languages. Of course there are some already defined classes like CL_ABAP_STRING_UTILITES but in this reading I will keep them for simplicity aside.

But regardless the simplicity of the regex solution, if we compare the two solutions the first one needs one line less and is more readable as the string evaluation can be directly used in the if statement.

Thus we come to finding 1:

Finding 1: For simple text comparison the standard text comparison function are more readable and regex functionality adds unnecessary complexity.

Finding 2: In ABAP regular expressions are included in the standard language features and can be used especially with the FIND and REPLACE functions.

(Sidenote: Of course also in this simple example further checks would be useful as for example AT would be a valid input, but lets leave this aspect beside for the moment.

Requirement 2: An user types his or her surname which should be validated before it is processed.

To check if a given text is a valid surename some text comparisions are neede (unless you find an existing object/function (which I do not know in ABAP) which does the work for you. So with text functions we could for example do the following checks:

IF lv_input NA SY-ABCDE.
MESSAGE ‘String contains no upper case character’ TYPE ‘E’.
IF lv_input(1) NA SY-ABCDE.
MESSAGE ‘String does not start with an upper case character’ TYPE ‘E’.

lv_input_uppercase = lv_input.
translate lv_input_uppercase TO UPPERCASE.
IF lv_input NC SY-ABCDE.
MESSAGE ‘String contains invalid characters’ TYPE ‘E’.
... and so on

An equal solution with regular expressions would be for example:

FIND REGEX ‘^[A-Z]([a-z]*)$’ IN lv_input.
IF sy-subr = 0.
MESSAGE ‘String is a valid surname’ TYPE ‘I’.
ELSE.
MESSAGE ‘String is not a valid surname’ TYPE ‘E’.

I will not explain in details what the regex ‘^[A-Z]([a-z]*)$’ does. There are a lot of tools like the ABAP Regex Toy (DEMO_REGEX_TOY) or Internet tools like https://regex101.com/ which provide an much more complete explanation than I could give you here.

Comparing the above given solution some interesting things can be seen. First the regex-part is now much shorter. Of course the regular expression is not self explaining and if you are not used to it needs some thinking. But the whole validation keeps on one place. An difference in the output is, that with the regex solution the user does not get specific messages. Lets summarize this in two findings.

Finding 3: As the string parsing gets more complex, regular expressions provide a method to keep the validation on one point and avoid a lot of conditional statements.

Finding 4: If specific validation messages are needed string functions may be better than a consolidated regular expression.

Requirement 3: A product code should be checked which consists of a two character country code + a 5-digit product number + a color specification (eg. AT12345-red) and the product number should be extracted for further processing.

The validation for this requirement is quite similar to the requirement 1.

You could check the first two characters for uppercase characters or compare them to predefined company codes. Then check the next 5 characters for a valid number. Ensure that the – is on the correct place and the last characters correspond to a valid color.
Afterwards you could extract all characters beginning with the 3^rd until the ‘-’ character.
Straightforward to code, roughly speaking 5 if-statements or equal expressions should do the work.

The regular expression solution could look like the following:

FIND REGEX ‘^[A-Z]{2}([1-9]{5})-[a-z]*$’ IN lv_input
SUBMATCHES lv_product_number.
IF sy-subr = 0.
CONCATENATE ‘Product number found: ’ lv_product_number INTO lv_message_text.
MESSAGE lv_message_text TYPE ‘I’.

So the amount of coding remains nearly the same regardingless the requirement is much complex. Well, the regex expression is already complicated and needs some time to understand. Again, for a detailed description of the regex use the above mentioned tools. This I want to summarize in the last finding.

A big advantage is that validating and extracting can be done together with using the round brackets().
This can also be used to replace parts of the string with the ABAP standard function replace REGEX ‘’...

Finding 5: if a string should be validated and processed regular expressions it is possible to perform validation of a string and extract or replace one or more substring with one operation.

To summarize my findings:

Regular expressions are a powerful tool which can solve a lot of requirements regarding text validation, processing and replacing.
Simple checks are almost always better done with standard text functions. Regular expression functions performed on medium complex requirements can save you a lot of typing effort and make your code more compact but adds also some not easy understandable regular expressions.
When the requirements get more complex a solution where the whole logic is bundled in on regular expression could leave to a quite hard understandable and maintainable code. In this cases a split into more regular expressions or a combination between regex and standard string functions is advisable.
Regular expressions should in my opinion always be considered for medium complex text operations as they combine when desired text validation, extraction and substitution. And especially ABAP supports them very good as they can be used with standard functions FIND and REPLACE but also with predefined classes like CL_ABAP_MATCHER or CL_ABAP_REGEX and so on.

I hope you enjoyed my reading and I appreciate feedback, ratings, corrections, your comments, different opinions or additional experience.

May the force of regex be with you