This blog post is the second in a series of three blog posts introducing recent changes and enhancement made to regular expressions in the ABAP language. A basic understanding of regular expressions and their syntax is assumed, some experience on using regular expressions in an ABAP context is beneficial if you want to follow the examples.
In part one (Modern Regular Expressions in ABAP – Part 1 – Introducing PCRE) we took a look at some of the new features PCRE has to offer. If you did not already, I strongly recommend you have a brief look at part one first before continuing, so you have a rough understanding of what PCRE is capable of. Also, the terminology introduced in part one will be used throughout this part as well.
Table of Contents
- Why you should migrate
- POSIX’ leftmost longest rule
- Whitespaces in Patterns
- What the Dot matches
- Choosing the right Unicode Mode
- Minor Incompatibilities
- Replacement and Substitution
- Where to go from here
Why you should migrate
PCRE is intended to replace POSIX as the new go-to regular expression flavor and it comes with a huge list of advantages:
- it is more powerful and flexible: PCRE offers a vast number of features and can be configured in many aspects
- it is more robust: PCRE can handle complex matches better and will less likely result in a
- it is faster: PCRE supports JIT compilation to greatly increase matching speed in certain scenarios
- it is supported by external tools: you can now test and even debug your patterns, e.g. using https://regex101.com/; don’t forget good ol’
DEMO_REGEX_TOYthough, which have been updated to also support PCRE
Apart from adding a ton of new features, PCRE also supports most of the existing POSIX features. There are however some differences and incompatibilities you have to watch out for when porting your existing patterns to PCRE. In the following sections we will take a closer look at these differences and how to deal with them.
POSIX’ Leftmost-Longest Rule
Both PCRE and POSIX use a regex-directed, backtracking algorithm, meaning both implementations will in most cases yield the same result. There is however a crucial difference: PCRE will always return the leftmost match, while POSIX aims to return the leftmost longest match, meaning that if multiple possible matches start at the same offset, the longest of those is returned.
Sounds a bit abstract at first, so let’s have a look at an example:
DATA(pcre_result) = match( val = `unfoldable` pcre = `un(fold|foldable)` ). " --> returns 'unfold' DATA(posix_result) = match( val = `unfoldable` regex = `un(fold|foldable)` ) ##regex_posix. " --> returns 'unfoldable'
While PCRE was satisfied after matching
fold, the leftmost alternative, POSIX tried all alternatives and found that matching
foldable actually results in the longest match at this position, so it returned that. To retrieve the longest match in this example using PCRE, we have several options:
" 1. reorder the pattern so that the leftmost match is automatically the longest DATA(fix1) = match( val = `unfoldable` pcre = `un(foldable|fold)` ). " 2. anchor the pattern at the beginning and end of the subject string DATA(fix2) = match( val = `unfoldable` pcre = `^un(fold|foldable)$` ). " 3. anchor the pattern at the word boundaries DATA(fix3) = match( val = `unfoldable` pcre = `\bun(fold|foldable)\b` ). " 4. extract the common prefix DATA(fix4) = match( val = `unfoldable` pcre = `unfold(able)?` ).
The different matching strategies do not only affect alternations introduced by
|, but all cases where multiple matches start at the same location, for example using the
DATA(pcre_result) = match( val = `unfoldable` pcre = `un(fold)?(foldable)?` ). " --> returns 'unfold' DATA(posix_result) = match( val = `unfoldable` regex = `un(fold)?(foldable)?` ) ##regex_posix. " --> returns 'unfoldable'
In this case, we can use for example a lookahead assertion to also return the longest match in the PCRE case:
DATA(pcre_result) = match( val `unfoldable` pcre = `un(fold(?!able))?(foldable)?` ) " --> returns 'unfoldable'
This may seem like a huge deal, but in practice patterns rarely take advantage of POSIX’ leftmost longest rule. The vast majority of cases should simply work as is in PCRE. If you indeed require the longest of multiple possible results, you can apply the techniques described above to reorder and/or rewrite your pattern.
Whitespaces in Patterns
By default PCRE’s extended mode is enabled for regular expressions of this kind in ABAP. This means that whitespace characters are ignored when the pattern is evaluated. Take for example the following pattern, which in the PCRE case does not match the string
DATA(posix_result) = find( val = `Hello World` regex = `Hello World` ) ##regex_posix. " --> found DATA(pcre_result) = find( val = `Hello World` pcre = `Hello World` ). " --> not found, what is going on...?
This is because
Hello World is equivalent to
HelloWorld for PCRE in extended mode:
DATA(posix_result) = find( val = `HelloWorld` regex = `Hello World` ) ##regex_posix. " --> not found DATA(pcre_result) = find( val = `HelloWorld` pcre = `Hello World` ). " --> found
If you want to explicitly match whitespaces in PCRE’s extended mode, you can do one of the following:
- escape the relevant whitespaces in the pattern using
DATA(result1) = find( val = `Hello World` pcre = `Hello\ World` ). " --> found DATA(result2) = find( val = `Hello World` pcre = `Hello \ World` ). " --> also found as unescaped whitespaces are ignored
- match all whitespaces using the
DATA(result1) = find( val = `Hello World` pcre = `Hello\sWorld` ). " --> found DATA(result2) = find( val = `Hello World` pcre = `Hello \s World` ). " --> also found DATA(result3) = find( val = |Hello\tWorld| pcre = `Hello \s World` ). " where '\t' denotes the tabulation character " --> also found as the tabulation character is considered a whitespace
The extended mode allows you to write (arguably) more readable regular expressions, especially if you are dealing with complex patterns. Recall the parser example from the last blog post:
(?(DEFINE) (?<true> true ) (?<false> false ) (?<zero> 0 ) (?<one> 1 ) (?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) ) (?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) ) (?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) ) (?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) ) (?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) ) ) \s*+ (?&T) \s*+
Without extended mode, the pattern would have to look like this:
Extended mode can however be a bit confusing at first, especially when migrating your POSIX regular expressions. You can therefore also disable the extended mode, either by setting
EXTENDED to false when creating the regular expression via
CL_ABAP_REGEX=>CREATE_PCRE( ), or by using the option syntax
(?-x) in the pattern itself. The latter also works when used in the built-in string functions:
DATA(pcre_result) = find( val = `Hello World` pcre = `(?-x)Hello World` ). " --> found
What the Dot matches
In POSIX, the
. meta-character matches anything. In PCRE this is not the case, as by default
. will match everything except a newline sequence:
DATA(pcre_result) = replace( val = |Hello\nWorld| pcre = `.` with = `x` occ = 0 ). " --> 'xxxxx\nxxxxx' DATA(posix_result) = replace( val = |Hello\nWorld| regex = `.` with = `x` occ = 0 ) ##regex_posix. " --> 'xxxxxxxxxxx'
What is considered a newline sequence in the context of the
. meta-character can be controlled either via parameter
NEWLINE_MODE of factory function
CL_ABAP_REGEX=>CREATE_PCRE( ), or by prefixing your pattern with the corresponding control verb.
If you want the
. meta-character to behave exactly as in the POSIX case, you can enable the so called single line mode by either setting parameter
DOT_ALL of factory function
CL_ABAP_REGEX=>CREATE_PCRE( ) to
ABAP_TRUE, or by setting the
(?s) option inside your pattern.
Choosing the right Unicode Mode
Unlike POSIX which always assumes UCS-2, PCRE allows you to treat your input string as both UCS-2 or UTF-16, depending on your needs. This can be configured in different ways depending on the type of regular expression operation performed:
|methods of class
||Unicode support is controlled by parameter
||no additional parameter exists to control Unicode support, instead the verb
The following table gives a quick overview of which Unicode mode to use when migrating a pattern from POSIX to PCRE:
|Operation||Treat Input as UCS-2 or UTF-16?||Accept Invalid UTF-16?||Action|
|methods of class
|methods of class
|methods of class
||UCS-2 (ABAP default)||–||set
|built-in functions and ABAP statements||UTF-16||Yes||this cannot be achieved with the built-in functions and ABAP statements;
|built-in functions and ABAP statements||UTF-16||No||add verb
|built-in functions and ABAP statements||UCS-2 (ABAP default)||–||(default)|
The following table contains a list of minor incompatibilities and ways to deal with them and achieve equivalent behavior in PCRE:
|Description||POSIX Syntax||PCRE Equivalent|
|matching uppercase and lowercase letters (and the negation thereof)||
|word anchoring at the beginning or the end||
|matching all “unicode” characters||
||use a character range depending on the context, e.g.
Replacement and Substitution
While both POSIX and PCRE allow simple substitutions, e.g.
$0 for the contents of the whole match and
$n for the contents of the
n-th capture group, they pretty much differ in everything else replacement related. We will not explore what PCRE adds to the table as we already did that in the last part. Instead, we will focus on the POSIX replacement syntax that is not directly supported by PCRE.
Let’s start with the low hanging fruit: POSIX supports an additional syntax for referring to the whole match,
$&. This can be trivially replaced by the
$0 syntax, which is equivalent.
POSIX however also allows substituting with the parts before and after the actual match, using the
$' syntax respectively:
DATA(posix_result) = replace( val = `again and` regex = `and` with = '$0 $`' ) ##regex_posix. " --> 'again and again' " === breakdown === " subject = 'again and' " match = 'and' " $0 = 'and' " $` = 'again ' " $0 $` = 'and again ' " replaced = 'again and again ' --> only the 'and' was replaced
Achieving the same in PCRE can be tricky. For this simple example, we can get away with simply matching everything preceding the
and using a capture group and just doing a simple capture group substitution:
DATA(pcre_result) = replace( val = `again and` pcre = `^(.+?)and` with = `$0 $1` ). " --> 'again and again'
There may however be cases where you have to get a bit more creative. Again, I have rarely seen a pattern in production that made use of these POSIX features in a reasonable way, so this shouldn’t concern you for the most part.
In case your pattern does not compile, suddenly matches unwanted results or does not match things it previously matched, have a look at the most common pitfalls when migrating from POSIX to PCRE:
||in PCRE’s extended mode, which is enabled by default, whitespaces in the pattern are ignored; you can either:
|replacement and substitution||
Where to go from here
This concludes the second part of this series. You have hopefully gained some insights into the features PCRE has to offer and how to utilize them, as well as things you have to watch out for.
In the next and last part, we will take a look at yet more regular expression flavors and their use cases: