Skip to Content
Technical Articles
Author's profile photo Julius Bettin

Modern Regular Expressions in ABAP – Part 2 – Migrating from POSIX to PCRE

This blog post is the second in a series of three blog posts introducing recent changes and enhancement made to regular expressions in the ABAP language. A basic understanding of regular expressions and their syntax is assumed, some experience on using regular expressions in an ABAP context is beneficial if you want to follow the examples.

In part one (Modern Regular Expressions in ABAP – Part 1 – Introducing PCRE) we took a look at some of the new features PCRE has to offer. If you did not already, I strongly recommend you have a brief look at part one first before continuing, so you have a rough understanding of what PCRE is capable of. Also, the terminology introduced in part one will be used throughout this part as well.

Table of Contents

Why you should migrate

PCRE is intended to replace POSIX as the new go-to regular expression flavor and it comes with a huge list of advantages:

  • it is more powerful and flexible: PCRE offers a vast number of features and can be configured in many aspects
  • it is more robust: PCRE can handle complex matches better and will less likely result in a REGEX_TOO_COMPLEX error
  • it is faster: PCRE supports JIT compilation to greatly increase matching speed in certain scenarios
  • it is supported by external tools: you can now test and even debug your patterns, e.g. using https://regex101.com/; don’t forget good ol’ DEMO_REGEX and DEMO_REGEX_TOY though, which have been updated to also support PCRE

Apart from adding a ton of new features, PCRE also supports most of the existing POSIX features. There are however some differences and incompatibilities you have to watch out for when porting your existing patterns to PCRE. In the following sections we will take a closer look at these differences and how to deal with them.

POSIX’ Leftmost-Longest Rule

Both PCRE and POSIX use a regex-directed, backtracking algorithm, meaning both implementations will in most cases yield the same result. There is however a crucial difference: PCRE will always return the leftmost match, while POSIX aims to return the leftmost longest match, meaning that if multiple possible matches start at the same offset, the longest of those is returned.

Sounds a bit abstract at first, so let’s have a look at an example:

DATA(pcre_result)  = match( val = `unfoldable` pcre  = `un(fold|foldable)` ).
" --> returns 'unfold'
DATA(posix_result) = match( val = `unfoldable` regex = `un(fold|foldable)` ) ##regex_posix.
" --> returns 'unfoldable'

While PCRE was satisfied after matching fold, the leftmost alternative, POSIX tried all alternatives and found that matching foldable actually results in the longest match at this position, so it returned that. To retrieve the longest match in this example using PCRE, we have several options:

" 1. reorder the pattern so that the leftmost match is automatically the longest
DATA(fix1) = match( val = `unfoldable` pcre = `un(foldable|fold)` ).
" 2. anchor the pattern at the beginning and end of the subject string
DATA(fix2) = match( val = `unfoldable` pcre = `^un(fold|foldable)$` ).
" 3. anchor the pattern at the word boundaries
DATA(fix3) = match( val = `unfoldable` pcre = `\bun(fold|foldable)\b` ).
" 4. extract the common prefix
DATA(fix4) = match( val = `unfoldable` pcre = `unfold(able)?` ).

The different matching strategies do not only affect alternations introduced by |, but all cases where multiple matches start at the same location, for example using the ? quantifier:

DATA(pcre_result)  = match( val = `unfoldable` pcre  = `un(fold)?(foldable)?` ).
" --> returns 'unfold'
DATA(posix_result) = match( val = `unfoldable` regex = `un(fold)?(foldable)?` ) ##regex_posix.
" --> returns 'unfoldable'

In this case, we can use for example a lookahead assertion to also return the longest match in the PCRE case:

DATA(pcre_result) = match( val `unfoldable` pcre = `un(fold(?!able))?(foldable)?` )
" --> returns 'unfoldable'

This may seem like a huge deal, but in practice patterns rarely take advantage of POSIX’ leftmost longest rule. The vast majority of cases should simply work as is in PCRE. If you indeed require the longest of multiple possible results, you can apply the techniques described above to reorder and/or rewrite your pattern.

Whitespaces in Patterns

By default PCRE’s extended mode is enabled for regular expressions of this kind in ABAP. This means that whitespace characters are ignored when the pattern is evaluated. Take for example the following pattern, which in the PCRE case does not match the string Hello World:

DATA(posix_result) = find( val = `Hello World` regex = `Hello World` ) ##regex_posix.
" --> found
DATA(pcre_result)  = find( val = `Hello World` pcre  = `Hello World` ).
" --> not found, what is going on...?

This is because Hello World is equivalent to HelloWorld for PCRE in extended mode:

DATA(posix_result) = find( val = `HelloWorld` regex = `Hello World` ) ##regex_posix.
" --> not found
DATA(pcre_result)  = find( val = `HelloWorld` pcre  = `Hello World` ).
" --> found

If you want to explicitly match whitespaces in PCRE’s extended mode, you can do one of the following:

  • escape the relevant whitespaces in the pattern using \ (backslash):
    DATA(result1) = find( val = `Hello World` pcre = `Hello\ World` ).
    " --> found
    DATA(result2) = find( val = `Hello World` pcre = `Hello \  World` ).
    " --> also found as unescaped whitespaces are ignored
  • match all whitespaces using the \s syntax:
    DATA(result1) = find( val = `Hello World` pcre = `Hello\sWorld` ).
    " --> found
    DATA(result2) = find( val = `Hello World` pcre = `Hello \s World` ).
    " --> also found
    DATA(result3) = find( val = |Hello\tWorld| pcre = `Hello \s World` ). " where '\t' denotes the tabulation character
    " --> also found as the tabulation character is considered a whitespace
    

The extended mode allows you to write (arguably) more readable regular expressions, especially if you are dealing with complex patterns. Recall the parser example from the last blog post:

(?(DEFINE)
  (?<true> true )
  (?<false> false )
  (?<zero> 0 )
  (?<one> 1 )
  (?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) )
  (?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) )
  (?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) )
  (?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) )
  (?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )
)
\s*+ (?&T) \s*+

Without extended mode, the pattern would have to look like this:

(?(DEFINE)(?<true>true)(?<false>false)(?<zero>0)(?<one>1)(?<if>if\s++(?&T)\s++then\s++(?&T)\s++else\s++(?&T))(?<succ>succ\s*+\(\s*+(?&T)\s*+\))(?<pred>pred\s*+\(\s*+(?&T)\s*+\))(?<iszero>iszero\s*+\(\s*+(?&T)\s*+\))(?<T>(?&true)|(?&false)|(?&zero)|(?&one)|(?&if)|(?&succ)|(?&pred)|(?&iszero)))\s*+(?&T)\s*+

 

Extended mode can however be a bit confusing at first, especially when migrating your POSIX regular expressions. You can therefore also disable the extended mode, either by setting EXTENDED to false when creating the regular expression via CL_ABAP_REGEX=>CREATE_PCRE( ), or by using the option syntax (?-x) in the pattern itself. The latter also works when used in the built-in string functions:

DATA(pcre_result)  = find( val = `Hello World` pcre = `(?-x)Hello World` ).
" --> found

What the Dot matches

In POSIX, the . meta-character matches anything. In PCRE this is not the case, as by default . will match everything except a newline sequence:

DATA(pcre_result)  = replace( val = |Hello\nWorld| pcre  = `.` with = `x` occ = 0 ).
" --> 'xxxxx\nxxxxx'
DATA(posix_result) = replace( val = |Hello\nWorld| regex = `.` with = `x` occ = 0 ) ##regex_posix.
" --> 'xxxxxxxxxxx'

What is considered a newline sequence in the context of the . meta-character can be controlled either via parameter NEWLINE_MODE of factory function CL_ABAP_REGEX=>CREATE_PCRE( ), or by prefixing your pattern with the corresponding control verb.

If you want the . meta-character to behave exactly as in the POSIX case, you can enable the so called single line mode by either setting parameter DOT_ALL of factory function CL_ABAP_REGEX=>CREATE_PCRE( ) to ABAP_TRUE, or by setting the (?s) option inside your pattern.

Choosing the right Unicode Mode

Unlike POSIX which always assumes UCS-2, PCRE allows you to treat your input string as both UCS-2 or UTF-16, depending on your needs. This can be configured in different ways depending on the type of regular expression operation performed:

Operation Description Default Behavior
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER Unicode support is controlled by parameter UNICODE_HANDLING of the factory functions:

  • STRICT: treat input as UTF-16, throw an exception upon encountering invalid UTF-16 (i.e. broken surrogate pairs)
  • IGNORE: treat input as UTF-16, ignore invalid UTF-16; parts of the input that are not valid UTF-16 cannot be matched in any way
  • RELAXED: treat input as UCS-2; \C is enabled in patterns, the matching of surrogate pairs by their Unicode code point is however no longer possible
UNICODE_HANDLING = STRICT is assumed unless specified otherwise
built-in functions find, find_end, replace, … and ABAP statements FIND and REPLACE no additional parameter exists to control Unicode support, instead the verb (*UTF) can be specified at the start of the pattern to enable UNICODE_HANDLING = STRICT if the (*UTF) verb is not specified at the start, UNICODE_HANDLING = RELAXED is assumed;
the \C syntax can however not be used

The following table gives a quick overview of which Unicode mode to use when migrating a pattern from POSIX to PCRE:

Operation Treat Input as UCS-2 or UTF-16? Accept Invalid UTF-16? Action
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER UTF-16 Yes set UNICODE_HANDLING to IGNORE
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER UTF-16 No set UNICODE_HANDLING to STRICT (default)
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER UCS-2 (ABAP default) set UNICODE_HANDLING to RELAXED
built-in functions and ABAP statements UTF-16 Yes this cannot be achieved with the built-in functions and ABAP statements;
use CL_ABAP_REGEX and CL_ABAP_MATCHER instead
built-in functions and ABAP statements UTF-16 No add verb (*UTF) to the start of the pattern
built-in functions and ABAP statements UCS-2 (ABAP default) (default)

Minor Incompatibilities

The following table contains a list of minor incompatibilities and ways to deal with them and achieve equivalent behavior in PCRE:

Description POSIX Syntax PCRE Equivalent
matching uppercase and lowercase letters (and the negation thereof) \u, \l, \U and \L \p{Lu}, \p{Ll}, \P{Lu} and \P{Ll}
\p and its negation \P are in fact much more powerful and can match a lot more character properties, e.g. \p{Sc} matches any currency symbol and \p{Hangul} matches any Hangul character
word anchoring at the beginning or the end \< and \> \b(?=\w) or [[:<:]] and \b(?<=\w) or [[:>:]]
matching all “unicode” characters [[:unicode:]] use a character range depending on the context, e.g. [^\x{00}-\x{ff}]

Replacement and Substitution

While both POSIX and PCRE allow simple substitutions, e.g. $0 for the contents of the whole match and $n for the contents of the n-th capture group, they pretty much differ in everything else replacement related. We will not explore what PCRE adds to the table as we already did that in the last part. Instead, we will focus on the POSIX replacement syntax that is not directly supported by PCRE.

Let’s start with the low hanging fruit: POSIX supports an additional syntax for referring to the whole match, $&. This can be trivially replaced by the $0 syntax, which is equivalent.

POSIX however also allows substituting with the parts before and after the actual match, using the $` and $' syntax respectively:

DATA(posix_result) = replace( val = `again and` regex = `and` with = '$0 $`' ) ##regex_posix.
" --> 'again and again'
" === breakdown ===
" subject  = 'again and'
" match    =       'and'
" $0       =       'and'
" $`       = 'again '
" $0 $`    =       'and again '
" replaced = 'again and again ' --> only the 'and' was replaced

Achieving the same in PCRE can be tricky. For this simple example, we can get away with simply matching everything preceding the and using a capture group and just doing a simple capture group substitution:

DATA(pcre_result)  = replace( val = `again and` pcre  = `^(.+?)and` with = `$0 $1` ).
" --> 'again and again'

There may however be cases where you have to get a bit more creative. Again, I have rarely seen a pattern in production that made use of these POSIX features in a reasonable way, so this shouldn’t concern you for the most part.

Troubleshooting

In case your pattern does not compile, suddenly matches unwanted results or does not match things it previously matched, have a look at the most common pitfalls when migrating from POSIX to PCRE:

Category Symptoms Prerequisites Solution
extended mode
  • pattern does not match
  • the pattern contains unescaped whitespaces
in PCRE’s extended mode, which is enabled by default, whitespaces in the pattern are ignored; you can either:

  • escape whitespaces in the pattern using \
  • match whitespaces using \s
  • disable extended mode using the EXTENDED parameter or the option syntax (?-x)

see Whitespaces in Patterns

the . meta-character
  • pattern does not match
  • pattern matches different parts of the subject than it did before
  • the subject string contains newline sequences
the . meta-character in PCRE by default does not match newline sequences; you can either:

  • enable single line mode using parameter DOT_ALL or the option syntax (?s)
  • match newline sequences explicitly using the \R syntax
  • use [\s\S] instead of .

see What the Dot matches

unicode handling
  • applying the pattern results in a UTF-related RABAX
  • the subject string or replacement string contains characters that are not valid in UTF-16
instances of CL_ABAP_REGEX by default assume UTF-16 input; see Choosing the right Unicode Mode
unicode handling
  • a different (sub-)match length is reported
  • the subject string contains UTF-16 surrogate pairs
instances of CL_ABAP_REGEX by default assume UTF-16 input; see Choosing the right Unicode Mode
replacement and substitution
  • applying a substitution results in a different string than before
  • the replacement string contains the substitutions $&, $' or $`
$&, $' and $` are not supported by PCRE; you can:

  • replace $& with $0
  • make use of additional capture groups and capture group substitutions using the $n syntax to emulate $' and $`

see Replacement and Substitution

Where to go from here

This concludes the second part of this series. You have hopefully gained some insights into the features PCRE has to offer and how to utilize them, as well as things you have to watch out for.

In the next and last part, we will take a look at yet more regular expression flavors and their use cases:

Modern Regular Expressions in ABAP – Part 3 – XPath and XSD

Assigned tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Sergei Haller
      Sergei Haller

      maybe its just "nitpicking" but having to escape spaces sounds wrong

      Author's profile photo Kilian Kilger
      Kilian Kilger

      Hi Sergei,

      actually its a very good thing and one gets used to it very quickly. It helps you to write much more readable regular expressions and is most beneficial for more complex regular expressions. It also helps to avoid errors, as in most cases " " is not what the user wants. Most users want \s.

      So:

      • Escaping spaces makes expressions more readable, as you can insert additional spaces for readability
      • Most users don't want " " but \s anyhow.

      Best regards,
      Kilian.