Technical Articles
Modern Regular Expressions in ABAP – Part 3 – XPath and XSD
This blog post is the third and last in a series of blog posts introducing recent changes and enhancement made to regular expressions in the ABAP language. A basic understanding of regular expressions and their syntax is assumed, some experience on using regular expressions in an ABAP context is beneficial if you want to follow the examples.
The previous parts of this series have focused on PCRE:
- Modern Regular Expressions in ABAP – Part 1 – Introducing PCRE
- Modern Regular Expressions in ABAP – Part 2 – Migrating from POSIX to PCRE
In this part we will take a look a look at two additional regular expression flavors that made their way into the ABAP language with SAP Basis 782 / SAP S/4 HANA CE 2011.
Table of Contents
Yet more Regex Flavors
If you are looking for a general purpose regular expression implementation, PCRE is the weapon of choice: it is fast, feature rich and can be used pretty much everywhere POSIX regular expressions can be used.
When processing XML data however, chances are you will stumble upon regular expression flavors specified by the W3C. Oftentimes regular expressions in this context are used for input validation. While PCRE can also be used for this, there are cases where you as the programmer are not in control of the regular expression at hand, as it may be given by somebody else, or require certain features to efficiently process XML data.
To account for these cases, two new regular expression flavors have been added to the ABAP language.
XSD Regular Expressions
The XSD (XML Schema Definition) standard specifies its own regular expression flavor, that can be applied using the pattern constraint inside an XML Schema Definition.
The following syntax elements are supported:
- greedy quantifiers (
+
,*
, …) - capturing groups (
(...)
) - character escapes (e.g.
\d
for digits,\s
for whitespaces) - Unicode character properties (
\p
and its negation\P
; while the general categories, e.g.Lu
for upper case letters, are the same, the script names differ when compared to PCRE) - alternations (
|
) - character classes (
[...]
)
DATA(regex1) = cl_abap_regex=>create_xsd( pattern = `\d{5} \w+` ).
" --> matches 5 digits, followed by a space,
" followed by one or more 'word' characters
DATA(regex2) = cl_abap_regex=>create_xsd( pattern = `(?:Hello)` ).
" --> ERROR: '(?:' is not supported
If will refer to these kinds of regular expressions as XSD regular expressions for the rest of this blog post. Keep in mind that we are only talking about the regular expressions here, other XSD aspects are out of scope.
XPath Regular Expressions
Similarly to the XSD standard, the XPath 2.0 standard also specifies a regular expression syntax, which is heavily based on XSD-style regular expressions.
Additionally, the XPath standard adds support for the following:
- lazy quantifiers (
+?
,*?
, …) - non-capturing groups (
(?:...)
) - backreferences (
\n
, wheren
is a number identifying a capture group) - pattern anchors (
^
and$
; XSD regular expressions do not give these characters any special meaning and will match them literally)
DATA(regex1) = cl_abap_regex=>create_xpath2( pattern = `\w+ is (?:happy|sad)` ).
" --> matches one or more 'word' characters, followed by ' is ' literally,
" followed by either 'happy' or 'sad'
DATA(regex2) = cl_abap_regex=>create_xpath2( pattern = `(?<my_group>Hello) World` ).
" --> ERROR: '(?<' is not supported
If will refer to these kinds of regular expressions as XPath regular expressions for the rest of this blog post. Again, keep in mind that we are only talking about the regular expressions here, other XPath aspects are out of scope.
Where they can be used
Both XSD and XPath regular expressions can be used in conjunction with classes CL_ABAP_REGEX
and CL_ABAP_MATCHER
:
" 1. create an XSD regular expression:
DATA(xsd_regex) = cl_abap_regex=>create_xsd( pattern = `[0-9]+` ).
DATA(xsd_matcher) = xsd_regex->create_matcher( text = `123456 HelloWorld` ).
" ...
" 2. create an XPath regular expression:
DATA(xpath_regex) = cl_abap_regex=>create_xpath2( pattern = `\w+` ).
DATA(xpath_matcher) = xpath_regex->create_matcher( text = `123456 HelloWorld` ).
" ...
While the XSD standard pretty much only considers the matching operation when it comes to regular expressions, there is no such restriction inside the ABAP language. Every possible operation of classes CL_ABAP_REGEX
and CL_ABAP_MATCHER
can also be performed on XSD and XPath regular expression based instances, including FIND
and REPLACE
:
DATA(xsd_result) = xsd_matcher->find_next( ).
" --> finds '123456'
DATA(xpath_result) = xpath_matcher->replace_next( newtext = `789` ).
" --> replaces '123456' with '789', yielding '789 HelloWorld'
Additionally, XPath regular expressions, as they are the more powerful of the two, can also be used inside built-in functions matches
, match
and count
:
DATA(result) = xsdbool( matches( val = `lower and UPPER case` xpath = `[a-z ]+` ) ).
" --> false
What makes them special
For the most part XSD and XPath regular expressions offer a limited feature set. Most of the syntax is also directly supported by PCRE.
There are however certain aspects that are particular to XSD and XPath and are not easily translatable to equivalent PCRE expressions.
Special Shorthands
As XSD regular expressions are intended to be used in an XML environment, they come with special shorthands to match XML names:
\i
matches any character that may be the first character of an XML name\c
matches any character that may occur after the first character in an XML name
" Match only valid XML tags
DATA(regex) = cl_abap_regex=>create_xsd( pattern = `<\i\c*>` ).
DATA(matcher1) = regex->create_matcher( text = `<Hellö>` ).
DATA(result1) = matcher1->match( ).
" --> true
DATA(matcher2) = regex->create_matcher( text = `<.INVALID.>` ).
DATA(result2) = matcher2->match( ).
" --> false, '.' is not a valid first character in an XML tag
Like for most shorthands, you can also use the negations \I
and \C
to match all characters that do not fulfill the criteria described above:
" Match only tags with invalid XML name
DATA(regex) = cl_abap_regex=>create_xpath2( pattern = `<(\I|\i\C)` ).
DATA(matcher1) = regex->create_matcher( text = `<Hello>` ).
DATA(result1) = matcher1->find_next( ).
" --> nothing found, name is valid
DATA(matcher2) = regex->create_matcher( text = `<...>` ).
DATA(result2) = matcher2->find_next( ).
" --> found, name is invalid!
Character Class Subtraction
You are probably already familiar with character classes in POSIX and PCRE, which let you match a set consisting of single characters and character ranges:
" 1. character class containing a single character range '0-9':
DATA(result1) = xsdbool( matches( val = `06227` pcre = `[0-9]+` ) ).
" --> true
" 2. character class containing two character ranges 'a-z' and 'A-Z',
" as well as the characters ' ' and '!':
DATA(result2) = xsdbool( matches( val = `Hello World!` pcre = `[a-zA-Z !]+` ) ).
" --> true
Both XPath and XSD extend the character class mechanism by allowing set like subtraction of character classes from one another. For example, subtracting character class B
from character class A
results in a character class that matches everything that A
matches, unless it is matched by B
.
This sounds a little abstract so let’s look at an example. Suppose we have two character classes:
[abcde]
which we will refer to asA
; it matches the characters froma
toe
; we could also use the character range notation to write it as[a-e]
[defg]
which we will refer to asB
; it matches the characters fromd
tog
; we could also use the character range notation to write it as[d-g]
To subtract B
from A
, we write [abcde-[defg]]
(note that the subtraction seems to take place inside the first character class; this is not a typo). The resulting character class is equivalent to [abc]
, as outlined in green in the following diagram:
Simple Character Class Subtraction
To subtract A
from B
, we write [defg-[abcde]]
. The resulting character class is equivalent to [fg]
, as outlined in purple in the diagram above.
As you can see the character class subtraction syntax also makes use of the -
character, similar to the character range syntax. In combination with character ranges, this can get a bit confusing: [a-e-[d-g]]
for example is equivalent to [abcde-[defg]]
While you can use character ranges pretty much everywhere in a character class, a character class subtraction must always be the last element.
With that in mind, let’s look at some more useful examples. Character classes unfold their true power when combined with Unicode character properties (sometimes referred to as Unicode character categories):
" 1. match all Greek characters
DATA(result1_1) = xsdbool( matches( xpath = `\p{IsGreek}+` val = `ΑβΓδΕ` ) ).
" --> true
DATA(result1_2) = xsdbool( matches( xpath = `\p{IsGreek}+` val = `안녕` ) ).
" --> false
" 2. match all uppercase letters
DATA(result2_1) = xsdbool( matches( xpath = `\p{Lu}+` val = `ABΓ` ) ).
" --> true
DATA(result2_2) = xsdbool( matches( xpath = `\p{Lu}+` val = `ABγ` ) ).
" --> false
" 3. match all Greek characters that are NOT uppercase letters
DATA(result3_1) = xsdbool( matches( xpath = `[\p{IsGreek}-[\p{Lu}]]+` val = `αβγδε` ) ).
" --> true
DATA(result3_2) = xsdbool( matches( xpath = `[\p{IsGreek}-[\p{Lu}]]+` val = `αβγδεfgh` ) ).
" --> false (not all Greek)
DATA(result3_3) = xsdbool( matches( xpath = `[\p{IsGreek}-[\p{Lu}]]+` val = `ΑβΓδΕ` ) ).
" --> false (contains uppercase letters)
While not directly supported through a dedicated syntax, it is also possible to express character class intersection (matching every element that is in both A
and B
, expressed by the white center in the diagram above), using a simple trick:
" NOTE: the 'Nd' in '\p{Nd}' stands for the 'number, decimal' property
" as specified by the Unicode standard
" method 1: match all Thai numerals by subtracting everything
" that is not Thai ('\P{IsThai}') from the set of numerals:
DATA(result1) = xsdbool( matches( xpath = `[\p{Nd}-[\P{IsThai}]]+` val = `๐๖๒๒๗` ) ).
" --> true
" method 2: match all Thai numerals; same principal as above,
" but using character class negation (indicated by '^' at the start
" of a character class) instead of '\P'
DATA(result2) = xsdbool( matches( xpath = `[\p{Nd}-[^\p{IsThai}]]+` val = `๐๖๒๒๗` ) ).
" --> true
Keep in mind, the same behavior can be achieved in PCRE using lookbehind assertions:
" NOTE: PCRE uses slightly different names when referring to scripts
" inside '\p' and '\P'; instead of 'IsGreek' as used in XSD / XPath,
" we simply write 'Greek'
" 1. character class subtraction (using negative lookbehind):
DATA(result1) = xsdbool( matches( pcre = `(\p{Greek}(?<!\p{Lu}))+` val = `αβγδε` ) ).
" --> true
" 2. character class intersection (using positive lookbehind):
DATA(result2) = xsdbool( matches( pcre = `(\p{Nd}(?<=\p{Thai}))+` val = `๐๖๒๒๗` ) ).
" --> true
When to use them
As mentioned at the beginning, XSD and XPath regular expressions are intended for certain use cases. For most tasks you should favor PCRE, as it is more powerful and available in more places.
So, use XSD and/or XPath regular expressions…:
- if you are dealing with a regular expression conforming to these standards, e.g. contained in an XSD or XPath expression
- if you are dealing with XML data and want to make use of the
\i
and\c
shorthands - if you want to express certain sets of characters using character class subtraction
Conclusion
This concludes the third and last part of this series in which we have covered recent additions made to regular expressions in the ABAP language. Below you can find a non-exhaustive list of further tools and resources that may be useful when creating regular expressions.
If you need to quickly experiment with and test regular expressions, you can use the following tools:
Tool | Description |
---|---|
reports DEMO_REGEX / DEMO_REGEX_TOY |
supports POSIX, PCRE, XSD and XPath regular expressions |
regex101.com | supports PCRE and others (but not POSIX / XSD / XPath); has basic debugging capabilities for PCRE |
If you want to know more about a certain regular expression related feature or technique, either consult the official ABAP documentation or take a look at one of these sources:
Source | Description |
---|---|
www.regular-expressions.info | great source covering a lot of different regular expression implementations, including lot’s of examples |
official PCRE documentation | the one stop shop for everything PCRE related; especially useful: NOTE: not all operations and settings described there can be performed or influenced from within ABAP; if in doubt, consult the official ABAP documentation |
XSD standard | regular expressions as specified by the XSD standard; very technical |
XPath 2.0 standard | regular expressions as specified by the XPath standard; also very technical |