Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
JasonHinsperger
Advisor
Advisor
0 Kudos

In this post, originally written by Glenn Paulley and posted to sybase.com in June of 2009, Glenn discusses how regular expressions can be used within queries in SQL Anywhere.

SQL Anywhere version 11.0.0 introduced support for search conditions that include regular expression searching. There are two variants of regular expression search predicates that one can use, each with their own semantics: SIMILAR TO and REGEXP.

SIMILAR TO

The SIMILAR TO predicate is part of the 2008 ANSI/ISO SQL standard. However, the draft of the next SQL standard, due in the 2011 timeframe, is currently under development - in fact, a WG3 editing meeting is underway in Korea this week - and SIMILAR TO will likely be eliminated from subsequent versions of the SQL standard as its functionality is being replaced by the REGEXP_LIKE predicate (see below).

The syntax of the SIMILAR TO predicate is straightforward:


expression [ NOT ] SIMILAR TO pattern [ ESCAPE escape-expression ]

but as usual the devil is in the details. For starters, here's an example, using the SQL Anywhere demo database:


SELECT *
FROM Customers
WHERE PostalCode NOT SIMILAR TO '([0-9]{5})|([0-9]{5}-[0-9]{4})|([A-Z][0-9][A-Z][[:whitespace:]]{1}[0-9][A-Z][0-9])'

which finds all those addresses with invalidly-formatted postal codes (either US or Canadian); the accepted codes have the formats of (a) five numbers (US), (b) five numbers, a dash, and four numbers (US), and (c) the six-character alphanumeric Canadian postal codes with a single embedded blank.



The regular expression patterns supported by SIMILAR TO, however, differ from those supported by regular expression conditions in other software packages (such as Perl). Here is a small, non-exhaustive list of differences (a more exhaustive list is contained in the SQL Anywhere documentation😞

  • As with LIKE and REGEXP, SIMILAR TO matches entire values, not portions of values.
  • SIMILAR TO uses "%" (percent) and "_" (underscore) as wildcard characters, in the same way as LIKE. One uses "%" instead of ".*".
  • SIMILAR TO doesn't support a variety of sub-character classes, such as [:ascii:], [:blank:], or [:punct:].
  • Perhaps most importantly, SIMILAR TO uses collation-based comparisons when comparing string values. This can be useful. For example, with SQL Anywhere's default case-insensitive string matching, the pattern [A]{1} is equivalent to [a]{1}, and these equivalences may also apply to accented characters with specific collations. However, a significant drawback is that range patterns don't work properly; the range pattern [A-C] does not, in fact, match only the upper case characters A, B, and C. Rather, in the default case-insensitive collation [A-C] matches any of the characters A, B, b, c and C; it does not match "a" because the character "a" precedes "A" in the collation sequence.

    This means, then, that the example above fails to properly validate Canadian postal codes; the query would accept Canadian postal codes containing lower-case letters.

REGEXP

With the SQL Anywhere 11.0.1 release, the REGEXP predicate supports regular expression patterns in a manner similar to Perl and other UNIX-based tools that support regular expression searching. Once again, the syntax is straightforward:


expression [ NOT ] REGEXP pattern [ ESCAPE escape-expression ]

In the SQL Standard, the syntax is virtually identical except that the predicate uses the keyword LIKE_REGEXP. Supported patterns are those from the XQuery portion of the standard. In SQL Anywhere, we've adopted pattern syntax from a variety of sources, primarily Perl. REGEXP does not use collation-based matching; matching is based on code point values in the database character set. For example, the comparison X REGEXP '[A-C]', for the single character X, is equivalent to CAST(X AS BINARY) >= CAST(A AS BINARY) AND CAST(X AS BINARY).REGEXP supports the common meta-characters and sub-classes familiar to programmers, and also supports special escaped characters such as "\s" for a space, or "\r" for carriage return, and look-ahead and look-behind assertions. Here is the same example for validating postal codes, but this time using REGEXP:

SELECT *
FROM Customers
WHERE PostalCode NOT REGEXP '([0-9]{5})|([0-9]{5}-[0-9]{4})|([A-Z][0-9][A-Z]\s[0-9][A-Z][0-9])'

Finally, note that the SQL Anywhere query optimizer will automatically optimize REGEXP and SIMILAR TO predicates - as it does for LIKE predicates - to be used as sargable predicates for index scans, depending on the specific pattern.