Skip to Content
Introduction

Regular expression libraries for Java have been around for a very long time, but it wasn’t until JDK 1.4 that support for regular expressions was added to the Java platform’s class library in the package java.util.regex. In this blog entry I give an introduction to the regular expression classes and their use. I assume a basic knowledge of regular expressions. If you’re new to them, you might want to read Eddy De Clercq’s three-part regular expression tutorial (part 1, part 2, part 3) first.

The java.util.regex package

As Java packages go, java.util.regex is pretty small: It contains two classes (Pattern and Matcher) and one exception (PatternSyntaxException). The Pattern class represents a compiled regular expression and the Matcher class is used to match a string against a Pattern object. A PatternSyntaxException is thrown when syntax errors in a regular expression prevents it from being compiled into a Pattern object.

The Pattern class

In order to match against a regular expression, you need to compile it. This is done by calling compile(String regex), which is a static method in the Pattern class. The regular expression in the following example matches account numbers that consist of four digits followed by two lowercase letters followed by three digits:

Pattern validAccount = Pattern.compile("^d{4}[a-z]{2}d{3}$");

The compile method throws the unchecked PatternSyntaxException if you pass it a regular expression with illegal syntax. However, unless the expressions are provided at runtime, you shouldn’t really need to catch this exception. Also, notice how the regular expression character class d, which represents a single digit, must be escaped with yet another backslash in the Java code. Otherwise, the compiler will complain that it doesn’t know about the d escape sequence.

The Matcher class

We now have a Pattern object. To match this pattern against a string, we need to create a Matcher object. This is done by calling matcher(CharSequence input) on the Pattern object (CharSequence is an interface implemented by the String, StringBuffer and CharBuffer classes). To determine whether or not the string matches the pattern, we call the matches() method, which returns a boolean value. In the following example, an array of account numbers are matched against the valid account number pattern:

String[] accounts = {"123ab456", "1234cd56", "1234abc567", "1234ab567"}; Pattern validAccount = Pattern.compile("^d{4}[a-z]{2}d{3}$"); for (int i = 0; i < accounts.length; i++) { Matcher m = validAccount.matcher(accounts[i]); if (m.matches()) { System.out.println("Match: " + accounts[i]); } }

The only valid account number turns out to be 1234ab567, which is what we’d expect.

Capturing groups

In the previous example we called the matches() method to determine whether or not a string matched a particular pattern. The Pattern and Matcher classes can do more than that, though. We can, for instance, match and extract substrings of the input by using the so called capturing groups in our regular expressions. To demonstrate, let’s change the rules for valid account numbers a bit. We now require valid account numbers to consist of between two and five digits followed by two lowercase letters followed by between one and three digits. Given an array of account numbers, we wish to extract the two lowercase letters from the valid account numbers:

String[] accounts = {"12356ab456", "1234cd", "1234abc567", "1234ab5678", "12ef3"}; Pattern validAccount = Pattern.compile("^d{2,5}([a-z]{2})d{1,3}$"); for (int i = 0; i < accounts.length; i++) { Matcher m = validAccount.matcher(accounts[i]); if (m.find()) { System.out.println("Account number: " + accounts[i]); System.out.println("Letters: " + m.group(1)); } }

The capturing groups are surrounded by parentheses in the regular expression and are numbered from left to right, starting at one. When the pattern has been matched to a subsequence of the input, i.e. if calling the find() method on the Matcher object returns true, we can access the String contents of the capturing groups by calling the group(int group) method.

Convenience methods in the String class

If all you want to do is match a single string s against a single pattern p, you can skip creating Pattern and Matcher objects explicitly. To do this, call matches(String regex), a new convenience method added to the venerable String class, as follows: s.matches(p). The String method forwards the call to matches(String regex, CharSequence input), which is a static convenience method in the Pattern class. Here’s a list of the other regular expression convenience methods added to the String class in JDK 1.4:

  • replaceAll(String regex, String replacement)
  • replaceFirst(String regex, String replacement)
  • split(String regex)
  • split(String regex, int limit)
Final thoughts

Regular expressions are a powerful addition to the Java class library. Keep in mind, though, that a complicated regular expression can become nearly impossible to read due to the extremely terse syntax. It’s always a good idea to document the way your regular expressions are constructed; you’ll be thankful when you return to the code six months later.

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

    1. Morten Wittrock Post author
      Hi Daniel

      Interesting, indeed. By the way, the full source code of the Pattern and Matcher classes is available if anybody would like to inspect it. Download the Java source code from java.sun.com to get it. At 5000+ lines of code Pattern.java is not an easy read, though – but there’s some interesting stuff in there.

      Kind regards,

      Morten Wittrock

      (0) 

Leave a Reply