Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4561

Extracting Text Between Parentheses in Java

$
0
0

1. Overview

When we code in Java, there are many scenarios where we need to extract text enclosed within parentheses. Understanding how to retrieve the text between parentheses is an essential skill.

In this tutorial, we’ll explore different methods to achieve this, focusing on regular expressions and some popular external libraries.

2. Introduction to the Problem

When our input contains only one pair of parentheses, we can extract the content between them using two replaceAll() method calls:

String myString = "a b c (d e f) x y z";
 
String result = myString.replaceAll(".*[(]", "")
  .replaceAll("[)].*", "");
assertEquals("d e f", result);

As the example above shows, the first replaceAll() removes everything until the ‘(‘ character. The second replaceAll() removes from ‘)‘ until the end of the StringThus, the rest is the text between ‘(‘ and ‘)‘.

However, this approach won’t work if our input has multiple “(…)” pairs. For example, let’s say we have another input:

static final String INPUT = "a (b c) d (e f) x (y z)";

There are three pairs of parentheses in INPUT. Therefore, we expect to see extracted values in the following String List:

static final List<String> EXPECTED = List.of("b c", "e f", "y z");

Next, let’s see how to extract these String values from the INPUT String.

For simplicity, we’ll leverage unit test assertions to verify whether each approach works as expected.

3. Greedy vs Non-greedy Regex Pattern

Regular expressions (regex) provide a powerful and flexible method for pattern matching and text extraction. So, let’s use regex to do the job.

Some of us may come up with this pattern to extract text between ‘(‘ and ‘)’: “[(](.*)[)]“. This pattern is pretty straightforward:

  • [(] and [)] matches the literal ‘(‘ and ‘)’
  • (.*) is a capturing group that matches anything between ‘(‘ and ‘)’

Next, let’s check if this pattern solves the problem correctly:

String myRegex = "[(](.*)[)]";
Matcher myMatcher = Pattern.compile(myRegex)
  .matcher(INPUT);
List<String> myResult = new ArrayList<>();
while (myMatcher.find()) {
    myResult.add(myMatcher.group(1));
}
assertEquals(List.of("b c) d (e f) x (y z"), myResult);

As the above test shows, using this pattern, we only have one String element in the result List: “b c) d (e f) x (y z”. This is because the ‘*’ quantifier applies a greedy match. In other words, “[(](.*)[)]” matches the first ‘(‘ in the input and then everything up to the last ‘)’ character, even if the content includes other “(…)” pairs.

This isn’t what we expected. To solve the problem, we need non-greedy matching, which means the pattern must match each “(…)” pair.

To make the ‘*’ quantifier non-greedy, we can add a question mark ‘?’ after it: “[(](*?)[)]“.

Next, let’s test if this pattern can extract the expected String elements:

String regex = "[(](.*?)[)]";
List<String> result = new ArrayList<>();
Matcher matcher = Pattern.compile(regex)
  .matcher(INPUT);
while (matcher.find()) {
    result.add(matcher.group(1));
}
assertEquals(EXPECTED, result);

As we can see, the non-greedy regex pattern “[(](.*?)[)]” does the job.

4. Using the Negated Character Class

Apart from using the non-greedy quantifier (*?), we can also solve the problem using regex’s negated character class:

String regex = "[(]([^)]*)";
List<String> result = new ArrayList<>();
Matcher matcher = Pattern.compile(regex)
  .matcher(INPUT);
while (matcher.find()) {
    result.add(matcher.group(1));
}
assertEquals(EXPECTED, result);

As the code shows, our regex pattern to extract texts between parentheses is “[(]([^)]*)“. Let’s break it down to understand how it works:

  • [(] – Matches the literal ‘(‘ character
  • [^)]* – Matches any characters if it isn’t ‘)’; As it follows [(], it only matches characters inside the parentheses.
  • ([^)]*) – We create a capturing group to extract the text between parentheses without including any opening or closing parenthesis.

Alternatively, we can replace the “[(]” character class with a positive lookbehind assertion “(?<=[(])“. Lookbehind assertions allow us to match a group of characters only if a specified pattern precedes them. In this example, (?<=[(]) asserts that what immediately precedes the current position is an opening parenthesis ‘(‘:

String regex = "(?<=[(])[^)]*";
List<String> result = new ArrayList<>();
Matcher matcher = Pattern.compile(regex)
    .matcher(INPUT);
while (matcher.find()) {
    result.add(matcher.group());
}
assertEquals(EXPECTED, result);

It’s worth noting that since lookaround is a zero-width assertion, the ‘(‘ character won’t be captured. Thus, we don’t need to create a capturing group to extract the expected text.

5. Using StringUtils From Apache Commons Lang 3

Apache Commons Lang 3 is a widely used library. Its StringUtils class offers a rich set of convenient methods for manipulating String values.

If we have only one pair of parentheses in the input, the StringUtils.substringBetween() method allows us to extract the String between them straightforwardly:

String myString = "a b c (d e f) x y z";
 
String result = StringUtils.substringBetween(myString, "(", ")");
assertEquals("d e f", result);

When the input has multiple pairs of parentheses, StringUtils.substringsBetween() returns texts inside parentheses pairs in an array:

String[] results = StringUtils.substringsBetween(INPUT, "(", ")");
assertArrayEquals(EXPECTED.toArray(), results);

If we’re using the Apache Commons Lang 3 library already in our project, these two methods are good choices for this task.

6. Conclusion

In this article, we’ve explored different ways to extract text between parentheses in Java. By understanding and applying these techniques, we can efficiently parse and process text in our Java applications.

As always, the complete source code for the examples is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4561

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>