Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

How to Use Regular Expressions to Replace Tokens in Strings

$
0
0

1. Overview

When we need to find or replace values in a string in Java, we usually use regular expressions. These allow us to determine if some or all of a string matches a pattern. We might easily apply the same replacement to multiple tokens in a string with the replaceAll method in both Matcher and String.

In this tutorial, we'll explore how to apply a different replacement for each token found in a string. This will make it easy for us to satisfy use cases like escaping certain characters or replacing placeholder values.

We'll also look at a few tricks for tuning our regular expressions to identify tokens correctly.

2. Individually Processing Matches

Before we can build our token-by-token replacement algorithm, we need to understand the Java API around regular expressions. Let's solve a tricky matching problem using capturing and non-capturing groups.

2.1. Title Case Example

Let's imagine we want to build an algorithm to process all the title words in a string. These words start with one uppercase character and then either end or continue with only lowercase characters.

Our input might be:

"First 3 Capital Words! then 10 TLAs, I Found"

From the definition of a title word, this contains the matches:

  • First
  • Capital
  • Words
  • I
  • Found

And a regular expression to recognize this pattern would be:

"(?<=^|[^A-Za-z])([A-Z][a-z]*)(?=[^A-Za-z]|$)"

To understand this, let's break it down into its component parts. We'll start in the middle:

[A-Z]

will recognize a single uppercase letter.

We allow single-character words or words followed by lowercase, so:

[a-z]*

recognizes zero or more lowercase letters.

In some cases, the above two character classes would be enough to recognize our tokens. Unfortunately, in our example text, there is a word that starts with multiple capital letters. Therefore, we need to express that the single capital letter we find must be the first to appear after non-letters.

Similarly, as we allow a single capital letter word, we need to express that the single capital letter we find must not be the first of a multi-capital letter word.

The expression [^A-Za-z] means “no letters”. We have put one of these at the start of the expression in a non-capturing group:

(?<=^|[^A-Za-z])

The non-capturing group, starting with (?<=, does a look-behind to ensure the match appears at the correct boundary. Its counterpart at the end does the same job for the characters that follow.

However, if words touch the very start or end of the string, then we need to account for that, which is where we've added ^| to the first group to make it mean “the start of the string or any non-letter characters”, and we've added |$ on the end of the last non-capturing group to allow the end of the string to be a boundary.

Characters found in non-capturing groups do not appear in the match when we use find.

We should note that even a simple use case like this can have many edge cases, so it's important to test our regular expressions. For this, we can write unit tests, use our IDE's built-in tools, or use an online tool like Regexr.

2.2. Testing Our Example

With our example text in a constant called EXAMPLE_INPUT and our regular expression in a Pattern called TITLE_CASE_PATTERN, let's use find on the Matcher class to extract all of our matches in a unit test:

Matcher matcher = TITLE_CASE_PATTERN.matcher(EXAMPLE_INPUT);
List<String> matches = new ArrayList<>();
while (matcher.find()) {
    matches.add(matcher.group(1));
}

assertThat(matches)
  .containsExactly("First", "Capital", "Words", "I", "Found");

Here we use the matcher function on Pattern to produce a Matcher. Then we use the find method in a loop until it stops returning true to iterate over all the matches.

Each time find returns true, the Matcher object's state is set to represent the current match. We can inspect the whole match with group(0) or inspect particular capturing groups with their 1-based index. In this case, there is a capturing group around the piece we want, so we use group(1) to add the match to our list.

2.3. Inspecting Matcher a Bit More

We've so far managed to find the words we want to process.

However, if each of these words were a token that we wanted to replace, we would need to have more information about the match in order to build the resulting string. Let's look at some other properties of Matcher that might help us:

while (matcher.find()) {
    System.out.println("Match: " + matcher.group(0));
    System.out.println("Start: " + matcher.start());
    System.out.println("End: " + matcher.end());
}

This code will show us where each match is. It also shows us the group(0) match, which is everything captured:

Match: First
Start: 0
End: 5
Match: Capital
Start: 8
End: 15
Match: Words
Start: 16
End: 21
Match: I
Start: 37
End: 38
... more

Here we can see that each match contains only the words we're expecting. The start property shows the zero-based index of the match within the string. The end shows the index of the character just after. This means we could use substring(start, end-start) to extract each match from the original string. This is essentially how the group method does that for us.

Now that we can use find to iterate over matches, let's process our tokens.

3. Replacing Matches One by One

Let's continue our example by using our algorithm to replace each title word in the original string with its lowercase equivalent. This means our test string will be converted to:

"first 3 capital words! then 10 TLAs, i found"

The Pattern and Matcher class can't do this for us, so we need to construct an algorithm.

3.1. The Replacement Algorithm

Here is the pseudo-code for the algorithm:

  • Start with an empty output string
  • For each match:
    • Add to the output anything that came before the match and after any previous match
    • Process this match and add that to the output
    • Continue until all matches are processed
    • Add anything left after the last match to the output

We should note that the aim of this algorithm is to find all non-matched areas and add them to the output, as well as adding the processed matches.

3.2. The Token Replacer in Java

We want to convert each word to lowercase, so we can write a simple conversion method:

private static String convert(String token) {
    return token.toLowerCase();
}

Now we can write the algorithm to iterate over the matches. This can use a StringBuilder for the output:

int lastIndex = 0;
StringBuilder output = new StringBuilder();
Matcher matcher = TITLE_CASE_PATTERN.matcher(original);
while (matcher.find()) {
    output.append(original, lastIndex, matcher.start())
      .append(convert(matcher.group(1)));

    lastIndex = matcher.end();
}
if (lastIndex < original.length()) {
    output.append(original, lastIndex, original.length());
}
return output.toString();

We should note that StringBuilder provides a handy version of append that can extract substrings. This works well with the end property of Matcher to let us pick up all non-matched characters since the last match.

4. Generalizing the Algorithm

Now that we've solved the problem of replacing some specific tokens, why don't we convert the code into a form where it can be used for the general case? The only thing that varies from one implementation to the next is the regular expression to use, and the logic for converting each match into its replacement.

4.1. Use a Function and Pattern Input

We can use a Java Function<Matcher, String> object to allow the caller to provide the logic to process each match. And we can take an input called tokenPattern to find all the tokens:

// same as before
while (matcher.find()) {
    output.append(original, lastIndex, matcher.start())
      .append(converter.apply(matcher));

// same as before

Here, the regular expression is no longer hard-coded. Instead, the converter function is provided by the caller and is applied on each match within the find loop.

4.2. Testing the General Version

Let's see if the general method works as well as the original:

assertThat(replaceTokens("First 3 Capital Words! then 10 TLAs, I Found",
  TITLE_CASE_PATTERN,
  match -> match.group(1).toLowerCase()))
  .isEqualTo("first 3 capital words! then 10 TLAs, i found");

Here we see that calling the code is straightforward. The conversion function is easy to express as a lambda. And the test passes.

Now we have a token replacer, so let's try some other use cases.

5. Some Use Cases

5.1. Escaping Special Characters

Let's imagine we wanted to use the regular expression escape character \ to manually quote each character of a regular expression rather than use the quote method. Perhaps we are quoting a string as part of creating a regular expression to pass to another library or service, so block quoting the expression won't suffice.

If we can express the pattern that means “a regular expression character”, it's easy to use our algorithm to escape them all:

Pattern regexCharacters = Pattern.compile("[<(\\[{\\\\^\\-=$!|\\]})?*+.>]");

assertThat(replaceTokens("A regex character like [",
  regexCharacters,
  match -> "\\" + match.group()))
  .isEqualTo("A regex character like \\[");

For each match, we prefix the \ character. As \ is a special character in Java strings, it's escaped with another \.

Indeed, this example is covered in extra \ characters as the character class in the pattern for regexCharacters has to quote many of the special characters. This shows the regular expression parser that we're using them to mean their literals, not as regular expression syntax.

5.2. Replacing Placeholders

A common way to express a placeholder is to use a syntax like ${name}. Let's consider a use case where the template “Hi ${name} at ${company}” needs to be populated from a map called placeholderValues:

Map<String, String> placeholderValues = new HashMap<>();
placeholderValues.put("name", "Bill");
placeholderValues.put("company", "Baeldung");

All we need is a good regular expression to find the ${…} tokens:

"\\$\\{(?<placeholder>[A-Za-z0-9-_]+)}"

is one option. It has to quote the $ and the initial curly brace as they would otherwise be treated as regular expression syntax.

At the heart of this pattern is a capturing group for the name of the placeholder. We've used a character class that allows alphanumeric, dashes, and underscores, which should fit most use-cases.

However, to make the code more readable, we've named this capturing group placeholder. Let's see how to use that named capturing group:

assertThat(replaceTokens("Hi ${name} at ${company}",
  "\\$\\{(?<placeholder>[A-Za-z0-9-_]+)}",
  match -> placeholderValues.get(match.group("placeholder"))))
  .isEqualTo("Hi Bill at Baeldung");

Here we can see that getting the value of the named group out of the Matcher just involves using group with the name as the input, rather than the number.

6. Conclusion

In this article, we looked at how to use powerful regular expressions to find tokens in our strings. We learned how the find method works with Matcher to show us the matches.

Then we created and generalized an algorithm to allow us to do token-by-token replacement.

Finally, we looked at a couple of common use-cases for escaping characters and populating templates.

As always, the code examples can be found over on GitHub.


Viewing all articles
Browse latest Browse all 4535

Trending Articles