Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Introduction to Apache Commons Text

$
0
0

1. Overview

Simply put, the Apache Commons Text library contains a number of useful utility methods for working with Strings, beyond what the core Java offers.

In this quick introduction, we’ll see what Apache Commons Text is, and what it is used for, as well as some practical examples of using the library.

2. Maven Dependency

Let’s start by adding the following Maven dependency to our pom.xml:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.1</version>
</dependency>

You can find the latest version of the library at the Maven Central Repository.

3. Overview

The root package org.apache.commons.text is divided into different sub-packages:

  • org.apache.commons.text.diff – diffs between Strings
  • org.apache.commons.text.similarity – similarities and distances between Strings
  • org.apache.commons.text.translate – translating text

Let’s see what each package can be used for – in more detail.

3. Handling Text

The org.apache.commons.text package contains multiple tools for working with Strings.

For instance, WordUtils has APIs capable of capitalizing the first letter of each word in a String, swapping the case of a String, and checking if a String contains all words in a given array.

Let’s see how we can capitalize the first letter of each word in a String:

@Test
public void whenCapitalized_thenCorrect() {
    String toBeCapitalized = "to be capitalized!";
    String result = WordUtils.capitalize(toBeCapitalized);
    
    assertEquals("To Be Capitalized!", result);
}

Here is how we can check if a string contains all words in an array:

@Test
public void whenContainsWords_thenCorrect() {
    boolean containsWords = WordUtils
      .containsAllWords("String to search", "to", "search");
    
    assertTrue(containsWords);
}

StrSubstitutor provides a convenient way to building Strings from templates:

@Test
public void whenSubstituted_thenCorrect() {
    Map<String, String> substitutes = new HashMap<>();
    substitutes.put("name", "John");
    substitutes.put("college", "University of Stanford");
    String templateString = "My name is ${name} and I am a student at the ${college}.";
    StrSubstitutor sub = new StrSubstitutor(substitutes);
    String result = sub.replace(templateString);
    
    assertEquals("My name is John and I am a student at the University of Stanford.", result);
}

StrBuilder is an alternative to Java.lang.StringBuilder. It provides some new features which are not provided by StringBuilder.

For example, we can replace all occurrences of a String in another String or clear a String without assigning a new object to its reference.

Here’s a quick example to replace part of a String:

@Test
public void whenReplaced_thenCorrect() {
    StrBuilder strBuilder = new StrBuilder("example StrBuilder!");
    strBuilder.replaceAll("example", "new");
   
    assertEquals(new StrBuilder("new StrBuilder!"), strBuilder);
}

To clear a String, we can simply do that by calling the clear() method on the builder:

strBuilder.clear();

4. Calculating the Diff between Strings

The package org.apache.commons.text.diff implements Myers algorithm for calculating diffs between two Strings.

The diff between two Strings is defined by a sequence of modifications that can convert one String to another.

There are three types of commands that can be used to convert a String to another – InsertCommand, KeepCommand, and DeleteCommand. 

An EditScript object holds the script that should be run in order to convert a String to another. Let’s calculate the number of single-char modifications that should be made in order to convert a String to another:

@Test
public void whenEditScript_thenCorrect() {
    StringsComparator cmp = new StringsComparator("ABCFGH", "BCDEFG");
    EditScript<Character> script = cmp.getScript();
    int mod = script.getModifications();
    
    assertEquals(4, mod);
}

5. Similarities and Distances between Strings

The org.apache.commons.text.similarity package contains algorithms useful for finding similarities and distances between Strings.

For example, LongestCommonSubsequence can be used to find the number of common characters in two Strings:

@Test
public void whenCompare_thenCorrect() {
    LongestCommonSubsequence lcs = new LongestCommonSubsequence();
    int countLcs = lcs.apply("New York", "New Hampshire");
    
    assertEquals(5, countLcs);
}

Similarly, LongestCommonSubsequenceDistance can be used to find the number of different characters in two Strings:

@Test
public void whenCalculateDistance_thenCorrect() {
    LongestCommonSubsequenceDistance lcsd = new LongestCommonSubsequenceDistance();
    int countLcsd = lcsd.apply("New York", "New Hampshire");
    
    assertEquals(11, countLcsd);
}

6. Text Translation

The org.apache.text.translate package was initially created to allow us to customize the rules provided by StringEscapeUtils.

The package has a set of classes which are responsible for translating text to some of the different character encoding models such as Unicode and Numeric Character Reference. We can also create our own customized routines for translation.

Let’s see how we can convert a String to its equivalent Unicode text:

@Test
public void whenTranslate_thenCorrect() {
    UnicodeEscaper ue = UnicodeEscaper.above(0);
    String result = ue.translate("ABCD");
    
    assertEquals("\\u0041\\u0042\\u0043\\u0044", result);
}

Here, we are passing the index of the character that we want to start translation from to the above() method.

LookupTranslator enables us to define our own lookup table where each character can have a corresponding value, and we can translate any text to its corresponding equivalent.

7. Conclusion

In this quick tutorial, we’ve seen an overview of what Apache Commons Text is all about and some of its common features.

The code samples can be found over on GitHub.


Viewing all articles
Browse latest Browse all 4535

Trending Articles