Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Removing Stopwords from a String in Java

$
0
0

1. Overview

In this tutorial, we’ll discuss different ways to remove stopwords from a String in Java. This is a useful operation in cases where we want to remove unwanted or disallowed words from a text, such as comments or reviews added by users of an online site.

We’ll use a simple loop, Collection.removeAll() and regular expressions.

Finally, we’ll compare their performance using the Java Microbenchmark Harness.

2. Loading Stopwords

First, we’ll load our stopwords from a text file.

Here we have the file english_stopwords.txt which contain a list of words we consider stopwords, such as I, he, she, and the.

We’ll load the stopwords into a List of String using Files.readAllLines():

@BeforeClass
public static void loadStopwords() throws IOException {
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
}

3. Removing Stopwords Manually

For our first solution, we’ll remove stopwords manually by iterating over each word and checking if it’s a stopword:

@Test
public void whenRemoveStopwordsManually_thenSuccess() {
    String original = "The quick brown fox jumps over the lazy dog"; 
    String target = "quick brown fox jumps lazy dog";
    String[] allWords = original.toLowerCase().split(" ");

    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    
    String result = builder.toString().trim();
    assertEquals(result, target);
}

4. Using Collection.removeAll()

Next, instead of iterating over each word in our String, we can use Collection.removeAll() to remove all stopwords at once:

@Test
public void whenRemoveStopwordsUsingRemoveAll_thenSuccess() {
    ArrayList<String> allWords = 
      Stream.of(original.toLowerCase().split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);

    String result = allWords.stream().collect(Collectors.joining(" "));
    assertEquals(result, target);
}

In this example, after splitting our String into an array of words, we’ll transform it into an ArrayList to be able to apply the removeAll() method.

5. Using Regular Expressions

Finally, we can create a regular expression from our stopwords list, then use it to replace stopwords in our String:

@Test
public void whenRemoveStopwordsUsingRegex_thenSuccess() {
    String stopwordsRegex = stopwords.stream()
      .collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));

    String result = original.toLowerCase().replaceAll(stopwordsRegex, "");
    assertEquals(result, target);
}

The resulting stopwordsRegex will have the format “\\b(he|she|the|…)\\b\\s?”. In this regex, “\b” refers to a word boundary, to avoid replacing “he” in “heat” for example, while “\s?” refers to zero or one space, to delete the extra space after replacing a stopword.

6. Performance Comparison

Now, let’s see which method has the best performance.

First, let’s set up our benchmark. We’ll use a rather big text file as the source of our String called shakespeare-hamlet.txt:

@Setup
public void setup() throws IOException {
    data = new String(Files.readAllBytes(Paths.get("shakespeare-hamlet.txt")));
    data = data.toLowerCase();
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
    stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
}

Then we’ll have our benchmark methods, starting with removeManually():

@Benchmark
public String removeManually() {
    String[] allWords = data.split(" ");
    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    return builder.toString().trim();
}

Next, we have the removeAll() benchmark:

@Benchmark
public String removeAll() {
    ArrayList<String> allWords = 
      Stream.of(data.split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);
    return allWords.stream().collect(Collectors.joining(" "));
}

Finally, we’ll add the benchmark for replaceRegex():

@Benchmark
public String replaceRegex() {
    return data.replaceAll(stopwordsRegex, "");
}

And here’s the result of our benchmark:

Benchmark                           Mode  Cnt   Score    Error  Units
removeAll                           avgt   60   7.782 ±  0.076  ms/op
removeManually                      avgt   60   8.186 ±  0.348  ms/op
replaceRegex                        avgt   60  42.035 ±  1.098  ms/op

It seems like using Collection.removeAll() has the fastest execution time while using regular expressions is the slowest.

7. Conclusion

In this quick article, we learned different methods to remove stopwords from a String in Java. We also benchmarked them to see which method has the best performance.

The full source code for the examples is available over on GitHub.


Viewing all articles
Browse latest Browse all 4535

Trending Articles