1. Overview
In this tutorial, we’ll discuss different ways to remove stopwords from a String in Java. This is a useful operation in cases where we want to remove unwanted or disallowed words from a text, such as comments or reviews added by users of an online site.
We’ll use a simple loop, Collection.removeAll() and regular expressions.
Finally, we’ll compare their performance using the Java Microbenchmark Harness.
2. Loading Stopwords
First, we’ll load our stopwords from a text file.
Here we have the file english_stopwords.txt which contain a list of words we consider stopwords, such as I, he, she, and the.
We’ll load the stopwords into a List of String using Files.readAllLines():
@BeforeClass public static void loadStopwords() throws IOException { stopwords = Files.readAllLines(Paths.get("english_stopwords.txt")); }
3. Removing Stopwords Manually
For our first solution, we’ll remove stopwords manually by iterating over each word and checking if it’s a stopword:
@Test public void whenRemoveStopwordsManually_thenSuccess() { String original = "The quick brown fox jumps over the lazy dog"; String target = "quick brown fox jumps lazy dog"; String[] allWords = original.toLowerCase().split(" "); StringBuilder builder = new StringBuilder(); for(String word : allWords) { if(!stopwords.contains(word)) { builder.append(word); builder.append(' '); } } String result = builder.toString().trim(); assertEquals(result, target); }
4. Using Collection.removeAll()
Next, instead of iterating over each word in our String, we can use Collection.removeAll() to remove all stopwords at once:
@Test public void whenRemoveStopwordsUsingRemoveAll_thenSuccess() { ArrayList<String> allWords = Stream.of(original.toLowerCase().split(" ")) .collect(Collectors.toCollection(ArrayList<String>::new)); allWords.removeAll(stopwords); String result = allWords.stream().collect(Collectors.joining(" ")); assertEquals(result, target); }
In this example, after splitting our String into an array of words, we’ll transform it into an ArrayList to be able to apply the removeAll() method.
5. Using Regular Expressions
Finally, we can create a regular expression from our stopwords list, then use it to replace stopwords in our String:
@Test public void whenRemoveStopwordsUsingRegex_thenSuccess() { String stopwordsRegex = stopwords.stream() .collect(Collectors.joining("|", "\\b(", ")\\b\\s?")); String result = original.toLowerCase().replaceAll(stopwordsRegex, ""); assertEquals(result, target); }
The resulting stopwordsRegex will have the format “\\b(he|she|the|…)\\b\\s?”. In this regex, “\b” refers to a word boundary, to avoid replacing “he” in “heat” for example, while “\s?” refers to zero or one space, to delete the extra space after replacing a stopword.
6. Performance Comparison
Now, let’s see which method has the best performance.
First, let’s set up our benchmark. We’ll use a rather big text file as the source of our String called shakespeare-hamlet.txt:
@Setup public void setup() throws IOException { data = new String(Files.readAllBytes(Paths.get("shakespeare-hamlet.txt"))); data = data.toLowerCase(); stopwords = Files.readAllLines(Paths.get("english_stopwords.txt")); stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?")); }
Then we’ll have our benchmark methods, starting with removeManually():
@Benchmark public String removeManually() { String[] allWords = data.split(" "); StringBuilder builder = new StringBuilder(); for(String word : allWords) { if(!stopwords.contains(word)) { builder.append(word); builder.append(' '); } } return builder.toString().trim(); }
Next, we have the removeAll() benchmark:
@Benchmark public String removeAll() { ArrayList<String> allWords = Stream.of(data.split(" ")) .collect(Collectors.toCollection(ArrayList<String>::new)); allWords.removeAll(stopwords); return allWords.stream().collect(Collectors.joining(" ")); }
Finally, we’ll add the benchmark for replaceRegex():
@Benchmark public String replaceRegex() { return data.replaceAll(stopwordsRegex, ""); }
And here’s the result of our benchmark:
Benchmark Mode Cnt Score Error Units removeAll avgt 60 7.782 ± 0.076 ms/op removeManually avgt 60 8.186 ± 0.348 ms/op replaceRegex avgt 60 42.035 ± 1.098 ms/op
It seems like using Collection.removeAll() has the fastest execution time while using regular expressions is the slowest.
7. Conclusion
In this quick article, we learned different methods to remove stopwords from a String in Java. We also benchmarked them to see which method has the best performance.
The full source code for the examples is available over on GitHub.