Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Check If a String Contains Multiple Keywords

$
0
0

1. Introduction

In this quick tutorial, we’ll find out how to detect multiple words inside of a string.

2. Our Example

Let’s suppose we have the string:

String inputString = "hello there, Baeldung";

Our task is to find whether the inputString contains the “hello” and “Baeldung” words.

So, let’s put our keywords into an array:

String[] words = {"hello", "Baeldung"};

Moreover, the order of the words isn’t important, and the matches should be case-sensitive.

3. Using String.contains()

As a start, we’ll show how to use the String.contains() method to achieve our goal.

Let’s loop over the keywords array and check the occurrence of each item inside of the inputString:

public static boolean containsWords(String inputString, String[] items) {
    boolean found = true;
    for (String item : items) {
        if (!inputString.contains(item)) {
            found = false;
            break;
        }
    }
    return found;
}

The contains() method will return true if the inputString contains the given item. When we don’t have any of the keywords inside our string, we can stop moving forward and return an immediate false.

Despite the fact that we need to write more code, this solution is fast for simple use cases.

4. Using String.indexOf()

Similar to the solution that uses the String.contains() method, we can check the indices of the keywords by using the String.indexOf() method. For that, we need a method accepting the inputString and the list of the keywords:

public static boolean containsWordsIndexOf(String inputString, String[] words) {
    boolean found = true;
    for (String word : words) {
        if (inputString.indexOf(word) == -1) {
            found = false;
            break;
        }
    }
    return found;
}

The indexOf() method returns the index of the word inside of the inputString. When we don’t have the word in the text, the index will be -1.

5. Using Regular Expressions

Now, let’s use a regular expression to match our words. For that, we’ll use the Pattern class.

First, let’s define the string expression. As we need to match two keywords, we’ll build our regex rule with two lookaheads:

Pattern pattern = Pattern.compile("(?=.*hello)(?=.*Baeldung)");

And for the general case:

StringBuilder regexp = new StringBuilder();
for (String word : words) {
    regexp.append("(?=.*").append(word).append(")");
}

After that, we’ll use the matcher() method to find() the occurrences:

public static boolean containsWordsPatternMatch(String inputString, String[] words) {

    StringBuilder regexp = new StringBuilder();
    for (String word : words) {
        regexp.append("(?=.*").append(word).append(")");
    }

    Pattern pattern = Pattern.compile(regexp.toString());

    return pattern.matcher(inputString).find();
}

But, regular expressions have a performance cost. If we have multiple words to look up, the performance of this solution might not be optimal.

6. Using Java 8 and List

And finally, we can use Java 8’s Stream API. But first, let’s do some minor transformations with our initial data:

List<String> inputString = Arrays.asList(inputString.split(" "));
List<String> words = Arrays.asList(words);

Now, it’s time to use the Stream API:

public static boolean containsWordsJava8(String inputString, String[] words) {
    List<String> inputStringList = Arrays.asList(inputString.split(" "));
    List<String> wordsList = Arrays.asList(words);

    return wordsList.stream().allMatch(inputStringList::contains);
}

The operation pipeline above will return true if the input string contains all of our keywords.

Alternatively, we can simply use the containsAll() method of the Collections framework to achieve the desired result:

public static boolean containsWordsArray(String inputString, String[] words) {
    List<String> inputStringList = Arrays.asList(inputString.split(" "));
    List<String> wordsList = Arrays.asList(words);

    return inputStringList.containsAll(wordsList);
}

However, this method works for whole words only. So, it would find our keywords only if they’re separated with whitespace within the text.

7. Using the Aho-Corasick Algorithm

Simply put, the Aho-Corasick algorithm is for text searching with multiple keywords. It has O(n) time complexity no matter how many keywords we’re searching for or how long the text length is.

Let’s include the Aho-Corasick algorithm dependency in our pom.xml:

<dependency>
    <groupId>org.ahocorasick</groupId>
    <artifactId>ahocorasick</artifactId>
    <version>0.4.0</version>
</dependency>

First, let’s build the trie pipeline with the words array of keywords. For that, we’ll use the Trie data structure:

Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();

After that, let’s call the parser method with the inputString text in which we would like to find the keywords and save the results in the emits collection:

Collection<Emit> emits = trie.parseText(inputString);

And finally, if we print our results:

emits.forEach(System.out::println);

For each keyword, we’ll see the start position of the keyword in the text, the ending position, and the keyword itself:

0:4=hello
13:20=Baeldung

Finally, let’s see the complete implementation:

public static boolean containsWordsAhoCorasick(String inputString, String[] words) {
    Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();

    Collection<Emit> emits = trie.parseText(inputString);
    emits.forEach(System.out::println);

    boolean found = true;
    for(String word : words) {
        boolean contains = Arrays.toString(emits.toArray()).contains(word);
        if (!contains) {
            found = false;
            break;
        }
    }

    return found;
}

In this example, we’re looking for whole words only. So, if we want to match not only the inputString but “helloBaeldung” as well, we should simply remove the onlyWholeWords() attribute from the Trie builder pipeline.

In addition, keep in mind that we also remove the duplicate elements from the emits collection, as there might be multiple matches for the same keyword.

8. Conclusion

In this article, we learned how to find multiple keywords inside a string. Moreover, we showed examples by using the core JDK, as well as with the Aho-Corasick library.

As usual, the complete code for this article is available over on GitHub.


Viewing all articles
Browse latest Browse all 4535

Trending Articles