Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Counting Words in a String

$
0
0

1. Overview

In this tutorial, we are going to go over different ways of counting words in a given string using Java.

2. Using StringTokenizer

A simple way to count words in a string in Java is to use the StringTokenizer class:

assertEquals(3, new StringTokenizer("three blind mice").countTokens());
assertEquals(4, new StringTokenizer("see\thow\tthey\trun").countTokens());

Note that StringTokenizer automatically takes care of whitespace for us, like tabs and carriage returns.

But, it might goof-up in some places, like hyphens:

assertEquals(7, new StringTokenizer("the farmer's wife--she was from Albuquerque").countTokens());

In this case, we’d want “wife” and “she” to be different words, but since there’s no whitespace between them, the defaults fail us.

Fortunately, StringTokenizer ships with another constructor. We can pass a delimiter into the constructor to make the above work:

assertEquals(7, new StringTokenizer("the farmer's wife--she was from Albuquerque", " -").countTokens());

This comes in handy when trying to count the words in a string from something like a CSV file:

assertEquals(10, new StringTokenizer("did,you,ever,see,such,a,sight,in,your,life", ",").countTokens());

So, StringTokenizer is simple, and it gets us most of the way there.

Let’s see though what extra horsepower regular expressions can give us.

3. Regular Expressions

In order for us to come up with a meaningful regular expression for this task, we need to define what we consider a word: a word starts with a letter and ends either with a space character or a punctuation mark.

With this in mind, given a string, what we want to do is to split that string at every point we encounter spaces and punctuation marks, then count the resulting words.

assertEquals(7, countWordsUsingRegex("the farmer's wife--she was from Albuquerque"));

Let’s crank things up a bit to see the power of regex:

assertEquals(9, countWordsUsingRegex("no&one#should%ever-write-like,this;but:well"));

It is not practical to solve this one through just passing a delimiter to StringTokenizer since we’d have to define a really long delimiter to try and list out all possible punctuation marks.

It turns out we really don’t have to do much, passing the regex [\pP\s&&[^’]]+ to the split method of the String class will do the trick:

public static int countWordsUsingRegex(String arg) {
    if (arg == null) {
        return 0;
    }
    final String[] words = arg.split("[\pP\s&&[^']]+");
    return words.length;
}

The regex [\pP\s&&[^’]]+ finds any length of either punctuation marks or spaces and ignores the apostrophe punctuation mark.

To find out more about regular expressions, refer to Regular Expressions on Baeldung.

4. Loops and the String API

The other method is to have a flag that keeps track of the words that have been encountered.

We set the flag to WORD when encountering a new word and increment the word count, then back to SEPARATOR when we encounter a non-word (punctuation or space characters).

This approach gives us the same results we got with regular expressions:

assertEquals(9, countWordsManually("no&one#should%ever-write-like,this but   well"));

We do have to be careful with special cases where punctuation marks are not really word separators, for example:

assertEquals(6, countWordsManually("the farmer's wife--she was from Albuquerque"));

What we want here is to count “farmer’s” as one word, although the apostrophe ” ‘ ” is a punctuation mark.

In the regex version, we had the flexibility to define what doesn’t qualify as a character using the regex. But now that we are writing our own implementation, we have to define this exclusion in a separate method:

private static boolean isAllowedInWord(char charAt) {
    return charAt == '\'' || Character.isLetter(charAt);
}

So what we have done here is to allow in a word all characters and legal punctuation marks, the apostrophe in this case.

We can now use this method in our implementation:

public static int countWordsManually(String arg) {
    if (arg == null) {
        return 0;
    }
    int flag = SEPARATOR;
    int count = 0;
    int stringLength = arg.length();
    int characterCounter = 0;

    while (characterCounter < stringLength) {
        if (isAllowedInWord(arg.charAt(characterCounter)) && flag == SEPARATOR) {
            flag = WORD;
            count++;
        } else if (!isAllowedInWord(arg.charAt(characterCounter))) {
            flag = SEPARATOR;
        }
        characterCounter++;
    }
    return count;
}

The first condition marks a word when it encounters one, and increments the counter. The second condition checks if the character is not a letter, and sets the flag to SEPARATOR.

5. Conclusion

In this tutorial, we have looked at ways to count words using several approaches. We can pick any depending on our particular use-case. As usual, the source code for this tutorial can be found in our GitHub.


Viewing all articles
Browse latest Browse all 4535

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>