1. Introduction
In this tutorial, we’ll show how to parse a stream of characters into tokens using the Java StreamTokenizer class.
2. StreamTokenizer
The StreamTokenizer class reads the stream character by character. Each of them can have zero or more of the following attributes: white space, alphabetic, numeric, string quote or comment character.
Now, we need to understand the default configuration. We have the following types of characters:
- Word characters: ranges like ‘a’ to ‘z’ and ‘A’ to ‘Z
- Numeric characters: 0,1,…,9
- Whitespace characters: ASCII values from 0 to 32
- Comment character: /
- String quote characters: ‘ and “
Note that the ends of lines are treated as whitespaces, not as separate tokens, and the C/C++-style comments are not recognized by default.
This class possesses a set of important fields:
- TT_EOF – A constant indicating the end of the stream
- TT_EOL – A constant indicating the end of the line
- TT_NUMBER – A constant indicating a number token
- TT_WORD – A constant indicating a word token
3. Default Configuration
Here, we’re going to create an example in order to understand the StreamTokenizer mechanism. We’ll start by creating an instance of this class and then call the nextToken() method until it returns the TT_EOF value:
private static final int QUOTE_CHARACTER = '\''; private static final int DOUBLE_QUOTE_CHARACTER = '"'; public static List<Object> streamTokenizerWithDefaultConfiguration(Reader reader) throws IOException { StreamTokenizer streamTokenizer = new StreamTokenizer(reader); List<Object> tokens = new ArrayList<Object>(); int currentToken = streamTokenizer.nextToken(); while (currentToken != StreamTokenizer.TT_EOF) { if (streamTokenizer.ttype == StreamTokenizer.TT_NUMBER) { tokens.add(streamTokenizer.nval); } else if (streamTokenizer.ttype == StreamTokenizer.TT_WORD || streamTokenizer.ttype == QUOTE_CHARACTER || streamTokenizer.ttype == DOUBLE_QUOTE_CHARACTER) { tokens.add(streamTokenizer.sval); } else { tokens.add((char) currentToken); } currentToken = streamTokenizer.nextToken(); } return tokens; }
The test file simply contains:
3 quick brown foxes jump over the "lazy" dog! #test1 //test2
Now, if we printed out the contents of the array, we’d see:
Number: 3.0 Word: quick Word: brown Word: foxes Word: jump Word: over Word: the Word: lazy Word: dog Ordinary char: ! Ordinary char: # Word: test1
In order to better understand the example, we need to explain the StreamTokenizer.ttype, StreamTokenizer.nval and StreamTokenizer.sval fields.
The ttype field contains the type of the token just read. It could be TT_EOF, TT_EOL, TT_NUMBER, TT_WORD. However, for a quoted string token, its value is the ASCII value of the quote character. Moreover, if the token is an ordinary character like ‘!’, with no attributes, then the ttype will be populated with the ASCII value of that character.
Next, we’re using sval field to get the token, only if it’s a TT_WORD, that is, a word token. But, if we’re dealing with a quoted string token – say “lazy” – then this field contains the body of the string.
Last, we’ve used the nval field to get the token, only if it’s a number token, using TT_NUMBER.
4. Custom Configuration
Here, we’ll change the default configuration and create another example.
First, we’re going to set some extra word characters using the wordChars(int low, int hi) method. Then, we’ll make the comment character (‘/’) an ordinary one and promote ‘#’ as the new comment character.
Finally, we’ll consider the end of the line as a token character with the help of the eolIsSignificant(boolean flag) method.
We only need to call these methods on the streamTokenizer object:
public static List<Object> streamTokenizerWithCustomConfiguration(Reader reader) throws IOException { StreamTokenizer streamTokenizer = new StreamTokenizer(reader); List<Object> tokens = new ArrayList<Object>(); streamTokenizer.wordChars('!', '-'); streamTokenizer.ordinaryChar('/'); streamTokenizer.commentChar('#'); streamTokenizer.eolIsSignificant(true); // same as before return tokens; }
And here we have a new output:
// same output as earlier Word: "lazy" Word: dog! Ordinary char: Ordinary char: Ordinary char: / Ordinary char: / Word: test2
Note that the double quotes became part of the token, the newline character is not a whitespace character anymore, but an ordinary character, and therefore a single-character token.
Also, the characters following the ‘#’ character are now skipped and the ‘/’ is an ordinary character.
We could also change the quote character with the quoteChar(int ch) method or even the whitespace characters by calling whitespaceChars(int low, int hi) method. Thus, further customizations can be made calling StreamTokenizer‘s methods in different combinations.
5. Conclusion
In this tutorial, we’ve seen how to parse a stream of characters into tokens using the StreamTokenizer class. We’ve learned about the default mechanism and created an example with the default configuration.
Finally, we’ve changed the default parameters and we’ve noticed how flexible the StreamTokenizer class is.
As usual, the code can be found over on GitHub.