■ StreamTokenizer parses a file in a manner similar to StringTokenizer.
■ To create a StreamTokenizer on a char file, you must supply a Reader input stream such as a FileReader. i.e.
FileReader
fr = new FileReader( "yourcharfile" );
StreamTokenizer
st = new StreamTokenizer( fr );
or
st
= new StreamTokenizer(new BufferedReader(new
FileReader("yourfile")));
■ To create a StreamTokenizer on a byte file, you must use an InputStreamReader stream to change the bytes into chars: i.e.
Reader
r = new BufferedReader( new InputStreamReader( new FileInputStrem("
yourbytefile" ) ) );
then
StreamTokenizer
st = new StreamTokenizer( r );
■ When a token is encountered with .nextToken(
), its type is placed in the variable ttype and, if it is a word, its value is placed in
the variable sval. If it
is a number, its value is placed in
nval. nval is always a double.
■ You cannot recognize line endings without asking for them, as StreamTokenizer treats them as whitespace by default. The .eolIsSignificant( true ) method makes the parser recognize line endings and return the StreamTokenizer.TT_EOL value in ttype, so they can be tested for.
■ You do not need to supply StreamTokenizer method parameter values as ints, even though int is usually called for. Defining a char will work via widening promotion, and chars are much more readable.
Using the st example above:
st.wordChars( 44, 46 ); sets commas (int 44), dashes (int 45) and periods (int 46) to be treated as pieces of regular words.
st.wordChars( ',', '.' ); does the same thing.
Note that supplying character parameters out of their sequential numeric chart order, as in st.wordChars( '.', ',' ); will compile successfully but will not function at runtime
■ You cannot define token delimiter characters in StreamTokenizer's constructor, as you can with StringTokenizer. Instead, StreamTokenizer offers various methods to control the possible attributes of an incoming character. The table shows the attributes and their control methods:
|
Situation: |
Use
Method: |
Do not treat all End-of-Line characters as tokens, just as whitespace |
eolIsSignificant( false
) (the default) |
|
Define a range of characters to
be considered "part of a word" |
wordChars( int lowch,
int highch ) |
|
Define a range of chars to be
ignorable whitespace characters |
whitespaceChars( int
lowch, int highch ) |
|
Define a character as ordinary,
to be returned in ttype if
encountered |
ordinaryChar( int ch ) |
|
Define a range of characters as
ordinary |
ordinaryChars( int
lowch, int highch ) |
|
Reset all characters to
ordinary, thus returning each one in ttype |
resetSyntax( ) |
|
Treat positive, decimal (.) and
negative (-) numbers as numbers |
parseNumbers( ) (the default) |
|
Define a char whose matching
pairs will bracket a String quote body, with the char to be returned in ttype and the quote body in sval |
quoteChar( int ch ) |
■ StreamTokenizer has several methods for controlling the parsing of program files written in C, C++, and Java.
|
Situation: |
Use
Method: |
|
Define a char beginning
remainder-of-line comments to be ignored |
commentChar( int ch ) |
Ignore all line contents after occurrences of // |
slashSlashComments(
true ) |
|
Ignore all file contents between
/* ... */ |
slashStarComments( true
) |
■ This snippet uses if statements to parse any text file whose lines contain just words, numbers, commas and periods, counting them. It also uses the lineno( ) method to report on the number of lines in the file:
import
java.io.*;
public
class ParseWithIf {
public static void main(String[] args) {
int w = 0;
int n = 0;
int c = 0;
int p = 0;
StreamTokenizer st = null;
try {
st = new StreamTokenizer(new
BufferedReader(new FileReader("yourfile")));
st.eolIsSignificant(true); // tells it to recognize line breaks
st.ordinaryChars(',', '.'); // set comma and dot to be returned in ttype
while ( true ) {
st.nextToken( );
if (st.ttype == StreamTokenizer.TT_WORD)
{
w++;
continue;
}
if (st.ttype ==
StreamTokenizer.TT_NUMBER) {
n++;
continue;
}
if (st.ttype == '.') {
p++;
continue;
}
if (st.ttype == ',') {
c++;
continue;
}
if (st.ttype ==
StreamTokenizer.TT_EOL) {
continue;
}
if (st.ttype ==
StreamTokenizer.TT_EOF) {
break;
} else {
System.out.println("Bad file format");
break;
}
}
} catch (Exception ex) { }
System.out.println( w + " words,
" + n + " numbers, " + c + " commas, " + p + "
periods" );
System.out.println( st.lineno( ) +
" Lines in file" );
}
}
■ You can also switch against the various StreamTokenizer result fields. This snippet repeats the above example but with switch instead of if statements:
import
java.io.*;
public
class ParseWithSwitch {
public static void main(String[] args) {
int w = 0;
int n = 0;
int c = 0;
int p = 0;
try {
StreamTokenizer st = new
StreamTokenizer(new FileReader("yourfile"));
st.eolIsSignificant(true); //
tells it to recognize line breaks
st.ordinaryChars(',', '.'); // set comma
and dot to be returned in ttype
int token = st.nextToken(); // prime the token
field for first comparison
st.pushBack(); //
reset for loop start
while (token !=
StreamTokenizer.TT_EOF) {
token = st.nextToken();
switch (token) {
case
StreamTokenizer.TT_NUMBER:
n++;
break;
case StreamTokenizer.TT_WORD:
w++;
break;
case ',':
c++;
break;
case '.':
p++;
break;
case StreamTokenizer.TT_EOL:
break;
case StreamTokenizer.TT_EOF:
break;
}
}
} catch (Exception ex) {}
System.out.println(w + " words,
" + n + " numbers, " + c + " commas, " + p + "
periods");
}
}
■ You cannot set periods ( . ) or dashes ( - ), which are considered part of numbers, to be ordinary characters with ordinaryChar( .. ) if you plan on explicitly parsing for numbers using parseNumbers( ). If you do, number parsing will take precedence and the periods and dashes will not be returned in ttype. i.e.
st.ordinaryChars( '-', '.' ); then
st.parseNumbers( ); causes those chars not to be returned.
However stating st.ordinaryChars( '-', '.' ); alone, invoking the number parsing default without supplying the explicit st.parseNumbers( ); will return those two ordinary characters in ttype.
■ You can use StreamTokenizer to parse a simple String character stream by supplying the StreamTokenizer constructor with a StringReader containing the desired characters. i.e.
StreamTokenizer
st = new StreamTokenizer( new StringReader( "Mary had a little lamb."
) );
■ This snippet finds, sorts and displays all the uniquely different alphabetic words in a text file. It reduces them to lower case with lowerCaseMode( ). But first resetSyntax( ) is used to eliminate unwanted default parts of words, like periods. The pushBack( ) method is used to reference the same token twice. A TreeSet provides both the automatic uniqueness and the ordering of results:
import
java.io.*;
import
java.util.*;
Set
S = new TreeSet( );
try
{ FileReader fr = new FileReader( "yourfile" );
StreamTokenizer st = new
StreamTokenizer(fr);
st.resetSyntax( );
st.wordChars('a', 'z');
st.wordChars('A', 'Z');
st.lowerCaseMode( true );
while
( ( st.nextToken( ) ) != StreamTokenizer.TT_EOF) {
st.pushBack( );
if (st.nextToken( ) ==
StreamTokenizer.TT_WORD) S.add( st.sval
);
} }
catch
(IOException e) { }
Iterator
it = S.iterator( );
while
( it.hasNext( ) ) { System.out.println( it.next( ) );
}