Transcript Tokenizers

Tokenizers
26-Jul-16
Tokens



A tokenizer is a program that extracts tokens from an input
stream
A token is a “word” or a significant punctuation mark.
A token has two parts:



Its value
Its kind, or type
For example, if we tokenize "while (x >= 0)" we might get these
tokens:






"while", keyword
"(", punctuation
"x", name
">=", operator
"0", integer
")", punctuation
Tokenizers as state machines

Tokenizers can be implemented as state machines, but with these
important differences:




To succeed (recognize a token), the tokenizer does not have to reach the
end of input; it only has to reach a final state
When the tokenizer returns a token, the remainder of the input string is
kept for use in getting the remaining tokens
Tokenizers are almost always implemented as state machines
We’ll do a quick tokenizer to recognize tokens in arithmetic
expressions:





Integers (digits only)
Variables (letters and digits, starting with a letter)
Operators, + - * / %
Parentheses, ( )
Errors (anything not in the above list)
TokenType

public enum TokenType {
INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR
}
Token

public class Token {
private TokenType type;
private String value;
public Token(TokenType type, String value) {
this.type = type;
this.value = value;
}
public TokenType getType() { return type; }
}
public String getValue() { return value; }
Additions to the Token class




For my JUnit testing, I needed to ask whether my Tokenizer was
returning the correct Tokens
public boolean equals(Object object) {
if (object == null) return false;
if (!(object instanceof Token) return false;
Token that = (Token)object;
return this.type == that.type
&& this.value.equals(that.value);
}
Since my tests were failing, I wanted to see what tokens I was actually
getting
public String toString() {
return value + ":" + type;
}
The constructor and hasNext()

public class Tokenizer {
private String input;
private int position;
public Tokenizer(String input) {
this.input = input.trim() + " "; // to simplify getting last token
position = -1;
}
public boolean hasNext() {
return position < input.length() - 2;
}
}
public Token next() { ... }
The shell of next()

public class Tokenizer {
private enum States {
READY, IN_NUMBER, IN_VARIABLE, ERROR
};
}
public Token next() {
States state;
String value = "";
if (!hasNext()) {
throw new IllegalStateException("No more tokens!");
}
state = States.READY;
while ((++position) < input.length()) {
char ch = input.charAt(position);
switch (state) {
case READY: { ... }
case IN_VARIABLE: { ... }
case IN_NUMBER: { ... }
default: { ... }
return new Token(TokenType.ERROR, value);
}
}
assert false; // should never get here
return null;
}
The READY state

case READY:
value = ch + "";
if (Character.isWhitespace(ch)) break;
if ("()".contains(ch + "")) {
return new Token(TokenType.PARENTHESIS, value);
}
if ("+-*/%".contains(ch + "")) {
return new Token(TokenType.OPERATOR, value);
}
if (Character.isLetter(ch)) {
state = States.IN_VARIABLE;
break;
}
if (Character.isDigit(ch)) {
state = States.IN_NUMBER;
break;
}
return new Token(TokenType.ERROR, value);
The IN_NUMBER state

case IN_NUMBER:
if (Character.isDigit(ch)) {
value += ch;
break;
} else {
position--; // save char for next time
return new Token(TokenType.INTEGER, value);
}
The IN_VARIABLE state

case IN_VARIABLE:
if (Character.isLetter(ch) || Character.isDigit(ch)) {
value += ch;
break;
} else {
position--; // save char for next time
return new Token(TokenType.VARIABLE, value);
}
The default case

default:
return new Token(TokenType.ERROR, value);
java.util.StringTokenizer

StringTokenizer is a trivial tokenizer provided by Sun



Everything is either a “token” or a “delimiter”
The most important methods are hasMoreTokens() and
nextToken()
There are three constructors:
 StringTokenizer(String str)


StringTokenizer(String str, String delim)


Delimiters are whitespace characters; any sequence of non-whitespace
characters is returned as a token
Same as above, except you get to specify which characters are
delimiters
StringTokenizer(String str, String delim,
boolean returnDelims)

Same as above, except you get to say you also want the delimiters
returned as tokens
java.io.StreamTokenizer

StreamTokenizer is a much more powerful (and much more
complex) tokenizer




It is basically capable of tokenizing C and Java programs, including
integers, doubles, and comments
There are a large number of possible settings, so that the tokenizer can be
customized
The constructor is StreamTokenizer(Reader r), where Reader is an
abstract class for reading character streams
The most important method is int nextToken(), where the returned int
tells you what kind of token it found


Once you know what kind of token has been found, you access fields of the
tokenizer to get its value
I’m not going to cover StreamTokenizer in my lectures


All the details are in the Java API
You may want to use StreamTokenizer in subsequent assignments
The End
15