Tokenizers 6-Nov-15 Tokens   A tokenizer is a program that extracts tokens from an input stream A token has two parts:    Its value—this is just the.

Download Report

Transcript Tokenizers 6-Nov-15 Tokens   A tokenizer is a program that extracts tokens from an input stream A token has two parts:    Its value—this is just the.

Tokenizers

26-Apr-20

Tokens

   A tokenizer stream is a program that extracts tokens from an input A token has two parts:   Its value—this is just the characters making up the token Its kind, or type For example, if we tokenize "while (x >= 0)" these tokens: we might get       "while" , keyword "(" , punctuation "x" , name ">=" , operator "0" , integer ")" , punctuation

Tokenizers as state machines

   Tokenizers can be implemented as state machines, but with these important differences:   To

succeed

(recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final state When the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokens Tokenizers are almost always implemented as state machines We ’ ll do a quick tokenizer to recognize tokens in arithmetic expressions:      Integers (digits only) Variables (letters and digits, starting with a letter) Operators, + - * / % Parentheses, ( ) Errors (anything not in the above list)

Tokenizers as DFAs

 A tokenizer is a kind of DFA, but… digit digit INTEGE R letter letter READY VARIABLE digit +, -, *, /, (, )  OPERATOR …if there is no valid transition:  If in a “final” state, return with a token; the next call start in the READY state with the next input character  If not in a final state, that’s a syntax error

TokenType

 public enum TokenType { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR; }

Token

 public class Token { private TokenType type; private String value; public Token(TokenType type, String value) { this.type = type; this.value = value; } public TokenType getType() { return type; } public String getValue() { return value; } }

Additions to the Token class

  For my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokens  public boolean equals(Object object) { Token that = (Token)object; return this.type == that.type

&& this.value.equals(that.value); } When tests fail, you need to see what Tokens you are getting  public String toString() { return value + ":" + type; }

The constructor and hasNext()

 public class Tokenizer { private String input; private int position; public Tokenizer(String input) { // add space to simplify getting last token this.input = input.trim() + " "; position = -1; } public boolean hasNext() { return position < input.length() - 2; } public Token next() { ... } }

The shell of next()

 public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; public Token next() { States state; String value = ""; if (!hasNext()) { throw new IllegalStateException("No more tokens!"); } state = States.READY; while ((++position) < input.length()) { char ch = input.charAt(position); switch (state) { case READY: { ... } case IN_VARIABLE: { ... } case IN_NUMBER: { ... } default: { ... } return new Token(TokenType.ERROR, value); } } assert false; // should never get here return null; } }

The READY state

 case READY: value = ch + ""; if (Character.isWhitespace(ch)) break; if ("()".contains(ch + "")) { return new Token(TokenType.PARENTHESIS, value); } if ("+-*/%".contains(ch + "")) { return new Token(TokenType.OPERATOR, value); } if (Character.isLetter(ch)) { state = States.IN_VARIABLE; break; } if (Character.isDigit(ch)) { state = States.IN_NUMBER; break; } return new Token(TokenType.ERROR, value);

The IN_NUMBER state

 case IN_NUMBER: if (Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.INTEGER, value); }

The IN_VARIABLE state

 case IN_VARIABLE: if (Character.isLetter(ch) || Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.VARIABLE, value); }

The default case

 default: return new Token(TokenType.ERROR, value);

java.util.StringTokenizer

   StringTokenizer  is a trivial tokenizer provided by Sun Everything is either a “ token ” or a “ delimiter ” The most important methods are hasMoreTokens() nextToken() and There are three constructors:  StringTokenizer(String str)  Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a token   StringTokenizer(String str, String delim)  Same as above, except you get to specify which characters are delimiters StringTokenizer(String str, String delim, boolean returnDelims)  Same as above, except you get to say you also want the delimiters returned as tokens

java.io.StreamTokenizer

  StreamTokenizer complex) tokenizer is a much more powerful (and much more     There are a large number of possible settings, so that the tokenizer can be customized The constructor is StreamTokenizer(Reader r) , where Reader abstract class for reading character streams is an The most important method is int nextToken() , where the returned int tells you what

kind

of token it found  Once you know what kind of token has been found, you access fields of the tokenizer to get its value I ’ m

not

going to cover StreamTokenizer in my lectures  All the details are in the Java API  It is basically capable of tokenizing C and Java programs, including integers, doubles, and comments It’s really ugly

The End