Recognizers

Transcript Recognizers

Recognizers
26-Jul-16
Parsers and recognizers

Given a grammar (say, in BNF) and a string,


A recognizer will tell whether the string belongs to the language defined
by the grammar
A parser will try to build a tree corresponding to the string, according to
the rules of the grammar
Input string
Recognizer result
2+3*4
true
2+3*
false
Parser result
Error
2
Building a recognizer


One way of building a recognizer from a grammar is
called recursive descent
Recursive descent is pretty easy to implement, once
you figure out the basic ideas



Recursive descent is a great way to build a “quick and dirty”
recognizer or parser
Production-quality parsers use much more sophisticated and
efficient techniques
In the following slides, I’ll talk about how to do
recursive descent, and give some examples in Java
3
Recognizing simple alternatives, I

Consider the following BNF rule:



<add_operator> ::= “+” | “-”
That is, an add operator is a plus sign or a minus sign
To recognize an add operator, we need to get the next
token, and test whether it is one of these characters


If it is a plus or a minus, we simply return true
But what if it isn’t?


We not only need to return false, but we also need to put the token
back because it doesn’t belong to us, and some other grammar rule
probably wants it
We need a tokenizer that can take back characters

We will make do with putting back only one token at a time
4
Creating a decorator class



A Decorator class is a class that extends another class
and add functionality
In this case, StringTokenizer may do what we need,
except be able to take back tokens
We can decorate StringTokenizer and add the ability to
take back tokens


For simplicity, we’ll allow only a single token to be returned
Our decorator class will need a little extra storage, and
will need to override and extend some methods
5
PushbackTokenizer I

public class PushbackTokenizer extends StringTokenizer {
String pushedToken = null; // to hold returned token


public PushbackTokenizer(String s) {
super(s); // superclass has no default constructor
}
public void pushBack(String token) { // added method
pushedToken = token;
}
6
PushbackTokenizer II
public boolean hasMoreTokens() {
if (pushedToken != null) return true;
else return super.hasMoreTokens();
}


Notice how we not only overrode this method, but if we didn’t have
anything special to do, we just let the superclass’s method handle it
public String nextToken() {
if (pushedToken == null) return super.nextToken();
// We only return a pushedToken once
String result = pushedToken;
pushedToken = null;
return result;
}


Again, we just use the superclass’s method, we don’t reinvent it
7
Sample use of PushbackTokenizer

public static void main(String[ ] args) {
PushbackTokenizer pb = new PushbackTokenizer("This is too cool");
String token;
System.out.print(pb.nextToken( ) + " ");
// “This”
System.out.print(pb.nextToken( ) + " ");
// “is”
System.out.print (token = pb.nextToken() + " "); // “too”
pb.pushBack(token);
// return “too”
System.out.print(pb.nextToken( ) + " ");
// get “too” again
System.out.print(pb.nextToken( ) + " ");
// “cool”
}


Output: This is too too cool
Question: Why the extra space?
8
Recognizing simple alternatives, II


Our rule is <add_operator> ::= “+” | “-”
Our method for recognizing an <add_operator>
(which we will simply call addOperator) looks like
this:

public boolean addOperator() {
Get the next token, call it t
If t is a “+”, return true
If t is a “-”, return true
If t is anything else,
put the token back
return false
}
9
Helper methods

We could turn the preceding pseudocode directly into
Java code




But we will be writing a lot of very similar code...
...and it won’t be very readable code
We should write some auxiliary or “helper” methods to hide
some of the details for us
First helper method:

private boolean symbol(String expectedSymbol)


Gets the next token and tests whether it matches the expectedSymbol
 If it matches, return true
 If it doesn’t match, put the symbol back and return false
We’ll look more closely at this method in a moment
10
Recognizing simple alternatives, III


Our rule is <add_operator> ::= “+” | “-”
Our pseudocode is:


public boolean addOperator() {
Get the next token, call it t
If t is a “+”, return true
If t is a “-”, return true
If t is anything else,
put the token back
return false
}
Thanks to our helper method, our actual Java code is:

public boolean addOperator() {
return symbol("+") || symbol("-");
}
11
Categories of tokens

Tokens are always strings, but they come in a variety of kinds:








Names: "limit", "y", "maxValue"
Keywords: "if", "while", "instanceof"
Numbers: "25", "3"
Symbols: "+", "=", ";"
Special: "\n", end_of_input
Instead of treating tokens as simple strings, it’s convenient to create a Token
class that holds both its string value and a constant telling what kind of token
it is
class Token {
String value;
int type;
...and this class should define some constants to represent the various types:


public static final int NAME = 1;
public static final int SYMBOL = 2; etc.
If you are using Java 5.0, this is what enums were invented for!
12
Implementing symbol

symbol gets a token, makes sure it’s a symbol,
compares it to the desired value, possibly puts the token
back, and returns true or false




We will want to do something similar for numbers, names,
end of lines, and maybe end of input
It would be foolish to write and debug all of these separately
Again, we should use an auxiliary method
private boolean symbol(String expectedSymbol) {
return nextTokenMatches(Token.SYMBOL, expectedSymbol);
}
13
nextTokenMatches #1

The nextTokenMatches method should:





Get a token
Compare types and values
Return true if the token is as expected
Put the token back and return false if it doesn’t match
private boolean nextTokenMatches(int type, String value) {
Token t = tokenizer.next();
if (type == t.getType() && value.equals(t.getValue())) {
return true;
} else {
tokenizer.pushBack(t);
return false;
}
}
14
nextTokenMatches #2

The previous method is fine for symbols, but what if we only care
about the type?




For example, we want to get a number—any number
We need to compare only type, not value
private boolean nextTokenMatches(int type, String value) {
Token t = tokenizer.next();
omit this parameter
if (type == t.getType() && value.equals(t.getValue())) return true;
else tokenizer.pushBack(t);
omit this test
return false;
}
The two versions of nextTokenMatches are difficult to combine
and fairly small, so we won’t worry about the code duplication
too much
15
addOperator reprise



public boolean addOperator() {
return symbol("+") || symbol("-");
}
private boolean symbol(String expectedSymbol) {
return nextTokenMatches(Token.SYMBOL, expectedSymbol);
}
private boolean nextTokenMatches(int type, String value) {
Token t = tokenizer.next();
if (type == t.getType() && value.equals(t.getValue())) return true;
else tokenizer.pushBack(t);
return false;
}
16
Sequences, I

Suppose we want to recognize a grammar rule in
which one thing follows another, for example,


The code for this would be fairly simple...


<empty_list> ::= “[” “]”
public boolean emptyList() {
return symbol("[") && symbol("]");
}
...except for one thing...


What happens if we get a “[” and don’t get a “]”?
The above method won’t work—why not?

Only the second call to symbol failed, and only one token gets
pushed back
17
Sequences, II


The grammar rule is <empty_list> ::= “[” “]”
And the token string contains [ 5 ]

Solution #1: Write a pushBack method that can keep track of more than
one token at a time (say, in a Stack)




Solution #2: Call it an error



You might be able to get away with this, depending on the grammar
For example, for any reasonable grammar, (2 + 3 +) is clearly an error
Solution #3: Change the grammar


This will allow you to put the back both the “[” and the “5”
The code gets pretty messy
You have to be very careful of the order in which you return tokens
Tricky, and may not be possible
Solution #4: Combine rules

See the next slide
18
Sequences, III

Suppose the grammar really says
<list> ::= “[” “]” | “[” <number> “]”

Now your pseudocode should look something like this:


public boolean list() {
if first token is “[” {
if second token is “]” return true
else if second token is a number {
if third token is “]” return true
else error
}
else put back first token
}
Revised grammar:

<list> ::= “[” <rest_of_list>
<rest_of_list> ::= “]” | <number> “]”
19
Simple sequences in Java

Suppose you have this rule:



<factor> ::= “(” <expression> “)”
A good way to do this is often to test whether the grammar rule
is not met
public boolean factor() {
if (symbol("(")) {
if (!expression()) error("Error in parenthesized expression");
if (!symbol(")")) error("Unclosed parenthetical expression");
return true;
}
return false;
}
20
Sequences and alternatives




Here’s the real grammar rule for <factor>:
<factor> ::= <name>
| <number>
| “(” <expression> “)”
And here’s the actual code:
public boolean factor() {
if (name()) return true;
if (number()) return true;
if (symbol("(")) {
if (!expression()) error("Error in parenthesized expression");
if (!symbol(")")) error("Unclosed parenthetical expression");
return true;
}
return false;
}
21
Recursion, I

Here’s an unfortunate (but legal!) grammar rule:


Here’s some code for it:




<expression> ::= <expression> “+” <term>
public boolean expression() {
if (!expression()) return false;
if (!addOperator()) return true;
if (!term()) error("Error in expression after '+' or '-'");
return true;
}
Do you see the problem?
We aren’t recurring with a simpler case, therefore, we have an
infinite recursion
Our grammar rule is left recursive (the recursive part is the
leftmost thing in the definition)
22
Recursion, II

Here’s our unfortunate grammar rule again:


Here’s an equivalent, right recursive rule:


<expression> ::= <expression> “+” <term>
<expression> ::= <term> “+” <expression>
Here’s some (much happier!) code for it:

public boolean expression() {
if (!term()) return false;
if (!addOperator()) return true;
if (!expression()) error("Error in expression after '+' or '-'");
return true;
}
23
Extended BNF—optional parts

Extended BNF uses brackets to indicate optional parts of rules


Example:
<if_statement> ::=
“if” <condition> <statement> [ “else” <statement> ]
Pseudocode for this example:
public boolean ifStatement() {
if you don’t see “if”, return false
if you don’t see a condition, return an error
if you don’t see a statement, return an error
if you see an “else” {
if you see a “statement”, return true
else return an error
}
else return true;
}
24
Extended BNF—zero or more

Extended BNF uses braces to indicate parts of a rule
that can be repeated

Example: <expression> ::= <term> { “+” <term> }



Note that this is not a good definition for an expression
Pseudocode for example:
public boolean expression() {
if you don’t see a term, return false
while you see a “+” {
if you don’t see a term, return an error
}
return true
}
25
Back to parsers


A parser is like a recognizer
The difference is that, when a parser recognizes
something, it does something about it


Usually, what a parser does is build a tree
If the thing that is being parsed is a program, then

You can write another program that “walks” the tree and
executes the statements and expressions as it finds them


Such a program is called an interpreter
You can write a similar program that “walks” the tree and
produces code in some other language (usually assembly
language) that does the same thing

Such a program is called a compiler
26
Conclusions

If you start with a BNF definition of a language,

You can write a recursive descent recognizer to tell you
whether an input string “belongs to” that language (is a valid
program in that language)


You can write a recursive descent parser to create a parse
tree representing the program


Writing such a recognizer is a “cookbook” exercise—you just follow
the recipe and it works (hopefully)
The parse tree can later be used to execute the program
BNF is purely about syntax


BNF tells you what is legal, and how things are put together
BNF has nothing to say about what things actually mean
27
The End
28

Recognizers

Transcript Recognizers

Directory