Abstract data types
Download
Report
Transcript Abstract data types
Parsing
Giuseppe Attardi
Università di Pisa
Parsing
Calculate grammatical structure of
program, like diagramming
sentences, where:
Tokens = “words”
Programs = “sentences”
For further information:
Aho, Sethi, Ullman, “Compilers: Principles,
Techniques, and Tools” (a.k.a, the “Dragon Book”)
Outline of coverage
Context-free grammars
Parsing
– Tabular Parsing Methods
– One pass
• Top-down
• Bottom-up
Yacc
Parser: extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
expression
operator
expression
variable
<<
string
cout
“hello, world\n”
Context-free languages
Grammatical structure defined by contextfree grammar
statement labeled-statement
| expression-statement
| compound-statement
labeled-statement ident : statement
| case constant-expression : statement
compound-statement
{ declaration-list statement-list }
“Context-free” = only one non-terminal in left-part
terminal
non-terminal
Parse trees
Parse tree = tree labeled with grammar
symbols, such that:
If node is labeled A, and its children
are labeled x1...xn, then there is a
production
A x1...xn
“Parse tree from A” = root labeled
with A
“Complete parse tree” = all leaves
labeled with tokens
Parse trees and sentences
Frontier of tree = labels on leaves (in leftto-right order)
Frontier of tree from S is a sentential form
Frontier of a complete tree from S is a
sentence
L
E
a
L
; E
“Frontier”
Example
G: L L ; E | E
E a | b
Syntax trees from start symbol (L):
L
E
a
L
L
; E
E
L
L
E
a
L
;
; E
b
E
b
a
Sentential forms:
a
a;E
a;b;b
Derivations
Alternate definition of sentence:
Given , in V*, say is a derivation
step if ’’’ and = ’’’ , where A
is a production
is a sentential form iff there exists a
derivation (sequence of derivation steps)
S ( alternatively, we say that S* )
Two definitions are equivalent, but note that there
are many derivations corresponding to each parse tree
Another example
H: L E ; L | E
E a | b
L
L
E ; L
E
a
E
a
L
L
E ;
b
E
b
; L
E
a
Ambiguity
For some purposes, it is important to
know whether a sentence can have more
than one parse tree
A grammar is ambiguous if there is a
sentence with more than one parse tree
Example: E E+E | E*E | id
E
E
E
+
E
id
E
*
id
E
E
id
id
E
*
+
E
id
E
id
Notes
If e then if b then d else f
{ int x; y = 0; }
A.b.c = d;
Id -> s | s.id
E -> E + T -> E + T + T -> T + T + T -> id
+ T + T -> id + T * id + T -> id + id * id
+ T ->
id + id * id + id
Ambiguity
Ambiguity is a function of the
grammar rather than the language
Certain ambiguous grammars may
have equivalent unambiguous ones
Grammar Transformations
Grammars can be transformed
without affecting the language
generated
Three transformations are discussed
next:
– Eliminating Ambiguity
– Eliminating Left Recursion
(i.e.productions of the form AA )
– Left Factoring
Eliminating Ambiguity
Sometimes an ambiguous grammar can
be rewritten to eliminate ambiguity
For example, expressions involving
additions and products can be written as
follows:
E E +T | T
T T * id | id
The language generated by this grammar
is the same as that generated by the
grammar in slide “Ambiguity”. Both
generate id(+id|*id)*
However, this grammar is not ambiguous
Eliminating Ambiguity (Cont.)
One advantage of this grammar is
that it represents the precedence
between operators. In the parsing
tree, products appear nested within
additions
E
E
T
id
+
T
T
*
id
id
Eliminating Ambiguity (Cont.)
An example of ambiguity in a
programming language is the
dangling else
Consider
S if then S else S | if then
S|
Eliminating Ambiguity (Cont.)
When there are two nested ifs and
only one else..
S
if
then
else
S
S
if then S
S
if
then
if
S
then
S
else
S
Eliminating Ambiguity (Cont.)
In most languages (including C++ and Java),
each else is assumed to belong to the
nearest if that is not already matched by an
else. This association is expressed in the
following (unambiguous) grammar:
S Matched
| Unmatched
Matched if then Matched else Matched
|
Unmatched if then S
| if then Matched else Unmatched
Eliminating Ambiguity (Cont.)
Ambiguity is a property of the
grammar
It is undecidable whether a context
free grammar is ambiguous
The proof is done by reduction to
Post’s correspondence problem
Although there is no general
algorithm, it is possible to isolate
certain constructs in productions
which lead to ambiguous grammars
Eliminating Ambiguity (Cont.)
For example, a grammar containing the
production AAA | would be ambiguous,
because the substring has two parses:
A
A
A
A
A
A
A
A
A
A
This ambiguity disappears if we use the productions
AAB | B and B
or the productions
ABA | B and B .
Eliminating Ambiguity (Cont.)
Examples of ambiguous productions:
AAA
AA | A
AA | AA
A CF language is inherently ambiguous if
it has no unambiguous CFG
– An example of such a language is
L = {aibjcm | i=j or j=m} which can be generated
by the grammar:
SAB | DC
AaA | e
CcC | e
BbBc | e
DaDb | e
Elimination of Left Recursion
A grammar is left recursive if it has a
nonterminal A and a derivation A +A
for some string
–
Top-down parsing methods cannot handle leftrecursive grammars, so a transformation to
eliminate left recursion is needed
Immediate left recursion (productions of
the form A A) can be easily eliminated:
1. Group the A-productions as
A A 1 | A 2 | … | A m | 1| 2 | … | n
where no i begins with A
2. Replace the A-productions by
A 1A’ | 2A’ | … | nA’
A’ 1A’ | 2A’| … | mA’ | e
Elimination of Left Recursion (Cont.)
The previous transformation,
however, does not eliminate left
recursion involving two or more
steps
For example, consider the grammar
S Aa | b
A Ac | Sd | e
S is left-recursive because S Aa
Sda,but it is not immediately left recursive
Elimination of Left Recursion (Cont.)
Algorithm. Eliminate left recursion
Arrange nonterminals in some order A1, A2 ,,…, An
for i = 1 to n {
for j = 1 to i - 1 {
replace each production of the form Ai Aj
by the production Ai d1 |
d2 | … | dn
where Aj d1 | d2 |…| dn are all the current Ajproductions
}
eliminate the immediate left recursion among the Aiproductions
}
Elimination of Left Recursion (Cont.)
To show that the previous algorithm actually
works, notice that iteration i only changes
productions with Ai on the left-hand side. And m >
i in all productions of the form Ai Am
Induction proof:
– Clearly true for i = 1
– If it is true for all i < k, then when the outer loop is
executed for i = k, the inner loop will remove all
productions Ai Am with m < i
– Finally, with the elimination of self recursion, m in the
Ai Am productions is forced to be > i
At the end of the algorithm, all derivations of the
form Ai +Amwill have m > i and therefore left
recursion would not be possible
Left Factoring
Left factoring helps transform a grammar for
predictive parsing
For example, if we have the two productions
S if then S else S
| if then S
on seeing the input token if, we cannot
immediately tell which production to choose to
expand S
In general, if we have A 1 | 2 and the input
begins with , we do not know (without looking
further) which production to use to expand A
Left Factoring (Cont.)
However, we may defer the decision
by expanding A to A’
Then after seeing the input derived
from , we may expand A’ to 1 or to
2
Left-factored, the original
productions become
A A’
A’ 1 | 2
Non-Context-Free Language Constructs
Examples of non-context-free languages are:
– L1 = {wcw | w is of the form (a|b)*}
– L2 = {anbmcndm | n 1 and m 1 }
– L3 = {anbncn | n 0 }
Languages similar to these that are context free
– L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w
reversed)
This language is generated by the grammar
S aSa | bSb | c
– L’2 = {anbmcmdn | n 1 and m 1 }
This language is generated by the grammar
S aSd | aAd
A bAc | bc
Non-Context-Free Language Constructs
(Cont.)
L”2 = {anbncmdm | n 1 and m 1 }
is generated by the grammar
S AB
A aAb | ab
B cBd | cd
L’3 = {anbn | n 1}
is generated by the grammar
S aSb | ab
This language is not definable by any
regular expression
Non-Context-Free Language Constructs
(Cont.)
Suppose we could construct a DFSM D accepting
L’3.
D must have a finite number of states, say k.
Consider the sequence of states s0, s1, s2, …, sk
entered by D having read e, a, aa, …, ak.
Since D only has k states, two of the states in the
sequence have to be equal. Say, si sj (i j).
From si, a sequence of i bs leads to an accepting
(final) state. Therefore, the same sequence of i bs
will also lead to an accepting state from sj.
Therefore D would accept ajbi which means that
the language accepted by D is not identical to L’3.
A contradiction.
Parsing
The parsing problem is: Given string of
tokens w, find a parse tree whose frontier
is w. (Equivalently, find a derivation from
w)
A parser for a grammar G reads a list of
tokens and finds a parse tree if they form
a sentence (or reports an error otherwise)
Two classes of algorithms for parsing:
– Top-down
– Bottom-up
Parser generators
A parser generator is a program that reads
a grammar and produces a parser
The best known parser generator is yacc It
produces bottom-up parsers
Most parser generators - including yacc do not work for every CFG; they accept a
restricted class of CFG’s that can be
parsed efficiently using the method
employed by that parser generator
Top-down parsing
Starting from parse tree containing just
S, build tree down toward input.
Expand left-most non-terminal.
Algorithm: (next slide)
Top-down parsing (cont.)
Let input = a1a2...an
current sentential form (csf) = S
loop {
suppose csf = a1…akA
based on ak+1…, choose production
A
csf becomes a1…ak
}
Top-down parsing example
Grammar: H: L E ; L | E
E a | b
Input: a;b
Parse tree
Sentential form
L
L
E ;L
L
E ;L
a
Input
L
E;L
a;b
a;b
a;L
a;b
Top-down parsing example (cont.)
Parse tree
L
E ;L
a
a;E
a;b
a;b
a;b
E
L
E ;L
a
Sentential form
E
b
Input
LL(1) parsing
Efficient form of top-down parsing
Use only first symbol of remaining
input (ak+1) to choose next
production. That is, employ a
function M: N P in “choose
production” step of algorithm.
When this is possible, grammar is
called LL(1)
LL(1) examples
Example 1:
H: L E ; L | E
E a | b
Given input a;b, so next symbol is
a.
Which production to use? Can’t tell.
H not LL(1)
LL(1) examples
Example 2:
Exp Term Exp’
Exp’ $ | + Exp
Term id
(Use $ for “end-of-input” symbol.)
Grammar is LL(1): Exp and Term have only
one production; Exp’ has two productions
but only one is applicable at any time.
Nonrecursive predictive parsing
Maintain a stack explicitly, rather
than implicitly via recursive calls
Key problem during predictive
parsing: determining the production
to be applied for a non-terminal
Nonrecursive predictive parsing
Algorithm. Nonrecursive predictive parsing
Set ip to point to the first symbol of w$.
repeat
Let X be the top of the stack symbol and a the symbol pointed to by ip
if X is a terminal or $ then
if X == a then
pop X from the stack and advance ip
else error()
else // X is a nonterminal
if M[X,a] == XY1 Y2 … Y k then
pop X from the stack
push YkY k-1, …, Y1 onto the stack with Y1 on top
(push nothing if Y1 Y2 … Y k is e )
output the production XY1 Y2 … Y k
else error()
until X == $
LL(1) grammars
No left recursion
A A : If this production is chosen,
parse makes no progress.
No common prefixes
A |
Can fix by “left factoring”:
A A’
’ |
LL(1) grammars (cont.)
No ambiguity
Precise definition requires that
production to choose be unique
(“choose” function M very hard to
calculate otherwise)
Top-down Parsing
L
Input tokens: <t0,t1,…,ti,...>
E0 … En
L
Input tokens: <ti,...>
E0 … En
From left to right,
“grow” the parse
tree downwards
...
Start symbol and
root of parse tree
Checking LL(1)-ness
For any sequence of grammar symbols ,
define set FIRST() to be
FIRST() = { a | * a for some }
LL(1) definition
Define: Grammar G = (N, , P, S) is LL(1) iff whenever there
are two left-most derivations (in which the leftmost nonterminal is always expanded first)
S * wA w * wtx
S * wA w * wty
it follows that =
In other words, given
1. a string wA in V* and
2. t, the first terminal symbol to be derived from A
there is at most one production that can be applied to A to
yield a derivation of any terminal string beginning with wt
FIRST sets can often be calculated by inspection
FIRST Sets
Exp Term Exp’
Exp’ $ | + Exp
Term id
(Use $ for “end-of-input” symbol)
FIRST($) = {$}
FIRST(+ Exp) = {+}
FIRST($) FIRST(+ Exp) = {}
grammar is LL(1)
FIRST Sets
L E ; L | E
E a | b
FIRST(E ; L) = {a, b} = FIRST(E)
FIRST(E ; L) FIRST(E) {}
grammar not LL(1).
Computing FIRST Sets
Algorithm. Compute FIRST(X) for all grammar
symbols X
forall X V do FIRST(X) = {}
forall X (X is a terminal) do FIRST(X) = {X}
forall productions X e do FIRST(X) = FIRST(X) U {e}
repeat
c: forall productions X Y1Y2 … Yk do
forall i [1,k] do
FIRST(X) = FIRST(X) U (FIRST(Yi) - {e})
if e FIRST(Yi) then continue c
FIRST(X) = FIRST(X) U {e}
until no more terminals or e are added to any FIRST
set
FIRST Sets of Strings of Symbols
FIRST(X1X2…Xn) is the union of
FIRST(X1) and all FIRST(Xi) such that e
FIRST(Xk) for k = 1, 2, …, i-1
FIRST(X1X2…Xn) contains e iff e
FIRST(Xk) for k = 1, 2, …, n
FIRST Sets do not Suffice
Given the productions
A Tx
A Ty
Tw
Te
T w should be applied when the next
input token is w.
T eshould be applied whenever the next
terminal is either x or y
FOLLOW Sets
For any nonterminal X, define the set
FOLLOW(X) as
FOLLOW(X) = {a | S * Xa}
Computing the FOLLOW Set
Algorithm. Compute FOLLOW(X) for all nonterminals
X
FOLLOW(S) ={$}
forall productions A B do
FOLLOW(B)=Follow(B) (FIRST() - {e})
repeat
forall productions A B or A B with e
FIRST() do
FOLLOW(B) = FOLLOW(B) FOLLOW(A)
until all FOLLOW sets remain the same
Construction of a predictive parsing table
Algorithm. Construction of a predictive parsing
table
M[:,:] = {}
forall productions A do
forall a FIRST() do
M[A,a] = M[A,a] U {A }
if e FIRST() then
forall b FOLLOW(A) do
M[A,b] = M[A,b] U {A }
Make all empty entries of M be error
Another Definition of LL(1)
Define: Grammar G is LL(1) if for every
A N with productions A 1 ||
n
FIRST(i FOLLOW(A)) FIRST(j
FOLLOW(A) ) = {} for all i, j
Regular Languages
Definition. A regular grammar is one
whose productions are all of the
type:
– A aB
–Aa
A Regular Expression is either:
–a
– R1 | R2
– R1 R2
– R*
Nondeterministic Finite State
Automaton
a
start
0
b
a
1
b
2
b
3
Regular Languages
Theorem. The classes of languages
– Generated by a regular grammar
– Expressed by a regular expression
– Recognized by a NDFS automaton
– Recognized by a DFS automaton
coincide.
Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM
$
$
$
KEYWORD
letter
=, +, -, /, (, )
OPERATOR
circle
state
double circle
accept state
arrow
transition
bold, cap labels
state names
lower case labels
transition characters
Scanner code
state := start
loop
if no input character buffered then read one, and add it to the accumulated token
case state of
start:
case input_char of
A..Z, a..z : state := id
0..9
: state := num
else ...
end
id:
case input_char of
A..Z, a..z : state := id
0..9
: state := id
else ...
end
num:
case input_char of
0..9: ...
...
else ...
end
...
end;
end;
Table-driven DFA
0-start
1-num
2-id
3-operator
4-keyword
white space
0
exit
exit
exit
exit
letter
2
error
2
exit
error
digit
1
1
2
exit
error
operator
3
exit
exit
exit
exit
$
4
error
error
exit
4
Language Classes
L0
L0
CSL
CFL [NPA]
LR(1)
LL(1)
RL
[DFA=NFA]
Question
Are regular expressions, as provided
by Perl or other languages, sufficient
for parsing nested structures, e.g.
XML files?
Recursive Descent Parser
stat → var = expr ;
expr → term [+ expr]
term → factor [* factor]
factor → ( expr ) | var | constant
var → identifier
Scanner
public class Scanner {
private StreamTokenizer input;
private Type lastToken;
public enum Type { INVALID_CHAR, NO_TOKEN , PLUS,
// etc. for remaining tokens, then:
EOF
};
public Scanner (Reader r) {
input = new StreamTokenizer(r);
input.resetSyntax();
input.eolIsSignificant(false);
input.wordChars('a', 'z');
input.wordChars('A', 'Z');
input.ordinaryChar('+');
input.ordinaryChar('*');
input.ordinaryChar('=');
input.ordinaryChar('(');
input.ordinaryChar(')');
input.whitespaceChars('\u0000', ' ');
}
Scanner
public int nextToken() {
Type token;
try {
switch (input.nextToken()) {
case StreamTokenizer.TT_EOF:
token = EOF;
break;
case Type.TT_WORD:
if (input.sval.equalsIgnoreCase("false"))
token = FALSE;
else if (input.sval.equalsIgnoreCase("true"))
token = TRUE;
else
token = VARIABLE;
break;
case '+':
token = PLUS;
break;
// etc.
}
} catch (IOException ex) { token = EOF; }
return token;
}
}
Parser
public class Parser {
private LexicalAnalyzer lexer;
private Type token;
public Expr parse(Reader r) throws
SyntaxException {
lexer = new LexicalAnalyzer(r);
nextToken(); // assigns token
Statement stat = statement();
expect(LexicalAnalyzer.EOF);
return stat;
}
Statement
// stat ::= variable '=' expr ';'
private Statement stat() throws
SyntaxException {
Expr var = variable();
expect(LexicalAnalyzer.ASSIGN);
Expr exp = expr();
Statement stat = new Statement(var, exp);
expect(LexicalAnalyzer.SEMICOLON);
return stat;
}
Expr
// expr ::= term ['+' expr]
private Expr expr() throws SyntaxException
{
Expr exp = term();
while (token == LexicalAnalyzer.PLUS) {
nextToken();
exp = new Exp(exp, expression());
}
return exp;
}
Term
// term ::= factor ['*' term ]
private Expr term() throws
SyntaxException {
Expr exp = factor();
// Rest of body: left as an exercise.
}
Factor
// factor ::= ( expr ) | var
private Expr factor() throws S.Exception {
Expr exp = null;
if (token == LexicalAnalyzer.LEFT_PAREN) {
nextToken();
exp = expression();
expect(LexicalAnalyzer.RIGHT_PAREN);
} else {
exp = variable();
}
return exp;
}
Variable
// variable ::= identifier
private Expr variable() throws S.Exception {
if (token == LexicalAnalyzer.ID) {
Expr exp = new Variable(lexer.getString());
nextToken();
return exp;
}
}
Constant
private Expr constantExpression() throws
S.Exception {
Expr exp = null;
// Handle the various cases for constant
// expressions: left as an exercise.
return exp;
}
Utilities
private void expect(Type t) throws
SyntaxException {
if (token != t) { // throw SyntaxException...
}
nextToken();
}
private void nextToken() {
token = lexer.nextToken();
}
}