Transcript Document

CIS 461
Compiler Design and Construction
Fall 2012
slides derived from Tevfik Bultan, Keith Cooper, and
Linda Torczon
Lecture-Module #13
Parsing
1
Top-down Parsing
S
A
fringe of the
parse tree
start symbol
D
B
?
C
S
left-to-right
scan
?
left-most
derivation
lookahead
Bottom-up Parsing
lookahead
S
input string
upper fringe of
the parse tree
?
A
D
right-most
derivation
in reverse
C
lookahead
2
LR Parsers
• LR(k) parsers are table-driven, bottom-up, shift-reduce parsers
that use a limited right context (k-token lookahead) for handle
recognition
• LR(k): Left-to-right scan of the input, Rightmost derivation in reverse
with k token lookahead
A grammar is LR(k) if, given a rightmost derivation
S  0  1  2  …  n-1  n  sentence
We can
1. isolate the handle of each right-sentential form i , and
2. determine the production by which to reduce,
by scanning i from left-to-right, going at most k symbols beyond
the right end of the handle of i
3
LR Parsers
A table-driven LR parser looks like
Stack
source
code
grammar
Scanner
Table-driven
Parser
Parser
Generator
ACTION &
GOTO
Tables
IR
4
LR Shift-Reduce Parsers
push($); // $ is the end-of-file symbol
push(s0); // s0 is the start state of the DFA that recognizes handles
lookahead = get_next_token();
repeat forever
s = top_of_stack();
if ( ACTION[s,lookahead] == reduce  ) then
pop 2*|| symbols;
s = top_of_stack();
push();
push(GOTO[s,]);
else if ( ACTION[s,lookahead] == shift si ) then
push(lookahead);
push(si);
lookahead = get_next_token();
else if ( ACTION[s,lookahead] == accept and lookahead == $ )
then return success;
else error();
The skeleton parser
•uses ACTION & GOTO
• does |words| shifts
• does |derivation|
reductions
• does 1 accept
5
LR Parsers
How does this LR stuff work?
• Unambiguous grammar  unique rightmost derivation
• Keep upper fringe on a stack
– All active handles include TOS
– Shift inputs until TOS is right end of a handle
• Language of handles is regular
– Build a handle-recognizing DFA
– ACTION & GOTO tables encode the DFA
• To match subterms, recurse & leave DFA’s state on stack
• Final states of the DFA correspond to reduce actions
– New state is GOTO[lhs , state at TOS]
6
Building LR Parsers
How do we generate the ACTION and GOTO tables?
• Use the grammar to build a model of the handle recognizing DFA
• Use the DFA model to build ACTION & GOTO tables
• If construction succeeds, the grammar is LR
How do we build the handle recognizing DFA ?
• Encode the set of productions that can be used as handles in the DFA
state: Use LR(k) items
• Use two functions goto( s,  ) and closure( s )
– goto() is analogous to move() in the DFA to NFA conversion
– closure() is analogous to -closure
• Build up the states and transition functions of the DFA
• Use this information to fill in the ACTION and GOTO tables
7
LR(k) items
An LR(k) item is a pair [A , B], where
A is a production  with a • at some position in the rhs
B is a lookahead string of length ≤ k
(terminal symbols or $)
Examples: [• , a], [• , a], [• , a], & [• , a]
The • in an item indicates the position of the top of the stack
• LR(0) items [  •  ] (no lookahead symbol)
• LR(1) items [  •  , a ] (one token lookahead)
• LR(2) items [  •  , a b ] (two token lookahead) ...
8
LR(1) Items
The production •, with lookahead a, generates 4 items
[• , a], [• , a], [• , a], & [• , a]
The set of LR(1) items for a grammar is finite
What’s the point of all these lookahead symbols?
• Carry them along to choose correct reduction
• Lookaheads are bookkeeping, unless item has • at right end
– Has no direct use in [• , a]
– In [• , a], a lookahead of a implies a reduction by 
– For { [• , a],[• , b] }
lookahead = a
 reduce to ;
lookahead  FIRST()
 shift
 Limited right context is enough to pick the actions
9
LR(1) Table Construction
High-level overview
 Build the handle recognizing DFA (aka Canonical Collection of sets of LR(1)
items), C = { I0 , I1 , ... , In }
a Introduce a new start symbol S’ which has only one production
S’  S
b Initial state, I0 should include
• [S’ •S, $], along with any equivalent items
• Derive equivalent items as closure( I0 )
c Repeatedly compute, for each Ik , and each grammar symbol , goto(Ik , )
• If the set is not already in the collection, add it
• Record all the transitions created by goto( )
This eventually reaches a fixed point
2
Fill in the ACTION and GOTO tables using the DFA
The canonical collection completely encodes the
transition diagram for the handle-finding DFA
10
Computing Closures
closure(I) adds all the items implied by items already in I
• Any item [ , a] implies [  , x] for each production
with  on the lhs, and x  FIRST(a)
• Since  is valid, any way to derive  is valid, too
The algorithm
Closure( I )
while ( I is still changing )
for each item [   •  , a]  I
for each production     P
for each terminal b  FIRST(a)
if [  •  , b]  I
then add [  •  , b] to I
Fixpoint computation
11
Computing Gotos
goto(I , x) computes the state that the parser would reach
if it recognized an x while in state I
• goto( { [   , a] },  ) produces [   , a]
• It also includes closure( [   , a] ) to fill out the state
The algorithm
Goto( I, x )
new = Ø
for each [   • x  , a]  I
new = new  [  x •  , a]
• Not a fixpoint method
• Uses closure
return closure(new)
12
Building the Canonical Collection of LR(1) Items
This is where we build the handle recognizing DFA!
Start from I0 = closure( [S’ • S , $] )
Repeatedly construct new states, until ano new states are generated
The algorithm
I0 = closure( [S’  • S , $] )
C = { I0 }
while ( C is still changing )
for each Ii  C and for each x  ( TNT )
Inew = goto(Ii , x)
if Inew  C then
C = C  Inew
record transition Ii  Inew on x
• Fixed-point computation
• Loop adds to C
• C  2ITEMS, so C is finite
13
Algorithms for LR(1) Parsers
Computing closure of set of LR(1) items:
Computing goto for set of LR(1) items:
Closure( I )
while ( I is still changing )
for each item [   •  , a]  I
for each production     P
for each terminal b  FIRST(a)
if [  •  , b]  I
then add [  •  , b] to I
Goto( I, x )
new = Ø
for each [   • x  , a]  I
new = new  [  x •  , a]
return closure(new)
Constructing canonical collection of LR(1) items:
I0 = closure( [S’  • S , $] )
C = { I0 }
while ( C is still changing )
for each Ii  C and for each x  ( TNT )
Inew = goto(Ii , x)
if Inew  C then
C = C  Inew
record transition Ii  Inew on x
• Canonical collection construction
algorithm is the algorithm for
constructing handle recognizing DFA
• Uses Closure to compute the states
of the DFA
• Uses Goto to compute the transitions
of the DFA
14
Example
Simplified, right recursive expression grammar
S  Expr
Expr  Term - Expr
Expr  Term
Term  Factor * Term
Term  Factor
Factor  id
Symbol
S
Expr
Term
Factor
*
id
FIRST
{ id }
{ id}
{ id }
{ id}
{-}
{*}
{ id}
15
I1= {[S  Expr •, $]}
Expr
I0 = {[S  •Expr ,$], [Expr  •Term - Expr , $], [Expr  •Term , $],
[Term  •Factor * Term , {$,-}], [Term  •Factor , {$,-}], [Factor  • id ,{$,-,*}]}
Term
id
Factor
I4 = { [Factor  id •, {$,-,*}] }
id
Factor
I2 = { [Expr  Term • - Expr , $],
[Expr  Term •, $] }
I3 = {[Term  Factor •* Term , {$,-}],
[Term  Factor •, {$,-}]}
*
I6 ={[Term  Factor * • Term , {$,-}], [Term  • Factor * Term , {$,-}],
[Term  • Factor , {$,-}], [Factor  • id , {$, -, *}]}
Factor
-
Term
id
I5 = {[Expr  Term - •Expr , $], [Expr  •Term - Expr , $], [Expr  •Term , $],
[Term  •Factor * Term , {$,-}], [Term  •Factor , {$,-}], [Factor  • id , {$,-,*}] }
I8 = { [Term  Factor * Term •, {$,-}] }
Expr
I7 = { [Expr  Term - Expr •, $] } 16
Constructing the ACTION and GOTO Tables
The algorithm
for each set of items Ix  C
for each item  Ix
if item is [ •a,b] and a  T and goto(Ix,a) =
Ik ,
then ACTION[x,a]  “shift k”
else if item is [S’S •,$]
then ACTION[x ,$]  “accept”
else if item is [ •,a]
then ACTION[x,a]  “reduce ”
x is the state number
Each state
corresponds to a set
of LR(1) items
for each n  NT
if goto(Ix ,n) = Ik
then GOTO[x,n]  k
17
Example (Constructing the LR(1) tables)
The algorithm produces the following table
A C T IO N
id
0
-
GOTO
*
$
s 4
1
T e rm
F a c to r
1
2
3
7
2
3
8
3
acc
2
s 5
3
r 5
s 6
r 5
4
r 6
r 6
r 6
5
s 4
6
s 4
7
8
Expr
r 3
r 2
r 4
r 4
18
What can go wrong in LR(1) parsing?
What if state s contains [   • a , b] and [   • , a] ?
• First item generates “shift”, second generates “reduce”
• Both define ACTION[s,a] — cannot do both actions
• This is called a shift/reduce conflict
• Modify the grammar to eliminate it
• Shifting will often resolve it correctly (remember the danglin else problem?)
What if set s contains [   • , a] and [   • , a] ?
• Each generates “reduce”, but with a different production
• Both define ACTION[s,a] — cannot do both reductions
• This is called a reduce/reduce conflict
• Modify the grammar to eliminate it
In either case, the grammar is not LR(1)
19
Algorithms for LR(0) Parsers
Computing closure of set of LR(0) items:
Closure( I )
while ( I is still changing )
for each item [   •  ]  I
for each production     P
if [  •  ]  I
then add [  •  ] to I
Constructing canonical collection of LR(0) items:
I0 = closure( [S’  • S ] )
C = { I0 }
while ( C is still changing )
for each Ii  C and for each x  ( TNT )
Inew = goto(Ii , x)
if Inew  C then
C = C  Inew
record transition Ii  Inew on x
Computing goto for set of LR(0) items:
Goto( I, x )
new = Ø
for each [   • x  ]  I
new = new  [  x •  ]
return closure(new)
• Basically the same algorithms we use
for LR(1)
• The only difference is there is no
lookahead symbol in LR(0) items and
therefore there is no lookahead
propagation
• LR(0) parsers generate less states
• They cannot recognize all the
grammars that LR(1) parsers recognize
• They are more likely to get conflicts
in the parse table since there is no
20
lookahead
SLR (Simple LR) Table Construction
• SLR(1) algorithm uses canonical collection of LR(0) items
• It also uses the FOLLOW sets to decide when to do a reduction
for each set of items Ix  C
,
for each item  Ix
if item is [ •a] and a  T and goto(Ix,a) = Ik
then ACTION[x,a]  “shift k”
else if item is [S’S •]
then ACTION[x ,$]  “accept”
else if item is [ •]
then for each a  FOLLOW()
then ACTION[x,a]  “reduce ”
for each n  NT
if goto(Ix ,n) = Ik
then GOTO[x,n]  k
21
LALR (LookAhead LR) Table Construction
•
•
LR(1) parsers have many more states than SLR(1) parsers
Idea: Merge the LR(1) states
– Look at the LR(0) core of the LR(1) items (meaning ignore the
lookahead)
– If two sets of LR(1) items have the same core, merge them and
change the ACTION and GOTO tables accordingly
• LALR(1) parsers can be constructed in two ways
1. Build the canonical collection of LR(1) items and then merge the
sets with the same core ( you should be able to do this)
2. Ignore the items with the dot at the beginning of the rhs and
construct kernels of sets of LR(0) items. Use a lookahead
propagation algorithm to compute lookahead symbols.
The second approach is more efficient since it avoids the construction of
the large LR(1) table.
22
Properties of LALR(1) Parsers
• An LALR(1) parser for a grammar G has same number of states as the
SLR(1) parser for that grammar
• If an LR(1) parser for a grammar G does not have any shift/reduce
conflicts, then the LALR(1) parser for that grammar will not have any
shift/reduce conflicts
• It is possible for LALR(1) parser for a grammar G to have a
reduce/reduce conflict while an LR(1) parser for that grammar does
not have any conflicts
23
Error recovery
• Panic-mode recovery: On discovering an error discard input symbols
one at a time until one synchronizing token is found
– For example delimiters such as “;” or “}” can be used as
synchronizing tokens
• Phrase-level recovery: On discovering an error make local
corrections to the input
– For example replace “,” with “;”
• Error-productions: If we have a good idea about what type of errors
occur we can augment the grammar with error productions and
generate appropriate error messages when an error production is used
• Global correction: Given an incorrect input string try to find a correct
string which will require minimum changes to the input string
– In general too costly
24
Direct Encoding of Parse Tables
Rather than using a table-driven interpreter …
• Generate spaghetti code that implements the logic
• Each state becomes a small case statement or if-then-else
• Analogous to direct coding a scanner
Advantages
• No table lookups and address calculations
• No representation for don’t care states
• No outer loop —it is implicit in the code for the states
This produces a faster parser with more code but no table
25
LR(k) versus LL(k) (Top-down Recursive Descent )
Finding Reductions
LR(k)  Each reduction in the parse is detectable with
 the complete left context,
 the reducible phrase, itself, and
 the k terminal symbols to its right
LL(k)  Parser must select the reduction based on
 The complete left context
 The next k terminals
Thus, LR(k) examines more context
26
Left Recursion versus Right Recursion
Right recursion
• Required for termination in top-down parsers
• Uses (on average) more stack space
• Produces right-associative operators
Left recursion
• More efficient bottom-up parsers
• Limits required stack space
• Produces left-associative operators
*
*
w
*
x
z
y
w*(x*(y*z))
*
*
*
w
z
y
x
Rule of thumb
( (w * x ) * y ) * z
• Left recursion for bottom-up parsers
(Bottom-up parsers can also deal with right-recursion but left recursion is more
efficient)
• Right recursion for top-down parsers
27
Hierarchy of Context-Free Languages
Context-free languages
LR(k)  LR(1)
Deterministic languages (LR(k))
LL(k) languages
LL(1) languages
Simple precedence
languages
The inclusion hierarchy for
context-free languages
Operator precedence
languages
28
Hierarchy of Context-Free Grammars
Context-free grammars
Unambiguous
CFGs
Operator
Precedence
LR(k)
LR(1)
Inclusion hierarchy for
context-free grammars
LL(k)
• Operator precedence
LALR(1)
includes some ambiguous
grammars
•
SLR(1)
LL(1) is a subset of SLR(1)
LL(1)
LR(0)
29
Summary
Top-down
recursive
descent
LR(1)
Advantages
Disadvantages
Fast
Hand-coded
Simplicity
High maintenance
Good error detection
Right associativity
Fast
Large working sets
Deterministic langs.
Poor error messages
Automatable
Large table sizes
Left associativity
30