Table-driven parsing • Parsing performed by a finite state machine. • Parsing algorithm is language-independent. • FSM driven by table (s) generated automatically.

Download Report

Transcript Table-driven parsing • Parsing performed by a finite state machine. • Parsing algorithm is language-independent. • FSM driven by table (s) generated automatically.

Table-driven parsing
• Parsing performed by a finite state machine.
• Parsing algorithm is language-independent.
• FSM driven by table (s) generated automatically from
grammar.
•
Language
generator
tables
Input
stack
parser
tables
Pushdown Automata
• A context-free grammar can be recognized by a finite state
machine with a stack: a PDA.
• The PDA is defined by set of internal states and a transition
table.
• The PDA can read the input and read/write on the stack.
• The actions of the PDA are determined by its current state, the
current top of the stack, and the current input symbol.
• There are three distinguished states:
– start state: nothing seen
– accept state: sentence complete
– error state: current symbol doesn’t belong.
•
Top-down parsing
• Parse tree is synthesized from the root (sentence
symbol).
• Stack contains symbols of rhs of current production,
and pending non-terminals.
• Automaton is trivial (no need for explicit states)
• Transition table indexed by grammar symbol G and
input symbol a. Entries in table are terminals or
productions: P
ABC…
Top-down parsing
• Actions:
– initially, stack contains sentence symbol
– At each step, let S be symbol on top of stack, and a be the
next token on input.
– if T (S, a) is terminal a, read token, pop symbol from stack
– if T (S, a) is production P
ABC…., remove S from
stack, push the symbols A, B, C on the stack (A on top).
– If S is the sentence symbol and a is the end of file, accept.
– If T (S, a) is undefined, signal error.
• Semantic action: when starting a production, build
tree node for non-terminal, attach to parent.
Table-driven parsing and recursive
descent parsing
• Recursive descent: every production is a procedure.
Call stack holds active procedures corresponding to
pending non-terminals.
• Stack still needed for context-sensitive legality
checks, error messages, etc.
• Table-driven parser: recursion simulated with
explicit stack.
Building the parse table
• Define two functions on the symbols of the grammar:
FIRST and FOLLOW.
• For a non-terminal N, FIRST (N) is the set of terminal
symbols that can start any derivation from N.
–
–
First (If_Statement) = {if}
First (Expr) = {id, ( }
• FOLLOW (N) is the set of terminals that can appear
after a string derived from N:
–
Follow (Expr) = {+, ), $ }
Computing FIRST (N)
• If N
e
First (N) includes e
• if N
aABC
First (N) includes a
• if N
X1X2
First (N) includes First (X1)
• if N
X1X2… and X1
•
e,
First (N) includes First (X2)
• Obvious generalization to First (a) where a is X1X2...
Computing First (N)
• Grammar for expressions, without left-recursion:
E
E’
T
T’
F
•
•
•
TE’ |
+TE’ |
FT’ |
*FT’ |
id
|
T
e
F
e
(E)
First (F) = { id, ( }
First (T’) = { *, e}
First (E’) = { +, e}
First (T) = { id, ( }
First (E) = { id, ( }
Computing Follow (N)
• Follow (N) is computed from productions in which N
appears on the rhs
• For the sentence symbol S, Follow (S) includes $
• if A
a N b, Follow (N) includes First (b)
– because an expansion of N will be followed by an expansion
from b
• if A
a N,
Follow (N) includes Follow (A)
– because N will be expanded in the context in which A is
expanded
• if A
aNB,B
e, Follow (N) includes Follow (A)
Computing Follow (N)
E
E’
T
T’
F
•
•
•
•
TE’
+TE’
FT’
*FT’
id
|
|
|
|
|
T
e
F
e
(E)
Follow (E) = { ), $ } Follow (E’) = { ), $ }
Follow (T) = First (E’ ) + Follow (E’) = { +, ), $ }
Follow (T’) = Follow (T) = { +, ), $ }
Follow (F) = First (T’) + Follow (T’) = { *, +, ), $ }
Building LL (1) parse tables
Table indexed by non-terminal and token. Table entry is a
production:
for each production P: A
a loop
for each terminal a in First (a) loop
T (A, a) := P;
end loop;
if e in First (a), then
for each terminal b in Follow (a) loop T (A, b) := P; end loop;
end if;
end loop;
• All other entries are errors.
• If two assignments conflict, parse table cannot be built.
LL (1) grammars
• If table construction is successful, grammar is LL (1):
left-to right, leftmost derivation with one-token
lookahead.
• If construction fails, can conceive of LL (2), etc.
• Ambiguous grammars are never LL (k)
• If a terminal is in First for two different productions of
A, the grammar cannot be LL (1).
• Grammars with left-recursion are never LL (k)
• Some useful constructs are not LL (k)
Bottom-up parsing
• Synthesize tree from fragments
• Automaton performs two actions:
– shift: push next symbol on stack
– reduce: replace symbols on stack
• Automaton synthesizes (reduces) when end of a
production is recognized
• States of automaton encode synthesis so far, and
expectation of pending non-terminals
• Automaton has potentially large set of states
• Technique more general than LL (k)
LR (k) parsing
• Left-to-right, rightmost derivation with k-token
lookahead.
• Most general parsing technique for deterministic
grammars.
• In general, not practical: tables too large (10^6 states
for C++, Ada).
• Common subsets: SLR, LALR (1).
The states of the LR(0) automaton
• An item is a point within a production, indicating that
part of the production has been recognized:
– A
a .B b,
• seen the expansion of a, expect to see expansion of B
• A state is a set of items
• Transition within states are determined by terminals
and non-terminals
• Parsing tables are built from automaton:
– action: shift / reduce depending on next symbol
– goto: change state depending on synthesized non-terminal
Building LR (0) states
• If a state includes:
A
a .B b
• it also includes every state that is the start of B:
B
.XYZ
• Informally: if I expect to see B next, I expect to see
anything that B can start with, and so on:
X
.GHI
• States are built by closure from individual items.
A grammar of expressions: initial state
•
•
•
•
•
E’
E
E
E + T | T;
-- left-recursion ok here.
T
T * F | F;
F
id | (E)
S0 = { E’ .E, E .E + T, E
.T,
F
.id, F
.(E),
T
.T * F, T
.F}
Adding states
• If a state has item A
a .a b,
and the next symbol in the input is a, we shift a on
the stack and enter a state that contains item
•
A
a a.b
(as well as all other items brought in by closure)
• if a state has as item A
a. , this indicates the
end of a production: reduce action.
• If a state has an item A
a .N b, then after a
reduction that find an N, go to a state with A a N. b
The LR (0) states for expressions
•
•
•
•
•
•
•
•
•
•
S1 = { E’ E., E E. + T }
S2 = { E T., T T. * F }
S3 = { T F. }
S4 = { F (. E), } + S0 (by closure)
S5 = { F id. }
S6 = { E E +. T, T .T * F, T .F, F .id, F .(E)}
S7 = { T T *. F, F .id, F .(E)}
S8 = { F (E.), E E.+ T}
S9 = { E E + T., T T.* F}
S10 = { T T * F.},
S11 = {F (E).}
Building SLR tables
• An arc between two states labeled with a terminal is
a shift action.
• An arc between two states labeled with a nonterminal is a goto action.
• if a state contains an item A a. , (a reduce item)
• the action is to reduce by this production, for all
terminals in Follow (A).
• If there are shift-reduce conflicts or reduce-reduce
conflicts, more elaborate techniques are needed.
LR (k) parsing
• Canonical LR (1): annotate each item with its own
follow set:
•
(A -> a a.b , f )
• f is a subset of the follow set of A, because it is
derived from a single specific production for A
• A state that includes A -> a a.b is a reduce state only
if next symbol is in f: fewer reduce actions, fewer
conflicts, technique is more powerful than SLR (1)
• Generalization: use sequences of k symbols in f
• Disadvantage: state explosion: impractical in general,
even for LR (1)
LALR (1)
• Compute follow set for a small set of
items
• Tables no bigger than SLR (1)
• Same power as LR (1), slightly worse
error diagnostics
• Incorporated into yacc, bison, etc.