Document

Transcript Document

CIS 461
Compiler Design and Construction
Fall 2012
slides derived from Tevfik Bultan, Keith Cooper, and
Linda Torczon
Lecture-Module #7
Introduction to Parsing
1
First Phase: Lexical Analysis (Scanning)
Source
code
token
IR
Parser
Scanner
get next
token
Errors
Scanner
• Maps stream of characters into tokens
– Basic unit of syntax
• Characters that form a word are its lexeme
• Its syntactic category is called its token
• Scanner discards white space and comments
• Scanner works as a subroutine of the parser
2
Lexical Analysis
• Specify tokens using Regular Expressions
• Translate Regular Expressions to Finite Automata
• Use Finite Automata to generate tables or code for the scanner
source code
Scanner
tables
or code
specifications
(regular expressions)
tokens
Scanner
Generator
3
Automating Scanner Construction
To build a scanner:
1 Write down the RE that specifies the tokens
2 Translate the RE to an NFA
3 Build the DFA that simulates the NFA
4 Minimize the DFA
5 Turn it into code or table
Scanner generators
• Lex , Flex, Jlex work along these lines
• Algorithms are well-known and well-understood
• Interface to parser is important
4
Automating Scanner Construction
RENFA (Thompson’s construction)
•
Build an NFA for each term
•
Combine them with -moves
NFA DFA (subset construction)
•
Build the simulation
DFA Minimal DFA
•
The Cycle of Constructions
RE
NFA
DFA
minimal
DFA
Hopcroft’s algorithm
DFA RE
•
All pairs, all paths problem
•
Union together paths from s0 to a final state
5
Scanner Generators: JLex, Lex, FLex
directly copied to the output file
user code
%%
macro (regular) definitions (e.g., digits
and state names
JLex directives
%%
regular expression rules
= [0-9]+)
each rule: optinal state list, regular expression, action
• States can be mixed with regular expressions
• For each regular expression we can define a set of states where it is valid (JLex, Flex)
• Typical format of regular expression rules:
<optional_state_list> regular_expression { actions }
6
JLex, FLex, Lex
Regular expression rules:
r_1
{ action_1 }
r_2
{ action_2 }
.
.
.
r_n
{ action_n }
Automata for regular
expression r_1
Java code for JLex,
C code for FLex and Lex
new final
states
Ar_1
new start
sate
s0
Rules used by scanner generators
1) Continue scanning the input until reaching an error state
2) Accept the longest prefix that matches to a regular
expression and execute the corresponding action
3) If two patterns match the longest prefix, then the action
which is specified earlier will be executed
4) After a match, go back to the end of the accepted prefix
in the input and start scanning for the next token
error

Ar_2

error

..
.
Ar_n
error
For faster scanning, convert this NFA
to a DFA and minimize the states
7
Limits of Regular Languages
Advantages of Regular Expressions
• Simple & powerful notation for specifying patterns
• Automatic construction of fast recognizers
• Many kinds of syntax can be specified with REs
If REs are so useful … Why not use them for everything?
Example — an expression grammar
Id
 [a-zA-Z] ([a-zA-z] | [0-9])*
Num  [0-9]+
Term  Id | Num
Op  “+” | “-” | “” | “/”
Expr ( Term Op )* Term
8
Limits of Regular Languages
If we add balanced parentheses to the expressions grammar, we cannot
represent it using regular expressions:
Id
 [a-zA-Z] ([a-zA-z] | [0-9])*
Num  [0-9]+
Term  Id | Num
Op  “+” | “-” | “” | “/”
Expr  Term | Expr Op Expr | “(“ Expr “)”
A DFA of size n cannot recognize balanced parentheses with nesting depth
greater than n
Not all languages are regular: RL’s  CFL’s  CSL’s
Solution: Use a more powerful formalism, context-free grammars
9
The Front End: Parser
Source
code
token
Scanner
Parser
get next
token
IR
IR
Type
Checker
Errors
Parser
• Input: a sequence of tokens representing the source program
• Output: A parse tree (in practice an abstract syntax tree)
• While generating the parse tree parser checks the stream of tokens for
grammatical correctness
– Checks the context-free syntax
• Parser builds an IR representation of the code
– Generates an abstract syntax tree
• Guides checking at deeper levels than syntax
10
The Study of Parsing
• Need a mathematical model of syntax — a grammar G
– Context-free grammars
• Need an algorithm for testing membership in L(G)
– Parsing algorithms
• Parsing is the process of discovering a derivation for some sentence
from the rules of the grammar
– Equivalently, it is the process of discovering a parse tree
• Natural language analogy
– Lexical rules correspond to rules that define the valid words
– Grammar rules correspond to rules that define valid sentences
11
An Example Grammar
1 Start  Expr
2 Expr  Expr Op Expr
3
| num
4
|
id
5 Op
 +
6
| 7
| *
8
| /
Start Symbol:
Nonterminal Symbols:
Terminal symbols:
Productions:
S = Start
N = { Start, Expr, Op }
T = { num, id, +, -, *, / }
P = { 1, 2, 3, 4, 5, 6, 7, 8 } (shown above)
12
Specifying Syntax with a Grammar
Context-free syntax is specified with a context-free grammar
Formally, a grammar is a four tuple, G = (S,N,T,P)
• T is a set of terminal symbols
– These correspond to tokens returned by the scanner
– For the parser tokens are indivisible units of syntax
• N is a set of non-terminal symbols
– These are syntactic variables that can be substituted during a
derivation
– Variables that denote sets of substrings occurring in the language
• S is the start symbol : S  N
– All the strings in L(G) are derived from the start symbol
• P is a set of productions or rewrite rules : P : N  (N  T)*
13
Production Rules
Restriction on production rules determines the expressive power
• Regular grammars: productions are either left-linear or right-linear
– Right-linear: Productions are of the form A  wB, or A  w where A,B
are nonterminals and w is a string of terminals
– Left-linear: Productions are of the form A  Bw, or A  w where A,B
are nonterminals and w is a string of terminals
– Regular grammars recognize regular sets
– One can automatically construct a regular grammar from an NFA that
accepts the same language (and visa versa)
• Context-free grammars: Productions are of the form A   where A is a
nonterminal symbol and  is a string of terminal and nonterminal symbols
• Context-sensitive grammars: Productions are of the form    where  and
 are arbitrary strings of terminal and nonterminal symbols with   and || 
||
• Unrestricted grammars: Productions are of the form    where  and  are
arbitrary strings of terminal and nonterminal symbols with  
– Unrestricted grammars are as powerful as Turing machines
14
An NFA can be translated to a Regular Grammar
•
•
•
•
•
For each state i of the NFA create a nonterminal symbol Ai
If state i has a transition to state j on symbol a, introduce the production
Ai  a Aj
If state i goes to state j on symbol , introduce the production Ai  Aj
If i is an accepting state, introduce Ai  
If i is the start state make Ai be the start symbol of the grammar
a
S0

a
S1
b
S2
b
S3
b
S4
1
2
3
4
5
6
7
A0  A1
A1  a A1
| b A1
| a A2
A2  b A3
A3  b A4
A4  
15
Derivations
An example grammar
1
2
3
4
5
6
7
8
S
 Expr
Expr  Expr Op Expr
Op
| num
| id
 +
| | *
| /
An example derivation for x - 2* y
Rule
—
1
2
4
6
2
3
7
4
Sentential Form
S
Expr
Expr Op Expr
<id,x> Op Expr
<id,x> - Expr
<id,x> - Expr Op Expr
<id,x> - <num,2> Op Expr
<id,x> - <num,2> * Expr
<id,x> - <num,2> * <id,y>
We denote this as: S * id - num * id
•
Such a sequence of rewrites is called a derivation
•
Process of discovering a derivation is called parsing
AB means A derives B after
applying one production
A*B means A derives B after
applying zero or more productions
16
Sentences and Sentential Forms
Given a grammar G with a start symbol S
• A string of terminal symbols than can be derived from S by applying
the productions is called a sentence of the grammar
– These strings are the members of set L(G), the language defined by
the grammar
• A string of terminal and nonterminal symbols that can be derived from
S by applying the productions of the grammar is called a sentential
form of the grammar
– Each step of derivation forms a sentential form
– Sentences are sentential forms with no nonterminal symbols
17
Derivations
•
•
At each step, we make two choices
1. Choose a non-terminal to replace
2. Choose a production to apply
Different choices lead to different derivations
Two types of derivation are of interest
• Leftmost derivation — replace leftmost non-terminal at each step
• Rightmost derivation — replace rightmost non-terminal at each step
These are the two systematic derivations (the first choice is fixed)
The example on the earlier slide was a leftmost derivation
• Of course, there is a rightmost derivation (next slide)
18
Two Derivations for x - 2 * y
Rule
—
1
2
4
6
2
3
7
4
Sentential Form
S
Expr
Expr Op Expr
<id,x> Op Expr
<id,x> - Expr
<id,x> - Expr Op Expr
<id,x> - <num,2> Op Expr
<id,x> - <num,2> * Expr
<id,x> - <num,2> * <id,y>
Leftmost derivation
Rule
—
1
2
4
7
2
3
6
4
Sentential Form
S
Expr
Expr Op Expr
Expr Op <id,y>
Expr * <id,y>
Expr Op Expr * <id,y>
Expr Op <num,2> * <id,y>
Expr - <num,2> * <id,y>
<id,x> - <num,2> * <id,y>
Rightmost derivation
In both cases, S * id - num * id
• Note that, these two derivations produce different parse trees
• The parse trees imply different evaluation orders!
19
Derivations and Parse Trees
Leftmost derivation
Rule
—
1
2
4
6
2
3
7
4
S
Sentential Form
S
Expr
Expr Op Expr
<id,x> Op Expr
<id,x> - Expr
<id,x> - Expr Op Expr
<id,x> - <num,2> Op Expr
<id,x> - <num,2> * Expr
<id,x> - <num,2> * <id,y>
This evaluates as x - ( 2 * y )
Expr
Expr
Op
<id,x>
-
Expr
Expr
Op
<num,2> *
Expr
<id,y>
20
Derivations and Parse Trees
Rightmost derivation
Rule
—
1
2
4
7
2
3
6
4
S
Sentential Form
S
Expr
Expr Op Expr
Expr Op <id,y>
Expr * <id,y>
Expr Op Expr * <id,y>
Expr Op <num,2> * <id,y>
Expr - <num,2> * <id,y>
<id,x> - <num,2> * <id,y>
This evaluates as ( x - 2 ) * y
E
E
E
Op
<id,x>
-
E
Op
E
*
<id,y>
<num,2>
21
Another Rightmost Derivation
Another Rightmost derivation
Rule
—
1
2
2
4
7
3
6
4
S
Sentential Form
S
Expr
Expr Op Expr
Expr Op Expr Op Expr
Expr Op Expr Op <id,y>
Expr Op Expr * <id,y>
Expr Op <num,2> * <id,y>
Expr - <num,2> * Expr
<id,x> - <num,2> * <id,y>
This evaluates as x - ( 2 * y )
Expr
Expr
Op
<id,x>
-
Expr
Expr
Op
<num,2> *
Expr
<id,y>
This parse tree is different than the parse
tree for the previous rightmost derivation,
but it is the same as the parse tree for the
22
earlier leftmost derivation
Derivation and Parse Trees
• A parse tree does not show in which order the productions were
applied, it ignores the variations in the order
• Each parse tree has a corresponding unique leftmost derivation
• Each parse tree has a corresponding unique rightmost derivation
23
Parse Trees and Precedence
These two parse trees point out a problem with the expression grammar:
It has no notion of precedence (implied order of evaluation between
different operators)
To add precedence
• Create a non-terminal for each level of precedence
• Isolate the corresponding part of the grammar
• Force parser to recognize high precedence subexpressions first
For algebraic expressions
• Multiplication and division, first
• Subtraction and addition, next
24
Another Problem: Parse Trees and Associativity
E
E
Op
<num,5> -
E
S
S
E
E
Op
E
-
<num,2>
<num,2>
Result is 1
E
E
Op
<num,5> -
E
Op
<num,2> -
E
<num,2>
Result is 5
25
Precedence and Associativity
Adding the standard algebraic precedence and using left recursion
produces:
1
2
3
4
5
6
7
8
9
S
 Expr
Expr  Expr + Term
| Expr - Term
| Term
Term  Term * Factor
| Term / Factor
| Factor
Factor  num
| id
This grammar is slightly larger
• Takes more rewriting to reach
some of the terminal symbols
• Encodes expected precedence
• Enforces left-associativity
• Produces same parse tree
under leftmost & rightmost
derivations
Let’s see how it parses our example
26
Precedence
Rule
1
3
7
8
3
7
8
4
7
Sentential Form
S
Expr
Expr - Term
Term - Term
Factor - Term
<id,x> - Term
<id,x> - Term * Factor
<id,x> - Factor * Factor
<id,x> - <num,2> * Factor
<id,x> - <num,2> * <id,y>
The leftmost derivation
S
E
E
-
T
T
T
F
F
<id,x>
<num,2>
*
F
<id,y>
Its parse tree
This produces x - ( 2 * y ) , along with an appropriate parse tree.
Both the leftmost and rightmost derivations give the same parse tree and
the same evaluation order, because the grammar directly encodes the
desired precedence.
27
Associativity
S
Rule
1
3
7
8
3
7
8
4
7
8
Sentential Form
S
Expr
Expr - Term
Expr - Factor
Expr - <num,2>
Expr - Term - <num,2>
Expr - Factor - <num,2>
Expr - <num,2> - <num,2>
Term - <num,2> - <num,2>
Factor - <num,2> - <num,2>
<num,5> - <num,2> - <num,2>
The rightmost derivation
E
E
E
-
-
<num,5
>
F
T
T
F
T
F
<num,2
>
<num,2
>
Its parse tree
This produces ( 5 - 2 ) - 2 , along with an appropriate parse tree.
Both the leftmost and rightmost derivations give the same parse tree and
the same evaluation order
28
Ambiguous Grammars
What was the problem with the original grammar?
1 S
 Expr
2 Expr  Expr Op
Expr
3
4
5 Op
6
7
8
|
|

|
|
|
num
id
+
*
/
Rule Sentential Form
—
S
1
Expr
2
Expr Op Expr
4
Expr Op <id,y>
7
Expr * <id,y>
2
Expr Op Expr * <id,y>
3
Expr Op <num,2> *
<id,y>
6
Expr - <num,2> * <id,y>
4
<id,x> - <num,2> * <id,y>
Rule Sentential Form
—
1
2
2
4
7
3
6
4
S
Expr
Expr Op Expr
Expr Op Expr Op Expr
Expr Op Expr Op <id,y>
Expr Op Expr * <id,y>
Expr Op <num,2> * <id,y>
Expr - <num,2> * Expr
<id,x> - <num,2> * <id,y>
•This grammar allows multiple rightmost derivations for x - 2 * y
•Equivalently, this grammar generates multiple parse trees for x - 2 * y
•The grammar is ambiguous
different choices
29
Ambiguous Grammars
• If a grammar has more than one leftmost derivation for some sentence
(or sentential form), then the grammar is ambiguous
• If a grammar has more than one rightmost derivation for some
sentence (or sentential form), then the grammar is ambiguous
• If a grammar produces more than one parse tree for some sentence (or
sentential form), then it is ambiguous
Classic example — the dangling-else problem
1
2
Stmt  if Expr then Stmt
|
if Expr then Stmt else Stmt
|
… other stmts …
30
Ambiguity
The following sentential form has two parse trees:
if Expr1 then if Expr2 then Stmt1 else Stmt2
Stmt
if
Expr1 then
if
Expr
Stmt
Stmt
else
then
2
production 2, then
production 1
Stmt2
Stmt1
if
Expr1 then
if
Expr
Stmt
then
Stmt1 else
Stmt2
2
production 1, then
production 2
31
Ambiguity
Removing the ambiguity
• Must rewrite the grammar to avoid generating the problem
• Match each else to innermost unmatched if
(common sense rule)
1
2
3
4
5
6
Stmt
Matched

|
Matched
Unmatched

If Expr then Matched else Matched
|
… other kinds of stmts …
Unmatched  If Expr then Stmt
|
If Expr then Matched else Unmatched
With this grammar, the example has only one parse tree
32
Ambiguity
if Expr1 then if Expr2 then Stmt1 else Stmt2
Rule
Sentential Form
—
Stmt
2
Unmatched
5
if Expr then Stmt
?
if Expr1 then Stmt
1
if Expr1 then Matched
3
if Expr1 then if Expr then Matched else
Matched
?
if Expr1 then if Expr2 then Matched else
Matched
4
if Expr1 then if Expr2 then Stmt1 else Matched
4
if Expr1 then if Expr2 then Stmt1 else Stmt2
This binds the else to the inner if
33
Ambiguity
Theoretical results:
• It is undecidable whether an arbitrary CFG is ambiguous
• There exists CFLs for which every CFG is ambiguous. These are
called inherenlty ambiguous CFLs.
– Example: { 0i 1j 2k | i = j or j = k }
34
Ambiguity
Ambiguity usually refers to confusion in the CFG
Overloading can create deeper ambiguity
a = f(17)
In many Algol-like languages, f could be either a function or a subscripted
variable
Disambiguating this one requires context
• Need values of declarations
• Really an issue of type, not context-free syntax
• Requires an extra-grammatical solution (not in CFG)
• Must handle these with a different mechanism
– Step outside grammar rather than use a more complex grammar
35
Ambiguity
Ambiguity can arise from two distinct sources
• Confusion in the context-free syntax
• Confusion that requires context to resolve
Resolving ambiguity
• To remove context-free ambiguity, rewrite the grammar
• To handle context-sensitive ambiguity takes cooperation
– Knowledge of declarations, types, …
– Accept a superset of the input language then check it with other means
(type checking, context-sensitive analysis)
– This is a language design problem
Sometimes, the compiler writer accepts an ambiguous grammar
– Parsing algorithms can be kludged so that they “do the right thing”
36