week3 - TU Delft: Computer Science & Engineering

Download Report

Transcript week3 - TU Delft: Computer Science & Engineering

Compiler construction
in4020 – lecture 3
Koen Langendoen
Delft University of Technology
The Netherlands
Summary of lecture 2
program text
token
description
• lexical analyzer generator
• description  FSA
• FSA construction
• dotted items
• character moves
•  moves
scanner
generator
lexical analysis
tokens
syntax analysis
AST
context handling
annotated AST
Quiz
2.24 Will the function identify() (to access a
symbol table) still work if the hash
function maps all identifiers onto the
same number?
2.26 Tutor X insists that macro processing
must be implemented as a separate
phase between reading the program and
the lexical analysis. Show mr. X wrong.
Overview
program text
• syntax analysis: tokens  AST
lexical analysis
tokens
language
grammar
parser
generator
syntax analysis
AST
• AST construction
context handling
annotated AST
• by hand: recursive descent
• automatic: top-down (LLgen), bottom-up (yacc)
Syntax analysis
• parsing: given a context-free grammar and a
stream ofsyn-tax:
tokens,the
find
the
derivation
way
in which
wordsthat ties
them together.
are put together to form phrases,
clauses, or sentences.
expression
• result: parse tree
‘-’
Webster’s
Dictionary
term
term
expression
factor
term
factor
identifier
identifier
‘b’
term
‘b’
‘*’
term
‘*’
factor
factor
identifier
factor
identifier
‘c’
constant
‘a’
‘4’
‘*’
Syntax analysis
• parsing: given a context-free grammar and a
stream of tokens, find the derivation that ties
them together.
expression
• result: parse tree
expression
‘-’
term
term
factor
term
factor
identifier
identifier
‘b’
term
‘b’
‘*’
‘*’
term
factor
factor
identifier
factor
identifier
‘c’
constant
‘a’
‘4’
‘*’
Context free grammar
• G = (VN, VT, S, P)
•
•
•
•
VN : set of Non-terminal symbols
VT : set of Terminal symbols
S : Start symbol (S  VN)
P : set of Production rules {N  }
• VN  VT = 
• P = {N   | N  VN    (VN  VT)*}
Top-down parsing
• process tokens left to right
• expression grammar
input  expression EOF
expression  term rest_expression
term  IDENTIFIER | ‘(’ expression ‘)’
rest_expression  ‘+’ expression | 
input
expression
term
EOF
rest_expression
IDENTIFIER
• example expression
aap + ( noot + mies )
Bottom-up parsing
• process tokens left to right
rest_expression
• expression grammar
input  expression EOF
expression  term rest_expression
term  IDENTIFIER | ‘(’ expression ‘)’
rest_expression  ‘+’ expression | 
IDENT
aap
•
expression
term
IDENT
+
( noot
• • •
rest_expr
IDENT
+
mies
•
•
 )
•
Comparison
node creation
top-down
bottom-up
pre-order
post-order
1
5
2
1
3
4
5
4
2
alternative selection
first token
last token
grammar type
restricted
LL(1)
manual +
automatic
relaxed
LR(1)
automatic
implementation
3
Recursive descent parsing
• each rule N translates to a boolean function
• return true if a terminal production of N was matched
• return false otherwise (without consuming any token)
• try alternatives of N in turn
• a terminal symbol must match the current token
• a non-terminal is matched by calling its routine
input  expression EOF
int input(void) {
return expression() && require(token(EOF));
}
Recursive descent parsing
expression  term rest_expression
int expression(void) {
return term() && require(rest_expression());
}
term  IDENTIFIER | ‘(’ expression ‘)’
int term(void) {
return token(IDENTIFIER) ||
token('(') && require(expression()) && require(token(')'));
}
Recursive descent parsing
rest_expression  ‘+’ expression | 
int rest_expression(void) {
return token('+') && require(expression()) || 1;
}
int token(int tk) {
auxiliary functions
• consume matched tokens
• report syntax errors
if (Token.class != tk) return 0;
get_next_token(); return 1;
}
int require(int found) {
if (!found) error();
return 1;
}
Automatic
top-down parsing
• follow recursive descent scheme, but avoid
interpretation overhead
• for each rule and alternative determine the
tokens it can start with: FIRST set
• parsing scheme for rule N  A1 | A2 | …
•
•
•
•
if token  FIRST(N) then ERROR
if token  FIRST(A1) then parse A1
if token  FIRST(A2) then parse A2
…
Exercise (7 min.)
• design an algorithm to compute the FIRST
sets of all non-terminals in a context free
grammar.
• hint: consider the types of rules
• alternatives
• composition
• empty productions
input  expression EOF
expression  term rest_expression
term  IDENTIFIER | ‘(’ expression ‘)’
rest_expression  ‘+’ expression | 
Answers
Answers (Fig 2.58, page 122)
• Nw
(w  VT )
closure algorithm
FIRST(N) = {w}
• N  A1 | A2 | …
FIRST(N) =  FIRST(Ai )
• N
FIRST(N) = {}
• NA
,
FIRST(N) = FIRST(A )
FIRST(N) = FIRST(A ) \ {}

FIRST()
 FIRST(A)
, otherwise
Break
Predictive parsing
• similar to recursive descent, but no back-tracking
• functions “know” what they are doing
input  expression EOF
FIRST(expression) = {IDENT, ‘(‘}
void input(void) {
switch (Token.class) {
case IDENT: case '(':
expression(); token(EOF); break;
default:
error();
}
}
void token(int tk) {
if (Token.class != tk) error();
get_next_token();
}
Predictive parsing
expression  term rest_expression
FIRST(term) = {IDENT, ‘(‘}
void expression(void) {
switch (Token.class) {
case IDENT: case '(':
term(); rest_expression(); break;
default:
error();
}
}
term  IDENTIFIER | ‘(’ expression ‘)’
void term(void) {
switch (Token.class) {
case IDENT:
token(IDENT); break;
case '(':
token('('); expression(); token(')'); break;
default:
error();
}
}
Predictive parsing
rest_expression  ‘+’ expression | 
FIRST(rest_expr) = {‘+’, }
void rest_expression(void) {
switch (Token.class) {
case '+':
token('+'); expression(); break;
case EOF: case ')': break;
default:
error();
}
}
• FIRST() = {}
FOLLOW(rest_expr) = {EOF, ‘)’}
• check nothing?
• NO: token  FOLLOW(rest_expr)
Limitations of LL(1) parsers
• FIRST/FIRST conflict
term  IDENTIFIER
| IDENTIFIER ‘[‘ expression ‘]’
| ‘(’ expression ‘)’
• FIRST/FOLLOW conflict
S  A ‘a’ ‘b’
A  ‘a’ | 
FIRST(A) = { ‘a’ } = FOLLOW(A)
• left recursion
expression  expression ‘-’ term | term
Making grammars LL(1)
• manual labour
• rewrite grammar
• adjust semantic actions
• three rewrite methods
• left factoring
• substitution
• left-recursion removal
Left factoring
term  IDENTIFIER
| IDENTIFIER ‘[‘ expression ‘]’
• factor out common prefix
term  IDENTIFIER after_identifier
after_identifier   | ‘[‘ expression ‘]’
‘[’  FOLLOW(after_identifier)
Substitution
S  A ‘a’ ‘b’
A  ‘a’ | 
• replace non-terminal by its alternative
S  ‘a’ ‘a’ ‘b’ | ‘a’ ‘b’
Left-recursion removal
NN|




...
• replace by
NM
MM | 
• example
N


expression  expression ‘-’ term | term
expression  term tail
tail  ‘-’ term tail | 
Exercise (7 min.)
• make the following grammar LL(1)
expression  expression ‘+’ term | expression ‘-’ term | term
term  term ‘*’ factor | term ‘/’ factor | factor
factor  ‘(‘ expression ‘)’ | func-call | identifier | constant
func-call  identifier ‘(‘ expr-list? ‘)’
expr-list  expression (‘,’ expression)*
• and what about
S  if E then S (else S)?
Answers
Answers
• substitution
F  ‘(‘ E ‘)’ | ID ‘(‘ expr-list? ‘)’ | ID | constant
• left factoring
E  E ( ‘+’ | ‘-’ ) T | T
T  T ( ‘*’ | ‘/’ ) F | F
F  ‘(‘ E ‘)’ | ID ( ‘(‘ expr-list? ‘)’ )? | constant
• left recursion removal
E  T (( ‘+’ | ‘-’ ) T )*
T  F (( ‘*’ | ‘/’ ) F )*
• if-then-else grammar is ambiguous
automatic generation
program text
lexical analysis
tokens
language
grammar
parser
generator
syntax analysis
AST
context handling
LL(1) push-down
automaton
annotated AST
LL(1) push-down automaton
transition table
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
• stack right-hand side of production
)
EOF


LL(1) push-down automaton
prediction stack
input
input
aap + ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
input
input
aap + ( noot + mies ) EOF
replace non-terminal by transition entry
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
expression EOF
input
aap + ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
expression EOF
input
aap + ( noot + mies ) EOF
replace non-terminal by transition entry
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
term rest-expr EOF
input
aap + ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
term rest-expr EOF
input
aap + ( noot + mies ) EOF
replace non-terminal by transition entry
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
IDENT rest-expr EOF
input
aap + ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
IDENT rest-expr EOF
pop matching token
aap + ( noot + mies ) EOF
input
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
rest-expr EOF
input
+ ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
rest-expr EOF
input
+ ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
+ expression EOF
input
+ ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
+ expression EOF
input
+ ( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
expression EOF
input
( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LL(1) push-down automaton
prediction stack
expression EOF
input
( noot + mies ) EOF
state
(top of stack) IDENT
look-ahead token
+
(
input
expression EOF
expression EOF
expression
term rest-expr
term rest-expr
term
IDENT
( expression )
rest-expr
+ expression
)
EOF


LLgen
• top-down parser generator
• to be used in assignment #1
• discussed in lecture 5
Summary
• syntax analysis: tokens  AST
program text
lexical analysis
tokens
language
grammar
parser
generator
syntax analysis
AST
• top-down parsing
• recursive descent
• push-down automaton
• making grammars LL(1)
context handling
annotated AST
Homework
• study sections:
• 1.10 closure algorithm
• 2.2.4.6 error handling in LL(1) parsers
• print handout for next week [blackboard]
• find a partner for the “practicum”
• register your group
• send e-mail to [email protected]