Compiler construction in4020 – lecture 5

Download Report

Transcript Compiler construction in4020 – lecture 5

Compiler construction
in4020 – lecture 5
Koen Langendoen
Delft University of Technology
The Netherlands
Summary of lecture 4
• syntax analysis: tokens  AST
• bottom-up parsing
• push-down automaton
• ACTION/GOTO tables
• LR(0)
• SLR(1)
• LR(1)
• LALR(1)
NO look-ahead
one-token look-ahead, FOLLOW sets to
solve shift-reduce conflicts
SLR(1), but FOLLOW set per item
LR(1), but “equal” states are merged
Quiz
2.50 Is the following grammar LR(0),
SLR(1), LALR(1), or LR(1) ?
(a) S  x S x | y
(b) S  x S x | x
Overview
• semantic analysis
• identification – symbol tables
• type checking
program text
lexical analysis
tokens
language
grammar
parser
generator
syntax analysis
AST
• assignment
• yacc
• LLgen
context handling
annotated AST
Semantic analysis
• information is scattered throughout the program
se-man-tic: of or relating to
• identifiersmeaning
serve as connectors
in language.
• find defining occurrence
of each applied
Webster’s Dictionary
occurrence of an identifier in the AST
• undefined identifiers  error
• unused identifiers  warning
• check rules in the language definition
• type checking
• control flow (dead code)
Semantic analysis
• information is scattered throughout the program
• identifiers serve as connectors
• find defining occurrence of each applied
occurrence of an identifier in the AST
• undefined identifiers  error
• unused identifiers  warning
• check rules in the language definition
• type checking
• control flow (dead code)
Symbol table
• global storage used by
all compiler phases
• holds information about
identifiers:
• type
• location
• size
•
program text
lexical analysis
syntax analysis
context handling
annotated AST
optimizations
code generation
executable
s
y
m
b
o
l
t
a
b
l
e
Symbol table implementation
• extensible string-indexable array
• linear list
• tree
• hash table
next
name
type
…
”aap”
bucket 0
bucket 1
next
next
bucket 2
name
name
bucket 3
type
type
…
…
hash function: string  int
”noot”
”mies”
Identification
• different kinds of identifiers
• variables
• type names
• field selectors
• name spaces
• scopes
typedef int i;
int j;
void foo(int j)
{
struct i {i i;} i;
i:
}
i.i = 3;
printf( "%d\n", i.i);
Identification
• different kinds of identifiers
• variables
• type names
• field selectors
• name spaces
• scopes
typedef int i;
int j;
void foo(int j)
{
struct i {i i;} i;
i:
}
i.i = 3;
printf( "%d\n", i.i);
Handling scopes
• stack of scope elements
• when entering a scope a new element is
pushed on the stack
• declared identifiers are entered in the top
scope element
• applied identifiers are looked up in the scope
elements from top to bottom
• the top element is removed upon scope exit
A scoped hash-based
symbol table
aap( int noot)
{ int mies, aap;
....
}
scope stack
2
1
”noot”
decl
1 prop
0
…
hash table
bucket 0
bucket 1
bucket 2
bucket 3
”aap”
decl
2 prop
0 prop
…
”mies”
decl
…
level
2 prop
Identification: complications
• overloading
• operators:
• functions:
N*2
prijs*2.20371
PUT(s:STRING)
PUT(i:INTEGER)
• solution: yield set of possibilities (to be constrained by
type checking)
• imported scopes
• C++ scope resolution operator x::
• Modula
FROM module IMPORT ...
• solution: stack (or merge) the new scope
Type checking
• operators and functions impose
restrictions on the types of the arguments
• types
• basic types
• structured types
• type names
typedef
struct {
double re;
double im;
} complex;
Forward declarations
• recursive data structures
TYPE Tree = POINTER TO Node;
Type Node = RECORD
element : Integer;
left, right : Tree;
END RECORD;
• type information must be stored
• type table
• type information must be resolved
• undefined types
• circularities
Type equivalence
• name equivalence [all types get a unique name]
VAR a : ARRAY [Integer 1..10] OF Real;
VAR b : ARRAY [Integer 1..10] OF Real;
• structural equivalence [difficult to check]
TYPE c = RECORD i : Integer; p : POINTER TO c; END RECORD;
TYPE d = RECORD
i : Integer;
p : POINTER TO
RECORD
i : Integer;
p : POINTER to c;
END RECORD;
END RECORD;
Coercions
• implicit data and type conversion
to match operand (argument) type
• coercions complicate identification
(ambiguity)
VAR a : Real;
...
a := 5;
3.14 + 7
8 + 9
• two phase approach
• expand a type to a set by applying coercions
• reduce type sets based on constraints imposed by
(overloaded) operators and language semantics
Variable: value or location?
• two usages of variables
rvalue: value
lvalue: location
VAR p : Real;
VAR q : Real;
...
p := q;
• insert coercion to dereference variable
:=
• checking rules:
expected
found
lvalue
rvalue
lvalue
-
deref
rvalue
ERROR
-
(location of)
p
deref
(location of)
q
Exercise (5 min.)
complete the table
expression
construct
constant
result kind
(lvalue/rvalue)
rvalue
identifier
&lvalue
*rvalue
V[rvalue]
V.selector
rvalue + rvalue
lvalue = rvalue
V stands for lvalue or rvalue
Answers
complete table
expression
construct
result kind
(lvalue/rvalue)
constant
rvalue
identifier (variable)
lvalue
identifier (otherwise)
rvalue
&lvalue
rvalue
*rvalue
lvalue
V[rvalue]
V
V.selector
V
rvalue + rvalue
rvalue
lvalue = rvalue
rvalue
V stands for lvalue or rvalue
Break
Assignment (practicum)
Asterix compiler
Asterix program
token
description
lex
lexical analysis
Asterix
grammar
yacc
syntax analysis
1) replace yacc by LLgen
2) make Asterix object-oriented
• classes and objects
• inheritance and dynamic binding
context handling
code generation
C-code
Yet another compiler compiler
• yacc (bison): parser generator for UNIX
• LALR(1) grammar  C code
• format of the yacc input file:
definitions
tokens + properties
%%
rules
grammar rules + actions
%%
user code
auxiliary C-code
Yacc-based
expression interpreter
• input file
%token DIGIT
%%
line :
;
expr :
|
|
|
;
%%
expr '\n'
{ printf("%d\n", $1);}
expr '+' expr
expr '*' expr
'(' expr ')'
DIGIT
{ $$ = $1 + $3;}
{ $$ = $1 * $3;}
{ $$ = $2;}
grammar
semantics
• yacc maintains a stack of “values” that may be
referenced ($i) in the semantic actions
Yacc interface to
lexical analyzer
• yacc invokes yylex()
to get the next token
%%
yylex()
{
int c;
• the “value” of a token
c = getchar();
must be stored in the
global variable yylval
if (isdigit(c)) {
yylval = c - '0';
return DIGIT;
}
return c;
• the default value type
is int, but can be
changed
}
Yacc interface to
back-end
• yacc generates a
function named
yyparse()
• syntax errors are
reported by invoking
a callback function
yyerror()
%%
yylex()
{
...
}
main()
{
yyparse();
}
yyerror()
{
printf("syntax error\n");
exit(1);
}
Yacc-based
expression interpreter
• input file
(desk0)
• run yacc
%%
line :
;
expr :
|
|
|
;
%%
expr '\n'
{ printf("%d\n", $1);}
expr '+' expr
expr '*' expr
'(' expr ')'
DIGIT
{ $$ = $1 + $3;}
{ $$ = $1 * $3;}
{ $$ = $2;}
> make desk0
bison -v desk0.y
desk0.y contains 4 shift/reduce conflicts.
gcc -o desk0 desk0.tab.c
>
Conflict resolution in Yacc
• shift-reduce: prefer shift
• reduce-reduce: prefer the rule that comes first
Yacc-based
expression interpreter
• input file
(desk0)
%%
line :
;
expr :
|
|
|
;
%%
expr '\n'
{ printf("%d\n", $1);}
expr '+' expr
expr '*' expr
'(' expr ')'
DIGIT
{ $$ = $1 + $3;}
{ $$ = $1 * $3;}
{ $$ = $2;}
• run yacc
• run desk0, is it correct? NO
> desk0
2*3+4
14
Operator precedence in Yacc
priority from
top (low) to
bottom (high)
%token DIGIT
%left '+'
%left '*'
%%
line :
;
expr :
|
|
|
;
%%
expr '\n'
{ printf("%d\n", $1);}
expr '+' expr
expr '*' expr
'(' expr ')'
DIGIT
{ $$ = $1 + $3;}
{ $$ = $1 * $3;}
{ $$ = $2;}
Exercise (7 min.)
multiple lines:
%%
lines:
|
;
line :
;
expr :
|
|
|
;
%%
line
lines line
expr '\n'
{ printf("%d\n", $1);}
expr '+' expr
expr '*' expr
'(' expr ')'
DIGIT
{ $$ = $1 + $3;}
{ $$ = $1 * $3;}
{ $$ = $2;}
Extend the interpreter to a desk calculator with
registers named a – z. Example input: v=3*(w+4)
Answers
Answers
%{
int reg[26];
%}
%token DIGIT
%token REG
%right '='
%left '+'
%left '*'
%%
expr : REG '=' expr
| expr '+' expr
| expr '*' expr
| '(' expr ')'
| REG
| DIGIT
;
%%
{
{
{
{
{
$$
$$
$$
$$
$$
=
=
=
=
=
reg[$1] = $3;}
$1 + $3;}
$1 * $3;}
$2;}
reg[$1];}
Answers
%%
yylex()
{
int c = getchar();
if (isdigit(c)) {
yylval = c - '0';
return DIGIT;
} else if ('a' <= c && c <= 'z') {
yylval = c - 'a';
return REG;
}
return c;
}
LLgen: LL(1) parser generator
• LLgen is part of the Amsterdam Compiler Kit
• takes LL(1) grammar + semantic actions in C
and generates a recursive descent parser
• LLgen features:
•
•
•
•
•
repetition operators
advanced error handling
parameter passing
control over semantic actions
dynamic conflict resolvers
LLgen example:
expression interpreter
• start from LR(1) grammar
• make grammar LL(1)
• left recursion
• operator precedence
• use repetition operators
lines :
|
;
line :
;
expr :
|
|
|
;
line
lines line
expr '\n'
expr '+' expr
expr '*' expr
'(' expr ')‘
DIGIT
yacc
LLgen example:
expression interpreter
• start from LR(1) grammar
• make grammar LL(1)
• left recursion
• operator precedence
• use repetition operators
• add semantic actions
%token DIGIT;
main
:
;
line
:
;
expr
:
;
term
:
;
factor :
|
;
• attach parameters to grammar rules
• insert C-code between the symbols
[line]+
expr '\n'
term [ '+' term ]*
factor [ '*' factor ]*
'(' expr ')‘
DIGIT
LLgen
main
: [line]+
;
line {int e;}
: expr(&e) '\n'
;
expr(int *e) {int t;}
: term(e)
[ '+' term(&t)
]*
;
term(int *t) {int f;}
: factor(t)
[ '*' factor(&f)
]*
;
factor(int *f)
: '(' expr(f) ')'
| DIGIT
;
grammar
{ printf("%d\n", e);}
{ *e += t;}
{ *t *= f;}
{ *f = yylval;}
semantics
values/results passed as parameters
main
: [line]+
;
line {int e;}
: expr(&e) '\n'
;
expr(int *e) {int t;}
: term(e)
[ '+' term(&t)
]*
;
term(int *t) {int f;}
: factor(t)
[ '*' factor(&f)
]*
;
factor(int *f)
: '(' expr(f) ')'
| DIGIT
;
grammar
{ printf("%d\n", e);}
{ *e += t;}
{ *t *= f;}
{ *f = yylval;}
semantics
semantic actions: C code between {}
main
: [line]+
;
line {int e;}
: expr(&e) '\n'
;
expr(int *e) {int t;}
: term(e)
[ '+' term(&t)
]*
;
term(int *t) {int f;}
: factor(t)
[ '*' factor(&f)
]*
;
factor(int *f)
: '(' expr(f) ')'
| DIGIT
;
grammar
{ printf("%d\n", e);}
{ *e += t;}
{ *t *= f;}
{ *f = yylval;}
semantics
LLgen interface to
lexical analyzer
• by default LLgen invokes
yylex() to get the next token
yylex()
{
int c;
• the “value” of a token can be
c = getchar();
if (isdigit(c)) {
yylval = c - '0';
return DIGIT;
}
return c;
stored in any global variable
(yylval) of any type (int)
}
LLgen interface to
back-end
• LLgen generates a
user-named
function (parse)
• LLgen handles
syntax errors by
inserting missing
tokens and deleting
unexpected tokens
• LLmessage() is
invoked to notify the
lexical analyzer
%start parse, main;
LLmessage(int class)
{
switch (class) {
case -1:
printf("expecting EOF, ");
case 0:
printf("deleting token (%d)\n",LLsymb);
break;
default:
/* push back token LLsymb */
printf("inserting token (%d)\n",class);
break;
}
}
Exercise (5 min.)
• extend LLgen-based interpreter to a desk
calculator with registers named a – z.
Example input: v=3*(w+4)
Answers
Answers
%token REG;
expr(int *e) {int r,t;}
: %if (ahead() == '=')
reg(&r) '=' expr(e)
| term(e)
[ '+' term(&t)
]*
;
factor(int *f) {int r;}
: '(' expr(f) ')'
| DIGIT
| reg(&r)
;
reg(int *r)
: REG
;
{ reg[r] = *e;}
{ *e += t;}
{ *f = yylval;}
{ *f = reg[r];}
{ *r
= yylval;}
Answers
dynamic conflict
resolution
%token REG;
expr(int *e) {int r,t;}
: %if (ahead() == '=')
reg(&r) '=' expr(e)
| term(e)
[ '+' term(&t)
]*
;
factor(int *f) {int r;}
: '(' expr(f) ')'
| DIGIT
| reg(&r)
;
reg(int *r)
: REG
;
{ reg[r] = *e;}
{ *e += t;}
{ *f = yylval;}
{ *f = reg[r];}
{ *r
= yylval;}
Homework
• study sections:
• 2.2.4.6 LLgen
• 2.2.5.9 yacc
• assignment 1:
• replace yacc with LLgen
• deadline April 9 08:59
• print handout for next week [blackboard]