Transcript Document

CIS 461
Compiler Design & Construction
Fall 2012
slides derived from Tevfik Bultan, Keith Cooper, and
Linda Torczon
Lecture-Module #6
Further Lexical Analysis
1
First Phase: Lexical Analysis (Scanning)
Source
code
token
IR
Parser
Scanner
get next
token
Errors
Scanner
• Maps stream of characters into tokens
– Basic unit of syntax
• Characters that form a word are its lexeme
• Its syntactic category is called its token
• Scanner discards white space and comments
• Scanner works as a subroutine of the parser
2
Lexical Analysis
• Specify tokens using Regular Expressions
• Translate Regular Expressions to Finite Automata
• Use Finite Automata to generate tables or code for the scanner
source code
Scanner
tables
or code
specifications
(regular expressions)
tokens
Scanner
Generator
3
Automating Scanner Construction
To build a scanner:
1 Write down the RE that specifies the tokens
2 Translate the RE to an NFA
3 Build the DFA that simulates the NFA
4 Systematically shrink the DFA
5 Turn it into code or table
Scanner generators
• Lex , Flex, Jlex work along these lines
• Algorithms are well-known and well-understood
• Interface to parser is important
4
Automating Scanner Construction
RENFA (Thompson’s construction)
•
Build an NFA for each term
•
Combine them with -moves
NFA DFA (subset construction)
•
Build the simulation
DFA Minimal DFA
•
The Cycle of Constructions
RE
NFA
DFA
minimal
DFA
Hopcroft’s algorithm
DFA RE
•
All pairs, all paths problem
•
Union together paths from s0 to a final state
5
NFA vs. DFA Scanners
•
•
•
•
Given a regular expression r we can convert it to an NFA of size O(|r|)
Given an NFA we can convert it to a DFA of size O(2|r|)
We can simulate a DFA on string x in O(|x|) time
We can simulate an NFA N (constructed by Thompson’s construction)
on a string x in O(|N|  |x|) time
Recognizing input string x for regular expression r
Automaton
Type
Space
Complexity
Time
Complexity
NFA
O(|r|)
O(|r|  |x|)
DFA
O(2|r|)
O(|x|)
6
Scanner Generators: JLex, Lex, FLex
directly copied to the output file
user code
%%
macro (regular) definitions (e.g., digits
and state names
JLex directives
%%
regular expression rules
= [0-9]+)
each rule: optional state list, regular expression, action
• user code at top (from parser-generator) specifies what tokens are
• States can be mixed with regular expressions
• For each regular expression we can define a set of states where it is valid (JLex, Flex)
• Standard format of regular expression rule:
<optional_state_list> regular_expression { actions }
7
JLex, FLex, Lex
Regular expression rules:
r_1
{ action_1 }
r_2
{ action_2 }
.
.
.
r_n
{ action_n }
Automata for regular
expression r_1
Java code for JLex,
C code for FLex and Lex
new final
states
Ar_1
new start
sate
s0
Rules used by scanner generators
1) Continue scanning the input until reaching an error state
2) Accept the longest prefix that matches to a regular
expression and execute the corresponding action
3) If two patterns match the longest prefix, then the action
which is specified earlier will be executed
4) After a match, go back to the end of the accepted prefix
in the input and start scanning for the next token
error

Ar_2

error

..
.
Ar_n
error
For faster scanning, convert this NFA
to a DFA and minimize the states
8
A Simple Example
Recognize the following tokens:
Id = [a-z][a-z0-9]*
Num = [0-9]+
if ="if"
Also take care of one line comments and white space:
WhiteSpace = [\ \t\f\b\r\n]
Comment = \/\/.*
9
/* User code */
import java.io.*; // For FileInputStream and its exceptions.
/* ========================================== */
class Type {
static final int IF = 0;
static final int ID = 1;
static final int NUM = 2;
static final int EOF = 3;
};
class Token {
public int type;
public String attribute;
public Token(int t) {
type=t;
}
public Token(int t, String s) {
type=t; attribute = s;
}
public static String spellingOf(int t) {
switch (t) {
case Type.IF : return "IF";
case Type.ID : return "ID";
case Type.NUM : return "NUM";
default : return "Undefinied token type";
}
}
public String toString() {
switch (type) {
case Type.ID :
case Type.NUM :
return spellingOf(type) + ", " + attribute;
default:
return spellingOf(type);
}
}
};
10
/* ================================================= */
class Example {
public static void main(String[] args) throws FileNotFoundException, IOException {
FileInputStream fis = new FileInputStream(args[0]);
Lexer L = new Lexer(fis);
Token T = L.next();
while (T.type != Type.EOF) {
System.out.println(T);
T = L.next();
}
}
}
/* ================================================ */
%%
/* JLex directives */
%class Lexer
%function next
%type Token
%eofval{
return new Token(Type.EOF);
%eofval}
/* white space */
WhiteSpace = [\ \t\f\b\r\n]
/* comments */
Comment = \/\/.*
Id = [a-z][a-z0-9]*
Num = [0-9]+
%%
{WhiteSpace} {}
{Comment} {}
"if"
{ return new Token(Type.IF); }
{Id}
{ return new Token(Type.ID, yytext()); }
{Num}
{ return new Token(Type.NUM, yytext()); }
11
If above JLex specification is in a file simple.jlx, you can generate a scanner
for that specification as follows:
% cd <directory for simple.jlx>
% setenv CLASSPATH ".:/fs/cs-cls/cs160/lib"
% java JLex.Main simple.jlx
% javac simple.jlx.java
% java Example input1
12
if i1
// this is a comment
if var15 15
1 2
4253
if i1
// this is a comment
if var15 15
1,2,3
IF
ID, i1
IF
ID, var15
NUM, 15
NUM, 1
NUM, 2
NUM, 4253
IF
ID, i1
IF
ID, var15
NUM, 15
NUM, 1
Undefined token type
NUM, 2
Undefined token type
NUM, 3
13
Building Faster Scanners from the DFA
Table-driven recognizers waste a lot of effort
• Read (& classify) the next character
• Find the next state
• Assign to the state variable
• Branch back to the top
We can do better
• Encode state & actions in the code
• Do transition tests locally
state = s0 ;
string = ;
char = get_next_char();
while (char != eof) {
state = (state,char);
string = string + char;
char = get_next_char();
}
if (state in Final) then
report acceptance;
else
report failure;
• Generate ugly, spaghetti-like code (it is OK, this is automatically
generated code)
• Takes (many) fewer operations per input character
14
Building Faster Scanners from the DFA
A direct-coded recognizer for Register regular expression R Digit Digit*
goto s0;
s0: string ;
char  get_next_char();
if (char = ‘r’)
then goto s1;
else goto se;
s1: string string+ char;
char  get_next_char();
if (‘0’ ≤ char ≤ ‘9’)
then goto s2;
else goto se;
•
•
s2: string string+ char;
char  get_next_char();
if (‘0’ ≤ char ≤ ‘9’)
then goto s2;
else if (char = eof)
then report acceptance;
else goto se;
se: print error message;
return failure;
Many fewer operations per character
State is encoded as the location in the code
15
Building Faster Scanners
Hashing keywords versus encoding them directly
• Some compilers recognize keywords as identifiers and check them in a
hash table
• Encoding it in the DFA is a better idea
– O(1) cost per transition
– Avoids hash lookup on each identifier
16
What is hard about Lexical Analysis?
Poor language design can complicate scanning
• Reserved words are important
– In PL/I there are no reserved keywords, so you can write a valid
statement like:
if then then then = else; else else = then
• Significant blanks
– In Fortran blanks are not significant
do 10 i = 1,25
do loop
do 10 i = 1.25
assignment to variable named do10i
• Closures
– Limited identifier length adds states to the automata to count
length
17
integer
function A
100
200
9
300
C
$
What can be so hard? (Fortran 66/77)
INTEGERFUNCTIONA
Macro definitions
PARAMETER(A=6,B=2)
First A and B are converted to (6-2)
IMPLICIT CHARACTER*(A-B)(A-B)
This statement declares that variables that
INTEGER FORMAT(10), IF(10), DO9E1
begin with A and B are of data-type four
FORMAT(4H)=(3)
character string
FORMAT(4 )=(3)
)=(3 is a literal constant
DO9E1=1
statement for formatting input, output
DO9E1=1,2
IF(X)=1
assigns value to variable DO9E1
IF(X)H=1
assigns value to array element
IF(X)300,200
CONTINUE
one statement split into two lines
END
THIS IS A “COMMENT CARD”
How does a compiler do this?
FILE(1)
END
• First pass finds & inserts blanks
• Can add extra words or tags to
create a scannable language
• Second pass is normal scanner18