Lexical Analysis

Download Report

Transcript Lexical Analysis

CS375
Compilers
Lexical Analysis
4th February, 2010
Outline
• Overview of a compiler.
• What is lexical analysis?
• Writing a Lexer
– Specifying tokens: regular expressions
– Converting regular expressions to NFA,
DFA
• Optimizations.
2
How It Works
Program representation
Source code
(character stream)
if (b == 0) a = b;
Lexical Analysis
Token
stream
if ( b == 0 ) a = b ;
==
Abstract syntax
tree (AST)
Decorated
AST
b
if
0
boolean
==
int b
=
a
if
int 0
Syntax Analysis
(Parsing)
b
int
Semantic Analysis
=
int a int b
lvalue
3
What is a lexical analyzer
• What?
– Reads in a stream of characters and
groups them into “tokens” or “lexemes”.
– Language definition describes what tokens
are valid.
• Why?
– Makes writing the parser a lot easier,
parser operates on “tokens”.
– Input dependent functionality such as
character codes, EOF, new line characters.
4
First Step: Lexical Analysis
Source code
(character stream)
if (b == 0) a = b;
Lexical Analysis
Token stream
if
( b == 0 ) a = b ;
Syntax Analysis
Semantic Analysis
5
What it should do?
Token
Description
?
Input w
String
Token? {Yes/No}
We want some way to describe tokens, and have our Lexer
take that description as input and decide if a string is a
token or not.
6
STARTING OFF
7
Tokens
•
•
•
•
Logical grouping of characters.
Identifiers: x y11 elsen _i00
Keywords: if else while break
Constants:
–
–
–
–
–
Integer: 2 1000 -500 5L 0x777
Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10
String: ”x” ”He said, \”Are you?\”\n”
Character: ’c’ ’\000’
Symbols: + * { } ++ < << [ ] >=
• Whitespace (typically recognized and discarded):
– Comment: /** don’t change this **/
– Space: <space>
– Format characters: <newline> <return>
8
Ad-hoc Lexer
• Hand-write code to generate tokens
• How to read identifier tokens?
Token readIdentifier( ) {
String id = “”;
while (true) {
char c = input.read();
if (!identifierChar(c))
return new Token(ID, id, lineNumber);
id = id + String(c);
}
}
• Problems
–
–
–
–
How to start?
What to do with following character?
How to avoid quadratic complexity of repeated concatenation?
How to recognize keywords?
9
Look-ahead Character
• Scan text one character at a time
• Use look-ahead character (next) to determine
what kind of token to read and when the
current token ends
char next;
…
while (identifierChar(next)) {
id = id + String(next);
next = input.read ();
}
e l s e
n
next
(lookahead)
10
Ad-hoc Lexer: Top-level Loop
class Lexer {
InputStream s;
char next;
Lexer(InputStream _s) { s = _s; next = s.read(); }
Token nextToken( ) {
if (identifierFirstChar(next))//starts with a char
return readIdentifier(); //is an identifier
if (numericFirstChar(next)) //starts with a num
return readNumber(); //is a number
if (next == ‘\”’) return readStringConst();
…
}
}
11
Problems
• Might not know what kind of token we
are going to read from seeing first
character
– if token begins with “i’’ is it an identifier?
(what about int, if )
– if token begins with “2” is it an integer
constant?
– interleaved tokenizer code hard to write
correctly, harder to maintain
– in general, unbounded look-ahead may be
needed
12
Problems (cont.)
• How to specify (unambiguously) tokens.
• Once specified, how to implement them
in a systematic way?
• How to implement them efficiently?
13
Problems (cont. )
• For instance, consider.
– How to describe tokens unambiguously
2.e0
“”
20.e-01 2.0000
“x”
“\\”
“\”\’”
– How to break up text into tokens
if (x == 0) a = x<<1;
if (x == 0) a = x<1;
– How to tokenize efficiently
• tokens may have similar prefixes
• want to look at each character ~1 time
14
Principled Approach
• Need a principled approach
1. Lexer Generators
– lexer generator that generates efficient
tokenizer automatically (e.g., lex, flex, Jlex)
a.k.a. scanner generator
2. Your own Lexer
– Describe programming language’s tokens
with a set of regular expressions
– Generate scanning automaton from that set
of regular expressions
15
Top level idea…
• Have a formal language to describe
tokens.
– Use regular expressions.
• Have a mechanical way of converting
this formal description to code.
– Convert regular expressions to finite
automaton (acceptors/state machines)
• Run the code on actual inputs.
– Simulate the finite automaton.
16
An Example : Integers
• Consider integers.
• We can describe integers using the
following grammar:
• Num -> ‘-’ Pos
• Num -> Pos
• Pos ->0 | 1 |…|9
• Pos ->0 | 1 |…|9 Pos
• Or in a more compact notation, we
have:
– Num-> -? [0-9]+
17
An Example : Integers
• Using Num-> -? [0-9]+ we can
generate integers such as -12, 23, 0.
• We can also represent above regular
expression as a state machine.
– This would be useful in simulation of the
regular expression.
18
An Example : Integers
• The Non-deterministic Finite Automaton
is as follows.
• We can verify that -123, 65, 0 are
accepted by the state machine.
• But which path to take?


0-9
1
0
-paths?
2
3
0-9
19
An Example : Integers
• The NFA can be converted to an
equivalent Deterministic FA as below.
– We shall see later how.
• It accepts the same tokens.
– -123
– 65
–0
{0,1}
-
{1}
0-9
0-9
{2,3}
0-9
20
An Example : Integers
• The deterministic Finite automaton
makes implementation very easier, as
we shall see later.
• So, all we have to do is:
– Express tokens as regular expressions
– Convert RE to NFA
– Convert NFA to DFA
– Simulate the DFA on inputs
21
The larger picture…
Regular
R
Expression
describing
tokens
Input
String
w
RE  NFA
Conversion
NFA  DFA
Conversion
DFA
Simulation
Yes, if w is valid token
No, if not
22
QUICK LANGUAGE THEORY
REVIEW…
23
Language Theory Review
• Let  be a finite set
–  called an alphabet
– a   called a symbol
• * is the set of all finite strings
consisting of symbols from 
• A subset L  * is called a language
• If L1 and L2 are languages, then L1 L2 is
the concatenation of L1 and L2, i.e., the
set of all pair-wise concatenations of
strings from L1 and L2, respectively
24
Language Theory Review, ctd.
• Let L  * be a language
• Then
– L0 = {}
– Ln+1 = L Ln for all n  0
• Examples
– if L = {a, b} then
• L1 = L = {a, b}
• L2 = {aa, ab, ba, bb}
• L3 = {aaa, aab, aba, aba, baa, bab, bba, bbb}
• …
25
Syntax of Regular Expressions
• Set of regular expressions (RE) over
alphabet  is defined inductively by
– Let a   and R, S  RE. Then:
• a  RE
• ε  RE
•   RE
• R|S  RE
• RS  RE
• R*  RE
• In concrete syntactic form, precedence
rules, parentheses, and abbreviations
26
Semantics of Regular Expressions
• Regular expression T  RE denotes the
language L(R)  * given according to the
inductive structure of T:
–
–
–
–
–
–
L(a) ={a}
the string “a”
L(ε) = {“”}
the empty string
L() = {}
the empty set
L(R|S) = L(R)  L(S)
alternation
L(RS) = L(R) L(S)
concatenation
L(R*) = {“”}  L(R)  L(R2)  L(R3)  L(R4)  …
Kleene closure
27
Simple Examples
• L(R) = the “language” defined by R
– L( abc ) = { abc }
– L( hello|goodbye ) = {hello, goodbye}
• OR operator, so L(a|b) is the language containing either
strings of a, or strings of b.
– L( 1(0|1)* ) = all non-zero binary numerals
beginning with 1
• Kleene Star. Zero or more repetitions of the string
enclosed in the parenthesis.
28
Convienent RE Shorthand
R+
R?
[abce]
[a-z]
[^ab]
[^a-z]
”abc”
\(
...
id=R
one or more strings from L(R): R(R*)
optional R: (R|ε)
one of the listed characters: (a|b|c|e)
one character from this range:
(a|b|c|d|e|…|y|z)
anything but one of the listed chars
one character not from this range
the string “abc”
the character ’(’
named non-recursive regular expressions
29
More Examples
Regular Expression R
Strings in L(R)
digit = [0-9]
“0” “1” “2” “3” …
posint = digit+
“8” “412” …
int = -? posint
“-42” “1024” …
real = int ((. posint)?)
“-1.56” “12” “1.0”
= (-|ε)([0-9]+)((. [0-9]+)|ε)
[a-zA-Z_][a-zA-Z0-9_]*
C identifiers
else
the keyword “else”
30
Historical Anomalies
• PL/I
– Keywords not reserved
• IF IF THEN THEN ELSE ELSE;
• FORTRAN
– Whitespace stripped out prior to scanning
• DO 123 I = 1
• DO 123 I = 1
, 2
• By and large, modern language design
intentionally makes scanning easier
31
WRITING A LEXER
32
Writing a Lexer
• Regular Expressions can be very useful
in describing languages (tokens).
– Use an automatic Lexer generator (Flex,
Lex) to generate a Lexer from language
specification.
– Have a systematic way of writing a Lexer
from a specification such as regular
expressions.
33
WRITING YOUR OWN LEXER
34
How To Use Regular Expressions
• Given R  RE and input string w, need
a mechanism to determine if w  L(R)
R  RE
(that describes a
token family)
Input string w
(from the program)
?
Yes, if w is a token
No, if w not a token
• Such a mechanism is called an acceptor
35
Acceptors
• Acceptor determines if an input string belongs
to a language L
Description L
of language
Input
String
Finite
Automaton
Acceptor
Yes, if w  L
No, if w  L
w
• Finite Automata are acceptors for languages
described by regular expressions
36
Finite Automata
• Informally, finite automaton consist of:
–
–
–
–
A finite set of states
Transitions between states
An initial state (start state)
A set of final states (accepting states)
• Two kinds of finite automata:
– Deterministic finite automata (DFA): the transition
from each state is uniquely determined by the
current input character
– Non-deterministic finite automata (NFA): there
may be multiple possible choices, and some
“spontaneous” transitions without input
37
DFA Example
• Finite automaton that accepts the strings in
the language denoted by regular expression
ab*a
• Can be represented as a graph or a transition
table.
b
– A graph.
• Read symbol
• Follow outgoing edge
0
a
1
a
2
38
DFA Example (cont.)
• Representing FA as transition tables
makes the implementation very easy.
• The above FA can be represented as :
– Current state and current symbol
determine next state.
a
– Until
0
1
• error state.
• End of input.
1
2
b
Error
2
1
Error
Error
39
Simulating the DFA
• Determine if the DFA accepts an input string
transition_table[NumSTATES][NumCHARS]
accept_states[NumSTATES]
state = INITIAL
b
a
a
while (state != Error)
0
1
2
{
c = input.read();
if (c == EOF) break;
state = trans_table[state][c];
}
return (state!=Error) && accept_states[state];
40
RE  Finite automaton?
• Can we build a finite automaton for every
regular expression?
• Strategy: build the finite automaton
inductively, based on the definition of regular
expressions
a
ε

a
41
RE  Finite automaton?
?
R automaton
?
S automaton
• Alternation R|S
?
• Concatenation: RS
R automaton
S automaton
• Recall ? implies optional move.
42
NFA Definition
• A non-deterministic finite automaton (NFA) is
an automaton where:
– There may be ε-transitions (transitions that do not
consume input characters)
– There may be multiple transitions from the same
state on the same input character
Example:


a
b
b
a
a
43
RE  NFA intuition
-?[0-9]+


0-9
0-9
When to take the -path?
44
NFA construction (Thompson)
• NFA only needs one stop state (why?)
• Canonical NFA form:
• Use this canonical form to inductively
construct NFAs for regular expressions
45
Inductive NFA Construction
ε
RS
R|S
R
ε
S
R
ε
ε
ε
S
ε
R*
ε
R
ε
ε
46
Inductive NFA Construction
RS
R|S
R
ε
S
R
ε
ε
ε
S
ε
R*
ε
R
ε
ε
47
DFA vs NFA
• DFA: action of automaton on each input
symbol is fully determined
– obvious table-driven implementation
• NFA:
– automaton may have choice on each step
– automaton accepts a string if there is any
way to make choices to arrive at accepting
state
– every path from start state to an accept
state is a string accepted by automaton
– not obvious how to implement!
48
Simulating an NFA
• Problem: how to execute NFA?
“strings accepted are those for which there is
some corresponding path from start state to
an accept state”
• Solution: search all paths in graph consistent
with the string in parallel
– Keep track of the subset of NFA states that search
could be in after seeing string prefix
– “Multiple fingers” pointing to graph
49
Example
• Input string: -23
• NFA states:
–
–
–
–
Start:{0,1}
“-” :{1}
“2” :{2, 3}
“3” :{2, 3}
1
0


0-9
2
3
0-9
• But this is very difficult to implement
directly.
50
NFA  DFA conversion
• Can convert NFA directly to DFA by same approach
• Create one DFA state for each distinct subset of
NFA states that could arise
• States: {0,1}, {1}, {2, 3}
-
0-9
1
0

2

{0,1}
3
0-9
-
{1}
0-9
0-9
{2,3}
0-9
• Called the “subset construction”
51
Algorithm
• For a set S of states in the NFA, compute
ε-closure(S) = set of states reachable from states in S
by one or more ε-transitions
T=S
Repeat T = T U {s’ | sT, (s,s’) is ε-transition}
Until
T remains unchanged
ε-closure(S) = T
• For a set S of ε-closed states in the NFA, compute
DFAedge(S,c) = the set of states reachable from states
in S by transitions on symbol c and ε-transitions
DFAedge(S,c) = ε-closure( { s’ | sS, (s,s’) is c-transition} )
52
Algorithm
DFA-initial-state = ε-closure(NFA-initial-state)
Worklist = { DFA-initial-state }
While ( Worklist not empty )
Pick state S from Worklist
For each character c
S’ = DFAedge(S,c)
if (S’ not in DFA states)
Add S’ to DFA states and worklist
Add an edge (S, S’) labeled c in DFA
For each DFA-state S
If S contains an NFA-final state
Mark S as DFA-final-state
53
Putting the Pieces Together
Regular
R
Expression
RE  NFA
Conversion
NFA  DFA
Conversion
Input
String
w
DFA
Simulation
Yes, if w  L(R)
No, if w  L(R)
54
OPTIMIZATIONS
55
State minimization
• State Minimization is an optimization
that converts a DFA to another DFA
that recognizes the same language and
has a minimum number of states.
– Divide all states into “equivalence” groups.
• Two states p and q are equivalent if for all
symbols, the outgoing edges either lead to
error or the same destination group.
• Collapse the states of a group to a single state
(instead of p and q, have a single state).
56
State Minimization (Equivalence)
• More formally, all states in group Gi are equivalent iff
for any two states p and q in Gi, and for every
symbol σ, transition(p,σ) and transition(q,σ) are
either both Error, or are states in the same group Gj
(possibly Gi itself).
• For example:
c
Gj
a
a
a
Gi
p
q
b
Gk(or Error)
b
c
r
b
c
57
State Minimization
• Step1. Partition states of original DFA into maximalsized groups of “equivalent” states S = {G1, … ,Gn}
b
1
a
b
0
a
2
b
b
3
a
a
4
• Step 2. Construct the minimized DFA such that there
is a state for each group Gi
b
b
a
a
58
DFA Minimization
• Step1. Partition states of original DFA into maximalsized groups of equivalent states
– Step 1a. Discard states not reachable from start state
– Step 1b. Initial partition is S = {Final, Non-final}
– Step 1c. Repeatedly refine the partition {G1,…,Gn} while
some group Gi contains states p and q such that for some
symbol σ, transitions from p and q on σ are to different
groups
Gi
Gj
a
Gk(or Error)
j≠k
p
q
a
59
DFA Minimization
• Step1. Partition states of original DFA into maximalsized groups of equivalent states
– Step 1a. Discard states not reachable from start state
– Step 1b. Initial partition is S = {Final, Non-final}
– Step 1c. Repeatedly refine the partition {G1,…,Gn} while
some group Gi contains states p and q such that for some
symbol σ, transitions from p and q on σ are to different
groups
Gi
p
Gj
Gk(or Error)
a
j≠k
q
Gi’
a
60
After state minimization.
• We have an optimized acceptor.
Regular
R
Expression
RE  NFA
NFA  DFA
Minimize DFA
Input
String
w
DFA
Simulation
Yes, if w  L(R)
No, if w  L(R)
61
Lexical Analyzers vs Acceptors
• We really need a Lexer, not an acceptor.
• Lexical analyzers use the same mechanism,
but they:
– Have multiple RE descriptions for multiple tokens
– Output a sequence of matching tokens (or an
error)
– Always return the longest matching token
– For multiple longest matching tokens, use rule
priorities
62
Lexical Analyzers
REs for all valid
Tokens
R1 … Rn
Character
Stream
program
RE  NFA
NFA  DFA
Minimize DFA
DFA
Simulation
Token stream
(and errors)
63
Handling Multiple REs
•
•
•
•
Construct one NFA for each RE
Associate the final state of each NFA with the given RE
Combine NFAs for all REs into one NFA
Convert NFA to minimized DFA, associating each final DFA state
with the highest priority RE of the corresponding NFA states
NFAs
keywords

Minimized DFA
whitespace



identifier
number
64
Using Roll Back
Consider three REs: {aa ba aabb] and input: aaba
0
a
b
1
5
a
a
2
b
3
b
4
6
• Reach state 3 with no transition on next character a
• Roll input back to position on entering state 2 (i.e.,
having read aa)
• Emit token for aa
• On next call to scanner, start in state 0 again with
input ba
65
Automatic Lexer Generators
• Input: token specification
– list of regular expressions in priority order
– associated action for each RE (generates
appropriate kind of token, other bookkeeping)
• Output: lexer program
– program that reads an input stream and breaks it
up into tokens according to the REs (or reports
lexical error -- “Unexpected character” )
66
Automatic Lexer (C)
Generator
67
Example: Jlex (Java)
%%
digits = 0|[1-9][0-9]*
letter = [A-Za-z]
identifier = {letter}({letter}|[0-9_])*
whitespace = [\ \t\n\r]+
%%
{whitespace} {/* discard */}
{digits}
{ return new Token(INT, Integer.parseInt(yytext()); }
”if”
{ return new Token(IF, yytext()); }
”while”
{ return new Token(WHILE, yytext()); }
…
{identifier}
{ return new Token(ID, yytext()); }
68
Example Output (Java)
• Java Lexer which implements the
functionality described in the language
specification.
• For instance :
case 5:
case 6:
case 2:
case 7:
case 4:
case 8:
case 1:
{ return new Token(WHILE, yytext()); }
break;
{ return new Token(ID, yytext()); }
break;
{ return new Token(IF, yytext());}
break;
{ return new Token(INT,
Integer.parseInt(yytext());}
69
Start States
• Mechanism that specifies state in which to
start the execution of the DFA
• Declare states in the second section
– %state STATE
• Use states as prefixes of regular expressions
in the third section:
– <STATE> regex {action}
• Set current state in the actions
– yybegin(STATE)
• There is a pre-defined initial state: YYINITIAL
70
Example
”
STRING
INITIAL
if
%%
%state STRING
%%
<YYINITIAL> “if”
<YYINITIAL> “\””
<STRING> “\””
<STRING>
.
.
”
{
{
{
{
return new Token(IF, null); }
yybegin(STRING); … }
yybegin(YYINITIAL); … }
…}
71
Summary
• Lexical analyzer converts a text stream to
tokens
• Ad-hoc lexers hard to get right, maintain
• For most languages, legal tokens are
conveniently and precisely defined using
regular expressions
• Lexer generators generate lexer automaton
automatically from token RE’s, prioritization
72
Summary
• To write your own Lexer:
– Describe tokens using Regular Expressions.
– Construct NFAs for those tokens.
• If you have no ambiguities in the NFA, or you have a
DFA directly from the regular expressions, you are done.
– Construct DFA from NFA using the
algorithm described.
– Systematically implement the DFA using
transition tables.
73
Reading
•
•
•
•
•
IC Language spec
JLEX manual
CVS manual
Links on course web home page
Regular Expression Matching Can Be Simple
And Fast (but is slow in Java, Perl, PHP,
Python, Ruby, ...), Russ Cox, January 2007
Acknowledgement
The slides are based on similar content by Tim
Teitelbaum, Cornell.
74