Transcript Chapter 3
Chapter 3
Chang Chi-Chung
2007.4.12
The Role of the Lexical Analyzer
Source
Program
Lexical
Analyzer
Token
Parser
getNextToken
error
error
Symbol Table
The Reason for Using the Lexical Analyzer
Simplifies the design of the compiler
Compiler efficiency is improved
LL(1) or LR(1) parsing with 1 token lookahead would not be
possible (multiple characters/tokens to match)
Systematic techniques to implement lexical analyzers by
hand or automatically from specifications
Stream buffering methods to scan input
Compiler portability is enhanced
Input-device-specific peculiarities can be restricted to the
lexical analyzer.
Tokens, Patterns, and Lexemes
Token (符號單元)
Pattern (樣本)
A pair consisting of a token name and optional arrtibute
value.
Example: num, id
A description of the form for the lexemes of a token.
Example: “non-empty sequence of digits”, “letter followed by
letters and digits”
Lexeme (詞)
A sequence of characters that matches the pattern for a
token.
Example: 123, abc
Example: Tokens, Patterns, and Lexemes
Token
Pattern
Lexeme
if
characters i f
if
else
characters e l s e
else
comparison < or > or <= or >= or == or !=
<=, !=
id
pi, score, D2
number
letter followed by letters and
digits
any numeric constant
literal
anything but “, surrounded by “’s
“core dump”
3.14, 0, 6.23
Input Buffering
E = M * C * * 2
eof
eof
lexemeBegin forward
eof
Sentinels
Strings and Languages
Alphabet
An alphabet is a finite set of symbols (characters)
String
A string is a finite sequence of symbols from
s denotes the length of string s
denotes the empty string, thus = 0
Language
A language is a countable set of strings over some fixed
alphabet
Abstract Language Φ
{ε}
String Operations
Concatenation (連接)
The concatenation of two strings x and y is denoted by xy
Identity (單位元素)
The empty string is the identity under concatenation.
s=s=s
Exponentiation
Define
s0 =
si = si-1s for i > 0
By Define
s1 = s
s2 = ss
Language Operations
Union
L M = { s s L or s M }
Concatenation
L M = { xy x L and y M}
Exponentiation
L0 = { }
Li = Li-1L
Kleene closure (封閉包)
L* = ∪i=0,…, Li
Positive closure
L+ = ∪i=1,…, Li
Regular Expressions
Regular Expressions
A convenient means of specifying certain simple sets
of strings.
We use regular expressions to define structures of
tokens.
Tokens are built from symbols of a finite vocabulary.
Regular Sets
The sets of strings defined by regular expressions.
Regular Expressions
Basis symbols:
If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
is a regular expression denoting language L() = {}
a is a regular expression denoting L(a) = {a}
rs is a regular expression denoting L(r) M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r)
A language defined by a regular expression is called
a regular set.
Operator Precedence
Operator
Precedence
Associative
*
highest
left
concatenation
Second
left
|
lowest
left
Algebraic Laws for Regular Expressions
Law
r|s=s|r
r|(s|t)=(r|s)|t
r(st) = (rs)t
r(s|t) = rs | rt
(s|t)r = sr | tr
Description
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
εr = rε = r
ε is the identity for concatenation
r* = ( r |ε)*
ε is guaranteed in a closure
r** = r*
* is idempotent
Regular Definitions
If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form:
d1 r1
d2 r2
…
dn rn
Each di is a new symbol, not in Σ and not the same as any
other of d’s.
Each ri is a regular expression over the alphabet
{d1, d2, …, di-1 }
Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
Example: Regular Definitions
Regular Definitions
letter_ A | B | … | Z | a | b | … | z | _
digit 0 | 1 | … | 9
id letter_ ( letter_ | digit )*
Regular definitions are not recursive
digits digit digits digit
wrong
Extensions of Regular Definitions
One or more instance
Zero or one instance
r? = r |ε
Character classes
r+ = rr* = r*r
r* = r+ | ε
[a-z] = abc…z
[A-Za-z] = A|B|…|Z|a|…|z
Example
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and Grammars
Context-Free Grammars
stmt if expr then stmt
if expr then stmt else stmt
ws ( blank | tab | newline )+
expr term relop term
term
Regular Definitions
term id
digit [0-9]
num
letter [A-Za-z]
if if
then then
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+ | -)? digit+ )?
Transition Diagrams
relop < <= <> > >= =
start
0
<
1
=
>
other
=
>
5
6
2
return(relop, LE)
3
return(relop, NE)
4 * return(relop, LT)
return(relop, EQ)
=
7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams
id letter ( letter | digit )*
letter or digit
start
9
letter
10
other
*
11
return (getToken(), installID() )
Finite Automata
Finite Automata are recognizers.
Two kind of the Finite Automata
FA simply say “Yes” or “No” about each possible input
string.
A FA can be used to recognize the tokens specified by a
regular expression
Use FA to design of a Lexical Analyzer Generator
Nondeterministic finite automata (NFA)
Deterministic finite automata (DFA)
Both DFA and NFA are capable of recognizing the
same languages.
NFA Definitions
NFA = { S, , , s0, F }
A finite set of states S
A set of input symbols Σ
input alphabet, ε is not in Σ
A transition function
:SS
A special start state s0
A set of final states F, F S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
a
0
a
1
3
2
b
c
c
This machine accepts abccabc, but it
rejects abcab.
This machine accepts (abc+)+.
Transition Table
The mapping of an NFA can be represented
in a transition table
a
start
a
0
1
b
2
b
3
b
(0, a) = {0,1}
(0, b) = {0}
(1, b) = {2}
(2, b) = {3}
STATE
a
b
ε
0
{0, 1}
{0}
-
1
-
{2}
-
2
-
{3}
-
3
-
-
-
DFA
DFA is a special case of an NFA
There are no moves on input ε
For each state s and input symbol a, there is
exactly one edge out of s labeled a.
Both DFA and NFA are capable of
recognizing the same languages.
Simulating a DFA
Input
An input string x terminated by
an end-of-file character eof. A
DFA D with start state s0,
accepting states F, and
transition function move.
Output
Answer “yes” if D accepts x;
“no” otherwise.
s = s0
c = nextChar();
while ( c != eof ) {
s = move(s, c);
c = nextChar();
}
if (s is in F )
return “yes”;
else
return “no”;
S = {0,1,2,3}
= {a, b}
s0 = 0
F = {3}
NFA vs DFA
a
start
a
0
b
1
b
2
3
b
(a | b)*abb
b
0
a
1
a
b
2
b
3
a
a
The Regular Language
The regular language defined by an NFA is the
set of input strings it accepts.
Example: (ab)*abb for the example NFA
An NFA accepts an input string x if and only if
there is some path with edges labeled with symbols
from x in sequence from the start state to some
accepting state in the transition graph
A state transition from one state to another on the
path is called a move.
Theorem
The followings are equivalent
Regular Expression
NFA
DFA
Regular Language
Regular Grammar
Convert Concept
Regular Expression
Minimization
Deterministic
Finite Automata
Nondeterministic
Finite Automata
Deterministic Finite
Automata
Construction of an NFA from a Regular
Expression
ε
s|t
N(s)
N(t)
st
a
a
s*
Use Thompson’s Construction
N(s)
N(t)
N(s)
Example
r11
r9
( a | b )* a b b
r7
r5
(
r3
)
r1
|
r2
a
b
r8
r6
*
r4
r10
b
b
a
r3 = r4
Example
( a | b )* a b b
2
start
0
1
a
3
6
4
b
5
7
a
8
b
9
b
10
Conversion of an NFA to a DFA
The subset construction algorithm converts an NFA
into a DFA using the following operation.
Operation
Description
ε- closure(s)
Set of NFA states reachable from NFA state s on εtransitions alone.
ε- closure(T)
Set of NFA states reachable from some NFA state s
in set T on ε-transitions alone.
= ∪s in T ε- closure(s)
move(T, a)
Set of NFA states to which there is a transition on
input symbol a from some state s in T
Subset Construction(1)
Initially, -closure(s0) is the only state in Dstates and it is unmarked;
while (there is an unmarked state T in Dstates) {
mark T;
for (each input symbol a ) {
U = -closure( move(T, a) );
if (U is not in Dstates)
add U as an unmarked state to Dstates
Dtran[T, a] = U
}
}
Computing ε- closure(T)
Example
2
start
0
1
a
3
6
4
b
5
7
( a | b )* a b b
a
8
b
9
b
10
b
C
start
A
b
a
b
a
B
a
b
a
D
a
b
E
NFA State
DFA State
a
b
{0,1,2,4,7}
A
B
C
{1,2,3,4,6,7,8}
B
B
D
{1,2,4,5,6,7}
C
B
C
{1,2,4,5,6,7,9}
D
B
E
{1,2,3,5,6,7,10}
E
B
C
Example
1
start
0
a
2
3
7
a
4
a
b
b
5
b
6
8
b
a
0137
247
a
b
b
7
b
b
8
b
a
abb
a*b+
68
b
58
Dstates
A = {0,1,3,7}
B = {2,4,7}
C = {8}
D = {7}
E = {5,8}
F = {6,8}
Minimizing the DFA
Step 1
Step 2
Split Procedure
Step 3
Start with an initial partition II with two group: F and S-F
(aceepting and nonaccepting)
If ( IInew = II )
IIfinal = II and continue step 4
else
II = IInew and go to step 2
Step 4
Construct the minimum-state DFA by IIfinal group.
Delete the dead state
Split Procedure
Initially, let IInew = II
for ( each group G of II ) {
Partition G into subgroup such that
two states s and t are in the same subgroup
if and only if
for all input symbol a, states s and t have
transition on a to states in the same group of
II.
/* at worst, a state will be in a subgroup by
itself */
replace G in IInew by the set of all subgroup formed
}
Example
initially, two sets {1, 2, 3, 5, 6}, {4, 7}.
{1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c.
{1, 2, 5} splits {1}, {2, 5} on b.
Minimizing the DFA
Major operation: partition states into
equivalent classes according to
A
B
C
D
E
final / non-final states
transition functions
a
B
B
B
B
B
b
C
D
C
E
C
(ABCDE)
(ABCD)(E)
(ABC)(D)(E)
(AC)(B)(D)(E)
AC
B
D
E
a
B
B
B
B
b
AC
D
E
AC
Important States of an NFA
The “important states” of an NFA are those
without an -transition, that is
if move({s}, a) for some a then s is an
important state
The subset construction algorithm uses only
the important states when it determines
-closure ( move(T, a) )
Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
Converting a RE Directly to a DFA
Construct a syntax tree for (r)#
Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
Construct DFA D by algorithm 3.62
Function Computed From the Syntax Tree
nullable(n)
firstpos(n)
The set of positions that can match the first symbol of a
string generated by the subtree at node n
lastpos(n)
The subtree at node n generates languages including the
empty string
The set of positions that can match the last symbol of a
string generated be the subtree at node n
followpos(i)
The set of positions that can follow position i in the tree
Rules for Computing the Function
Node n
nullable(n)
firstpos(n)
lastpos(n)
A leaf labeled
by
true
A leaf with
position i
false
{i}
{i}
n = c1 | c 2
nullable(c1)
or
nullable(c2)
firstpos(c1) firstpos(c2)
lastpos(c1) lastpos(c2)
n = c1 c2
nullable(c1)
and
nullable(c2)
if ( nullable(c1) )
firstpos(c1) firstpos(c2)
else firstpos(c1)
if ( nullable(c2) )
lastpos(c1) lastpos(c2)
else lastpos(c2)
n = c1*
true
firstpos(c1)
lastpos(c1)
Computing followpos
for (each node n in the tree)
{
//n is a cat-node with left child c1 and right child c2
if ( n == c1.c2)
for (each i in lastpos(c1) )
followpos(i) = followpos(i) firstpos(c2);
else if (n is a star-node)
for ( each i in lastpos(n) )
followpos(i) = followpos(i) firstpos(n);
}
Converting a RE Directly to a DFA
Initialize Dstates to contain only the unmarked state
firstpos(n0), where n0 is the root of syntax tree T for
(r)#;
while ( there is an unmarked state S in Dstates ) {
mark S;
for ( each input symbol a ) {
let U be the union of followpos(p)
for all p in S that correspond to a;
if (U is not in Dstates )
add U as an unmarked state to Dstates
Dtran[S,a] = U;
}
}
○
Example
○
#
( a | b )* a b b #
○
n
○
a
3
*
|
a
1
b
2
b
4
b
5
6
n = ( a | b )* a
nullable(n) = false
firstpos(n) = { 1, 2, 3 }
lastpos(n) = { 3 }
followpos(1) = {1, 2, 3 }
Example
{1, 2, 3}
( a | b )* a b b #
{1, 2, 3}
{1, 2, 3}
nullable
{1, 2, 3}
{1, 2}
*
{1, 2}
{1, 2}
|
{1, 2}
{1} a {1}
1
{3}
{4}
{6} # {6}
6
{5} b {5}
5
{4} b {4}
4
{3} a {3}
3
{2} b {2}
2
{5}
{6}
firstpos
lastpos
Example
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6
-
1
3
4
5
2
b
1,2,3
( a | b )* a b b #
b
a
a
b
1,2,
3,4
a
1,2,
3,5
a
b
1,2,3,6
6
Time and Space Complexity
Automaton
Space
(worst case)
Time
(worst case)
NFA
O(r)
O(rx)
DFA
O(2|r|)
O(x)