Transcript Chapter 3

Chapter 3
Chang Chi-Chung
2007.4.12
The Role of the Lexical Analyzer
Source
Program
Lexical
Analyzer
Token
Parser
getNextToken
error
error
Symbol Table
The Reason for Using the Lexical Analyzer

Simplifies the design of the compiler


Compiler efficiency is improved



LL(1) or LR(1) parsing with 1 token lookahead would not be
possible (multiple characters/tokens to match)
Systematic techniques to implement lexical analyzers by
hand or automatically from specifications
Stream buffering methods to scan input
Compiler portability is enhanced

Input-device-specific peculiarities can be restricted to the
lexical analyzer.
Tokens, Patterns, and Lexemes

Token (符號單元)



Pattern (樣本)



A pair consisting of a token name and optional arrtibute
value.
Example: num, id
A description of the form for the lexemes of a token.
Example: “non-empty sequence of digits”, “letter followed by
letters and digits”
Lexeme (詞)


A sequence of characters that matches the pattern for a
token.
Example: 123, abc
Example: Tokens, Patterns, and Lexemes
Token
Pattern
Lexeme
if
characters i f
if
else
characters e l s e
else
comparison < or > or <= or >= or == or !=
<=, !=
id
pi, score, D2
number
letter followed by letters and
digits
any numeric constant
literal
anything but “, surrounded by “’s
“core dump”
3.14, 0, 6.23
Input Buffering
E = M * C * * 2
eof
eof
lexemeBegin forward
eof
Sentinels
Strings and Languages

Alphabet


An alphabet  is a finite set of symbols (characters)
String

A string is a finite sequence of symbols from 



s denotes the length of string s
 denotes the empty string, thus  = 0
Language

A language is a countable set of strings over some fixed
alphabet 


Abstract Language Φ
{ε}
String Operations

Concatenation (連接)



The concatenation of two strings x and y is denoted by xy
Identity (單位元素)

The empty string is the identity under concatenation.

s=s=s
Exponentiation


Define
s0 = 
si = si-1s for i > 0
By Define
s1 = s
s2 = ss
Language Operations





Union
L  M = { s  s  L or s  M }
Concatenation
L M = { xy  x  L and y  M}
Exponentiation
L0 = {  }
Li = Li-1L
Kleene closure (封閉包)
L* = ∪i=0,…, Li
Positive closure
L+ = ∪i=1,…, Li
Regular Expressions

Regular Expressions




A convenient means of specifying certain simple sets
of strings.
We use regular expressions to define structures of
tokens.
Tokens are built from symbols of a finite vocabulary.
Regular Sets

The sets of strings defined by regular expressions.
Regular Expressions

Basis symbols:



If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then





 is a regular expression denoting language L() = {}
a   is a regular expression denoting L(a) = {a}
rs is a regular expression denoting L(r)  M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r)
A language defined by a regular expression is called
a regular set.
Operator Precedence
Operator
Precedence
Associative
*
highest
left
concatenation
Second
left
|
lowest
left
Algebraic Laws for Regular Expressions
Law
r|s=s|r
r|(s|t)=(r|s)|t
r(st) = (rs)t
r(s|t) = rs | rt
(s|t)r = sr | tr
Description
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
εr = rε = r
ε is the identity for concatenation
r* = ( r |ε)*
ε is guaranteed in a closure
r** = r*
* is idempotent
Regular Definitions

If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form:
d1  r1
d2  r2
…
dn  rn



Each di is a new symbol, not in Σ and not the same as any
other of d’s.
Each ri is a regular expression over the alphabet
  {d1, d2, …, di-1 }
Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
Example: Regular Definitions
Regular Definitions
letter_  A | B | … | Z | a | b | … | z | _
digit  0 | 1 | … | 9
id  letter_ ( letter_ | digit )*
Regular definitions are not recursive
digits  digit digits digit
wrong
Extensions of Regular Definitions

One or more instance



Zero or one instance


r? = r |ε
Character classes



r+ = rr* = r*r
r* = r+ | ε
[a-z] = abc…z
[A-Za-z] = A|B|…|Z|a|…|z
Example


digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and Grammars
Context-Free Grammars
stmt  if expr then stmt
 if expr then stmt else stmt

ws  ( blank | tab | newline )+
expr  term relop term
 term
Regular Definitions
term  id
digit  [0-9]
 num
letter  [A-Za-z]
if  if
then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+ | -)? digit+ )?
Transition Diagrams
relop  <  <=  <>  >  >=  =
start
0
<
1
=
>
other
=
>
5
6
2
return(relop, LE)
3
return(relop, NE)
4 * return(relop, LT)
return(relop, EQ)
=
7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams
id  letter ( letter | digit )*
letter or digit
start
9
letter
10
other
*
11
return (getToken(), installID() )
Finite Automata

Finite Automata are recognizers.




Two kind of the Finite Automata



FA simply say “Yes” or “No” about each possible input
string.
A FA can be used to recognize the tokens specified by a
regular expression
Use FA to design of a Lexical Analyzer Generator
Nondeterministic finite automata (NFA)
Deterministic finite automata (DFA)
Both DFA and NFA are capable of recognizing the
same languages.
NFA Definitions

NFA = { S, , , s0, F }


A finite set of states S
A set of input symbols Σ


input alphabet, ε is not in Σ
A transition function 

:SS

A special start state s0

A set of final states F, F  S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
a
0
a
1
3
2
b
c
c


This machine accepts abccabc, but it
rejects abcab.
This machine accepts (abc+)+.
Transition Table

The mapping  of an NFA can be represented
in a transition table
a
start
a
0
1
b
2
b
3
b
(0, a) = {0,1}
(0, b) = {0}
(1, b) = {2}
(2, b) = {3}
STATE
a
b
ε
0
{0, 1}
{0}
-
1
-
{2}
-
2
-
{3}
-
3
-
-
-
DFA

DFA is a special case of an NFA



There are no moves on input ε
For each state s and input symbol a, there is
exactly one edge out of s labeled a.
Both DFA and NFA are capable of
recognizing the same languages.
Simulating a DFA

Input


An input string x terminated by
an end-of-file character eof. A
DFA D with start state s0,
accepting states F, and
transition function move.
Output

Answer “yes” if D accepts x;
“no” otherwise.
s = s0
c = nextChar();
while ( c != eof ) {
s = move(s, c);
c = nextChar();
}
if (s is in F )
return “yes”;
else
return “no”;
S = {0,1,2,3}
 = {a, b}
s0 = 0
F = {3}
NFA vs DFA
a
start
a
0
b
1
b
2
3
b
(a | b)*abb
b
0
a
1
a
b
2
b
3
a
a
The Regular Language

The regular language defined by an NFA is the
set of input strings it accepts.


Example: (ab)*abb for the example NFA
An NFA accepts an input string x if and only if


there is some path with edges labeled with symbols
from x in sequence from the start state to some
accepting state in the transition graph
A state transition from one state to another on the
path is called a move.
Theorem

The followings are equivalent





Regular Expression
NFA
DFA
Regular Language
Regular Grammar
Convert Concept
Regular Expression
Minimization
Deterministic
Finite Automata
Nondeterministic
Finite Automata
Deterministic Finite
Automata
Construction of an NFA from a Regular
Expression

ε

s|t

N(s)

N(t)
st
a
a
s*
Use Thompson’s Construction
N(s)


N(t)

N(s)


Example

r11
r9
( a | b )* a b b
r7
r5
(
r3
)
r1
|
r2
a
b
r8
r6
*
r4
r10
b
b
a
r3 = r4
Example

( a | b )* a b b

2
start
0

1
a
3

6

4

b

5


7
a
8
b
9
b
10
Conversion of an NFA to a DFA

The subset construction algorithm converts an NFA
into a DFA using the following operation.
Operation
Description
ε- closure(s)
Set of NFA states reachable from NFA state s on εtransitions alone.
ε- closure(T)
Set of NFA states reachable from some NFA state s
in set T on ε-transitions alone.
= ∪s in T ε- closure(s)
move(T, a)
Set of NFA states to which there is a transition on
input symbol a from some state s in T
Subset Construction(1)
Initially, -closure(s0) is the only state in Dstates and it is unmarked;
while (there is an unmarked state T in Dstates) {
mark T;
for (each input symbol a   ) {
U = -closure( move(T, a) );
if (U is not in Dstates)
add U as an unmarked state to Dstates
Dtran[T, a] = U
}
}
Computing ε- closure(T)
Example

2
start
0

1
a

3


6

4
b
5

7
( a | b )* a b b
a
8
b
9
b
10


b
C
start
A
b
a
b
a
B
a
b
a
D
a
b
E
NFA State
DFA State
a
b
{0,1,2,4,7}
A
B
C
{1,2,3,4,6,7,8}
B
B
D
{1,2,4,5,6,7}
C
B
C
{1,2,4,5,6,7,9}
D
B
E
{1,2,3,5,6,7,10}
E
B
C
Example
1

start
0
a

2

3


7
a
4
a
b
b
5
b
6
8
b
a
0137
247
a
b
b
7
b
b
8
b

a
abb
a*b+
68
b
58
Dstates
A = {0,1,3,7}
B = {2,4,7}
C = {8}
D = {7}
E = {5,8}
F = {6,8}
Minimizing the DFA

Step 1


Step 2


Split Procedure
Step 3


Start with an initial partition II with two group: F and S-F
(aceepting and nonaccepting)
If ( IInew = II )
IIfinal = II and continue step 4
else
II = IInew and go to step 2
Step 4


Construct the minimum-state DFA by IIfinal group.
Delete the dead state
Split Procedure
Initially, let IInew = II
for ( each group G of II ) {
Partition G into subgroup such that
two states s and t are in the same subgroup
if and only if
for all input symbol a, states s and t have
transition on a to states in the same group of
II.
/* at worst, a state will be in a subgroup by
itself */
replace G in IInew by the set of all subgroup formed
}
Example



initially, two sets {1, 2, 3, 5, 6}, {4, 7}.
{1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c.
{1, 2, 5} splits {1}, {2, 5} on b.
Minimizing the DFA

Major operation: partition states into
equivalent classes according to


A
B
C
D
E
final / non-final states
transition functions
a
B
B
B
B
B
b
C
D
C
E
C
(ABCDE)
(ABCD)(E)
(ABC)(D)(E)
(AC)(B)(D)(E)
AC
B
D
E
a
B
B
B
B
b
AC
D
E
AC
Important States of an NFA

The “important states” of an NFA are those
without an -transition, that is



if move({s}, a)   for some a then s is an
important state
The subset construction algorithm uses only
the important states when it determines
-closure ( move(T, a) )
Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
Converting a RE Directly to a DFA



Construct a syntax tree for (r)#
Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
Construct DFA D by algorithm 3.62
Function Computed From the Syntax Tree

nullable(n)


firstpos(n)


The set of positions that can match the first symbol of a
string generated by the subtree at node n
lastpos(n)


The subtree at node n generates languages including the
empty string
The set of positions that can match the last symbol of a
string generated be the subtree at node n
followpos(i)

The set of positions that can follow position i in the tree
Rules for Computing the Function
Node n
nullable(n)
firstpos(n)
lastpos(n)
A leaf labeled
by 
true


A leaf with
position i
false
{i}
{i}
n = c1 | c 2
nullable(c1)
or
nullable(c2)
firstpos(c1)  firstpos(c2)
lastpos(c1)  lastpos(c2)
n = c1 c2
nullable(c1)
and
nullable(c2)
if ( nullable(c1) )
firstpos(c1)  firstpos(c2)
else firstpos(c1)
if ( nullable(c2) )
lastpos(c1)  lastpos(c2)
else lastpos(c2)
n = c1*
true
firstpos(c1)
lastpos(c1)
Computing followpos
for (each node n in the tree)
{
//n is a cat-node with left child c1 and right child c2
if ( n == c1.c2)
for (each i in lastpos(c1) )
followpos(i) = followpos(i)  firstpos(c2);
else if (n is a star-node)
for ( each i in lastpos(n) )
followpos(i) = followpos(i)  firstpos(n);
}
Converting a RE Directly to a DFA
Initialize Dstates to contain only the unmarked state
firstpos(n0), where n0 is the root of syntax tree T for
(r)#;
while ( there is an unmarked state S in Dstates ) {
mark S;
for ( each input symbol a   ) {
let U be the union of followpos(p)
for all p in S that correspond to a;
if (U is not in Dstates )
add U as an unmarked state to Dstates
Dtran[S,a] = U;
}
}
○
Example
○
#
( a | b )* a b b #
○
n
○
a
3
*
|
a
1
b
2
b
4
b
5
6
n = ( a | b )* a
nullable(n) = false
firstpos(n) = { 1, 2, 3 }
lastpos(n) = { 3 }
followpos(1) = {1, 2, 3 }
Example
{1, 2, 3}
( a | b )* a b b #
{1, 2, 3}
{1, 2, 3}
nullable
{1, 2, 3}
{1, 2}
*
{1, 2}
{1, 2}
|
{1, 2}
{1} a {1}
1
{3}
{4}
{6} # {6}
6
{5} b {5}
5
{4} b {4}
4
{3} a {3}
3
{2} b {2}
2
{5}
{6}
firstpos
lastpos
Example
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6
-
1
3
4
5
2
b
1,2,3
( a | b )* a b b #
b
a
a
b
1,2,
3,4
a
1,2,
3,5
a
b
1,2,3,6
6
Time and Space Complexity
Automaton
Space
(worst case)
Time
(worst case)
NFA
O(r)
O(rx)
DFA
O(2|r|)
O(x)