Transcript Chapter 3

Chapter 3
Chang Chi-Chung
2007.4.12
The Role of the Lexical Analyzer
Source
Program
Lexical
Analyzer
Token
Parser
getNextToken
error
error
Symbol Table
The Reason for Using the Lexical Analyzer

Simplifies the design of the compiler



Compiler efficiency is improved



A parser that had to deal with comments and white space
as syntactic units would be more complex.
If lexical analysis is not separated from parser, then LL(1)
or LR(1) parsing with 1 token lookahead would not be
possible (multiple characters/tokens to match)
Systematic techniques to implement lexical analyzers by
hand or automatically from specifications
Stream buffering methods to scan input
Compiler portability is enhanced

Input-device-specific peculiarities can be restricted to the
lexical analyzer.
Lexical Analyzer

Lexical analyzer are divided into a cascade of
two process.

Scanning

Consists of the simple processes that do not require
tokenization of the input.



Deletion of comments.
Compaction of consecutive whitespace characters into one.
Lexical analysis

The scanner produces the sequence of tokens as output.
Tokens, Patterns, and Lexemes

Token (符號單元)



Pattern (樣本)



A pair consisting of a token name and optional arrtibute
value.
Example: num, id
A description of the form for the lexemes of a token.
Example: “non-empty sequence of digits”, “letter followed by
letters and digits”
Lexeme (詞)


A sequence of characters that matches the pattern for a
token.
Example: 123, abc
Examples: Tokens, Patterns, and Lexemes
Token
Pattern
Lexeme
if
characters i f
if
else
characters e l s e
else
comparison < or > or <= or >= or == or !=
<=, !=
id
pi, score, D2
number
letter followed by letters and
digits
any numeric constant
literal
anything but “, surrounded by “’s
“core dump”
3.14, 0, 6.23
An Example


E = M * C ** 2
A sequence of pairs by lexical analyzer
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Input Buffering
E = M * C * * 2
eof
eof
lexemeBegin forward
eof
Sentinels
Lookahead Code with Sentinels
switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer) {
reload first buffer;
forward = beginning of first buffer;
}
else
/* eof within a buffer marks the end of inout */
terminate lexical anaysis;
break;
cases for the other characters;
}
Strings and Languages

Alphabet


An alphabet  is a finite set of symbols (characters)
String

A string is a finite sequence of symbols from 



s denotes the length of string s
 denotes the empty string, thus  = 0
Language

A language is a countable set of strings over some fixed
alphabet 


Abstract Language Φ
{ε}
String Operations

Concatenation (連接)



The concatenation of two strings x and y is denoted by xy
Identity (單位元素)

The empty string is the identity under concatenation.

s=s=s
Exponentiation


Define
s0 = 
si = si-1s for i > 0
By Define
s1 = s
s2 = ss
Language Operations





Union
L  M = { s  s  L or s  M }
Concatenation
L M = { xy  x  L and y  M}
Exponentiation
L0 = {  }
Li = Li-1L
Kleene closure (封閉包)
L* = ∪i=0,…, Li
Positive closure
L+ = ∪i=1,…, Li
Regular Expressions

Regular Expressions




A convenient means of specifying certain simple sets
of strings.
We use regular expressions to define structures of
tokens.
Tokens are built from symbols of a finite vocabulary.
Regular Sets

The sets of strings defined by regular expressions.
Regular Expressions

Basis symbols:



If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then





 is a regular expression denoting language L() = {}
a   is a regular expression denoting L(a) = {a}
rs is a regular expression denoting L(r)  M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r)
A language defined by a regular expression is called
a regular set.
Operator Precedence
Operator
Precedence
Associative
*
highest
left
concatenation
Second
left
|
lowest
left
Algebraic Laws for Regular Expressions
Law
r|s=s|r
r|(s|t)=(r|s)|t
r(st) = (rs)t
r(s|t) = rs | rt
(s|t)r = sr | tr
Description
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
εr = rε = r
ε is the identity for concatenation
r* = ( r |ε)*
ε is guaranteed in a closure
r** = r*
* is idempotent
Regular Definitions

If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form:
d1  r1
d2  r2
…
dn  rn



Each di is a new symbol, not in Σ and not the same as any
other of d’s.
Each ri is a regular expression over the alphabet
  {d1, d2, …, di-1 }
Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
Example: Regular Definitions
Regular Definitions
letter_  A | B | … | Z | a | b | … | z | _
digit  0 | 1 | … | 9
id  letter_ ( letter_ | digit )*
Regular definitions are not recursive
digits  digit digits digit
wrong
Extensions of Regular Definitions

One or more instance



Zero or one instance


r? = r |ε
Character classes



r+ = rr* = r*r
r* = r+ | ε
[a-z] = abc…z
[A-Za-z] = A|B|…|Z|a|…|z
Example


digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and Grammars
Context-Free Grammars
stmt  if expr then stmt
 if expr then stmt else stmt

ws  ( blank | tab | newline )+
expr  term relop term
 term
Regular Definitions
term  id
digit  [0-9]
 num
letter  [A-Za-z]
if  if
then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+ | -)? digit+ )?
LEXEMES
Any ws
if
then
else
Any id
Any number
<
<=
=
<>
>
>=
TOKEN NAME
if
then
else
id
number
relop
relop
relop
relop
relop
relop
ATTRIBUTE VALUE
Pointer to table entry
Pointer to table entry
LT
LE
EQ
NE
GT
GE
Transition Diagrams
relop  <  <=  <>  >  >=  =
start
0
<
1
=
>
other
=
>
5
6
2
return(relop, LE)
3
return(relop, NE)
4 * return(relop, LT)
return(relop, EQ)
=
7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams
id  letter ( letter | digit )*
letter or digit
start
9
letter
10
other
*
11
return (getToken(), installID() )
An Example: Implement of RELOP
TOKEN getRelop()
{
TOKEN retToken = new(RELOP);
while (1) {
case 0: c = nextChar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state= 5;
else if (c == ‘>‘) state= 6;
else fail();
break;
case 1: ...
...
case 8: retract();
retToken.attribute = GT;
return(retTOKEN);
}
}
Finite Automata

Finite Automata are recognizers.




Two kind of the Finite Automata



FA simply say “Yes” or “No” about each possible input
string.
A FA can be used to recognize the tokens specified by a
regular expression
Use FA to design of a Lexical Analyzer Generator
Nondeterministic finite automata (NFA)
Deterministic finite automata (DFA)
Both DFA and NFA are capable of recognizing the
same languages.
NFA Definitions

NFA = { S, , , s0, F }


A finite set of states S
A set of input symbols Σ


input alphabet, ε is not in Σ
A transition function 

:SS

A special start state s0

A set of final states F, F  S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
a
0
a
1
3
2
b
c
c


This machine accepts abccabc, but it
rejects abcab.
This machine accepts (abc+)+.
Transition Table

The mapping  of an NFA can be represented
in a transition table
a
start
a
0
1
b
2
b
3
b
(0, a) = {0,1}
(0, b) = {0}
(1, b) = {2}
(2, b) = {3}
STATE
a
b
ε
0
{0, 1}
{0}
-
1
-
{2}
-
2
-
{3}
-
3
-
-
-
DFA

DFA is a special case of an NFA



There are no moves on input ε
For each state s and input symbol a, there is
exactly one edge out of s labeled a.
Both DFA and NFA are capable of
recognizing the same languages.
Simulating a DFA

Input


An input string x terminated by
an end-of-file character eof. A
DFA D with start state s0,
accepting states F, and
transition function move.
Output

Answer “yes” if D accepts x;
“no” otherwise.
s = s0
c = nextChar();
while ( c != eof ) {
s = move(s, c);
c = nextChar();
}
if (s is in F )
return “yes”;
else
return “no”;
S = {0,1,2,3}
 = {a, b}
s0 = 0
F = {3}
NFA vs DFA
a
start
a
0
b
1
b
2
3
b
(a | b)*abb
b
0
a
1
a
b
2
b
3
a
a
The Regular Language

The regular language defined by an NFA is the
set of input strings it accepts.


Example: (ab)*abb for the example NFA
An NFA accepts an input string x if and only if


there is some path with edges labeled with symbols
from x in sequence from the start state to some
accepting state in the transition graph
A state transition from one state to another on the
path is called a move.
Theorem

The followings are equivalent





Regular Expression
NFA
DFA
Regular Language
Regular Grammar
Convert Concept
Regular Expression
Minimization
Deterministic
Finite Automata
Nondeterministic
Finite Automata
Deterministic Finite
Automata
Construction of an NFA from a Regular
Expression

ε

s|t

N(s)

N(t)
st
a
a
s*
Use Thompson’s Construction
N(s)


N(t)

N(s)


Example

r11
r9
( a | b )* a b b
r7
r5
(
r3
)
r1
|
r2
a
b
r8
r6
*
r4
r10
b
b
a
r3 = r4
Example

( a | b )* a b b

2
start
0

1
a
3

6

4

b

5


7
a
8
b
9
b
10
Conversion of an NFA to a DFA

The subset construction algorithm converts an NFA
into a DFA using the following operation.
Operation
Description
ε- closure(s)
Set of NFA states reachable from NFA state s on εtransitions alone.
ε- closure(T)
Set of NFA states reachable from some NFA state s
in set T on ε-transitions alone.
= ∪s in T ε- closure(s)
move(T, a)
Set of NFA states to which there is a transition on
input symbol a from some state s in T
Subset Construction(1)
Initially, -closure(s0) is the only state in Dstates and it is unmarked;
while (there is an unmarked state T in Dstates) {
mark T;
for (each input symbol a   ) {
U = -closure( move(T, a) );
if (U is not in Dstates)
add U as an unmarked state to Dstates
Dtran[T, a] = U
}
}
Computing ε- closure(T)
Example

2
start
0

1
a

3


6

4
b
5

7
( a | b )* a b b
a
8
b
9
b
10


b
C
start
A
b
a
b
a
B
a
b
a
D
a
b
E
NFA State
DFA State
a
b
{0,1,2,4,7}
A
B
C
{1,2,3,4,6,7,8}
B
B
D
{1,2,4,5,6,7}
C
B
C
{1,2,4,5,6,7,9}
D
B
E
{1,2,3,5,6,7,10}
E
B
C
Example
1

start
0
a

2

3


7
a
4
a
b
b
5
b
6
8
b
a
0137
247
a
b
b
7
b
b
8
b

a
abb
a*b+
68
b
58
Dstates
A = {0,1,3,7}
B = {2,4,7}
C = {8}
D = {7}
E = {5,8}
F = {6,8}
Simulation of an NFA

Input


An input string x terminated by an
end-of-file character eof. An NFA
N with start state s0, accepting
states F, and transition function
move.
Output

Answer “yes” if N accepts x; “no”
otherwise.
S = ε-closure(s0)
c = nextChar();
while ( c != eof ) {
S = ε-closure(s0)
c = nextChar();
}
if (S∩F != ψ)
return “yes”;
else
return “no”;
Minimizing the DFA

Step 1


Step 2


Split Procedure
Step 3


Start with an initial partition II with two group: F and S-F
(aceepting and nonaccepting)
If ( IInew = II )
IIfinal = II and continue step 4
else
II = IInew and go to step 2
Step 4


Construct the minimum-state DFA by IIfinal group.
Delete the dead state
Split Procedure
Initially, let IInew = II
for ( each group G of II ) {
Partition G into subgroup such that
two states s and t are in the same subgroup
if and only if
for all input symbol a, states s and t have
transition on a to states in the same group of
II.
/* at worst, a state will be in a subgroup by
itself */
replace G in IInew by the set of all subgroup formed
}
Example



initially, two sets {1, 2, 3, 5, 6}, {4, 7}.
{1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c.
{1, 2, 5} splits {1}, {2, 5} on b.
Minimizing the DFA

Major operation: partition states into
equivalent classes according to


A
B
C
D
E
final / non-final states
transition functions
a
B
B
B
B
B
b
C
D
C
E
C
(ABCDE)
(ABCD)(E)
(ABC)(D)(E)
(AC)(B)(D)(E)
AC
B
D
E
a
B
B
B
B
b
AC
D
E
AC
Important States of an NFA

The “important states” of an NFA are those
without an -transition, that is



if move({s}, a)   for some a then s is an
important state
The subset construction algorithm uses only
the important states when it determines
-closure ( move(T, a) )
Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
Converting a RE Directly to a DFA



Construct a syntax tree for (r)#
Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
Construct DFA D by algorithm 3.62
Function Computed From the Syntax Tree

nullable(n)


firstpos(n)


The set of positions that can match the first symbol of a
string generated by the subtree at node n
lastpos(n)


The subtree at node n generates languages including the
empty string
The set of positions that can match the last symbol of a
string generated be the subtree at node n
followpos(i)

The set of positions that can follow position i in the tree
Rules for Computing the Function
Node n
nullable(n)
firstpos(n)
lastpos(n)
A leaf labeled
by 
true


A leaf with
position i
false
{i}
{i}
n = c1 | c 2
nullable(c1)
or
nullable(c2)
firstpos(c1)  firstpos(c2)
lastpos(c1)  lastpos(c2)
n = c1 c2
nullable(c1)
and
nullable(c2)
if ( nullable(c1) )
firstpos(c1)  firstpos(c2)
else firstpos(c1)
if ( nullable(c2) )
lastpos(c1)  lastpos(c2)
else lastpos(c2)
n = c1*
true
firstpos(c1)
lastpos(c1)
Computing followpos
for (each node n in the tree)
{
//n is a cat-node with left child c1 and right child c2
if ( n == c1.c2)
for (each i in lastpos(c1) )
followpos(i) = followpos(i)  firstpos(c2);
else if (n is a star-node)
for ( each i in lastpos(n) )
followpos(i) = followpos(i)  firstpos(n);
}
Converting a RE Directly to a DFA
Initialize Dstates to contain only the unmarked state
firstpos(n0), where n0 is the root of syntax tree T for
(r)#;
while ( there is an unmarked state S in Dstates ) {
mark S;
for ( each input symbol a   ) {
let U be the union of followpos(p)
for all p in S that correspond to a;
if (U is not in Dstates )
add U as an unmarked state to Dstates
Dtran[S,a] = U;
}
}
○
Example
○
#
( a | b )* a b b #
○
n
○
a
3
*
|
a
1
b
2
b
4
b
5
6
n = ( a | b )* a
nullable(n) = false
firstpos(n) = { 1, 2, 3 }
lastpos(n) = { 3 }
followpos(1) = {1, 2, 3 }
Example
{1, 2, 3}
( a | b )* a b b #
{1, 2, 3}
{1, 2, 3}
nullable
{1, 2, 3}
{1, 2}
*
{1, 2}
{1, 2}
|
{1, 2}
{1} a {1}
1
{3}
{4}
{6} # {6}
6
{5} b {5}
5
{4} b {4}
4
{3} a {3}
3
{2} b {2}
2
{5}
{6}
firstpos
lastpos
Example
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6
-
1
3
4
5
2
b
1,2,3
( a | b )* a b b #
b
a
a
b
1,2,
3,4
a
1,2,
3,5
a
b
1,2,3,6
6
Time and Space Complexity
Automaton
Space
(worst case)
Time
(worst case)
NFA
O(r)
O(rx)
DFA
O(2|r|)
O(x)