Ch.7 Expressions and Statements

Download Report

Transcript Ch.7 Expressions and Statements

Regular Expressions
Finite State Automaton
Regular expressions
 Terminology on Formal languages:
– alphabet : a finite set of symbols
– string : a finite sequence of alphabet symbols
– language : a (finite or infinite) set of strings.
 Regular Operations on languages:
Union: R  S = { x | x  R or x  S}
Concatenation: RS = { xy | x  R and y  S}
Kleene closure: R* = R concatenated with itself 0 or more times
= {}  R  RR  RRR 
= strings obtained by concatenating a finite
number of strings from the set R.
Programming Languages
2
Regular Expressions
A pattern notation for describing certain kinds of sets over
strings:
Given an alphabet :
–  is a regular exp. (denotes the language {})
– for each a  , a is a regular exp. (denotes the language {a})
– if r and s are regular exps. denoting L(r) and L(s) respectively, then
so are:
• (r) | (s) ( denotes the language L(r)  L(s) )
• (r)(s) ( denotes the language L(r)L(s) )
• (r)* ( denotes the language L(r)* )
Programming Languages
3
Common Extensions to r.e. Notation





One or more repetitions of r : r+
A range of characters : [a-zA-Z], [0-9]
An optional expression: r?
Any single character: .
Giving names to regular expressions, e.g.:
–
–
–
–
letter = [a-zA-Z_]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
ident = letter ( letter | digit )*
Integer_const = digit+
Programming Languages
4
Examples of Regular Expressions
Identifiers:
Letter
 (a|b|c| … |z|A|B|C| … |Z)
Digit
 (0|1|2| … |9)
Identifier  Letter ( Letter | Digit )*
Numbers:
[0-9]
[0-9]+
[0-9]*
[1-9][0-9]*
([1-9][0-9]*)|0
-?[0-9]+
[0-9]*\.[0-9]+
([0-9]+)|([0-9]*\.[0-9]+)
[eE][-+]?[0-9]+
([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?
-?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? )
Programming Languages
5
Examples of Regular Expressions
Numbers:
Integer  (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal  Integer . Digit *
Real
 ( Integer | Decimal ) E (+|-|) Digit *
Complex  ( Real , Real )
Programming Languages
6
Exercise of Regular Expressions
 문자열
[a-z]
[a-zA-Z]
[a-zA-Z][a-zA-Z0-9]*
[a-zA-Z0-9]
 스트링
– "this is a string"
\".*\"
<- wrong!!! why?
\"[^"]*\"
 몇가지 연습
0과 1로 이루어진 문자열 중에서...
– 0으로 시작하는 문자열
– 0으로 시작해서 0으로 끝나는 문자열
– 0과 1이 번갈아 나오는 문자열
– 0이 두번 계속 나오지 않는 문자열
0[01]*
0[01]*0
______
______
Recognizing Tokens: Finite Automata
A finite automaton is a 5-tuple (Q, , T, q0, F),
where:
–  is a finite alphabet;
– Q is a finite set of states;
– T: Q    Q is the transition
function;
– q0  Q is the initial state; and
– F  Q is a set of final states.
Programming Languages
8
Finite Automata: An Example
A (deterministic) finite automaton (DFA) to match C-style
comments:
Programming Languages
9
Example 2
Consider the problem of recognizing register names
Register  r (0|1|2| … | 9) (0|1|2| … | 9)*
 Allows registers of arbitrary number
 Requires at least one digit
RE corresponds to a recognizer (or DFA)
(0|1|2| … 9)
(0|1|2| … 9)
r
S0
S1
S2
accepting state
Transitions on other inputs
go to an error
state, se
Recognizer
for Register
Programming Languages
10
Example 2 (continued)
DFA operation
 Start in state S0 & take transitions on each input character
 DFA accepts a word x iff x leaves it in a final state (S2 )
(0|1|2| … 9)
r
S0
(0|1|2| … 9)
S1
S2
accepting state
Recognizer for Register
So,
 r17 takes it through s0, s1, s2 and accepts
 r takes it through s0, s1 and fails
 a takes it straight to se
Programming Languages
11
Example 2 (continued)
To be useful, recognizer must turn into code
Char  next character
State  s0
while (Char  EOF)
State  (State,Char)
Char  next character
if (State is a final state )
then report success
else report failure
Skeleton recognizer

r
0,1,2,3,4,5,6,
7,8,9
s0
s1
se
se
s1
se
s2
se
s2
se
s2
se
se
se
se
se
Table encoding RE
Programming Languages
All others
12
What if we need a tighter specification?
r Digit Digit* allows arbitrary numbers
 Accepts r00000
 Accepts r99999
 What if we want to limit it to r0 through r31 ?
Write a tighter regular expression
– Register  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
– Register  r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA
 Has more states
 Same cost per transition
 Same basic implementation
Programming Languages
13
Tighter register specification (continued)
The DFA for
Register  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
(0|1|2| … 9)
S2
S3
0,1,2
S0
r
S1
3
S5
0,1
S6
4,5,6,7,8,9
S4
 Accepts a more constrained set of registers
 Same set of actions, more states
Programming Languages
14
Tighter register specification (continued)
r
0,1
2
3
4-9
All
others
s0
S1
se
se
se
se
se
s1
se
s2
s2
s5
s4
se
s2
se
s3
s3
s3
s3
se
s3
se
se
se
se
se
se
s4
se
se
se
se
se
se
s5
se
s6
se
se
se
se
s6
se
se
se
se
se
se
se
se
se
se
se
se
se

Table encoding RE for theProgramming
tighter register
specification
Languages
15
Automating Scanner Construction

RE→ NFA (Thompson’s construction)
– Build an NFA for each term
– Combine them with ε-moves

NFA → DFA (subset construction)
– Build the simulation

DFA → Minimal DFA
– Hopcroft’s algorithm

DFA →RE (Not part of the scanner construction)
– All pairs, all paths problem
– Take the union of all paths from s0 to an accepting state
Programming Languages
16