Regular expressions - CS 434/534 Home Page

Download Report

Transcript Regular expressions - CS 434/534 Home Page

Course Overview
PART I: overview material
1
2
3
Introduction
Language processors (tombstone diagrams, bootstrapping)
Architecture of a compiler
Supplementary material:
Theoretical foundations
PART II: inside a compiler
(Regular expressions)
4 Syntax analysis
5
6
7
Contextual analysis
Runtime organization
Code generation
PART III: conclusion
8
9
Interpretation
Review
1
Regular Expressions
• finite state machine is a good “visual” aid
– but it is not very suitable as a specification
(its textual description is too clumsy)
• regular expressions are a suitable specification
– a more compact way to define a language that can be
accepted by an FSM
• used to give the lexical description of a programming
language
– define each “token” (keywords, identifiers, literals,
operators, punctuation, etc)
– define white-space, comments, etc
• these are not tokens, but must be recognized and ignored
2
Example: Pascal identifier
• Lexical specification (in English):
– a letter, followed by zero or more letters or digits
• Lexical specification (as a regular expression):
– letter . (letter | digit)*
|
.
*
( )
means "or"
means "followed by“ (dot may be omitted)
means zero or more instances of
are used for grouping
3
Operands of a regular expression
• Operands are same as labels on the edges of an FSM
– single characters, or
– the special character  (the empty string)
• "letter" is a shorthand for
– a | b | c | ... | z | A | B | C | ... | Z
• "digit“ is a shorthand for
– 0|1|2|…|9
• sometimes we put the characters in quotes
– necessary when denoting | . * ( )
4
Precedence of | . * operators.
Regular
Expression
Operator
Analogous
Arithmetic
Operator
Precedence
|
.
plus
times
lowest
middle
*
exponentiation
highest
• Consider regular expressions:
– letter.letter | digit*
– letter.(letter | digit)*
5
TEST YOURSELF
Question 1: Describe (in English) the language
defined by each of the following regular
expressions:
– letter (letter* | digit*)
– (letter | _ ) (letter | digit | _ )*
– digit* "." digit*
– digit digit* "." digit digit*
6
TEST YOURSELF
Question 2: Write a regular expression for
each of these languages:
– The set of all C++ reserved words
• Examples: if, while, for, class, int, case, char, true, false
– C++ string literals that begin with ” and end with ”
and don’t contain any other ” except possibly in the
escape sequence \”
• Example: ”The escape sequence \” occurs in this string”
– C++ comments that begin with /* and end with */
and don’t contain any other */ within the string
• Example: /* This is a comment * still the same comment
7 */
Example: Integer Literals
• An integer literal with an optional sign can be
defined in English as:
– “(nothing or + or -) followed by one or more digits”
• The corresponding regular expression is:
– (+|-|) (digit.digit*)
• A new convenient operator ‘+’
– same precedence as ‘*’
– digit digit* is the same as
– digit +
which means "one or more digits"
8
Language Defined by a Regular Expression
• Recall: language = set of strings
• Language defined by an automaton
– the set of strings accepted by the automaton
• Language defined by a regular expression
– the set of strings that match the expression
Regular Exp.

a
a.b.c
a|b|c
(a | b | c)*
Corresponding Set of Strings
{""}
{"a"}
{"abc"}
{"a", "b", "c"}
{"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
9
Concept of Reg Exp Generating a String
Rewrite regular expression until have only a
sequence of letters (string) left
Replacement
Rules
1) r1 | r2 ––> r1
2) r1 | r2 ––> r2
3) r* ––> r r*
4) r* ––> 
Example
(0|1)* 2 (0|1)*
(0|1) (0|1)* 2 (0|1)*
1 (0|1)* 2 (0|1)*
1 2 (0|1)*
1 2 (0|1) (0|1)*
1 2 (0|1)
120
10
Non–determinism in Generation
• Different rule applications may yield different
final results
Example 1
(0|1)* 2 (0|1)*
(0|1) (0|1)* 2 (0|1)*
1 (0|1)* 2 (0|1)*
1 2 (0|1)*
1 2 (0|1) (0|1)*
1 2 (0|1)
120
Example 2
(0|1)* 2 (0|1)*
(0|1) (0|1)* 2 (0|1)*
0 (0|1)* 2 (0|1)*
0 2 (0|1)*
0 2 (0|1) (0|1)*
0 2 (0|1)
021
11
Concept of Language Generated by Reg Exp
• Set of all strings generated by a regular
expression is the language of the regular
expression
• In general, language may be infinite
• String generated by regular expression
language is often called a “token”
12
Examples of Languages and Reg Exp
•  = { 0, 1, . }
– (0 | 1)+ "." (0 | 1)* | (0 | 1)* "." (0 | 1)+
 binary floating point numbers
– (0 0)*  even-length all-zero strings
– 1* (0 1* 0 1*)*  binary strings with even number
of zeros
•  = { a,b,c, 0, 1, 2 }
– (a|b|c)(a|b|c|0|1|2)*  alphanumeric identifiers
– (0|1|2)+  trinary numbers
13
Reg Exp Notational Shorthand
•
•
•
•
R + one or more strings of R: R(R*)
R? optional R: (R|)
[abcd] one of listed characters: (a|b|c|d)
[a-z] one character from this range:
(a|b|c|d...|z)
• [^abc] anything but one of the listed chars
• [^a-z] any one character not from this range
14
Equivalence of FSM and Regular Expressions
• Theorem:
– For each finite state machine M, we can construct
a regular expression R such that M and R accept
the same language.
– [proof omitted]
• Theorem:
– For each regular expression R, we can construct a
finite state machine M such that R and M accept
the same language.
– [proof outline follows]
15
Regular Expressions to NFSM (1)
• For each kind of reg exp, define a NFSM
– Notation: NFSM for reg exp M
M
• For 

• For input a
a
16
Regular Expressions to NFSM (2)
• For A . B

A
B
• For A | B


A


B
17
Regular Expressions to NFSM (3)
• For A*


A


18
Example of RegExp -> NFSM conversion
• Consider the regular expression
(1|0)*1
• The NFSM is

A

B
1


C

0
D
F 
E
G 
H 
I
1
J

19
Converting NFSM to DFSM
• Simulate the NFSM
• Each state of DFSM
– is a non-empty subset of states of the NFSM
• Start state of DFSM
– is the set of NFSM states reachable from the
NFSM start state using only -moves
• Add a transition S
a
> S’ to DFSM iff
– S’ is the set of NFSM states reachable from any
state in S after consuming only the input a,
considering -moves as well
20
Remarks on converting NFSM to DFSM
• An NFSM may be in many states at any time
• How many different states ?
• If there are N states, the NFSM must be in
some subset of those N states
• How many subsets are there?
• 2N = finitely many
• For example, if N = 5 then 2N = 32 subsets
21
NFSM -> DFSM Example

A

B

C 1

0
D
F 
E

G 
H 
I
1
J

0
ABCDHI
1
0
FGHIABCD
0
1
EJGHIABCD
1
22
TEST YOURSELF
Question 3: First convert each of these
regular expressions to a NFSM
– (a | b | ) (a | b)
– (ab | ba)* (aa | bb)
Question 4: Next convert each resulting NFSM
to a DFSM
23