Transcript Document
CPSC 388 – Compiler Design
and Construction
Scanners – Regular Expressions
Announcements
Last day to Add/Drop Sept 11
Wipe Down Computer Keyboards and Mice
ACM programming contest
Read Chapter 3
Homework 2 due this Friday
PROG1 due this Friday
FSA for Java string constants (Anybody
figure it out?)
Homework 1 returned
Summary – In addition to your summary please
answer the following questions in your report:
Context – What is the context of the research
presented in the paper? What new ideas or concepts
do you think were presented in the paper?
Evaluation – How was the research evaluated? What
evaluation techniques would you like to see to
compare the research to the state of the art before
the paper?
Significance – What is the significance of the
research? Do you feel the work is a minor
improvement or a major step in the field?
Grammar / spelling / clarity / support for statements
made
Scanner Generator
.Jlex file
Containing
Regular Expressions
Scanner Generator
To understand Regular Expressions
you need to understand Finite-State Automata
.java file
Containing
Scanner code
FSA Formal Definition (5-tuple)
Q – a finite set of states
Σ – The alphabet of the automata
(finite set of characters to label edges)
δ – state transition function
δ(statei,character) statej
q – The start state
F – The set of final states
Types of FSA
Deterministic (DFA)
No State has more than one outgoing
edge with the same label
Non-Deterministic (NFA)
States may have more than one
outgoing edge with same label.
Edges may be labeled with ε, the empty
string. The FSA can take an epsilon
transition without looking at the current
input character.
Terms to Know
Alphabet (Σ) – any finite set of
symbols e.g. binary, ASCII, Unicode
String – finite sequence of symbols
e.g. 010001, banana, bãër
Language – any countable set of
strings e.g.
Empty set
Well-formed C programs
English words
Regular Expressions
Easy way to express a language that is
accepted by FSA
Rules:
ε is a regular expression
Any symbol in Σ is a regular expression
r and s are any regular expressions then so is:
r|s denotes union e.g. “r or s”
rs denotes r followed by s (concatination)
(r)* denotes concatination of r with itself zero or
more times (Kleene closer)
() used for controlling order of operations
If
Example Regular Expressions
Regular Expression
Corresponding Language
ε
{“”}
a
{“a”}
abc
{“abc”}
a|b|c
{“a”,”b”,”c”}
(a|b|c)*
{“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}
a|b|c|…|z|A|B|…|Z
Any letter
0|1|2|…|9
Any digit
Precedence in Regular Expressions
* has highest precedence, left associative
Concatenation has second highest
precedence, left associative
| has lowest associative, left associative
More Regular Expression Examples
Regular Expression
Corresponding Language
ε|a|b|ab*
{“”, “a”, “b”, “ab”, “abb”, “abbb”,…}
ab*c
{“ac”, “abc”, “abbc”,…}
ab*|a*
{“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}
a(b*|a*)
{“a”, “ab”, “aa”, “abb”, “aaa”, …}
a(b|a)*
{“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}
You Try
What is the language described by
each Regular Expression?
a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Regular Definitions
If Σ is an alphabet of basic symbols,
then a regular definition is a
sequence of definitions of the form:
D1 → R1
1. Each di is a new symbol not in Σ and
D2 → R2
not the same as any other of the d’s.
…
Dn → R n
2. Each ri is a regular expression over
Σ U (d1,d2,…,di-1)
Regular Definitions Example
Example C identifiers:
Σ = ASCII
letter_
digit
id
→
→
→
a|b|c|…|z|A|B|C|…|Z|_
0|1|2|…|9
letter_(letter_|digit)*
Regular Definitions Example
Example Unsigned Numbers (integer or float):
Σ = ASCII
digit
digits
optionalFraction
optionalExponent
number
→ 0|1|2|…|9
→ digit digit*
→ . digits | ε
→ (E(+|-| ε)digits)| ε
→ digits optionalFraction optionalExponent
Special Characters in Reg. Exp.
What does each of the following mean?
*
– Kleene Closure
|
– or
()
– grouping
[]
– creates a character class
+
– Positive Closure
?
– zero or one instance
“”
– anything in quotes means itself, e.g. “*”
.
– matches any single character (except newline)
\
– used for escape characters (newline, tab, etc.)
^
– matches beginning of a line
$
– matches the end of a line
Extensions to Regular Expressions
+ means one or more occurrence
(positive closure)
? means zero or one occurrence
Character classes
a|r|t can be written [art]
a|b|…|z can be written [a-z]
As long as there is a clear ordering to
characters
[^a-z] matches any character except a-z
Example Using Character Classes
^[^aeiou]*$
Matches any complete line that does not
contain a lowercase vowel
How do you tell which meaning of ^ is
intended?
Try It
Create Character Classes for:
First ten letters (up to “j”)
Lowercase consonants
Digits in hexadecimal
Create Regular Expressions for:
Case Insensitive keyword such as
SELECT (or Select or SeLeCt) in SQL
Java string constants
Any string of whitespace characters
Creating a Scanner
Create a set of regular expressions, one for
each token to be recognized
Convert regular expressions into one
combined DFA
Run DFA over input character stream
Longest matching regular expression is selected
If a tie then use first matching regular
expression
Attach code to run when a regular
expression matches