Transcript Document

CPSC 388 – Compiler Design
and Construction
Scanners – Regular Expressions
Announcements







Last day to Add/Drop Sept 11
Wipe Down Computer Keyboards and Mice
ACM programming contest
Read Chapter 3
Homework 2 due this Friday
PROG1 due this Friday
FSA for Java string constants (Anybody
figure it out?)
Homework 1 returned
 Summary – In addition to your summary please
answer the following questions in your report:
 Context – What is the context of the research
presented in the paper? What new ideas or concepts
do you think were presented in the paper?
 Evaluation – How was the research evaluated? What
evaluation techniques would you like to see to
compare the research to the state of the art before
the paper?
 Significance – What is the significance of the
research? Do you feel the work is a minor
improvement or a major step in the field?
 Grammar / spelling / clarity / support for statements
made
Scanner Generator
.Jlex file
Containing
Regular Expressions
Scanner Generator
To understand Regular Expressions
you need to understand Finite-State Automata
.java file
Containing
Scanner code
FSA Formal Definition (5-tuple)
Q – a finite set of states
Σ – The alphabet of the automata
(finite set of characters to label edges)
δ – state transition function
δ(statei,character)  statej
q – The start state
F – The set of final states
Types of FSA
 Deterministic (DFA)
 No State has more than one outgoing
edge with the same label
 Non-Deterministic (NFA)
 States may have more than one
outgoing edge with same label.
 Edges may be labeled with ε, the empty
string. The FSA can take an epsilon
transition without looking at the current
input character.
Terms to Know
 Alphabet (Σ) – any finite set of
symbols e.g. binary, ASCII, Unicode
 String – finite sequence of symbols
e.g. 010001, banana, bãër
 Language – any countable set of
strings e.g.
Empty set
Well-formed C programs
English words
Regular Expressions
 Easy way to express a language that is
accepted by FSA
 Rules:
ε is a regular expression
Any symbol in Σ is a regular expression
r and s are any regular expressions then so is:
r|s denotes union e.g. “r or s”
rs denotes r followed by s (concatination)
(r)* denotes concatination of r with itself zero or
more times (Kleene closer)
 () used for controlling order of operations


If



Example Regular Expressions
Regular Expression
Corresponding Language
ε
{“”}
a
{“a”}
abc
{“abc”}
a|b|c
{“a”,”b”,”c”}
(a|b|c)*
{“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}
a|b|c|…|z|A|B|…|Z
Any letter
0|1|2|…|9
Any digit
Precedence in Regular Expressions
 * has highest precedence, left associative
 Concatenation has second highest
precedence, left associative
 | has lowest associative, left associative
More Regular Expression Examples
Regular Expression
Corresponding Language
ε|a|b|ab*
{“”, “a”, “b”, “ab”, “abb”, “abbb”,…}
ab*c
{“ac”, “abc”, “abbc”,…}
ab*|a*
{“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}
a(b*|a*)
{“a”, “ab”, “aa”, “abb”, “aaa”, …}
a(b|a)*
{“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}
You Try
 What is the language described by
each Regular Expression?
a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Regular Definitions
If Σ is an alphabet of basic symbols,
then a regular definition is a
sequence of definitions of the form:
D1 → R1
1. Each di is a new symbol not in Σ and
D2 → R2
not the same as any other of the d’s.
…
Dn → R n
2. Each ri is a regular expression over
Σ U (d1,d2,…,di-1)
Regular Definitions Example
Example C identifiers:
Σ = ASCII
letter_
digit
id
→
→
→
a|b|c|…|z|A|B|C|…|Z|_
0|1|2|…|9
letter_(letter_|digit)*
Regular Definitions Example
Example Unsigned Numbers (integer or float):
Σ = ASCII
digit
digits
optionalFraction
optionalExponent
number
→ 0|1|2|…|9
→ digit digit*
→ . digits | ε
→ (E(+|-| ε)digits)| ε
→ digits optionalFraction optionalExponent
Special Characters in Reg. Exp.
What does each of the following mean?
*
– Kleene Closure
|
– or
()
– grouping
[]
– creates a character class
+
– Positive Closure
?
– zero or one instance
“”
– anything in quotes means itself, e.g. “*”
.
– matches any single character (except newline)
\
– used for escape characters (newline, tab, etc.)
^
– matches beginning of a line
$
– matches the end of a line
Extensions to Regular Expressions
 + means one or more occurrence
(positive closure)
 ? means zero or one occurrence
 Character classes
 a|r|t can be written [art]
 a|b|…|z can be written [a-z]
As long as there is a clear ordering to
characters
 [^a-z] matches any character except a-z
Example Using Character Classes
^[^aeiou]*$
Matches any complete line that does not
contain a lowercase vowel
How do you tell which meaning of ^ is
intended?
Try It
 Create Character Classes for:
 First ten letters (up to “j”)
 Lowercase consonants
 Digits in hexadecimal
 Create Regular Expressions for:
 Case Insensitive keyword such as
SELECT (or Select or SeLeCt) in SQL
 Java string constants
 Any string of whitespace characters
Creating a Scanner
 Create a set of regular expressions, one for
each token to be recognized
 Convert regular expressions into one
combined DFA
 Run DFA over input character stream
 Longest matching regular expression is selected
 If a tie then use first matching regular
expression
 Attach code to run when a regular
expression matches