CPSC4600 Implementation of HLPL Spring 2003

Download Report

Transcript CPSC4600 Implementation of HLPL Spring 2003

CPSC4600 Implementation of HLPL
Spring 2003
Instructor: Dr. Shahadat Hossain
Today’s Agenda
 Administrative Matters
 Course Information
 Introduction to translation
Course Assessment
 Lectures:




MWF 10:00 a.m - 10:50 a.m.
Project: Write a compiler for a small procedural
language
Two quiz tests (in January and March)
One midterm on February 15
Final exam
Course Material
 Course web page: http://classes.uleth.ca/200301/cpsc4600a/
- Course
related material will be made available here
 Homework assignments will be given occasionally.
They will not be graded
 Text: Compilers by Aho, Sethi, and Ullman
Brinch Hansen on Pascal Compilers by P.B.
Hansen
Grading Method




Project:
Quiz tests:
Midterm:
Final:
25% of course grade
5% each
25%
40%
Out of Class Help
Office Hours:
appointment
MWF 4:00 p.m. - 5:00 p.m or by
Office Location: C540 University Hall
Tentative Project Schedule
Scanner (weight 20%):
January 31 (midnight)
Parser (weight 25%):
February 21 (midnight)
Scope and Type Check (weight 20%):
March 08 (midnight)
Complete Compiler with code generation (35%):
April 04 (midnight)
NB: Project can be done in groups of at most 3 people
All phases of the project must be completed
Course Outline
Brinch Hansen On Pascal Compilers
- Whole book
The dragon book (Ahu, Sethi, Ullman)
- Chapter 1 : all
- Chapter 2 : all
- Chapter 3: 3.1-3.7 (selected topics)
- Chapter 4: 4.1-4.5 (selected topics)
- Chapter 5: 5.1-5.2 (selected topics)
- Chapter 6: 6.1-6.2 (selected topics)
- Chapter 7: 7.1-7.6 (selected topics)
- Chapter 8: 8.1-8.7 (selected topics)
What is cs4600?
Introduction to theoretical and practical aspects of
program translators
Learn theory by doing!!
Handling nontrivial software project
Bits of advice to succeed in cs4600
Start early. Document your code.
Design before implementing
Test each function/method code separately for
correctness before you plug them in the main code
Discuss with your group members
Questions?
Pascal- : A subset of Pascal
Pascal- has only
two simple types integer and Boolean
two structured types array and record
Type definition: A type definition always creates a new
type; it can never rename an existing type
type
table = array [1..100] of integer;
stack = record
contents: table;
size: integer end;
Pascal- : A subset of Pascal
Variable definition: A type name must be used in a
variable definition
var
A: table;
x, y: integer;
All constants have simple types:
Predefined constants: true, false
Constant definition:
const max = 100; on = true;
Pascal- : A subset of Pascal
Statements:
-assignment
x := y;
-if-statement
if x = y then x := 1;
-while-statement
while I <10 do I := I+1;
-compound statement begin x := y; y := z end
-procedure call
-recursion
A Complete Pascal- Program
Program ProgramExample;
const n=100;
type table=array[1..n] of integer;
var A: table; i,x: integer; yes: Boolean;
procedure search(value: integer; var found:
Boolean; var index: integer);
var limit: integer;
begin
index:=1; limit:=n;
while index<limit do
if A[index]=value then
limit := index
else
index := index+1;
found := A[index] = value
end;
A Complete Pascal- Program
begin {input table}
i:=1;
while i<=n do
begin
read(A[i]);
i:=i+1
end;
{test search}
read(x);
while x<>0 do
begin
search(x,yes,i);
write(x);
if yes then
write(i);
read(x);
end
end. {program}
Pascal- Vocabulary
The vocabulary of a programming language is made up
of basic symbols and comments.
Basic Symbols:
a) Identifiers: In Pascal-, an identifier is called a Name, and
consists of a letter that may be followed by any sequence of
letters and digit (Identifiers are case insensitive)
b) Denotations: Denotations represent specific values, according
to conventions laid down by the language designer. In Pascal- a
Numeral is the only denotation in the vocabulary.
Pascal- Vocabulary
c) There are two kinds of delimiters in Pascal-, word symbols and
special symbols:
Word symbols: and array begin const div do else
end if mod not of or procedure program record
then type var while
Special symbols: + - * < = > <= <> >= := ( ) [ ]
, . : ; ..
Comments: A comment in Pascal- is an arbitrary sequence
of characters enclosed in braces { }. Comments may extend
over several lines and may be nested to arbitrary depth.
Pascal- Vocabulary
White space (spaces, tabs and new lines) and comments are
called separators. Any basic symbol may be preceded by one or
more separators, and the program may be followed by zero or
more separators
Example:
{Incorrect}
ifx>0thenx:=10divx-1;
{Correct}
if x>0
then{Can divide}x:=10
div x-1;
Pascal- Grammar
Program --> 'program' ProgramName ';' BlockBody
'.'
BlockBody --> [ConstantDefinitionPart]
[TypeDefinitionPart] [VariableDefinitionPart]
{ProcedureDefinition} CompoundStatement .
Constant, Type, and Variable definition grammar:
Pascal- Grammar
Constant, Type, and Variable definition grammar
ConstantDefinitionPart --> 'const'
ConstantDefinition {ConstantDefinition}
ConstantDefinition --> ConstantNameDef '='
Constant ';'
Constant -> Numeral | ConstantNameUse
TypeDefinitionPart --> 'type' TypeDefinition
{TypeDefinition}
TypeDefinition --> TypeNameDef '=' NewType ';'
NewType --> NewArrayType | NewRecordType
NewArrayType --> 'array' '[' IndexRange ']' 'of'
TypeNameUse .
IndexRange --> Constant '..' Constant
Pascal- Grammar
Constant, Type, and Variable definition grammar
NewRecordType --> 'record' FieldList 'end'
FieldList --> RecordSection {';' RecordSection}
RecordSection --> FieldNameDefList ':'
TypeNameUse
FieldNameDefList --> FieldNameDef {','
FieldNameDef}
VariableDefinitionPart --> 'var'
VariableDefinition {VariableDefinition}
VariableDefinition --> VariableNameDefList ':'
TypeNameUse ';'
VariableNameDefList --> VariableNameDef {','
VariableNameDef}
Pascal- Grammar
Expression grammar
Expression --> SimpleExpression
[RelationalOperator SimpleExpression]
RelationalOperator --> '<' | '=' | '>' | '<=' |
'<>' | '>=’
SimpleExpression --> [SignOperator] Term |
SimpleExpression AddingOperator Term
SignOperator --> '+' | '-'
AddingOperator --> '+' | '-' | 'or’
Term --> Factor | Term MultiplyingOperator Factor
MultiplyingOperator: '*' | 'div' | 'mod' | 'and’
Pascal- Grammar
Expression grammar
Factor -->
Numeral |
VariableAccess |
'(' Expression ')' |
NotOperator Factor
NotOperator --> 'not' .
VariableAccess -->
VariableNameUse |
VariableAccess '[' Expression ']' |
VariableAccess '.' FieldNameUse
Pascal- Grammar
Statement grammar
Statement -->
AssignmentStatement |
ProcedureStatement |
IfStatement |
WhileStatement |
CompoundStatement |
Empty
AssignmentStatement --> VariableAccess ':='
Expression
ProcedureStatement --> ProcedureNameUse
ActualParameterList
Pascal- Grammar
Statement grammar
ActualParameterList: --> '(' ActualParameters ')'
ActualParameters --> ActualParameter {',’
ActualParameter}
ActualParameter --> Expression
IfStatement -->
'if' Expression 'then' Statement |
'if' Expression 'then' Statement 'else'
Statement
WhileStatement --> 'while' Expression 'do'
Statement
CompoundStatement: 'begin' Statement {';'
Statement} 'end' .
Pascal- Grammar
Procedure grammar
ProcedureDefinition --> 'procedure'
ProcedureNameDef ProcedureBlock ';'
ProcedureBlock --> FormalParameterList ';'
BlockBody
FormalParameterList --> |'(' ParameterDefinitions
')'
ParameterDefinitions --> ParameterDefinition {';'
ParameterDefinition}
ParameterDefinition -->
'var' ParameterNameDefList ':' TypeNameUse |
ParameterNameDefList ':' TypeNameUse
ParameterNameDefList -->
ParameterNameDef | ParameterNameDefList ','
ParameterNameDef
The Project Language PL
A PL Program consists of a Block followed by a period
Program -> Block .
The Block describes a set of named objects, (constants, variables,
and procedures) and a sequence of statements that use these
objects
Variables can be simple variables of integer or Boolean type, or onedimensional arrays of integer or Boolean elements indexed from
1 to some constant n
The Project Language PL
Block -> begin DefinitionPart StatementPart end
DefinitionPart -> { Definition ; }
Definition -> ConstantDefinition |
VariableDefinition | ProcedureDefinition
ConstantDefinition -> const ConstantName =
Constant
ConstantName -> Identifier
VariableDefinition -> TypeSymbol VariableList |
TypeSymbol array VariableList [ Constant ]
TypeSymbol -> integer | Boolean
VariableList -> VariableName { , VariableName}
VariableName -> Identifier
The Project Language PL
Procedure Definition:
Procedures can be recursive, but have no parameters in PL
ProcedureDefinition -> proc ProcedureName Block
ProcedureName -> Identifier
The Project Language PL
PL Statements
StatementPart -> { Statement ; }
Statement -> EmptyStatement | ReadStatement |
WriteStatement | AssignmentStatement |
ProcedureStatement | IfStatement | DoStatement
The Project Language PL
PL Statements can be
Empty statement which does nothing
EmptyStatement -> skip
read statemenr, which reads one or more integers into variables;
ReadStatement -> read VariableAccessList
write statement, which writes out the values of a sequence of
integer expressions;
WriteStatement -> write ExpressionList
assignment statement, which assigns to a sequence of variable
accesses to distinct variables, all of the same simple type
AssignmentStatement -> VariableAccessList :=
ExpressionList
The Project Language PL
procedure statement , which activates the code of a possibly
recursive parameterless procedure; All local variables are
allocated anew whenever the procedure is activated;
ProcedureStatement -> call ProcedureName
ProcedureName -> Identifier
if statement, which selects a guarded command from a sequence of
such commands, whose guard is true, and executes the
corresponding sequence of statements; at least one of the
guards must evaluate to true, otherwise the program execution
is aborted with an error message; If more than one guard is
true, one is selected arbitrarily.
IfStatement -> if GuardedCommandList fi
DoStatement -> do GuardedCommandList od
The Project Language PL
do statement, which executes a sequence of guarded commands
repeatedly, until all of the guards evaluate to false; at each
iteration if at least one guard is true, a guarded command with a
true guard is selected, and the corresponding statements are
executed.
DoStatement -> do GuardedCommandList od
GuardedCommandList -> GuardedCommand { []
GuardedCommand }
GuardedCommand ->
Expression -> StatementPart
VariableAccessList -> VariableAccess { ,
VariableAccess }
ExpressionList -> Expression { , Expression }
The Project Language PL
Expressions:
In PL expressions may contain arithmetic operators + - * / and \
or the relational operators, <, = and >
Expression -> PrimaryExpression { PrimaryOperator
PrimaryExpression }
PrimaryOperator -> & | |
PrimaryExpression -> SimpleExperession [
RelationalOperator SimpleExperession ]
RelationalOperator -> < | = | >
SimpleExperession -> [-] Term { AddingOperator
Term
AddingOperator -> + | - Term -> Factor {
MultiplyingOperator Factor }
MultiplyingOperator -> * | / | \
Factor -> Constant | VariableAccess | ( Expression
) | ~ Factor
The Project Language PL
Expressions:
In PL expressions may contain arithmetic operators + - * / and \
or the relational operators, <, = and >
Factor -> Constant | VariableAccess | ( Expression
) | ~ Factor
Constant -> Numeral | BooleanSymbol | ContsantName
VariableAccess -> VariableName [ IndexedSelector ]
IndexedSelector -> [ Expression ]
BooleanSymbol -> false | true
Numeral -> Digit { Digit }
Identifier -> Letter { Letter | Digit | _ }
Example Program in PL
$A PL Program: Linear Search
begin const n=10;
integer array A[n];
integer x,i;
Boolean found;
proc Search
begin
integer m;
i,m := 1,n;
do i < m ->
if A[i] = x -> m:=i; []
~(A[i] = x ) -> i:=i+1;
fi;
Example Program in PL
od;
found := A[i] = x;
end; $ input the table :
i:=1; do
~(i > n) -> read A[i]; i:=i+1;
od;
$ Test Search :
read x; do ~(x = 0) -> call Search;
if found -> write x, i; []
~found -> write x;
fi;
read x;
od;
end.
A simple one-pass compiler (Chapter 2
ASU)
Build a translator for converting simple infix expressions to
their postfix form
- Discusses all phases of translation process
- Shows how the grammar rules are implemented in a programming
languages
- Shows how the components of the translator are “glued” together
for example, A+B+C*D
– Postfix form: AB+CD*+
– Postfix notation can be converted directly into
code for a stack machine, for example
push A, push B, +, push C, push D, *, +, store
Infix and Postfix Expressions
Example
Infix: 3 * 4
Postfix: 3 4 *
Infix: 3 * 4 + 5 * 2
Infix: A+B+C*D
Postfix: 3 4 * 5 2 *+
Postfix AB+CD*+
Postfix expressions can be converted directly into code for
a stack machine:
push A, push B, ADD, push C, push D, MULT, ADD, store
Context-free Grammar
Stmt --> list eof
list
--> expr ; list
expr --> expr + term
|
|
empty
expr -
term
term --> term * factor | term / factor
|
term div factor
|
term mod factor |
factor
factor --> ( expr )
|
id
| num
|
term
Lexical Analysis
Remove white space (and comments)
while (1) {
t = getchar();
if (t== ‘ ‘ || t == ‘\t’ || t == ‘\n’)
/* strip off blanks, tabs, new lines */}
Recognize Numbers (token and its attribute values)
while ( isdigit(t)) {
value = value*10+t – ‘0’;
t = getchar(); }
Lexical Analysis
Recognize identifiers, keywords, and reserve words
if
(isalpha(t)) {
int b = 0;
while ( isalnum(t)) {
lexbuff[b++] = t;
t = getchar();
}
/* other code */
}
Lexical Analysis
A symbol table is needed to distinguish identifiers.
-- Keywords are fixed char strings to identify certain construct
e.g. begin
-- Reserved words are keywords that may not be used as
identifiers
Interface to the lexical analyzer
Pass token and attribute value
Read char
Input
nextToken()
Scanner
Put back
char
Parser
token
Keep track of line no.
Lexical Analysis
How to distinguish the token “<“ from token “<=“ token
when the scanner read the character “<“ ?
The scanner must read ahead one or more
characters
The scanner is often implemented as a procedure
called by the parser, returning a token at a time.
Lexical Analysis
Input buffer
- A block of characters is read into the buffer at a time for I/O
efficiency. A pointer keeps track of how many characters have been
analyzed.
Symbol Table
- Symbol table is a database that contains information about
identifiers (procedure names, variable names, labels, … etc) . It can
be used to communicate among multiple compiling phases.
- Symbol table interface
Insert(s, t) return the index of a new entry for string s, token t.
lookup(s) return the index of entry for string s, or 0 if not found
– Handling reserved words
We may initialize the symbol table by inserting all reserved
words
Lexical Analysis
Symbol table implementation
Symbol table is probably the most important data structure in
compiler implementation. A good design will support the following
– Fast access
– Easy to maintain
– Flexible
– Supporting nested scope
Lexical Analysis
A sample implementation of symbol table
lexPtr token attributes
div
mod
id
id
id
div\0 mod\0 val\0 rate\0 height\0 …….
Lexical Analysis
Functions of lexical analysis phase
– Grouping input characters into tokens
– Stripping out comments and white spaces
– Correlating error messages with the source program
Issues (why separating lexical analysis from parsing)
– Simpler design
– Compiler efficiency
– Compiler portability
Lexical Analysis
Tokens, Patterns and Lexemes
– Pattern: A rule that describes a set of strings
– Token: A set of strings with the same pattern
– Lexeme: The sequence of characters of a token
Token Lexeme Pattern
if
if
if
id
val
String of letters and
digits that starts
with a letter
Num 123
Sequence of digits
….
….
….
Lexical Analysis
Token attributes
x = y + 10
Token
Attribute
ID
Index to symbol table
entry for “x”
= (TOK_EQ)
ID
Index to symbol table
entry for “y”
+ (TOK_ADD)
NUM
10
Outline of Lectures
Jan 20 - Jan 24
 Practical Issues in Scanner Construction [BH Chap 3,4]
 Finite Automata ( Finite-State Machines) [ASU 3.6, 3.7]


Deterministic and Non-Deterministic FA
How to Implement an automaton
 Regular Expressions [ASU 3.3, 3.4]

a useful notation for describing lexemes
 Build Scanner Using [ASU Chap 3]



Regular Expressions
Finite Automata
automatically using flex
Announcements
 Email me ([email protected]) the
name of the members of your project
team
 There will be a quiz next week on
material covered until this week
Overview of Scanner Construction
Task: translate the sequence of characters to a corresponding
sequence of tokens (by grouping characters into lexemes).
Q: How do I know how to group the characters?
“A PL identifier (or name) is a sequence of letters or digits, or
an underscores, the first of which must be a letter”
“Two adjacent word symbols, names, or numerals must be
separated by at least one separator”
“PL names and word symbols are case sensitive”
The scanner is called by parser
token
Scanner
Parser
nextToken()
 Each time the scanner is called, it should


find the longest sequence of characters
 in the input starting with the current character …
 that corresponds to a token, and
return that token.
Writing a Scanner
1. write it from scratch (adhoc methods)
2. automatically generate it with a scanner
generator
lex or flex (produce C code), or
jlex (produces Java code).
input to a scanner generator:
one regular expression for each token
output of a scanner generator:
a finite state machine (FSM)
Regular Expressions to Finite Automata
Generating a scanner
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
DFA (Deterministic Finite Automaton, NFA( Nondeterministic finite automaton
Finite Automata
 A FA is similar to a compiler in that:


A compiler recognizes legal programs in some
(source) language.
A finite-state machine recognizes legal strings in
some language.
 Example: Pascal Identifiers

sequences of one or more letters or digits,
starting with a letter:
letter | digit |UNDERSCORE
letter
S
A
Finite-Automata State Graphs
A state
The start state
An accepting state
A transition
a
Finite Automata
Transition
s a r
Is read
In state s on input “a” go to state r
If end of input
If in accepting state => accept
Otherwise => reject
If no transition possible (got stuck) => reject
Language defined by FSM
The language defined by a FSM is the set of strings
accepted by the FSM.
in the language of the FSM shown in page 58 :
x, tmp2, XyZzy, position_27.
not in the language of the FSM shown in page 58 :
123, a?, 13apples, _hello
Example: Integer Literals
FA that accepts integer literals with an optional + or sign:
digit
B
digit
digit
+
S
A
-
Formal Definition
A finite automaton is a 5-tuple (, Q, , s, F) where:
An input alphabet 
A set of states Q
A start state s
A set of accepting states F  Q
 is the state transition function: Q x   Q
(i.e., encodes transitions state input state)
Two kinds of Automata
Deterministic (DFA):
No state has more than one outgoing edge with the
same label.
Non-Deterministic (NFA):
States may have more than one outgoing edge with
same label.
Edges may be labeled with  (epsilon), the empty
string.
The automaton can take an  epsilon transition
without looking at the current input character.
Example of NFA
integer-literal example:
digit
B

digit
+
S
A
-
Non-deterministic automata (NFA)
sometimes simpler than DFA
can be in multiple states at the same time
NFA accepts a string if there exists a sequence of
moves starting in the start state, ending in a final
state, that consumes the entire string.
Example:
the integer-literal NFA on input "+75":
Equivalence of DFA and NFA
Theorem:
For every non-deterministic finite-state machine M,
there exists a deterministic machine M' such that
M and M' accept the same language.
Q: Why is the theorem important for scanner
generation?
Q: Theorem is not enough: what do we need for
automatic scanner generation?
How to Implement a FSM
A table-driven approach:
table:
one row for each state in the machine, and
one column for each possible character.
Table[j][k]
which state to go to from state j on character k,
an empty entry corresponds to the machine getting
stuck.
The table-driven program for a DFA
state = S // S is the start state
repeat {
k = next character from the input
if k == EOF then // end of input
if state is a final state then accept
else reject
state = T[state,k]
if state = empty then reject // got stuck
}
The table-driven program for a DFA
state = S // S is the start state
repeat {
k = next character from the input
if k == EOF then // end of input
if state is a final state then accept
else reject
state = T[state,k]
if state = empty then reject // got stuck
}
Generating a scanner
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
Regular Expressions
Automaton is a good “visual” aid but cumbersome to
specify the patterns
Regular expressions are a suitable specification
- a compact way to define a language that can be
accepted by an automaton.
- can be used to specify input to a scanner generator
define pattern for
- each token
- white-space, comments, etc (these do not correspond to
tokens, but must be recognized and ignored)
Languages: Introductory Definitions
An Alphabet  is a set of symbols (e.g. characters)
• A String is a finite sequence of symbols (sometimes
called a Sentence)
• The symbol  (epsilon) represents the empty string
• x.y = xy is the concatenation of string x to string y
• xi (“exponentiation”) is string x concatenated to itself i
times. (e.g. x3 = “xxx”)
• x0 =  (by convention)
Languages
Given an alphabet  ( a set of characters):
A language over  is a set of strings of characters
drawn from 
Examples:
Not every string of English
• Alphabet: English characters
characters is a
• Language: English sentences
valid English sentence
• Alphabet: ASCII
• Language: C programs
Not every string of ASCII
characters is a
valid C program
Regular Expressions
The regular expressions over  are expressions
representing languages over  built as follows:
 = {“”} Epsilon denotes single zero length string
‘c’= {“c”} Single character c
For regular expressions A, B
A|B represents L(A)  L(B)
AB represents L(AB)
A* represents L(A*)
A+ represents L(A+)
Precedence and Parentheses
Parentheses ( ) can be used in RE’s to group operations
Otherwise, the operator precedence is :
* (exponentiation) highest
concatenation
| (alternation) lowest
Other Notations
r? means r |  (zero or one “r”)
[abc] means: a | b | c
[a-f] means: a | b | c | d | e | f
[a-zA-Z] means: a|b|c....|z|A|B|C...|Z
Also observe:
R | S = S | R (| is commutative)
R(S|T) = RS | RT (concatenation distributes
over |)
Regular Definitions
We can give names to regular expressions and then give
a series of definitions in which each new definition
may contain references to previously defined named
definitions
Example:
letter  [a-zA-Z]
A regular definition in which the symbol letter represents a
regular expression denoting the set of lower and upper case
letters.
Digit  [0-9]
The set of decimal digits
Summary
Regular expressions provide a useful notation for
describing the tokens in typical programming
languages
Regular languages are a language specification
Example: Pascal- identifier
 Lexical specification (in English):

a letter, followed by zero or more letters or digits.
 Lexical specification (as regular expression):

letter . (letter | digit)*
|
means "or"
.
means "followed by"
*
means zero or more instances of
() are used for grouping
Operands of a regular expression
 Operands are same as labels on the edges of an FSM


single characters
the special character  (the empty string)
 “letter” is a shorthand for

a | b | c | ... | z | A | ... | Z
 “digit” is a shorthand for

0|1|…|9
 sometimes characters are enclosed in quotes

for example when denoting | . *
Precedence of | . * operators.
Regular
Expression
Operator
Analogous
Arithmetic
Operator
Precedence
|
plus
times
exponentiation
lowest
middle
highest
.
*
 What do the following regular expressions represent?


letter.letter | digit*
letter.(letter | digit)*
Regular Expressions
 Describe (in English) the language defined by each of
the following regular expressions:
 letter.(letter | digit*)

digit digit* "." digit digit*
Example: Integer Literals
 An integer literal with an optional sign can be defined
in English as:

“(empty or + or -) followed by one or more digits”
 The corresponding regular expression is:

(+|-|).(digit.digit*)
 A new convenient operator ‘+’


same precedence as ‘*’
digit.digit* is the same as digit+ which means
"one or more digits"
Language Defined by Regular Expressions
 Recall: language = set of strings
 Language defined by an automaton

the set of strings accepted by the automaton
 Language defined by a regular expression

the set of strings that “match” the expression.
Regular Exp.

a
a.b.c
a|b|c
(a | b | c)*
Corresponding Set of Strings
{""}
{"a"}
{"abc"}
{"a", "b", "c"}
{"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
The Role of Regular Expressions
 Theorem: Let Σ be an alphabet. Let L( r ) denote the set of
strings over Σ defined by the regular expression r. Further, let
L(M) denote the set of strings over Σ accepted by a NFA M.
Then
L = L ( r ) for regular expression r if and only if there is a NFA
M that accepts L
 Q: Why is the theorem important for scanner
generation?
 Q: Theorem is not enough: what do we need for
automatic scanner generation?
Regular Expressions to NFA (1)
 For each kind of regular expression r , define an NFA
M(r) such that L(r) = L(M(r))
M(r)
• For 

• For input a in Σ
a
Regular Expressions to NFA (2)
 For A . B (concatenation)
M(A)

M(B)
• For A | B (union)


M(B)


M(A)
Regular Expressions to NFA (3)
 For A* (exponentiation or Kleene star)


A


Example of RegExp -> NFA conversion
 Consider the regular expression
(1|0)*1
 The NFA is

A

B
1


C

0
D
F 

E
G 
H 
I
1
J
NFA to DFA. The Trick
 Simulate the NFA
 Each state of DFA
= a non-empty subset of states of the NFA
 Start state
= the set of NFA states reachable through -moves
from NFA start state
 Add a transition S a S’ to DFA if and only if
 S’ is the set of NFA states reachable from any
state in S after seeing the input a, considering moves as well
NFA to DFA. Remark
 An NFA may be in many states at any time
 How many different states ?
 If there are N states, the NFA must be in some
subset of those N states
 How many subsets are there?
 2N - 1 = finitely many
NFA -> DFA

A

B

C 1

0
D
F 

E

G 
H 
I
1
J
States reachable from A on  transitions:
ABCDHI = {A,B,C,D,H,I}
NFA -> DFA

A

B
1


C

0 F
D

E
G 
H 
I
1
J

The set ABCDHI becomes the start state of the DFA
to be constructed
ABCDHI
Which NFA states can be reached from ABCDHI on reading a zero (0) ?
-closure(A) = DFA start
state
States reachable from ABCDHI
on reading 0 FGABCDHI= {F,G,

A

B
1
A,B,C,D,H,I}


C

0 F
D

E
G 
H 
I
1
J

The set ABCDHI becomes the start state of the DFA
to be constructed
0
FGABCDHI
ABCDHI
Which NFA states can be reached from FGABCDHI on reading a zero (0) ?
States reachable from FGABCDHI
on reading 0 is FGABCDHI= {F,G,
A,B,C,D,H,I}

A

B
1


C

0 F
D

E
G 
H 
I
1

0
FGABCDHI
ABCDHI
Which NFA states can be reached from ABCDHI on reading a one (1) ?
J
States reachable from ABCDHI on
reading 1 is EGABCDHI= {E,G,J
A,B,C,D,H,I} (Final state)

A

B
1


C

0 F
D

E
G 
H 
I

0
0
FGABCDHI
ABCDHI
1
1
EGJABCDHI
OBS: No new states are possible
1
J
States reachable from ABCDHI on
reading 1 is EGJABCDHI= {E,G,J
A,B,C,D,H,I} (Final state)

A

B
1


C

0 F
D

E
G 
H 
I

0
FGABCDHI
0
0
1
ABCDHI
1
1
EGJABCDHI
OBS: No new states are possible !
1
J
NFA to DFA: the practice
 NFA -> DFA conversion is at the heart of tools such
as flex
 But, DFAs can be huge
 In practice, flex-like tools trade off speed for space in
the choice of NFA and DFA representations
Scanner Construction
 Given a single string, automata and regular
expressions return a Boolean answer:

a given string is/is not in a language
 In contrast …

Given an input (series of strings separated by white
space/comments and terminated by EOF), a scanner returns
a series of tokens
 finds the longest lexeme, and
 returns the corresponding token
Let’s build a scanner for a very simple
language:
The language of assignment statements:
LHS = RHS
LHS = RHS
…


left-hand side of assignment is a Pascal- identifier:
 a letter followed by one or more letters or digits
right-hand side is one of the following:
 ID + ID
 ID * ID
 ID == ID
Step 1: Define tokens
 Our language has five tokens,

they can be defined by five regular expressions:
Token
Regular Expression
Step 2: Convert REs to NFAs
“=”
ASSIGN:
letter
ID:
PLUS:
“+”
TIMES:
“*”
EQUALS:
“=”
“=”
letter |
digit
Step 4: Combining per-token DFAs
 Goal of a scanner:
 find the longest prefix of the current input that
corresponds to a token.
 This has two consequences:
 lookahead:
 Examine if the next input character can
“extend” the current token. If yes, keep
building a larger token.
 a real scanner cannot get stuck:
 What if we get stuck building the larger token?
Solution: return characters back to input.
Furthermore …
 In general the input can correspond to a series of
tokens (lexemes), not just a single token.


Problem: It is no longer correct to run the FSM until it gets
stuck or whole string is consumed.
So, how to partition the input into lexemes?
Solution: a token must be returned when a regular
expression is matched.
 Some lexemes (like whitespace and comments) do
not correspond to tokens.


Problem: how to “discard” these lexemes?
Solution: after finding such a lexeme, the scanner simply
starts again and tries to match another regular expression.
Extend the DFA
 modify the DFA so that an edge can have


an associated action to
 "put back one character" or
 "return token XXX",
such DFA is called a transducer
 we must combine the DFAs for all of the tokens in to
a single DFA, and
Step 4: Example of extending the DFA

The DFA that recognizes Pascal- identifiers
must be modified as follows:
letter | digit
S
letter
action:
• put back 1 char
• return ID
any char except
letter or digit
Implementing the extended DFA
 The table-driven technique works, with a few small
modifications:



Include a column for end-of-file
besides ‘next state’, a table entry includes
 an (optional) action: put back n characters, return token
Instead of repeating
 "read a character; update the state variable"
until the machine gets stuck or the entire input is read,
 "read a character; update the state variable;
perform the action"

(eventually, the action will be to return a value, so
the scanner code will stop).
Step 4) Example: Combined DFA for our
language
F3
return PLUS
F4
“*”
return TIMES
TMP
any char except “=”
“+”
letter | digit
letter
S
“=”
ID
put back 1 char;
return ID
any char except
letter or digit
return EQUALS
“=”
F5
“=”
put back 1 char; return ASSIGN
F1
F3
Transition Table (part 1)
+
S
F3,
return PLUS
F2,
ID put back 1 char;
return ID
T F1,
M put back 1 char;
P return ASSIGN
*
F4,
return TIMES
F2,
put back 1 char;
return ID
F1,
put back 1 char;
return ASSIGN
=
TMP
F2,
put back 1 char;
return ID
F5,
return EQUALS
Transition Table (part 2)
letter
digit
EOF
ID
ID
ID
F1,
put back 1 char;
return ASSIGN
F1,
put back 1 char;
return ASSIGN
F2,
put back 1 char;
return ID
F1,
put back 1 char;
return ASSIGN
TEST YOURSELF #1
Augment the "combined" finite-state machine to:
 Ignore white-spaces between tokens

white-spaces are spaces, tabs and newlines
 Give an error message if



a character other than +, *, =, letter, or digit occurs in the input,
or
a digit is seen as the first character in the current input
(in both cases, ignore the bad character).
 Return an EOF token when there are no more tokens
in the input.
Implementation Remarks
NFA -> DFA conversion is at the heart of tools such as
lex and flex
• But, DFAs can be huge
• There are algorithms for reducing DFA’s but we will
not look at this issue
Lex and Flex: Scanner Generators
Lex Compiler
Lex.yy.c
Lex.yy.c
C compiler
MyScanner
(executable)
Source program file
myScanner
Lex source program
lex.l
Token stream
Flex Program Structure
%{
.....C header stuff, copied into lex.yy.c.....
%}
.....Regular Definitions....
%%
.....Translation Rules for all tokens.....
%%
.....Auxiliary C Procedures, also copied
into lex.yy.c
Flex Program Structure
%{
.....C header stuff, copied into lex.yy.c.....
%}
.....Regular Definitions....
%%
.....Translation Rules for all tokens.....
%%
.....Auxiliary C Procedures, also copied
into lex.yy.c
FLex Program Example (1)
%{
/* File : simple.l A flex program that copies files */
%}
%%
%%
Compiling and Executing
flex simple.l (produces lex.yy.c)
gxcc -o simple lex.yy.c -lf (compile )
simple < test.in > test.out
FLex Program Example (2)
%{
/* a Lex program that adds line numbers
to lines of stdin, printing to stdout */
#include <stdio.h>
int lineno = 1;
%}
line .*\n
%%
{line} { printf("%5d %s",lineno++,yytext); }
%%
main()
{ yylex(); return 0; }
The Scanner Generator Flex
Whenever a pattern is matched with some text in the input,
corresponding action (if any) is executed
If the action does not “return”, Flex resumes scanning for the next
token from input
If more than one pattern matches a prefix of the remaining input,
Flex chooses the longest matching string
If more than one pattern matches the same input, Flex chooses the
“first” such pattern defined.
Unrecognized input is copied to the stdout
Flex Program Example(3)
%{/* Specification of a scanner for sample grammar
given in slide #102. */
#define ASSIGN 256
#define EQUAL 257
#define ID
258
#define MULT
259
#define PLUS
260
%}
/* regular definitions */
delim
[ \t\n]
ws
{delim}+
letter
[a-zA-Z]
digit
[0-9]
id
({letter})({letter}|{digit})*
Flex Program Example(3)
%{/* Specification of a scanner for sample grammar
given in slide #102. */
%%
{ws}
"="
"=="
{id}
"*"
"+"
%%
{/* no action and no return */}
{ return(ASSIGN);}
{ return(EQUAL);}
{ return(ID);}
{ return(MULT);}
{ return(PLUS);}
Main Program to Call the Scanner
/* A short program to test our scanner */
extern char *yytext;
int main() {
int token;
do {
token = yylex() ;
printf(" Token %d Value %s\n", token, yytext);
} while (token !=0);
}
Sample Input File
a = b + c * d
x == y
Output From Scanner
Token
Token
Token
Token
Token
Token
Token
Token
Token
Token
Token
258 Value
256 Value
258 Value
260 Value
258 Value
259 Value
258 Value
258 Value
257 Value
258 Value
0 Value
a
=
b
+
c
*
d
x
==
y