Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of.

Lexical Analysis and Scanning
Compiler Construction
Lecture 2
Spring 2001
Robert Dewar
The Input
Read string input
Might be sequence of characters (Unix)
Might be sequence of lines (VMS)
Character set
ISO Latin-1
ISO 10646 (16-bit = unicode)
Others (EBCDIC, JIS, etc)
The Output
A series of tokens
Punctuation ( ) ; , [ ]
+ - ** :=
begin end if
String literals “hello this is a string”
Character literals ‘x’
Numeric literals 123 4_5.23e+2 16#ac#
Free form vs Fixed form
Free form languages
White space does not matter
Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages
Layout is critical
Fortran, label in cols 1-6
COBOL, area A B
Lexical analyzer must worry about layout
Typically individual special characters
Such as + Lexical analyzer does not know : from :
Sometimes double characters
E.g. (* treated as a kind of bracket
Returned just as identity of token
And perhaps location
For error message and debugging purposes
Like punctuation
No real difference for lexical analyzer
Typically single or double special chars
Operators + Operations :=
Returned just as identity of token
And perhaps location
Reserved identifiers
E.g. BEGIN END in Pascal, if in C
Maybe distinguished from identifiers
Returned just as token identity
E.g. mode vs mode in Algol-68
With possible location information
Unreserved keywords (e.g. PL/1)
Handled as identifiers (parser distinguishes)
Rules differ
Need to build table
Length, allowed characters, separators
So that junk1 is recognized as junk1
Typical structure: hash table
Lexical analyzer returns token type
And key to table entry
Table entry includes location information
More on Identifier Tables
Most common structure is hash table
With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters
No hash code is perfect!
Avoid any arbitrary limits
String Literals
Text must be stored
Actual characters are important
Not like identifiers
Character set issues
Table needed
Lexical analyzer returns key to table
May or may not be worth hashing
Character Literals
Similar issues to string literals
Lexical Analyzer returns
Token type
Identity of character
Note, cannot assume character set of
host machine, may be different
Numeric Literals
Also need a table
Typically record value
E.g. 123 = 0123 = 01_23 (Ada)
But cannot use int for values
Because may have different characteristics
Float stuff much more complex
Denormals, correct rounding
Very delicate stuff
Handling Comments
Comments have no effect on program
Can therefore be eliminated by scanner
But may need to be retrieved by tools
Error detection issues
E.g. unclosed comments
Scanner does not return comments
Case Equivalence
Some languages have case equivalence
Some do not
Pascal, Ada
C, Java
Lexical analyzer ignores case if needed
This_Routine = THIS_RouTine
Error analysis may need exact casing
Issues to Address
Lexical analysis can take a lot of time
Minimize processing per character
I/O is also an issue (read large blocks)
We compile frequently
Compilation time is important
Especially during development
General Approach
Define set of token codes
An enumeration type
A series of integer definitions
These are just codes (no semantics)
Some codes associated with data
E.g. key for identifier table
May be useful to build tree node
For identifiers, literals etc
Interface to Lexical Analyzer
Convert entire file to a file of tokens
Lexical analyzer is separate phase
Parser calls lexical analyzer
Get next token
This approach avoids extra I/O
Parser builds tree as we go along
Implementation of Scanner
Given the input text
Generate the required tokens
Or provide token by token on demand
Before we describe implementations
We take this short break
To describe relevant formalisms
Relevant Formalisms
Type 3 (Regular) Grammars
Regular Expressions
Finite State Machines
Regular Grammars
Regular grammars
Non-terminals (arbitrary names)
Terminals (characters)
Two forms of rules
Non-terminal ::= terminal
Non-terminal ::= terminal Non-terminal
One non-terminal is the start symbol
Regular (type 3) grammars cannot count
No concept of matching nested parens
Regular Grammars
Regular grammars
E.g. grammar of reals with no exponent
REAL ::= 0 REAL1
(repeat for 1 .. 9)
REAL1 ::= 0 REAL1
(repeat for 1 .. 9)
INTEGER ::= 0 INTEGER (repeat for 1 .. 9)
(repeat for 1 .. 9)
Start symbol is REAL
Regular Expressions
Regular expressions (RE) defined by
Any terminal character is an RE
Alternation RE | RE
Concatenation RE1 RE2
Repetition RE* (zero or more RE’s)
Language of RE’s = type 3 grammars
Regular expressions are more convenient
Specifying RE’s in Unix Tools
Single characters a b c d \x
Alternation [bcd] [b-z] ab|cd
Match any character .
Match sequence of characters x* y+
Concatenation abc[d-q]
Optional [0-9]+(.[0-9]*)?
Finite State Machines
Languages and Automata
A language is a set of strings
An automaton is a machine
That determines if a given string is in the
language or not.
FSM’s are automata that recognize
regular languages (regular expressions)
Definitions of FSM
A set of labeled states
Directed arcs labeled with character
A state may be marked as terminal
Transition from state S1 to S2
If and only if arc from S1 to S2
Labeled with next character (which is eaten)
Recognized if ends up in terminal state
One state is distinguished start state
Building FSM from Grammar
One state for each non-terminal
A rule of the form
Nont1 ::= terminal
Generates transition from S1 to final state
A rule of the form
Nont1 ::= terminal Nont2
Generates transition from S1 to S2
Building FSM’s from RE’s
Every RE corresponds to a grammar
For all regular expressions
A natural translation to FSM exists
We will not give details of algorithm here
Non-Deterministic FSM
A non-deterministic FSM
Has at least one state
With two arcs to two separate states
Labeled with the same character
Which way to go?
Implementation requires backtracking
Nasty 
Deterministic FSM
For all states S
For all characters C
There is either ONE or NO arcs
From state S
Labeled with character C
Much easier to implement
No backtracking 
Dealing with ND FSM
Construction naturally leads to ND FSM
For example, consider FSM for
[0-9]+ | [0-9]+\.[0-9]+
(integer or real)
We will naturally get a start state
With two sets of 0-9 branches
And thus non-deterministic
Converting to Deterministic
There is an algorithm for converting
From any ND FSM
To an equivalent deterministic FSM
Algorithm is in the text book
Example (given in terms of RE’s)
[0-9]+ | [0-9]+\.[0-9]+
Implementing the Scanner
Three methods
Completely informal, just write code
Define tokens using regular expressions
Convert RE’s to ND finite state machine
Convert ND FSM to deterministic FSM
Program the FSM
Use an automated program
To achieve above three steps
Ad Hoc Code (forget FSM’s)
Write normal hand code
A procedure called Scan
Normal coding techniques
Basically scan over white space and comments
till non-blank character found.
Base subsequent processing on character
E.g. colon may be : or :=
/ may be operator or start of comment
Return token found
Write aggressive efficient code
Using FSM Formalisms
Start with regular grammar or RE
Typically found in the language standard
For example, for Ada:
Chapter 2. Lexical Elements
Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
decimal-literal ::= integer [.integer][exponent]
integer ::= digit {[underline] digit}
exponent ::= E [+] integer | E - integer
Using FSM formalisms, cont
Given RE’s or grammar
Convert to finite state machine
Convert ND FSM to deterministic FSM
Write a program to recognize
Using the deterministic FSM
Implementing FSM (Method 1)
Each state is code of the form:
case Next_Character is
when ‘a’ => goto state3;
when ‘b’ => goto state1;
when others =>
end case;
Implementing FSM (Method 2)
There is a variable called State
case State is
when state1 =><<state1>>
case Next_Character is
when ‘a’ => State := state3;
when ‘b’ => State := state1;
when others => End_token_processing;
end case;
when state2 …
end case;
end loop;
Implementing FSM (Method 3)
T : array (State, Character) of State;
while More_Input loop
Curstate := T (Curstate, Next_Char);
if Curstate = Error_State then …
end loop;
Automatic FSM Generation
Our example, FLEX
FLEX is given
See home page for manual in HTML
A set of regular expressions
Actions associated with each RE
It builds a scanner
Which matches RE’s and executes actions
Flex General Format
Input to Flex is a set of rules:
actions (C statements)
actions (C statements)
Flex scans the longest matching Regexp
And executes the corresponding actions
An Example of a Flex scanner
printf (“an integer %s (%d)\n”,
yytext, atoi (yytext));
{DIGIT}+”.”{DIGIT}* {
printf (“a float %s (%g)\n”,
yytext, atof (yytext));
if|then|begin|end|procedure|function {
printf (“a keyword: %s\n”, yytext));
Flex Example (continued)
printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” {
printf (“an operator %s\n”, yytext); }
[ \t\n]+
/* eat Ada style comment */
/* eat white space */
printf (“unrecognized character”);
Assembling the flex program
#include <math.h> /* for atof */
<<flex text we gave goes here>>
main (argc, argv)
int argc;
char **argv;
yyin = fopen (argv[1], “r”);
Running flex
flex is a program that is executed
For Ada fans
The input is as we have given
The output is a running C program
Look at aflex (www.adapower.com)
For C++ fans
flex can run in C++ mode
Generates appropriate classes
Choice Between Methods?
Hand written scanners
Typically much faster execution
And pretty easy to write
And a easier for good error recovery
Flex approach
Simple to Use
Easy to modify token language
The GNAT Scanner
Hand written (scn.adb/scn.ads)
Basically a call does
Super quick scan past blanks/comments etc
Big case statement
Process based on first character
Call special routines
Namet.Get_Name for identifier (hashing)
Keywords recognized by special hash
Strings (stringt.ads)
Integers (uintp.ads)
Reals (ureal.ads)
More on the GNAT Scanner
Entire source read into memory
Single contiguous block
Source location is index into this block
Different index range for each source file
See sinput.adb/ads for source mgmt
See scans.ads for definitions of tokens
More on GNAT Scanner
Read scn.adb code
Very easy reading, e.g.
DTL (Dewar Trivial Language)
DTL Grammar
Program ::= DECLARE Decls BEGIN Stmts
Decls ::= {Decl}*
Stmts ::= {Stmt}+
Identifier ::= letter (_{digit}+)*
Decl ::= DECLARE identifier : Type
DTL (Continued)
Integer_Literal ::= {digit}+
Real_Literal ::= {digit}+”.”{digit}*
Stmt ::= Assignstmt | Ifstmt | Whilestmt
Assignstmt ::= Identifier := Expr
Expr ::= Literal | (Expr) Op (Expr)
Op ::= + | Literal ::= Integer_Literal | Real_Literal
Ifstmt ::= IF Expr Relop Expr THEN Stmts
Whilestmt ::= WHILE Expr Relop Expr DO Stmts
Relop ::= > | < | >= | <=
DTL Example
A_123 := 23
B := 2.4
WHILE A_123 > (2) + (1)
DO A_123 := A_123 - 1
Write a flex or aflex program
Recognize tokens of DTL program
Print out tokens in style of flex example
Extra credit
Build hash table for identifiers
Output hash table key
Some languages allow preprocessing
This is a separate step
Can either be done as separate phase
Input is source
Output is expanded source
Or embedded into the lexical analyzer
Often done as separate phase
Need to keep track of source locations
Nasty Glitches
Separation of tokens
Not all languages have clear rules
FORTRAN has optional spaces
identifier operator literal
Keyword stmt loopvar operator literal punc literal
Modern languages avoid this kind of thing!