Lexical Analysis The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set:     ASCII ISO Latin-1 ISO 10646 (16-bit.

Transcript Lexical Analysis The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set:     ASCII ISO Latin-1 ISO 10646 (16-bit.

Lexical Analysis
The Input

Read string input



Might be sequence of characters (Unix)
Might be sequence of lines (VMS)
Character set:




ASCII
ISO Latin-1
ISO 10646 (16-bit = unicode) Ada, Java
Others (EBCDIC, JIS, etc)
The Output

A series of tokens: kind, location, name (if any)







Punctuation
( ) ; , [ ]
Operators
+ - ** :=
Keywords
begin end if while try catch
Identifiers
Square_Root
String literals
“press Enter to continue”
Character literals ‘x’
Numeric literals
 Integer:
123
 Floating_point:
4_5.23e+2
 Based representation: 16#ac#
Free form vs Fixed form

Free form languages (all modern ones)

White space does not matter. Ignore these:



Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages (historical)

Layout is critical



Fortran, label in cols 1-6
COBOL, area A B
Lexical analyzer must know about layout to find tokens
Punctuation: Separators

Typically individual special characters such
as ( { } : .. (two dots)

Sometimes double characters: lexical scanner
looks for longest token:


(*, /* --
comment openers in various languages
Returned just as identity (kind) of token

And perhaps location for error messages and
debugging purposes
Operators

Like punctuation


No real difference for lexical analyzer
Typically single or double special chars



Operators + - == <=
Operations := =>
Returned as kind of token

And perhaps location
Keywords

Reserved identifiers


E.g. BEGIN END in Pascal, if in C, catch in C++
Maybe distinguished from identifiers


Returned as kind of token


E.g. mode vs mode in Algol-68
With possible location information
Oddity: unreserved keywords in PL/1


IF IF THEN THEN = THEN + 1;
Handled as identifiers (parser disambiguates)
Identifiers

Rules differ


Length, allowed characters, separators
Need to build a names table

Single entry for all occurrences of Var1



Language may be case insensitive: same entry for
VAR1, vAr1, Var1
Typical structure: hash table
Lexical analyzer returns token kind


And key (index) to table entry
Table entry includes location information
Organization of names table

Most common structure is hash table






With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters (e.g. sum
mod table size).
No hash code is perfect! Expect collisions.
Avoid any arbitrary limits on table or chain size.
String Literals


Text must be stored
Actual characters are important





Not like identifiers: must preserve casing
Character set issues: uniform internal representation
Table needed
Lexical analyzer returns key into table
May or may not be worth hashing to avoid duplicates
Character Literals


Similar issues to string literals
Lexical Analyzer returns



Token kind
Identity of character
Cannot assume character set of host
machine, may be different
Numeric Literals

need a table to store numeric value


E.g. 123 = 0123 = 01_23 (Ada)
But cannot use predefined type for values


Because may have different bounds
Floating point representations much more
complex



Denormals, correct rounding
Very delicate to compute correct value.
Host / target issues
Handling Comments




Comments have no effect on program
Can be eliminated by scanner
But may need to be retrieved by tools
Error detection issues


E.g. unclosed comments
Scanner skips over comments and returns
next meaningful token
Case Equivalence

Some languages are case-insensitive


Some are not


Pascal, Ada
C, Java
Lexical analyzer ignores case if needed



This_Routine = THIS_RouTine
Error analysis may need exact casing
Friendly diagnostics follow user’s conventions
Performance Issues

Speed


Lexical analysis can become bottleneck
Minimize processing per character



Skip blanks fast
I/O is also an issue (read large blocks)
We compile frequently

Compilation time is important


Especially during development
Communicate with parser through global variables
General Approach

Define set of token kinds:

An enumeration type (tok_int, tok_if, tok_plus,
tok_left_paren, tok_assign etc).


Or a series of integer definitions in more primitive
languages…
Some tokens carry associated data


E.g. key for identifier table
May be useful to build tree node

For identifiers, literals etc
Interface to Lexical Analyzer

Either: Convert entire file to a file of tokens


Lexical analyzer is separate phase
Or: Parser calls lexical analyzer to supply
next token


This approach avoids extra I/O
Parser builds tree incrementally, using successive
tokens as tree nodes
Relevant Formalisms





Type 3 (Regular) Grammars
Regular Expressions
Finite State Machines
Equivalent in expressive power
Useful for program construction, even if
hand-written
Regular Grammars

Regular grammars





Non-terminals (arbitrary names)
Terminals (characters)
Productions limited to the following:
 Non-terminal ::= terminal
 Non-terminal ::= terminal Non-terminal
 Treat character class (e.g. digit) as terminal
Regular grammars cannot count: cannot express size limits
on identifiers, literals
Cannot express proper nesting (parentheses)
Regular Grammars

grammar for real literals with no exponent







digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
REAL ::= digit REAL1
REAL1 ::= digit REAL1
(arbitrary size)
REAL1 ::= . INTEGER
INTEGER ::= digit INTEGER (arbitrary size)
INTEGER ::= digit
Start symbol is REAL
Regular Expressions

Regular expressions (RE) defined by an
alphabet (terminal symbols) and three
operations:




Alternation
RE1 | RE2
Concatenation RE1 RE2
Repetition
RE* (zero or more RE’s)
Language of RE’s = regular grammars

Regular expressions are more convenient for
some applications
Specifying RE’s in Unix Tools






Single characters
a b c d \x
Alternation
[bcd] [b-z] ab|cd
Any character
.
(period)
Match sequence of characters x* y+
Concatenation
abc[d-q]
Optional RE
[0-9]+(\.[0-9]*)?
Finite State Machines




A language defined by a grammar is a (possibly infinite)
set of strings
An automaton is a computation that determines whether
a given string belongs to a specified language
A finite state machine (FSM) is an automaton that
recognize regular languages (regular expressions)
Simplest automaton: memory is single number (state)
Specifying an FSM





A set of labeled states
Directed arcs between states labeled with character
One or more states may be terminal (accepting)
A distinguished state is start
Automaton makes transition from state S1 to S2


If and only if arc from S1 to S2 is labeled with next character in
input
Token is legal if automaton stops on terminal state
Building FSM from Grammar


One state for each non-terminal
A rule of the form



Nt1 ::= terminal
Generates transition from S1 to final state
A rule of the form


Nt1 ::= terminal Nt2
Generates transition from S1 to S2 on an arc
labeled by the terminal
Graphic representation
S
digit
digit
letter
Int
letter
letter
underscore
id
digit
digit
Building FSM’s from RE’s


Every RE corresponds to a grammar
For all regular expressions


A natural translation to FSM exists
Alternation often leads to non-deterministic
machines
Non-Deterministic FSM

A non-deterministic FSM

Has at least one state





With two arcs to two distinct states
Labeled with the same character
Example: from start state, a digit can begin an
integer literal or a real literal
Implementation requires backtracking
Nasty 
Deterministic FSM

For all states S

For all characters C:



There is at most one arc from any state S that is
labeled with C
Much easier to implement
No backtracking 
From NFSM to DFSM


There is an algorithm for converting a nondeterministic machine to a deterministic one
Result may have exponentially more states



Intuitively: need new states to express uncertainty
about token: int or real
Algorithm is efficient in practice (e.g. grep)
Other algorithms for minimizing number of
states of FSM, for showing equivalence, etc.
Implementing the Scanner

Three methods

Hand-coded approach:


Hybrid approach :



draw DFSM, then implement with loop and case statement
define tokens using regular expressions, convert to NFSM,
apply algorithm to obtain minimal DSFM
Hand-code resulting DFSM
Automated approach:
 Use regular grammar as input to lexical scanner
generator (e.g. LEX)
Hand-coding

Normal coding techniques




Scan over white space and comments till non-blank character
found.
Branch depending on first character:
 If digit, scan numeric literal
 If character, scan identifier or keyword
 If operator, check next character (++, etc.)
 Need table to determine character type efficiently
Return token found
Write aggressive efficient code: goto’s, global
variables
Using grammar and FSM

Start with regular grammar or RE


Typically found in the language reference
example (Ada):

Chapter 2. Lexical Elements




Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
decimal-literal ::= integer [.integer][exponent]
integer ::= digit {[underline] digit}
exponent ::= E [+] integer | E - integer
Using grammar and FSM





Create one state for each non-terminal
Label edges according to productions in grammar
Each state becomes a label in the program
Code for each state is a switch on next character,
corresponding to edges out of current state
If no possible transition on next character, then:


If state is accepting, return the corresponding token
If state is not accepting, report error
Hand-coded version:

Each state is encoded as follows:



<<state1>>
case Next_Character is
when ‘a’ => goto state3;
when ‘b’ => goto state1;
when others =>
End_of_token_processing;
end case;
<<state2>>
…
No explicit mention of state of automaton
Translating from FSM to code

variable holds current state:
loop
case State is
when state1 =>
<<state1>>
case Next_Character is
when ‘a’ => State := state3;
when ‘b’ => State := state1;
when others => End_token_processing;
end case;
when state2 …
…
end case;
end loop;
Automatic scanner construction


LEX builds a transition table, indexed by state
and by character.
Code gets transition from table:
Tab : array (State, Character) of State := …
begin
while More_Input loop
Curstate := Tab (Curstate, Next_Char);
if Curstate = Error_State then …
end loop;
Automatic FSM Generation

Our example, FLEX


FLEX is given



See home page for manual in HTML
A set of regular expressions
Actions associated with each RE
It builds a scanner

Which matches RE’s and executes actions
Flex General Format

Input to Flex is a set of rules:




Regexp
Regexp
…
actions (C statements)
actions (C statements)
Flex scans the longest matching Regexp

And executes the corresponding actions
An Example of a Flex scanner

DIGIT [0-9]
ID
%%
{DIGIT}+
[a-z][a-z0-9]*
{
printf (“an integer %s (%d)\n”,
yytext, atoi (yytext));
}
{DIGIT}+”.”{DIGIT}* {
printf (“a float %s (%g)\n”,
yytext, atof (yytext));
if|then|begin|end|procedure|function {
printf (“a keyword: %s\n”, yytext));
Flex Example (continued)
{ID}
printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” {
printf (“an operator %s\n”, yytext); }
“--”.*\n
/* eat Ada style comment */
[ \t\n]+
/* eat white space */
.
%%
printf (“unrecognized character”);
Assembling the flex program
%{
#include <math.h> /* for atof */
%}
<<flex text we gave goes here>>
%%
main (argc, argv)
int argc;
char **argv;
{
yyin = fopen (argv[1], “r”);
yylex();
}
Running flex

flex is an executable program



For Ada fans


The input is lexical grammar as described
The output is a running C program
Look at aflex (www.adapower.com)
For C++ fans

flex can run in C++ mode

Generates appropriate classes
Choice Between Methods?

Hand written scanners




Typically much faster execution
Easy to write (standard structure)
Preferable for good error recovery
Flex approach


Simple to Use
Easy to modify token language
The GNAT Scanner

Hand written (scn.adb/scn.ads)

Each call does:
 Optimal scan past blanks/comments etc.
 Processing based on first character
 Call special routines for major classes:




Namet.Get_Name for identifier (hashing)
Keywords recognized by special hash
Strings (scn-slit.adb):
 complication with “+”, “and”, etc. (string or operator?)
Numeric literals (scn-nlit.adb):
 complication with based literals: 16#FFF#
Historical oddities
Because early keypunch machines were unreliable,
FORTRAN treats blanks as optional: lexical analysis
and parsing are intertwined.



DO10I=1.6
 identifier operator literal
 DO10I
=
1.6
DO10I=1,6
 Keyword stmt id operator

DO
10 I
=
3 tokens:
7 tokens:
literal comma literal
1
,
6
Celebrated NASA failure caused by this bug (?)

Lexical Analysis The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set:     ASCII ISO Latin-1 ISO 10646 (16-bit.

Transcript Lexical Analysis The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set:     ASCII ISO Latin-1 ISO 10646 (16-bit.

Directory