Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of.

Download Report

Transcript Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar The Input  Read string input    Might be sequence of characters (Unix) Might be sequence of.

Lexical Analysis and Scanning
Compiler Construction
Lecture 2
Spring 2001
Robert Dewar
The Input

Read string input



Might be sequence of characters (Unix)
Might be sequence of lines (VMS)
Character set




ASCII
ISO Latin-1
ISO 10646 (16-bit = unicode)
Others (EBCDIC, JIS, etc)
The Output

A series of tokens







Punctuation ( ) ; , [ ]
Operators
+ - ** :=
Keywords
begin end if
Identifiers
Square_Root
String literals “hello this is a string”
Character literals ‘x’
Numeric literals 123 4_5.23e+2 16#ac#
Free form vs Fixed form

Free form languages

White space does not matter



Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages

Layout is critical



Fortran, label in cols 1-6
COBOL, area A B
Lexical analyzer must worry about layout
Punctuation

Typically individual special characters



Such as + Lexical analyzer does not know : from :
Sometimes double characters


E.g. (* treated as a kind of bracket
Returned just as identity of token

And perhaps location

For error message and debugging purposes
Operators

Like punctuation


No real difference for lexical analyzer
Typically single or double special chars



Operators + Operations :=
Returned just as identity of token

And perhaps location
Keywords

Reserved identifiers


E.g. BEGIN END in Pascal, if in C
Maybe distinguished from identifiers


Returned just as token identity


E.g. mode vs mode in Algol-68
With possible location information
Unreserved keywords (e.g. PL/1)

Handled as identifiers (parser distinguishes)
Identifiers

Rules differ


Need to build table



Length, allowed characters, separators
So that junk1 is recognized as junk1
Typical structure: hash table
Lexical analyzer returns token type


And key to table entry
Table entry includes location information
More on Identifier Tables

Most common structure is hash table






With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters
No hash code is perfect!
Avoid any arbitrary limits
String Literals


Text must be stored
Actual characters are important





Not like identifiers
Character set issues
Table needed
Lexical analyzer returns key to table
May or may not be worth hashing
Character Literals


Similar issues to string literals
Lexical Analyzer returns



Token type
Identity of character
Note, cannot assume character set of
host machine, may be different
Numeric Literals


Also need a table
Typically record value


E.g. 123 = 0123 = 01_23 (Ada)
But cannot use int for values


Because may have different characteristics
Float stuff much more complex


Denormals, correct rounding
Very delicate stuff
Handling Comments




Comments have no effect on program
Can therefore be eliminated by scanner
But may need to be retrieved by tools
Error detection issues


E.g. unclosed comments
Scanner does not return comments
Case Equivalence

Some languages have case equivalence


Some do not


Pascal, Ada
C, Java
Lexical analyzer ignores case if needed


This_Routine = THIS_RouTine
Error analysis may need exact casing
Issues to Address

Speed


Lexical analysis can take a lot of time
Minimize processing per character


I/O is also an issue (read large blocks)
We compile frequently

Compilation time is important

Especially during development
General Approach

Define set of token codes




An enumeration type
A series of integer definitions
These are just codes (no semantics)
Some codes associated with data


E.g. key for identifier table
May be useful to build tree node

For identifiers, literals etc
Interface to Lexical Analyzer

Convert entire file to a file of tokens


Lexical analyzer is separate phase
Parser calls lexical analyzer



Get next token
This approach avoids extra I/O
Parser builds tree as we go along
Implementation of Scanner




Given the input text
Generate the required tokens
Or provide token by token on demand
Before we describe implementations


We take this short break
To describe relevant formalisms
Relevant Formalisms



Type 3 (Regular) Grammars
Regular Expressions
Finite State Machines
Regular Grammars

Regular grammars



Non-terminals (arbitrary names)
Terminals (characters)
Two forms of rules




Non-terminal ::= terminal
Non-terminal ::= terminal Non-terminal
One non-terminal is the start symbol
Regular (type 3) grammars cannot count

No concept of matching nested parens
Regular Grammars

Regular grammars

E.g. grammar of reals with no exponent






REAL ::= 0 REAL1
(repeat for 1 .. 9)
REAL1 ::= 0 REAL1
(repeat for 1 .. 9)
REAL1 ::= . INTEGER
INTEGER ::= 0 INTEGER (repeat for 1 .. 9)
INTEGER ::= 0
(repeat for 1 .. 9)
Start symbol is REAL
Regular Expressions

Regular expressions (RE) defined by





Any terminal character is an RE
Alternation RE | RE
Concatenation RE1 RE2
Repetition RE* (zero or more RE’s)
Language of RE’s = type 3 grammars

Regular expressions are more convenient
Specifying RE’s in Unix Tools






Single characters a b c d \x
Alternation [bcd] [b-z] ab|cd
Match any character .
Match sequence of characters x* y+
Concatenation abc[d-q]
Optional [0-9]+(.[0-9]*)?
Finite State Machines



Languages and Automata
A language is a set of strings
An automaton is a machine


That determines if a given string is in the
language or not.
FSM’s are automata that recognize
regular languages (regular expressions)
Definitions of FSM




A set of labeled states
Directed arcs labeled with character
A state may be marked as terminal
Transition from state S1 to S2

If and only if arc from S1 to S2



Labeled with next character (which is eaten)
Recognized if ends up in terminal state
One state is distinguished start state
Building FSM from Grammar


One state for each non-terminal
A rule of the form



Nont1 ::= terminal
Generates transition from S1 to final state
A rule of the form


Nont1 ::= terminal Nont2
Generates transition from S1 to S2
Building FSM’s from RE’s


Every RE corresponds to a grammar
For all regular expressions


A natural translation to FSM exists
We will not give details of algorithm here
Non-Deterministic FSM

A non-deterministic FSM

Has at least one state





With two arcs to two separate states
Labeled with the same character
Which way to go?
Implementation requires backtracking
Nasty 
Deterministic FSM

For all states S

For all characters C





There is either ONE or NO arcs
From state S
Labeled with character C
Much easier to implement
No backtracking 
Dealing with ND FSM


Construction naturally leads to ND FSM
For example, consider FSM for

[0-9]+ | [0-9]+\.[0-9]+


(integer or real)
We will naturally get a start state


With two sets of 0-9 branches
And thus non-deterministic
Converting to Deterministic

There is an algorithm for converting

From any ND FSM



To an equivalent deterministic FSM
Algorithm is in the text book
Example (given in terms of RE’s)


[0-9]+ | [0-9]+\.[0-9]+
[0-9]+(\.[0-9]+)?
Implementing the Scanner

Three methods


Completely informal, just write code
Define tokens using regular expressions




Convert RE’s to ND finite state machine
Convert ND FSM to deterministic FSM
Program the FSM
Use an automated program

To achieve above three steps
Ad Hoc Code (forget FSM’s)

Write normal hand code


A procedure called Scan
Normal coding techniques


Basically scan over white space and comments
till non-blank character found.
Base subsequent processing on character




E.g. colon may be : or :=
/ may be operator or start of comment
Return token found
Write aggressive efficient code
Using FSM Formalisms

Start with regular grammar or RE


Typically found in the language standard
For example, for Ada:

Chapter 2. Lexical Elements




Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
decimal-literal ::= integer [.integer][exponent]
integer ::= digit {[underline] digit}
exponent ::= E [+] integer | E - integer
Using FSM formalisms, cont

Given RE’s or grammar



Convert to finite state machine
Convert ND FSM to deterministic FSM
Write a program to recognize

Using the deterministic FSM
Implementing FSM (Method 1)

Each state is code of the form:


<<state1>>
case Next_Character is
when ‘a’ => goto state3;
when ‘b’ => goto state1;
when others =>
End_of_token_processing;
end case;
<<state2>>
…
Implementing FSM (Method 2)

There is a variable called State
loop
case State is
when state1 =><<state1>>
case Next_Character is
when ‘a’ => State := state3;
when ‘b’ => State := state1;
when others => End_token_processing;
end case;
when state2 …
…
end case;
end loop;
Implementing FSM (Method 3)

T : array (State, Character) of State;
while More_Input loop
Curstate := T (Curstate, Next_Char);
if Curstate = Error_State then …
end loop;
Automatic FSM Generation

Our example, FLEX


FLEX is given



See home page for manual in HTML
A set of regular expressions
Actions associated with each RE
It builds a scanner

Which matches RE’s and executes actions
Flex General Format

Input to Flex is a set of rules:




Regexp
Regexp
…
actions (C statements)
actions (C statements)
Flex scans the longest matching Regexp

And executes the corresponding actions
An Example of a Flex scanner

DIGIT
ID
%%
{DIGIT}+
[0-9]
[a-z][a-z0-9]*
{
}
printf (“an integer %s (%d)\n”,
yytext, atoi (yytext));
{DIGIT}+”.”{DIGIT}* {
printf (“a float %s (%g)\n”,
yytext, atof (yytext));
if|then|begin|end|procedure|function {
printf (“a keyword: %s\n”, yytext));
Flex Example (continued)
{ID}
printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” {
printf (“an operator %s\n”, yytext); }
“--”.*\n
[ \t\n]+
.
%%
/* eat Ada style comment */
/* eat white space */
printf (“unrecognized character”);
Assembling the flex program
%{
#include <math.h> /* for atof */
%}
<<flex text we gave goes here>>
%%
main (argc, argv)
int argc;
char **argv;
{
yyin = fopen (argv[1], “r”);
yylex();
}
Running flex

flex is a program that is executed



For Ada fans


The input is as we have given
The output is a running C program
Look at aflex (www.adapower.com)
For C++ fans

flex can run in C++ mode

Generates appropriate classes
Choice Between Methods?

Hand written scanners




Typically much faster execution
And pretty easy to write
And a easier for good error recovery
Flex approach


Simple to Use
Easy to modify token language
The GNAT Scanner

Hand written (scn.adb/scn.ads)

Basically a call does




Super quick scan past blanks/comments etc
Big case statement
Process based on first character
Call special routines





Namet.Get_Name for identifier (hashing)
Keywords recognized by special hash
Strings (stringt.ads)
Integers (uintp.ads)
Reals (ureal.ads)
More on the GNAT Scanner

Entire source read into memory





Single contiguous block
Source location is index into this block
Different index range for each source file
See sinput.adb/ads for source mgmt
See scans.ads for definitions of tokens
More on GNAT Scanner

Read scn.adb code

Very easy reading, e.g.
DTL (Dewar Trivial Language)

DTL Grammar






Program ::= DECLARE Decls BEGIN Stmts
Decls ::= {Decl}*
Stmts ::= {Stmt}+
Type ::= INTEGER | REAL
Identifier ::= letter (_{digit}+)*
Decl ::= DECLARE identifier : Type
DTL (Continued)

Integer_Literal ::= {digit}+
Real_Literal ::= {digit}+”.”{digit}*
Stmt ::= Assignstmt | Ifstmt | Whilestmt
Assignstmt ::= Identifier := Expr
Expr ::= Literal | (Expr) Op (Expr)
Op ::= + | Literal ::= Integer_Literal | Real_Literal
Ifstmt ::= IF Expr Relop Expr THEN Stmts
Whilestmt ::= WHILE Expr Relop Expr DO Stmts
Relop ::= > | < | >= | <=
DTL Example

DECLARE
DECL A_123 : INTEGER
DECL B: REAL
BEGIN
A_123 := 23
B := 2.4
WHILE A_123 > (2) + (1)
DO A_123 := A_123 - 1
ASSIGNMENT TWO

Write a flex or aflex program



Recognize tokens of DTL program
Print out tokens in style of flex example
Extra credit


Build hash table for identifiers
Output hash table key
Preprocessors

Some languages allow preprocessing

This is a separate step



Can either be done as separate phase



Input is source
Output is expanded source
Or embedded into the lexical analyzer
Often done as separate phase
Need to keep track of source locations
Nasty Glitches



Separation of tokens
Not all languages have clear rules
FORTRAN has optional spaces

DO10I=1.6



DO10I=1,6



identifier operator literal
DO10I
=
1.6
Keyword stmt loopvar operator literal punc literal
DO
10
I
=
1
,
6
Modern languages avoid this kind of thing!