UNIT 1 INTODUCTION TO COMPILERS
Download
Report
Transcript UNIT 1 INTODUCTION TO COMPILERS
UNIT 1
BY :- NAMRATHA NAYAK
www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
INTODUCTION TO COMPILERS
TOPICS
Language
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Processors
Structure of a Compiler
Evolution of Programming Languages
Science of Building a Compiler
Applications of Compiler Technology
Programming Language Basics
LANGUAGE PROCESSORS
COMPILER
Source language – High-level language like C, C++
Target language – object code of the target machine
Report any errors detected in the source program during
translation
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Read a program in source language and translate into the
target language
LANGUAGE PROCESSORS
INTERPRETER
Directly executes the operations specified in the source
program on inputs supplied by the user
Target program is not produced as output of translation
Gives better error-diagnostics than a compiler
Executes source program statement by statement
Target program produced by compiler is much faster at
mapping inputs to outputs
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
LANGUAGE PROCESSORS
EXAMPLE
Source program is first compiled into bytecodes
Bytecodes are then interpreted by a virtual machine
Just-in-time compilers
Translate bytecodes into machine language before they runt he
intermediate program to process input
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Java language processors combine compilation and
interpretation
LANGUAGE PROCESSORS
A Language-Processing System
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
LANGUAGE PROCESSORS
Preprocessor
Source program may be divided into modules in separate files
Accomplishes the task of collecting the source program
Can delete comments, include other files, expand macros
Assembler
Compiler produces an assembly-language program
Symbolic form of the machine language
Produces relocatable machine code as output
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
LANGUAGE PROCESSORS
Linker/Loader
Relocatable Code
Relocatable machine code may have to be linked with other
object files
Linker
Resolves external memory addresses
Code in file referring to a location in another file
Loader
Resolve all relocatable addresses relative to a given starting address
Puts together all the executable object files into memory for
execution
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Not ready to execute
Memory references are made relative to an undetermined starting
address in memory
THE STRUCTURE OF A COMPILER
Analysis Phase
Break up source program into token or constituent pieces
Impose a grammatical structure
Create an intermediate representation of the source program
If source program is syntactically incorrect or semantically
wrong
Provide informative messages to the user
Symbol Table
Stores the information collected about the source program
Given to the synthesis phase along with the intermediate
representation
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
THE STRUCTURE OF A COMPILER
Synthesis Phase
Constructs the desired target program from
Back end of the compiler
Analysis phase is called front end of the compiler
Compilation process is a sequence of phases
Each phase transforms one representation of source program
into another
Several phases may be grouped together
Symbol table is used by all the phases of the compiler
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Intermediate representation
Information in symbol table
THE STRUCTURE OF A COMPILER
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
LEXICAL ANALYSIS
Lexical Analyzer
Reads stream of characters in the source program
Groups the characters into meaningful sequences – lexemes
For each lexeme, a token is produced as output
Information from symbol table is needed for syntax analysis
and code generation
Consider the following assignment statement
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
<token-name , attribute-value>
Token-name : symbol used during syntax analysis
Attribute-value : an entry in the symbol table for this token
SYNTAX ANALYSIS
Parsing
Parser uses the tokens to create a tree-like intermediate
representation
Depicts the grammatical structure of the token stream
Syntax tree is one such representation
Other phases use this syntax tree to help analyze source
program and generate target program
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Interior node – operation
Children - arguments of the operation
SEMANTIC ANALYSIS
Semantic Analyzer
Checks semantic consistency with language using:
Gathers type information and save in syntax tree or symbol
table
Type Checking
Checks each operator for matching operands
Ex: Report error if floating point number is used as index of an array
Coercions or type conversions
Binary arithmetic operator applied to a pair of integers or floating point
numbers
If applied to floating point and integer, compiler may convert integer to
floating-point number
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Syntax tree
Information in symbol table
SEMANTIC ANALYSIS
Semantic Analyzer
For our assignment statement
Type checker finds that * is applied to floating-point ‘rate’ and
integer ‘60’
Integer is converted to floating-point
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Position, rate and initial are floating-point numbers
Lexeme 60 is an integer
INTERMEDIATE CODE GENERATION
After syntax and semantic analysis
Compilers generate machine-like intermediate representation
This intermediate representation should have the two properties:
Three-address code
Sequence of assembly-like instructions with three operands per
instruction
Each operand acts like a register
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Should be easy to produce
Should be easy to translate into target machine
INTERMEDIATE CODE GENERATION
Points to be noted about three-address instructions are:
Each assignment instruction has at most one operator on the right
side
Compiler must generate a temporary name to hold the value
computed by a three-address instruction
Some instructions have fewer than three operands
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
CODE OPTIMIZATION
Attempt to improve the target code
Optimizer can deduce that
Conversion of 60 from int to float can be done once at compile
time
So, the inttofloat can be eliminated by replacing 60 with 60.0
t3 is used only once to transmit its value to id1
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Faster code, shorter code or target code that consumes less power
CODE GENERATION
Code Generator
Takes intermediate representation as input
Maps it into target language
If target language is machine code
Assignment of registers to hold variables is a crucial aspect
First operand of each instruction specifies a destination
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Registers or memory locations are selected for each of the variables used
Intermediate instructions are translated into sequences of machine
instructions performing the same task
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
SYMBOL-TABLE MANAGEMENT
Essential function of Compiler
Record variable names used in source program
Collect information about storage allocated for a name
Symbol Table
Data structure containing a record for each variable name with
fields for attributes
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Type
Scope – where in the program the value may be used
In case of procedure names,
Number and type of its argument
Method of passing each argument
Type returned
COMPILER-CONSTRUCTION TOOLS
Commonly used compiler-construction tools
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Parser Generators
Scanner Generators
Syntax-directed translation engines
Code-generator Generators
Data-flow analysis engines
Compiler-construction Toolkits
EVOLUTION OF PROGRAMMING LANGUAGES
Move to Higher-Level Languages
Development of mnemonic assembly languages in 1950’s
Classification of Languages
Generation
First-generation : machine languages
Second-generation : assembly languages
Third-generation : C, C++, C#, Java
Fourth-generation : SQL, Postscript
Fifth-generation : Prolog
Imperative and Declarative
Imperative : how a computation is to be done
Declarative : what computation is to be done
Object-oriented Language
Scripting Languages
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
EVOLUTION OF PROGRAMMING LANGUAGES
Impact on Compilers
What problems to deal with
What heuristics to use to approach the problem of generating efficient
code
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Advances in PL’s placed new demands on compiler writers
Devise algorithms and representations to support new features
Performance of a computer is dependent on compiler technology
Good software-engineering techniques are essential for creating
and evolving modern language processors
Compiler writers must evaluate tradeoffs about
SCIENCE OF BUILDING A COMPILER
Modeling in Compiler Design and Implementation
Study of compilers is a study of how
Finite-state machines and regular expressions
Useful for describing the lexical units of a program (keywords, identifiers)
Used to describe the algorithms used to recognize those units
Context-free Grammars
Describe syntactic structure of PL
Nesting of parentheses, control constructs
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
To design the right mathematical models and
Choose the right algorithms
SCIENCE OF BUILDING A COMPILER
Science of Code Optimization
Optimization must be correct, i.e., preserve the meaning of compiled
program
Optimization must improve the performance of many programs
Compilation time must be kept reasonable
Engineering effort required must be manageable
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
“Optimization” – an attempt to produce code that is more efficient
Processor architectures have become more complex
Important to formulate the right problem to solve
Need a good understanding of the programs
Compiler design must meet the following design objectives
APPLICATIONS OF COMPILER TECHNOLOGY
Implementation of high-level programming languages
High-level programming language defines a programming
abstraction
Low-level language have more control over computation and
produce efficient code
Common programming languages (C, Fortran, Cobol) support
User-defined aggregate data types (arrays, structures, control flow )
Data-flow optimizations
Analyze flow of data through the program and remove redundancies
Key ideas behind object oriented languages are
Data Abstraction
Inheritance of properties
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Hard to write, less portable, prone to errors and harder to maintain
Example : register keyword
APPLICATIONS OF COMPILER TECHNOLOGY
Implementation of high-level programming languages
Java has features that make programming easier
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Type-safe – an object cannot be used as an object of an unrelated type
Array accesses are checked to ensure that they lie within the bounds
Built in garbage-collection facility
Optimizations developed to overcome the overhead by eliminating
unnecessary range checks
APPLICATIONS OF COMPILER TECHNOLOGY
Optimizations for Computer Architectures
Parallelism
Memory hierarchies
Consists of several levels of storage with different speeds and sizes
Average memory access time is reduces
Using registers effectively is the most important problem in optimizing a
program
Caches and physical memories are managed by the hardware
Improve effectiveness by changing the layout of data or order of
instructions accessing the data
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Instruction level : multiple operations are executed simultaneously
Processor level : different threads of the same application run on different
processors
APPLICATIONS OF COMPILER TECHNOLOGY
Design of new Computer Architectures
RISC (Reduced Instruction-Set Computer)
Specialized Architectures
Data flow machines, vector machines, VLIW, SIMD, systolic arrays
Made way into the designs of embedded machines
Entire systems can fit on a single chip
Compiler technology helps to evaluate architectural designs
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
CISC (Complex Instruction-Set Computer) –
Make assembly programming easier
Include complex memory addressing modes
Optimizations reduce these instructions to a small number of simpler
operations
PowerPC, SPARC, MIPS, Alpha and PA-RISC
APPLICATIONS OF COMPILER TECHNOLOGY
Program Translations
Binary Translation
Hardware synthesis
Hardware designs are described in high-level hardware description
languages like Verilog and VHDL
Described at register transfer level (RTL)
Variables represent registers
Expressions represent combinational logic
Tools translate RTL descriptions into gates, which are then mapped to
transistors and eventually to physical layout
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Translate binary code of one machine to that of another
Allow machine to run programs compiled for another instruction set
Used to increase the availability of software for their machines
Can provide backward compatibility
APPLICATIONS OF COMPILER TECHNOLOGY
Program Translations
Database Query Interpreters
Compiled Simulation
Simulation
Technique used in scientific and engg disciplines
Understand a phenomenon or validate a design
Inputs include description of the design and specific input parameters for
that run
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Languages are useful in other applications
Query languages like SQL are used to search databases
Queries consist of predicates containing relational and boolean operators
Can be interpreted or compiled into commands to search a database
APPLICATIONS OF COMPILER TECHNOLOGY
Software Productivity Tools
Testing is a primary technique for locating errors in a program
Use data flow analysis to locate errors statically
Problem of finding all program errors is undecidable
Ways in which program analysis has improved software productivity
Type Checking
Catch inconsistencies in the program
Operation applied to wrong type of object
Parameters to a procedure do not match the signature
Go beyond finding type errors by analyzing flow of data
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
If pointer is assigned null and then dereferenced, the program is clearly in error
APPLICATIONS OF COMPILER TECHNOLOGY
Software Productivity Tools
Bounds Checking
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Security breaches are caused by buffer overflows in programs written in C
Data-flow analysis can be used to locate buffer overflows
Failing to identify a buffer overflow may compromise the security of the
system
Memory-management tools
Automatic memory management removes all memory-management errors
like memory leaks
Tools developed to help programmers find memory management errors
Purify - dynamically catches memory management errors as they occur
PROGRAMMING LANGUAGE BASICS
The Static/Dynamic Distinction
What decision can the compiler make about a program
Static Policy - Language uses a policy that allows compiler to decide
an issue, i.e., at compile time
Dynamic Policy – Policy that allows a decision to be made when we
execute the program, i.e. at run time
Scope of Declarations
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Scope declaration of x is the region of the program in which uses of x refer to
this declaration
Static or Lexical scope : Used if it is possible to determine the scope of a
declaration by looking only at the program
Dynamic Scope : As the program runs, the same use of x could refer to any
several different declaration of x
Example : public static int x;
PROGRAMMING LANGUAGE BASICS
Environments and States
Whether the changes that occur as the program is run
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Affects the values of the data elements
Affect interpretation of names for that data
Association of names with locations on memory (store) and then with
values is described as a two-stage mapping
Environment – Mapping from names to locations in the store
State – Mapping from locations in store to their values. It maps l-values to
their corresponding r-values
PROGRAMMING LANGUAGE BASICS
Environments and States
Example
Exceptions to environment and state mappings
Static versus dynamic binding of names to locations
Static versus dynamic binding of locations to values
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
PROGRAMMING LANGUAGE BASICS
Static Scope and Block Structure
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Scope rules for C – based on program structure
Scope of a declaration – determined by the location of its appearance
Languages like C++,C# and Java provide explicit control over scopes
– public, private and protected
Static scope rules for a language with blocks – a grouping of
declarations and statements
C static scope policy is as follows:
C program is a sequence of top-level declarations of variables & functions
Functions may have variable declarations within them, scope of which is
restricted to the function in which it appears
Scope of a top-level declaration of a name x consists of the entire program that
follows
PROGRAMMING LANGUAGE BASICS
Static Scope and Block Structure
The syntax of blocks in C is given by
Block structure – nesting property of blocks
Static scope rule for variable declaration is as follows:
If declaration D of name x belongs to block B,
Then scope of D is all of B, except for any blocks B’ nested to any depth
within B in which x is redeclared
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
It is a type of statement and can appear anywhere that other statements can
appear
Is a sequence of declarations followed by a sequence of statements, all
surrounded by braces
PROGRAMMING LANGUAGE BASICS
Static Scope and Block Structure
Blocks in a C++ program
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
PROGRAMMING LANGUAGE BASICS
Explicit Access Control
Classes and structures introduce new scope for their members
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
If p is an object of a class with a field x, then use of x in p.x refers to field x in
the class definition
The scope of declaration x in a class C extends to any subclass C’, except if C’
has a local declaration of the same name x
Public, private and protected – provide explicit control over access to
member names in a super class
In C++, class definition may be separated from the definition of some
or all of its methods
A name x associated with the class C may have a region of code that is outside
its scope followed by another region within its scope
PROGRAMMING LANGUAGE BASICS
Dynamic Scope
Based on factors that can be known only when the program executes
A use of a name x refers to the declaration of x in the most recently
called procedure with such a declaration
Macro expansion in the C preprocessor
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Dynamic scope resolution is essential for polymorphic procedures
PROGRAMMING LANGUAGE BASICS
Dynamic Scope
Method resolution in OOP
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
The procedure called when x.m() is executed depends on the class of the object
denoted by x at that time
Example:
Class C with a method named m()
D is a subclass of C , and D has its own method named m()
There is a use of m of the form x.m(), where x is an object of class C
Impossible to tell at compile time whether x will be of class C or of the
subclass D
Cannot be decided until runtime which definition of m is the right one
Code generated by compiler must determine the class of the object x, and call
one or the other method named m
PROGRAMMING LANGUAGE BASICS
Parameter Passing Mechanisms
How actual parameters are associated with formal parameters
Call-by-Value
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Actual parameters – used in the call of a procedure
Formal parameters – used in the procedure definition
The actual parameter is evaluated or copied
Value is placed in the location belonging to the corresponding formal
parameter of the called procedure
Computation involving formal parameters done by the called procedure is local
to that procedure and actual parameters cannot be changed
In C, we can pass a pointer to a variable to allow that variable to be changed by
the callee
Array names passed as parameters in C,C++ or Java give the called procedure
what is in effect a pointer or reference to the array itself
PROGRAMMING LANGUAGE BASICS
Parameter Passing Mechanisms
Call-by-Reference
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Address of actual parameter is passed to the callee as the value of the
corresponding formal parameter
Changes to formal parameter appear as changes to the actual parameter
Essential when the formal parameter is a large object, array or a structure
Strict call-by-value requires that the caller copy the entire actual parameter
into the space of the corresponding formal parameter
Copying is expensive when the parameter is large
Call-by-Name
The callee executes as if the actual parameter were substituted literally for the
formal parameter in the code of the callee
PROGRAMMING LANGUAGE BASICS
Aliasing
Consequence of call-by-reference parameter passing
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Possible that two formal parameters can refer to the same location
Such variables are said to be aliases of one another
Example:
a is an array belonging to procedure p, and p calls another procedure q(x,y)
with a call q(a,a)
Parameters are passed by value but the array names are references to the
location where the array is stored
So, x and y become aliases of each other
Understanding aliasing is essential for a compiler that optimizes a program
www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
LEXICAL ANALYSIS
OBJECTIVES
Role
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
of Lexical analyzer
Lexical analysis using formal language definitions with
Finite Automata
Specification of Tokens
Recognition of Tokens
PROGRAMMING LANGUAGE STRUCTURE
A Programming
SYNTAX
SEMANTICS
Decides whether a sentence in a language is well-formed
Determines the meaning , if any, of a syntactically well-formed sentence
GRAMMAR
Well
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Language is defined by:
Provides a generative finite description of the language
developed tools (regular, context-free and attribute
grammars) are available for the description of syntax
Lexical analyzer and the Parser handle the syntax of the
programming language
THE ROLE OF THE LEXICAL ANALYZER
Main task of lexical analyzer
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Read input characters in a source program
Group them into lexemes
Produce as output a sequence of tokens for each lexeme
Stream of tokens is sent to the parser
Whenever a lexeme is found, it is entered into the symbol table
THE ROLE OF THE LEXICAL ANALYZER
Other tasks performed by the lexical analyzer
Removing comments and whitespace
Correlating error messages generated by compiler with source program
Associates a line number with each error message
Makes a copy of the source program with error messages
Cascade of two processes
Scanning
Processes that do not require the tokenization of input, like, deletion of
comments and compaction of whitespaces
Lexical analysis
Scanner produces the sequence of tokens as output
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
THE ROLE OF THE LEXICAL ANALYZER
Lexical Analysis versus Parsing
Simplicity of design
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Separation of lexical and syntactic analysis allows to simplify one of these tasks
A parser that has to deal with comments and whitespace is more complex
Compiler efficiency is improved
Allows to apply specialized techniques that serve only the lexical task
Specialized buffering techniques for reading input
Compiler portability is enhanced
Input device specific peculiarities can be restricted to lexical analyzer
TOKENS, PATTERNS, AND LEXEMES
Token
Pattern
Description of the form that the lexemes of a token may take
If keyword is a token, pattern is a sequence of characters that form the
keyword
Lexeme
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Pair consisting of a token name and an optional attribute value
Token name – abstract symbol for a lexical unit, like keyword
Sequence of characters in the source program that matches the pattern for a
token
Identified by the lexical analyzer as an instance of that token
TOKENS, PATTERNS, AND LEXEMES
Typical tokens in a Programming Language
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
One token for each keyword
Tokens for the operators
Token representing all identifiers
One or more tokens representing constants, such as numbers and literal
strings
Tokens for each punctuation symbol, such as comma, semicolon, left and
right parentheses
ATTRIBUTES FOR TOKENS
When more than one lexeme matches a pattern, additional
information about the lexeme must be passed
Example : Pattern for token number matches both 0 and 1
So, lexical analyzer returns to the parser both the token name and attribute
value describing the lexeme
Token name influences parsing decisions
Attribute value influences translation of tokens after the parse
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Appropriate attribute value for an identifier is a pointer to the symbol-table
entry for that entry
LEXICAL ERRORS
Lexical analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input
Error recovery strategy
“Panic mode” recovery
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Delete successive characters from the remaining input, until the lexical analyzer
can find a well-formed token at the beginning of the input left
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
See whether a prefix of the remaining input can be transformed into a valid
lexeme by a single transformation
INPUT BUFFERING
Buffer Pairs
Specialized buffering techniques to reduce the amount of overhead to
process a single input character
Scheme involving two buffers that are alternately reloaded
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
eof – marks the end of the source file
Two pointers to the input
lexemeBegin – marks beginning of the current lexeme
forward – scans ahead until a pattern match is found
INPUT BUFFERING
Buffer Pairs
Once the next lexeme is determined, forward is set to the character at
its right end
After lexeme is recorded as an attribute value, lexemeBegin is set to
the character immediately after the lexeme just found
To advance forward pointer,
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Test whether the end of one of the buffers has been reached
If so, then reload the other buffer from the input
Move forward to the beginning of the newly loaded buffer
INPUT BUFFERING
Sentinels
In previous scheme, each time the forward pointer is advanced,
For each character read, make two tests
End of buffer
Determine what character is read
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Must check that we have not moved off one of the buffers
If we do, then reload the other buffer
Combine the buffer-end test with the test for current character, if we
extend each buffer to hold a sentinel character at the end
Sentinel is a special character that is not a part of the source program
SPECIFICATION OF TOKENS
Strings and Languages
Alphabet
String over an alphabet
Finite sequence of symbols drawn from that alphabet
Length of string s (|s|) – number of occurrences of symbols in s
Empty string (ε) –string of length zero
Language
Set of strings over some fixed alphabet
Ex :
{ε}, set containing only the empty string
Set of all syntactically well-formed C programs
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Finite set of symbols, e.g., letters, digits and punctuation
Binary alphabet – {0,1}
SPECIFICATION OF TOKENS
Strings and Languages
Prefix of a string s
Suffix of string s
String obtained by removing zero or more symbols from the end of s
String obtained by removing zero or more symbols from the beginning if s
Substring of s
String obtained by deleting any prefix and any suffix from s
Proper prefixes, suffixes and substrings of a string s :
Prefixes, suffixes and substrings of s that are not or not equal to s itself
Subsequence of s
Any string formed by deleting zero or more not necessarily consecutive positions
of s
Concatenation of x and y (xy)
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
String formed by appending y to x
SPECIFICATION OF TOKENS
Operations on Languages
Kleene Closure (L*)
Set of strings obtained by concatenating L zero or more times
0
L - concatenation of L zero times, that is ,{ ε }
Positive Closure (L+)
Same as Kleene closure but without the term L0
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
SPECIFICATION OF TOKENS
Regular Expressions
Notation used for describing all languages that can be built from these
operators applied to the symbols of some alphabet
Ex: Language of C identifiers letter (letter | digit)*
Each regular expression r denotes a language L(r), defined recursively
from the languages denoted by r’s sub expressions
Rules that define the RE’s over some alphabet
r and s are RE’s denoting languages L(r) and L(s), then
(r)|(s) is a RE denoting the language L(r) U L(s)
(r)(s) is a RE denoting the language L(r)L(s)
(r)* is a RE denoting (L(r))*
(r) is a RE denoting L(r)
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
ε is a regular expression and L(ε ) is {ε }
If a is a symbol in alphabet, then a is a regular expression and L(a) = {a}
SPECIFICATION OF TOKENS
Regular Expressions
Parentheses in RE’s may be dropped if we adopt the following
Example: (a) | ((b)*(c) a | b*c
Regular set : Language defined by a RE
Two RE’s are equivalent if they denote the same regular set
Ex: (a|b) = (b|a)
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Unary operator * has highest precedence and is left associative
Concatenation has second highest precedence and is left associative
| has lowest precedence and is left associative
SPECIFICATION OF TOKENS
Regular Definitions
r1
r2
........
dn
rn
Each di is a new symbol, not in and not the same as any other of the d’s
Each ri is a RE over the alphabet U {d1, d2,... ,di-1}
d1
d2
Avoid recursive definition by restricting ri to and the previously
defined d’s
Construct a RE over alone for each ri
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
If is an alphabet of symbols, then a regular definition is a sequence
of definitions of the form
SPECIFICATION OF TOKENS
Extensions of Regular Expressions
One or more instances
Zero or one instance
Unary operator ? Means “zero or one occurence”
r? = r|ε or L(r?) = L(r) U {ε}
Character classes
Regular expression, a1| a2|....| an can be replaced by [a1 a2... an ]
[abc] is short form for a|b|c
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Unary operator +, represents positive closure of a RE and its language
r* = r+| ε and r+ = rr* = r*r
RECOGNITION OF TOKENS
Study how to take patterns for all the needed tokens
Build a piece of code that examines the input string
Find a prefix that is a lexeme matching one of the patterns
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Simple form of branching statements and conditional expressions
Terminals of the grammar : if, then, else, relop, id, number
RECOGNITION OF TOKENS
Patterns for the tokens are described using regular definitions
Recognize the token ws, to remove whitespaces
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
RECOGNITION OF TOKENS
Tokens, their patterns and attribute values
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
TRANSITION DIAGRAMS
Convert
States
Represents a condition that could occur during the scanning of input
that matches a pattern
Edges
Directed from one state to another
Labelled by a symbol or set of symbols
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
patterns into flowcharts called transition diagrams
If in some state s, and next input symbol is a,
Look for an edge out of state s labelled by a
If such an edge is found, advance the forward pointer and enter the
state to which that edge leads
TRANSITION DIAGRAMS
Important conventions about transition diagrams
Certain states are said to be final or accepting
If it is necessary to retract the forward pointer one position, then
place a * near the accepting state
One state is the start state or initial state
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Indicate that a lexeme has been found
If there is an action to be taken – returning a token an attribute value to the
parser – attach that action tot he accepting state
Transition diagram always begins in the start state before any input symbols
have been read
TRANSITION DIAGRAMS
Transition diagram for relop
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
RECOGNITION OF RESERVED WORDS AND
IDENTIFIERS
Keywords like if or then are reserved, even though they look
like identifiers they are not identifiers
Two ways to handle reserved words that look like identifiers
1.
Install the reserved words in the symbol table initially
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
When an identifier is found, call to installID places it in the symbol table and
returns a pointer to the symbol-table entry
Any identifier not in the symbol table during lexical analysis has a token id
getToken examines the symbol table entry for the lexeme found and returns
the token name
RECOGNITION OF RESERVED WORDS AND
IDENTIFIERS
Two ways to handle reserved words that look like identifiers
2.
Create separate transition diagrams for each keyword
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Such a diagram consists of states representing the situation after each
successive letter of keyword is seen , followed by a test for “nonletter-ordigit”
Necessary to check that the identifier has ended, or else would return
token then in situations where correct token was id
ARCHITECTURE OF A TRANSITION-DIAGRAMBASED LEXICAL ANALYZER
Collection of transition diagrams can be used to build a lexical
analyzer
Each state is represented by a piece of code
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Variable state holding the number of the current state
A switch based on the value of state takes us to the code for each of
the possible states, where action of that state is found
Code for a state is itself a switch statement or multiway branch that
determines the next state
ARCHITECTURE OF A TRANSITIONDIAGRAM-BASED LEXICAL ANALYZER
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Fig : Implementation of relop transition diagram
ARCHITECTURE OF A TRANSITIONDIAGRAM-BASED LEXICAL ANALYZER
Ways in which the code could fit into the entire lexical analyzer
Arrange the transition diagrams for each token to be tried sequentially
Run various transition diagrams “in parallel”
Feed the next input character to all of them an allow each one to make the
transitions required
Must be careful to resolve the case where
One diagram finds a lexeme that matches the pattern
While one or more other diagrams are still able to process the input
Combine all transition diagrams into one
Allow to read input until there is no possible next state
Take the longest lexeme that matched any pattern
www.Bookspar.com | Website for
Students | VTU - Notes - Question Papers
Function fail() resets the forward pointer and starts next transition diagram
Allows to use transition diagrams for the individual keywords