Transcript Chapter 8

Chapter 9
Compilers and Language
Translation
The Compilation Process




Phase I: Lexical analysis
Phase II: Parsing
Phase III: Semantics and code
generation
Phase IV: Code Optimization
Introduction



High-level languages are more
difficult to “translate” than assembly
languages.
Assembly language and machine
language are related 1-to-1.
The relationship between a high-level
language and machine language is 1to-many.
Compiler


The piece of software that
translates high-level
programming language codes
into machine language codes.
Two distinct goals of compiler:
• Correctness
• Efficient and concise
Example: 2x0+2x1+…+2x50000
The Compilation Process
Scanner
Parser
Code
Generator
Object
file
Optimizer
Lexical Analysis


The compiler examines the individual
characters in the source program and
groups them into syntactical units,
called tokens, that will be analyzed in
succeeding stages.
Analogous to grouping letters into
words prior to analyzing text.
Parsing


During this stage the sequence of
tokens formed by the scanner is
checked to see whether it is
syntactically correct according to the
rules of the programming language.
Equivalent to checking whether the
words in the text form grammatically
correct sentences.
Semantic Analysis and Code
Generation

If the high-level language statement
is structurally correct, then the
compiler analyzes its meaning and
generates the proper sequence of
machine language instructions to
carry out these actions.
Code Optimization

The compiler takes the generated
code and see whether it can be made
more efficient, either by making it
run faster, or having it occupy less
memory.
Phase I: Lexical Analysis



Scanner, or lexical analyzer, groups
input characters into tokens.
Example:
a = b + 319 - delta;
The scanner discards nonessential
characters, such as blanks and tabs,
and the group the remaining
characters into high-level syntactic
symbols such as symbols, numbers,
and operators.
Token Classifications


Token type
Classification number
symbol
1
number
2
Others: =(3),+(4),-(5),;(6); ==(7),
if(8), else (9), ( 10, ) 11
Phase II: Parsing


During the parsing phase, a compiler
determines whether the tokens
recognized by the scanner fit
together in a grammatically
meaningful way.
Analogous to the operation of
“diagramming a sentence”.
Example

To prove the
sequence of words:
The man bit the
dog
is a correctly formed
sentence.
Another Example
The man bit the
Programming Language
Example

Statement: a = b + c
Parse Tree


The structure shown in the previous
example is called a parse tree.
It starts from the individual tokens
a,=,b,+,c and show how these
tokens can be grouped together into
predefined grammatical categories
such as <symbol>, <addition
operator> and <expression> until
the desired goal is reached. (in this
case, <assignment statement>)
Grammars, Languages and BNF



How does a parser know how to
construct the parse tree?
The parser must be given a formal
description of the syntax, the
grammatical structure, of the
language that it is going to analyze.
Most widely used notation for
representing the syntax of
programming language is called BNF,
an acronym for Backus-Naur form.
BNF



The syntax of a language is specified
as a set of rules, also called
productions.
The entire collection of rules is called
a grammar.
BRN rule:
left-hand side::=“definition”
BNF Example


<assignment
statement>::=<symbol>=<expressi
on>
The rule says that the syntactical
construct called <assignment
statement> is defined as a
<symbol> followed by the token =
followed by the syntactical construct
called <expression>
Terminal/Nonterminals

BNF uses two types of objects on the
right hand side of a productions:
• Terminals: actual tokens of the
language recognized and returned by a
scanner.
• Nonterminals: an intermediate
grammatical category used to help
explain and organize the language.
Goal Symbol



The goal symbol is the highest-level
nonterminal.
When goal symbol has been
produced, the parser has finished
building the tree, and the statements
have been successfully parsed.
The collection of all statements that
can be successfully parsed is called
the language defined by a grammar.
Meta-symbols


Meta-symbol: used to describe the
characteristics of another language.
BNF has five meta-symbols:
<
>
::=
| :OR,
Ex:<digit>:=0|1|2|3|4|5|6|7|8|9
L : null string
Ex:<signed integer>:= <sign><number>
<sign>:= +|-|L
Fundamental Rule of Parsing

If, by repeated applications of the
rules of the grammar, a parser can
convert the sequence of input tokens
into the goal symbol, then that
sequence of tokens is a syntactically
valid statement of the language.
Example

A three-rule grammar
1. <sentence>::=<noun><verb>
2. <noun>::= bees|dogs
3. <verb>::=buzz|bite
•
•
Example 1: Dogs bite.
Example 2: Bees dogs.
Another Example

Grammar for a simplified
assignment statement
1. <assignment
statement>::=<variable>=<expression>
2. <expression>::=<variable>|<variable>+<v
ariable>
3. <variable>::= x|y|z
Generated Parse Tree
Wrong Path
How to parse?


The process of parser is a complex
sequence of applying rules, building
grammatical constructs, seeing
whether things are moving toward
the correct answer (the goal
symbol). If not, “undo” the rule just
applied and try another.
Look-ahead parsing algorithm:
“looking down the road” a few tokens
to see what would happen if a
certain choice were made.
Example
Not
possible to
build a
parse tree
with the
grammar.
Major Challenge

Design a grammar that:
• Includes every valid statement that we
want to be in the language
• Excludes every invalid statement that
we do not want to be in the language
Assignment Statement (2nd try)
1. <assignment
statement>::=<variable>=<expression>
2. <expression>::=<variable>|<expression
>+<expression> (recursive definition)
3. <variable>::= x|y|z
Resulting Parse Tree
Using Recursive Definition
Validity vs. Ambiguity


It is possible to construct two parse
trees of x=x+y+z using the 2nd
grammar. Two different meanings.
X=(x+y)+z
x=x+(y+z)
If-else grammar
Parse Tree
Phase III: Semantics and Code
Generation
1. <sentence>::=<noun><verb>
2. <noun>::= bees|dogs
3. <verb>::=buzz|bite

Possible combinations:
•
•
•
•

Dogs bite.
Dogs bark.
Bees bite.
Bees bark.
Not all combinations make sense.
Semantics and Code
Generation


A compiler examines the semantics
of a programming language
statement. It analyzes the meaning
of the tokens and tries to understand
the actions they perform.
If the statement is meaningless, it is
semantically rejected. Otherwise it is
translated into machine language.
Example


The statement
sum=a+b;
is syntactically correct.
But what if the variables are defined
as follows:
char a;
double b;
int sum;
Semantic Records

Each nonterminal symbol is
associated with a semantic record, a
data structure that stores
information about a nonterminal,
such as the actual name of the
object and its data type.
Semantic Records (II)

Grows gradually.
Another Situation
Two-Stage Process


Semantic analysis: a pass over the
parse tree to determine whether all
branches of the tree are semantically
valid.
Code generation: the compiler makes
a 2nd pass over the parse tree to
produce the translated code.
Example
Example (cont’d)
Example (cont’d)
Example (cont’d)
Example (cont’d)
Code Optimization

To make the code more efficient:
• Local optimization
• Global optimization

Different from programmer
optimization with compiler tools such
as:
• Visual development environments
• On-line debuggers
• Reusable code libraries
Local Optimization


Look at a very small block of
instructions and try to improve it.
Possible approaches
• Constant evaluation: x=1+1;
• Strength reduction: x=x*2;
• Eliminating unnecessary operations
Global Optimization


Look at large segments of program
and decide how to improve
performance.
A much harder problem.