ITS 015: Compiler Construction
Download
Report
Transcript ITS 015: Compiler Construction
컴파일러
첫째주 2009/09/01
권혁철
과제
It might be the biggest program you’ve ever
written.
It cannot be done the day it’s due!
Syllabus에 있는 대로 따라 올 것
ABEEK의 설계과목임
팀별 리포트 : 팀원별 역할을 분명히, 전체 시스템 구조와 내
용을 알아야 함
제한조건 아래 설계
창의성이 있어야 ????
팀 구성/과제
팀 구성
2명을 한 팀으로 구성
두 사람이 공동으로 개발하되, 두 사람은 모두 전체
프로그램을 알고 있어야 함
과제
컴파일러를 만드는 일반 과정을 따르면 됨
제한 조건 내에서 자신의 아이디어를 추가함
제한조건
형선언, 사칙연산, if문, for문은 포함함
가상어셈블리어 코드를 만들고, 인터프리터를 만들어도
됨
부프로그램이나 기타 기능 추가 시 이를 평가하여 가점을 줌
Pentium용 어셈블리어나 Byte-code로 결과를 출력하여 수행되
면 이를 인정하여 가점을 줌
언어 선택이나 문법 선택, 언어정의 따위는 각자 알아서
함
Lex나 Yacc을 사용해도 됨
단, 이를 이용하지 않으면 이를 인정하여 가점을 줌, Parse table
은 Yacc의 것을 이용할 수 있음
Making Languages Usable
It was our belief that if FORTRAN, during its first
months, were to translate any reasonable “scientific”
source program into an object program only half as
fast as its hand-coded counterpart, then acceptance
of our system would be in serious danger... I believe
that had we failed to produce efficient programs, the
widespread use of languages like FORTRAN would
have been seriously delayed.
18 person-years to complete!!!
— John Backus
Compiler construction
Compiler writing is perhaps the most
pervasive topic in computer science,
involving many fields:
Programming languages
Architecture
Theory of computation
Algorithms
Software engineering
In this course, you will put everything you
have learned together. Exciting, right??
간단한 질문
Consider the grammar shown below(<S> is your start symbol).
Circle which of the strings shown on the below are in the
language described by the grammar? There may be zero or
more correct answers.
Grammar:
<S> ::= <A> a <B> b
<A> ::= b <A> | b
<B> ::= <A> a | a
Strings:
A) baab B) bbbabb
C) bbaaaa D) baaabb
E) bbbabab
Compose the grammar for the language consisting of sentences
of an equal number of a’s followed by an equal number of b’s.
For example, aaabbb is in the language, aabbb is not, the empty
string is not in the language.
What is a compiler?
Source
Program
Compiler
Target
Program
Error
Message
The source language might be
General purpose, e.g. C or Pascal
A “little language” for a specific domain, e.g. SIML
The target language might be
Some other programming language
The machine language of a specific machine
관련 용어
What is an interpreter?
A program that reads an executable
program and produces the results of
executing that program
Target Machine: machine on which compiled
program is to be run
Cross-Compiler: compiler that runs on a
different type of machine than is its target
Compiler-Compiler: a tool to simplify the
construction of compilers (YACC/JCUP)
Is it hard??
In the 1950s, compiler writing took an
enormous amount of effort.
The first FORTRAN compiler took 18 person-years
Today, though, we have very good software
tools
You will write your own compiler in a team of 3 in
one semester!
Intrinsic interest
Compiler construction involves ideas from
many different parts of computer science
Artificial intelligence
Algorithms
Theory
Systems
Architecture
Greedy algorithms
Heuristic search techniques
Graph algorithms, union-find
Dynamic programming
DFAs & PDAs, pattern matching
Fixed-point algorithms
Allocation & naming,
Synchronization, locality
Pipeline & hierarchy management
Instruction set use
Intrinsic merit
Compiler construction poses challenging and interesting
problems:
Compilers must do a lot but also run fast
Compilers have primary responsibility for run-time
performance
Compilers are responsible for making it acceptable to use
the full power of the programming language
Computer architects perpetually create new challenges for
the compiler by building more complex machines
Compilers must hide that complexity from the programmer
Success requires mastery of complex interactions
High-level View of a Compiler
Source
code
Implications
Must
Must
Must
Must
Compiler
Machine
code
Errors
recognize legal (and illegal) programs
generate correct code
manage storage of all variables (and code)
agree with OS & linker on format for object code
Two Pass Compiler
We break compilation into two phases:
ANALYSIS breaks the program into pieces and creates an
intermediate representation of the source program.
SYNTHESIS constructs the target program from the
intermediate representation.
Sometimes we call the analysis part the FRONT END
and the synthesis part the BACK END of the compiler.
They can be written independently.
Traditional Two-pass Compiler
Source
code
Implications
Front
End
IR
Back
End
Machine
code
Errors
Use an intermediate representation (IR)
Front end maps legal source code into IR
Back end maps IR into target machine code
Admits multiple front ends & multiple passes
(better code)
Typically, front end is O(n) or O(n log n), while back end is NPComplete
A Common Fallacy
Fortran
Front
end
Scheme
Front
end
Java
Front
end
Smalltalk
Front
end
Back
end
SPARC
Back
end
i86
Back
end
Power PC
Can we build n x m compilers with n+m components?
Must encode all language specific knowledge in each front
end
Must encode all features in a single IR
Must encode all target specific knowledge in each back end
Limited success in systems with very low-level IRs
Source code analysis
Analysis is important for many applications
besides compilers:
STRUCTURE EDITORS try to fill out syntax units as
you type
PRETTY PRINTERS highlight comments, indent
your code for you, and so on
STATIC CHECKERS try to find programming bugs
without actually running the program
INTERPRETERS don’t bother to produce target
code, but just perform the requested operations
(e.g. Matlab)
Source code analysis
Analysis comes in three phases:
LINEAR ANALYSIS processes characters left-toright and groups them into TOKENS
HIERARCHICAL ANALYSIS groups tokens
hierarchically into nested collections of tokens
SEMANTIC ANALYSIS makes sure the program
components fit together, e.g. variables should be
declared before they are used
Linear (lexical) analysis
The linear analysis stage is called LEXICAL ANALYSIS or
SCANNING.
Example:
position = initial + rate * 60
gets translated as:
1.
2.
3.
4.
5.
6.
7.
he IDENTIFIER “position”
The ASSIGNMENT SYMBOL “=”
The IDENTIFIER “initial”
The PLUS OPERATOR “+”
The IDENTIFIER “rate”
The MULTIPLICATION OPERATOR “*”
The NUMERIC LITERAL 60
Hierarchical (syntax) analysis
The hierarchical stage is called SYNTAX
ANALYSIS or PARSING.
The hierarchical structure of the source
program can be represented by a PARSE
TREE, for example:
assignment statement
identifier
=
expression
position
expression
+
expression
identifier
expression
*
expression
initial
identifier
identifier
rate
60
Syntax analysis
The hierarchical structure of the syntactic
units in a programming language is normally
represented by a set of recursive rules.
Example for expressions:
1.
2.
3.
Any identifier is an expression
Any number is an expression
If expression1 and expression2 are expressions,
so are
expression1 + expression2
expression1 * expression2
( expression1 )
Syntax analysis
Example for statements:
1.
2.
If identifier1 is an identifier and expression2 is an
expression, then identifier1 = expression2 is a
statement.
If expression1 is an expression and statement2 is
a statement, then the following are statements:
while ( expression1 ) statement2
if ( expression1 ) statement2
Lexical vs. syntactic analysis
Generally if a syntactic unit can be
recognized in a linear scan, we convert it into
a token during lexical analysis.
More complex syntactic units, especially
recursive structures, are normally processed
during syntactic analysis (parsing).
Identifiers, for example, can be recognized
easily in a linear scan, so identifiers are
tokenized during lexical analysis.
Source code analysis
It is common to convert complex parse trees to
simpler SYNTAX TREES, with a node for each
operator and children for the operands of each
operator.
position = initial + rate * 60
=
position
Analysis
+
initial
*
rate
60
Semantic analysis
The semantic analysis stage:
Checks for semantic errors, e.g. undeclared variables
Gathers type information
Determines the operators and operands of expressions
Example: if rate is a float, the integer literal 60 should be
converted to a float before multiplying.
=
position
+
initial
*
rate
inttoreal
60
source program
The rest of the
process
lexical
analyzer
syntax
analyzer
symbol-table
manager
semantic
analyzer
intermediate
code generator
code
optimizer
code
generator
target program
error
handler
Symbol-table management
During analysis, we record the identifiers used in the
program.
The symbol table stores each identifier with its
ATTRIBUTES.
Example attributes:
How much STORAGE is allocated for the id
The id’s TYPE
The id’s SCOPE
For functions, the PARAMETER PROTOCOL
Some attributes can be determined immediately;
some are delayed.
Error detection
Each compilation phase can have errors
Normally, we want to keep processing after
an error, in order to find more errors.
Each stage has its own characteristic errors,
e.g.
Lexical analysis: a string of characters that do not
form a legal token
Syntax analysis: unmatched { } or missing ;
Semantic: trying to add a float and a pointer
position = initial + rate * 60
lexical analyzer
Internal
Representations
id1 = id2 + id3 * 60
syntax analyzer
symbol table
1 Position
…
2 initial
…
3 rate
…
4
=
id1
+
*
id2
id3
60
semantic analyzer
=
id1
+
id2
*
id3 inttoreal
60
Each stage of
processing
transforms a
representation of
the source code
program into a new
representation.
Intermediate code generation
Some compilers explicitly create an
intermediate representation of the source
code program after semantic analysis.
The representation is as a program for an
abstract machine.
Most common representation is “threeaddress code” in which all memory locations
are treated as registers, and most instructions
apply an operator to two operand registers,
and store the result to a destination register.
Intermediate code generation
=
position
+
initial
*
rate
inttoreal
60
semantic analyzer
temp1 := inttoreal(60)
temp2 := id3 * temp1
temp3 := id2+ temp2
id1 := temp3
The Optimizer (or Middle End)
IR
Opt
1
IR
Opt
2
IR
Opt
3
IR
...
Opt
n
IR
Errors
Modern optimizers are structured as a series of passes
Typical Transformations
performance, code size, power consumption etc
Discover & propagate some constant value
Move a computation to a less frequently executed place
Specialize some computation based on context
Discover a redundant computation & remove it
Remove useless or unreachable code
Encode an idiom in some particularly efficient form
Code optimization
At this stage, we improve the code to make it run
faster.
temp1 := inttoreal(60)
temp2 := id3 * temp1
temp3 := id2+ temp2
id1 := temp3
code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
Code generation
In the final stage, we take the three-address code
(3AC) or other intermediate representation, and
convert to the target language.
We must pick memory locations for variables and
allocate registers.
temp1 := id3 * 60.0
id1 := id2 + temp1
code generator
MOVF
MULF
MOVF
ADDF
MOVF
id3, R2
#60.0, R2
id2, R1
R2, R1
R1, id1
The Back End
IR
Instruction
Selection
IR
Register
Allocation
IR
Instruction
Scheduling
Machine
code
Errors
Responsibilities
Translate IR into target machine code
Choose instructions to implement each IR operation
Decide which value to keep in registers
Ensure conformance with system interfaces
Automation has been less successful in the back end
The Back End
IR
Instruction
Selection
IR
Instruction Selection
Register
Allocation
IR
Instruction
Scheduling
Machine
code
Errors
Produce fast, compact code
Take advantage of target features such as
addressing modes
Usually viewed as a pattern matching problem
ad hoc methods, pattern matching, dynamic
programming
The Back End
IR
Instruction
Selection
Register Allocation
IR
Register
Allocation
IR
Instruction
Scheduling
Machine
code
Errors
Have each value in a register when it is used
Manage a limited set of resources
Can change instruction choices & insert LOADs & STOREs
Optimal allocation is NP-Complete
(1 or k registers)
Compilers approximate
problems
solutions to NP-Complete
The Back End
IR
Instruction
Selection
IR
Register
Allocation
IR
Instruction Scheduling
Avoid hardware stalls and interlocks
Use all functional units productively
Can increase lifetime of variables
Machine
code
Instruction
Scheduling
Errors
(changing the allocation)
Optimal scheduling is NP-Complete in nearly all cases
Heuristic techniques are well developed
Cousins of the compiler
PREPROCESSORS take raw source code and
produce the input actually read by the
compiler
MACRO PROCESSING: macro calls need to be
replaced by the correct text
Macros can be used to define a constant used in many
places. E.g. #define BUFSIZE 100 in C
Also useful as shorthand for often-repeated expressions:
#define DEG_TO_RADIANS(x) ((x)/180.0*M_PI)
#define ARRAY(a,i,j,ncols) ((a)[(i)*(ncols)+(j)])
FILE INCLUSION: included files (e.g. using
#include in C) need to be expanded
Cousins of the compiler
ASSEMBLERS take assembly code and covert
to machine code.
Some compilers go directly to machine code;
others produce assembly code then call a
separate assembler.
Either way, the output machine code is
usually RELOCATABLE, with memory
addresses starting at location 0.
Cousins of the compiler
LOADERS take relocatable machine code and
alter the addresses, putting the instructions
and data in a particular location in memory.
The LINK EDITOR (part of the loader) pieces
together a complete program from several
independently compiled parts.
Compiler writing tools
We’ve come a long way since the 1950s.
SCANNER GENERATORS produce lexical analyzers
automatically.
Input: a specification of the tokens of a language (usually
written as regular expressions)
Output: C code to break the source language into tokens.
PARSER GENERATORS produce syntactic analyzers
automatically.
Input: a specification of the language syntax (usually written
as a context-free grammar)
Output: C code to build the syntax tree from the token
sequence.
There are also automated systems for code synthesis.