No Slide Title

Download Report

Transcript No Slide Title

Introduction
CPSC 388
Ellen Walker
Hiram College
Why Learn About Compilers?
• Practical application of important
computer science theory
• Ties together computer architecture and
programming
• Useful tools for developing language
interpreters
– Not just programming languages!
Computer Languages
• Machine language
– Binary numbers stored in memory
– Bits correspond directly to machine actions
• Assembly language
– A “symbolic face” for machine language
– Line-for-line translation
• High-level language (our goal!)
– Closer to human expressions of problems, e.g.
mathematical notation
Assembler vs. HLL
• Assembler
Ldi $r1, 2
Sto $r1, x
• HLL
X = 2;
-- put the value 2 in R1
-- store that value in X
Characteristics of HLL’s
• Easier to learn (and remember)
• Machine independent
– No knowledge of architecture needed
– … as long as there is a compiler for that
machine!
Early Milestones
• FORTRAN (Formula Translation)
– IBM (John Backus) 1954-1957
– First High-level language, and first
compiler
• Chomsky Hierarchy (1950’s)
– Formal description of natural language
structure
– Ranks languages according to the
complexity of their grammar
Chomsky Hierarchy
• Type 3: Regular languages
– Too simple for programming languages
– Good for tokens, e.g. numbers
• Type 2: Context Free languages
– Standard representation of programming
languages
• Type 1: Context Sensitive Languages
• Type 0: Unrestricted
Another View of the Hierarchy
CSL
CFL
RL
Formal Language & Automata
Theory
• Machines to recognizes each language class
– Turing Machine (computable languages)
– Push-down Automaton (context-free languages)
– Finite Automaton (regular languages)
• Use machines to prove that a given language
belongs to a class
• Formally prove that a given language does
not belong to a class
Practical Applications of Theory
• Translate from grammar to formal
machine description
• Implement the formal machine to parse
the language
• Tools:
– Scanner Generator (RL / FA): LEX, FLEX
– Parser Generator (CFL / FA): YACC, Bison
Beyond Parsing
• Code generation
• Optimization
– Techniques to “mindlessly” improve code
– Usually after code generation
– Rarely “optimal”, simply better
Phases of a Compiler
•
•
•
•
Scanner -> tokens
Parser -> syntax tree
Semantic Analyzer -> annotated tree
Source code optimizer -> intermediate
code
• Code generator -> target code
• Target code optimizer -> better target
code
Additional Tables
• Symbol table
– Tracks all variable names and other
symbols that will have to be mapped to
addresses later
• Literal table
– Tracks literals (such as numbers and
strings) that will have to be stored along
with the eventual program
Scanner
•
•
•
•
Read a stream of characters
Perform lexical analysis to generate tokens
Update symbol and literal tables as needed
Example:
Input: a[j] = 4 + 1
Tokens: ID Lbrack ID Rbrack EQL NUM PLUS NUM
Parser
• Performs syntax analysis
• Relates the sequence of tokens to the
grammar
• Builds a tree that represents this
relationship, the parse tree
Partial Grammar
•
•
•
•
•
•
assign-expr -> expr = expr
array-expr -> ID [ expr ]
expr -> array-expr
expr -> expr + expr
expr -> ID
expr -> NUM
Example Parse
assign-expression
expression
array-expression
ID
[ expression ]
ID
=
expression
add-expression
expression
NUM
+
expression
NUM
Abstract Syntax Tree
assign-expression
expression
array-expression
ID
expression
ID
expression
add-expression
expression
NUM
expression
NUM
Semantic Analyzer
• Determine the meaning (not structure) of the
program
• This is “compile-time” or static semantics only
• Example; a[j] = 4 + 1
–
–
–
–
a refers to an array location
a contains integers
j is an integer
j is in the range of the array (not checked in C)
• Parse or Syntax tree is “decorated” with this
information
Source Code Optimizer
• Simplify and improve the source code by
applying rules
– Constant folding: replace “4+2” by 6
– Combine common sub-expressions
– Reordering expressions (often prior to constant
folding)
– Etc.
• Result: modified, decorated syntax tree or
Intermediate Representation
Code Generator
• Generates code for the target machine
• Example:
– MOV R0, j
– MUL R0, 2
– MOV R1, &a
– ADD R1, R0
– MOV *R1, 6
value of j into R0
2*j in R0 (int = 2 wds)
value of a in R1
a+2*j in R1 (addr of a[j])
6 into address in R1
Target Code Optimizer
• Apply rules to improve machine code
• Example:
– MOV R0, j
– SHL R0
– MOV &a[R0], 6
(shift to multiply by 2)
Use more complex
machine instruction to
replace simpler ones
Major Data Structures
•
•
•
•
•
•
Tokens
Syntax Tree
Symbol Table
Literal Table
Intermediate Code
Temporary files
Structuring a Compiler
• Analysis vs. Synthesis
– Analysis = understanding the source code
– Synthesis = generating the target code
• Front end vs. Back end
– Front end: parsing & intermediate code
generation (target machine-independent)
– Back end: target code generation
• Optimization included in both parts
Multiple Passes
• Each pass process the source code
once
– One pass per phase
– One pass for several phases
– One pass for entire compilation
• Language definition can preclude onepass compilation
Runtime Environments
• Static (e.g. FORTRAN)
– No pointers, no dynamic allocation, no recursion
– All memory allocation done prior to execution
• Stack-based (e.g. C family)
– Stack for nested allocation (call/return)
– Heap for random allocation (new)
• Fully dynamic (LISP)
– Allocation is automatic (not in source code)
– Garbage collection required
Error Handling
• Each phase finds and handles its own types
of errors
– Scanning: errors like: 1o1 (invalid ID)
– Parsing: syntax errors
– Semantic Analysis: type errors
• Runtime errors handled by the runtime
environment
– Exception handling by programmer often allowed
Compiling the Compiler
• Using machine language
– Immediately executable, hard to write
– Necessary for the first (FORTRAN)
compiler
• Using a language with an existing
compiler and the same target machine
• Using the language to be compiled
(bootstrapping)
Bootstrapping
• Write a “quick & dirty” compiler for a
subset of the language (using machine
language or another available HLL)
• Write a complete compiler in the
language subset
• Compile the complete compiler using
the “quick & dirty” compiler