No Slide Title

Download Report

Transcript No Slide Title

Embedded Systems in Silicon
TD5102
Compilers
with emphasis on ILP compilation
Henk Corporaal
http://www.ics.ele.tue.nl/~heco/courses/EmbSystems
Technical University Eindhoven
DTI / NUS Singapore
2005/2006
Compiling for ILP Architectures
Overview:
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Summary and Conclusions
H.C. TD 5102
2
Motivation
• Performance requirements increase
• Applications may contain much instruction
level parallelism
• Processors offer lots of hardware
concurrency
Problem to be solved:
– how to exploit this concurrency automatically?
H.C. TD 5102
3
Goals of code generation
• High speedup
– Exploit all the hardware concurrency
– Extract all application parallelism
• obey true dependencies only
• resolve false dependencies by renaming
• No code rewriting: automatic parallelization
– However: application tuning may be required
• Limit code expansion
H.C. TD 5102
4
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Summary and Conclusions
H.C. TD 5102
5
Measuring and exploiting available
parallelism
• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from
issue window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
H.C. TD 5102
6
Execution trace
Trace analysis
Program
Compiled code
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
r3,r3,4
For i := 0..2
set
r1,0
add
A[i] := i;
set
r2,3
brne r1,r2,Loop
set
r3,&A
st
r1,0(r3)
st
r1,0(r3)
add
r1,r1,1
add
r1,r1,1
add
r3,r3,4
add
r3,r3,4
brne r1,r2,Loop
S := X+3;
Loop:
brne r1,r2,Loop
st
r1,0(r3)
add
add
r1,r1,1
add
r3,r3,4
r1,r5,3
brne r1,r2,Loop
How parallel can this code be executed?
H.C. TD 5102
add
r1,r5,3
7
Trace analysis
Parallel Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
brne r1,r2,Loop
add
r1,r5,3
Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
H.C. TD 5102
8
Ideal Processor
Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all
register WAW & WAR hazards avoided
2. Branch and Jump prediction – Perfect => all program instructions
available for execution
3. Memory-address alias analysis – addresses are known. A store
can be moved before a load provided addresses not equal
Also:
–
–
–
–
unlimited number of instructions issued/cycle (unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum optimization
level
H.C. TD 5102
9
Upper Limit to ILP: Ideal Processor
Integer: 18 - 60
FP: 75 - 150
160
150.1
140
Instruction Issues per cycle
IPC
118.7
120
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
H.C. TD 5102
10
Different effects reduce the
exploitable parallelism
• Reducing window size
– i.e., the number of instructions to choose from
• Non-perfect branch prediction
– perfect (oracle model)
– dynamic predictor
(e.g. 2 bit prediction table with finite number of entries)
– static prediction (using profiling)
– no prediction
• Restricted number of registers for renaming
– typical superscalars have O(100) registers
• Restricted number of other resources, like FUs
H.C. TD 5102
11
Different effects reduce the exploitable
parallelism
• Non-perfect alias analysis (memory disambiguation)
Models to use:
– perfect
– inspection: no dependence in following cases:
r1 := 0(r9)
r1 := 0(fp)
4(r9) := r2
0(gp) := r2
A more advanced analysis may disambiguate most stack
and global references, but not the heap references
– none
• Important:
– good branch prediction, 128 registers for renaming, alias analysis on
stack and global accesses, and for FloatingPt a large window size
H.C. TD 5102
12
Summary
• Amount of parallelism is limited
– higher in Multi-Media
– higher in kernels
• Trace analysis detects all types of parallelism
– task, data and operation types
• Detected parallelism depends on
– quality of compiler
– hardware
– source-code transformations
H.C. TD 5102
13
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
H.C. TD 5102
14
Compiler basics
• Overview
– Compiler trajectory / structure / passes
– Abstract Syntax Tree (AST)
– Control Flow Graph (CFG)
– Data Dependence Graph (DDG)
– Basic optimizations
– Register allocation
– Code selection
H.C. TD 5102
15
Compiler basics: trajectory
Source program
Preprocessor
Compiler
Error
messages
Assembler
Library
code
Loader/Linker
Object program
H.C. TD 5102
16
Compiler basics: structure / passes
Source code
Lexical analyzer
Parsing
Intermediate code
Code optimization
Code generation
Register allocation
Sequential code
Scheduling and allocation
token generation
check syntax
check semantic
parse tree generation
data flow analysis
local optimizations
global optimizations
code selection
peephole optimizations
making interference graph
graph coloring
spill code insertion
caller / callee save and restore code
exploiting ILP
Object code
H.C. TD 5102
17
Compiler basics: structure
Simple compilation example
position := initial + rate * 60
Lexical analyzer
id := id + id * 60
temp1 := intoreal(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
Syntax analyzer
Code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
:=
id
+
id
*
id
60
Intermediate code generator
H.C. TD 5102
Code generator
movf
mulf
movf
addf
movf
id3, r2
#60, r2, r2
id2, r1
r2, r1
r1, id1
18
Compiler basics: structure
- SUIF-1 toolkit example
FORTRAN
C
FORTRAN to C
pre-processing
C front-end
FORTRAN specific transformations
converting non-standard structures to SUIF
constant propagation
forward propagation
high-SUIF to low-SUIF
induction variable identification
constant propagation
scalar privatization analysis
strength reduction
reduction analysis
dead-code elimination
locality optimization and parallelism analysis
register allocation
parallel code generation
assembly code generation
SUIF to text
SUIF text
H.C. TD 5102
SUIF to postscript
postscript
SUIF to C
C
assembly code
19
Compiler basics:
Abstract Syntax Tree (AST)
C input code:
if (a > b)
else
{ r = a % b; }
{ r = b % a; }
Parse tree: ‘infinite’ nesting:
Stat IF
Cmp >
Var a
Var b
Statlist
Stat Expr
Assign
Var r
Binop %
Var a
Var b
Statlist
Stat Expr
Assign
Var r
Binop %
Var b
Var a
H.C. TD 5102
20
Compiler basics:
Control flow graph (CFG)
C input code:
if (a > b)
else
CFG:
{ r = a % b; }
{ r = b % a; }
1
sub t1, a, b
bgz t1, 2, 3
2
3
rem r, a, b
goto 4
rem r, b, a
goto 4
4
…………..
…………..
Program, is collection of
Functions, each function is collection of
Basic Blocks, each BB contains set of
Instructions, each instruction consists of several
Transports,..
H.C. TD 5102
21
Data Dependence Graph (DDG)
a := b + 15;
c := 3.14 * d;
e := c / f;
Translation to DDG
&d
3.14
&f
&b
ld
15
+
&a
ld
&e
ld
&c
*
/
st
st
st
H.C. TD 5102
22
Compiler basics: Basic optimizations
• Machine independent optimizations
• Machine dependent optimizations
(details are in any good compiler book)
H.C. TD 5102
23
Machine independent optimizations
– Common subexpression elimination
– Constant folding
– Copy propagation
– Dead-code elimination
– Induction variable elimination
– Strength reduction
– Algebraic identities
• Commutative expressions
• Associativity: Tree height reduction
– Note: not always allowed
(due to limited precision)
H.C. TD 5102
24
Machine dependent optimization example
What’s the optimal implementation of a*34 ?
– Use multiplier: mul Tb,Ta,34
• Pro: No thinking required
• Con: May take many cycles
– Alternative:
SHL Tc,
ADD Tb,
SHL Tc,
ADD Tb,
•
•
•
•
H.C. TD 5102
Ta,
Tc,
Tc,
Tb,
1
Tzero
4
Tc
Pros: May take fewer cycles
Cons:
Uses more registers
Additional instructions ( I-cache load / code size)
25
Compiler basics: Register allocation
• Register Organization
Conventions needed for parameter passing
and register usage across function calls; a MIPS example:
r31
Callee saved registers
r21
r20
r11
r10
Caller saved registers
Temporaries
Argument and result transfer
r1
r0
H.C. TD 5102
Hard-wired 0
26
Register allocation using graph coloring
Given a set of registers, what is the most efficient
mapping of registers to program variables in
terms of execution time of the program?
• A variable is defined at a point in program when a value is
assigned to it.
• A variable is used at a point in a program when its value is
referenced in an expression.
• The live range of a variable is the execution range
between definitions and uses of a variable.
H.C. TD 5102
27
Register allocation using graph coloring
Example:
Program:
Live Ranges
a
b
c
d
a :=
c :=
b :=
:= b
d :=
:= a
:= c
:= d
H.C. TD 5102
28
Register allocation using graph coloring
Inference Graph
a
Coloring:
a = red
b = green
c = blue
d = green
b
c
d
Graph needs 3 colors (chromatic nr =3)
=> program needs 3 registers
H.C. TD 5102
29
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not enough colors
(registers) to color the interference graph
Example:
Only two registers
available !!
H.C. TD 5102
Program:
Live Ranges
a
b
c
d
a :=
c :=
store c
b :=
:= b
d :=
:= a
load c
:= c
:= d
30
Compiler basics: Code selection
• CISC era
– Code size important
– Determine shortest sequence of code
• Many options may exist
– Pattern matching
Example M68020:
D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] 
ADD ([10,A1], D2*16, 20), D1
• RISC era
– Performance important
– Only few possible code sequences
– New implementations of old architectures optimize RISC part
of instruction set only; for e.g. i486 / Pentium / M68020
H.C. TD 5102
31
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
H.C. TD 5102
32
What is scheduling?
• Time allocation:
– Assigning instructions or operations to time slots
– Preserve dependences:
• Register dependences
• Memory dependences
– Optimize code with respect to performance/ code
size/ power consumption/ ..
• Space allocation
– satisfy resource constraints:
• Bind operations to FUs
• Bind variables to registers/ register files
• Bind transports to buses
H.C. TD 5102
33
Why scheduling?
Let’s look at the execution time:
Texecution = Ncycles x Tcycle
= Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution
– Reduce CPI (cycles per instruction)
• early scheduling of long latency operations
• avoid pipeline stalls due to structural, data and control hazards
• allow Nissue > 1 and therefore CPI < 1
– Reduce Ninstructions
• compact many operations into each instruction (VLIW)
H.C. TD 5102
34
Scheduling data hazards
RaW dependence
Avoiding RaW stalls:
Reordering of instructions by
the compiler
Example: avoiding one-cycle
load interlock
Code:
a = b + c
d = e - f
H.C. TD 5102
Unscheduled code:
Lw R1,b
Lw R2,c
Add R3,R1,R2
interlock
Sw a,R3
Lw R1,e
Lw R2,f
Sub R4,R1,R2
interlock
Sw d,R4
Scheduled code:
Lw R1,b
Lw R2,c
Lw R5,e
extra reg. needed!
Add R3,R1,R2
Lw R2,f
Sw a,R3
Sub R4,R5,R2
Sw d,R4
35
Scheduling control hazards
Branch requires 3 actions:
• Compute new address
• Determine condition
• Perform the actual branch (if taken): PC := new address
time
Branch L
IF
ID
IF
Predict
not taken
L:
H.C. TD 5102
OF EX
ID
OF
EX
WB
IF
ID
OF
EX WB
IF
ID
OF EX
WB
IF
ID
EX
OF
WB
36
Control hazards: what's the penalty?
CPI = CPIideal + fbranch x Pbranch
Pbranch = Ndelayslots x miss_rate
• Superscalars tend to have large branch penalty
Pbranch due to
– many pipeline stages
– multiple instructions (or operations) / cycle
• Note:
– the lower CPI the larger the effect of penalties
H.C. TD 5102
37
What can we do about control
hazards and CPI penalty?
• Keep penalty Pbranch low:
– Early computation of new PC
– Early determination of condition
– Visible delay slots filled by compiler (MIPS)
• Branch prediction
• Reduce control dependencies (control height
reduction) [Schlansker and Kathail, Micro’95]
• Remove branches: if-conversion
– Conditional instructions: CMOVE, cond skip next
– Guarding all instructions: TriMedia
H.C. TD 5102
38
Scheduling: Conditional instructions
• Example: Cmove (supported by Alpha)
If (A=0) S = T;
Object code:
Bnez r1, L
Mov r2, r3
L:
. . . .
assume:
r1: A,
r2: S,
r3: T
After conversion:
H.C. TD 5102
Cmovz r2, r3, r1
39
Scheduling: Conditional instructions
Conditional instructions are useful, however:
• Squashed instructions still take execution time and execution resources
– Consequence: long target blocks can not be if-converted
• Condition has to be known early
• Moving operations across multiple branches requires complicated
predicates
• Compatibility: change of ISA (instruction set architecture)
Practice:
• Current superscalars support a limited set of conditional instructions
• CMOVE: alpha, MIPS, PowerPC, SPARC
• HP PA: any RR instruction can conditionally squash next instruction
Large VLIWs profit from making all instructions conditional
• guarded execution: TriMedia, Intel/HP IA-64, TI C6x
H.C. TD 5102
40
Guarded execution
SLT r1,r2,r3
BEQ r1,r0, else
then:
ADDI r2,r2,1
..X..
IF-conversion
j cont
else:
SUBI r2,r2,1
..Y..
cont:
MUL r4,r2
SLT b1,r2,r3
b1:ADDI r2,r2,1
!b1: SUBI r2,r2,1
b1:..X..
!b1: ..Y..
MUL r4,r2
H.C. TD 5102
41
Scheduling: Conditional instructions
Full guard support
If-conversion of conditional code
Assume:
•
•
•
•
tbranch branch latency
pbranch branching probability
ttrue execution time of the TRUE branch
tfalse execution time of the FALSE branch
Execution times of original and if-converted code for non-ILP
architecture:
toriginal_code = (1 + pbranch) x tbranch +
p x ttrue + (1 - pbranch) x tfalse
tif_converted_code = ttrue + tfalse
H.C. TD 5102
42
Scheduling: Conditional instructions
Speedup of if-converted code for non-ILP architectures
Only interesting for short target blocks!
H.C. TD 5102
43
Scheduling: Conditional instructions
Speedup of if-converted code for ILP architectures
with sufficient resources
tif_converted = max(ttrue, tfalse)
tif converted
Much larger area
of interest !!
H.C. TD 5102
44
Scheduling: Conditional instructions
• Full guard support for large ILP
architectures has a number of advantages:
–
–
–
–
Removing unpredictable branches
Enlarging scheduling scope
Enabling software pipelining
Enhancing code motion when speculation is not
allowed
– Resource sharing; even when speculation is allowed
guarding may be profitable
H.C. TD 5102
45
Scheduling: Overview
Transforming a sequential program into a parallel program:
read sequential program
read machine description file
for each procedure do
perform function inlining
for each procedure do
transform an irreducible CFG into a reducible CFG
perform control flow analysis
perform loop unrolling
perform data flow analysis
perform memory reference disambiguation
perform register allocation
for each scheduling scope do
perform instruction scheduling
write parallel program
H.C. TD 5102
46
Scheduling: Int.Lin.Programming
Integer linear programming scheduling method
• Introduce:
– Decision variables: xi,j = 1 if operation i is scheduled in cycle j
– Constraints like:
– Limited resources:
j,
x
i, j,t
 Mt
i
where xt operation of type t and Mt number of resources of type t
– Data dependence constraints
– Timing constraints
• Problem: too many decision variables
H.C. TD 5102
47
List Scheduling
•
•
•
•
Make a dependence graph
Determine minimal length
Determine ASAP, ALAP, and slack of each operation
Place each operation in first cycle with sufficient
resources
Note:
– Scheduling order sequential
– Priority determined by used heuristic; e.g. slack
H.C. TD 5102
48
Basic Block Scheduling
B
C
ASAP cycle
ALAP cycle
A
ADD
<2,2>
SUB
<3,3>
NEG
A
LD
<2,3>
ADD
X
H.C. TD 5102
<1,1> slack
<4,4>
C
ADD
<1,3>
LD
A
B
MUL
<2,4>
y
<1,4>
z
49
ASAP and ALAP formulas
max{asap(u) + delay(u,v) | (u,v)  E } if pred(v)  
asap(v) =
0
otherwise
min{alap(u) - delay(u,v) | (u,v)  E } if succ(v)  
alap(v) =
Lmax
otherwise
slack(v) = alap(v) - asap(v)
H.C. TD 5102
50
Cycle based list scheduling
proc Schedule(DDG = (V,E))
beginproc
ready = { v | (u,v)  E }
// all nodes which have no predecessor
ready’ = ready
// all nodes which can be scheduled in
sched = 
//
current cycle
current_cycle = 0
while sched  V do
for each v  ready’ do
if ResourceConfl(v,current_cycle, sched) then
cycle(v) = current_cycle
sched = sched  {v}
endif
endfor
current_cycle = current_cycle + 1
ready = { v | v  sched   (u,v) E, u  sched }
ready’ = { v | v  ready   (u,v) E, cycle(u) + delay(u,v)  current_cycle}
endwhile
endproc
H.C. TD 5102
51
Problem with basic block scheduling
• Basic blocks contain on average only about
6 instructions
• Unrolling may help for loops
• Go beyond basic blocks:
1. Extended basic block scheduling
2. Software pipelining
H.C. TD 5102
52
Extended basic block scheduling: Scope
Partitioning a CFG into scheduling scopes:
A
A
C
B
D
F
E
G
Trace
H.C. TD 5102
B
C
D
D’
E
E’
F
G
tail
duplication
G’
Superblock
53
Extended basic block scheduling: Scope
Partitioning a CFG into scheduling scopes:
A
C
B
E
F
G
G’
D
F’
E’
Decision Tree
C
B
D’
D
H.C. TD 5102
A
tail
duplication
G’’
F
E
G
Hyperblock/ region
54
Extended basic block scheduling: Scope
Comparing scheduling scopes:
Trace Sup. Hyp. Dec. Region
block block Tree
Multiple exc. paths
No
No
Yes Yes Yes
Side-entries allowed
Yes
No
No
No
No
Join points allowed
Yes
No
Yes No
Yes
Code motion down joins Yes
No
No
No
No
Must be if-convertible
No
No
Yes No
No
Tail dup. before sched.
No
Yes
No Yes
No
H.C. TD 5102
55
Extended basic block scheduling:
Code Motion
A
a) add r4, r4, 4
b) beq . . .
B
C
c) add r1, r1, r2
d) sub r1, r1, r2
D
e) st r1, 8(r4)
• Downward code motions?
— a  B, a  C, a  D, c  D, d  D
• Upward code motions?
— c  A, d  A, e  B, e  C, e  A
H.C. TD 5102
56
Extended basic block scheduling:
Code Motion
Legend:
D/b
M
D
I
Basic blocks between source and
destination basic blocks
D
Basic blocks where duplication have
to be placed
I
M
M
I
I
M
D
I
Control flow edges where off-liveness
checks have to be performed
b
Destination basic blocks
b’
Source basic blocks
M
b’
• SCP (single copy on a path) rule: no path may
exist between 2 different D blocks
H.C. TD 5102
57
Extended basic block scheduling:
Code Motion
• A dominates B  A is always executed before B
– Consequently:
• A does not dominate B  code motion from B to A requires
code duplication
• B post-dominates A  B is always executed after A
– Consequently:
• B does not post-dominate A  code motion from B to A is
speculative
A
Q1: does C dominate E?
B
Q2: does C dominate D?
C
Q3: does F post-dominate D?
D
E
Q4: does D post-dominate B?
F
H.C. TD 5102
58
Scheduling: Loops
Loop Optimizations:
A
C
B
A
C
B
C
B
C’
C’
C’’
C’’
D
Loop peeling
H.C. TD 5102
A
D
D
Loop unrolling
59
Scheduling: Loops
Problems with unrolling:
• Exploits only parallelism within sets of n iterations
• Iteration start-up latency
• Code expansion
resource utilization
Basic block scheduling
Basic block scheduling
and unrolling
Software pipelining
time
H.C. TD 5102
60
Software pipelining
• Software pipelining a loop is:
– Scheduling the loop such that iterations start before
preceding iterations have finished
Or:
– Moving operations across the backedge


Example: y = a.x
LD
LD
LD
LD ML
LD ML
ML
LD ML ST
LD ML ST
ST
ML ST
ML ST
ST
ST
3 cycles/iteration
H.C. TD 5102
Unroling
Software pipelining
5/3 cycles/iteration
1 cycle/iteration
61
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop
for (i = 0; i < n; i++)
a[i+6] = 3* a[i] - 1;
(a) Example loop
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
(c) Software pipeline
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
(b) Code without loop control
Prologue
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
Kernel
Epilogue
• Prologue fills the SW pipeline with iterations
• Epilogue drains the SW pipeline
H.C. TD 5102
62
Summary and Conclusions
• Compilation for ILP architectures is getting
mature and enters the commercial area.
• However:
– Great discrepancy between available and
exploitable parallelism
What if you need more parallelism?
- source-to-source transformations
- use other algorithms
H.C. TD 5102
63
H.C. TD 5102
64