No Slide Title

Download Report

Transcript No Slide Title

ASCI Winterschool on
Embedded Systems
March 2004
Renesse
Compilers
with emphasis on ILP compilation
Henk Corporaal
Peter Knijnenburg
Compiling for ILP Architectures
Overview:
•
•
•
•
•
•
•
Motivation and Goals
Measuring and exploiting available parallelism
Compiler basics
Scheduling for ILP architectures
Source level transformations
Compilation frameworks
Summary and Conclusions
ASCI winterschool H.C.- P.K.
2
Motivation
• Performance requirements increase
• Applications may contain much instruction
level parallelism
• Processors offer lots of hardware
concurrency
Problem to be solved:
– how to exploit this concurrency automatically?
ASCI winterschool H.C.- P.K.
3
Goals of code generation
• High speedup
– Exploit all the hardware concurrency
– Extract all application parallelism
• obey true dependencies only
• No code rewriting: automatic parallelization
– However: application tuning may be required
• Limit code expansion
ASCI winterschool H.C.- P.K.
4
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
ASCI winterschool H.C.- P.K.
5
Measuring and exploiting available
parallelism
• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from
issue window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
ASCI winterschool H.C.- P.K.
6
Measuring and exploiting available parallelism
• Different effects reduce the exploitable parallelism:
– Reducing window size
• i.e., the number of instructions to choose from
– Non-perfect branch prediction
• perfect (oracle model)
• dynamic predictor
(e.g. 2 bit prediction table with finite number of entries)
• static prediction (using profiling)
• no prediction
– Restricted number of registers for renaming
• typical superscalars have O(100) registers
– Restricted number of other resources, like FUs
ASCI winterschool H.C.- P.K.
7
Measuring and exploiting available parallelism
• Different effects reduce the exploitable parallelism
(cont’d):
– Non-perfect alias analysis (memory disambiguation)
Models to use:
• perfect
• inspection: no dependence in following cases:
r1 := 0(r9)
r1 := 0(fp)
4(r9) := r2
0(gp) := r2
A more advanced analysis may disambiguate most stack and
global references, but not the heap references
• none
• Important: good branch prediction, 128 registers for
renaming, alias analysis on stack and global accesses,
and for FP a large window size
ASCI winterschool H.C.- P.K.
8
Measuring and exploiting available parallelism
• How much parallelism is there in real programs?
Application
Domain
Dhrystone
Cpp
Compress
Linpack
Livermore
Livermore, Kernel 1
Mpeg-play
Mpeg-play, DCT
Mccd
Mccd, GVI
scalar
scalar
scalar
vector
vector
vector
appl. spec
appl. spec
appl. spec
appl. spec
Limited
1.74
1.28
1.30
2.26
1.77
2.66
1.85
2.47
2.06
3.39
Compiler model
Real
Oracle-a
3.85
4.8
3.65
10.6
2.65
6.2
4.19
9.9
3.55
6.0
7.91
520.0
3.25
14.4
6.12
60.9
3.18
17.4
3.39
111.0
Oracle-b
74.1
40.1
22.1
92.6
9.7
527.0
32.7
60.9
45.0
111.0
Used compiler model:
Limited: Look within basic blocks only
Real:
Inter basic block scheduling ILP compiler
Oracle-a: Trace analysis, within functions only
Oracle-b: Trace analysis, within whole program
ASCI winterschool H.C.- P.K.
9
Conclusions
• Amount of parallelism is limited
– higher in Multi-Media
– higher in kernels
• Trace analysis detects all types of parallelism
– task, data and operation types
• Detected parallelism depends on
– quality of compiler
– hardware
– source-code transformations
ASCI winterschool H.C.- P.K.
10
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
ASCI winterschool H.C.- P.K.
11
Compiler basics
• Overview
– Compiler trajectory / structure / passes
– Abstract Syntax Tree (AST)
– Control Flow Graph (CFG)
– Basic optimizations
– Register allocation
– Code selection
ASCI winterschool H.C.- P.K.
12
Compiler basics: trajectory
Source program
Preprocessor
Compiler
Error
messages
Assembler
Library
code
Loader/Linker
Object program
ASCI winterschool H.C.- P.K.
13
Compiler basics: structure / passes
Source code
Lexical analyzer
Parsing
Intermediate code
Code optimization
Code generation
Register allocation
Sequential code
Scheduling and allocation
token generation
check syntax
check semantic
parse tree generation
data flow analysis
local optimizations
global optimizations
code selection
peephole optimizations
making interference graph
graph coloring
spill code insertion
caller / callee save and restore code
exploiting ILP
Object code
ASCI winterschool H.C.- P.K.
14
Compiler basics: structure
Simple compilation example
position := initial + rate * 60
Lexical analyzer
id := id + id * 60
temp1 := intoreal(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
Syntax analyzer
Code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
:=
id
+
id
*
id
60
Intermediate code generator
ASCI winterschool H.C.- P.K.
Code generator
movf
mulf
movf
addf
movf
id3, r2
#60, r2, r2
id2, r1
r2, r1
r1, id1
15
Compiler basics: structure
- SUIF-1 toolkit example
FORTRAN
C
FORTRAN to C
pre-processing
C front-end
FORTRAN specific transformations
converting non-standard structures to SUIF
constant propagation
forward propagation
high-SUIF to low-SUIF
induction variable identification
constant propagation
scalar privatization analysis
strength reduction
reduction analysis
dead-code elimination
locality optimization and parallelism analysis
register allocation
parallel code generation
assembly code generation
SUIF to text
SUIF text
ASCI winterschool H.C.- P.K.
SUIF to postscript
postscript
SUIF to C
C
assembly code
16
Compiler basics:
Abstract Syntax Tree (AST)
C input code:
if (a > b)
else
{ r = a % b; }
{ r = b % a; }
Parse tree: ‘infinite’ nesting:
Stat IF
Cmp >
Var a
Var b
Statlist
Stat Expr
Assign
Var r
Binop %
Var a
Var b
Statlist
Stat Expr
Assign
Var r
Binop %
Var b
Var a
ASCI winterschool H.C.- P.K.
17
Compiler basics:
Control flow graph (CFG)
C input code:
if (a > b)
else
CFG:
{ r = a % b; }
{ r = b % a; }
1
sub t1, a, b
bgz t1, 2, 3
2
3
rem r, a, b
goto 4
rem r, b, a
goto 4
4
…………..
…………..
Program, is collection of
Functions, each function is collection of
Basic Blocks, each BB contains set of
Instructions, each instruction consists of several
Transports,..
ASCI winterschool H.C.- P.K.
18
Compiler basics: Basic optimizations
• Machine independent optimizations
• Machine dependent optimizations
ASCI winterschool H.C.- P.K.
19
Compiler basics: Basic optimizations
• Machine independent optimizations
–
–
–
–
–
–
–
Common subexpression elimination
Constant folding
Copy propagation
Dead-code elimination
Induction variable elimination
Strength reduction
Algebraic identities
• Commutative expressions
• Associativity: Tree height reduction
– Note: not always allowed(due to limited precision)
ASCI winterschool H.C.- P.K.
20
Compiler basics: Basic optimizations
• Machine dependent optimization example
What’s the optimal implementation of a*34 ?
– Use multiplier: mul Tb, Ta, 34
• Pro: No thinking required
• Con: May take many cycles
– Alternative:
•
•
•
•
ASCI winterschool H.C.- P.K.
SHL Tc, Ta, 1
ADD Tb, Tc, Tzero
SHL Tc, Tc, 4
ADD Tb, Tb, Tc
Pros: May take fewer cycles
Cons:
Uses more registers
Additional instructions ( I-cache load / code size)
21
Compiler basics: Register allocation
• Register Organization
Conventions needed for parameter passing
and register usage across function calls
r31
Callee saved registers
r21
r20
r11
r10
Caller saved registers
Temporaries
Argument and result transfer
r1
r0
ASCI winterschool H.C.- P.K.
Hard-wired 0
22
Register allocation using graph coloring
Given a set of registers, what is the most efficient
mapping of registers to program variables in terms
of execution time of the program?
• A variable is defined at a point in program when a value is
assigned to it.
• A variable is used at a point in a program when its value is
referenced in an expression.
• The live range of a variable is the execution range
between definitions and uses of a variable.
ASCI winterschool H.C.- P.K.
23
Register allocation using graph coloring
Example:
Program:
Live Ranges
a
b
c
d
a :=
c :=
b :=
:= b
d :=
:= a
:= c
:= d
ASCI winterschool H.C.- P.K.
24
Register allocation using graph coloring
Inference Graph
a
Coloring:
a = red
b = green
c = blue
d = green
b
c
d
Graph needs 3 colors => program needs 3 registers
ASCI winterschool H.C.- P.K.
25
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not enough colors
(registers) to color the interference graph
Example:
Only two registers
available !!
ASCI winterschool H.C.- P.K.
Program:
Live Ranges
a
b
c
d
a :=
c :=
store c
b :=
:= b
d :=
:= a
load c
:= c
:= d
26
Register allocation for a monolithic RF
Scheme of the optimistic register allocator
Spill code
Renumber
Build
Spill costs
Simplify
Select
The Select phase selects a color (= machine register) for a variable
that minimizes the heuristic:
h1 = fdep(col, var) + caller_callee(col, var)
where:
fdep(col, var)
: a measure for the introduction of false dependencies
caller_callee(col, var) : cost for mapping var on a caller or callee saved register
ASCI winterschool H.C.- P.K.
27
Compiler basics: Code selection
• CISC era
– Code size important
– Determine shortest sequence of code
• Many options may exist
– Pattern matching
Example M68029:
D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] 
ADD ([10,A1], D2*16, 20) D1
• RISC era
– Performance important
– Only few possible code sequences
– New implementations of old architectures optimize RISC part
of instruction set only; for e.g. i486 / Pentium / M68020
ASCI winterschool H.C.- P.K.
28
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
ASCI winterschool H.C.- P.K.
29
What is scheduling?
• Time allocation:
– Assigning instructions or operations to time slots
– Preserve dependences:
• Register dependences
• Memory dependences
– Optimize code with respect to performance/ code
size/ power consumption/ ..
In practice scheduling may also integrate allocation of
resources:
• Space allocation (satisfy resource constraints):
– Bind operations to FUs
– Bind variables to registers/ register files
– Bind transports to buses
ASCI winterschool H.C.- P.K.
30
Why scheduling?
Let’s look at the execution time:
Texecution = Ncycles x Tcycle
= Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution
– Reduce CPI (cycles per instruction)
• early scheduling of long latency operations
• avoid pipeline stalls due to structural, data and control hazards
• allow Nissue > 1 and therefore CPI < 1
– Reduce Ninstructions
• compact many operations into each instruction (VLIW)
ASCI winterschool H.C.- P.K.
31
Scheduling: Structural hazards
Basic pipelining diagram:
• IF: instruction fetch
• ID: instruction decode
• OF: operand fetch
• EX: execute
• WB: write back
time
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
Pipeline stalls due to lack of resources:
load
time
IF ID OF EX WB
empty pipeline
stage
IF ID OF EX WB
IF ID OF EX EX EX WB
IF
ID OF
IF ID
Shared memory port
ASCI winterschool H.C.- P.K.
EX WB
OF EX WB
One FU
32
Scheduling: Data dependences
Three types: RaW, WaR and WaW
Examples:
add r1, r2, 5
sub r4, r1, r3
; r1 := r2+5
; RaW of r1
add r1, r2, 5
sub r2, r4, 1
; WaR of r2
add r1, r2, 5
sub r1, r1, 1
; WaW of r1
st
ld
; M[r2+5] := r1
; RaW if 5+r2 = 0+r4
r1, 5(r2)
r5, 0(r4)
WaW and WaR can be solved through renaming !!
ASCI winterschool H.C.- P.K.
33
Scheduling: RaW dependence
add r1, r2, 5
sub r4, r1, r3
;r1:= r2+5
;RaW of r1
Without bypass circuitry
time
add r1, r2, 5
IF
sub r4, r1, r3
ID OF
IF
EX WB
OF
ID
EX WB
With bypass circuitry
time
add r1, r2, 5
IF
ID OF
EX WB
Saves two cycles
sub r4, r1, r3
ASCI winterschool H.C.- P.K.
IF
ID OF
EX WB
34
Scheduling: RaW dependence
Bypassing circuitry:
from register file
ALU
to register file
from register file
ASCI winterschool H.C.- P.K.
35
Scheduling:
RaW dependence
Avoiding RaW stalls:
Reordering of instructions by
the compiler
Example: avoiding one-cycle
load interlock
Code:
a = b + c
d = e - f
ASCI winterschool H.C.- P.K.
Unscheduled code:
Lw R1,b
Lw R2,c
Add R3,R1,R2
interlock
Sw a,R3
Lw R1,e
Lw R2,f
Sub R4,R1,R2
interlock
Sw d,R4
Scheduled code:
Lw R1,b
Lw R2,c
Lw R5,e
extra reg.
needed!
Add R3,R1,R2
Lw R2,f
Sw a,R3
Sub R4,R5,R2
Sw d,R4
36
Scheduling: Control hazards
Branch requires 3 actions:
• Compute new address
• Determine condition
• Perform the actual branch (if taken): PC := new address
time
Branch L
IF
ID
IF
Predict
not taken
L:
ASCI winterschool H.C.- P.K.
OF EX
ID
OF
EX
WB
IF
ID
OF
EX WB
IF
ID
OF EX
WB
IF
ID
EX
OF
WB
37
Control hazards: what's the penalty?
CPI = CPIideal + fbranch x Pbranch
Pbranch = Ndelayslots x miss_rate
• Superscalars tend to have large branch penalty
Pbranch due to many pipeline stages
• Note that penalties have larger effect when
CPIideal is low
ASCI winterschool H.C.- P.K.
38
Scheduling: Control hazards
• What can we do about control hazards and
CPI penalty?
– Keep penalty Pbranch low:
• Early computation of new PC
• Early determination of condition
• Visible delay slots filled by compiler (MIPS)
– Branch prediction
– Reduce control dependencies (control height
reduction) [Schlansker and Kathail, Micro’95]
– Remove branches: if-conversion
• Conditional instructions: CMOVE, cond skip next
• Guarding all instructions: TriMedia
ASCI winterschool H.C.- P.K.
39
Scheduling: Control height reduction
• Reduce the number of branches (control
height) along a trace [Schlansker and Kathail,
Micro’95]
• Problems with stores:
– May not move above branches
ASCI winterschool H.C.- P.K.
40
Scheduling: Control height reduction
Original code:
store 0
c0
branch 0
exit 0
store 1
c1
branch 1
exit 1
store 2
c2
branch 2
exit 2
store 3
c3
branch 3
exit 3
fall-through
ASCI winterschool H.C.- P.K.
41
Scheduling: Control height reduction
New code:
c0c1c2 c3
to off-trace code
store 0
c0
FT
branch
store 1
branch 0
store 1
c1
branch 1
store 2
store 3
on-trace code
exit 1
store 2
c2
branch 2
fall-through
exit 0
exit 2
store 3
exit 3
off-trace code
Note that stores 1-3 may also be guarded; this eliminated
the branch latency altogether along the on-trace path
ASCI winterschool H.C.- P.K.
42
Scheduling: Conditional instructions
• Example: Cmove (supported by Alpha)
If (A=0) S = T;
assume:
r1: A,
r2: S,
Object code:
Bnez r1, L
Mov r2, r3
L:
. . . .
r3: T
After conversion:
Cmovz r2, r3, r1
ASCI winterschool H.C.- P.K.
43
Scheduling: Conditional instructions
Conditional instructions are useful, however:
• Squashed instructions still take execution time and execution resources
– Consequence: long target blocks can not be if-converted
• Condition has to be known early
• Moving operations across multiple branches requires complicated
predicates
• Compatibility: change of ISA (instruction set architecture)
Practice:
• Current superscalars support a limited set of conditional instructions
• CMOVE: alpha, MIPS, PowerPC, SPARC
• HP PA: any RR instruction can conditionally squash next instruction
Large VLIWs profit from making all instructions conditional
• guarded execution: TriMedia, Intel/HP IA-64, TI C6x
ASCI winterschool H.C.- P.K.
44
Scheduling: Conditional instructions
Full guard support
If-conversion of conditional code
Assume:
•
•
•
•
tbranch branch latency.
pbranch branching probability.
ttrue execution time of the TRUE branch.
tfalse execution time of the FALSE branch.
Execution times of original and if-converted-code for non-ILP architecture:
toriginal_code = (1 + pbranch) x tbranch + p x ttrue +
(1 - pbranch) x tfalse
tif_converted_code = ttrue + tfalse
ASCI winterschool H.C.- P.K.
45
Scheduling: Conditional instructions
Speedup of if-converted code for non-ILP architectures
Only interesting for short target blocks
ASCI winterschool H.C.- P.K.
46
Scheduling: Conditional instructions
Speedup of if-converted code for ILP architectures
with sufficient resources
tif_converted = max(ttrue, tfalse)
tif converted
Much larger area
of interest !!
ASCI winterschool H.C.- P.K.
47
Scheduling: Conditional instructions
• Full guard support for large ILP
architectures has a number of advantages:
–
–
–
–
Removing unpredictable branches
Enlarging scheduling scope
Enabling software pipelining
Enhancing code motion when speculation is not
allowed
– Resource sharing; even when speculation is allowed
guarding may be profitable
ASCI winterschool H.C.- P.K.
48
Scheduling: Overview
Transforming a sequential program into a parallel program:
read sequential program
read machine description file
for each procedure do
perform function inlining
for each procedure do
transform an irreducible CFG into a reducible CFG
perform control flow analysis
perform loop unrolling
perform data flow analysis
perform memory reference disambiguation
perform register allocation
for each scheduling scope do
perform instruction scheduling
write parallel program
ASCI winterschool H.C.- P.K.
49
Scheduling: Int.Lin.Programming
Integer linear programming scheduling method
• Introduce:
– Decision variables: xi,j = 1 if operation i is scheduled in cycle j
– Constraints like:
– Limited resources:
j,
x
i, j,t
 Mt
i
where xt operation of type t and Mt number of resources of type t
– Data dependence constraints
– Timing constraints
• Problem: too many decision variables
ASCI winterschool H.C.- P.K.
50
List Scheduling
•
•
•
•
Make a dependence graph
Determine minimal length
Determine ASAP, ALAP, and slack of each operation
Place each operation in first cycle with sufficient
resources
Note:
– Scheduling order sequential
– Priority determined by used heuristic; e.g. slack
ASCI winterschool H.C.- P.K.
51
Basic Block Scheduling
B
C
ASAP cycle
ALAP cycle
A
ADD
<2,2>
SUB
<3,3>
NEG
A
LD
<2,3>
ADD
X
ASCI winterschool H.C.- P.K.
<1,1> slack
<4,4>
C
ADD
<1,3>
LD
A
B
MUL
<2,4>
y
<1,4>
z
52
List scheduling
Priority - Heuristic
max{asap(u) + delay(u,v) | (u,v)  E } if pred(v)  
asap(v) =
0
otherwise
min{alap(u) - delay(u,v) | (u,v)  E } if succ(v)  
alap(v) =
Lmax
otherwise
slack(v) = alap(v) - asap(v)
ASCI winterschool H.C.- P.K.
53
Extended basic block scheduling: Scope
Trace
ASCI winterschool H.C.- P.K.
Superblock
Decision Tree
Hyperblock/region
54
Extended basic block scheduling: Scope
Comparing scheduling scopes:
Trace Sup. Hyp.
block block
Multiple exc. paths
No
No
Yes
Side-entries allowed
Yes
No
No
Join points allowed
Yes
No
Yes
Code motion down joins Yes
No
No
Must be if-convertible
No
No
Yes
Tail dup. before sched.
No
Yes
No
ASCI winterschool H.C.- P.K.
Dec. Region
Tree
Yes
Yes
No
No
No
Yes
No
No
No
No
Yes
No
55
Extended basic block scheduling: Scope
Partitioning a CFG into scheduling scopes:
A
A
C
B
D
F
E
G
Trace
ASCI winterschool H.C.- P.K.
B
C
D
D’
E
E’
F
G
tail
duplication
G’
Superblock
56
Extended basic block scheduling: Scope
Partitioning a CFG into scheduling scopes:
A
C
B
E
F
G
G’
D
F’
E’
Decision Tree
C
B
D’
D
ASCI winterschool H.C.- P.K.
A
tail
duplication
G’’
F
E
G
Hyperblock/ region
57
Extended basic block scheduling:
Code Motion
A
a) add r4, r4, 4
b) beq . . .
B
C
c) add r1, r1, r2
d) sub r1, r1, r2
D
e) st r1, 8(r4)
• Downward code motions?
— a  B, a  C, a  D, c  D, d  D
• Upward code motions?
— c  A, d  A, e  B, e  C, e  A
ASCI winterschool H.C.- P.K.
58
Extended basic block scheduling:
Code Motion
Legend:
D/b
M
D
I
Basic blocks between source and
destination basic blocks
D
Basic blocks where duplication have
to be placed
I
M
M
I
I
M
D
I
Control flow edges where off-liveness
checks have to be performed
b
Destination basic blocks
b’
Source basic blocks
M
b’
• SCP (single copy on a path) rule: no path may
exist between 2 different D blocks
ASCI winterschool H.C.- P.K.
59
Extended basic block scheduling:
Code Motion
• A dominates B  A is always executed before B
– Consequently:
• A does not dominate B  code motion from B to A requires
code duplication
• B post-dominates A  B is always executed after A
– Consequently:
• B does not post-dominate A  code motion from B to A is
speculative
A
Q1: does C dominate E?
B
Q2: does C dominate D?
C
Q3: does F post-dominate D?
D
E
Q4: does D post-dominate B?
F
ASCI winterschool H.C.- P.K.
60
Scheduling: Loops
Loop Optimizations:
A
C
B
A
C
B
C
B
C’
C’
C’’
C’’
D
Loop peeling
ASCI winterschool H.C.- P.K.
A
D
D
Loop unrolling
61
Scheduling: Loops
Problems with unrolling:
• Exploits only parallelism within sets of n iterations
• Iteration start-up latency
• Code expansion
resource utilization
Basic block scheduling
Basic block scheduling
and unrolling
Software pipelining
time
ASCI winterschool H.C.- P.K.
62
Software pipelining
• Software pipelining a loop is:
– Scheduling the loop such that iterations start before
preceding iterations have finished
Or:
– Moving operations across the backedge


Example: y = a.x
LD
LD
LD
LD ML
LD ML
ML
LD ML ST
LD ML ST
ST
ML ST
ML ST
ST
ST
3 cycles/iteration
ASCI winterschool H.C.- P.K.
Unroling
Software pipelining
5/3 cycles/iteration
1 cycle/iteration
63
Software pipelining (cont’d)
Basic techniques:
• Modulo scheduling (Rau, Lam)
– list scheduling with modulo resource constraints
• Kernel recognition techniques
–
–
–
–
unroll the loop
schedule the iterations
identify a repeating pattern
Examples:
• Perfect pipelining (Aiken and Nicolau)
• URPR (Su, Ding and Xia)
• Petri net pipelining (Allan)
• Enhanced pipeline scheduling (Ebcioğlu)
– fill first cycle of iteration
– copy this instruction over the backedge
ASCI winterschool H.C.- P.K.
64
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop
for (i = 0; i < n; i++)
a[i+6] = 3* a[i] - 1;
(a) Example loop
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
(c) Software pipeline
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
(b) Code without loop control
Prologue
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
ld
mul
sub
st
r1,(r2)
r3,r1,3
r4,r3,1
r4,(r5)
Kernel
Epilogue
• Prologue fills the SW pipeline with iterations
• Epilogue drains the SW pipeline
ASCI winterschool H.C.- P.K.
65
Software pipelining:
determine II, Initation Interval
For (i=0;.....)
Cyclic data dependences
A[i+6]= 3*A[i]-1
ld r1, (r2)
(0,1)
(1,0)
(delay, distance)
mul r3, r1, 3
(0,1)
(1,0)
(1,6)
sub r4, r3, 1
(0,1)
(1,0)
st r4, (r5)
cycle(v)  cycle(u) + delay(u,v) - II.distance(u,v)
ASCI winterschool H.C.- P.K.
66
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic dependences
and resources:
MII = max{ ResMII, RecMII }
Resources:
 used( r ) 
ResMII  max 

rresources available( r ) 
Cycles:
cycle(v)  cycle(v)  delay(e)  II.distance(e)
Therefore:
ec


RecMII  min II  N | ccycles ,0  delay(e)  II.distance(e)
ec


Or:
ASCI winterschool H.C.- P.K.
 ec delay( e) 
RecMII  max 

ccycles  
distance( e) 
ec
67
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
ASCI winterschool H.C.- P.K.
68
Source Level Transformations
• Dependence Analysis
• Loop level transformations
– Loop merging
– Loop interchange
– Loop unrolling
– Unroll-and-Jam
– Loop tiling
• Much more powerful than back-end
optimizations
ASCI winterschool H.C.- P.K.
69
Dependence Analysis
• Consider following statements:
S1: a = b + c;
S2: d = a + f;
S3: a = g + h;
• S1  S2: true or flow dependence
• S2  S3: anti-dependence
• S1  S3: output dependence
ASCI winterschool H.C.- P.K.
70
Dependences in Loops
• Consider the following loop
for(i=0; i<N; i++){
S1: a[i] = …;
S2: b[i] = a[i-1];}
• Loop carried dependence S1  S2.
• Need to detect if there exists i and i’ such
that i = i’-1 in loop space.
ASCI winterschool H.C.- P.K.
71
Definition of Dependence
• There exists a dependence if there two
statement instances that refer to the same
memory location and (at least) one of them
is a write.
• There should not be a write between these
two statement instances.
• In general, it is undecidable whether there
exist a dependence.
ASCI winterschool H.C.- P.K.
72
Direction of Dependence
• If there is a flow dependence between two
statements S1 and S2 in a loop, then S1
writes to a variable in an earlier iteration
than S2 reads that variable.
• The iteration vector of the write is
lexicographically less than the iteration
vector of the read.
• I  I’ iff i1 = i’1  i(k-1) = i’(k-1)  ik < i’k
for some k.
ASCI winterschool H.C.- P.K.
73
Direction Vectors
• A direction vector is a vector
(=,=,,=,<,*,*,,*)
where * can denote = or < or >.
• Such a vector encodes a (collection of)
dependence.
• A loop transformation should result in a
new direction vector for the dependence
that is also lexicographically positive.
ASCI winterschool H.C.- P.K.
74
Loop Interchange
• Interchanging two loops also interchanges
the corresponding entries in a direction
vector.
• Example: if direction vector of a
dependence is (<,>) then we may not
interchange the loops because the resulting
direction would be (>,<) which is
lexicographically negative.
ASCI winterschool H.C.- P.K.
75
Affine Bounds and Indices
• We assume loop bounds and array index
expressions are affine expressions:
a0 + a1 * i1 +  + ak * ik
• Each loop bound for loop index ik is an
affine expressions over the previous loop
indices i1 to i(k-1).
• Each loop index expression is a affine
expression over all loop indices.
ASCI winterschool H.C.- P.K.
76
Non-Affine Expressions
• Index expressions like i*j cannot be
handled by dependence tests. We must
assume that there exists a dependence.
• An important class of index expressions are
indirections A[B[i]]. These occur frequently
in scientific applications (sparse matrix
computations).
• In embedded applications???
ASCI winterschool H.C.- P.K.
77
Linear Diophantine Equations
• A linear diophantine equations is of the
form
aj * xj = c
• Equation has a solution iff gcd(a1,,an) is
divisor of c.
ASCI winterschool H.C.- P.K.
78
GCD Test for Dependence
• Assume single loop and two references
A[a+bi] and A[c+di].
• If there exist a dependence, then gcd(b,d)
divides (c-a).
• Note the direction of the implication!
• If gcd(b,d) does not divide (c-a) then there
exists no dependence.
ASCI winterschool H.C.- P.K.
79
GCD Test (cont’d)
• However, if gcd(b,d) does divide (c-a) then
there might exist a dependence.
• Test is not exact since it does not take into
account loop bounds.
• For example:
for(i=0; i<10; i++)
A[i] = A[i+10] + 1;
ASCI winterschool H.C.- P.K.
80
GCD Test (cont’d)
• Using the Theorem on linear diophantine
equations, we can test in arbitrary loop
nests.
• We need one test for each direction vector.
• Vector (=,=,,=,<,) implies that first k
indices are the same.
• See book by Zima for details.
ASCI winterschool H.C.- P.K.
81
Other Dependence Tests
• There exist many dependence test
– separability test
– GCD test
– Banerjee test
– Range test
– Fourier-Motzkin test
– Omega test
• Exactness increases, but so does the cost.
ASCI winterschool H.C.- P.K.
82
Fourier-Motzkin Elimination
• Consider a collection of linear inequalities
over the variables i1,,in
e1(i1,,in)  e1’(i1,,in)

em(i1,,in)  em’(i1,,in)
• Is this system consistent, or does there
exist a solution?
• FM-elimination can determine this.
ASCI winterschool H.C.- P.K.
83
FM-Elimination (cont’d)
• First, create all pairs L(i1,,i(n-1))  in and
in  U(i1,,i(n-1)). This is solution for in.
• Then create new system
L(i1,,i(n-1))  U(i1,,i(n-1))
together with all original inequalities not
involving in.
• This new system has one variable less and
we continue this way.
ASCI winterschool H.C.- P.K.
84
FM-Elimination (cont’d)
• After eliminating i1, we end up with
collection of inequalities between constants
c1  c1’.
• The original system is consistent iff every
such inequality can be satisfied.
• There does not exist an inequality like
10  3.
• There may be exponentially many new
inequalities generated!
ASCI winterschool H.C.- P.K.
85
Fourier-Motzkin Test
• Loop bounds plus array index expressions
generate sets of inequalities, using new
loop indices i’ for the sink of dependence.
• Each direction vector generates inequalities
i1 = i1’  i(k-1) = i(k-1)’ ik < ik’
• If all these systems are inconsistent, then
there exists no dependence.
• This test is not exact (real solutions but no
integer ones) but almost.
ASCI winterschool H.C.- P.K.
86
N-Dimensional Arrays
• Test in each dimension separately.
• This can introduce another level of
inaccuracy.
• Some tests (FM and Omega test) can test
in many dimensions at the same time.
• Otherwise, you can linearize an array:
Transform a logically N-dimensional array
to its one-dimensional storage format.
ASCI winterschool H.C.- P.K.
87
Hierarchy of Tests
• Try cheap test, then more expensive ones:
• if (cheap test1= NO)
then print ‘NO’
else if (test2 = NO)
then print ‘NO’
else if (test3 = NO)
then print ‘NO’
else 
ASCI winterschool H.C.- P.K.
88
Practical Dependence Testing
• Cheap tests, like GCD and Banerjee tests,
can disprove many dependences.
• Adding expensive tests only disproves a
few more possible dependences.
• Compiler writer needs to trade-off
compilation time and accuracy of
dependence testing.
• For time critical applications, expensive
tests like Omega test (exact!) can be used.
ASCI winterschool H.C.- P.K.
89
Loop Transformations
• Change the order in which the iteration
space is traversed.
• Can expose parallelism, increase available
ILP, or improve memory behavior.
• Dependence testing is required to check
validity of transformation.
ASCI winterschool H.C.- P.K.
90
Loop Merging
• Two loops with the same upper and lower
bounds can be merged.
• Reduces loop overhead.
• Improves temporal locality.
ASCI winterschool H.C.- P.K.
91
Loop Merging
for Ia = exp1 to exp2
A(Ia)
for Ib = exp1 to exp2
B(Ib)

ASCI winterschool H.C.- P.K.
for I = exp1 to exp2
A(I)
B(I)
92
Example of locality improvement
for (i=0;
B[i] =
for (j=0;
C[j] =
i<N; i++)
f(A[i]);
j<N; j++)
f(B[j],A[j]);
for (i=0; i<N; i++)
B[i] = f(A[i]);
C[i] = f(B[i],A[i]);
• Consumptions of second loop closer to
productions and consumptions of first loop
• Not always so!
ASCI winterschool H.C.- P.K.
93
Loop Merge:
Satisfy dependencies
• Data dependencies from first to second
loop
can block Loop Merge
• Dependency is allowed if
 I: cons(I)  prod(I) in loop 2
• Enablers: Bump, Reverse, Skew
for (i=0; i<N; i++)
B[i] = f(A[i]);
N-1 >= i
for (i=0; i<N; i++)
C[i] = g(B[N-1]);
ASCI winterschool H.C.- P.K.
for (i=0; i<N; i++)
B[i] = f(A[i]);
i-2 < i
for (i=2; i<N; i++)
C[i] = g(B[i-2]);
94
Loop Unrolling
• Duplicate loop body and adjust loop
header.
• Increases available ILP, reduces loop
overhead, and increases possibilities for
common subexpression elimination.
• Always valid.
ASCI winterschool H.C.- P.K.
95
Loop Unrolling: Downside
• If unroll factor is not divisor of trip count,
then need to add remainder loop.
• If trip count not known at compile time,
need to check at runtime.
• Code size increases which may result in
higher I-cache miss rate.
• Global determination of optimal unroll
factors is difficult.
ASCI winterschool H.C.- P.K.
96
Loop Interchange
• Exchange two loops in a loop nest
for(j=0; j<N; j++)
for(i=0; i<N; i++)
A[i][j]
for(i=0; i<N; i++)
for(j=0; j<N; j++)
A[i][j]
ASCI winterschool H.C.- P.K.
97
Loop Interchange
j
j
Loop
Interchange
i
for(i=0; i<W; i++)
for(j=0; j<H; j++)
A[i][j] = …;
ASCI winterschool H.C.- P.K.
i
for(j=0; j<H; j++)
for(i=0; i<W; i++)
A[i][j] = …;
98
Loop Interchange
• Validity: dependence direction vectors.
• Mostly used to improve cache behavior.
• The innermost loop (loop index changes
fastest) should (only) index the right-most
array index expression in case of row-major
storage like in C.
• Can improve execution time by 1 or 2
orders of magnitude.
ASCI winterschool H.C.- P.K.
99
Loop Interchange (cont’d)
• Loop interchange can also expose
parallelism.
• If an innerloop does not carry a
dependence (entry in direction vector
equals ‘=‘), this loop can be executed in
parallel.
• Moving this loop outwards increases the
granularity of the parallel loop iterations.
ASCI winterschool H.C.- P.K.
100
Unroll-and-Jam
• Unroll outerloop and fuse new copies of the
innerloop.
• Increases size of loop body and hence
available ILP.
• Can also improve locality.
ASCI winterschool H.C.- P.K.
101
Unroll-and-Jam Example
for(i=0;i<N;i++)
for(j=0;j<N;j++)
A[i][j] = B[j][i];
for(i=0; i<N; i+=2)
for(j=0; j<N; j++){
A[i][j] = B[j][i];
A[i+1][j] =B[j][i+1];
}
• More ILP exposed
• Spatial locality of B enhanced
ASCI winterschool H.C.- P.K.
102
Loop Tiling
• Improve cache reuse by dividing the
iteration space into tiles and iterating over
these tiles.
• Only useful when working set does not fit
into cache or when there exists much
interference.
• Two adjacent loops can legally be tiled if
they can legally be interchanged.
ASCI winterschool H.C.- P.K.
103
Tiling Example
• for(i=0; i<N; i++)
for(j=0; j<N; j++)
A[i][j] = B[j][i];
• for(TI=0; TI<N; TI+=16)
for(TJ=0; TJ<N; TJ+=16)
for(i=TI; i<min(TI+16,N);i++)
for(j=TJ; j<min(TJ+16,N); j++)
A[i][j] = B[j][i];
ASCI winterschool H.C.- P.K.
104
Selecting a Tile Size
• Current tile size selection algorithms use a
cache model:
– Generate collection of tile sizes;
– Estimate resulting cache miss rate;
– Select best one.
• Only take into account L1 cache.
• Mostly do not take into account n-way
associativity.
ASCI winterschool H.C.- P.K.
105
Polyhedral Model
• Polyhedron is set x : Ax  c for some
matrix A and bounds vector c.
• Polyhedra are objects in a manydimensional space without holes.
• Iteration spaces of loops (with unit stride)
can be represented as polyhedra.
• Array accesses and loop transformations
can be represented as matrices.
ASCI winterschool H.C.- P.K.
106
Iteration Space
• A loop nest is represented as BI  b for
iteration vector I.
• Example:
for(i=0; i<10;i++)
-1 0
0
for(j=i; j<10;j++)
1 0 i
9

1 -1 j
0
0 1
9
ASCI winterschool H.C.- P.K.
107
Array Accesses
• Any array access A[e1][e2] for linear index
expressions e1 and e2 can be represented
as an access matrix and offset vector.
A+a
• This can be considered as a mapping from
the iteration space into the storage space
of the array (which is a trivial polyhedron)
ASCI winterschool H.C.- P.K.
108
Unimodular Matrices
• A unimodular matrix T is a matrix with
integer entries and determinant 1.
• This means that such a matrix maps an
object onto another object with exactly the
same number of integer points in it.
• Its inverse T¹ always exist and is
unimodular as well.
ASCI winterschool H.C.- P.K.
109
Types of Unimodular
Transformations
•
•
•
•
Loop interchange
Loop reversal
Loop skewing for arbitrary skew factor
Since unimodular transformations are
closed under multiplication, any
combination is a unimodular transformation
again.
ASCI winterschool H.C.- P.K.
110
Application
• Transformed loop nest is given by AT¹ I’  a
• Any array access matrix is transformed into
AT¹.
• Transformed loop nest needs to be
normalized by means of Fourier-Motzkin
elimination to ensure that loop bounds are
affine expressions in more outer loop
indices.
ASCI winterschool H.C.- P.K.
111
Data Transformations
• Use different data structure to improve
cache hit rate.
• Array padding changes one dimension of
an array by inserting ‘dummy elements’.
• int A[n][m]; ==> int A[n][m+pad-factor];
• This re-positions the array in cache and
may reduce interference.
• Good improvements.
ASCI winterschool H.C.- P.K.
112
Media Instructions
• Many embedded processors have special
instructions for media applications: multiplyaccumulate, clipped arithmetic, subword
parallelism, etc.
• These instructions are not supported at the
C level. Custom compilers provide intrinsic
functions to drive the code generator.
• Idiom recognition is required but very
difficult.
ASCI winterschool H.C.- P.K.
113
Phase Order Problem (1)
• The problem of determining in which order
transformations need to be applied.
• Well known example: scheduling and
register allocation:
– First scheduling may require too many
registers, introducing spill code.
– First register allocation may introduce spurious
dependences on registers, frustrating the
scheduling.
ASCI winterschool H.C.- P.K.
114
Phase Order Problem (2)
• In traditional compilers, the phase order is
hard-coded by the developer.
• However, transformations may destroy
opportunities for other transformations that
actually are more beneficial.
• Very little work has been done on this
difficult problem: Muchnick simply suggests
some sequence based on his experience.
ASCI winterschool H.C.- P.K.
115
Phase Order Problem (3)
• Optimal solution seems to be dependent on
application, target architecture, and input.
• Polyhedral model partially solves the
problem since many transformations can
be composed into one large transformation.
• Iterative Compilation searches the
transformation space and selects the best
one by profiling. Order may change also.
ASCI winterschool H.C.- P.K.
116
Overview
• Motivation and Goals
• Measuring and exploiting available
parallelism
• Compiler basics
• Scheduling for ILP architectures
• Source level transformations
• Compilation frameworks
• Summary and Conclusions
ASCI winterschool H.C.- P.K.
117
Compilation Frameworks
• It may be better to develop an interactive
compiler that offers possibilities to the
programmer.
• The compiler checks which transformations
can be legally employed in each point and
shows alternative code sequences.
• The programmer selects one alternative
based on his understanding of the code.
ASCI winterschool H.C.- P.K.
118
Retargetable Compilers
• There exist many embedded processors
and many, many more will be developed in
short time cycles.
• Developing a new compiler and, in
particular, an optimizer for each platform is
way too expensive.
• It is also too slow: we need compilers for
design space exploration.
ASCI winterschool H.C.- P.K.
119
Retargetable Compilers (2)
• New processors may differ from older ones
in the width of the data path, number of
registers, size of scratch pad, etc.
• It may also have a novel instruction set or
new types of instructions, e.g., media
instructions.
• Ideally, a compiler can be parameterized by
the architecture and the ISA.
ASCI winterschool H.C.- P.K.
120
Retargetable Compilers (3)
• For some issues, retargetability is not too
difficult:
– Register allocation algorithms take the number
of physical registers as input
– Scheduling algorithms take number of issue
slots and FUs as input
• There exist code generator generators that
take a semantic description of the ISA and
apply dynamic programming to cover IR.
ASCI winterschool H.C.- P.K.
121
Summary and Conclusions
• Compilation for ILP architectures is getting
mature
• and enters the commercial area.
• However
– Great discrepancy between available and exploitable
parallelism
• Advanced code scheduling techniques needed to
exploit ILP
• Source-to-source transformations needed for
– enabling parallelism
– efficient use of memory hierarchy
ASCI winterschool H.C.- P.K.
122
Future
• Exploiting thread level parallelism
– At procedure level
– At loop level
• Source-to-source transformation framework
• Exploiting locality of
– communication, processing, and storage
• Language extensions
– to support ILP compilers?
– for multithreading?
• Retargetability
ASCI winterschool H.C.- P.K.
123