Software Pipelining and Loop Optimizations in the Presence

Transcript Software Pipelining and Loop Optimizations in the Presence

Software Bubbles: Using Predication to Compensate
for Aliasing in Software Pipelines
Benjamin Goldberg, Emily Crutcher
NYU
Chad Huneycutt, Krishna Palem
Georgia Tech
PACT 2002
2
Introduction
• New VLIW/EPIC architectures have hardware
features for supporting a range of compiler
optimizations
• Intel IA64 (Itanium), HP Lab’s HPL-PD
• Also several processors for embedded systems
– e.g. Sharc DSP processor
• predication is particularly interesting
• how can we use predication at run-time to enable optimizations
that the compiler would otherwise not be able to perform?
• This is part of a larger project developing run-time
tests for optimization and verification.
PACT 2002
3
Predication in HPL-PD
• In HPL-PD, most operations can be predicated
• they can have an extra operand that is a one-bit
predicate register.
r2 = ADD r1,r3 if p2
• If the predicate register contains 0, the operation is not
performed
• The values of predicate registers are typically set by
“compare-to-predicate” operations
p1 = CMPP<= r4,r5
PACT 2002
4
Software Pipelining
• Software Pipelining is the technique of scheduling
instructions across several iterations of a loop.
• reduces pipeline stalls on sequential pipelined machines
• exploits instruction level parallelism on superscalar and
VLIW machines
• intuitively, iterations are overlaid so that an iteration
starts before the previous iteration have completed
sequential
loop
pipelined
loop
PACT 2002
5
Constraints on Software Pipelining
• The instruction-level parallelism in a software pipeline
is limited by
• Resource Constraints
• VLIW instruction width, functional units, bus conflicts, etc.
• Dependence Constraints
• particularly loop carried dependences between iterations
• arise when
– the same register is used across several iterations
– the same memory location is used across several iterations
Memory Aliasing
PACT 2002
6
Aliasing-based Loop Dependences
• Assembly:
• Source code:
for(i=2; i<n;i++)
a[i] = a[i-3] + c;
dependence spans
three iterations
“distance = 3”
• Pipeline
kernel
1 cycle
load
add
store
incra3
incra
load
add
store
incra3
incra
load
add
store
incra3
incra
PACT 2002
loada
add
store
incra3
incra
Initiation Interval (II)
load
add
store
incra3
incra
load
add
store
incra3
incra
load
add
store
incra3
incra
7
Aliasing-based Loop Dependences
• Assembly:
• Source code:
for(i=2; i<n;i++)
a[i] = a[i-1] + c;
distance = 1
load
• Pipeline
add
store
incra1 load
incra add
store
incra1 load
kernel
3 cycles incra add
store
incra1
incra
PACT 2002
loada
add
store
incra1
incra
Initiation Interval (II)
8
Dynamic Memory Aliasing
• What if the code were:
for(i=k;i<n;i++)
a[i] = a[i-k] + c;
where k is unknown at compile time?
• the dependence distance is the value of k
• “dynamic” aliasing
• The possibilities are:
• k = 0 no loop carried dependence
• k > 0 loop carried true dependence with distance k
• k < 0 loop carried anti-dependence with distance | k |
• The worst case is k = 1 (as on previous slide)
• The compiler has to assume the worst, and generate
the most pessimistic pipelined schedule
PACT 2002
9
Pipelining Despite Aliasing
• This situation arises quite frequently:
void copy(char *a, char *b, int size)
{ for(int i=0;i<n;i++)
a[i] = b[i];
}
• Distance = (b – a)
• What can the compiler do?
• Generate different versions of the software pipeline for
different distances
• branch to the appropriate version at run-time
• possible code explosion, cost of branch
• Another alternative: Software Bubbling
• a new technique for Software Pipelining in the presence
of dynamic aliasing
PACT 2002
10
Software Bubbling
• Compiler generates the most optimistic pipeline
• constrained only by resource constraints
• perhaps also by static dependences in the loop
• All operations in the pipeline kernel are predicated
• rotating predicate registers are especially useful, but not
necessary
• The predication pattern determines if the operations in
a given iteration “slot” are executed
• The predication pattern is assigned dynamically, based
on the dependence distance at run time.
• Continue to use simple example:
for(i=k;i<n;i++)a[i] = a[i-k] + c;
PACT 2002
11
Software Bubbling
load
add
store
incr
incr
Optimistic Pipeline
for k > 2 or k < 0
Pipeline for k = 1
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
operations
disabled by predication
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
PACT 2002
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
Pipeline for k = 2
load
add
store
incr
incr
12
The Predication Pattern
• Each iteration slot is predicated upon a different
predicate register
• all operations within the slot are predicated on the same
predicate register
load
add
store
incr
incr
if
if
if
if
if
p[0]
p[0]
p[0]
p[0]
p[0]
load
add
store
incr
incr
if
if
if
if
if
p[1]
p[1]
p[1]
p[1]
p[1]
load
add
store
incr
incr
if
if
if
if
if
p[2]
p[2]
p[2]
p[2]
p[2]
load
add
store
incr
incr
if
if
if
if
if
p[3]
p[3]
p[3]
p[3]
p[3]
kernel
load if p[0] add if p[1]
store if p[2] incr if p[3] incr if p[4]
PACT 2002
13
Bubbling Predication Pattern
10110
01101
11011
10110
01101
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
• The predication pattern in the kernel rotates
• In this case, the initial pattern is 110110
• No operation is predicated on the leftmost bit in this case
• Rotating predicate registers are perfect for this.
PACT 2002
14
Computing the predication pattern
• Suppose the compiler, based only on static constraints,
generated an initiation interval of II cycles.
• The number of cycles actually required between the source and
target iterations of a dynamic dependence is
latency(sourceOp) – offset(sourceOp,targetOp)
• The required distance in iteration slots between a source and
target iteration is given by
L =  latency(sourceOp) – offset(sourceOp,targetOp) /II 
Note that this can be computed at compile time.
• With a dependence distance of d, however, each dependence is
from an iteration i to and iteration i+d. Thus, as long as no more
than d iterations occur within L iterations slots, the dependence
is preserved.
PACT 2002
15
Computing the predication pattern
offset = -2
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
load
add
store
incr
incr
Example:
• Suppose d = 2 and latency(store) = 1
• L = latency(store) – offset(store,load)/II
Static Interval (II)
load
add
store
incr
incr
= 3, the factor by which the II would have to be increased, assuming
the dependence spanned one iteration
• The predication pattern should insure that only d out of L
iterations slots are enabled. In this case, 2 out of 3 slots.
PACT 2002
16
Computing the Predication Pattern (cont)
• To enable d out of L iteration slots, we simply create a
bit pattern of length L whose first d bits are 1 and the
rest are 0.
= 2d – 1.
• Before entering the loop, we initialize the aggregate
predicate register (PR) by executing
PR = shl 1, rd
PR = sub PR,1
where rd contains the value of d (run-time value)
• The predicate register rotation occurs automatically
using BRF and adding the instruction
p[0] = mov p[L]
within the loop, where L is a compile-time constant
PACT 2002
17
Generalized Software Bubbling
• So far, we’ve seen Simple Bubbling
• d is constant throughout the loop
• If d changes as the loop progresses, then software
bubbling can still be performed.
• The predication pattern changes as well
• This is called Generalized Bubbling
• test occurs within the loop
• iteration slot is only enabled if less than d iteration slots out the
the previous L slots have been enabled.
• Examples of code requiring generalized bubbling
appear quite often.
• Alvinn Spec Benchmark, Lawrence Livermore Loops Code
PACT 2002
18
Bubbling vs. Hardware Disambiguation
• EPIC architectures provide hardware support for
memory disambiguation
• IA-64: Advanced Load Address Table, ld.a, ld.c
• HPL-PD: lds ldv
• Allows loads to be moved above possibly aliased
stores.
• ldv (or ld.c) will reissue load if aliasing occurs
...
S r3, 4
r1 = L r2
r1 = ADD r1,7
r1 = LDS r2
...
S r3, 4
r1 = LDV r2
r1 = ADD r1,7
• Can’t this be used for software pipelining?
PACT 2002
19
Bubbling vs. Hardware Disambiguation
• The problem with using hardware disambiguation is
that the ldv/ld.c must occur after the store.
possible
dependence
load
....
....
....
store
incr
lds
ldv
....
....
....
store
incr
load
....
....
....
store
incr
load
....
....
....
store
incr
PACT 2002
large II
lds
ldv
....
....
....
store
incr
lds
ldv
....
....
....
store
incr
20
Experimental Results
• Experiments were performed using the Trimaran
Compiler Research Infrastructure
• www.trimaran.org
• produced by a consortium of HP Labs, UIUC, NYU, and
Georgia Tech
• Provides an highly optimizing EPIC compiler
• Configurable HPL-PD cycle-by-cycle simulator
• Visualization tools for displaying IR, performance, etc.
• Benchmarks from the literature were identified as
being amenable for software bubbling
PACT 2002
21
Simple Bubbled Loops
S152 Total Execution Time
800
Callahan-Dongerra-Levine
S152 Loop Benchmark
Cycles (thousands)
700
600
500
Static Pipeline
Bubbled
400
300
200
100
0
1
2
3
4
5
Machine
Matrix Addition
Cycles (thousands)
Matrix Add Total Execution Time
100
90
80
70
60
50
40
30
20
10
0
Static Pipeline
Bubbled
1
2
3
machine
PACT 2002
4
5
22
Generalized Bubbled Loops
Alvinn Total Execution Time
14
7000
12
6000
10
Cycles (millions)
Cycles per iteration
Alvinn Cycles per Pipelined Loop
Static Pipeline
Bubbled
Unsafe Pipeline
8
6
4
5000
4000
3000
2000
2
1000
0
0
1
2
3
4
5
6
Static Pipelined
Bubbled
1
Machine
2
3
Machine
Alvinn SPEC Benchmark
PACT 2002
4
5
23
Generalized Bubbled Loops (cont)
Cycles per Loop
Cycles per iteration
12
10
8
Static Pipeline
6
Bubbled
4
Unsafe
Pipeline
2
0
1
2
3
Lawrence Livermore Loops
Kernel 2 Benchmark
4
5
6
Machine
Total Execution Time
Total Cycles (thousands)
600
500
400
Static Pipeline
Bubbled
300
200
100
0
1
2
3
Machine
PACT 2002
4
5
24
Related Work
• Nicolau(89): Memory disambiguation for Multiflow
• Bernstein,Cohen&Mayday[94]: Run-time tests for
array aliasing
• Su,Habib,Zhao,Wang,Wu[94]: Empirical study of
dynamic memory aliasing, run-time checks in
pipelined code
• Davidson&Jinturker[95]: Run-time disambiguation
for unrolling & pipelining
• Warter,Lavery,Hwu[93]: Predication for if-conversion
in software pipelines
• Rau,Schlansker,Tirumalai[92]: Predication for the
prolog and epilog of software pipeline
PACT 2002
25
Conclusions
• Modern VLIW/EPIC architectures provide ample
opportunity, and need, for sophisticated optimizations
• Predication is a very powerful feature of these
machines
• Dynamic memory aliasing doesn’t have to prevent
optimizations like software pipelining
• we’ve also applied similar techniques to scalar
replacement, loop interchange, etc.
PACT 2002

Software Pipelining and Loop Optimizations in the Presence

Transcript Software Pipelining and Loop Optimizations in the Presence

Directory