Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks IBM Haifa Research Lab – HiPEAC member, Isreal {dorit, ira, zaks}@il.ibm.com PLDI.

Download Report

Transcript Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks IBM Haifa Research Lab – HiPEAC member, Isreal {dorit, ira, zaks}@il.ibm.com PLDI.

Auto-Vectorization of Interleaved Data for SIMD
Dorit Nuzman, Ira Rosen, Ayal Zaks
IBM Haifa Research Lab – HiPEAC member, Isreal
{dorit, ira, zaks}@il.ibm.com
PLDI 2006
IBM Labs in Haifa
Main Message
1. Most SIMD targets support access to packed data in memory (“SIMpD”),
but there are important applications which access non-consecutive data
2. We show how a classic compiler loop-based auto-SIMDizing optimization
was augmented to support accesses to strided, interleaved data
3. This can serve as a first step to combine traditional loop-based
vectorization with (if-converted) basic-block vectorization (“SLP”)
2
PLDI 2006
IBM Labs in Haifa
SIMD:
Single
Instruction
Multiple
Data
SIM p D:
Single
Instruction
Multiple
Packed Data
0
1
2
3
VR1 a b c d
OP(a)
VR2
OP(b)
VR3
OP(c)
VR4
OP(d)
VR5
VOP( a, VR1
b, c, d )
Vector Operation
Vectorization
Vector Registers
Data in Memory:
a b c d e f g h i j k l m n o p
3
PLDI 2006
IBM Labs in Haifa
SIMD:
Single
Instruction
Multiple
Data
SIM p D:
Single
Instruction
Multiple
Packed Data
0
1
2
3
VR1 a b c d
OP(a)
VR2
OP(b)
VR3
VOP( a, VR1
b, c, d )
OP(c)
VR4
OP(d)
VR5
Vectorization
Data in Memory:
a b c d e f g h i j k l m n o p
4
PLDI 2006
IBM Labs in Haifa
SIMD:
Single
Instruction
Multiple
Data
SIM p D:
Single
Instruction
Multiple
Packed Data
0
1
2
3
VR1 a b c d
OP(a)
VR2 e f g h
OP(f)
VR3 i j k l
VOP( a, VR5
f, k, p )
OP(k)
VR4 m n o p
OP(p)
VR5 a f k p
Vectorizing for a SIMpD Architecture
Data in Memory:
a b c d e f g h i j k l m n o p
5
PLDI 2006
IBM Labs in Haifa
SIM p D: Single Instruction Multiple Packed Data
0
1
2
3
VR1 a b c d
mask OP(a)
…
VR2 e f g h
loop:
VR3 i j k l
(VR1,…,VR4)  vload (mem)
VR4 m n o p
VR5  pack (VR1,…,VR4),mask
VR5 a f k p
operation
OP(f)
OP(k)
OP(p)
VOP(VR5)
Reorder
buffer
Reorder
buffer
Data in Memory:
a b c d e f g h i j k l m n o p
6
VOP( a, VR5
f, k, p )
PLDI 2006
memory
IBM Labs in Haifa
Application accessing non-consecutive data – Viterbi decoder
(before)
Stride 1
Stride 2
-
<<
1
+
max
<<
1|1
+
sel
max
Stride 4
7
PLDI 2006
Stride 2
<<
1
-
sel
<< 1|1
IBM Labs in Haifa
Application accessing non-consecutive data – Viterbi decoder
(after)
Stride 1
Stride 2
-
<<
1
+
max
<<
1|1
+
sel
max
Stride 4
8
PLDI 2006
Stride 2
<<
1
-
sel
<< 1|1
IBM Labs in Haifa
Application accessing non-consecutive data – Audio downmix
(before)
Stride 4
>>
1
>>
1
+
>>
1
>>
1
+
Stride 2
9
PLDI 2006
IBM Labs in Haifa
Application accessing non-consecutive data – Audio downmix
(after)
Stride 4
>>
1
>>
1
+
>>
1
>>
1
+
Stride 2
10
PLDI 2006
IBM Labs in Haifa
Basic unpacking and packing operations for strided access
 Use two pairs of inverse operations widely supported on SIMD platforms:
 extract_even, extract_odd:
 interleave_high, interleave_low:
 Use them recursively to support strided accesses with power-of-2 strides
 Support several data types
11
PLDI 2006
IBM Labs in Haifa
Classic loop-based auto-vectorization
vect_analyze_loop (loop) {
if (!1_analyze_counted_single_bb_loop (loop)) FAIL
if (!2_determine_VF (loop)) FAIL
if (!3_analyze_memory_access_patterns (loop)) FAIL
if (!4_analyze_scalar_dependence_cycles (loop)) FAIL
if (!5_analyze_data_dependence_distances (loop)) FAIL
if (!6_analyze_consecutive_data_accesses (loop)) FAIL
if (!7_analyze_data_alignment (loop)) FAIL
if (!8_analyze_vops_exist_forall_ops (loop)) FAIL
SUCCEED
}
vect_transform_loop (loop) {
FOR_ALL_STMTS_IN_LOOP(loop, stmt)
replace_OP_by_VOP (stmt);
decrease_loop_bound_by_factor_VF (loop);
}
12
PLDI 2006
IBM Labs in Haifa
Vectorizing non unit stride access
 One VOP accessing data with stride d requires loading of dVF elements
 Several, otherwise unrelated VOPs can share these loaded elements
 If they all share the same stride d
 If they all start close to each other
 Upto d VOPS; if less, there are ‘gaps’
 Recognize this spatial reuse potential to eliminate redundant load and
extract operations
 Better make the decision earlier than later – without such elimination
 vectorizing the loop may be non beneficial (for loads)
 vectorizing the loop may be prohibited (for stores)
13
PLDI 2006
IBM Labs in Haifa
Augmenting the vectorizer: step 1/3 – build spatial groups
 5_analyze_data_dependence_distances
already traversed all pairs of load/stores to analyze their dependence distance:
if (cross_iteration_dependence_distance <= (VF-1)*stride)
if (read,write) or (write,read) or (write,write)
ok = dep_resolve();
endif
endif
 Augment this traversal to look for spatial reuse between pairs of independent
loads and stores, building spatial groups:
if ok and (intra_iteration_address_distance < stride*u)
if (read,read) or (write,write)
ok = analyze_and_build_spatial_groups();
endif
endif
14
PLDI 2006
IBM Labs in Haifa
Augmenting the vectorizer: step 2/3 – check spatial groups
 6_analyze_consecutive_data_accesses
already traversed each individual load/store to analyze its access pattern
 Augment this traversal by
 Allowing non-consecutive accesses
 Building singleton groups for strided ungrouped load/stores
 Checking for gaps and profitability of spatial groups
15
PLDI 2006
IBM Labs in Haifa
Augmenting the vectorizer: step 3/3 – transformation
 vect_transform_stmt
generates vector code per scalar OP
 Augment this by considering
 If OP is a load/store in first position of a spatial group
 generate d load/stores
 handle their alignment according to the starting address
 generate d log d extract/interleaves
 If OP belongs to a spatial group, connect it to the appropriate
extract/interleave according to its position
 Unused extract/interleaves are discarded by subsequent DCE
16
PLDI 2006
IBM Labs in Haifa
Performance – qualitative: VF/(1 + log d)
 Vectorized code has d load/stores
and (d log d) extract/interleaves
 Scalar code has dVF loads/stores
 Performance improvement factor in
# of load/store/extract/interleave is
VF/(1 + log d)
17
PLDI 2006
d
VF=4
VF=8
VF=16
1
4
8
16
2
2
4
8
4
1.3
2.6
5.3
8
1
2
4
16
0.8
1.6
3.2
32
0.6
1.2
2.4
IBM Labs in Haifa
Performance – empirically (on PowerPC 970 with Altivec)
(a) Speedups for tests w ith m em ory operations only
aligned
(b) Speedups for tests w ith arithm etic operations
expected improvement factor on loads/stores
12
12
10
10
speedup factor
speedup factor
unaligned
8
6
4
2
0




18
unaligned
aligned
2
16 32
expected improvement factor on loads/stores
8
6
4
2
2
4
8
16 32
2 4 8 16 32
2
interleaving factor, for VF=4,8,16
4
8
16 32
0
4
8
2
4
8
16 32
2
4
8
16 32
interleaving factor, for VF=4,8,16
Stride of 2 always provides speedups
Strides of 8, 16 suffer from increased code-size – turns off loop unrolling
Stride of 32 suffers from high register pressure (d+1)
If non-permute operations exist – speedups for all strides if VFm8
PLDI 2006
IBM Labs in Haifa
Performance – stride of 8 with gaps
Speedups on interleaving with gaps (VF=16, δ=8)
access with least reuse
medium degree of reuse
access with most reuse
minimum expected improvement factor in loads/stores
6
speedup factor
 Position of gaps affects the number
of extract (interleaves) needed
 Improvement is observed even for a
single strided access
(VF=16 with arithmetic operations)
5
4
3
2
1
0
1
2
4
unaligned
7
1
2
aligned
num ber of m em ory accesses per iteration
19
PLDI 2006
4
7
i2_
cx
do
i2_ t - fp
i1_ c xf
ir
bi
tP - fp
i2_ ac k
-u
bi
32
tP
a
i4_
ck
b
i8_ i tP u32
ac
cv
k
t_
c o - u3
de 2
ci2_
u3
2
cx
M
ul
tS
ca
i2_ le- s
1
cx
do 6
ti
2
s1
i2_
_
i n c xfi 6
te
rs1
i2_ rpo
6
m
l
i x ato
S
tre r- s1
i2_ a m 6
al
s
ph - s1
i4_ aM 6
i
i
i4_ 2_v x- s1
ite
6
i1
_d
rb
i4_ ow i- s1
n
i2
6
_d M ix
o
s
w
i4_
nM 16
m
ix
ix
S
tre - s1
i4_ a m 6
al
si8_ pha s16
M
cv
ix
t_
c o - s1
de 6
cu1
6
Speedup Factor
20
10
9
8
7
6
5
4
3
2
1
0
PLDI 2006
i2_
cx
do
i4_
t
i2_ - u8
di
i4_ ss o c x
fi
al
l
ph v e_ r - u8
rg
aB
ba
i4_ l en
d_ - u8
i2
_c
rg
i4_ vt_ bau
co
rg
de 8
ba
c_t
i8_ o _a u8
rg
cv
b
t_
c o - u8
de
ci4_
u8
3_
al
ph
i4_ aB
3_ l en
i4_
d
di
2_ i2_
ss - u8
de 1_
ol
v
c
c
ev
i8_ om t_c
u
4_ pre od 8
s s ec
co
-u
_c
m
pr
od 8
es
e
c
s_
c o - u8
de
cu8
IBM Labs in Haifa
Performance - kernels
Vectorization Speedup Factors on Interleaved Data
Benchmarks
 4 groups: VF=4, 8, 16, 16-with-gaps
 Strides prefix each kernel
 Slowdown when doing only memory operations at VF=4, d=8
unaligned
aligned
IBM Labs in Haifa
Future direction – towards loop-aware SLP
 When building spatial groups, we consider distinct operations accessing
adjacent/close addresses; this is the first step of building SLP chains
 SLP looks for VF fully interleaved accesses, without gaps; may require
earlier loop unrolling
 Next step is to consider the operations that use a spatial group of loads –
if they’re isomorphic, try to postpone the extracts
 Analogous to handling alignment using zero-shift, lazy-shift, eager-shift
policies
21
PLDI 2006
IBM Labs in Haifa
Conclusions
1. Existing SIMD targets supporting SIMpD can provide improved
performance for important power-of-2 strided applications
– don’t be afraid of d > 2
2. Existing compiler loop-based auto-vectorization can be augmented
efficiently to handle such strided accesses
3. This can serve as a first step combining traditional loop-based
vectorization with (if-converted) basic-block vectorization (“SLP”)
4. This area of work is fertile;
consider details (d, gaps, positions, VF, non-mem ops) for it not to be
futile!
22
PLDI 2006
IBM Labs in Haifa
Questions
23
PLDI 2006