Transcript Document

ME964
High Performance Computing
for Engineering Applications
CUDA Arithmetic Support
A Software Design Exercise: Prefix Scan
Oct. 14, 2008
Before we get started…

Last Time



Gauging the extent to which you use hardware resources in CUDA
 The occupancy calculator and the “–ptxas-options –v” compile option
Control Flow in CUDA
 The issue of divergent warps
 Predicated instructions
Today

CUDA arithmetic support


Performance implications
A software design exercise: prefix scan
 Preamble to the topic of software design patterns

Helpful with your assignment
2
Runtime Math Library

There are two types of runtime math operations

__func(): direct mapping to hardware



Fast but low accuracy (see programming guide for details)
Examples: __sin(x), __exp(x), __pow(x,y)
func() : compile to multiple instructions

Slower but higher accuracy (5 ulp, or less)
 ulp(x): Units in the Last Place - the gap between the two floating-point
numbers closest to the value x


HK-UIUC
Examples: sin(x), exp(x), pow(x,y)
The -use_fast_math compiler option forces every func() to
compile to __func()
3
Arithmetic Instruction Throughput
in CUDA

int and float add, shift, min, max and float mul, mad: 4 cycles per warp

int multiply (*) is by default 32-bit



Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply
Integer divide and modulo are expensive



HK-UIUC
requires multiple cycles / warp
Compiler will convert literal power-of-2 divides to shifts
Keep this in mind, be explicit in cases where compiler can’t tell that divisor
is a power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2
4
Arithmetic Instruction Throughput

Reciprocal, reciprocal square root, sin/cos, log,
exp: 16 cycles per warp



Other functions are combinations of the above


HK-UIUC
These are the versions prefixed with “__”
Examples:__rcp(), __sin(), __exp()
y / x == rcp(x) * y == 20 cycles per warp
sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp
5
Make your program float-safe!

Existing hardware already has double precision support




G80 is single-precision only
Double precision comes at a performance cost
Careless use of double or undeclared types may run more slowly on
existing cards with double precision support (GTX280)
Important to be float-safe (be explicit whenever you want single
precision) to avoid using double precision where it is not needed

Add ‘f’ specifier on float literals:



// double assumed
// float explicit
Use float version of standard library functions


HK-UIUC
foo = bar * 0.123;
foo = bar * 0.123f;
foo = sin(bar);
foo = sinf(bar);
// double assumed
// single precision explicit
6
Deviations from IEEE-754

Addition and Multiplication are IEEE 754 compliant


However, often combined into multiply-add (FMAD)




HK-UIUC
Maximum 0.5 ulp (units in the least place) error
Intermediate result is truncated
Division is non-compliant (2 ulp)
Not all rounding modes are supported
No mechanism to detect floating-point exceptions
7
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest and
round to zero
All 4 IEEE, round to
nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate
only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
HK-UIUC
8
End: CUDA Arithmetic Support
Begin: Software Design Exercise
~Prefix Sum~
9
Objective

Putting your CUDA knowledge to work

The vehicle for the software design exercise today is the parallel
implementation of a prefix sum operation


Understand that:




Recall first assignment, also the topic of the current assignment
Different algorithmic designs lead to different performance levels
Different constraints dominate in different applications and/or design
solutions
Case studies help to establish intuition, idioms and ideas
Point out parallel algorithm patterns that can result in superior performance



Understand that there are patterns and it’s worth being aware of them
If you want, these are the tricks of the trade
When considering patterns, you can’t lose sight of the underlying hardware
10
Software for Parallel Computers

You come to rely on compiler to figure out the parallelism in a piece
of code and then map it to an underlying hardware


You rely on parallel libraries built for a specific underlying hardware


VERY hard, the holy grail in parallel computing
Very convenient, the way to go when such libraries are available
You rely on language extensions to facilitate the process of
generating a parallel executable

This is where you are with CUDA
11
Parallel Prefix Sum (Scan)

Definition:
The all-prefix-sums operation takes a binary associative operator 
with identity I, and an array of n elements
[a0, a1, …, an-1]
and returns the ordered set
[I, a0, (a0  a1), …, (a0  a1  …  an-2)].

12
Example:
Exclusive scan: last input
if  is addition, then scan on the set
element is not included in
[3 1 7 0 4 1 6 3]
the result
returns the set
[0 3 4 11 11 15 16 22]
(From Blelloch, 1990, “Prefix
Sums and Their Applications)
Applications of Scan
Scan is a simple and useful parallel building block

Convert recurrences from sequential …
for(j=1;j<n;j++)
out[j] = out[j-1] + f(j);

… into parallel:
forall(j) in parallel
temp[j] = f(j);
scan(out, temp);
Useful in implementation of several parallel algorithms:






HK-UIUC

radix sort
quicksort
String comparison
Lexical analysis
Stream compaction





Polynomial evaluation
Solving recurrences
Tree operations
Histograms
Etc.
13
Scan on the CPU
void scan( float* scanned, float* input, int length)
{
scanned[0] = 0;
for(int i = 1; i < length; ++i)
{
scanned[i] = scanned[i-1] + input[i-1];
}
}

Just add each element to the sum of the elements before it

Trivial, but sequential

Exactly n-1 adds: optimal in terms of work efficiency
14
Parallel Scan Algorithm: Solution One
Hillis & Steele (1986)


Note that a implementation of the algorithm shown in picture
requires two buffers of length n (shown is the case n=8=23)
Assumption: the number n of elements is a power of 2: n=2M
d=0
x0
x1
x2
x3
x4
x5
x6
x7
d=1 S(x0..x0) S(x0..x1) S(x1..x2) S(x2..x3) S(x3..x4) S(x4..x5) S(x5..x6) S(x6..x7)
d=2 S(x0..x0) S(x0..x1) S(x0..x2) S(x0..x3) S(x1..x4) S(x2..x5) S(x3..x6) S(x4..x7)
d=3 S(x0..x0) S(x0..x1) S(x0..x2) S(x0..x3) S(x0..x4) S(x0..x5) S(x0..x6) S(x0..x7)
Picture courtesy of Mark Harris
15
The Plain English Perspective

First iteration, I go with stride 1=20

Start at x[2M] and apply this stride to all the array elements before x[2M] to find the
mate of each of them. When looking for the mate, the stride should not land you
before the beginning of the array. The sum replaces the element of higher index.


Second iteration, I go with stride 2=21

Start at x[2M] and apply this stride to all the array elements before x[2M] to find the
mate of each of them. When looking for the mate, the stride should not land you
before the beginning of the array. The sum replaces the element of higher index.


This means that I have 2M – 21 additions
Third iteration: I go with stride 4=22

Start at x[2M] and apply this stride to all the array elements before x[2M] to find the
mate of each of them. When looking for the mate, the stride should not land you
before the beginning of the array. The sum replaces the element of higher index.


This means that I have 2M-1 additions
This means that I have 2M – 22 additions
… (and so on)
16
The Plain English Perspective

Consider the kth iteration (k is some arbitrary valid integer): I go with stride 2k-1

Start at x[2M] and apply this stride to all the array elements before x[2M] to find the
mate of each of them. When looking for the mate, the stride should not land you
before the beginning of the array. The sum replaces the element of higher index.

This means that I have 2M-2k-1 additions

…

Mth iteration: I go with stride 2M-1

Start at x[2M] and apply this stride to all the array elements before x[2M] to find the
mate of each of them. When looking for the mate, the stride should not land you
before the beginning of the array. The sum replaces the element of higher index.


This means that I have 2M-2M-1 additions
NOTE: There is no (M+1)th iteration since this would automatically put me
beyond the bounds of the array (if you apply an offset of 2M to “&x[2M] ” it places
you right before the beginning of the array – not good…)
17
Hillis & Steele Parallel Scan Algorithm

Algorithm looks like this:
for d := 0 to M-1 do
forall k in parallel do
if k – 2d ≥0 then
x[out][k] := x[in][k] + x[in][k − 2d]
else
x[out][k] := x[in][k]
endforall
swap(in,out)
endfor
Double-buffered version of the sum scan
18
Operation Count
Final Considerations

The number of operations tally:

(2M-20) + (2M-21) + … + (2M-2k) +…+ (2M-2M-1)
Final operation count:

This is an algorithm with O(n*log(n)) work


This scan algorithm is not that work efficient


Sequential scan algorithm does n-1 adds
A factor of log(n) might hurt: 20x more work for 106 elements!


Homework requires a scan of about 16 million elements
A parallel algorithm can be slow when execution resources are
saturated due to low algorithm efficiency
19
Hillis & Steele: Kernel Function
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// load input into shared memory.
// Exclusive scan: shift right by one and set first element to 0
temp[thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for( int offset = 1; offset < n; offset <<= 1 )
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
20
Hillis & Steele: Kernel Function, Quick Remarks

The kernel is very simple, which is good

Note the nice trick that was used to swap the buffers

The kernel only works when the entire array is processed
by one block


One block in CUDA has 512 threads, which means I can have up
to 1024 elements (short of 16 million, which is your assignment)
This needs to be improved upon, can’t limit solution to what’s been
presented so far
21
Improving Efficiency

A common parallel algorithm pattern:
Balanced Trees



Build a balanced binary tree on the input data and sweep it to and
then from the root
Tree is not an actual data structure, but a concept to determine what
each thread does at each step
For scan:

Traverse down from leaves to root building partial sums at internal
nodes in the tree


HK-UIUC
Root holds sum of all leaves (this is a reduction algorithm!)
Traverse back up the tree building the scan from the partial sums
22
Picture and Pseudocode
~ Reduction Step~
d=0
x0
S(x0..x1)
x2
S(x0..x3)
x4
S(x4..x5)
x6
S(x0..x7)
d=1
x0
S(x0..x1)
x2
S(x0..x3)
x4
S(x4..x5)
x6
S(x4..x7)
d=2
x0
S(x0..x1)
x2
S(x2..x3)
x4
S(x4..x5)
x6
S(x6..x7)
x6
x7
x[j ×2k + 1 - 1]
d=3
x0
x1
x2
x3
x4
x5
i1 =
1
3
7
i2 =
0
1
3
3 5 7
7 -1 -1
-1 -1 -1
2 4 6
5 -1 -1
-1 -1 -1
NOTE: “-1” entries
indicate no-ops
for k=0 to M-1
offset = 2k
for j=1 to 2M-k-1 in parallel do
x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1]
endfor
endfor
23
function [offset,index1,index2] = coeffsReduce(M)
% MATLAB utility to validate my understanding of
% how the various strides and indeces are to be
% constructed in the up-sweep phase of the
% Blelloch scan algorithm
offset = zeros(M,1);
offset = offset - 1;
index1 = zeros(M, 2^(M-1));
index1 = index1 - 1;
index2 = zeros(M, 2^(M-1));
index2 = index2 - 1;
for k=0:M-1
offset(k+1) = 2^k;
for j=1:2^(M-k-1)
index1(k+1,j) = j*2^(k+1)-1;
index2(k+1,j) = j*2^(k+1)-1-2^k;
end
end
Validate the
Indices/Offset
>> [offset, i1, i2] = coeffsReduce(3)
offset =
1
2
4
i1 =
1 3 5 7
3 7 -1 -1
7 -1 -1 -1
i2 =
0 2 4 6
1 5 -1 -1
3 -1 -1 -1
24
Operation Count, Reduce Phase
for k=0 to M-1
offset = 2k
for j=1 to 2M-k-1 in parallel do
x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1]
endfor
endfor
By inspection:
Looks promising…
25
The Down-Sweep Phase
x0
S(x0..x1)
x2
S(x0..x3)
x4
S(x4..x5)
x6
S(x0..x7)
Zero
d=0
x0
S(x0..x1)
x2
S(x0..x3)
x4
S(x4..x5)
x6
0
d=1
x0
S(x0..x1)
x2
0
x4
S(x4..x5)
x6
S(x0..x3)
d=2
x0
0
x2
S(x0..x1)
x4
S(x0..x3)
x6
S(x0..x5)
d=3
0
x0
NOTE: This is just a mirror
image of the reduction
stage. Easy to come up with
the indexing scheme…
S(x0..x1) S(x0..x2) S(x0..x3) S(x0..x4) S(x0..x5) S(x0..x6)
for k=M-1 to 0
offset = 2k
for j=1 to 2M-k-1 in parallel do
dummy = x[j·2k+1-2k-1]
x[j·2k+1-2k-1] = x[j·2k+1-1]
x[j·2k+1-1] = x[j·2k+1-1] + dummy
endfor
endfor
26
Down-Sweep Phase, Remarks

Number of operations for the down-sweep phase:



Total number of operations associated with this algorithm




Additions: n-1
Swaps: n-1 (each swap shadows an addition)
Additions: 2n-2
Swaps: n-1
Looks very comparable with the work load in the sequential solution
The algorithm is convoluted though, it won’t be easy to implement

Kernel shown on next slide
27
01| __global__ void prescan(float *g_odata, float *g_idata, int n)
02| {
03|
extern __shared__ float temp[];// allocated on invocation
04|
05|
06|
int thid = threadIdx.x;
07|
int offset = 1;
08|
09|
temp[2*thid]
= g_idata[2*thid]; // load input into shared memory
10|
temp[2*thid+1] = g_idata[2*thid+1];
11|
12|
for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
13|
{
14|
__syncthreads();
15|
16|
if (thid < d)
17|
{
18|
int ai = offset*(2*thid+1)-1;
19|
int bi = offset*(2*thid+2)-1;
20|
21|
temp[bi] += temp[ai];
22|
}
23|
offset *= 2;
24|
}
25|
26|
if (thid == 0) { temp[n - 1] = 0; } // clear the last element
27|
28|
for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
29|
{
30|
offset >>= 1;
31|
__syncthreads();
32|
33|
if (thid < d)
34|
{
35|
int ai = offset*(2*thid+1)-1;
36|
int bi = offset*(2*thid+2)-1;
37|
38|
float t
= temp[ai];
39|
temp[ai] = temp[bi];
40|
temp[bi] += t;
41|
}
42|
}
43|
44|
__syncthreads();
45|
46|
g_odata[2*thid]
= temp[2*thid]; // write results to device memory
47|
g_odata[2*thid+1] = temp[2*thid+1];
48| }
28
Bank Conflicts
Current implementation has many ShMem bank conflicts



Can significantly hurt performance on current GPU hardware
The source of the conflicts: linear indexing with stride that is a
power of 2 multiple of thread id (see below): “j·2k+1-1”
for k=0 to M-1
offset = 2k
for j=1 to 2M-k-1 in parallel do
x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1]
endfor
endfor

Simple modifications to current memory addressing
scheme can save a lot of cycles
29
Bank Conflicts

Occur when multiple threads access the same shared memory
bank with different addresses

In our case, we have something like 2k+1·j-1




No penalty if all threads access different banks


k=0: two way bank conflict
k=1: four way bank conflict
…
Or if all threads access exact same address
Recall that shared memory accesses with conflicts are serialized

N-bank memory conflicts lead to a set of N successive shared
memory transactions
30
Initial Bank Conflicts on Load

Each thread loads two shared mem data elements

Tempting to interleave the loads (see lines 9 & 10, and 46 & 47)
temp[2*thid]
= g_idata[2*thid];
temp[2*thid+1] = g_idata[2*thid+1];

Thread 0 accesses banks 0 and 1

Thread 1 accesses banks 2 and 3

…

Thread 8 accesses banks 16 and 17. Oops, that’s 0 and 1 again…


HK-UIUC
Two way bank conflict, can’t be easily eliminated
Better to load one element from each half of the array
temp[thid]
= g_idata[thid];
temp[thid + (n/2)] = g_idata[thid + (n/2)];
31
Bank Conflicts in the tree algorithm


Bank:
When we build the sums, during the first iteration of the
algorithm each thread in a half-warp reads two shared
memory locations and writes one:
Th(0,8) access bank 0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
...
3 1 7 0 4 1 6 3 5 8 2 0 3 3 1 9 4 5 7 …
T0
T1
T2
T3
T4
9
T5
T6
T7
15
T8
T9
0
1
2
3
4
5
6
7
8
10
11
12
13
14
0
1
2
3
4
7
7
4
5
6
9
5 13 2
2
3
6
1 10 4
9
7 …
…
...
First iteration: 2 threads access each of 8 banks.
Each
corresponds
to a single thread.
HK-UIUC
Like-colored arrows represent
simultaneous memory accesses
32
Bank Conflicts in the tree algorithm


Bank:
When we build the sums, each thread reads two shared
memory locations and writes one:
Th(1,9) access bank 2, etc.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
...
3 1 7 0 4 1 6 3 5 8 2 0 3 3 1 9 4 5 7 …
T0
T1
T2
T3
T4
9
T5
T6
T7
15
T8
T9
0
1
2
3
4
5
6
7
8
10
11
12
13
14
0
1
2
3
4
7
7
4
5
6
9
5 13 2
2
3
6
1 10 4
9
7 …
…
...
First iteration: 2 threads access each of 8 banks.
Each
corresponds
to a single thread.
HK-UIUC
Like-colored arrows represent
simultaneous memory accesses
33
Bank Conflicts in the tree algorithm

2nd iteration: even worse!

Bank:
4-way bank conflicts; for example:
Th(0,4,8,12) access bank 1, Th(1,5,9,13) access Bank 5, etc.
0
1
2
3
4
5
6
7
8
10
11
12
13
14
0
1
2
3
4
7
4
4
5
6
9
5 13 2
2
3
6
1 10 4
9
7 …
T0
3
9
T1
7
8
15
T2
9
10
11
...
T3
15
T4
0
1
2
4
5
6
12
13
14
0
1
2
3
4
7 11 4
5
6 14 5 13 2 15 3
6
1 16 4
9
7 …
…
...
2nd iteration: 4 threads access each of 4 banks.
Each
corresponds
to a single thread.
HK-UIUC
Like-colored arrows represent
simultaneous memory accesses
34
Managing Bank Conflicts
in the Tree Algorithm

Use padding to prevent bank conflicts

Add a word of padding every 16 words.


Now you work with a virtual 17 bank shared memory layout
Within a 16-thread half-warp, all threads access different banks

They are aligned to a 17 word memory layout

It comes at a price: you have memory words that are wasted

Keep in mind: you should also load data from global into shared
memory using the virtual memory layout of 17 banks
35
Use Padding to Reduce Conflicts

After you compute a ShMem address like this:
Address = 2 * stride * thid;

Add padding like this:
Address += (address >> 4); // divide by NUM_BANKS

HK-UIUC
This removes most bank conflicts

Not all, in the case of deep trees

Material posted online will contain a discussion of this “deep
tree” situation along with a proposed solution
36
Managing Bank Conflicts
in the Tree Algorithm
Original scenario.
Bank:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
...
3 1 7 0 4 1 6 3 5 8 2 0 3 3 1 9 4 5 7 …
T0
T1
T2
T3
T4
9
T5
T6
T7
15
T8
T9
0
1
2
3
4
5
6
7
8
10
11
12
13
14
0
1
2
3
4
7
7
4
5
6
9
5 13 2
2
3
6
1 10 4
9
7 …
…
...
Actual physical memory (true bank number)
(0) (1) (2) (3)
Modified scenario, virtual 17 bank memory layout.
Virtual 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 …
Bank: 3 1 7 0 4 1 6 3 5 8 2 0 3 3 1 9 P 4 5 7 …
T0
HK-UIUC
T1
T2
T3
T4
9
T5
T6
T7
15
16
T9…
T8
0
1
2
3
4
5
6
7
8
10
11
12
13
14
0
1
2
3
4
7
7
4
5
6
9
5 13 2
2
3
6
1 10 P 4
9
7 …
Note that only arrows with the same color happen simultaneously.
...
37
Concluding Remarks, Parallel Scan

Intuitively, the scan operation is not the type of procedure
ideally suited for parallel computing

Even if it doesn’t fit like a glove, leads to nice speedup:
# elements
CPU Scan (ms)
GPU Scan (ms)
Speedup
1024
0.002231
0.079492
0.03
32768
0.072663
0.106159
0.68
65536
0.146326
0.137006
1.07
131072
0.726429
0.200257
3.63
262144
1.454742
0.326900
4.45
524288
2.911067
0.624104
4.66
1048576
5.900097
1.118091
5.28
2097152
11.848376
2.099666
5.64
4194304
23.835931
4.062923
5.87
8388688
47.390906
7.987311
5.93
16777216
94.794598
15.854781
5.98
Source: 2007 paper of Harris, Sengupta, Owens
38
Concluding Remarks, Parallel Scan

The Hillis-Steele (HS) implementation is simple, but suboptimal

The Harris-Sengupta-Owen (HSO) solution is convoluted, but scales
like O(n)

The complexity of the algorithm due to an acute bank-conflict situation

Michael Garland argues that there is a lot of serendipity in the HSO
algorithm, and it’s not clear if it’s worth pursuing

Finally, we have not solved the problem yet: we only looked at the case
when our array has up to 1024 elements


You will have to think how to handle the 16,777,216=224 elements case
Likewise, it would be fantastic if you implement as well the case when
the number of elements is not a power of 2
39
Taking it one step further: completely eliminating the bank conflicts
(individual reading)
40
Scan Bank Conflicts (1)

A full binary tree with 64 leaf nodes:
Scale (s)
1
2
4
8
16
32
Thread addresses
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
0 8 16 24 32 40 48 56
0 16 32 48
0 32
0
Conflicts
2-way
4-way
4-way
4-way
2-way
None
Banks
0 2
0 4
0 8
0 0
0 0
0


4
8
0
0
6
12
8
0
8
0
0
10 12 14
4 8 12
8 0 8
0
0
2
4
4
8
6
12
8
0
10 12 14
4 8 12
0
2
4
6
8
10 12 14
0
2
4
6
8
10 12 14
Multiple 2-and 4-way bank conflicts
Shared memory cost for whole tree

1 32-thread warp = 6 cycles per thread w/o conflicts



Counting 2 shared mem reads and one write (s[a] += s[b])
6 * (2+4+4+4+2+1) = 102 cycles
36 cycles if there were no bank conflicts (6 * 6)
41
Scan Bank Conflicts (2)


It’s much worse with bigger trees!
A full binary tree with 128 leaf nodes
Only the last 6 iterations shown (root and 5 levels below)

Scale (s)
2
4
8
16
32
64
Thread addresses
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 122
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
0 16 32 48 64 80 96 112
0 32 64 96
0 64
0
Conflicts
4-way
8-way
8-way
4-way
2-way
None
Banks
0 4
0 8
0 0
0 0
0 0
0

8
0
0
0
12
8
0
0
0
0
0
4
8
0
8
0
0
12
8
0
0
0
4
8
8
0
12
8
0
0
4
8
8
0
12
8
0
4
8
12
0
4
8
12
0
4
8
12
0
4
8
Cost for whole tree:


12*2 + 6*(4+8+8+4+2+1) = 186 cycles
48 cycles if there were no bank conflicts! 12*1 + (6*6)
42
10
Scan Bank Conflicts (3)

A full binary tree with 512 leaf nodes

Only the last 6 iterations shown (root and 5 levels below)
Scale (s)
8
16
32
64
128
256
Thread addresses
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448 464 480 496
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
0 64 128 192 256 320 384 448
0 128 256 384
0 256
0
Conflicts
16-way
16-way
8-way
4-way
2-way
None
Banks
0 0
0 0
0 0
0 0
0 0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Cost for whole tree:


48*2+24*4+12*8+6* (16+16+8+4+2+1) = 570 cycles
120 cycles if there were no bank conflicts!
43
0
Fixing Scan Bank Conflicts

Insert padding every NUM_BANKS elements
const int LOG_NUM_BANKS = 4; // 16 banks on G80
int tid = threadIdx.x;
int s = 1;
// Traversal from leaves up to root
for (d = n>>1; d > 0; d >>= 1)
{
if (thid <= d)
{
int a = s*(2*tid); int b = s*(2*tid+1)
a += (a >> LOG_NUM_BANKS); // insert pad word
b += (b >> LOG_NUM_BANKS); // insert pad word
shared[a] += shared[b];
}
}
44
Fixing Scan Bank Conflicts

Leaf Nodes
A full binary tree with 64 leaf nodes
Scale (s) Thread addresses
64
1
0 2 4 6 8 10 12 14 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 51 53 55 57 59 61 63
2
0 4 8 12 17 21 25 29 34 38 42 46 51 55 59 63
4
0 8 17 25 34 42 51 59
8
0 17 34 51
16
0 34
= Padding inserted
32
0
Conflicts Banks
None
0 2
None
0 4
None
0 8
None
0 1
None
0 2
None
0

4
8
1
2
6
12
9
3
8
1
2
10 12 14
5 9 13
10 3 11
1
2
3
6
5 7
10 14
9
3
11 13 15
7 11 15
2
4
6
8
10 12 14
0
3
5
7
9
11 13 15
No more bank conflicts!


However, there are ~8 cycles overhead for addressing
 For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra
cycles)
So just barely worth the overhead on a small tree
 84 cycles vs. 102 with conflicts vs. 36 optimal
45
Fixing Scan Bank Conflicts

A full binary tree with 128 leaf nodes

Only the last 6 iterations shown (root and 5 levels below)
Scale (s)
2
4
8
16
32
64
Thread addresses
0 4 8 12 17 21 25 29 34 38 42 46 51 55 59 63 68 72 76 80 85 89 93 97 102 106 110 114 119 123 127 131
0 8 17 25 34 42 51 59 68 76 85 93 102 110 119 127
0 17 34 51 68 85 102 119
0 34 68 102
0 68
= Padding inserted
0
Conflicts
None
None
None
None
None
None
Banks
0 4 8
0 8 1
0 1 2
0 2 4
0 4
0

12
9
3
6
1 5
2 10
4 5
9
3
6
13
11
7
2 6 10 14
4 12 5 13
3
6
7
14
11
7
15
15
4
8 12 0
5
9 13 1
6
10
14
2
7
11
15
No more bank conflicts!

Significant performance win:
 106 cycles vs. 186 with bank conflicts vs. 48 optimal
46
3
Fixing Scan Bank Conflicts

A full binary tree with 512 leaf nodes

Only the last 6 iterations shown (root and 5 levels below)
Scale (s)
8
16
32
64
128
256
Thread addresses
0 17 34 51 68 85 102 119 136 153 170 187 204 221 238 255 272 289 306 323 340 357 374 391 408 425 442 459 476 493 510 527
0 34 68 102 136 170 204 238 272 306 340 374 408 442 476 510
0 68 136 204 272 340 408 476
0 136 272 408
0 272
= Padding inserted
0
Conflicts
None
2-way
2-way
2-way
2-way
None
Banks
0 1
0 2
0 4
0 8
0 0
0

2
4
8
0
3
6
12
8
4
8
0
5
10
4
6
12
8
7
14
12
8
0
9
2
10
4
11
6
12
8
13
10
14
12
15
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Wait, we still have bank conflicts


Method is not foolproof, but still much improved
304 cycles vs. 570 with bank conflicts vs. 120 optimal
47
15
Fixing Scan Bank Conflicts

It’s possible to remove all bank conflicts


Just do multi-level padding
Example: two-level padding:
const int LOG_NUM_BANKS = 4; // 16 banks on G80
int tid = threadIdx.x;
int s = 1;
// Traversal from leaves up to root
for (d = n>>1; d > 0; d >>= 1)
{
if (thid <= d)
{
int a = s*(2*tid); int b = s*(2*tid+1)
int offset = (a >> LOG_NUM_BANKS); // first level
a += offset + (offset >>LOG_NUM_BANKS); // second level
offset = (b >> LOG_NUM_BANKS);
// first level
b += offset + (offset >>LOG_NUM_BANKS); // second level
temp[a] += temp[b];
}
}
48
Fixing Scan Bank Conflicts

A full binary tree with 512 leaf nodes
Only the last 6 iterations shown (root and 5 levels below)

Scale (s) Thread addresses
8
0 17 34 51 68 85 102 119 136 153 170 187 204 221 238
16
0 34 68 102 136 170 204 238 273 307 341 375 409 443 477
32
0 68 136 204 273 341 409 477
64
0 136 273 409
128
0 273
= 1-level
256
0
= 2-level
Conflicts Banks
None
0 1
None
0 2
None
0 4
None
0 8
None
0 1
None
0
2
4
8
1
3
6
12
9

4
8
1
5 6 7
10 12 14
5 9 13
9
3
padding inserted
padding inserted
10 11 12 13 14 15
5 7 9 11 13 15
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
No bank conflicts



8
1
255 273 290 307 324 341 358 375 392 409 426 443 460 477 494 511 528
511
But an extra cycle overhead per address calculation
Not worth it: 440 cycles vs. 304 with 1-level padding
With 1-level padding, bank conflicts only occur in warp 0


Very small remaining cost due to bank conflicts
Removing them hurts all other warps
49
0