Transcript PPT

Parallel Prefix Sum (Scan)
GPU Graphics
Gary J. Katz
University of Pennsylvania CIS 665
Adapted from articles
taken from
GPU Gems III
Scan

Definition:

The all-prefix-sums operation takes a binary associative
operator with identity I, and an array of n elements
[a0, a1, …, an-1]
and returns the array
[I, a0, (a0 a1), … , (a0

a1 …
Example:
[ 1 13 35 2 6 8 10 23 52 11 26 19 ]
[ 0 1 14 49 51 57 65 75 98 150 161 187 206]
an-2)]
Sequential Scan
out[0] = 0;
for (k = 1; k < n; k++)
out[k] = in[k-1] + out[k -1];


Performs n adds for an array length of n
Work Complexity is O(n)
Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
if( k >= 2d )
x[k] = x[k – 2d-1] + x[k]


Performs O(nlog2n) addition operations
Assumes there are as many processors as data
elements
for(d = 1; d < log2n; d++)
for all k in parallel
if( k >= 2d )
x[k] = x[k – 2d-1] + x[k]
Parallel Scan
X0
X1
X2
X3
X4
X5
X6
X7
D=1
∑(x0..x0) ∑(x0..x1) ∑(x1..x2) ∑(x2..x3) ∑(x3..x4) ∑(x4..x5) ∑(x5..x6) ∑(x6..x7)
D=2
∑(x0..x0) ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x1..x4) ∑(x2..x5) ∑(x3..x6) ∑(x4..x7)
D=3
∑(x0..x0) ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6) ∑(x0..x7)
Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
if( k >= 2d )
x[k] = x[k – 2d-1] + x[k]

What’s the problem with this algorithm for the
GPU?
Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
if( k >= 2d )
x[out][k] = x[in][k – 2d-1] + x[in][k]
else
x[out][k] = x[in][k]

GPU needs to double buffer the array
Issues with Current Implementation?


Only works for 512 elements
(one thread block)
GPU has a complexity of O(nlog2n)
( CPU version is O(n) )
A work efficient parallel scan


Goal is a parallel scan that is O(n) instead of
O(nlog2n)
Solution:
Balanced Trees: Build a binary tree on the input
data and sweep it to and from the root.
Binary tree with n leaves has d=log2n levels, each
level d has 2d nodes
One add is performed per node, therefore O(n) add
on a single traversal of the tree.

Balanced Binary Trees
Binary tree with n
leaves has d=log2n
levels, each level d
has 2d nodes
One add is
performed per node,
therefore O(n) add
on a single traversal
of the tree.
Two Phase Algorithm
1. Up-sweep phase
2. Down-sweep phase
d=0
d=1
d=2
d=3
Tree for n = 8
The Up-Sweep Phase
for(d = 1; d < log2n-1; d++)
for all k=0; k < n-1; 2d+1 in parallel
x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]
Where have we seen this before?
The Down-Sweep Phase
x[n-1] = 0;
for(d = log2n – 1; d >=0; d--)
for all k = 0; k < n-1; k += 2d+1 in parallel
t = x[k + 2d – 1]
x[k + 2d - 1] = x[k + 2d+1 -1]
x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]
x0
∑(x0..x1)
x2
∑(x0..x3)
x4
∑(x4..x5)
x6
∑(x0..x7)
x0
∑(x0..x1)
x2
∑(x0..x3)
x4
∑(x4..x5)
x6
0
x0
∑(x0..x1)
x2
0
x4
∑(x4..x5)
x6
∑(x0..x3)
x0
0
x2
∑(x0..x1)
x4
∑(x0..x3)
x6
∑(x0..x5)
0
x
∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)
Current Limitations


Array sizes are limited to 1024 elements
Array sizes must be a power of two
Alterations for Arbitrary Sized Arrays
Initial array of values
Scan Block 0
Scan Block 1
Scan Block 2
Scan Block 3
Block Sums
Scan Block Sums
Final Array of Scanned Values




Divide the large array into blocks that can be scanned by a single thread block
Scan each block and write the total sums of each block to another array of blocks
Scan the block sums, generating an array of block increments
The result is added to each of the element of their respective block
Applications



Stream Compaction
Summed-Area Tables
Radix Sort
Stream Compaction
Definition:


Extracts the ‘interest’ elements from an array of elements
and places them continuously in a new array
Uses:


Collision Detection
Sparse Matrix Compression
A
B
A
D
D
A
B
A
C
B
E
C
F
B
Stream Compaction
A
B
A
D
D
E
C
F
B
1
1
1
0
0
0
1
0
1
0
1
2
3
3
3
3
4
4
A
B
A
D
D
E
C
F
B
A
B
A
C
B
0
1
2
3
4
Input: We want to
preserve the gray elements
Set a ‘1’ in each gray input
Scan
Scatter gray inputs to
output using scan result as
scatter address
Summed Area Tables

Definition:


A 2D table generated from an input image in
which each entry in the table stores the sum of all
pixels between the entry location and the lowerleft corner of the input image
Uses:

Can be used to perform filters of different widths
at every pixel in the image in constant time per
pixel
Summed Area Tables
1.
2.
3.
Apply sum scan to all rows of the image
Transpose image
Apply a sum scan to all rows of the result
Radix Sort
Initial Array
110011 51
101001 41
010011 19
000110 6
110000 48
011001 25
010111 23
Pass 1
Pass 2
000110 6
110000 48
110011 51
101001 41
010011 19
011001 25
010111 23
110000 48
101001 41
011001 25
000110 6
110011 51
010011 19
010111 23
Pass 4
Pass 5
000110 6
101001 41
110000 48
110011 51
010011 19
010111 23
011001 25
000110 6
010011 19
010111 23
011001 25
101001 41
110000 48
110011 51
Pass 3
Pass 3
110000 48
101001 41
011001 25
110011 51
010011 19
000110 6
010111 23
110000 48
110011 51
010011 19
000110 6
010111 23
101001 41
011001 25
Radix Sort Using Scan
100
111
010
110
011
101
001
000
0
1
0
0
1
1
1
0
1
0
1
1
0
0
0
1
b = least significant bit
e = Insert a 1 for all
false sort keys
0
1
1
2
3
3
3
3
f = Scan the 1s
Input Array
Total Falses = e[n-1] + f[n-1]
0-0+4
=4
1-1+4
=4
2-1+4
=5
3-2+4
=5
4-3+4
=5
5-3+4
=6
6-3+4
=7
7-3+4
=8
0
4
1
2
5
6
7
3
100
111
010
110
011
101
001
000
100
010
110
000
111
011
101
001
t = index – f + Total Falses
d=b?t:f
Scatter input using d
as scatter address
Radix Sort Using GPU



Partial Radix sort is performed once for each
block.
Scan needs to be performed once for each bit
Partial sorts are then sorted together using
bitonic sort
References

These slides are directly based upon the
following resource and are meant for
education purposes only.
GPU Gems III, Chapter 39, Parallel Prefix
Sum (Scan) with CUDA, Mark Harris,
Shubhabrata Sengupta, John D. Owens