Ye_ipdps10-slide - IEEE International Parallel and
Download
Report
Transcript Ye_ipdps10-slide - IEEE International Parallel and
High Performance Comparison-Based Sorting Algorithm
on Many-Core GPUs
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne
Key Laboratory of Computer System and Architecture
ICT, CAS, China
Outline
GPU computation model
Our sorting algorithm
– A new bitonic-based merge sort, named Warpsort
Experiment results
conclusion
GPU computation model
Massively multi-threaded, data-parallel many-core
architecture
Important features:
– SIMT execution model
Avoid branch divergence
– Warp-based scheduling
implicit hardware synchronization among threads within a warp
– Access pattern
Coalesced vs. non-coalesced
Why merge sort ?
Similar case with external sorting
– Limited shared memory on chip vs. limited main
memory
Sequential memory access
– Easy to meet coalesced requirement
Why bitonic-based merge sort ?
Massively fine-grained parallelism
– Because of the relatively high complexity, bitonic
network is not good at sorting large arrays
– Only used to sort small subsequences in our
implementation
Again, coalesced memory access requirement
Problems in bitonic network
naïve implementation
– Block-based bitonic network
– One element per thread
Some problems
Phase
0
Stage
0
1
0
2
1
0
1
2
– in each stage
n elements produce only n/2
thread
compare-and-swap
operations
Form both ascending pairs
and descending pairs
– Between stages
block
synchronization
Too many branch divergences and synchronization operations
What we use ?
Warp-based bitonic network
– each bitonic network is assigned to an independent warp,
instead of a block
Barrier-free, avoid synchronization between stages
– threads in a warp perform 32 distinct compare-and-swap
operations with the same order
Avoid branch divergences
At least 128 elements per warp
And further a complete comparison-based sorting
algorithm: GPU-Warpsort
Overview of GPU-Warpsort
Divide input seq into small
tiles, and each followed by a
warp-based bitonic sort
Input
Step 1
...
bitonic sort by a warp
bitonic sort by a warp
...
merge by a warp
Merge, until the parallelism
is insufficient.
merge by a warp
merge by a warp
Step 2
merge by a warp
merge by a warp
...
Split into small subsequences
split into independent subsequences
split into independent subsequences
Step 3
...
...
...
Merge, and form the output
Step 4
merege by a warp
Output
merge by a warp
merge by a warp
merge by a warp
merge by a warp
...
Step1: barrier-free bitonic sort
divide the input array into
equal-sized tiles
Each tile is sorted by a
warp-based bitonic
network
– 128+ elements per tile to
avoid branch divergence
– No need for
__syncthreads()
– Ascending pairs +
descending pairs
– Use max() and min() to
replace if-swap pairs
bitonic_warp_128_(key_t *keyin, key_t *keyout) {
//phase 0 to log(128)-1
for(i=2;i<128;i*=2){
for(j=i/2;j>0;j/=2){
k0 ← position of preceding element in each pair
to form ascending order
if(keyin[k0]>keyin[k0+j])
swap(keyin[k0],keyin[k0+j]);
k1 ← position of preceding element in each pair
to form descending order
if(keyin[k1]<keyin[k1+j])
swap(keyin[k1],keyin[k1+j]);
}
}
//special case for the last phase
for(j=128/2;j>0;j/=2){
k0 ← position of preceding element in the thread's
first pair to form ascending order
if(keyin[k0]>keyin[k0+j])
swap(keyin[k0],keyin[k0+j]);
k1 ← position of preceding element in the thread's
second pair to form ascending order
if(keyin[k1]>keyin[k1+j])
swap(keyin[k1],keyin[k1+j]);
}
}
Step 2: bitonic-based merge sort
t-element merge sort
– Allocate a t-element buffer in
shared memory
– Load the t/2 smallest
elements from seq A and B,
respectively
– Merge
– Output the lower t/2 elements
– Load the next t/2 smallest
elements from A or B
t = 8 in this example
Sequence A
0
2
4
Sequence B
buf
(shared memory)
6
4
2
6
8
10
12
14
1
3
5
7
9
0
1
3
5
7
6
7
11
13
barrier-free bitonic merge network
buf
0
1
2
3
4
5
Output
No, then load the next 4
elements from B
A[3]<B[3]?
Yes, then load the next
4 elements from A
buf
14
12
10
8
4
5
6
7
6
7
barrier-free bitonic merge network
buf
15
13
11
9
4
5
barrier-free bitonic merge network
15
Step 3: split into small tiles
Problem of merge sort
– the number of pairs decreases geometrically
– Can not fit this massively parallel platform
Method
– Divide the large seqs into independent small tiles
which satisfy:
a subsequence( x, i ), b subsequence( y, j ) : a b,
0 x l ,0 y l ,0 i j s.
Step 3: split into small tiles (cont.)
How to get the splitters?
– Sample the input sequence randomly
Input sequence
...
...
...
...
...
...
...
Sample sequence
...
...
...
...
...
sort
...
Sorted sample sequence
Splitters
...
...
...
...
...
...
...
...
...
...
...
...
Step 4: final merge sort
Subsequences (0,i), (1,i),…, (l-1,i) are merged into Si
Then,S0, S1,…, Sl are assembled into a totally sorted
array
s
0,0
1,0
0,1
1,1
0,2
0,3
1,2
1,3
...
...
0,s-2
1.s-2
0,s-1
1,s-1
l
...
l-1,0
l-1,1
l-1,2
l-1,3
...
l-1,s-2 l-1,s-1
Experimental setup
Host
– AMD Opteron880 @ 2.4 GHz, 2GB RAM
GPU
– 9800GTX+, 512 MB
Input sequence
– Key-only and key-value configurations
32-bit keys and values
– Sequence size: from 1M to 16M elements
– Distributions
Zero, Sorted, Uniform, Bucket, and Gaussian
Performance comparison
Mergesort
Warpsort
– Our implementation
warpsort
quicksort
300
250
200
150
100
50
0
ko
kv
1M
ko
kv
2M
ko
kv
4M
Sequence Size
ko
kv
ko
kv
8M
16M
8M
16M
70
– Cederman, ESA’08
60
Sorting Rate (millions/sec)
– Fastest sorting algorithm on
GPU (Satish, IPDPS’09)
radixsort
350
Quicksort
Radixsort
mergesort
400
Time (msec)
– Fastest comparison-based
sorting algorithm on GPU
(Satish, IPDPS’09)
– Implementations already
compared by Satish are not
included
450
50
40
30
20
warpsort
radixsort
mergesort
10
0
1M
2M
4M
Sequence Size
Performance results
Key-only
– 70% higher performance than quicksort
Key-value
– 20%+ higher performance than mergesort
– 30%+ for large sequences (>4M)
Results under different distributions
– Load imbalance
450
Time_zero
Time_uniform
Time_gaussian
Time_bucket
Time_sorted
Rate_zero
Rate_uniform
Rate_gaussian
Rate_bucket
Rate_sorted
70
400
65
60
300
250
55
200
50
150
45
100
40
50
0
35
1M
2M
4M
Sequence Size
8M
16M
Sorting Rate (millions/sec)
350
Time (msec)
Uniform, Bucket, and
Gaussian distribution
almost get the same
performance
Zero distribution is the
fastest
Not excel on Sorted
distribution
Conclusion
We present an efficient comparison-based sorting algorithm for
many-core GPUs
– carefully map the tasks to GPU architecture
Use warp-based bitonic network to eliminate barriers
– provide sufficient homogeneous parallel operations for each thread
avoid thread idling or thread divergence
– totally coalesced global memory accesses when fetching and storing
the sequence elements
The results demonstrate up to 30% higher performance
– Compared with previous optimized comparison-based algorithms
Thanks