Ye_ipdps10-slide - IEEE International Parallel and

Download Report

Transcript Ye_ipdps10-slide - IEEE International Parallel and

High Performance Comparison-Based Sorting Algorithm
on Many-Core GPUs
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne
Key Laboratory of Computer System and Architecture
ICT, CAS, China
Outline
 GPU computation model
 Our sorting algorithm
– A new bitonic-based merge sort, named Warpsort
 Experiment results
 conclusion
GPU computation model
 Massively multi-threaded, data-parallel many-core
architecture
 Important features:
– SIMT execution model
 Avoid branch divergence
– Warp-based scheduling
 implicit hardware synchronization among threads within a warp
– Access pattern
 Coalesced vs. non-coalesced
Why merge sort ?
 Similar case with external sorting
– Limited shared memory on chip vs. limited main
memory
 Sequential memory access
– Easy to meet coalesced requirement
Why bitonic-based merge sort ?
 Massively fine-grained parallelism
– Because of the relatively high complexity, bitonic
network is not good at sorting large arrays
– Only used to sort small subsequences in our
implementation
 Again, coalesced memory access requirement
Problems in bitonic network
 naïve implementation
– Block-based bitonic network
– One element per thread
 Some problems
Phase
0
Stage
0
1
0
2
1
0
1
2
– in each stage
 n elements produce only n/2
thread
compare-and-swap
operations
 Form both ascending pairs
and descending pairs
– Between stages
block
 synchronization
Too many branch divergences and synchronization operations
What we use ?
 Warp-based bitonic network
– each bitonic network is assigned to an independent warp,
instead of a block
 Barrier-free, avoid synchronization between stages
– threads in a warp perform 32 distinct compare-and-swap
operations with the same order
 Avoid branch divergences
 At least 128 elements per warp
 And further a complete comparison-based sorting
algorithm: GPU-Warpsort
Overview of GPU-Warpsort
Divide input seq into small
tiles, and each followed by a
warp-based bitonic sort
Input
Step 1
...
bitonic sort by a warp
bitonic sort by a warp
...
merge by a warp
Merge, until the parallelism
is insufficient.
merge by a warp
merge by a warp
Step 2
merge by a warp
merge by a warp
...
Split into small subsequences
split into independent subsequences
split into independent subsequences
Step 3
...
...
...
Merge, and form the output
Step 4
merege by a warp
Output
merge by a warp
merge by a warp
merge by a warp
merge by a warp
...
Step1: barrier-free bitonic sort
 divide the input array into
equal-sized tiles
 Each tile is sorted by a
warp-based bitonic
network
– 128+ elements per tile to
avoid branch divergence
– No need for
__syncthreads()
– Ascending pairs +
descending pairs
– Use max() and min() to
replace if-swap pairs
bitonic_warp_128_(key_t *keyin, key_t *keyout) {
//phase 0 to log(128)-1
for(i=2;i<128;i*=2){
for(j=i/2;j>0;j/=2){
k0 ← position of preceding element in each pair
to form ascending order
if(keyin[k0]>keyin[k0+j])
swap(keyin[k0],keyin[k0+j]);
k1 ← position of preceding element in each pair
to form descending order
if(keyin[k1]<keyin[k1+j])
swap(keyin[k1],keyin[k1+j]);
}
}
//special case for the last phase
for(j=128/2;j>0;j/=2){
k0 ← position of preceding element in the thread's
first pair to form ascending order
if(keyin[k0]>keyin[k0+j])
swap(keyin[k0],keyin[k0+j]);
k1 ← position of preceding element in the thread's
second pair to form ascending order
if(keyin[k1]>keyin[k1+j])
swap(keyin[k1],keyin[k1+j]);
}
}
Step 2: bitonic-based merge sort
 t-element merge sort
– Allocate a t-element buffer in
shared memory
– Load the t/2 smallest
elements from seq A and B,
respectively
– Merge
– Output the lower t/2 elements
– Load the next t/2 smallest
elements from A or B
 t = 8 in this example
Sequence A
0
2
4
Sequence B
buf
(shared memory)
6
4
2
6
8
10
12
14
1
3
5
7
9
0
1
3
5
7
6
7
11
13
barrier-free bitonic merge network
buf
0
1
2
3
4
5
Output
No, then load the next 4
elements from B
A[3]<B[3]?
Yes, then load the next
4 elements from A
buf
14
12
10
8
4
5
6
7
6
7
barrier-free bitonic merge network
buf
15
13
11
9
4
5
barrier-free bitonic merge network
15
Step 3: split into small tiles
 Problem of merge sort
– the number of pairs decreases geometrically
– Can not fit this massively parallel platform
 Method
– Divide the large seqs into independent small tiles
which satisfy:
a  subsequence( x, i ), b  subsequence( y, j ) : a  b,
0  x  l ,0  y  l ,0  i  j  s.
Step 3: split into small tiles (cont.)
 How to get the splitters?
– Sample the input sequence randomly
Input sequence
...
...
...
...
...
...
...
Sample sequence
...
...
...
...
...
sort
...
Sorted sample sequence
Splitters
...
...
...
...
...
...
...
...
...
...
...
...
Step 4: final merge sort
 Subsequences (0,i), (1,i),…, (l-1,i) are merged into Si
 Then,S0, S1,…, Sl are assembled into a totally sorted
array
s
0,0
1,0
0,1
1,1
0,2
0,3
1,2
1,3
...
...
0,s-2
1.s-2
0,s-1
1,s-1
l
...
l-1,0
l-1,1
l-1,2
l-1,3
...
l-1,s-2 l-1,s-1
Experimental setup
 Host
– AMD Opteron880 @ 2.4 GHz, 2GB RAM
 GPU
– 9800GTX+, 512 MB
 Input sequence
– Key-only and key-value configurations
 32-bit keys and values
– Sequence size: from 1M to 16M elements
– Distributions
 Zero, Sorted, Uniform, Bucket, and Gaussian
Performance comparison
 Mergesort
 Warpsort
– Our implementation
warpsort
quicksort
300
250
200
150
100
50
0
ko
kv
1M
ko
kv
2M
ko
kv
4M
Sequence Size
ko
kv
ko
kv
8M
16M
8M
16M
70
– Cederman, ESA’08
60
Sorting Rate (millions/sec)
– Fastest sorting algorithm on
GPU (Satish, IPDPS’09)
radixsort
350
 Quicksort
 Radixsort
mergesort
400
Time (msec)
– Fastest comparison-based
sorting algorithm on GPU
(Satish, IPDPS’09)
– Implementations already
compared by Satish are not
included
450
50
40
30
20
warpsort
radixsort
mergesort
10
0
1M
2M
4M
Sequence Size
Performance results
 Key-only
– 70% higher performance than quicksort
 Key-value
– 20%+ higher performance than mergesort
– 30%+ for large sequences (>4M)
Results under different distributions
– Load imbalance
450
Time_zero
Time_uniform
Time_gaussian
Time_bucket
Time_sorted
Rate_zero
Rate_uniform
Rate_gaussian
Rate_bucket
Rate_sorted
70
400
65
60
300
250
55
200
50
150
45
100
40
50
0
35
1M
2M
4M
Sequence Size
8M
16M
Sorting Rate (millions/sec)
350
Time (msec)
 Uniform, Bucket, and
Gaussian distribution
almost get the same
performance
 Zero distribution is the
fastest
 Not excel on Sorted
distribution
Conclusion
 We present an efficient comparison-based sorting algorithm for
many-core GPUs
– carefully map the tasks to GPU architecture
 Use warp-based bitonic network to eliminate barriers
– provide sufficient homogeneous parallel operations for each thread
 avoid thread idling or thread divergence
– totally coalesced global memory accesses when fetching and storing
the sequence elements
 The results demonstrate up to 30% higher performance
– Compared with previous optimized comparison-based algorithms
Thanks