adityaThesis2014

Download Report

Transcript adityaThesis2014

Combining Data Parallelism and Task
Parallelism for Efficient Performance on
Hybrid CPU and GPU Systems
IIIT Hyderabad
Aditya Deshpande
Adviser: Prof. P J Narayanan
Center for Visual Information Technology
International Institute of Information
Technology Hyderabad
Background: Why Parallel?
 Early computer systems had only
a single core.
 #transistors doubled every 2yrs.
(Moore’s Law)
 Computer architects used them to
give speedup. (Freq. Scaling)
IIIT Hyderabad
 Power increases with Frequency.
 Post Pentium 4 (May’04) Intel shifted focus to multi-core processors.
 End of freq. scaling, Parallel Algorithms only means for speedup.
As of today …
 Commodity computers have a multi-core CPU and many-core GPU.
IIIT Hyderabad
 Multi-core CPUs (Coarse/Task Parallelism) –
 LU, QR and Cholesky Decomposition.
 Random number and Probability Distribution Generators.
 FFT, PBzip2, String Processing, Bioinformatics, Data Struct. etc.
 Intel MKL and other libraries.
 Many-core GPUs (Fine/Data Parallelism) –
 Scan, Sort, Hashing, SpMV, Lists, Linear Algebra etc.
 Graph Algorithms: BFS, SSSP, APSP, SCC, MST.
 cuBLAS, cuFFT, NvPP, Magma, cuSparse, CUDPP, Thrust etc.
In Past
this thesis
Work …
 Earlier data-parallel algorithms constitute only portions of end-toend applications. For example, Linear Algebra, Matrix, List.
 Earlier algorithms also had some inherent data-parallelism. For
example, BFS, Image Proc. – Filtering, Color Conversion, FFT.
IIIT Hyderabad
 Design Principles for Challenging Data Parallel Algorithms
 Breaking Sequentiality
 Addressing Irregularity
 Combining Data and Task Parallelism for end-to-end app’s
 Work Sharing
 Pipelining
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Error Diffusion Dithering
 Technique to create an illusion of higher color depth
IIIT Hyderabad
 Floyd-Steinberg Dithering (FSD)
1. Sum (error and pixel value)
2. Find Nearest Color to Sum
3. Output the Nearest Color
4. Diffuse Error to neighbors
3/16
…
5/16
…
7/16
…
1/16
…
Problem of Sequentiality
 Error distribution dictates order of processing pixels.
n
 Scan-line order of processing.
m
 Inherently sequential O(mn) algorithm.
IIIT Hyderabad
 Long chain of dependency (last to first).
 Previous work,
 Metaxas: 3 pixel groups, processed on N-processor array.
 Zhang et al.: FSD too hard to parallelize, used a more
inherently parallel pin-wheel error diffusion algorithm.
Data Dependency in FSD
7/16
1/16
3/16
5/16
Outgoing Errors
Incoming Errors
IIIT Hyderabad
Data dependency imposes scheduling
constraint.
T(i,j) : Iteration in which pixel (i,j) is
processed.
Trapezoidal Region of
Dependency
T(i,j) > max( T(i-1,j) , T(i,j-1),
T(i-1,j-1), T(i-1,j+1) )
Optimal Scheduling in FSD
1
3
2
4
3
5
4
6
5
7
6
8
7
9
5 6 7 8 9 10 11
7 8 9 10 11 12 13
9 10 11 12 13 14 15
11 12 13 14 15 16 17
IIIT Hyderabad
Optimal scheduling:
T(i,j) = 1 + max(…)
 Knight’s move order of processing pixels.
 Each pixel depends only on previous three iterations.
 Pixels in same iteration (or label) can be processed in parallel.
Coarse Parallel FSD on CPU
 To increase computation per thread, group pixels in blocks.
 Pixels within blocks are sequentially processed.
 Trapezoidal blocks to satisfy data dependency of last pixel
F
within a block.
C
A
a+b
E
b
IIIT Hyderabad
B
a
Block Height : b
Block Width : a
D
G
Adjacent Block Processing
1
3
5
2
4
6
3
5
7
4
6
8
Trapezoidal blocks adhere
to knight’s move ordering
Process blocks in Parallel
Fine-Grained Data Parallel GPU
FSD
 GPU’s have more number physical core’s compared to multi-core
CPU’s.
 Favorable to have more number of light-weight threads.
 Grouping into blocks reduces the amount of parallelism.
IIIT Hyderabad
 Process at the pixel level itself, more parallelism.
 Pixels of same iteration (knight’s move) can be scheduled in
parallel.
Data Re-ordering for Optimal
Performance
 Naïve Storage : row major or column major.
 Un-coalesced access.
1P 2P 3P 4P 5P
3 4 5 6P 7P
IIIT Hyderabad
5 6 7 8P 9
1P 2P 3P 4P 5P 3
4
5 6P 7P 5
Access pattern of Image Values
Data Re-ordering for Optimal
Performance
 Store image and errors in knight’s order.
 Coalesced memory access!
1P 2P 3P 4P 5P
3 4 5 6P 7P
IIIT Hyderabad
5 6 7 8P 9
1P 2P 3P 3 4P 4 5P 5
5 6P 6
Only Consecutive Memory Locations are accessed
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Work-Sharing for FSD
 Drawbacks of a pure (only) GPU approach
 GPU needs a lot of threads.
 Dithering does not offer many threads initially and at end.
 Launching only a few threads on GPU doesn’t make sense.
 Should use GPU only when parallelism is above a threshold.
 Split data-parallel step across CPU and GPU using Work Sharing!
IIIT Hyderabad
Time (t)
Width of shaded region denotes #pixels for parallel processing
Handover and Hybrid FSD
IIIT Hyderabad
Time (t)
Handover point
Handover FSD
Work on CPU
“Width” controls load balancing
between CPU and GPU
Time (t)
Hybrid FSD
Work on GPU
Results: Speedup by Work Sharing
Optimal Point for
Handover FSD
165
160
155
150
Optimal Width
for Hybrid FSD
140
135
130
1864
1800
1736
1672
1608
1544
1480
1416
1352
1288
1224
1160
1096
1032
968
904
840
776
712
648
584
520
456
392
328
264
200
136
72
125
8
IIIT Hyderabad
145
Results: Runtime Performance
0
1107
498
346
1078
627
471
507
345
210
143
426
140
102
308
75
51
161
60
40
118
GTX 480
255
283
531
370
403
466
3072 x
2304
176
2349 x
2373
128
257
1536 x
1152
79
1024 x
768
131
130
0
110
109
75
200
187
300
62
62
37
250
400
292
395
512
500
100
Tesla T10
Intel Core2Duo P8600
CPU-GPU Hybrid FSD Runtime (ms)
309
344
252
381
8600M
Intel Core i7 980X
600
136
159
214
Intel Core i7 920
1024 x 2349 x 3072 x 4320 x 5300 x 6042 x 6200 x 8211 x 10326 x
768
2373 2304 3240 4300 3298 8000 8652 4910
GTX 480
1024 x 2349 x 3072 x 4320 x 5300 x 6042 x 6200 x 8211 x 10326 x
768
2373 2304 3240 4300 3298 8000 8652 4910
40
45
24
39
39
22
200
237
300
114
400
118
170
66
129
190
73
183
390
500
440
560
566
646
600
11
6
18
1000
245
163
495
1500
Intel Core2Duo P8600
CPU-GPU Handover FSD Runtime (ms)
IIIT Hyderabad
3197
803
649
426
573
450
870
292
220
437
Intel Core i7 980X
700
0
2000
1024 x 2349 x 3072 x 4320 x 5300 x 6042 x 6200 x 8211 x 10326 x
768
2373 2304 3240 4300 3298 8000 8652 4910
Intel Core i7 920
100
2500
500
650
0
33
26
49
1000
229
178
344
2000
925
747
1413
3000
3000
2096
1600
2109
1578
2708
4000
2651
2070
5000
2536
Coarse Parallel FSD Runtime on CPU
4408
FSD Sequential Runtime
4320 x
3240
8600M
5300 x
4300
6200 x
8000
Tesla T10
Pure GPU FSD (8600M): 1024x768 – 48ms, 6200x8000 – 576ms.
8211 x
8652
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Sorting
 Textbooks teach us many popular sorting methods.
Quicksort
2
9
1
2
3
9
4
2
3
7
Mergesort
Radixsort
1
2
2
9
3
7
IIIT Hyderabad
Data is always Numbers!
 Real data is beyond just numbers
- Dictionary words or sentences
- DNA sequences, multi-dimensional db records
- File Paths
3
9
4
2
IIIT Hyderabad
Can we sort strings efficiently?
Irregularity in String Sorting
 Number Sorting (or Fixed Length Sorting)
 Fixed Length Keys (8 to 128 bits).
 Standard containers: float, int, double etc.
 Keys Fit into registers.
 Comparisons take O(1) time.
FIXED LENGTH KEYS
IIIT Hyderabad
 String Sorting (or Variable/Long Length Sorting)




Keys have no restriction on length.
Iteratively load keys from main memory.
Comparisons *do not* take O(1) time.
Suffix Sort (1M strings of 1M length!)
VARIABLE LENGTH KEYS
Variable work per thread and arbitrary memory accesses: IRREGULARITY
Can we sort strings efficiently?
IIIT Hyderabad
Yes, we can.
If, we limit the
#(iterative comparisons) performed.
String Sorting
CPU
Multi-key Quicksort
Thrust Merge Sort
[Bentley and Sedgewick, SODA’97]
[Satish et al., IPDPS’09]
Burstsort
Fixed/Var. Merge Sort
[Sinha et al., JEA’07]
[Davidson et al., InPar’12]
MSD Radix Sort
IIIT Hyderabad
GPU
[Kärkkäinen and Rantala, SPIRE’08]
Hybrid Merge Sort
[Banerjee et al., AsHES’13]
Our String Sort
(Radix Sort)
CPU: Burstsort (Sinha et al.)
 Input {bat, barn, bark, by, byte, bytes, wane, way, wall, west}
 Partition into small buckets
A B C D
IIIT Hyderabad
A B C D
W X Y Z
W X Y Z
A B C D E
T
\0
NE
RN
TE
Y
RK
TES
LL
X Y Z
ST
BURST TRIE
 Sort small buckets in CPU Cache
 No merging of sorted buckets, already ordered!
CPU: MSD Radix Sort
AA
AB
AC
SORT: {bat, barn, bark, by, byte, bytes, wane, way, wall}
✔  Don’t explicitly use pointers.
A
T
- Counting Methods (Two-Pass)
AT
RN
B
A
- Dynamic Methods (One-Pass),
AR
RK
B
use std::vector, std::list.
AR
C
Y
Z
W
X
ZW
IIIT Hyderabad
ZY
ZZ
A
TE
TES
Y
Z
MSD Radix Sort
✔
Algorithmic Caching
• Fixed next few characters stored.
✔
Supra-alphabets
• Use 2 character granularity.
GPU
- Fastest Parallel Radix Sort
- Max. successive char loaded
- Adaptive Granularity
NE
Y
Y
Z
\0
LL
GPU: Davidson et al.
Three Stage Merge Sort:
1. Stable Bitonic Sort
2. Parallel Merge
3. Co-operative Merge
Prefer Register Packing vs. Over-utilization.
IIIT Hyderabad
2.5x faster than Thrust Merge Sort.
String Sort:
Keys : First few chars.
Value : Index of successive chars.
3-Stage Merge Sort
Iterative Comparator
Can we sort strings efficiently?
Yes, we can.
IIIT Hyderabad
If, we limit the
#(iterative comparisons) performed.
Previous methods do this?
CPU: ✔ (Small Buckets)
GPU: ✖ (Not Quite!)
Merge Sort: Iterative Comparisons
 Repetitive loading for resolving ties in
every merge step.
 Davidson et al. show that
“After every merge step comparisons
are between more similar strings”
IIIT Hyderabad
 All previous GPU String Sorting
approaches are based on Merge Sort.
Illustration of Comparisons
Iterative comparisons = High Latency Global Memory Access = Divergence 
We develop ‘Radix Sort’ based String Sort to mitigate this.
Radix Sort for String Sorting
0
0
0
First Sort
0
0
1
1
IIIT Hyderabad
1
MSB Segment ID
k char prefix
(proxy for prefix)
Future Sorts
Seg ID +
k-char prefix
as Keys
Our GPU String Sort
 Entire Strings are not shuffled, we move only indices and
prefix.
 Fastest GPU Radix Sort Primitive is used to perform each sort.
 GPU Radix Sort by Merrill and Grimshaw:
 Optimized to be compute bound. (Hide irregularity of memory!)
 Many operations performed in shared memory/registers.
IIIT Hyderabad
 Unlike merge sort, each part of string loaded only once!
Parallel
Scan (Segment Array)
 Parallel Segment
ID Generation
Parallel Lookup 1
1
1
1
Sorted Keys
1
1
1
Additional Optimizations
Adaptive Segment ID
 #Segments are limited  #Segment ID bytes in Key limited!
 Apart from min. segment id bytes, remaining have next chars.
 Allows maximum characters to be compared per sort step.
Singleton Elimination – Remove Size 1 buckets.
PARALLEL SCATTER (Smaller Sorting Problem)
IIIT Hyderabad
Parallel Scan (DEST Array)
STENCIL
0
0
0
SINGLETONS
0
Results: Datasets
IIIT Hyderabad
 We use datasets created by Sinha et al. (Burstsort).
 We also create 2 practical datasets of our own (pc-filelist / sentences)
 After-Sort Tie Length: Indicates difficulty of sorting a dataset.
Details of Datasets
Runtime Speedup
Speed Up vs. Burstsort and MSD Radix Sort
Speed Up vs. Previous GPU Method
1.8
IIIT Hyderabad
2(
1M
)
ci a
l- 4
(10
M)
ra
nd
om
(1M
d ic
)
tca
l ls
(0.
1M
gen
)
om
e(
32
M)
wo
rd
s(
10
M
)
ur
l (1
ar
0M
ti fi
)
ci a
l- 5
(10
s en
M
)
te n
ces
(1.
2M
pc
)
-fil
e li
st
(10
M
)
6.5
ial
-
5.1
ar
ti fi
3
10.8
tifi
c
8.4
9.3
21
18
15
12
9
6
3
0
ar
19.7
ar
tifi
cia
l- 2
ar
t ifi
cia (1M
)
l-4
(
10
ra
M
nd
)
om
dic
(1M
tca
)
lls
(0.
ge
1M
no
)
me
(32
wo
M
rd
)
s(
10
M
)
ur
ar
l(
t ifi
10
c ia
M
)
sen l-5 (
1
ten
0M
ce
)
s(
pc
1
.
-fi
2M
lel
)
ist
(10
M
)
21
18
15 12.5 13.3
12
9
6
3
0
Speed Up vs. Burstsort
Speed Up vs. MSD Radix Sort
 Good speedups on tough datasets (url, pc-filelist, sentences).
 Max. speedup on genome demonstrates scalability of radix sort.
 Outperforms both previous GPU and CPU methods.
IIIT Hyderabad
Analytical Estimate of Runtime
Sort Time: ‘t’
For ‘p’ Mkeys/s performance of thrust and ‘N’ Mkeys problem
t = 1000/p x N
Time per iteration: α x t (where, ‘α’ = 1.5 – 2 for our approach)
‘k’: max. tie length
Total time = α x 1000/p x N x k
IIIT Hyderabad
Effect of Optimizations
 Speed Up Singleton Elimination : 0.9 to 4.5
 Speed Up Adaptive Segment ID : 1.4 to 3.3
Runtime on Standard Primitives
 Standard Primitives improve in performance
 Tuned by vendors for new architectures
 New algorithms/improvements developed over time
(GPU Sorts have continually improved)
 Our String Sort can inherit these without re-design
% of Time on Thrust Primitives
%Time on Thrust
se
nt
en
ce
s
pc
-fi
le
lis
t
ur
l
ar
tifi
ci
al
-2
ar
tifi
ci
al
-4
ar
tifi
ci
al
-5
ra
nd
om
di
ct
ca
lls
ge
no
m
e
w
or
ds
IIIT Hyderabad
100
80
60
40
20
0
String Sorting Summary
We built a GPU String Sort that:
 outperforms state-of-the-art
 adapts to future architectures
 first radix sort based string sort
IIIT Hyderabad
 scales to challenging inputs
 code available at
http://web.iiit.ac.in/~adity.deshapandeug08/stringSort/
 code also made a part of CUDPP (standard GPU library)
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Burrows Wheeler Transform
 Input String: I[1…N]
Last column along with the index of original string
(i.e. 4 since I[4]=1) is the BWT OUTPUT
I[N] S[N]
 Sort all cyclically shifted
strings of I[1…N].
 Last column of sorted
strings, with index of
original string is BWT.
IIIT Hyderabad
 O(N) strings are sorted,
each with length O(N).
I[N]
1
b
6
a
b
a
n
a
n
2
a
4
a
n
a
b
a
n
2
a
n
a
n
a
b
BWT
TRANSFORM
3
n
4
a
1
b
a
n
a
n
a
5
n
5
n
a
b
a
n
a
6
a
3
n
a
n
a
b
a
INPUT
OUTPUT MATRIX
Last column can be easily computed by offset
addi on even if we output this shuffled I[N].
 GPU String Sort works when ties are only a few characters (~100’s).
 Suffix sort in BWT has longer ties 103 to 105 characters.
IIIT Hyderabad
Modified String Sort for BWT
 Doubling MCU length of String Sort
 MCU length determines #sort steps.
 Large #sort steps for long ties and thus, longer runtime. 
 Use fixed length MCU initially, then double.
 1.5 to 2.5x speedup (e.g. enwik8, 1.06 to 0.58s per block).
 Fixed length sort inexpensive initially, doubling curtails
#sort steps eventually.
IIIT Hyderabad
Modified String Sort for BWT
 Partial GPU Sort and CPU Merge
 Cyclically shifted strings have special property.
 We can sort only 2/3rd strings, synthesize rest w/o iterative sort.
 Sort all (mod 3) ≠ 0 strings iteratively.
 1st char of (mod 3) = 0 string, rank of next in 2/3rd sort enough to
sort remaining 1/3rd strings.
 Non-iterative overlapped merge also possible. (CPU)
Datasets GPU BWT
 Datasets
 Enwik8: First 108 bytes of English Wikipedia Dump (96MB).
 Wiki-xml: Wikipedia xml dump (151MB).
 Linux-2.6.11.tar: Publicly available linux kernel (199 MB).
 Silesia Corpus: Data-compression benchmark (208MB).
IIIT Hyderabad
 Tie-Length vs. Block Size
Runtime GPU BWT vs. Bzip2 BWT
Average runtime (secs/per block) for CPU and GPU BWT Algorithms, Block Size : 900KB
0.5
No speedup for
small blocks.
0.397
0.45
0.4
0.35
0.18
0.209
0.212
0.3
0.104
0.021
0.02
0.02
0.05
0.07
0.021
0.1
0.092
0.2
0.15
0.09
0.25
wiki-xml,
MSD = 960, ASD = 614
silesia.tar
MSD=16320, ASD=1406
linux-2.6.11.tar,
MSD = 65472, ASD = 2836
0
enwik8,
MSD = 960, ASD = 298
GPU Sort (2/3rd + 1/3rd)
CPU BWTAverage
(bzip2) runtime (secs/per block)
Constant Time Merge Operation
CPU Merge (1/3rd + 2/3rd)
GPU
BWTBWT
Time (Increases
with MSD/ASD)
for CPU and
GPU
Algorithms,
Block Size : 4.5MB
3.152
4
GPU not utilized
sufficiently.
3.5
3
0.171
0.162
0.75
0.186
0.183
0.5
0.28
1
0.46
0.878
1.5
0.768
2
0.95
1.69
2.5
wiki-xml,
MSD = 960, ASD = 874
silesia.tar
MSD=16320, ASD=4075
linux-2.6.11.tar,
MSD = 65472, ASD = 10078
0
10
GPU Sort (2/3rd + 1/3rd)
CPU Merge (1/3rd + 2/3rd)
GPU BWT Time increase with MSD/ASD
Constant Time Merge Operation
CPU BWT (bzip2)
Average runtime (secs/per block) for CPU and GPU BWT Algorithms, Block Size : 9MB
9.39
enwik8,
MSD = 960, ASD = 576
9
8
4.53
5
2.31
0.412
1.79
0.385
1.57
0.406
1
1.08
2
0.414
3
1.81
4
0.526
IIIT Hyderabad
7
6
silesia.tar
MSD=16320, ASD=8430
linux-2.6.11.tar,
MSD = 262080, ASD = 27340
0
enwik8,
MSD = 960, ASD = 813
wiki-xml,
MSD = 960, ASD = 929
Speedup on
large blocks.
GPU Sort (2/3rd + 1/3rd)
CPU Merge (1/3rd + 2/3rd)
GPU BWT Time increase with MSD/ASD
Constant Time Merge Operation
CPU BWT (bzip2)
GPU still slow
for worst-case
linux dataset.
String Perturbation




Time for GPU BWT and CPU BWT vs. % perturbation for linux-2.6.11.tar, 4.5MB
3.152
3.5
10
9.39
Large #sort steps result from repeated substrings/long ties.
Runtime reduces greatly if we break ties.
Perturbation ‘add random chars at fixed interval’ to break ties.
Useful for applications where BWT transformed string is irrelevant,
and BWT+IBWT are used in pairs (viz. BW Compression).
 Fixed Perturbation can be removed after IBWT.
Time for GPU BWT and CPU BWT vs. % perturbation for linux-2.6.11.tar, 9MB
9
3
8
7
2.5
6
2
CPU BWT
GPU Sort time decreases with perturbation
Constant Time Merge Operation
1.22
0.381
0.59
0.01%
MSD = 8128, ASD = 5689
1.32
0%
MSD = 262080, ASD = 27340
0.37
1.09
2.22
0.367
1%
MSD = 192, ASD = 76
1
2.12
3
0.37
0.165
0.1%
MSD = 960, ASD = 828
0.323
0.166
0.01%
MSD = 4032, ASD = 2535
CPU Merge (2/3rd + 1/3rd)
2.31
1.112
1.15
0.164
0%
MSD = 4032, ASD = 2329
0
GPU Sort (2/3rd + 1/3rd)
4
2
0.163
IIIT Hyderabad
0.5
0.559
1
1.154
1.5
0.971
1.151
5
0
GPU Sort (2/3rd + 1/3rd)
CPU Merge (2/3rd + 1/3rd)
CPU BWT
0.1%
MSD = 960, ASD = 911
GPU Sort Time decreases with perturbation
1%
MSD = 192, ASD = 185
Constant Time Merge Operation
Linux-9MB Blocks, 8.2x speedup with 0.1% perturbation
Outline - Data Par. + Task Par.
GPU
CPU
BREAKING SEQUENTIALITY
Floyd-Steinberg Dithering
WORK SHARING
Floyd-Steinberg Dithering
(Hybrid and Handover Algorithms)
ADDRESSING IRREGULARITY
IIIT Hyderabad
String Sorting + Burrows
Wheeler Transform
PIPELINING
Burrows Wheeler Compression
(Hybrid and All-Core Algorithms)
Burrows Wheeler Compression
Three step procedure.
File divided into blocks and following steps done on each block.
1. Burrows Wheeler Transform
- Suffix sort and use the last last column. (Most compute intensive)
2. Move-to-Front Transform
- Similar to run-length encoding. (~10% of runtime)
IIIT Hyderabad
3. Huffman Encoding
- Standard frequency of chars based encoding. (~10% of runtime)





BWC Pipelining: Hybrid BWC
Patel et al. did all 3 steps on GPU, 2.78X slowdown.
Map appropriate operation to appropriate compute platform.
GPU for sorts of BWT, CPU does sequential merge, MTF, Huff.
Pipeline blocks such that CPU computation overlaps with GPU.
Throughput BWC = BWT, barring first and last block offset.
IIIT Hyderabad
CPU
(i)2/3rd Sort
(ii)1/3rd Sort
Block #1
GPU
MTF+
MERGE HUFF
#1
(i)2/3rd Sort
(ii)1/3rd Sort
Block #2
MERGE
MTF+
HUFF
#2
(i)2/3rd Sort
(ii)1/3rd Sort
Block #3
MERGE
(i)2/3rd Sort
(ii)1/3rd Sort
Block #4
BWC Pipelining: All-Core BWC
 System made of CoSt’s:
- GPU with controlling CPU thread a CoSt
- Other CPU cores are CoSt’s
 Split blocks across CoSt’s, dequeued from work-queue.
 GPU CoSt runs Hybrid BWC
 CPU CoSt runs best BWC by Seward (i.e. Bzip2).
INPUT
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
FIFO WORK QUEUE
IIIT Hyderabad
Atomic dequeue of Work Items and Parallel Execution
CoSt
CoSt
CoSt
CoSt
Enqueue to Output
Queue
Output
Output
OUTPUT QUEUE
Ouput
Output
Output
Results: Hybrid BWC
IIIT Hyderabad
Compression Ratio improves with increase in Block Size
GPU runtime is better with larger blocks compared to CPU
GPU runtime improves with perturbation, CPU runtime stays the same
Compressed file size increases, but reasonable till 0.1% (< state-of-the-art)
Runtime & compressed file size better than state-of-the-art (Bzip2, 900KB)
Note, CPU does much less work using 900KB blocks, GPU uses 9MB.
Results: Hybrid BWC
Speed Up and Percent Reduction in Compressed File Size
8
1.903
2.25
7
4.6
1
1.207
1.5
1.386
2
1.297
2.5
8.1
1.047
8.4
1.828
3
9
2.92
3.5
2.9
0.5
6
5
4
3
2
Hybrid BWC (9MB) Speed Up vs. CPU
BWC (900 KB)
Hybrid BWC (9MB) Speed Up vs. CPU
BWC (9MB)
% Reduction in File Size (with 9MB
Blocks) {right y-axis}
1
0
0
enwik8 (96MB)
wiki-xml (151MB)
linux (199MB)
silesia.tar (203MB)
IIIT Hyderabad
 GPU BWC by Patel et al. was 2.78 slower than CPU.
 Our Hybrid BWC (by using CPU BWC as proxy) is nearly 5 times faster
than Patel et al.
IIIT Hyderabad
Results: All-Core BWC, High-end
 Using CPU CoSt’s only: 3.06x speedup
 Using all CoSt’s (CPU and GPU): 4.87x speedup
IIIT Hyderabad
Results: All-Core BWC, Low-end
 Using CPU CoSt’s only: 1.22x speedup
 Using all CoSt’s (CPU and GPU): 1.67x speedup
 Though GPU is slower, our load balancing provides speedup using all resources.
Conclusions
 Developed data parallel algorithms for difficult problems with sequentiality
and irregularity
 Developed techniques for efficient use of Hybrid CPU and GPU systems
 3-4X speedup by Coarse Parallel FSD, 10X speedup using GPU FSD
 FSD techniques can be applied to several dynamic programming problems
 String sort outperforms state-of-the-art significantly, adapts to future GPUs
IIIT Hyderabad
 Speedup for first time on BW compression using GPUs
 Pipelining/Work-sharing and its benefit on BWC/FSD respectively should
motivate developers to make fast end-to-end applications for CPU+GPU
systems
Related Publications

Aditya Deshpande, Ishan Misra and P J Narayanan. Hybrid Implementation of Error
Diffusion Dithering, Proceedings of IEEE International Conference on High Performance
Computing, Dec 2011, Bangalore, India.

Aditya Deshpande and P J Narayanan. Can GPUs Sort Strings Efficiently?, Proceedings
of IEEE International Conference on High Performance Computing, Dec 2013, Bangalore,
India. (Best GPU Paper Award)

Aditya Deshpande and P J Narayanan. Fast Burrows Wheeler Compression using CPU
IIIT Hyderabad
and GPU, ACM Transactions on Parallel Computing. (Submitted April 2014, under review)
Thank you.
Questions?
IIIT Hyderabad
All codes will be available for download at http://cvit.iiit.ac.in/ or
http://web.iiit.ac.in/~adity.deshapandeug08 , CVIT/Personal Webpage.
Please contact [email protected] for more details.
We thank the ‘Indo-Israeli Project’ by Department of Science and Technology
for partial financial support for this work.