Floating-Point Data Compression at 75 Gb/s on a GPU
Download
Report
Transcript Floating-Point Data Compression at 75 Gb/s on a GPU
Floating-Point Data Compression
at 75 Gb/s on a GPU
Molly A. O’Neil and Martin Burtscher
Department of Computer Science
Introduction
Scientific simulations on HPC clusters
Run on interconnected compute nodes
Produce and transfer lots of floating-point data
Data storage and transfer are expensive and slow
Compute nodes have multiple cores but only one link
Interconnects are getting faster
Lonestar: 40 Gb/s InfiniBand
Speeds of up to 100 Gb/s soon
Texas Advanced Computing Center
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Introduction (cont.)
Compression
Reduced storage, faster transfer
Only useful when done in real time
Saturate network with compressed data
Requires compressor tailored to hardware capabilities
Charles Trevelyan for http://plus.maths.org/
GFC algorithm for IEEE 754 double-precision data
Designed specifically for GPU hardware (CUDA)
Provides reasonable compression ratio and operates
above throughput of emerging networks
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Lossless Data Compression
Dictionary-based (Lempel-Ziv family) [gzip, lzop]
Variable-length entropy coders (Huffman, AC)
Run-length encoding [fax]
Transforms (Burrows-Wheeler) [bzip2]
Special-purpose FP compressors [FPC, FSD, PLMI]
Prediction and leading-zero suppression
None of these offer real-time speeds for
state-of-the-art networks
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GFC Algorithm
GPUs require 1000s of parallel activities, but…
compression is a generally serial operation
Divide data into n chunks,
processed in parallel
Best perf: choose n to match
max number of resident warps
Each chunk composed of
32-word subchunks
One double per warp thread
Use previous subchunk to
provide prediction values
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Dimensionality
Many scientific data sets display dimensionality
Interleaved coordinates from multiple dimensions
Optional dimensionality parameter to GFC
Determines index of previous subchunk to use as the
prediction
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GFC Algorithm (cont.)
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GPU Optimizations
Low thread divergence (few if statements)
Some short enough to be predicated
Coalesce memory accesses by packing/unpacking
data in shared memory (for CC < 2.0)
Very little inter-thread
communication and synchronization
Prefix sum only
Warp-based implementation
Floating-Point Data Compression at 75 Gb/s on a GPU
gamedsforum.ca
March 2011
Evaluation Method
Systems
Two quad-core 2.53 GHz Xeons
NVIDIA FX 5800 GPU (CC 1.3)
13 datasets: real-world data (19 – 277 MB)
Observational data, simulation results, MPI messages
Comparisons
Compression ratio vs. 5 compressors in common use
Throughput vs. pFPC (fastest known CPU compressor)
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Compression Ratio
1.188 (range: 1.01 – 3.53)
Harmonic mean compression ratio
1.25
Low (FP data), but in
line with other algos
Largely independent of
number of chunks
1.20
1.15
When done in real-
1.10
time, compression at
this ratio can greatly
speed up MPI apps
1.05
1.00
bzip2
gzip
lzop
FPC
pFPC
GFC
3% – 98% speed-up
[Ke et al., SC’04]
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Throughput
C: 75 – 87 Gb/s
Harmonic mean throughput (Gb/s)
100
90
Mean: 77.9 Gb/s
GFC
80
compression
70
decompression
D: 90 – 121 Gb/s
Mean: 96.6 Gb/s
60
50
4x faster than pFPC
40
pFPC
on 8 cores (2 CPUs)
30
20
Improvement over
10
0
1.15
1.20
1.25
1.30
1.35
Harmonic mean compression ratio
Floating-Point Data Compression at 75 Gb/s on a GPU
1.40
pFPC’s compression
ratio vs. performance
trend
March 2011
NEW: Fermi Throughput
Fermi improvements:
Faster, simpler memory accesses
Hardware support for count-
leading-zeros op
Compression ratio: 1.187
C: 119 – 219 (HM: 167.5 Gb/s)
D: 169 – 219 (HM: 180.3 Gb/s)
Compresses over 9.5x faster
than pFPC on 8 x86 cores
Throughput (Gb/s)
200
150
100
50
compression
decompression
0
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Summary
GFC algorithm
Chunks up data, each warp processes a chunk
iteratively by 32-word subchunks
No communication required between warps
Minimum 75 Gb/s – 90 Gb/s (encode-decode)
throughput on GTX-285, and 119 Gb/s – 169
Gb/s on Fermi, with a compression ratio of 1.19
CUDA source code is freely available at
http://www.cs.txstate.edu/~burtscher/research/GFC/
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Conclusions
GPU can compress much faster than PCIe bus can
transfer the data
But…
PCIe bus will become faster
AMD
CPU-GPU increasingly on single die
GPU-to-GPU, GPU-to-NIC transfers coming?
NVIDIA
GFC is the first compressor with the potential to
deliver real-time FP data compression for current
and emerging network speeds
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011