Floating-Point Data Compression at 75 Gb/s on a GPU

Transcript Floating-Point Data Compression at 75 Gb/s on a GPU

Floating-Point Data Compression
at 75 Gb/s on a GPU
Molly A. O’Neil and Martin Burtscher
Department of Computer Science
Introduction
 Scientific simulations on HPC clusters
 Run on interconnected compute nodes
 Produce and transfer lots of floating-point data
 Data storage and transfer are expensive and slow
 Compute nodes have multiple cores but only one link
 Interconnects are getting faster
 Lonestar: 40 Gb/s InfiniBand
 Speeds of up to 100 Gb/s soon
Texas Advanced Computing Center
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Introduction (cont.)
 Compression
 Reduced storage, faster transfer
 Only useful when done in real time
Saturate network with compressed data
 Requires compressor tailored to hardware capabilities

Charles Trevelyan for http://plus.maths.org/
 GFC algorithm for IEEE 754 double-precision data
 Designed specifically for GPU hardware (CUDA)
 Provides reasonable compression ratio and operates
above throughput of emerging networks
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Lossless Data Compression
 Dictionary-based (Lempel-Ziv family) [gzip, lzop]
 Variable-length entropy coders (Huffman, AC)
 Run-length encoding [fax]
 Transforms (Burrows-Wheeler) [bzip2]
 Special-purpose FP compressors [FPC, FSD, PLMI]
 Prediction and leading-zero suppression
 None of these offer real-time speeds for
state-of-the-art networks
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GFC Algorithm
 GPUs require 1000s of parallel activities, but…
compression is a generally serial operation
 Divide data into n chunks,
processed in parallel
 Best perf: choose n to match
max number of resident warps
 Each chunk composed of
32-word subchunks
 One double per warp thread
 Use previous subchunk to
provide prediction values
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Dimensionality
 Many scientific data sets display dimensionality
 Interleaved coordinates from multiple dimensions
 Optional dimensionality parameter to GFC
 Determines index of previous subchunk to use as the
prediction
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GFC Algorithm (cont.)
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
GPU Optimizations
 Low thread divergence (few if statements)
 Some short enough to be predicated
 Coalesce memory accesses by packing/unpacking
data in shared memory (for CC < 2.0)
 Very little inter-thread
communication and synchronization
 Prefix sum only
 Warp-based implementation
Floating-Point Data Compression at 75 Gb/s on a GPU
gamedsforum.ca
March 2011
Evaluation Method
 Systems
 Two quad-core 2.53 GHz Xeons
 NVIDIA FX 5800 GPU (CC 1.3)
 13 datasets: real-world data (19 – 277 MB)
 Observational data, simulation results, MPI messages
 Comparisons
 Compression ratio vs. 5 compressors in common use
 Throughput vs. pFPC (fastest known CPU compressor)
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Compression Ratio
 1.188 (range: 1.01 – 3.53)
Harmonic mean compression ratio
1.25
 Low (FP data), but in
line with other algos
 Largely independent of
number of chunks
1.20
1.15
 When done in real-
1.10
time, compression at
this ratio can greatly
speed up MPI apps
1.05
1.00
bzip2
gzip
lzop
FPC
pFPC
GFC
 3% – 98% speed-up
[Ke et al., SC’04]
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Throughput
 C: 75 – 87 Gb/s
Harmonic mean throughput (Gb/s)
100
90
 Mean: 77.9 Gb/s
GFC
80
compression
70
decompression
 D: 90 – 121 Gb/s
 Mean: 96.6 Gb/s
60
50
 4x faster than pFPC
40
pFPC
on 8 cores (2 CPUs)
30
20
 Improvement over
10
0
1.15
1.20
1.25
1.30
1.35
Harmonic mean compression ratio
Floating-Point Data Compression at 75 Gb/s on a GPU
1.40
pFPC’s compression
ratio vs. performance
trend
March 2011
NEW: Fermi Throughput
 Fermi improvements:
 Faster, simpler memory accesses
 Hardware support for count-
leading-zeros op
 Compression ratio: 1.187
 C: 119 – 219 (HM: 167.5 Gb/s)
 D: 169 – 219 (HM: 180.3 Gb/s)
 Compresses over 9.5x faster
than pFPC on 8 x86 cores
Throughput (Gb/s)
200
150
100
50
compression
decompression
0
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Summary
 GFC algorithm
 Chunks up data, each warp processes a chunk
iteratively by 32-word subchunks
 No communication required between warps
 Minimum 75 Gb/s – 90 Gb/s (encode-decode)
throughput on GTX-285, and 119 Gb/s – 169
Gb/s on Fermi, with a compression ratio of 1.19
 CUDA source code is freely available at
http://www.cs.txstate.edu/~burtscher/research/GFC/
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011
Conclusions
 GPU can compress much faster than PCIe bus can
transfer the data
 But…
 PCIe bus will become faster
AMD
 CPU-GPU increasingly on single die
 GPU-to-GPU, GPU-to-NIC transfers coming?
NVIDIA
 GFC is the first compressor with the potential to
deliver real-time FP data compression for current
and emerging network speeds
Floating-Point Data Compression at 75 Gb/s on a GPU
March 2011