slides - University of British Columbia
Download
Report
Transcript slides - University of British Columbia
A GPU Accelerated Storage System
Abdullah Gharaibeh
with: Samer Al-Kiswany
Sathish Gopalakrishnan
Matei Ripeanu
NetSysLab
The University of British Columbia
1
GPUs radically change the cost landscape
$600
$1279
(Source: CUDA Guide)
2
Harnessing GPU Power is Challenging
more complex programming model
limited memory space
accelerator / co-processor model
3
Motivating Question:
Does the 10x reduction in computation costs GPUs offer
change the way we design/implement distributed systems?
Context:
Distributed Storage Systems
4
Distributed Systems Computationally Intensive Operations
Operations
Hashing
Techniques
Similarity detection
Erasure coding
Content addressability
Encryption/decryption
Security
Membership testing (Bloom-filter)
Integrity checks
Compression
Redundancy
Load balancing
Summary cache
Storage efficiency
Computationally intensive
Limit performance
5
Distributed Storage System Architecture
Application Layer
FS API
Metadata
Manager
Client
Files divided
into stream of
blocks
Techniques To improve Performance/Reliability
Application
Redundancy
Access
Module
Storage
Nodes
Integrity
Checks
Similarity
Detection
Security
Enabling Operations
Compression
Encryption/
Decryption
Hashing
Encoding/
Decoding
Offloading
CPU Layer
GPU
6
Contributions:
A GPU accelerated storage system:
Design and prototype implementation that integrates
similarity detection and GPU support
End-to-end system evaluation:
2x throughput improvement for a realistic
checkpointing workload
7
Challenges
Integration Challenges
Files divided
into stream of
blocks
Minimizing the integration effort
Transparency
Separation of concerns
Similarity Detection
Hashing
Extracting Major Performance Gains
Hiding memory allocation overheads
Hiding data transfer overheads
Offloading Layer
Efficient utilization of the GPU memory units
Use of multi-GPU systems
GPU
8
Past Work: Hashing on GPUs
HashGPU1: a library that exploits
GPUs to support specialized use of
hashing in distributed storage systems
Hashing
stream of
blocks
One performance data point:
Accelerates hashing by up to 5x speedup
compared to a single core CPU
HashGPU
GPU
However, significant speedup achieved only for large blocks (>16MB)
=> not suitable for efficient similarity detection
“Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos9
Neto, G. Yuan, M. Ripeanu,, HPDC ‘08
1
Profiling HashGPU
At least
75%
overhead
Amortizing memory allocation and overlapping data transfers
and computation may bring important benefits
10
CrystalGPU
CrystalGPU: a layer of abstraction
that transparently enables common
GPU optimizations
Files divided
into stream of
blocks
One performance data point:
CrystalGPU improves the speedup of
HashGPU library by more than one order
of magnitude
HashGPU
CrystalGPU
Offloading Layer
Similarity Detection
GPU
11
CrystalGPU Opportunities and Enablers
Enabler: a high-level memory manager
Opportunity: overlap the communication
and computation
Files divided
into stream of
blocks
Similarity Detection
HashGPU
Enabler: double buffering and
asynchronous kernel launch
CrystalGPU
Memory
Manager
Opportunity: multi-GPU systems (e.g.,
GeForce 9800 GX2 and GPU clusters)
Task Queue
Double
Buffering
GPU
Enabler: a task queue manager
12
Offloading Layer
Opportunity: Reusing GPU memory buffers
Experimental Evaluation:
CrystalGPU evaluation
End-to-end system evaluation
13
CrystalGPU Evaluation
Testbed: A machine with
CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus
GPU: NVIDIA GeForce dual-GPU 9800GX2
Files divided
into stream of
blocks
Experiment space:
HashGPU/CrystalGPU vs. original HashGPU
Three optimizations
Buffer reuse
Overlap communication and computation
Exploiting the two GPUs
HashGPU
CrystaGPU
GPU
14
HashGPU Performance on top CrystalGPU
Base Line:
CPU Single Core
The gains enabled by the three optimizations can be realized!
15
End-to-End System Evaluation
Testbed
– Four storage nodes and one metadata server
– One client with 9800GX2 GPU
Three implementations
– No similarity detection (without-SD)
– Similarity detection
• on CPU (4 cores @ 2.6GHz) (SD-CPU)
• on GPU (9800 GX2) (SD-GPU)
Three workloads
– Real checkpointing workload
– Completely similar files: all possible gains in terms of data saving
– Completely different files: only overheads, no gains
Success metrics:
– System throughput
– Impact on a competing application: compute or I/O intensive
16
System Throughput (Checkpointing Workload)
1.8x improvement
The integrated system preserves the throughput gains on
17
a realistic workload!
System Throughput (Synthetic Workload of Similar Files)
Room for 2x
improvement
Offloading to the GPU enables close to optimal performance!
18
Impact on Competing (Compute Intensive) Application
Writing Checkpoints back to back
2x
improvement
7% reduction
Frees resources (CPU) to competing applications while
19
preserving throughput gains!
Summary
We present the design and implementation of a distributed
storage system that integrates GPU power
We present CrystalGPU: a management layer that transparently
enable common GPU optimizations across GPGPU applications
We empirically demonstrate that employing the GPU enable close
to optimal system performance
We shed light on the impact of GPU offloading on competing
applications running on the same node
20
netsyslab.ece.ubc.ca
21
Similarity Detection
Hashing
File A
X
Y
Z
Potentially improving write
throughput
Only the first block is different
Hashing
File B
W
Y
Z
22
Execution Path on GPU – Data Processing Application
1. Preprocessing
(memory allocation)
3
2. Data transfer in
3. GPU Processing
4. Data transfer out
5. Postprocessing
1
2
4
5
1
2
3
4
5
TTotal = TPreprocesing + TDataHtoG + TProcessing + TDataGtoH + TPostProc
23