slides - University of British Columbia

Download Report

Transcript slides - University of British Columbia

A GPU Accelerated Storage System
Abdullah Gharaibeh
with: Samer Al-Kiswany
Sathish Gopalakrishnan
Matei Ripeanu
NetSysLab
The University of British Columbia
1
GPUs radically change the cost landscape
$600
$1279
(Source: CUDA Guide)
2
Harnessing GPU Power is Challenging
 more complex programming model
 limited memory space
 accelerator / co-processor model
3
Motivating Question:
Does the 10x reduction in computation costs GPUs offer
change the way we design/implement distributed systems?
Context:
Distributed Storage Systems
4
Distributed Systems Computationally Intensive Operations
Operations
Hashing
Techniques
Similarity detection
Erasure coding
Content addressability
Encryption/decryption
Security
Membership testing (Bloom-filter)
Integrity checks
Compression
Redundancy
Load balancing
Summary cache
Storage efficiency
Computationally intensive
Limit performance
5
Distributed Storage System Architecture
Application Layer
FS API
Metadata
Manager
Client
Files divided
into stream of
blocks
Techniques To improve Performance/Reliability
Application
Redundancy
Access
Module
Storage
Nodes
Integrity
Checks
Similarity
Detection
Security
Enabling Operations
Compression
Encryption/
Decryption
Hashing
Encoding/
Decoding
Offloading
CPU Layer
GPU
6
Contributions:
 A GPU accelerated storage system:
Design and prototype implementation that integrates
similarity detection and GPU support
 End-to-end system evaluation:
2x throughput improvement for a realistic
checkpointing workload
7
Challenges
 Integration Challenges
Files divided
into stream of
blocks
 Minimizing the integration effort
 Transparency
 Separation of concerns
Similarity Detection
Hashing
 Extracting Major Performance Gains
 Hiding memory allocation overheads
 Hiding data transfer overheads
Offloading Layer
 Efficient utilization of the GPU memory units
 Use of multi-GPU systems
GPU
8
Past Work: Hashing on GPUs
HashGPU1: a library that exploits
GPUs to support specialized use of
hashing in distributed storage systems
Hashing
stream of
blocks
One performance data point:
Accelerates hashing by up to 5x speedup
compared to a single core CPU
HashGPU
GPU
However, significant speedup achieved only for large blocks (>16MB)
=> not suitable for efficient similarity detection
“Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos9
Neto, G. Yuan, M. Ripeanu,, HPDC ‘08
1
Profiling HashGPU
At least
75%
overhead
Amortizing memory allocation and overlapping data transfers
and computation may bring important benefits
10
CrystalGPU
CrystalGPU: a layer of abstraction
that transparently enables common
GPU optimizations
Files divided
into stream of
blocks
One performance data point:
CrystalGPU improves the speedup of
HashGPU library by more than one order
of magnitude
HashGPU
CrystalGPU
Offloading Layer
Similarity Detection
GPU
11
CrystalGPU Opportunities and Enablers
Enabler: a high-level memory manager
 Opportunity: overlap the communication
and computation
Files divided
into stream of
blocks
Similarity Detection
HashGPU
Enabler: double buffering and
asynchronous kernel launch
CrystalGPU
Memory
Manager
 Opportunity: multi-GPU systems (e.g.,
GeForce 9800 GX2 and GPU clusters)
Task Queue
Double
Buffering
GPU
Enabler: a task queue manager
12
Offloading Layer
 Opportunity: Reusing GPU memory buffers
Experimental Evaluation:
 CrystalGPU evaluation
 End-to-end system evaluation
13
CrystalGPU Evaluation
Testbed: A machine with
CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus
GPU: NVIDIA GeForce dual-GPU 9800GX2
Files divided
into stream of
blocks
Experiment space:
 HashGPU/CrystalGPU vs. original HashGPU
 Three optimizations
 Buffer reuse
 Overlap communication and computation
 Exploiting the two GPUs
HashGPU
CrystaGPU
GPU
14
HashGPU Performance on top CrystalGPU
Base Line:
CPU Single Core
The gains enabled by the three optimizations can be realized!
15
End-to-End System Evaluation
 Testbed
– Four storage nodes and one metadata server
– One client with 9800GX2 GPU
 Three implementations
– No similarity detection (without-SD)
– Similarity detection
• on CPU (4 cores @ 2.6GHz) (SD-CPU)
• on GPU (9800 GX2) (SD-GPU)
 Three workloads
– Real checkpointing workload
– Completely similar files: all possible gains in terms of data saving
– Completely different files: only overheads, no gains
 Success metrics:
– System throughput
– Impact on a competing application: compute or I/O intensive
16
System Throughput (Checkpointing Workload)
1.8x improvement
The integrated system preserves the throughput gains on
17
a realistic workload!
System Throughput (Synthetic Workload of Similar Files)
Room for 2x
improvement
Offloading to the GPU enables close to optimal performance!
18
Impact on Competing (Compute Intensive) Application
Writing Checkpoints back to back
2x
improvement
7% reduction
Frees resources (CPU) to competing applications while
19
preserving throughput gains!
Summary
 We present the design and implementation of a distributed
storage system that integrates GPU power
 We present CrystalGPU: a management layer that transparently
enable common GPU optimizations across GPGPU applications
 We empirically demonstrate that employing the GPU enable close
to optimal system performance
 We shed light on the impact of GPU offloading on competing
applications running on the same node
20
netsyslab.ece.ubc.ca
21
Similarity Detection
Hashing
File A
X
Y
Z
Potentially improving write
throughput
Only the first block is different
Hashing
File B
W
Y
Z
22
Execution Path on GPU – Data Processing Application
1. Preprocessing
(memory allocation)
3
2. Data transfer in
3. GPU Processing
4. Data transfer out
5. Postprocessing
1
2
4
5
1
2
3
4
5
TTotal = TPreprocesing + TDataHtoG + TProcessing + TDataGtoH + TPostProc
23