Transcript Slide 1

Revisiting Co-Processing for
Hash Joins on the Coupled CPUGPU Architecture
Jiong He, Mian Lu, Bingsheng He
School of Computer Engineering
Nanyang Technological University
27th Aug 2013
Outline
•
•
•
•
Motivations
System Design
Evaluations
Conclusions
Importance of Hash Joins
• In-memory databases
– Enable GBs even TBs of data reside in main memory
(e.g., the large memory commodity servers)
– Are hot research topic recently
• Hash joins
– The most efficient join algorithm in main memory
databases
– Focus: simple hash joins (SHJ, by ICDE 2004) and
partitioned hash joins (PHJ, by VLDB 1999)
3
Hash Joins on New Architectures
• Emerging hardware
– Multi-core CPUs (8-core, 16-core, even many-core)
– Massively parallel GPUs (NVIDIA, AMD, Intel, etc.)
• Query co-processing on new hardware
– On multi-core CPUs: SIGMOD’11 (S. Blanas), …
– On GPUs: SIGMOD’08 (B. He), VLDB’09 (C. Kim), …
– On Cell: ICDE’07 (K. Ross), …
4
Bottlenecks
• Conventional query co-processing is inefficient
– Data transfer overhead via PCI-e
– Imbalanced workload distribution
Light-weight workload:
PCI-e
Create
CPUcontext,
GPU
Send and receive data,
Launch GPU program,
Post-processing.
Heavy-weight workload:
All real computations.
Cache
Cache
CPU
GPU
Main Memory
Device Memory
5
The Coupled Architecture
• Coupled CPU-GPU architecture
– Intel Sandy Bridge, AMD Fusion APU, etc.
• New opportunities
– Remove the data transfer overhead
– Enable fine-grained workload scheduling
– Increase higher cache reuse
CPU
GPU
Cache
Main Memory
6
Challenges Come with Opportunities
• Efficient data sharing
– Share main memory
– Share Last-level cache (LLC)
• Keep both processors busy
– The GPU cannot dominate the performance
– Assign suitable tasks for maximum speedup
Outline
•
•
•
•
Motivations
System Design
Evaluations
Conclusions
Fine-Grained Definition of Steps for CoProcessing
• Hash join consists of three stages (partition,
build and probe)
• Each stage consists of multiple steps (take build
as example)
–
–
–
–
b1: compute # hash bucket
b2: access hash bucket header
b3: search the key list
b4: insert the tuple
9
Co-Processing Mechanisms
• We study the following three kinds of coprocessing mechanisms
– Off-loading (OL)
– Data-dividing (DD)
– Pipeline (PL)
• With the fine-grained step definition of hash
joins, we can easily implement algorithms with
those co-processing mechanisms
10
Off-loading (OL)
• Method: Offload the whole step to one device
• Advantage: Easy to schedule
• Disadvantage: Imbalance
CPU
GPU
11
Data-dividing (DD)
• Method: Partition the input at stage level
• Advantage: Easy to schedule, no imbalance
• Disadvantage: Devices are underutilized
CPU
GPU
12
Pipeline (PL)
• Method: Partition the input at step level
• Advantage: Balanced, devices are fully utilized
• Disadvantage: Hard to schedule
CPU
GPU
13
Determining Suitable Ratios for PL
is Challenging
• Workload preferences of CPU & GPU vary
• Different computation type & amount of memory
access across steps
• Delay across steps should be minimized to
achieve global optimization
Cost Model
• Abstract model for CPU/GPU
• Estimate data transfer costs, memory access
costs and execution costs
• With the cost model, we can
– Estimate the elapsed time
– Choose the optimal workload ratios
More details can be found in our paper.
15
Outline
•
•
•
•
Motivations
System Design
Evaluations
Conclusions
System Setting Up
• System configurations
# cores
Core frequency
(GHz)
CPU
4
3.0
GPU
400
0.6
Zero copy
buffer (MB)
Local
memory (KB)
32
512
32
• Data sets
– R and S relations with 16M tuples each
– Two attributes in each tuple: (key, record-ID)
– Data skew: uniform, low skew and high skew
17
Cache
(MB)
4
Discrete vs. Coupled Architecture
• In discrete architecture:
– data transfer takes 4%~10%
– merge takes 14%~18%
• The coupled architecture outperforms the
discrete by 5%~21% among all variants
data-transfer
merge
Elapsed time (s)
3.5
partition
build
probe
5.1%
3
6.2%
15.3%
21.5%
2.5
2
1.5
1
0.5
0
discrete coupled
discrete coupled
discrete coupled
discrete coupled
SHJ-DD
SHJ-OL
PHJ-DD
PHJ-OL
Fine-grained vs. Coarse-grained
• For SHJ, PL outperforms OL & DD by 38% and 27%
• For PHJ, PL outperforms OL & DD by 39% and 23%
OL (GPU-only)
DD
PL (Fine-grained)
3
38%
Elapsed time (s)
2.5
39%
27%
2
23%
1.5
1
0.5
0
SHJ
PHJ
19
Unit Costs in Different Steps
• Unit cost represents the average processing
time of one tuple for one device in one step
• Costs vary heavily across different steps on two
devices
CPU
GPU
Elapsed time per tuple (ns)
25
20
15
10
5
0
pr1 pr2 pr3
Partition
b1 b2 b3 b4
Build
p1 p2 p3 p4
Probe
20
Ratios Derived from Cost Model
• Ratios across steps are different
– In the first step of all three stages, the GPU takes
should take most of the work (i.e. hashing)
• Workload dividing are fine-grained at step level
21
Other Findings
• Results on skewed data
• Results on input with varying size
• Evaluations on some design tradeoffs, etc.
More details can be found in our paper.
Outline
•
•
•
•
Motivations
System Design
Evaluations
Conclusions
Conclusions
• Implement hash joins on the discrete and the
coupled CPU-GPU architectures
• Propose a generic cost model to guide the finegrained tuning to get optimal performance
• Evaluate some design tradeoffs to make hash
join better exploit the hardware power
• The first systematic study of hash join coprocessing on the emerging coupled CPUGPU architecture
24
Future Work
• Design a full-fledged query processor
• Extend the fine-grained design methodology to
other applications on the coupled CPU-GPU
architecture
25
Acknowledgement
• Thank Dr. Qiong Luo and Ong Zhong Liang for
their valuable comments
• This work is partly supported by a MoE AcRF
Tier 2 grant (MOE2012-T2-2-067) in Singapore
and an Interdisciplinary Strategic Competitive
Fund of Nanyang Technological University
2011
for
“C3:
Cloud-Assisted
Green
Computing at NTU Campus”
26
Questions?
27