Debunking the 100X GPU vs. CPU Myth: An Evaluation of

Download Report

Transcript Debunking the 100X GPU vs. CPU Myth: An Evaluation of

Debunking the 100X GPU vs. CPU
Myth: An Evaluation of Throughput
Computing on CPU and GPU
Presented by: Ahmad Lashgar
ECE Department, University of Tehran
Seminar of Parallel Processing. Instructor: Dr. Fakhraie
29 Dec 11
ISCA 2010
Original authors: Victor W Lee et al.
Intel Corporation
Some slides are included from original paper only for educational purposes
1
Abstract
• Is the GPU silver bullet of parallel computing?
• How far is the difference between peak and
achievable performance?
2
Overview
• Abstract
• Architecture
– CPU: Intel core i7
– GPU: Nvidia GTX280
•
•
•
•
•
•
Implications for throughput computing applications
Methodology
Results
Analyzing the results
Platform optimization guides
Conclusion
3
Architecture (1)
• Intel core i7-960
–
–
–
–
–
4-core, 3.2 GHz
2-way multi-threading
4-wide
L1 32KB, L2 256KB, L3 3MB
32 GB/sec
[DIXON’2010]
4
Architecture (2)
• Nvidia GTX280
–
–
–
–
30 core, 1.3GHz
1024-way multi-threading
8-way SIMD
16KB software managed cache (shared
memory)
– 141 GB/s
[LINDHOLM’2008]
5
Architecture (3)
Core i7-960
GTX280
Core
4
30
Frequency (GHz)
3.2
1.3
Transistors
0.7B (263mm2)
1.4B (576mm2)
Memory Bandwidth (GB/s)
32
141
SP SIMD
4
8
DP SIMD
2
1
Peak SP scalar GFLOPS
25.6
116.6
Peak SP SIMD GFLOPS
102.4
311.1 (933.1)
Peak DB SIMD GFLOPS
51.2
77.8
Red texts are not the author’s numbers.
6
Implications for throughput computing
applications
1. Number of core difference
2. Cache size/multi-threading
3. Bandwidth difference
7
1. Number of cores difference
• It is all about the core complexity:
– The common goal: Improving pipeline efficiency
– CPU goal: Single-thread performance
• Exploiting ILP
• Sophisticated branch predictor
• Multiple issue logics
– GPU goal: Throughput
• Interleaving hundreds of threads
8
2. Cache size/multi-threading
• CPU goal: reducing memory latency
– Programmer-transparent data caching
• Increasing the cache size to capture the working set
– Prefetching (HW/SW)
• GPU goal: hiding memory latency
– Interleave the execution of hundreds of threads to hide
the latency of each other
• Notice:
– CPU uses multi-threading for latency hiding
– GPU uses software controlled caching (shared memory) for
reducing memory latency
9
3. Bandwidth difference
• Bandwidth versus latency
• CPU goal: single thread performance
– Workloads do not demand for many memory accesses
– Bring the data as soon as possible
• GPU goal: throughput
– There are lots of memory accesses, provide the good
bandwidth
– No matter the latency, core will hide it!
10
Methodology (1)
• Hardware
– Intel Core i7-960, 6GB DRAM, GTX280 1GB
• Software
– SUSE Enterprise 11
– CUDA Toolkit 2.3
11
Methodology (2)
• Optimizations
– On CPU:
• SGEMM, SpMV and FFT from Intel MKL 10
• Always 2 threads per core
– On GPU:
• Best possible algorithm for SpMV, FFT and MC
• Often 128 to 256 threads per core (to leverage shared memory
and register-file usage)
– Interleaving GPU execution and HD/DH memory transfers
where possible
12
Results
• The HD/DH data transfer time is not considered
• Only 2.5X on average
– Far from what is reported by previous researches (100X)
13
Where is the speedup of previous
researches?!
• What CPU and GPU are compared?
• How much optimization is performed on CPU and
GPU?
– Where they optimize both platforms, they reported much
lower speedup (like this paper)
14
Analyzing the results (1)
1.
2.
3.
4.
5.
Bandwidth
Compute flops (single precision)
Compute flops (double precision)
Reduction and synchronization
Fixed function
15
Analyzing the results (2)
1. Bandwidth
–
–
Peak: GTX280/Corei7-960 ~ 4.7X
Feature: Large working set, Performance is bounded by
the bandwidth
Examples
–
•
•
•
SAXPY (5.3X)
LBM (5X)
SpMV (1.9X)
–
CPU benefits from caching
16
Analyzing the results (3)
2. Compute Flops (Single Precision)
–
–
Peak: GTX280/Corei7-960 ~ 3X
Feature: Bounded by computation, benefit from more
cores
Examples
–
•
SGEMM, Conv and FFT (2.8-4X)
17
Analyzing the results (4)
3. Compute Flops (Double Precision)
–
–
Peak: GTX280/Corei7-960 ~ 1.5X
Feature: Bounded by computation, benefit from more
cores
Examples
–
•
•
MC (1.8X)
Blitz (5X)
–
•
Uses transcendental operations
Sort (1.25X slower)
–
–
Due to decrease in SIMD width usage
Depends on scalar performance
18
Analyzing the results (5)
4. Reduction and Synchronization
–
Feature: More threads, higher the synchronization
overhead
Examples
–
•
Hist (1.8X)
–
–
•
On CPU, 28% of the time is spent on atomic operations
On GPU, the atomic operations are much slower
Solv (1.9X slower)
–
Multiple kernel launches to preserve cache coherency on GPU
19
Analyzing the results (6)
5. Fixed function
–
Feature: Interpolation, texturing and transcendental
operation are bonus on GPU
Examples
–
•
Bilat (5.7X)
–
•
On CPU, 66% of the time is spent on transcendental operations
GJK (14.9X)
–
Uses texture lookup
20
Platform optimization guides
• CPU programmer have heavily relied on increasing
clock frequency
• Their application do not benefits from TLP and DLP
• Today CPUs use wider SIMD which stays idle if not
exploited by programmer (or compiler)
• This paper showed that careful multi-threading can
reduce the gap heavily
– For LBM, from 114X down to 5X
• Let’s learn some optimization tips from the authors
21
CPU optimization
• Scalability (4X):
– Scale the kernel with the number of threads
• Blocking (5X):
– Be aware of cache hierarchy and use it efficiently
• Regularizing (1.5X):
– Align the data regularly to take advantage of SIMD
22
GPU optimization
• Global synchronization
– Reduce the atomic operations
• Shared memory
– Use shared memory to reduce of-chip demand
– Shared memory is multi-banked and is efficient for
gathers/scatters operations
23
Conclusion
• This work analyzed the performance of important
throughput computing kernels on CPU and GPU
– the gap is much lower that previous reports (~2.5X)
• Recommendation for a throughput computing
architecture:
–
–
–
–
–
–
High compute
High bandwidth
Large cache
Gather/scatter support
Efficient synchronization
Fixed function units
24
Thank you for your attention.
any question?
25
References
[LEE’2010] V. W. Lee et al, Debunking the 100X GPU vs. CPU Myth: An
Evaluation of Throughput Computing on CPU and GPU, ISCA 2010
[DIXON’2010] M. Dixon et al, The next-generation Intel® Core ™
Microarchitecture, Intel® Technology Journal, Volume 14 Issue 3, 2010
[LINDHOLM’2008] E. Lindholm et al, NVIDIA Tesla A Unified Graphics and
Computing Architecture, IEEE Micro 2008
26