gpu-concurrency-management_kayiran_micro14

Download Report

Transcript gpu-concurrency-management_kayiran_micro14

1
Kayıran
CPU Cores
L1
Caches
ALUs
omnetpp
1
0.8
0.6
Up to 20%
0.2
Normalized CPU IPC
Latency optimized cores and throughput optimized
cores share the memory hierarchy
noGPU
1.2
MM
MM
1
0.8
0.6
0
mcf
H
• GPU TLP is reduced if
memory or network
congestion is high.
• Improves CPU
performance.
• Might cause low latency
tolerance for GPU cores.
L
M
H
GPU scheduler stalls
can be high due to:
• High memory
congestion
• Low latency tolerance
due to low TLP
Increase #
of warps
No change
Decrease #
in # of
of warps
warps
CPU-GPU Balanced
Scheme: CM-BAL
• GPU TLP is increased if
GPU cores suffer from
low latency tolerance.
• Provides balanced
improvements.
• The CPU-GPU benefits
trade-off can be
controlled.
1
omnetpp
• 23% potential CPU
improvements w/o
significant performance loss
for the GPU
0.5
perlbench
• High GPU TLP causes memory and network
congestion
• High memory congestion degrades CPU
performance
• GPU cores can tolerate memory congestion
due to multi-threading
• The optimal TLP for CPUs and GPUs might be
different due to the disparity between latency
tolerance of CPUs and GPUs
Summary
Performance Benefits
CM-BAL1:
Balanced improvements for both CPUs and GPUs
Normalized GPU IPC
M
CPU applications are
affected significantly
due to GPU
interference
Up to 85%
0.2
1.1
CM-BAL4:
Tuned to favor CPU applications
7%
2%
Warp Scheduler
Controls GPU Thread-Level Parallelism
1
-11%
Improved GPU
performance
Improved CPU
performance
Existing Works

×
CPU-based
Scheme
×

CPU-GPU
Balanced
Scheme


-11%
0.9
0.8
DYNCTA
1.4
1.2
CM-CPU
CM-BAL1
CM-BAL2
24%
CM-BAL3
CM-BAL4
19%
7%
2%
• Achieved by an existing
GPU-based technique
• Effective for GPU
performance
CPU IPC
0
0.4
Normalized CPU
WS
Memory
congestion
CPU-based Scheme:
CM-CPU
PVR
GPU IPC
1.5
PVR
KM
Scheme
Network
congestion
L
2
0
Interconnect
DRAM
GPU applications are
affected moderately
due to CPU
interference
0.4
KM
LLC cache
perlbench
Normalized IPC
ROB
Warp Scheduler
Normalized GPU IPC
CTA
Scheduler
L2
Caches
1.2
mcf
2
Ausavarungnirun
Latency Tolerance of CPUs vs. GPUs
SIMT Cores
L1
Caches
ALUs
Effects of Application Interference
noCPU
1
Jog
Onur
Nachiappan
Adwait
Rachata
Mahmut T. Kandemir1 Gabriel H. Loh3 Onur Mutlu2 Chita R. Das1
1The Pennsylvania State University 2Carnegie Mellon University 3AMD Research
Managing GPU Concurrency in
Heterogeneous Architectures
Heterogeneous Architectures
1
CN
1
0.8
DYNCTA
CM-CPU
CM-BAL1
CM-BAL2
CM-BAL3
CM-BAL4
+ control the trade-off