The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California, San Diego.

Download Report

Transcript The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California, San Diego.

The Architecture and Evolution of
CPU-GPU Systems for General
Purpose Computing
Manish Arora
Computer Science and Engineering
University of California, San Diego
From GPU to GPGPU
GPGPU
...
L2
Frame Buffer
Operations
Geometry
Processing
Vertex
Processing
Input
Assembly
GPU
SM
Shared
Mem
SM
...
Shared
Mem
L2
Memory Controller
Memory Controller
Off-Chip Memory
Off-Chip Memory
Widespread adoption (300M devices)
First with NVIDIA Tesla in 2006-2007.
2
Previous Generation Consumer Hardware1
CPU
GPGPU
...
Cache
Hierarchy
SM
Core
Core
Shared
Mem
Cache
Hierarchy
...
Shared
Mem
L2
Last Level Cache
Memory Controller
Memory Controller
Off-Chip Memory
SM
Off-Chip Memory
PCI
Bridge
1
2006 – 2010
3
Current Consumer Hardware2
Core
Core
...
Cache
Hierarchy
Cache
Hierarchy
SM
SM
Shared
Mem
Shared
Mem
SM
...
Shared
Mem
L2
Shared On-Chip Last Level Cache
CPU
GPGPU
Memory Controller
2
Off-Chip Memory
Intel Sandy Bridge
AMD Fusion APUs
2011 - 2012
4
Our Goals Today



Examine the current state of the art
Trace the next steps of this evolution (major part)
Lay out research opportunities
5
Throughput
Applications
Part 1
Energy Efficient
GPUs
Outline
Lower Costs
Overheads
CPU only
Workloads
GPGPU
Part 6
(Future Work)
Chip Integrated
CPU-GPU Systems
Part 2
GPGPU
Evolution
Part 5
Part 3
Holistic
Optimizations
Opportunistic
Optimizations
Part 4
CPU Core
Redundancy
Optimization
Elimination
Shared
Components
Tools
Emerging
Technologies
Power
Temperature
Reliability
Next Generation CPU – GPU
Architectures
6
Throughput
Applications
Energy Efficient
GPUs
Lower Costs
Overheads
GPGPU
Part 1
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Progression of GPGPU Architectures
7
GPGPUs - 1

The fixed function graphics era (pre 2006)




Non-graphics processing was possible






Programmable vertex processors
Programmable pixel processors
Lots of fixed hardware blocks (assembly, geometry, z-culling…)
Represent user work as graphics tasks
Trick the graphics pipeline
Programming via graphics APIs
No hardware for bit-wise operations, no explicit branching…
Imbalance in modern workloads motivated unification
General purpose opportunity sensed by vendors
8
GPGPUs - 2

The unified graphics and computing era (2006 - 2010)




Non-graphics processing easy




High level programming (C, C++, Python etc.)
Separate GPU and CPU memory space
Explicit GPU memory management required
High overhead to process on the GPU


Single programmable processor design
Explicit support for both graphics and computing
Computing specific modifications (IEEE FP Compliance and ECC)
Memory transfers over PCI
Significant customer market penetration
9
GPGPUs - 3

Chip Integrated CPU-GPU era (2011 onwards)




Chip Integration advantages





Multicore CPU + GPGPU on the same die
Shared last level caches and memory controller
Shared main memory system
Lower total system costs
Shared hardware blocks improve utilization
Lower latency
Higher Bandwidth
Continued improvements in programmability

Standardization efforts (OpenCL and DirectCompute)
10
Contemporary
GPU Architecture
DRAM
DRAM
DRAM
(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
CPU
GPGPU
Memory Controller
Memory Controller
Memory
Controller
L2 Cache
Core
Core
...
Cache
Hierarchy
SM
SM
SM
L2 Cache
Cache
Hierarchy
...
SM L2 Cache SM
Shared
Mem
...
SM
...
SM
L2
Shared
Mem
SM
Interconnect
SM
Last
SM Cache. . .
SM Level
Memory Controller
L2 Cache
Memory Controller
Off-Chip Memory
DRAM
...
L2 Cache
Memory Controller
PCI
Bridge
DRAM
SM
SM
SM
Memory
Controller
L2 Cache
Off-Chip
Memory
Memory Controller
DRAM
11
SM Architecture
(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
Banked Register File
Warp Scheduler
Operand Buffering
SIMT Lanes
ALUs
SFUs
MEM
TEX
Shared Memory / L1 Cache
12
Multi-threading and Warp Scheduling

Warp processing




Hardware Multithreading for latency hiding





32 threads grouped and processed as a Warp
Single instruction fetched and issued per warp
Lots of active threads per SM (Fermi: 1536 threads in 48 Warps)
Threads has dedicated registers (Fermi: 21 registers per thread)
Register state need not be copied or restored
Enables fast switching (potentially new warp each cycle)
Threads processed in-order
Warps scheduled out-of-order
Example of Warp Scheduling
SM Multithreaded Instruction Scheduler
(Lindholm et al. IEEE Micro 2007)
Time
Warp 1 Instruction 1
Warp 2 Instruction 1
Warp 3 Instruction 1
Warp 3 Instruction 2
.
.
.
Warp 2 Instruction 2
Warp 1 Instruction 2
.
.
.
Design for Efficiency and Scalability
Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011

Amortized costs of instruction supply


Efficient Data supply




Large register files
Managed locality (via shared memories)
Lack of global structures


Single instruction multiple thread model
No out-of-order processing
High utilization with hardware multithreading
Biggest tradeoff : Programmability


Exposed microarchitecture, frequent changes
Programmer has to manage data
15
Scalability
(Lee et al. ISCA 2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE
Micro 2011 and other public sources)


Double precision performance 10x in 3 generations
Memory structures growing slower than ALUs (22.5x)



Memory bandwidth even slower (2.2x in 4 generations)
Clearly favors workloads with high Arithmetic Intensity
CPU performance gap increasing rapidly

Double precision performance gap 2x  9x
16
Throughput
Applications
Energy Efficient
GPUs
Lower Costs
Overheads
GPGPU
Part 2
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
GPGPU
Evolution
Towards Better GPGPU
Next Generation CPU – GPU
Architectures
17
Control-flow Divergence Losses
(Fung et al. Micro 2007)
Mask = 1111
Code A
Low
Utilization
Divergent
Branch
Code B
Mask = 1111
Diverge Point Path A: Ins 1 Path A: Ins 2
Merge
Point
Path B: Ins 1 Path B: Ins 2 Converge Point
…
Time
…
Divergent
Branch
Dynamic Warp Formation
Mask = 1111
(Fung et al. Micro 2007)




Code A
Code B
Key Insight: Several
warps at the same
diverge point
Combine threads from same execution path
dynamically
Merge
Mask = 1111
Point
Generate
warps on the fly
Original Scheme
Warp 0 : Path A
Warp 1 : Path A
Warp 0 : Path B
Warp 1 : Path B
20.7% improvements @ 4.7% area overhead
With DWF
Warp 0+1 : Path A Warp 0+1 : Path B
Time
Dynamically formed 2 new
warps from 4 original warps
Dynamic Warp Formation Intricacies
(Fung et al. Micro 2007)

Needs
several
warps
at the same execution point
Denotes
register
accessed


“Majority”
warp
Register
File scheduling policy
Bank 2
Bank N
ALU 1
ALU 2
ALU N
ALU N
Bank 1
Bank N
ALU 2

Bank 2

ALU 1

Bank 1
Need for Lane-awareness
Register File
Banked register files
Spread out threads of the dynamic warp
Simplifies design
Register file accesses for static
warps
Register file accesses without
during lanelane
aware dynamic
awareness
warp formation
Large Warp Microarchitecture
(Narasiman et al. Micro 2011)

Time
Similar idea to generate
dynamic warps
 =
Differs
in the creation
T = 1method
T
0
1

0

0
1
Machine
0
1
0
T=2
T=3
organized
as
warps
0
0 than
- SIMT
0
- bigger
0
- the
0
1
0 large
1
width
0
1
0
0
1
0
1
0
1
0
1
0
Dynamically create warps from within the large warp
-
0
1
1
0
0
1
1
0
0
-
1
0
0
-
-
1
1
1
1
1
1
1
1
1
1
1
-
1
-
1
Original
Large Warp
1
1
1
Activity Mask
Activity Mask
Activity Mask
1
1
1
1
1
-
1
-
1
Two level Scheduling
(Narasiman et al. Micro 2011)

Typical Warp scheduling scheme: Round Robin


All warps tend to reach long latency operations at the
same time


Beneficial because it exploits data locality across warps
Cannot hide latency because everyone is waiting
Solution: Group warps into several sets



Schedule warps within a single set round robin
Still exploit data locality
Switch to another set when all warps of a set hit long latency
operations
22
Dynamic Warps vs Large Warp + 2-Level Scheduling
(Fung et al Micro 2007 vs Narasiman et al. Micro 2011)

Dynamic Warp formation gives better performance vs
Large Warp alone




More opportunities to form warps
All warps vs large warp size
Large Warp + 2-level scheduling better than dynamic
warp formation
2-level scheduling can be applied together with
dynamic warp formation
23
Throughput
Applications
Energy Efficient
GPUs
Lower Costs
Overheads
Part 3
CPU only
Workloads
GPGPU
Chip Integrated
CPU-GPU Systems
GPGPU
Evolution
Holistic
Optimizations
Holistically Optimized
CPU Designs
CPU Core
Redundancy
Optimization
Elimination
Next Generation CPU – GPU
Architectures
24
Motivation to Rethink CPU Design
(Arora et al. In Submission to IEEE Micro 2012)







Heterogeneity works best when each composing core
runs subsets of codes well (Kumar et al. PACT 2006)
GPGPU already an example of this
The CPU need not be fully general-purpose
Sufficient to optimize it for non-GPU code
CPU undergoes a “Holistic Optimization”
Code expected to run on the CPU is very different
We start by investigating properties of this code
25
Benchmarks


Took important computing applications and partitioned
them over the CPU and GPU
Partitioning knowledge mostly based on expert
information





Either used publically available source code
Or details from publications
Performed own CUDA implementations for 3 benchmarks
Also used serial and parallel programs with no known
GPU implementations as CPU only workloads
Total of 11 CPU-heavy, 11 mixed and 11 GPU-heavy
benchmarks
26
Methodology

Used a combination of two techniques




Branches categorized into 4 categories


Biased (same direction), patterned (95% accuracy on local
predictor), correlated (95% accuracy on gshare), hard (remaining)
Loads and stores characterized into 4 categories


Inserted start-end functions based on partitioning information
Real machine measurements
PIN based simulators
Static (same address), Strided (95% accuracy on stride prefetcher),
Patterned (95% accuracy on Markov predictor), Hard (remaining)
Thread level parallelism is speedup on 32 core machine
27
Results – CPU Time


Conservative speedups are capped at 10x
More time being spent on the CPU than GPU
28
Results – Instruction Level Parallelism


Drops in 17/22 apps (11% drop for larger window size)
Short independent loops on GPU / Dependence heavy
code on CPU
29
Results – Branch Characterization


Frequency of hard branches 11.3%  18.6%
Occasional effects of data dependent branches
30
Results – Loads


Reduction in strided loads  Increase in hard loads
Occasional GPU mapping of irregular access kernels
31
Results – Vector Instructions


SSE usage drops to almost half
GPUs and SSE extensions targeting same regions of code
32
Results – Thread Level Parallelism



GPU heavy worst hit (14x  2.1x), Overall 40-60% drops
Majority of benchmarks have almost no post-GPU TLP
Going from 8 cores to 32 cores has a 10% benefit
33
Impact : CPU Core Directions


Larger instruction windows will have muted gains
Considerably increased pressure on branch predictor


Need to adopt better performing techniques (L-Tage Seznec et al. )
Memory access will continue to be major bottlenecks



Stride or next-line prefetching almost irrelevant
Need to apply techniques that capture complex patterns
Lots of literature but never adapted on real machines (e.g. Markov
prediction, Helper thread prefetching)
34
Impact : Redundancy Elimination

SSE rendered significantly less important




Every core need not have it
Cores could share SSE hardware
Extra CPU cores not of much use because of lack of TLP
Few bigger cores with a focus on addressing highly
irregular code will improve performance
35
Throughput
Applications
Energy Efficient
GPUs
Lower Costs
Overheads
Part 4
CPU only
Workloads
GPGPU
Chip Integrated
CPU-GPU Systems
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Redundancy
Optimization
Elimination
Shared Component
Designs
Shared
Components
Next Generation CPU – GPU
Architectures
36
Optimization of Shared Structures
Core
Core
...
Cache
Hierarchy
Cache
Hierarchy
SM
SM
Shared
Mem
Shared
Mem
SM
...
Shared
Mem
L2
Shared On-Chip Last Level Cache
CPU
Memory Controller
Latency Sensitive
Off-Chip Memory
GPGPU
Potentially Latency
In-Sensitive But
Bandwidth Hungry
37
TAP: TLP Aware Shared LLC Management
(Lee et al. HPCA 2012)

Insight 1: GPU cache misses / hits may or may not
Impact performance



Measure cache sensitivity to performance


Core sampling controller
Insight 2: GPU causes a lot more cache traffic than CPU



Misses only matter if there is not enough latency hiding
Allocated capacity useless if there is abundant parallelism
Allocation schemes typically allocate based on number of accesses
Normalization needed for larger number of GPU accesses
Cache block lifetime normalization
38
TAP Design - 1

Core sampling controller




Usually GPUs run the same workload on all cores
Use different cache policies on 2 of cores and measure
performance difference
E.g. LRU for one core / MRU on the other
Cache block lifetime normalization


Count number of cache accesses for all CPU and GPU workloads
Calculate ratios of access counts across workloads
39
TAP Design - 2

Utility based Cache Partitioning (UCP)





TLP Aware Utility based Cache Partitioning (TAP-UCP)




Dynamic cache way allocation scheme
Allocate ways based on an applications expected gain from
additional space (utility)
Uses cache hit rate to calculate utility
Uses cache access rates to calculate cache block lifetime
Uses core sampling controller information
Allocate ways based on performance sensitivity and not hit rate
TAP-UCP normalizes access rates to reduce GPU workload weight
5% better performance than UCP, 11% over LRU
40
QoS Aware Mem Bandwidth Partitioning
Jeong et al. DAC 2012

Typical Memory Controller Policy: Always Prioritize CPU



CPU latency sensitive, GPU not
However, this can slow down GPU traffic
Problem for real-time applications (graphics)
41
QoS Aware Mem Bandwidth Partitioning
(Jeong et al. DAC 2012)


Static management policies problematic
Authors propose a dynamic management scheme






Default scheme is to prioritize CPU over GPU
Periodically measure current rate of progress on the frame
Work decomposed into smaller tiles, so measurement simple
Compare with target frame rate
If current frame rate slower than measured rate, set CPU and GPU
priorities equal
If close to deadline and still behind, boost GPU request priority
even further
42
Throughput
Applications
Energy Efficient
GPUs
Lower Costs
Overheads
CPU only
Workloads
GPGPU
Chip Integrated
CPU-GPU Systems
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Redundancy
Optimization
Elimination
Part 5
Opportunistic
Optimizations
Opportunistic
Optimizations
Shared
Components
Next Generation CPU – GPU
Architectures
43
Opportunistic Optimizations

Chip integration advantages




Lower latency
New communication paths e.g. shared L2
Opportunity for non-envisioned usage
Using idle resources to help active execution


Idle GPU helps CPU
Idle CPU helps GPU
44
Idle GPU Shader based Prefetching
(Woo et al. ASPLOS 2010)



Realization: Advanced Prefetching not adopted
because of high storage costs
GPU system can have exploitable idle resources
Use idle GPU shader resources




Propose an OS based enabling and control interface



Register files as prefetcher storage
Execution threads as logic structures
Parallel prefetcher execution threads to improve latency
Miss Address Provider
Library of prefetchers and application specific selection
Prefetching performance benefit of 68%
45
Miss Address Provider
...
Core
Core
SM
...
SM
MAP
Data prefetched into
Shared LLC
OS Allocates Idle
GPU Core
Shared On-Chip Last Level Cache
GPU Core stores
and processes miss
stream
Command Buffer
Shader Pointer
Miss info forwarded
To GPU Core
Miss PC
Miss Address
MAP
46
CPU assisted GPGPU processing
(Yang et al. HPCA 2012)

Use idle CPU resources to prefetch for GPGPU
applications



Compiler based framework to convert GPU kernels to
CPU prefetching program
CPU runs ahead appropriately of the GPU




Target bandwidth sensitive GPGPU applications
If too far behind then the CPU cache hit rate will be very high
If too far ahead then GPU cache hit rate will be very low
Very few CPU cycle required since LLC line is large
Prefetching performance benefit of 21%
47
Example GPU Kernel and CPU program
GPU Kernel
Requests for
Single thread
__global__ void VecAdd (float *A, *B, *C, int N) {
int I = blockDim.x * blockIdx.x + threadIdx.x;
C[i] = A[i] + B[i] }
float mem_fetch (float *A, *B, *C, int N) {
return A[N] + B[N] + C[N] }
void cpu_prefetching (…) {
unroll_factor = 8
//traverse through all thread blocks (TB)
Skip_factor
for (j = 0; j < N_TB; j += Concurrent_TB)
//loop to traverse concurrent threads TB_Size
controls CPU
for (i = 0; i < Concurrent_TB*TB_Size;
timing
i += skip_factor*batch_size*unroll_factor) {
for (k=0; j<batch_size; k++) {
Batch_size controls
id = i + skip_factor*k*unroll_factor
+ j*TB_Size
how often skip_fctor
//unrolled loop
is updated
float a0 = mem_fetch (id + skip_factor*0)
float a1 = mem_fetch (id + skip_factor*1)
...
sum += a0 + a1 + . . . }
update skip_factor
}}}
Unroll_factor
artificially boost
CPU requests
For all concurrent
Thread blocks
48
Drawbacks: CPU assisted GPGPU processing


Does not consider effects of Thread block scheduling
CPU program stripped of actual computations

Memory requests from data or computation dependent paths not
considered
49
Throughput
Applications
Energy Efficient
GPUs
Part 6
Lower Costs
Overheads
CPU only
Workloads
GPGPU
Future
Work
Chip Integrated
CPU-GPU Systems
Tools
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Redundancy
Optimization
Elimination
Opportunistic
Optimizations
Shared
Components
Emerging
Technologies
Power
Temperature
Reliability
Next Generation CPU – GPU
Architectures
50
Continued System Optimizations

Continued holistic optimizations


Understand impact of GPU workloads on CPU requests to the
memory controller?
Continued opportunistic optimizations


Latest GPUs allow different kernels to be run on the same GPU
Can GPU threads prefetch for other GPU kernels?
51
Research Tools




Severe lack of GPU research tools
No GPU power model
No GPU temperature model
Immediate and impactful opportunities
52
Power, Temperature and Reliability



Bounded by lack of power tools
No work yet on effective power management
No work yet on effective temperature management
53
Emerging Technologies



Impact of non-volatile memories on GPUs
3D die stacked GPUs
Stacked CPU-GPU-Main memory systems
54
Conclusions





In this work we looked at the CPU-GPU research
landscape
GPGPUs systems are quickly scaling in performance
CPU needs to be refocused to handle extremely
irregular code
Design of shared components needs to be rethought
Abundant optimization and research opportunities!
Questions?
55
Backup Slides
Results – Stores

Similar trends as loads but slightly less pronounced
57
Results – Branch Prediction Rates


Hard branches translate to higher misprediction rates
Strong influence of CPU only benchmarks
58