Slide - University of British Columbia

Transcript Slide - University of British Columbia

Cache Coherence for GPU
Architectures
Inderpreet Singh1, Arrvindh Shriraman2,
Wilson Fung1, Mike O’Connor3, Tor Aamodt1
1
University of British Columbia
2 Simon Fraser University
3 AMD Research
Image source: www.forces.gc.ca
What is a GPU?
Workgroups
CPU
spawn
GPU
Wavefronts
GPU Core
GPU Core
L1D
L1D
▪▪▪
done
CPU
Interconnect
CPU
time
Inderpreet Singh
spawn
L2 Bank
▪▪▪
GPU
Cache Coherence for GPU Architectures
2
Evolution of GPUs
• Graphics pipeline
OpenGL/
DirectX
Vertex
Shader
Pixel
Shader
• Compute (OpenCL, CUDA)
• e.g. Matrix Multiplication
Inderpreet Singh
Cache Coherence for GPU Architectures
3
Evolution of GPUs
• Future: coherent memory space
• Efficient critical sections
• Load balancing
Stencil computation
lock shared structure
…
computation
…
unlock
Inderpreet Singh
Workgroups
Cache Coherence for GPU Architectures
4
GPU Coherence Challenges
• Challenge 1: Coherence traffic
Load C
Load D
Load E
Load F
…
Load
C
MESI
No coherence
GPU-VI
C1
2.2
Interconnect traffic
1.5
Load G
Load H
Load I
Load J
…
1.3
Recalls
L1D
A B
Load O
Load P
Load Q
Load R
…
C2
C3
C4
L1D
A B
L1D
A B
L1D
A B
rcl A
rcl A
1.0
Load K
Load L
Load M
Load N
…
rcl A
ack
ack
ack
rcl A
ack
gets C
0.5
A
Do not require
coherence
Inderpreet Singh
L2/Directory
Cache Coherence for GPU Architectures
B
5
GPU Coherence Challenges
• Challenge 2: Tracking in-flight requests
• Significant % of L2
S
Shared
S_M
M
Modified
L2 / Directory
MSHR
Inderpreet Singh
Cache Coherence for GPU Architectures
6
GPU Coherence Challenges
• Challenge 3: Complexity
Non-coherent L1
MESI L2 States
MESI L1 States
Events
States
Non-coherent L2
Inderpreet Singh
Cache Coherence for GPU Architectures
7
GPU Coherence Challenges
All three challenges result from introducing coherence
messages on a GPU
1. Traffic: transferring
2. Storage: tracking
3. Complexity: managing
GPU cache coherence without coherence messages?
• YES – using global time
Inderpreet Singh
Cache Coherence for GPU Architectures
8
Temporal Coherence (TC)
• Global time
Local Timestamp
> Global Time  VALID
Core 1
Core 2
L1D
L1D
0
A=0
Interconnect
L2 Bank
0
Inderpreet Singh
▪▪▪
A=0
Global Timestamp
▪▪▪
< Global Time 
NO L1 COPIES
Cache Coherence for GPU Architectures
9
Temporal Coherence (TC)
T=11
T=0
T=15
Core 1
Core 2
L1D
L1D
No coherence
Interconnect
messages
10
A=0
L2 Bank
10
0
Inderpreet Singh
▪▪▪
A=0
A=1
Cache Coherence for GPU Architectures
10
Temporal Coherence (TC)
What lifetime values should be requested on loads?
• Use a predictor to predict lifetime values
What about stores to unexpired blocks?
• Stall them at the L2?
Inderpreet Singh
Cache Coherence for GPU Architectures
11
TC Stalling Issues
Stall?
Problem #1: Sensitive to mispredictions
Problem #2: Impedes other accesses
Problem #3: Hurts existing GPU applications
Solution: TC-Weak
Inderpreet Singh
Cache Coherence for GPU Architectures
12
TC-Weak
• Stores return Global Write Completion Time (GWCT)
1 data=NEW
2 FENCE
3 flag=SET
T=0
T=31
T=1
GPU Core 1
GPU Core 2
L1D
GWCT Table
W0:
W1:
L1D
GWCT Table
W0:
W1:
30
data=OLD
No stalling at L2
Interconnect
L2 Bank
30
47
Inderpreet Singh
data=NEW
data=OLD
flag=NULL
flag=SET
Cache Coherence for GPU Architectures
13
TC-Weak
Stalling
TC-Weak
Misprediction sensitivity
Doesn’t impedes other
accesses
Good for existing GPU
applications
Inderpreet Singh
Cache Coherence for GPU Architectures
14
Methodology
•
•
•
•
•
GPGPU-Sim v3.1.2 for GPU core model
GEMS Ruby v2.1.1 for memory system
All protocols written in SLICC
Model a generic NVIDIA Fermi-based GPU (see paper for details)
Applications:
• 6 do not require coherence
• 6 require coherence
•
•
•
•
•
•
Inderpreet Singh
Barnes Hut
Cloth Physics
Versatile Place and Route
Max-Flow Min-Cut
3D Wave Equation Solver
Octree Partitioning
Cache Coherence for GPU Architectures
Locks
Stencil communication
Load balancing
15
Interconnect Traffic
MESI
Interconnect Traffic
1.50
NO-COH
GPU-VI
TC-Weak
2.3
• Reduces traffic by 53% over
MESI and 23% over GPU-VI
for intra-workgroup
applications
1.25
1.00
• Lower traffic than 16x-sized
32-way directory
0.75
0.50
0.25
0.00
Inderpreet Singh
Do not require
coherence
Cache Coherence for GPU Architectures
16
Performance
MESI
Speedup
2.0
NO-L1
GPU-VI
TC-Weak
1.5
• TC-Weak with simple predictor
performs 85% better than
disabling L1 caches
1.0
• Performs 28% better than TC
with stalling
0.5
• Larger directory sizes do not
improve performance
0.0
Inderpreet Singh
Require
coherence
Cache Coherence for GPU Architectures
17
Complexity
Non-Coherent L1
MESI
TC-Weak
L1L2 States
MESI L1 States
Non-Coherent L2
Inderpreet Singh
TC-Weak L2
Cache Coherence for GPU Architectures
18
Summary
• First work to characterize GPU coherence challenges
• Save traffic and energy by using global time
• Reduce protocol complexity
• 85% performance improvement over no coherence
Questions?
Inderpreet Singh
Cache Coherence for GPU Architectures
19
Backup Slides
Inderpreet Singh
Cache Coherence for GPU Architectures
20
Lifetime Predictor
• One prediction value per L2 bank
• Events local to L2 bank update prediction value
Events
L2 Bank
1. Expired load:
↑
2. Unexpired store:
↓
3. Unexpired eviction:
↓
TT==20
0
Prediction
prediction++
prediction-Value
Inderpreet Singh
10
30
A
Prediction
Cache Coherence for GPU Architectures
21
TC-Strong vs TC-Weak
TCSUO
TCW
Fixed lifetime for all applications
TCS
TCSOO
TCW w/ predictor
Best lifetime for each application
1.2
1.2
Speedup
Speedup
1.4
1.0
0.8
0.6
Inderpreet Singh
1.0
0.8
0.6
All applications
All applications
Cache Coherence for GPU Architectures
22
Interconnect Power and Energy
Interworkgroup
Inderpreet Singh
Intraworkgroup
1,6
1,4
1,2
1,0
0,8
0,6
0,4
0,2
0,0
Router (Static)
Interworkgroup
Cache Coherence for GPU Architectures
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
Link (Static)
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
Normalized Energy
Router (Dynamic)
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
1,6
1,4
1,2
1,0
0,8
0,6
0,4
0,2
0,0
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
Normalized Power
Link (Dynamic)
Intraworkgroup
23