The Locality-Aware Adaptive Cache Coherence Protocol
Download
Report
Transcript The Locality-Aware Adaptive Cache Coherence Protocol
The Locality-Aware Adaptive
Cache Coherence Protocol
George Kurian1, Omer Khan2, Srini Devadas1
1 Massachusetts
Institute of Technology
2 University of Connecticut, Storrs
1
Cache Hierarchy Organization
Directory-Based Coherence
• Private caches: 1 or 2 levels
• Shared cache: Last-level
Private cache
Write
Writeword
miss
1
3
Sharer
Shared
Cache + 2
Directory
• Concurrent reads lead to
replication in private caches
• Directory maintains coherence
for replicated lines
4
Sharer
2
Private Caching
Advantages & Drawbacks
☺ Exploits spatiotemporal locality
☹ Inefficiently handles data
with LOW spatio-temporal
locality
☺ Efficient low-latency
local access to private
+ shared data (cache
☹ Working set > private cache size
line replication)
☹ Inefficient cache utilization
(Cache thrashing)
☹ Unnecessary fetch of entire
cache line
☹ Shared data replication
3
increases working set
Private Caching
Advantages & Drawbacks
☺ Exploits spatiotemporal locality
☹ Inefficiently handles data
with LOW spatio-temporal
locality
☺ Efficient low-latency
local access to private
+ shared data
(cache communication and time
Increased
on-chip
☹ Working set > private cache size
line replication)
spent waiting for expensive events
☹ Shared data with frequent
writes
☹ Wasteful invalidations,
synchronous writebacks,
cache line ping-ponging
4
On-Chip Communication Problem
Bill Dally, Stanford
Shekhar Borkar, Intel
Wires relative to gates are getting worse every generation
Bit movement is much more expensive than computation
Must Architect Efficient Coherence Protocols
5
Locality of Benchmarks
Evaluating Reuse before Evictions
80%
20%
• Utilization: # private L1 cache accesses before
cache line is evicted
• 40% of lines evicted have a utilization < 4
6
Locality of Benchmarks
Evaluating Reuse before Invalidations
80%
10%
• Utilization: # private L1 cache accesses before
cache line is invalidated (intervening write)
7
• Assign each memory address
to unique “home” core
– Cache line present only in
shared cache at “home” core
(single location)
• For access to non-locally
cached word, request
“remote” shared cache on
“home” core to perform the
read/write access
Home core
Remote-Word Access (RA)
2
1
Write word
NUCA-based protocol
[Fensch et al HPCA’08]
[Hoffmann et al HiPEAC’10]
8
Remote-Word Access
Advantages & Drawbacks
☺ Energy Efficient
(low locality data)
Word access (~200 bits)
cheaper than cache line
fetch (~640 bits)
☺ NO data replication
Efficient private cache
utilization
☺ NO invalidations /
synchronous writebacks
☹ Round-trip network
request for remoteWORD access
☹ Expensive for high
locality data
☹ Data placement
dictates distance &
frequency of remote
accesses
9
Locality-Aware Cache Coherence
• Combine advantages of private caching and
remote access
• Privately cache high locality lines
– Optimize hit latency and energy
• Remotely cache low locality lines
– Prevent data replication & costly data movement
• Private Caching Threshold (PCT)
– Utilization >= PCT Mark as private
– Utilization < PCT Mark as remote
10
Locality-Aware Cache Coherence
• Private Caching Theshold (PCT) = 4
Invalidations Breakdown (%)
1
2,3
4,5
6,7
>=8
100%
80%
Private
60%
40%
20%
Remote
0%
Invalidations vs Utilization
11
Outline
• Motivation for Locality-Aware Coherence
• Detailed Implementation
• Optimizations
• Evaluation
• Conclusion
12
Baseline System
Core
M
Compute Pipeline
L1 D-Cache
M
M
L1 I-Cache
L2 Shared Cache
Directory
Router
• Compute pipeline
• Private L1-I and L1-D caches
• Logically shared physically distributed L2 cache with
integrated directory
• L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09]
• ACKwise limited-directory protocol [Kurian – PACT10]
13
Locality-Aware Coherence
Important Features
• Intelligent allocation of cache lines
– In the private L1 cache
– Allocation decision made per-core at cache line level
• Efficient locality tracking hardware
– Decoupled from traditional coherence tracking
structures
• Protocol complexity low
– NO additional networks for deadlock avoidance
14
Implementation Details
Private Cache Line Tag
State
LRU
Tag
Private
Utilization
• Private Utilization bits to track cache line
usage in L1 cache
• Communicated back to directory on eviction
or invalidation
• Storage overhead is only 0.4%
15
Implementation Details
Directory Entry
State
ACKwise
Pointers
1…p
Tag
P/R1
Remote
Utilization1
…
…
P/Rn
Remote
Utilizationn
• P/Ri: Private/Remote Mode
• Remote-Utilizationi: Line usage by Corei at
shared L2 cache
• Complete Locality Classifier: Track
mode/remote-utilization for all cores
• Storage overhead reduced later
16
Mode Transitions Summary
• Classification based on previous behavior
Remote Utilization < PCT
Private Utilization < PCT
Initial
Private
Private Utilization >= PCT
Remote
Remote Utilization >= PCT
17
Walk Through Example
Core A
Private Caching
Threshold
PCT = 2
Pipeline +
L1 Cache
Network
Core B
Pipeline +
L1 Cache
All cores start out
in private mode
Core C
Pipeline +
L1 Cache
Uncached
Directory
Core-A Core-B Core-C
Private
Private
Private
U
U
U
Core D
L2 Cache +
Directory
18
Walk Through Example
Core A
PCT = 2
Read[X]
Core B
Core C
Uncached
Directory
Core-A Core-B Core-C
Private
Private
Private
U
U
U
Core D
19
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared
Directory
Cache Line [X]
Core-A
Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
20
Walk Through Example
Core A
PCT = 2
Shared 1
Cache Line [X]
Core B
Core C
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
21
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Read[X]
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
22
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Directory
Cache Line [X]
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
23
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Cache Line [X]
Shared 1
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
24
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
1
Read[X]
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
25
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
26
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Write[X]
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
27
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Shared
Directory
2
Inv [X]
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
28
Walk Through Example
Core A
PCT = 2
Invalid 0
Inv-Reply [X]
(1)
Core B
Core C
Shared
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
29
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared
Inv-Reply [X]
(1)
Directory
Shared
2
Core-A Core-B Core-C
Remote
Private
Private
0
U
C
Core D
Clean
-
30
Walk Through Example
Core A
PCT = 2
Core B
Core C
Inv-Reply [X]
(2)
Invalid 0
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
U
C
Core D
Clean
-
31
Walk Through Example
Core A
PCT = 2
Core B
Core C
Inv-Reply [X]
(2)
Uncached
Directory
Core-A Core-B
Core-C
Remote
Private
Private
0
U
U
Core D
Clean
-
32
Walk Through Example
Core A
PCT = 2
Core B
Core C
Modified
Directory
Cache Line [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
33
Walk Through Example
Core A
PCT = 2
Core B
Modified 1
Core C
Cache Line [X]
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
34
Walk Through Example
Core A
PCT = 2
Read[X]
Core B
Modified
Core C
1
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
35
Walk Through Example
Core A
PCT = 2
Core B
Modified
Core C
1
Modified
Directory
WB [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
36
Walk Through Example
Core A
PCT = 2
Core B
Shared 1
Core C
WB-Reply [X]
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
37
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 1
Shared
Directory
WB-Reply [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
38
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 1
Shared
Directory
Word [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
39
Walk Through Example
Core A
PCT = 2
Core B
Shared 1
Core C
Write [X]
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
40
Walk Through Example
Core A
PCT = 2
Core B
Shared
Core C
1
UpgradeReply [X]
Modified
Directory
Core-A
Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
41
Walk Through Example
Core A
PCT = 2
Core B
Core C
Modified 2
Modified
Directory
Core-A
Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
42
Walk Through Example
Core A
PCT = 2
Read [X]
Core B
Core C
Modified 2
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
43
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Read [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
44
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Word [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
45
Walk Through Example
Core A
PCT = 2
Read [X]
Core B
Core C
Shared 2
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
46
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Read [X]
Core-A Core-B Core-C
Remote
Private
Private
2
C
U
Core D
Dirty
-
47
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Cache Line [X]
(2)
Shared
Directory
Core-A
Core-B Core-C
Private
Private
Private
C
C
U
Core D
Dirty
-
48
Walk Through Example
Core A
PCT = 2
Shared 2
Cache Line [X]
(2)
Core B
Core C
Shared 2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
C
U
Core D
Dirty
-
49
Outline
•
•
•
•
•
Motivation for Locality-Aware Coherence
Detailed Implementation
Optimizations
Evaluation
Conclusion
50
Complete Locality Classifier
High Directory Storage
State
ACKwise
Pointers
1…p
Tag
P/R1
Remote
Utilization1
…
…
P/Rn
Remote
Utilizationn
• Complete Locality Classifier: Tracks locality
information for all cores
Classifier
Complete
Bit Overhead per core 192 KB (60%)
(256 KB L2)
51
Limited Locality Classifier
Reduces Directory Storage
State
ACKwise
Pointers
1…p
Tag
Core ID1
…
Core IDk
P/R1
Remote
Utilization1
…
P/Rk
Remote
Utilizationk
…
• Utilization and mode tracked for k sharers
• Modes of other sharers obtained by taking a
majority vote
52
Limited-3 Locality Classifier
• Utilization and mode tracked for 3 sharers
Metric
Limited-3 vs Complete
Achieves
the
performance
and
energy
Completion Time
3 % lower
of the Complete1.5locality
Energy
% lower classifier
• CT and Energy lower because remote mode
classification learned faster with Limited-3
Classifier
Complete
Bit Overhead per core 192 KB (60%)
(256 KB L2)
Limited-3
18 KB (5.7%)
53
Private <-> Remote Transition
Results In Private Cache Thrashing
• Difficult to measure private cache locality of
line in shared L2 cache
Initial
Private Utilization < PCT
Private
Private Utilization >= PCT
Remote Utilization < PCT
Remote
Remote Utilization >= PCT
• Core reverts back to private mode after #PCT
accesses to cache line at shared L2 cache
• Evicts other lines in the private L1 cache
• Results in low spatio-temporal locality for all 54
Ideal Classifier
NO Private Cache Thrashing
• Ideal classifier maintains part of the working
set in the private cache
• Other lines placed in remote mode at shared
cache
55
Remote Access Threshold
Reduces Private Cache Thrashing
• If core classified as remote sharer (capacity),
increase cost of promotion to private mode
• If core classified as private sharer, reset the
Reduces
private
cache
thrashing
to
a
cost back to its starting value
Initial
negligible levelRemote Utilization < RAT
Private Utilization < PCT
Private
Private Utilization >= PCT
Remote
Remote Utilization >= RAT
• Remote Access Threshold (RAT) varied based on
PCT & application behavior [details in paper] 56
Outline
•
•
•
•
•
Motivation for Locality-Aware Coherence
Implementation Details
Optimizations
Evaluation
Conclusion
57
Reducing Capacity Misses
Private L1 Cache Miss Rate vs PCT (Blackscholes)
Cache Miss Rate
Breakdown (%)
Cold
Capacity
Upgrade
Sharing
Word
3
2
1
0
1
2
3
4
5
6
PCT
7
8
• Miss rate reduces as PCT increases (better utilization)
• Multiple capacity misses (expensive) replaced with
single word access (cheap)
• Cache miss rate increases towards the end
(one capacity miss turns into multiple word misses) 58
Energy vs PCT
Blackscholes
Energy (normalized)
1.2
1
Network Link
0.8
Network Router
0.6
Directory
0.4
L2 Cache
L1-D Cache
0.2
L1-I Cache
0
1
2
3
4
5
6
PCT
7
8
• Reducing L1 cache misses (& Capacity Word)
lead to lesser network traffic and L2 accesses
• Accessing a word (200 bits) cheaper than
59
fetching the entire cache line (640 bits)
Completion Time vs PCT
Blackscholes
Completion Time
(normalized)
1.2
1
Synchronization
0.8
L2Cache-OffChip
0.6
L2Cache-Sharers
0.4
L2Cache-Waiting
L1Cache-L2Cache
0.2
Compute
0
1
2
3
4
5
6
7
8
• Lower L1 cache miss rate + miss penalty
• Less time spent waiting on L1 cache misses
60
Reducing Sharing Misses
Private L1 Cache Miss Rate vs PCT (Streamcluster)
Cache Miss Rate
Breakdown (%)
Cold
Capacity
Upgrade
Sharing
Word
8
6
4
2
0
1
2
3
4
5
6
7
8
PCT
• Sharing misses (expensive) turned into word
misses (cheap) as PCT increases
61
Energy vs PCT
Energy (normalized)
Streamcluster
1.2
1
Network Link
0.8
Network Router
0.6
Directory
0.4
L2 Cache
L1-D Cache
0.2
L1-I Cache
0
1
2
3
4
5
6
7
PCT
8
• Reduce invalidations, asynchronous writebacks and cache-line ping-pong’ing
62
Completion Time vs PCT
Streamcluster
Completion Time
(normalized)
1.2
1
Synchronization
0.8
L2Cache-OffChip
0.6
L2Cache-Sharers
0.4
L2Cache-Waiting
L1Cache-L2Cache
0.2
Compute
0
1
2
3
4
5
6
7
PCT
8
• Less time spent waiting for invalidations and
invalidations and by loads waiting for previous stores
• Critical section time reduction -> synchronization time
63
reduction
Variation with PCT
Results Summary
• Evaluated 18 benchmarks from the SPLASH-2,
PARSEC, parallel-MI bench and UHPC suites +
3 hand-written benchmarks
• PCT of 4 obtains 25% reduction in energy and
15% reduction in completion time
• Evaluations done using Graphite simulator for
64 cores, McPAT/CACTI cache energy models
and DSENT network energy models at 11 nm
64
Conclusion
• Three potential advantages of the locality-aware
adaptive cache coherence protocol
– Better private cache utilization
– Reduced on-chip communication (invalidations, asynchronous
write-backs and cache-line transfers)
– Reduced memory access latency and energy
• Efficient locality tracking hardware
• Decoupled from traditional coherence tracking structures
• Limited3 locality classifier has low overhead of 18KB per-core
(with 256KB per-core L2 cache)
• Simple to implement
– NO additional networks for deadlock avoidance
65