Transcript pptx

CACM July 2012
Talk: Mark D. Hill, Wisconsin
@ Cornell University, 10/2012
1
Executive Summary
• Today chips provide shared memory w/ HW coherence
as low-level support for OS & application SW
• As #cores per chip scales?
o Some argue HW coherence gone due to growing overheads
o We argue it’s stays by managing overheads
• Develop scalable on-chip coherence proof-of-concept
o
o
o
o
Inclusive caches first
Exact tracking of sharers & replacements (key to analysis)
Larger systems need to use hierarchy (clusters)
Overheads similar to today’s
 Compatibility of on-chip
HW coherence is here to stay
2
Let’s spend programmer
sanity on parallelism,
not lost compatibility!
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication
2. Storage
3. Enforcing Inclusion
4. Latency
5. Energy
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
3
Academics Criticize HW Coherence
• Choi et al. [DeNovo]:
o Directory…coherence…extremely complex
& inefficient .... Directory … incurring
significant storage and invalidation traffic
overhead.
• Kelm et al. [Cohesion]:
o A software-managed coherence protocol ...
avoids .. directories and duplicate tags , &
implementing & verifying … less traffic ...
4
Industry Eschews HW Coherence
• Intel 48-Core IA-32 MessagePassing Processor …
SW protocols … to eliminate the
communication & HW overhead
• IBM Cell processor …
the greatest opportunities for
increased application
performance, is the existence of
the local store memory and the
fact that software must manage
this memory
BUT…
5
Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.
6
Define “Coherence as Scalable”
• Define a coherent system as scalable when
the cost of providing coherence grows (at most) slowly
as core count increases
• Our Focus
o YES: coherence
o NO: Any scalable system also requires scalable HW
(interconnects, memories) and SW (OS, middleware, apps)
• Method
o Identify each overhead & show it can grow slowly
• Expect more cores
o Moore Law’s provide more transistors
o Power-efficiency improvements (w/o Dennard Scaling)
o Experts disagree on how many core possible
7
Caches & Coherence
• Cache— fast, hidden memory—to reduce
o Latency: average memory access time
o Bandwidth: interconnect traffic
o Energy: cache misses cost more energy
• Caches hidden (from software)
o Naturally for single core system
o Via Coherence Protocol for multicore
• Maintain coherence invariant
o For a given (memory) block at a give time either
o Modified (M): A single core can read & write
o Shared (S): Zero or more cores can read, but not write
8
Baseline Multicore Chip
•Intel Core i7 like
Block in private cache
state
tag
Core
1
Core
2
Core
C
Private cache
Private cache
Private cache
block data
•C = 16 Cores (not 8)
~2 bits
~64 bits
~512 bits
Interconnection network
•Private
L1/L2
Caches
Block
in shared cache
tracking
bits
state
tag
block data
•Shared Last-Level Cache (LLC)
~C bits
~2 bits
~64 bits
~512 bits
•64B blocks w/ ~8B tag
•HW coherence pervasive in general-purpose
multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)
9
Baseline Chip Coherence
Block in private cache
state
tag
~2 bits
Core
1
Core
2
Core
C
Private cache
Private cache
Private cache
block data
~64 bits
~512 bits
Interconnection network
Block in shared cache
tracking
bits
~C bits
state
~2 bits
tag
block data
~64 bits
~512 bits
• 2B per 64+8B L2 block to track L1 copies
• Inclusive L2 (w/ recall messages on LLC evictions)
10
Coherence Example Setup
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
Core
3
Private cache
Interconnection network
Bank 0
Bank 1
Bank 2
Bank 3
A: {0000} I …
B: {0000} I …
• Block A in no private caches: state Invalid (I)
• Block B in no private caches: state Invalid (I)
11
Coherence Example 1/4
Write A
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
Core
3
Private cache
A: M, …
Interconnection network
Bank 0
Bank 1
Bank 2
Bank 3
{1000} I
M …
A: {0000}
B: {0000} I …
• Block A at Core 0 exclusive read-write: Modified(M)
12
Coherence Example 2/4
Read B
Read B
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
A: M, …
B: S, …
Core
3
Private cache
B: S, …
Interconnection network
Bank 0
Bank 1
Bank 2
Bank 3
A: {1000} M …
{0100} I
{0110}
S …
B: {0000}
• Block B at Cores 1+2 shared read-only: Shared (S)
13
Coherence Example
3/4
Write A
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
A: M, …
B: S, …
B: S, …
Core
3
Private cache
A: M, …
Interconnection network
Bank 0
Bank 1
Bank 3
Bank 2
{1000} M …
A:{0001}
B: {0110} S …
• Block A moved from Core 0 to 3 (still M)
14
Write B
Coherence Example 4/4
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
B: M, …
B: S, …
B: S, …
Core
3
Private cache
A: M, …
Interconnection network
Bank 0
Bank 1
Bank 2
Bank 3
A: {0001} M …
{1000} S
M …
B: {0110}
• Block B moved from Cores1+2 (S) to Core 1 (M)
15
Caches & Coherence
16
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages (subtle)
4. Latency: Indirection on some requests
5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
17
1. Communication: (a) No Sharing, Dirty
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
o W/o coherence: RequestDataData(writeback)
o W/ coherence: RequestDataData(writeback)Ack
o Overhead = 8/(8+72+72) = 5% (independent of #cores!)
18
1. Communication: (b) No Sharing, Clean
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
o W/o coherence: RequestData0
o W/ coherence: RequestData(Evict)Ack
o Overhead = 16/(8+72) = 10-20% (independent of #cores!)
19
1. Communication: (c) Sharing, Read
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
o To memory: RequestData
o To one other core: RequestForwardData(Cleanup)
o Charge 1-2 Control messages (independent of #cores!)
20
1. Communication: (d) Sharing, Write
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
o If Shared at C other cores
o Request{Data, C Invalidations + C Acks}(Cleanup)
o Needed since most directory protocols send invalidations
to caches that have & sometimes do not have copies
21
o Not Scalable
1. Communication: Extra Invalidations
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
{1|2 3|4
{ 0
1
0
o
o
o
o
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
.. C-1|C}
..
0 }
1
Core 1 Read: RequestData
Core C Write: Request{Data, 2 Inv + 2 Acks}(Cleanup)
Charge Write for all necessary & unnecessary invalidations
What if all invalidations necessary? Charge reads that get data!
22
1. Communication: No Extra Invalidations
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
{1 2 3 4
{1 0 0 0
{0
o
o
o
o
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
.. C-1 C}
.. 0 0}
1}
Core 1 Read: RequestData + {Inv + Ack} (in future)
Core C Write: RequestData(Cleanup)
If all invalidations necessary, coherence adds
Bounded overhead to each miss -- Independent of #cores!
23
1. Communication Overhead
(1) Communication overhead bounded & scalable
(a) Without Sharing & Dirty
(b) Without Sharing & Clean
(c) Shared Read Miss (charge future inv + ack)
(d) Shared Write Miss (not charged for inv + acks)
• But depends on tracking exact sharers (next)
24
Total Communication
700
700
600
600
Bytes per miss
500
400
300
200
500
400
300
200
100
100
1024
256
0
1
2
3
4
5
6
7
8
16
32
64
128
256
512
0
4
Read misses per write miss
64
16
Cores
1
Read misses per write miss
1024
1024
256
0
0
1
2
3
4
5
6
7
8
16
32
64
128
256
512
1024
Bytes per miss
C Read Misses per Write Miss
4
64
16
Cores
1
Inexact (32b coarse vector) Exact (unbounded storage)
How get performance of “exact” w/ reasonable storage?
25
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages
4. Latency: Indirection on some requests
5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
26
2. Storage Overhead (Small Chip)
• Track up to C=#readers (cores) per LLC block
• Small #Cores: C bit vector acceptable
o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3%
Block in private cache
state
tag
~2 bits
Core
1
Core
2
Core
C
Private cache
Private cache
Private cache
block data
~64 bits
~512 bits
Interconnection network
Block in shared cache
tracking
bits
~C bits
state
~2 bits
tag
block data
~64 bits
~512 bits
27
2. Storage Overhead (Larger Chip)
• Use Hierarchy!
Cluster 1
Cluster K
Cluster of K cores
Cluster of K cores
core
core
core
core
core
core
private
cache
private
cache
private
cache
private
cache
private
cache
private
cache
Intra-cluster
Interconnection network
Intra-cluster
Interconnection network
tracking
state bits tag
Cache
block data
tracking
state bits tag
Cluster Cache
{11..1} S …
block data
{10..1} S …
Inter-cluster Interconnection network
tracking
state bits tag
block data
Shared last-level cache
{11..1 … 10..1} S …
{1 … 1} S …
28
Cluster Cache
2. Storage Overhead (Larger Chip)
• Medium-Large #Cores: Use Hierarchy!
o Cluster: K1 cores with L2 cluster cache
o Chip: K2 clusters with L3 global cache
o Enables K1*K2 Cores
• E.g., 16 16-core clusters
o 256 cores (16*16)
o 3% storage overhead!!
• More generally?
29
Storage Overhead for Scaling
16 clusters of
16 cores each
(2) Hierarchy enables scalable storage
30
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages (subtle)
4. Latency: Indirection on some requests
5. Energy: dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
31
3. Enforcing Inclusion (Subtle)
• Inclusion:
Block in a private cache  In shared cache
+ Augment shared cache to track
private cache sharers (as assumed)
- Replace in shared cache  Replace in private c.
- Make impossible?
- Requires too much shared cache associativity 
- E.g., 16 cores w/ 4-way caches  64-way assoc
- Use recall messages 
• Make recall messages necessary & rare 
32
Inclusion Recall Example
Write C
Core
0
Core
1
Core
2
Private cache
Private cache
Private cache
A: M, …
B: S, …
Core
3
Private cache
B: S, …
Interconnection network
Bank 0
Bank 1
Bank 2
Bank 3
A: {1000} M …
B: {0110} S …
• Shared cache miss to new block C
• Needs to replace (victimize) block B in shared cache
• Inclusion forces replacement of B in private caches
33
Make All Recalls Necessary
Exact state tracking (cover earlier)
+
L1/L2 replacement messages (even clean)
=
Every recall message finds cached block

Every recall message necessary & occurs
after a cache miss (bounded overhead)
34
Make Necessary Recalls Rare
100
Assume misses to random
sets [Hill & Smith 1989]
Percentage of Misses Causing Recalls
90
80
Associativity
of Shared
Cache
70
60
1-way
50
2-way
40
4-way
8-way
30
20
Core i7
10
0
0
1
2
3
4
5
6
7
8
Expected
Design
Space
Ratio of Aggregate Private Cache Capacity to Shared Cache Capacity
• Recalls naturally rare when
Shared Cache Size/ Σ Private Cache sizes > 2
(3) Recalls made rare
35
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages
4. Latency: Indirection on some requests
5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
36
4. Latency Overhead – Often None
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
1.
2.
3.
4.
None: private hit
“None”: private miss + “direct” shared cache hit
“None”: private miss + shared cache miss
BUT …
37
4. Latency Overhead -- Some
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
Key:
Green for Required
Red for Overhead
Thin is 8-byte control
Thick is 72-byte data
4. 1.5-2X: private miss + shared cache hit with indirection(s)
• How bad?
38
4. Latency Overhead -- Indirection
4. 1.5-2X: private miss + shared cache hit with indirection(s)
interconnect + cache + interconnect + cache + interconnect
--------------------------------------------------------------------------------------------interconnect + cache + interconnect
• Acceptable today
• Relative latency similar w/ more cores/hierarchy
• Vs. magically having data at shared cache
(4) Latency overhead bounded & scalable
39
5. Energy Overhead
• Dynamic -- Small
o Extra message energy – traffic increase small/bounded
o Extra state lookup – small relative to cache block lookup
o …
• Static – Also Small
o Extra state – state increase small/bounded
o …
• Little effect on energy-intensive cores, cache data
arrays, off-chip DRAM, secondary storage, …
• (5) Energy overhead bounded & scalable
40
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages (subtle)
4. Latency: Indirection on some requests
5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Apply analysis to caches used by AMD
Criticisms & Summary
41
Review Inclusive Shared Cache
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
tracking
bits
state
~1 bit per core ~2 bits
tag
~64 bits
block data
~512 bits
• Inclusive Shared Cache:
• Block in a private cache  In shared cache
• Blocks must be cached 42
redundantly 
Non-Inclusive Shared Cache
Core 1
Core 2
Core C
Private cache
Private cache
Private cache
Interconnection network
2. Inclusive
Directory
(probe filter)
1. Non-Inclusive
Shared Cache
state
tracking
bits
state
block data
tag
~2 bits
~1 bit per core ~2 bits
tag
~64 bits
Dataless 
Ensures coherence 
But duplicates tags 
~64 bits
~512 bits
 Any size or associativity
 Avoids redundant caching
 Allows victim caching
43
Non-Inclusive Shared Cache
• Non-Inclusive Shared Cache: Data Block + Tag
(Any Configuration )
• Inclusive Directory: Tag (Again)  + State
• Inclusive Directory == Coherence State Overhead
• WITH TWO LEVELS
o Directory size proportional to sum of private cache sizes
o 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size
L2 / ΣL1s
1
2
4
8
Overhead
11%
7.6%
4.6%
2.5%
• Coherence overhead higher than w/ inclusion
44
Non-Inclusive Shared Caches
WITH THREE LEVELS
• Cluster has L2 cache & cluster directory
o Cluster directory points to cores w/ L1 block (as before)
o (1) Size = 22% * ΣL1s sizes
• Chip has L3 cache & global directory
o Global directory points to cluster w/ block in
o (2) Cluster directory for size 22% * ΣL1s +
o (3) Cluster L2 cache for size 22% * ΣL2s
L3 / ΣL2
= L2 / ΣL1s
1
2
4
8
Overhead
(1)+(2)+(3)
23%
13%
6.5%
3.1%

• Hierarchical overhead higher than w/ inclusion
45
Outline
Motivation & Coherence Background
Scalability Challenges
1. Communication: Extra bookkeeping messages
(longer section)
2. Storage: Extra bookkeeping storage
3. Enforcing Inclusion: Extra recall messages (subtle)
4. Latency: Indirection on some requests
5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
46
Some Criticisms
(1) Where are workload-driven evaluations?
o Focused on robust analysis of first-order effects
(2) What about non-coherent approaches?
o Showed compatible of coherence scales
(3) What about protocol complexity?
o We have such protocols today (& ideas for better ones)
(4) What about multi-socket systems?
o Apply non-inclusive approaches
(5) What about software scalability?
o Hard SW work need not re-implement coherence
Executive Summary
• Today chips provide shared memory w/ HW coherence
as low-level support for OS & application SW
• As #cores per chip scales?
o Some argue HW coherence gone due to growing overheads
o We argue it’s stays by managing overheads
• Develop scalable on-chip coherence proof-of-concept
o
o
o
o
Inclusive caches first
Exact tracking of sharers & replacements (key to analysis)
Larger systems need to use hierarchy (clusters)
Overheads similar to today’s
Let’s spend programmer
 Compatibility of on-chip
HW coherence is here to stay sanity on parallelism,
not lost compatibility!
48
Coherence NOT this Awkward
49