18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University.
Download
Report
Transcript 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University.
18-742
Parallel Computer Architecture
Lecture 5: Cache Coherence
Chris Craik (TA)
Carnegie Mellon University
Readings: Coherence
Required for Review
Required
Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with
private cache memories,” ISCA 1984.
Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010.
Censier and Feautrier, “A new solution to coherence problems in multicache systems,”
IEEE Trans. Comput., 1978.
Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.
Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79,
1992.
Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.
Recommended
Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA
1988.
Lamport, “How to Make a Multiprocessor Computer that Correctly Executes
Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691.
Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8.
2
Shared Memory Model
Many parallel programs communicate through shared memory
Proc 0 writes to an address, followed by Proc 1 reading
This implies communication between the two
Proc 0
Mem[A] = 1
Proc 1
…
Print Mem[A]
Each read should receive the value last written by anyone
This requires synchronization (what does last written mean?)
What if Mem[A] is cached (at either end)?
3
Cache Coherence
Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?
P2
P1
Interconnection Network
x
1000
Main Memory
4
The Cache Coherence Problem
P2
P1
ld r2, x
1000
Interconnection Network
x
1000
Main Memory
5
The Cache Coherence Problem
ld r2, x
P1
P2
1000
1000
ld r2, x
Interconnection Network
x
1000
Main Memory
6
The Cache Coherence Problem
ld r2, x
add r1, r2, r4
st x, r1
P1
P2
2000
1000
ld r2, x
Interconnection Network
x
1000
Main Memory
7
The Cache Coherence Problem
ld r2, x
add r1, r2, r4
st x, r1
P1
P2
2000
1000
ld r2, x
Should NOT
load 1000
ld r5, x
Interconnection Network
x
1000
Main Memory
8
Cache Coherence: Whose Responsibility?
Software
Can the programmer ensure coherence if caches are
invisible to software?
What if the ISA provided a cache-flush instruction?
What needs to be flushed (what lines, and which caches)?
When does this need to be done?
Hardware
Simplifies software’s job
Knows sharer set
Doesn’t need to be conservative in synchronization
9
Coherence: Guarantees
Writes to location A by P0 should be seen by P1
(eventually), and all writes to A should appear in some
order
Coherence needs to provide:
Write propagation: guarantee that updates will propagate
Write serialization: provide a consistent global order seen
by all processors
Need a global point of serialization for this store ordering
Ordering between writes to different locations is a memory
consistency model problem: separate issue
10
Coherence: Update vs. Invalidate
How can we safely update replicated data?
Option 1: push updates to all copies
Option 2: ensure there is only one copy (local), update it
On a Read:
If local copy isn’t valid, put out request
(If another node has a copy, it returns it, otherwise
memory does)
11
Coherence: Update vs. Invalidate
On a Write:
Read block into cache as before
Update Protocol:
Write to block, and simultaneously broadcast written
data to sharers
(Other nodes update their caches if data was present)
Invalidate Protocol:
Write to block, and simultaneously broadcast invalidation
of address to sharers
(Other nodes clear block from cache)
12
Update vs. Invalidate
Which do we want?
Write frequency and sharing behavior are critical
Update
+ If sharer set is constant and updates are infrequent, avoids
the cost of invalidate-reacquire(broadcast update pattern)
- If data is rewritten without intervening reads by other cores,
updates were useless
- Write-through cache policy bus becomes bottleneck
Invalidate
+ After invalidation broadcast, core has exclusive access rights
+ Only cores that keep reading after each write retain a copy
- If write contention is high, leads to ping-ponging (rapid
mutual invalidation-reacquire)
13
Cache Coherence Methods
How do we ensure that the proper caches are updated?
Snoopy Bus [Goodman83, Papamarcos84]
Bus-based, single point of serialization
Processors observe other processors’ actions and infer ownership
E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
and invalidates its own copy of A
Directory [Censier78, Lenoski92, Laudon97]
single point of serialization per block, distributed among nodes
Processors make explicit requests for blocks
Directory tracks ownership (sharer set) for each block
Directory coordinates invalidation appropriately
E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
14
Snoopy Bus vs. Directory Coherence
Snoopy
+ Critical path is short: miss bus transaction to memory
+ Global serialization is easy: bus provides this already (arbitration)
+ Simple: adapt bus-based uniprocessors easily
- Requires single point of serialization (bus): not scalable
(not quite true that snoopy needs bus: recent work on this later)
Directory
- Requires extra storage space to track sharer sets
Can be approximate (false positives are OK)
- Adds indirection to critical path: request directory mem
- Protocols and race conditions are more complex
+ Exactly as scalable as interconnect and directory storage
(much more scalable than bus)
15
Snoopy-Bus Coherence
16
Snoopy Cache Coherence
Caches “snoop” (observe) each other’s write/read
operations
A simple protocol:
PrRd/--
PrWr / BusWr
Valid
BusWr
PrRd / BusRd
Write-through, nowrite-allocate
cache
Actions: PrRd,
PrWr, BusRd,
BusWr
Invalid
PrWr / BusWr
17
Snoopy Invalidation Protocol: MSI
Extend single valid bit per block to three states:
M(odified): cache line is only copy and is dirty
S(hared): cache line is one of several copies
I(nvalid): not present
Read miss makes a Read request on bus, saves in S state
Write miss makes a ReadEx request, saves in M state
When a processor snoops ReadEx from another writer, it
must invalidate its own copy (if any)
SM upgrade can be made without re-reading data from
memory (via Invl)
18
MSI State Machine
M
BusRd/Flush
PrWr/BusRdX
PrWr/BusRdX
PrRd/-PrWr/--
BusRdX/Flush
PrRd/BusRd
S
I
PrRd/-BusRd/-BusRdX/--
ObservedEvent/Action
[Culler/Singh96]
19
MSI Correctness
Write propagation
Immediately after write:
New value will exist only in writer’s cache
Block will be in M state
Transition into M state ensures all other caches are in I state
Upon read in another thread, that cache will miss BusRd
BusRd causes flush (writeback), and read will see new value
Write serialization
Only one cache can be in M state at a time
Entering M state generates BusRdX
BusRdX causes transition out of M state in other caches
Order of block ownership (M state) defines write ordering
This order is global by virtue of central bus
20
More States: MESI
InvalidSharedModified sequence takes two bus ops
What if data is not shared? Unnecessary broadcasts
Exclusive state: this is the only copy, and it is clean
Block is exclusive if, during BusRd, no other cache had it
Wired-OR “shared” signal on bus can determine this: snooping
caches assert the signal if they also have a copy
Another BusRd also causes transition into Shared
Silent transition ExclusiveModified is possible on write!
MESI is also called the Illinois protocol [Papamarcos84]
21
MESI State Machine
M
PrWr/-PrWr/BusRdX
BusRd/Flush
E
BusRd/ $ Transfer
S
PrWr/BusRdX
PrRd (S’)/BusRd
PrRd (S)/BusRd
BusRdX/Flush (all incoming)
I
[Culler/Singh96]
22
Snoopy Invalidation Tradeoffs
Should a downgrade from M go to S or I?
S: if data is likely to be reused
I: if data is likely to be migratory
Cache-to-cache transfer
On a BusRd, should data come from another cache, or mem?
Another cache
Memory
may be faster, if memory is slow or highly contended
Simpler: mem doesn’t need to wait to see if cache has data first
Less contention at the other caches
Requires writeback on M downgrade
Writeback on Modified->Shared: necessary?
One possibility: Owner (O) state (MOESI system)
One cache owns the latest data (memory is not updated)
Memory writeback happens when all caches evict copies
23
Update-Based Coherence: Dragon
Four states:
(E)xclusive: Only copy, clean
(Sm): shared, modified (Owner state in MOESI)
(Sc): shared, clean (with respect to Sm, not memory)
(M)odified: only copy, dirty
Use of updates allows multiple copies of dirty data
No I state: there is no invalidate (only ordinary evictions)
Invariant: at most one Sm in a sharer set
If a cache is Sm, it is authoritative, and Sc caches are
clean relative to this, not clean to memory
McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984.
(cited in Culler/Singh96)
24
Dragon State Machine
PrRd/-BusUpd/Update
PrRd/-PrRdMiss/BusRd(S’)
E
BusRd/--
PrRdMiss/BusRd(S)
Sc
PrWr/BusUpd(S)
BusUpd/Update
PrWrMiss/
(BusRd(S); BusUpd)
Sm
PrRd/-PrWr/BusUpd(S)
BusRd/Flush
PrWr/--
BusRd/Flush
PrWr/BusUpd(S’)
PrWr/BusUpd(S’)
PrWrMiss/BusRd(S’)
M
PrRd/-PrWr/--
[Culler/Singh96]
25
Update-Protocol Tradeoffs
Shared-Modified state vs. keeping memory up-to-date
Equivalent to write-through cache when there are multiple
sharers
Immediate vs. lazy notification of Sc-block eviction
Immediate: if there is only one sharer left, it can go to E or M
and save a BusUpd later (only one saved)
Lazy: no extra traffic required upon eviction
26
Directory-Based Coherence
27
Directory-Based Protocols
Required when scaling past the capacity of a single bus
Distributed, but:
Coherence still requires single point of serialization (for write
serialization)
This can be different for every block (striped across nodes)
We can reason about the protocol for a single block: one
server (directory node), many clients (private caches)
Directory receives Read and ReadEx requests, and sends
Invl requests: invalidation is explicit (as opposed to snoopy
buses)
28
Directory: Data Structures
0x00
0x04
0x08
0x0C
…
Shared: {P0, P1, P2}
--Exclusive: P2
-----
Key operation to support is set inclusion test
False positives are OK: want to know which caches may
contain a copy of a block, and spurious invals are ignored
False positive rate determines performance
Most accurate (and expensive): full bit-vector
Compressed representation, linked list, Bloom filter
[Zebchuk09] are all possible
Here, we will assume directory has perfect knowledge
29
Directory: Basic Operations
Follow semantics of snoop-based system
Directory:
but with explicit request, reply messages
Receives Read, ReadEx, Upgrade requests from nodes
Sends Inval/Downgrade messages to sharers if needed
Forwards request to memory if needed
Replies to requestor and updates sharing state
Protocol design is flexible
Exact forwarding paths depend on implementation
For example, do cache-to-cache transfer?
30
MESI Directory Transaction: Read
P0 acquires an address for reading:
1. Read
P0
Home
2. DatEx (DatShr)
P1
Culler/Singh Fig. 8.16
31
RdEx with Former Owner
1. RdEx
P0
Home
2. Invl
3a. Rev
Owner
3b. DatEx
32
Contention Resolution (for Write)
P0
1a. RdEx
1b. RdEx
4. Invl
3. RdEx
5a. Rev
Home
2a. DatEx
P1
2b. NACK
5b. DatEx
33
Issues with Contention Resolution
Need to escape race conditions by:
NACKing requests to busy (pending invalidate) entries
OR, queuing requests and granting in sequence
(Or some combination thereof)
Fairness
Original requestor retries
Which requestor should be preferred in a conflict?
Interconnect delivery order, and distance, both matter
We guarantee that some node will make forward progress
Ping-ponging is a higher-level issue
With solutions like combining trees (for locks/barriers) and
better shared-data-structure design
34
Protocol and Directory Tradeoffs
Forwarding vs. strict request-reply
Speculative replies from memory/directory node
Decreases critical path length in best case
More complex implementation (and potentially more network
traffic)
Directory storage can imply protocol design
1A.
Increases complexity
Shorten critical path by creating a chain of request forwarding
E.g., linked list for sharer set
Hansson et al, “Avoiding message-dependent deadlock in network-based systems on chip,” VLSI Design 2007.
35
Other Issues / Backup
36
Memory Consistency (Briefly)
We consider only sequential consistency [Lamport79] here
Sequential Consistency gives the appearance that:
All operations (R and W) happen atomically in a global order
Operations from a single thread occur in order in this stream
Proc 0
A = 1;
Proc 1
while (A == 0) ;
B = 1;
Proc 2
while (B == 0) ;
print A
A=1?
Thus, ordering between different mem locations exists
More relaxed models exist; usually require memory barriers
when synchronizing
37
Correctness Issue: Inclusion
What happens with multilevel caches?
Snooping level (say, L2) must know about all data in
private hierarchy above it (L1)
Inclusive cache is one solution [Baer88]
What about directories?
L2 must contain all data that L1 contains
Must propagate invalidates upward to L1
Other options
Non-inclusive: inclusion property is optional
Why would L2 evict if it has more sets and more ways than L1?
Prefetching!
Exclusive: line is in L1 xor L2 (AMD K7)
38