18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University.

Transcript 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University.

18-742
Parallel Computer Architecture
Lecture 5: Cache Coherence
Chris Craik (TA)
Carnegie Mellon University
Readings: Coherence

Required for Review



Required






Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with
private cache memories,” ISCA 1984.
Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010.
Censier and Feautrier, “A new solution to coherence problems in multicache systems,”
IEEE Trans. Comput., 1978.
Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.
Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79,
1992.
Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.
Recommended



Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA
1988.
Lamport, “How to Make a Multiprocessor Computer that Correctly Executes
Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691.
Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8.
2
Shared Memory Model


Many parallel programs communicate through shared memory
Proc 0 writes to an address, followed by Proc 1 reading
 This implies communication between the two
Proc 0
Mem[A] = 1
Proc 1
…
Print Mem[A]


Each read should receive the value last written by anyone
 This requires synchronization (what does last written mean?)
What if Mem[A] is cached (at either end)?
3
Cache Coherence

Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?
P2
P1
Interconnection Network
x
1000
Main Memory
4
The Cache Coherence Problem
P2
P1
ld r2, x
1000
Interconnection Network
x
1000
Main Memory
5
The Cache Coherence Problem
ld r2, x
P1
P2
1000
1000
ld r2, x
Interconnection Network
x
1000
Main Memory
6
The Cache Coherence Problem
ld r2, x
add r1, r2, r4
st x, r1
P1
P2
2000
1000
ld r2, x
Interconnection Network
x
1000
Main Memory
7
The Cache Coherence Problem
ld r2, x
add r1, r2, r4
st x, r1
P1
P2
2000
1000
ld r2, x
Should NOT
load 1000
ld r5, x
Interconnection Network
x
1000
Main Memory
8
Cache Coherence: Whose Responsibility?

Software



Can the programmer ensure coherence if caches are
invisible to software?
What if the ISA provided a cache-flush instruction?

What needs to be flushed (what lines, and which caches)?

When does this need to be done?
Hardware


Simplifies software’s job
Knows sharer set

Doesn’t need to be conservative in synchronization
9
Coherence: Guarantees


Writes to location A by P0 should be seen by P1
(eventually), and all writes to A should appear in some
order
Coherence needs to provide:




Write propagation: guarantee that updates will propagate
Write serialization: provide a consistent global order seen
by all processors
Need a global point of serialization for this store ordering
Ordering between writes to different locations is a memory
consistency model problem: separate issue
10
Coherence: Update vs. Invalidate


How can we safely update replicated data?
 Option 1: push updates to all copies
 Option 2: ensure there is only one copy (local), update it
On a Read:
 If local copy isn’t valid, put out request
 (If another node has a copy, it returns it, otherwise
memory does)
11
Coherence: Update vs. Invalidate

On a Write:
 Read block into cache as before
Update Protocol:


Write to block, and simultaneously broadcast written
data to sharers
(Other nodes update their caches if data was present)
Invalidate Protocol:


Write to block, and simultaneously broadcast invalidation
of address to sharers
(Other nodes clear block from cache)
12
Update vs. Invalidate

Which do we want?


Write frequency and sharing behavior are critical
Update
+ If sharer set is constant and updates are infrequent, avoids
the cost of invalidate-reacquire(broadcast update pattern)
- If data is rewritten without intervening reads by other cores,
updates were useless
- Write-through cache policy  bus becomes bottleneck

Invalidate
+ After invalidation broadcast, core has exclusive access rights
+ Only cores that keep reading after each write retain a copy
- If write contention is high, leads to ping-ponging (rapid
mutual invalidation-reacquire)
13
Cache Coherence Methods

How do we ensure that the proper caches are updated?

Snoopy Bus [Goodman83, Papamarcos84]


Bus-based, single point of serialization
Processors observe other processors’ actions and infer ownership


E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
and invalidates its own copy of A
Directory [Censier78, Lenoski92, Laudon97]
 single point of serialization per block, distributed among nodes



Processors make explicit requests for blocks
Directory tracks ownership (sharer set) for each block
Directory coordinates invalidation appropriately

E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
14
Snoopy Bus vs. Directory Coherence

Snoopy
+ Critical path is short: miss  bus transaction to memory
+ Global serialization is easy: bus provides this already (arbitration)
+ Simple: adapt bus-based uniprocessors easily
- Requires single point of serialization (bus): not scalable


(not quite true that snoopy needs bus: recent work on this later)
Directory
- Requires extra storage space to track sharer sets

Can be approximate (false positives are OK)
- Adds indirection to critical path: request  directory  mem
- Protocols and race conditions are more complex
+ Exactly as scalable as interconnect and directory storage
(much more scalable than bus)
15
Snoopy-Bus Coherence
16
Snoopy Cache Coherence


Caches “snoop” (observe) each other’s write/read
operations
A simple protocol:
PrRd/--
PrWr / BusWr
Valid

BusWr
PrRd / BusRd

Write-through, nowrite-allocate
cache
Actions: PrRd,
PrWr, BusRd,
BusWr
Invalid
PrWr / BusWr
17
Snoopy Invalidation Protocol: MSI

Extend single valid bit per block to three states:







M(odified): cache line is only copy and is dirty
S(hared): cache line is one of several copies
I(nvalid): not present
Read miss makes a Read request on bus, saves in S state
Write miss makes a ReadEx request, saves in M state
When a processor snoops ReadEx from another writer, it
must invalidate its own copy (if any)
SM upgrade can be made without re-reading data from
memory (via Invl)
18
MSI State Machine
M
BusRd/Flush
PrWr/BusRdX
PrWr/BusRdX
PrRd/-PrWr/--
BusRdX/Flush
PrRd/BusRd
S
I
PrRd/-BusRd/-BusRdX/--
ObservedEvent/Action
[Culler/Singh96]
19
MSI Correctness

Write propagation

Immediately after write:






New value will exist only in writer’s cache
Block will be in M state
Transition into M state ensures all other caches are in I state
Upon read in another thread, that cache will miss  BusRd
BusRd causes flush (writeback), and read will see new value
Write serialization

Only one cache can be in M state at a time




Entering M state generates BusRdX
BusRdX causes transition out of M state in other caches
Order of block ownership (M state) defines write ordering
This order is global by virtue of central bus
20
More States: MESI




InvalidSharedModified sequence takes two bus ops
What if data is not shared? Unnecessary broadcasts
Exclusive state: this is the only copy, and it is clean
Block is exclusive if, during BusRd, no other cache had it

Wired-OR “shared” signal on bus can determine this: snooping
caches assert the signal if they also have a copy

Another BusRd also causes transition into Shared

Silent transition ExclusiveModified is possible on write!

MESI is also called the Illinois protocol [Papamarcos84]
21
MESI State Machine
M
PrWr/-PrWr/BusRdX
BusRd/Flush
E
BusRd/ $ Transfer
S
PrWr/BusRdX
PrRd (S’)/BusRd
PrRd (S)/BusRd
BusRdX/Flush (all incoming)
I
[Culler/Singh96]
22
Snoopy Invalidation Tradeoffs

Should a downgrade from M go to S or I?



S: if data is likely to be reused
I: if data is likely to be migratory
Cache-to-cache transfer


On a BusRd, should data come from another cache, or mem?
Another cache


Memory




may be faster, if memory is slow or highly contended
Simpler: mem doesn’t need to wait to see if cache has data first
Less contention at the other caches
Requires writeback on M downgrade
Writeback on Modified->Shared: necessary?
 One possibility: Owner (O) state (MOESI system)


One cache owns the latest data (memory is not updated)
Memory writeback happens when all caches evict copies
23
Update-Based Coherence: Dragon





Four states:
 (E)xclusive: Only copy, clean
 (Sm): shared, modified (Owner state in MOESI)
 (Sc): shared, clean (with respect to Sm, not memory)
 (M)odified: only copy, dirty
Use of updates allows multiple copies of dirty data
No I state: there is no invalidate (only ordinary evictions)
Invariant: at most one Sm in a sharer set
If a cache is Sm, it is authoritative, and Sc caches are
clean relative to this, not clean to memory
McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984.
(cited in Culler/Singh96)
24
Dragon State Machine
PrRd/-BusUpd/Update
PrRd/-PrRdMiss/BusRd(S’)
E
BusRd/--
PrRdMiss/BusRd(S)
Sc
PrWr/BusUpd(S)
BusUpd/Update
PrWrMiss/
(BusRd(S); BusUpd)
Sm
PrRd/-PrWr/BusUpd(S)
BusRd/Flush
PrWr/--
BusRd/Flush
PrWr/BusUpd(S’)
PrWr/BusUpd(S’)
PrWrMiss/BusRd(S’)
M
PrRd/-PrWr/--
[Culler/Singh96]
25
Update-Protocol Tradeoffs

Shared-Modified state vs. keeping memory up-to-date


Equivalent to write-through cache when there are multiple
sharers
Immediate vs. lazy notification of Sc-block eviction


Immediate: if there is only one sharer left, it can go to E or M
and save a BusUpd later (only one saved)
Lazy: no extra traffic required upon eviction
26
Directory-Based Coherence
27
Directory-Based Protocols


Required when scaling past the capacity of a single bus
Distributed, but:




Coherence still requires single point of serialization (for write
serialization)
This can be different for every block (striped across nodes)
We can reason about the protocol for a single block: one
server (directory node), many clients (private caches)
Directory receives Read and ReadEx requests, and sends
Invl requests: invalidation is explicit (as opposed to snoopy
buses)
28
Directory: Data Structures
0x00
0x04
0x08
0x0C
…

Shared: {P0, P1, P2}
--Exclusive: P2
-----
Key operation to support is set inclusion test


False positives are OK: want to know which caches may
contain a copy of a block, and spurious invals are ignored
False positive rate determines performance

Most accurate (and expensive): full bit-vector
Compressed representation, linked list, Bloom filter
[Zebchuk09] are all possible

Here, we will assume directory has perfect knowledge

29
Directory: Basic Operations

Follow semantics of snoop-based system


Directory:





but with explicit request, reply messages
Receives Read, ReadEx, Upgrade requests from nodes
Sends Inval/Downgrade messages to sharers if needed
Forwards request to memory if needed
Replies to requestor and updates sharing state
Protocol design is flexible


Exact forwarding paths depend on implementation
For example, do cache-to-cache transfer?
30
MESI Directory Transaction: Read
P0 acquires an address for reading:
1. Read
P0
Home
2. DatEx (DatShr)
P1
Culler/Singh Fig. 8.16
31
RdEx with Former Owner
1. RdEx
P0
Home
2. Invl
3a. Rev
Owner
3b. DatEx
32
Contention Resolution (for Write)

P0
1a. RdEx
1b. RdEx
4. Invl
3. RdEx
5a. Rev
Home
2a. DatEx
P1

2b. NACK

5b. DatEx
33
Issues with Contention Resolution

Need to escape race conditions by:

NACKing requests to busy (pending invalidate) entries






OR, queuing requests and granting in sequence
(Or some combination thereof)
Fairness


Original requestor retries
Which requestor should be preferred in a conflict?
Interconnect delivery order, and distance, both matter
We guarantee that some node will make forward progress
Ping-ponging is a higher-level issue

With solutions like combining trees (for locks/barriers) and
better shared-data-structure design
34
Protocol and Directory Tradeoffs

Forwarding vs. strict request-reply



Speculative replies from memory/directory node



Decreases critical path length in best case
More complex implementation (and potentially more network
traffic)
Directory storage can imply protocol design

1A.
Increases complexity
Shorten critical path by creating a chain of request forwarding
E.g., linked list for sharer set
Hansson et al, “Avoiding message-dependent deadlock in network-based systems on chip,” VLSI Design 2007.
35
Other Issues / Backup
36
Memory Consistency (Briefly)


We consider only sequential consistency [Lamport79] here
Sequential Consistency gives the appearance that:


All operations (R and W) happen atomically in a global order
Operations from a single thread occur in order in this stream
Proc 0
A = 1;
Proc 1
while (A == 0) ;
B = 1;
Proc 2
while (B == 0) ;
print A
A=1?


Thus, ordering between different mem locations exists
More relaxed models exist; usually require memory barriers
when synchronizing
37
Correctness Issue: Inclusion


What happens with multilevel caches?
Snooping level (say, L2) must know about all data in
private hierarchy above it (L1)


Inclusive cache is one solution [Baer88]



What about directories?
L2 must contain all data that L1 contains
Must propagate invalidates upward to L1
Other options

Non-inclusive: inclusion property is optional



Why would L2 evict if it has more sets and more ways than L1?
Prefetching!
Exclusive: line is in L1 xor L2 (AMD K7)
38