ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches ECE669 L18: Scalable Parallel Caches April 6, 2004

Download Report

Transcript ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches ECE669 L18: Scalable Parallel Caches April 6, 2004

ECE 669
Parallel Computer Architecture
Lecture 18
Scalable Parallel Caches
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Overview
° Most cache protocols are more complicated than
two state
° Snooping not effective for network-based systems
• Consider three alternate cache coherence approaches
• Full-map, limited directory, chain
° Caching will affect network performance
° Limitless protocol
• Gives appearance of full-map
° Practical issues of processor – memory system
interaction
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Context for Scalable Cache Coherence
Scalable Networks
- many simultaneous
transactions
Realizing Pgm Models
through net transaction
protocols
- efficient node-to-net interface
- interprets transactions
Scalable network
Switch
Scalable
distributed
memory
Switch

CA
M
Switch
$
P
Caches naturally replicate
data
- coherence through bus
snooping protocols
- consistency
Need cache coherence protocols that scale!
- no broadcast or single point of order
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Generic Solution: Directories
Directory
P1
P1
Cache
Cache
Directory
Memory
Memory
Comm.
Assist
Comm
Assist
Scalable Interconnection Network
° Maintain state vector explicitly
• associate with memory block
• records state of block in each cache
° On miss, communicate with directory
• determine location of cached copies
• determine action to take
• conduct protocol to maintain coherence
ECE669 L18: Scalable Parallel Caches
April 6, 2004
A Cache Coherent System Must:
° Provide set of states, state transition diagram, and
actions
° Manage coherence protocol
• (0) Determine when to invoke coherence protocol
• (a) Find info about state of block in other caches to determine
action
- whether need to communicate with other cached copies
• (b) Locate the other copies
• (c) Communicate with those copies (inval/update)
° (0) is done the same way on all systems
• state of the line is maintained in the cache
• protocol is invoked if an “access fault” occurs on the line
° Different approaches distinguished by (a) to (c)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Coherence in small machines: Snooping Caches
M
a
a
3
a
2
Broadcast
snoop
cache
a
5
Purge
Match
cache
4
directory
Dual
ported
cache
directory
a
cache
1
write
Processor
Processor
• Broadcast address on shared write
• Everyone listens (snoops) on bus to see if any of their own
addresses match
• How do you know when to broadcast, invalidate...
- State associated with each cache line
ECE669 L18: Scalable Parallel Caches
April 6, 2004
State diagram for ownership protocols
Ownership
invalid
Local
Write
Local
Read
Broadcast a
Remote
Write
(Replace) Remote
Write
(Replace)
read-clean
Remote
Read
Local
Write
write-dirty
Broadcast a
• In ownership protocol: writer owns exclusive copy
• For each shared data cache block
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Maintaining coherence in large machines
• Software
• Hardware - directories
° Software coherence
Typically yields weak coherence
i.e. Coherence at sync points (or fence pts)
° E.g.: When using critical sections for shared ops...
foo1
foo2
foo3
foo4
° Code
GET_FOO_LOCK
/* MUNGE WITH FOOs */
Foo1 =
X = Foo2
Foo3 =.
.
.
RELEASE_FOO_LOCK
° How do you make this work?
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Situation
cache
MEM
cache
foo1
foo2
foo3
foo4
foo home
P
°
flush
wait
P
Flush foo* from cache, wait till done
° Issues
flush
r
a1,
• Lock ?
• Must be conservative
- Lose some locality
• But, can exploit appl. characteristics
r
a1
...
unlock
e.g. TSP, Chaotic
Allow some inconsistency
• Need special processor instructions
ECE669 L18: Scalable Parallel Caches
r
w
a1
April 6, 2004
, ... a 2w
Scalable Approach: Directories
° Every memory block has associated directory
information
• keeps track of copies of cached blocks and their states
• on a miss, find directory entry, look it up, and communicate only
with the nodes that have copies if necessary
• in scalable networks, communication with directory and copies
is through network transactions
° Many alternatives for organizing directory
information
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Basic Operation of Directory
P
P
Cache
Cache
• k processors.
• With each cache-block in memory: k
presence-bits, 1 dirty-bit
Interconnection Network
••
Memory
•
presence bits
Directory
• With each cache-block in cache:
valid bit, and 1 dirty (owner) bit
dirty bit
• Read from main memory by processor i:
• If dirty-bit OFF then { read from main memory; turn p[i] ON; }
• if dirty-bit ON then { recall line from dirty proc (cache state to
shared); update memory; turn dirty-bit OFF; turn p[i] ON;
supply recalled data to i;}
• Write to main memory by processor i:
• If dirty-bit OFF then { supply data to i; send invalidations to all
caches that have the block; turn dirty-bit ON; turn p[i] ON; ... }
ECE669 L18: Scalable Parallel Caches
April 6, 2004
1
Scalable dynamic schemes
• Limited directories
• Chained directories
• Limitless schemes
Use software
Other approach: Full Map (not scalable)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Scalable hardware schemes
DIR
M
E
M
...
M
MEM
C
M
C
C
P
D
I
R
P
General directories:
• On write, check directory
if shared, send inv msg
P
Limited directories
Chained directories
LimitLESS directories
• Distribute directories with MEMs
Directory bandwidth scales in proportion to memory bandwidth
• Most directory accesses during memory access -- so not too many extra
network requests (except, write to read VAR)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Memory controller - (directory) state diagram for
memory block
uncached
replace update
write
read
write/invs req
1 or more
read copies
1 write
copy i
read i/update req
pointers
ECE669 L18: Scalable Parallel Caches
pointer
April 6, 2004
Network
ECE669 L18: Scalable Parallel Caches
M
M
M
C
C
C
P
P
P
April 6, 2004
...
Network
Limited directories: Exploit worker set behavior
• Invalidate 1 if 5th processor comes along (sometimes
can set a broadcast invalidate bit)
• Rarely more than 2 processors share
• Insight: The set of 4 pointers can be managed like a
fully-associative 4 entry cache on the virtual space of all
pointers
• But what do you do about widely shared data?
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Network
1
Trap
5th req
M
M
M
C
C
C
P
P
P
3
2
...
° LimitLESS directories:
Limited directories Locally Extended through Software Support
• Trap processor when 5th request comes
• Processor extends directory into local memory
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Zero pointer LimitLESS: All software coherence
directory
mem
comm
cache
mem
proc
trap always
comm
cache
proc
remote
mem
comm
cache
proc
local
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Network
M
M
inv
inv
C
M
C
inv
C
wrt
P
P
P
...
° Chained directories: Simply different data
structure for directory
•
•
•
•
Link all cache entries
But longer latencies
Also more complex hardware
Must handle replacements of elements in chain due to
misses!
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Network
M
M
M
C
C
C
P
P
P
...
Doubly linked chains
Of course, can do these data structures though software +
msgs as in LimitLESS
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Network
M
M
C
C
C
P
P
P
M
DIR
...
ECE669 L18: Scalable Parallel Caches
April 6, 2004
...
Network
M
M
DIR
M
...
4
3
write permission granted
ack
2
1
inv
C
C
C
write
P
P
P
...
Full map: Problem
• Does not scale -- need N pointers
• Directories distributed, so not a bottleneck
ECE669 L18: Scalable Parallel Caches
April 6, 2004
MEMORY
disks?
cache
P
P
P
P
P
P
P
° Hierarchical - E.g. KSR (actually has rings...)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
MEMORY
disks?
A
A
A
A
cache
P
P
P
P
RD A
P
P
P
Hierarchical - E.g. KSR (actually has rings...)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
MEMORY
disks?
A
A
A
A
A?
A
cache
P
P
P
P
RD A
P
P
P
Hierarchical - E.g. KSR (actually has rings...)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Widely shared data
° 1. Synchronization variables
Software combining tree
flag
wrt
spin
° 2. Read only objects
wrt
Instructions
Read-only data
Can mark these and bypass coherence protocol
° 3. But, biggest problem:
Frequently read, but rarely written data which does not fall into
known patterns like synchronization variables
ECE669 L18: Scalable Parallel Caches
April 6, 2004
All software coherence
All software
LimitLESS1
LimitLESS2
LimitLESS4
All hardware
0.00
0.40
0.80
1.20
Execution Time (Mcycles)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
1.60
Execution time (megacycles)
All software coherence
9.0
LimitLESS4 Protocol
8.0
LimitLESS0 : All software
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
4x4
8x8
16x16
32x32
64x64
128x128
Matrix size
Cycle Performance for Block Matrix Multiplication, 16 processors
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Execution time (10000 cycles)
All software coherence
5.0
LimitLESS4 Protocol
LimitLESS0 : All software
4.0
3.0
2.0
1.0
0.0
2 nodes
4 nodes
16 nodes
32 nodes
64 nodes
Performance for Single Global Barrier (first INVR to last RDATA)
ECE669 L18: Scalable Parallel Caches
April 6, 2004
Summary
° Tradeoffs in caching an important issue
° Limitless protocol provides software extension to
hardware caching
° Goal: maintain coherence but minimize network
traffic
° Full map not scalable and too costly
° Distributed memory makes caching more of a
challenge
ECE669 L18: Scalable Parallel Caches
April 6, 2004