DDM * A Cache Only Memory Architecture

Transcript DDM * A Cache Only Memory Architecture

DDM – A Cache Only Memory
Architecture
Hagersten, Landin, and Haridi (1991)
Presented by Patrick Eibl
Outline
•
•
•
•
•
•
•
•
Basics of Cache-Only Memory Architectures
The Data Diffusion Machine (DDM)
DDM Coherence Protocol
Examples of Replacement, Reading, Writing
Memory Overhead
Simulated Performance
Strengths and Weaknesses
Alternatives to DDM Architecture
The Big Idea:
UMA→NUMA →COMA
• Centralized shared • Shared memory is
memory feeds data
distributed among
through network to
processors (DASH)
individual caches • Data can move
• Uniform access
from home
time to all memory
memory to other
caches as needed
• No notion of
“home” for data;
moves to wherever
it is needed
• Individual
memories behave
like caches
COMA: The Basics
• Individual memories are called Attraction
Memories (AM) – each processor “attracts” its
working data set
• AM also contains data that has never been
accessed (+/-?)
• Uses shared memory programming model, but
with no pressure to optimize static partitioning
• Limited duplication of shared memory
• The Data Diffusion Machine is the specific COMA
presented here
Data Diffusion Machine
• Directory hierarchy allows scaling to arbitrary
number of processors
• Branch factors and bottlenecks a consideration
– Hierarchy can be split into different address domains
to improve bandwidth
Coherence Protocol
Item States
I: Invalid
E: Exclusive
S: Shared
R: Reading
W: Waiting
RW: Reading
and Waiting
Bus Transactions
e: erase
x: exclusive
r: read
d: data
i: inject
o: out
• Transient states support split-transaction bus
• Fairly standard protocol with important exception of
replacement, which must be managed carefully (example
to come)
• Sequential consistency is guaranteed, but with cost that
writes must wait for acknowledge before continuing
I: Invalid
S: Shared
Replacement Example
o: out
i: inject
i
o
io
Directories
o
id
Caches
Processors
1. A block needs to be brought into a full
AM, necessitating a replacement and an
out transaction
2. Out propagates up until it finds another
copy of block in S, R, W, or RW
3. Out reaches top and is converted to
inject, meaning this is the last copy of the
data and it needs a new home
4. Inject finds space in new AM
5. Data is transferred to new home
6. States change accordingly
I: Invalid
R: Reading
A: Answering
S: Shared
Multilevel Read Example
r
d
r
d
r
d
r: read
d: data
r
d
Directories
r
d
r
d
rd
Caches
Processors
1. First cache issues read request
2. Read propagates up hierarchy
3. Read reaches directory with block in
shared state
4. Directories change to answering
state while waiting for data
5. Data moves back along same path,
changing states to shared as it goes
2. Second cache issues request for
same block
3. Request for same block
encountered; directory simply waits
for data reply from other request
I: Invalid
R: Reading
W: Waiting
E: Exclusive
S: Shared
Multilevel Write Example
e
xe
e: erase
x: exclusive
e
Directories
xe
e
e
e
Caches
Processors
1. Cache issues write request
2. Erase propagates up hierarchy and
back down, invalidating all
other copies
4. Top of hierarchy responds with
acknowledge
5. ACK propagates back down,
changing states from
Waiting to Exclusive
2. Second cache issues write to same
block
3. Second exclusive request encounters
other write to same block; first one won
because it arrived first; other erase is
propagated back down
4. State of second cache changed to
RW, and will issue a read request
before another erase (not shown)
Memory Overhead
• Inclusion is necessary for directories, but not
for data
• Directories only need state bits and address
tags
• For two sample configurations given,
overheads were 6% for one-level 32-processor
and 16% for two-level 256-processor
• Larger item size reduces overhead
Simulated Performance
• Minimal success on
programs for which each
processor operates on
entire shared data
• MP3D was rewritten to
improve performance by
exploiting fact that data
has no home
• OS, hardware, and
emulator in development
at the time
• Different DDM topology
for each program (-)
Strengths
• Each processor attracts the data it’s using into its
own memory space
• Data doesn’t need to be duplicated at a home
node
• Ordinary shared memory programming model
• No need to optimize static partitioning (there is
none)
• Directory hierarchy scales reasonably well
• Good when data is moved around in smaller
chunks
Weaknesses
• Attraction memories hold data that isn’t being
used, making them bigger and slower
• Different DDM hierarchy topology was used for
each program in simulations
• Does not fully exploit large spatial locality;
software wins in that case (S-COMA)
• Branching hierarchy is prone to bottlenecks and
hotspots
• No way to know where data is but with expensive
tree traversal (NUMA wins here)
Alternatives to COMA/DDM
• Flat-COMA
– Blocks are free to migrate, but have home nodes
with directories corresponding to physical address
• Simple-COMA
– Allocation managed by OS and done at page
granularity
• Reactive-NUMA
– Switches between S-COMA and NUMA with
remote cache on per-page basis
Good summary of COMAs:
http://ieeexplore.ieee.org/iel5/2/16679/00769448.pdf?tp=&isnumber=&arnumber=769448