CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008

Download Report

Transcript CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008

CMP L2 Cache Management
Presented by: Yang Liu
CPS221 Spring 2008
Based on:
Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar
ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood
Outline

Motivation

Related Work (1) – Non-uniform Caches

CMP-NuRAPID

Related Work (2) – Replication Schemes

ASR
Motivation

Two options for L2 caches in CMPs


Shared: high latency because of wire delay
Private: more misses because of replications

Need hybrid L2 caches

Take in mind


On-chip communication is fast
On-chip capacity is limited
NUCA




Non-Uniform Cache Architecture
Place frequently-accessed data closest to the
core to allow fast access
Couple tag and data placement
Can only place one or two ways in each set
close to the processor
NuRAPID




Non-uniform access with Replacement And
Placement usIng Distance associativity
Decouple the set-associative way number from data
placement
Divide the cache data array into d-groups
Use forward and reverse pointers



Forward: from tag to data
Reverse: from data to tag
One to one?
CMP-NuRAPID - Overview





Hybrid private tag
Shared data organization
Controlled Replication – CR
In-Situ Communication – ISC
Capacity Stealing – CS
CMP-NuRAPID – Structure

Need carefully chosen d-group preference
CMP-NuRAPID – Data and Tag Array


Tag arrays snoop on bus to maintain coherence
The data array is accessed through a crossbar
CMP-NuRAPID – Controlled Replication




For read-only sharing
First use no copy, save capacity
Second copy, reduce future access latency
In total, avoid off-chip misses
CMP-NuRAPID – Time Issues

Start to read before the invalidation and end
after the invalidation


Mark the tag for the block being read from a
farther d-group busy
Start to read after the invalidation begins and
end before the invalidation completes

Put an entry in the queue that holds the order of
the bus transaction before sending a read request
to a farther d-group
CMP-NuRAPID – In-situ Communication



For read-write sharing
Communication state
Write-through for all C blocks in L1 cache
CMP-NuRAPID – Capacity Stealing

Demote less-frequently-used data to unused
frames in the d-groups closer to the cores
with less capacity demands

Placement and Promotion


Place all private blocks in the d-group closest to
the initiating core
Promote the block directly to the closest d-group
for the core
CMP-NuRAPID – Capacity Stealing

Demotion and Replacement



Demote the block to the next-fastest d-group
Replace in the order of invalid, private, and shared
Doesn’t this kind of demotion pollute another
core’s fastest d-group?
CMP-NuRAPID - Methodology




Simics
4-core CMP
8 MB, 8-way CMP-NuRAPID with 4 singleported d-groups
Both multithreaded and multiprogrammed
workloads
CMP-NuRAPID – Multithreaded
CMP-NuRAPID – Multiprogrammed
Replication Schemes

Cooperative Caching



Victim Replication




Private L2 caches
Restrict replication under certain criteria
Share L2 cache
Allow replication under certain criteria
Both have static replication policies
How about dynamic?
ASR - Overview

Adaptive Selective Replication

Dynamic cache block replication
Replicate blocks when the benefits exceed
the costs



Benefits: lower L2 hit latency
Costs: More L2 misses
ASR – Sharing Types

Shingle Requestor


Shared Read-Only


Blocks are read, but not written, by multiple processors
Shared Read-Write


Blocks are accessed by a single processor
Blocks are accessed by multiple processors, with at least
one write
Focus on replicating shared read-only blocks



High locality
Little Capacity
Large portion of requests
ASR - SPR



Selective Probabilistic Replication
Assume private L2 caches and selectively
limits replication on L1 evictions
Use probabilistic filtering to make local
replication decisions
ASR – Balancing Replication
ASR – Replication Control

Replication levels




C: Current
H: Higher
L: Lower
Cycles


H: Hit cycles-per-instruction
M: Miss cycles-per-instruction
ASR – Replication Control
ASR – Replication Control

Wait until there are enough events to ensure
a fair cost/benefit comparison

Wait until four consecutive evaluation
intervals predict the same change before
change the replication level
ASR – Designs Supported by SPR

SPR-VR



SPR-NR



Add 1-bit per L2 cache block to identify replicas
Disallow replications when the local cache set is filled with
owner blocks with identified sharers
Store a 1-bit counter per remote processor for each L2
block
Remove the shared bus overhead (How?)
SPR-CC

Model the centralized tag structure using an idealized
distributed tag structure
ASR - Methodology





Two CMP configurations – Current and Future
8 processors
Writeback, write-allocate cache
Both commercial and scientific workloads
Use throughput as metrics
ASR – Memory Cycles
ASR - Speedup
Conclusion

Hybrid is better
Dynamic is better

Need tradeoff

How does it scale…
