CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008
Download
Report
Transcript CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008
CMP L2 Cache Management
Presented by: Yang Liu
CPS221 Spring 2008
Based on:
Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar
ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood
Outline
Motivation
Related Work (1) – Non-uniform Caches
CMP-NuRAPID
Related Work (2) – Replication Schemes
ASR
Motivation
Two options for L2 caches in CMPs
Shared: high latency because of wire delay
Private: more misses because of replications
Need hybrid L2 caches
Take in mind
On-chip communication is fast
On-chip capacity is limited
NUCA
Non-Uniform Cache Architecture
Place frequently-accessed data closest to the
core to allow fast access
Couple tag and data placement
Can only place one or two ways in each set
close to the processor
NuRAPID
Non-uniform access with Replacement And
Placement usIng Distance associativity
Decouple the set-associative way number from data
placement
Divide the cache data array into d-groups
Use forward and reverse pointers
Forward: from tag to data
Reverse: from data to tag
One to one?
CMP-NuRAPID - Overview
Hybrid private tag
Shared data organization
Controlled Replication – CR
In-Situ Communication – ISC
Capacity Stealing – CS
CMP-NuRAPID – Structure
Need carefully chosen d-group preference
CMP-NuRAPID – Data and Tag Array
Tag arrays snoop on bus to maintain coherence
The data array is accessed through a crossbar
CMP-NuRAPID – Controlled Replication
For read-only sharing
First use no copy, save capacity
Second copy, reduce future access latency
In total, avoid off-chip misses
CMP-NuRAPID – Time Issues
Start to read before the invalidation and end
after the invalidation
Mark the tag for the block being read from a
farther d-group busy
Start to read after the invalidation begins and
end before the invalidation completes
Put an entry in the queue that holds the order of
the bus transaction before sending a read request
to a farther d-group
CMP-NuRAPID – In-situ Communication
For read-write sharing
Communication state
Write-through for all C blocks in L1 cache
CMP-NuRAPID – Capacity Stealing
Demote less-frequently-used data to unused
frames in the d-groups closer to the cores
with less capacity demands
Placement and Promotion
Place all private blocks in the d-group closest to
the initiating core
Promote the block directly to the closest d-group
for the core
CMP-NuRAPID – Capacity Stealing
Demotion and Replacement
Demote the block to the next-fastest d-group
Replace in the order of invalid, private, and shared
Doesn’t this kind of demotion pollute another
core’s fastest d-group?
CMP-NuRAPID - Methodology
Simics
4-core CMP
8 MB, 8-way CMP-NuRAPID with 4 singleported d-groups
Both multithreaded and multiprogrammed
workloads
CMP-NuRAPID – Multithreaded
CMP-NuRAPID – Multiprogrammed
Replication Schemes
Cooperative Caching
Victim Replication
Private L2 caches
Restrict replication under certain criteria
Share L2 cache
Allow replication under certain criteria
Both have static replication policies
How about dynamic?
ASR - Overview
Adaptive Selective Replication
Dynamic cache block replication
Replicate blocks when the benefits exceed
the costs
Benefits: lower L2 hit latency
Costs: More L2 misses
ASR – Sharing Types
Shingle Requestor
Shared Read-Only
Blocks are read, but not written, by multiple processors
Shared Read-Write
Blocks are accessed by a single processor
Blocks are accessed by multiple processors, with at least
one write
Focus on replicating shared read-only blocks
High locality
Little Capacity
Large portion of requests
ASR - SPR
Selective Probabilistic Replication
Assume private L2 caches and selectively
limits replication on L1 evictions
Use probabilistic filtering to make local
replication decisions
ASR – Balancing Replication
ASR – Replication Control
Replication levels
C: Current
H: Higher
L: Lower
Cycles
H: Hit cycles-per-instruction
M: Miss cycles-per-instruction
ASR – Replication Control
ASR – Replication Control
Wait until there are enough events to ensure
a fair cost/benefit comparison
Wait until four consecutive evaluation
intervals predict the same change before
change the replication level
ASR – Designs Supported by SPR
SPR-VR
SPR-NR
Add 1-bit per L2 cache block to identify replicas
Disallow replications when the local cache set is filled with
owner blocks with identified sharers
Store a 1-bit counter per remote processor for each L2
block
Remove the shared bus overhead (How?)
SPR-CC
Model the centralized tag structure using an idealized
distributed tag structure
ASR - Methodology
Two CMP configurations – Current and Future
8 processors
Writeback, write-allocate cache
Both commercial and scientific workloads
Use throughput as metrics
ASR – Memory Cycles
ASR - Speedup
Conclusion
Hybrid is better
Dynamic is better
Need tradeoff
How does it scale…