Hybrid Main Memory Literature Survey HanBin Yoon, Justin Meza, Rachata Ausavarungnirun Talk Outline   Project Overview Literature Review     Row Buffer Locality Selective Promotion Reuse Synthesis.

Download Report

Transcript Hybrid Main Memory Literature Survey HanBin Yoon, Justin Meza, Rachata Ausavarungnirun Talk Outline   Project Overview Literature Review     Row Buffer Locality Selective Promotion Reuse Synthesis.

Hybrid Main Memory Literature Survey
HanBin Yoon, Justin Meza,
Rachata Ausavarungnirun
Talk Outline


Project Overview
Literature Review




Row Buffer Locality
Selective Promotion
Reuse
Synthesis
2
Project Overview

Use DRAM as a smart cache to PCM
Before
After
CPU
CPU
DRAM
DRAM
PCM
3
Literature Review

Row Buffer Locality


Selective Promotion


“Micro-Pages: Increasing DRAM Efficiency with Locality-Aware
Data Placement”
K. Sudan et al., ASPLOS 2010
“CHOP: Adaptive Filter-Based DRAM Caching for CMP Server
Platforms”
X. Jiang et al., HPCA 2010
Reuse


“Adaptive Insertion Policies for High Performance Caching”
M. K. Qureshi et al., ISCA 2007
“Adaptive Insertion Policies for Managing Shared Caches”
A. Jaleel et al., PACT 2008
4
Row-Buffer Locality
“Micro-Pages: Increasing DRAM
Efficiency with Locality-Aware Data
Placement”
K. Sudan et al., ASPLOS 2010
Observation


Observation: typically, many memory accesses to the OS
pages are accessed to a small contiguous chuck of cache
blocks in the row
Solution: Gather hot cache blocks
and map it to the same DRAM
row to provide better row
buffer locality
6
Possible Solutions

Reducing OS page size


OS managed mechanism: page size is reduced to 1KB (from
4KB)
Hot pages are migrated via DRAM-copy



Hot pages are located in the same row
Cold pages is promoted to super pages to decrease TLB
thrashing through the reduced page size
Hardware managed via indirection


Additional level of address mapping to map hot pages into the
same row
Mapping table store the indirection, updated every epoch
7
Tradeoffs

Advantages



Cluster hot pages into the same row: can potentially increase
row buffer locality
This technique can be done purely in software
Disadvantages


Additional indirection can create a critical path
Clustering hot pages into the same row does not necessarily
mean more row buffer locality
8
Selective Promotion
“CHOP: Adaptive Filter-Based
DRAM Caching for CMP Server
Platforms”
X. Jiang et al., HPCA 2010
Motivation



Many-core architectures provide lots of cores
Problem: Limited pin count  bandwidth to memory
restricted
Want to reduce off-chip bandwidth by identifying and cache
hot pages


In DRAM on-chip or 3D-stacked
Without requiring a large tag store
CHOP Architecture


Keep a table of recently accessed pages
On a page access,



Add an entry if not already in table
If in table, increment entry
When an entry reaches a certain threshold, cache it on chip
Results

30% performance improvement over traditional DRAM
cache


Small (?) storage overhead (800 KB) compared to
traditional DRAM cache


Hot pages are closer to CPU, don’t need to be re-fetched
Data tracked only for recently accessed pages, not all pages in
DRAM
Lower bandwidth consumption compared to traditional
DRAM cache

Don’t have to go to main memory as often by caching hot
pages
Tradeoffs

Pros



Reduces off-chip bandwidth
Smaller tag store than tracking all of DRAM
Cons


Requires changes to the chip/on-chip DRAM to work
Tag store is somewhat large
13
Reuse
“Adaptive Insertion Policies for
High Performance Caching”
M. K. Qureshi et al., ISCA 2007
“Adaptive Insertion Policies for
Managing Shared Caches”
A. Jaleel et al., PACT 2008
Adaptive Insertion Policies

Problem: Cache Thrashing

Key Insight: Account for reuse

Solution 1: LRU Insertion

Solution 2: Bimodal Insertion
if (e > 1/32)
LRU Insertion
else
MRU Insertion
15
Adaptive Insertion Policies
Solution 3: Dynamic Insertion (MRU <-> Bimodal)
PSEL value

M inst
Set Dueling
16
Adaptive Insertion Policies

Results
17
Adaptive Insertion Policies

Advantages

Retains cache blocks with high reuse



Adapts to program phase change



Higher cache hit rate than MRU insertion
Avoids thrashing
Aware of individual workload characteristics
Low implementation overhead
Disadvantages

Slightly harmful for workloads with good temporal locality
18
Synthesis
Synthesis
Micro-Pages
CHOP
DIP
Goal
Improve row
buffer locality
1. Reduce off-chip
memory
bandwidth
2. Small tag store
1. Retain high
reuse blocks
2. Avoid trashing
Memory hierarchy
Main memory
On-chip DRAM
cache
Last-level cache
Granularity
1 KB u-pages in
4KB row buffer
OS page
64 Bytes
Row buffer locality Increases
Does not consider
N/A
Reuse
Top N frequently
accessed u-pages
in dedicated rows
Threshold based on Low reuse blocks
number of accesses get evicted quickly
due to LRU
Insertion
Applies to PCM
Applicable but
with wear and
energy concerns
Use PCM instead of
large DRAM main
memory
Possible to adopt
insertion policy to
account for reuse
20
Thank You!
Appendix
Adaptive Insertion Policies

Solution for Multi-Core
Decide on MRU or Bimodal on per-thread basis
23
Adaptive Insertion Policies

Problem: Cache Trashing
Access pattern: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, …
MRU
4
3
2
1
LRU
MRU
5
4
3
2
LRU
MRU
1
5
4
3
LRU
MRU
2
1
5
4
LRU
No account for reuse
24
Adaptive Insertion Policies


Key Insight: Account for reuse
Solution 1: LRU Insertion
Access pattern: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, …
MRU
4
3
2
1
LRU
MRU
4
3
2
5
LRU
MRU
4
3
2
1
LRU
MRU
2
4
3
1
LRU
MRU
3
2
4
1
LRU
Access pattern: 6, 7, 8, 9, 10, 6, 7, 8, 9, 10, …
25