a DRAM page-mode scheduling policy for the many

Download Report

Transcript a DRAM page-mode scheduling policy for the many

[Paper Review]
Minimalist Open-page:
A DRAM Page-mode Scheduling Policy
for the Many-core Era
Dimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+
MICRO’11
+ The
University of Texas at Austin
* IBM Corp.
Korea University, VLSI Signal Processing Lab.
Jinil Chung (정진일)
([email protected])
[email protected]
Abstract
DRAM: balance between performance,
power, and storage density
To realize good performance,
Must mange the structural and timing
restrictions of the DRAM devices
Use of “Page-mode” feature can
mitigate many DRAM constraints
[IEEE Spectrum(link)]
Aggressive page-mode results in
many conflicts (e.g. bank conflict)
when multiple workloads in manycore systems map to the same DRAM
In this paper, Minimalist approach
“just enough” page-mode accesses to get benefits, avoiding unfairness
 Proposed address hashing + data prefetch engine + per request priority
[email protected]
(2)
1. Introduction
Row buffer (or “page-mode”) Access
Open-page policy
Closed-page policy
Page-mode
gain
Reducing row access
latency
None
(single col. access
per row activation)
Multiple
requests in
many core
system
Introducing priority
inversion and
fairness/starvation
problems
Avoiding
complexities of row
buffer management
This paper proposed combination of open/closed-page policy based on …
1) Page-mode gain with only a small number of page accesses
 Propose a fair DRAM address mapping scheme: low RBL & high BLP
2) Page-mode hit with spatial locality which can be captured in prefetch engines
 Propose an intuitive criticality-based memory request priority scheme
NOT temporal locality!
[email protected]
RBL: Row-buffer Locality
BLP: Bank-level Parallelism
(3)
2. Background
DRAM timing constraint results in “dead time”
before and after random access
MC(Memory Controller)’s job is to reduce
performance-limiting gaps using parallelism
1) tRC (row cycle time; ACT-to-ACT @same BK)
: MC activates a page  wait for tRC @same BK
: multiple threads access diff. row @same BK 
latency overhead (tRC delay)
2) tRP (row precharge time; PRE-to-ACT @same BK)
: In open-page policy, MC activates other page 
tRP penalty @same BK (=close current page before
new page is opened)
tRC (e.g. 48ns)
tRP (e.g. 12ns)
tRAS (e.g. 36ns)
ACT
[email protected]
PRE
ACT
@same bank
(4)
3. Motivation
Use of “page-mode” …
1)
2)
3)
4)
Next page
Latency Effects: Due to tRC & tRP, overall latency increase  small # of access?
Power Reduction: only Activate Power reduction  small # of access is enough
Bank Utilization: drop off quickly as access increase  small # of access is enough
Other DRAM complexities: small # of access is needed for soften restrictions
ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns
-. single access per ACT: limited peak utilization (6ns*4/30ns=80%)
-. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%)
62%
If B/U is high, the probability
that new request will conflict w/
a busy bank is greater.
16%
Closedpage policy
[email protected]
Closedpage policy
(5)
3. Motivation
3.1 Row-buffer locality in Modern Processors
: in current WS/Server class designs
 large last-level cache
(e.g. IBM PowerPC 7)
Temporal locality: hits to
the large Last-level cache
Row buffers exploit only
Spatial locality
Using prefetch engines,
It can be predict spatial locality
RBL: Row-buffer Locality
[email protected]
(6)
3. Motivation
3.2 Bank and Row Buffer Locality Interplay with Address Mapping
-. DRAM device address: row, column, and bank
Workload A: long sequential access seq.
Workload B: single operation
(DRAM all col.  low order real addr.)
e.g. FR-FCFS
(DRAM all col.  low order real addr.)
Workload A: higher priority
 Slow B0
Workload B: higher priority
 Slow A4
e.g. ATLAS, PAR-BS
(DRAM col. & bank  low order real
addr.)
High BLP (Bank-level Parallelism)
 B0 can be serviced w/o degrading
traffic to the workload A
e.g. Minimalist
[email protected]
(7)
4. Minimalist Open-page Mode
4.1 DRAM Address Mapping Scheme
-. The basic difference that the Column access bits are split in two places.
+. 2 LSB bits are located right after the Block bits
+. 5 MSB bits are located just before the Row bits
-. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits
produce the actual bank selection bits  reducing row buffer conflict [Zhang et
al./MICRO’00]
For sequential access of
4 cache lines
7-bit
[email protected]
5-bit
2-bit
(8)
4. Minimalist Open-page Mode
4.2 Data Prefetch Engine [IBM PowerPC 6]
: predictable “page-mode” opportunities  need for accurate prefetch engine
: each core includes HW prefetcher w/ prefetch depth distance predictor
1) Multi-line Prefetch Requests
-. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines)
-. Reducing command BW and queue resource
[email protected]
(9)
4. Minimalist Open-page Mode
4.3 Memory Request Queue Scheduling Scheme
: In OOO execution, the importance of each request can vary both between and
within applications  need for dynamic priority scheme
1) DRAM Memory Requests Priority Calculation
-. different priority based on criticality to performance
-. Increase priority of each request every 100ns time interval  time-based
-. 2 categories: read (normal) and prefetch  read request is higher priority
-. MLP information from MSHR in each core: many misses  less important
-. Distance information from Prefetch engine (4.2)
Read request
MLP: Memory Level Parallelism
MSHR: Miss Status Holding Register
[email protected]
( 10 )
4. Minimalist Open-page Mode
4.3 Memory Request Queue Scheduling Scheme (cont.)
2) DRAM Page Closure (Precharge) Policy
-. Using autoprecharge  increasing command BW
3) Overall Memory Requests Scheduling Scheme (Priority Rules 1)
-. Same rules are used by all of MC  No need for communication among MC
-. if MC is servicing the multiple transfers from a multi-line prefetch request, it can
be interrupted by a higher priority request  very critical request can be serviced w/
the smallest latency
4) Handling write operations
-. dynamic priority scheme
not apply to write
-. Using VWQ(Virtual Write Queue)
 causing minimal write instructions
[email protected]
( 11 )
5. Evaluation
-. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset
-. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment
-. Minimalist open-page scheme is compared against three open-page policies: Table 5
1) PAR-BS (Parallelism-aware Batch Scheduler)
2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler
3) FR-FCFS (First-Ready, First-Come-First-Served): baseline
[email protected]
( 12 )
5. Evaluation
5.1 Throughput
-. Overall, “Minimalist Hash+Priority" demonstrated the best throughput
improvement over the other schemes, achieving a 10% improvement.
-. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8%
throughput improvements over the whole workload suite.
[email protected]
( 13 )
5. Evaluation
5.2 Fairness
-. Minimalist improves fairness up to 15% with an overall improvement of 7.5%,
3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.
[email protected]
( 14 )
5. Evaluation
5.3 Row Buffer Access per Activation
-. The observed page-access rate for the aggressive open-page policies fall
significantly short  The high page hit rate is simply not possible given the
interleaving of requests between the eight executing programs.
-. With the Minimalist scheme, the achieved page-access rate is close to 3.5,
compared to the ideal rate of four.
[email protected]
( 15 )
5. Evaluation
5.4 Target Page-hit Count Sensitivity
-. The Minimalist system requires a target number of page hits to be selected that
indicates the maximum number of pages hits the scheme attempts to achieve per
row activation.
-. a target number of 4 pages hits provides the best results.
(that different system configuration may shift the optimal page-mode hit count.)
[email protected]
( 16 )
5. Evaluation
5.5 DRAM Energy Consumption
-. To estimate the power consumption we used the Micron power calculator
-. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist
Hash+Priority" provide a small decrease of approximately 5% to the overall energy
consumption.
-. The energy results are essentially a balance between the decrease in page-mode
hits (resulting in high DRAM activation power) and the increase in system
performance (decreasing runtime).
[email protected]
( 17 )
Conclusions
Minimalist Open-page memory scheduling policy
-. Page-mode gain w/ small number of page accesses for each page activation
-. Assign per-request priority using request stream information in MLP and data
prefetch engine
Improving throughput and fairness
-. Throughput increased by 10% on average (compared to FR-FCSC)
-. No need for thread based priority information
-. No need for communication/coordination among multiple MC or OS
[email protected]
( 18 )
Appendix. Detailed simulation information
[email protected]
( 19 )
Appendix. Detailed simulation information
[email protected]
( 20 )
Appendix. Detailed simulation information
[email protected]
( 21 )
Thanks,
[email protected]
( 22 )