Recent Progress in Embedded Memory Controller Design
Download
Report
Transcript Recent Progress in Embedded Memory Controller Design
Memory Hierarchy
Latency, Capacity, Bandwidth
L: 0.5ns, C: 10MB
Cache
L: 50ns, C: 100GB
BW: 100GB/s
DRAM
L: 10us, C: 2TB
BW: 2GB/s
Flash
L: 10ms, C: 4TB
BW: 600MB/s
Disk
Controller
DRAM Primer
<bank, row, column>
Page buffer per bank
DRAM Characteristics
DRAM page crossing
Charge ~10K DRAM cells and bitlines
Increase power & latency
Decrease effective bandwidth
Sequential access VS. random access
Less page crossing
Lower power consumption
4.4x shorter latency
10x better BW
5
Take Away: DRAM = Disk
Embedded Controller
Bad News
None available as in
general purpose
processor
Good News
Opportunities for
customization
Agenda
Overview
Multi-Port Memory Controller (MPMC)
Design
“Out-of-Core” Algorithmic Exploration
Motivating Example: H.264 Decoder
Diverse QoS requirements
Latency sensitive
6.4
9.6
MB/s
1.2
164.8
Bandwidth sensitive
0.09 31.0 156.7
94
Dynamic
latency, BW
and power
9
Wanted
Bandwidth guarantee
Prioritized access
Reduced page crossing
1
Previous Works
Bandwidth guarantee
• Q0: Distinguish bandwidth guarantee for different
classes of ports
• Q1: Distinguish bandwidth guarantee for each port
Q2: Prioritized access
Q3: Residual bandwidth allocation
Q4: Effective DRAM bandwidth
Q0 Q1
Q2 Q3 Q4
[Rixner,00][McKee,00][Hur,04]
[Heighecker,03,05][Whitty,08]
✓
✓
✓
[Lee,05]
✓
[Burchard,05]
✓
Proposed BCBR
✓
✓
✓
✓
✓
✓
✓
11
Key Observations
Port locality:
Same port requests
same DRAM page
Service time
flexibility
Weighted round robin:
Statically allocated BW
Underutilized at runtime
Minimum BW guarantee
Busting service
Credit borrow & repay
1/24 second to decode a
video frame
4M cycles at 100 MHz for
request reordering
Residual bandwidth
Reorder requests according
to priority
Dynamic BW calculation
Capture and re-allocate
residual BW
12
Weighted Round Robin
Assume bandwidth requirement Tround = 10
Q2: 30%
Q1: 50%
Q0: 20%
Time: scheduling cycles
T(Rij): arriving time of jth requests for Qi
Clock: 0 1 2 3
Request time: T(R2) R20 R21 R22
Service time: Q2 R20 R21 R22
4
5
6
7
8
9
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
T(R0) R00 R01
Q0
R00 R01
13
Problem with WRR
Priority: Q0 > Q2
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20 R21 R22
4
5
6
7
8
9
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
T(R0) R00 R01
Q0
8 cycles of waiting time!
R00 R01
Could be worse!
14
Borrow Credits
Zero Waiting time for Q0!
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20
4
5
6
7
8
9
T(R1) R10 R11 R12
Q1
borrow
T(R0) R00 R01
Q0* R00 R01
debtQ0
Q2
Q2
Q2
15
Repay Later
At Q0’s turn, BW guarantee is recovered
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20
4
5
6
7
8
9
R21 R22
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
repay
T(R0) R00 R01
Q0* R00 R01
debtQ0
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Prioritized access!
16
Problem: Depth of DebtQ
DebtQ as residual BW collector
BW allocated to Q0 increases to: 20% + residual BW
Requirement for the depth of DebtQ0 decreases
Clock: 0 1 2 3
T(R2) R20 R21 R22
R20
Q2
T(R1)
Q1
R10 R11 R12 R13
T(R0)
Q0*
R00 R01 R03
debtQ0
4
5 6
7
8
R21 R22
Help repay
R10 R11 R12 R13
R00 R01
Q2
9
R03
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
17
Evaluation Framework
Simulation Framework
Workload: ALPBench suite
DRAMSim: simulates DRAM latency+BW+power
Reference schedulers: PQ, RR, WRR, BGPQ
18
Bandwidth Guarantee
Bandwidth guarantees:
P0: 2%
System residual: 8%
Port
RR
PQ
BGPQ
WRR
BCBR
P1: 30%
0
P2: 20%
1
P3:20% P4:20%
No BW guarantee
2
3
4
1.08%
0.73%
1.07%
24%
80%
39%
24%
18%
20%
24%
0%
20%
24%
0%
20%
0.76%
0.76%
33%
33%
22%
22%
22%
22%
22%
22%
Provides BW guarantee!
19
Cache Response Latency
Average 16x faster than WRR
As fast as PQ (prioritized access)
Latency (ns)
20
DRAM Energy & BW Efficiency
30% less page crossing (compared to RR)
1.4x more energy efficient
1.2x higher effective DRAM BW
As good as WRR (exploit port locality)
GB/J
Act-Pre Ratio
Improvement
RR
0.298
BGPQ WRR
0.289 0.412
BCBR
0.411
29.6% 30.1% 23.0% 23.0%
1.0x 0.97x 1.38x 1.38x
21
Hardware Cost
BCBR: frontend
1393 LUTs
884 registers
0 BRAM
Reference backend:
speedy DDRMC
1986 LUTs
1380 registers
4 BRAMs
Xilinx MPMC:
frontend + backend
3450 LUTs
5540 registers
1-9 BRAMs
BCBR + Speedy
3379 LUTs
2264 registers
4 BRAMs
Better performance without higher cost!
22
Agenda
Overview
Multi-Port Memory Controller (MPMC)
Design
“Out-of-Core” Algorithm / Architecture
Exploration
Idea
Out-of-core
algorithms
Data does not fit DRAM
Performance dominated
by IO
Key questions
Reduce #IOs
Block granularity
Remember
DRAM=DISK
So let’s
Ask the same question
Plug-on DRAM
parameters
Get DRAM-specific
answers
24
Motivating Example: CDN
Caches in CDN
Get closer to users
Save bandwidth
Zipf’s law
80-20 rule hit
rate
25
Video Cache
Defining the Knobs
Transaction
a number of column access commands
enclosed by row activation / precharge
W: burst size
s : # bursts
Function of array
organization &
timing params
Function of
algorithmic
parameters
Function of array
organization &
timing params
27
D-nary Heap
Algorithmic Design Variable:
Branching Factor
Record Size
B+ Tree
Lessons Learned
Optimal result can be beautifully derived!
Big O does not matter in some cases
Depending on data input characteristics