Systems and interconnects Addressing the design challenge

Transcript Systems and interconnects Addressing the design challenge

An introduction to SDRAM
and memory controllers
5kk73
Presentation Outline (part 1)
Introduction to SDRAM
Basic SDRAM operation
Memory efficiency
SDRAM controller architecture
Conclusions
2
Static RAM (SRAM)
►
SRAM is typically on-chip memory
►
Found in higher levels of the memory hierarchy
– Commonly used for caches and scratchpads
►
Either local to processor or centralized
– Local memory has very short access time
– Centralized shared memories have intermediate access time
►
An SRAM cell consists of six transistors
– Limits memory to a few megabytes, or even smaller
3
Dynamic RAM (DRAM)
►
DRAM was patented in 1968 by Robert Dennard at IBM
►
Significantly cheaper than SRAM
–
–
–
–
►
DRAM cell has 1 transistor and 1 capacitor vs. 6 transistors for SRAM
A bit is represented by a high or low charge on the capacitor
Charge dissipates due to leakage – hence the term dynamic RAM
Capacity of up to a gigabyte per chip
DRAM is (shared) off-chip memory
– Long access time compared to SRAM
– Off-chip pins are expensive in terms of area and power
• SDRAM bandwidth is scarce and must be efficiently utilized
►
Found in lower levels of memory hierarchy
– Used as remote high-volume storage
4
The DRAM evolution
►
Evolution of the DRAM design in the past 15 years
– A clock signal was added making the design synchronous (SDRAM)
– The data bus transfers data on both rising and falling edge of the
clock (DDR SDRAM)
– Second and third generation of DDR memory (DDR2/DDR3) scales to
higher clock frequencies (up to 800 MHz)
– DDR4 is now standardized by JEDEC (up to 1200 MHz)
– Special branches of DDR memories for graphic cards (GDDR) and for
low-power systems (LPDDR)
5
SDRAM Architecture
►
The SDRAM architecture is organized in banks, rows and columns
– A row buffer stores a currently active (open) row
►
The memory interface has a command bus, address bus, and a data bus
– Buses shared between all banks to reduce the number of off-chip pins
– A bank is essentially is an independent memory, but with shared I/O
Example memory:
ba
nk
Typical values DDR2/DDR3:
4 or 8 banks
4, 8, 16 bits / column
200-800 MHz
32 MB – 1 GB density
activate
(open)
8 banks
row
8K – 65K rows / bank
1K – 2K columns / row
16-bit DDR3-1600 64 MB
column
precharge
(close)
8K rows / bank
1024 columns / row
row buffer
read
write
16 bits / column
3200 MB/s peak
bandwidth
6
Presentation Outline
Introduction to SDRAM
Basic SDRAM operation
Memory efficiency
SDRAM controller architecture
Conclusions
7
Basic SDRAM Operation
►
Requested row is activated and copied into the row buffer of the bank
►
Read bursts and/or write bursts are issued to the active row
– Programmed burst length (BL) of 4 or 8 words
Row is precharged and stored back into the memory array
ba
nk
Command
Abbr
Description
Activate
ACT
Activate a row in a particular bank
Read
RD
Initiate a read burst to an active row
Write
WR
Initiate a write burst to an active row
Precharge
PRE
Close a row in a particular bank
Refresh
REF
Start a refresh operation
No operation
NOP
Ignores all inputs
column
activate
(open)
row
►
precharge
(close)
row buffer
read
8
write
Timing Constraints
►
Timing constraints determine which commands can be scheduled
– More than 20 constraints, some are inter-dependent
– Limits the efficiency of memory accesses
• Wait for precharge, activate and read/write commands before data on bus
– Timing constraints get increasingly severe for faster memories
• The physical design of the memory core has not changed much
• Constaint in nanoseconds constant, but clock period gets shorter
Parameter
Abbr.
Cycles
ACT to RD/WR
tRCD
3
ACT to ACT (diff. banks)
tRRD
2
ACT to ACT (same bank)
tRAS
12
Read latency
tRL
3
RD to RD
-
BL/2
9
July 20, 2015
Bank Parallelism
►
Multiple banks provide parallelism
– SDRAM has separate data and command buses
– Activate, precharge and transfer data in parallel (bank preparation)
– Increases efficiency
►
Figure shows parallel memory bursts with burst length 8
10
July 20, 2015
Presentation Outline
Introduction to SDRAM
Basic SDRAM operation
Memory efficiency
SDRAM controller architecture
Conclusions
11
Memory Efficiency
►
Memory efficiency is the fraction of clock cycles with data transfer
– Defines the exchange rate between peak bandwidth and net bandwidth
– Net bandwidth is the actual useful bandwidth after considering overhead
►
Five categories of memory efficiency for SDRAM:
–
–
–
–
–
►
Refresh efficiency
Read/write efficiency
Bank efficiency
Command efficiency
Data efficiency
Memory efficiency is the product of these five categories
12
Refresh Efficiency
►
SDRAM need to be refreshed regularly to retain data
–
–
–
–
►
DRAM cell contains leaking capacitor
Refresh command must be issued every 7.8 μs for DDR2/DDR3/DDR4
All banks must be precharged
Data cannot be transfered during refresh
Refresh efficiency is largely independent of traffic
– generally 95 – 99%
13
Read / Write Efficiency
►
The data bus of an SDRAM is bi-directional
– Cycles are lost when switching direction of the data bus
– Extra NOPs must be inserted between read and write commands
►
Read/write efficiency depends on traffic
– Determined by frequency of read/write switches
– Switching too often has a significant impact on memory efficiency
• Switching after every burst of 8 words gives 57% r/w efficiency with DDR2-400
►
How would you address this if you designed a memory controller?
14
Bank Efficiency
►
Bank conflict when a read or write targets an inactive row (row miss)
– Significantly impacts memory efficiency
– Requires precharge followed by activate
• Less than 40% bank efficiency if always row miss in same bank
►
Bank efficiency depends on traffic
– Determined by address of request and memory map
►
How would you address this if you designed a memory controller?
15
Command Efficiency
►
Command bus uses single data rate
– Congested if two commands are required simultaneously
– One command has to be delayed – may delay data on bus
►
Command efficiency depends on traffic
– Small bursts reduce command efficiency
• Potentially more activate and precharge commands issued
– Generally quite high (95-100%)
16
Data Efficiency
►
A memory burst can access segments of the programmed burst length
– Minimum access granularity
• Burst length 8 words is 16 B with 16-bit memory and 64 B with 64-bit memory
– Excess data is thrown away!
►
If data is poorly aligned an extra segment have to be transferred
– Cycles are lost when transferring unrequested data
►
Data efficiency depends on the memory client ( and the application )
– Smaller requests and bigger burst length reduce data efficiency
17
Conclusions on Memory Efficiency
►
Memory efficiency is highly dependent on traffic
►
Worst-case efficiency is very low
– Every burst targets different rows in the same bank
– Read/write switch after every burst
►
Results in
– Less than 31% efficiency for all DDR2/DDR3/LPDDR/LPDDR2 memories
– Efficiency drops as memories become faster (DDR4)
18
Worst-case memory efficiency
1
0.9
0.8
0.7
0.6
128MB_DDR 2-400
128MB_DDR 2-800
128MB_DDR 3-800
128MB_DDR 3-1600
128MB_LPDDR -266
128MB_LPDDR -400
256MB_LPDDR -416
256MB_LPDDR 2-667-S4
256MB_LPDDR 2-800-S4
256MB_LPDDR 2-1066-S4
0.5
0.4
0.3
0.2
0.1
0
►
Conclusion
– Worst-case efficiency must be avoided!
– (And what is wrong with this picture?)
19
DDRX-Y runs at Y/2 MHz command rate
Transports Y memory words per cycle
Presentation Outline
Introduction to SDRAM
Basic SDRAM operation
Memory efficiency
SDRAM controller architecture
Conclusions
20
A general memory controller architecture
►
A general controller architecture consists of two parts
►
The front-end
– buffers requests and responses per requestor
– schedules one (or more) requests for memory access
– is independent of the memory type
►
The back-end
– translates scheduled request(s) into SDRAM command sequence
– is dependent on the memory type
21
Front-end arbitration
►
Front-end provides buffering and arbitration
►
Arbiter can schedule requests in many different ways
– Priorities common to give low-latency access to critical requestors
• E.g. stalling processor waiting for a cache line
• Important to prevent starvation of low priority requestors
– Common to schedule fairly in case of multiple processors
(round-robin, TDM)
– Next request may be scheduled before previous is finished
• Gives more options to command generator in back-end
►
Scheduled requests are sent to the back-end for memory access
22
Back-end
►
Back-end contains a memory map and a command generator
►
Memory map decodes logical address to physical address
– Physical address is (bank, row, column)
– Can be done in different ways – choice affects efficiency
Logical addr.
0x10FF00
►
Memory
map
Physical addr.
(2, 510, 128)
Command generator schedules commands for the target memory
– Customized for a particular memory generation
– Programmable to handle different timing constraints
23
Continuous memory map
►
The memory map decodes a memory address into
(bank, row, column)
– Decoding is done by slicing the bits in the logical address
►
Continuous memory map
– Map sequential address to columns in row
– Switch bank when all columns in row are visited
– Switch row when all banks are visited
24
Bank-interleaved memory map
► Bank-interleaved
memory map
– Maps bursts to different banks in interleaving fashion
– Active row in a bank is not changed until all columns are visited
25
Memory map generalization
►
Continuous and interleaving memory map are just 2 possible memory
mapping schemes
– In the most general case, an arbitrary set of bits out of the logical address
could be used for the row, column and bank address, respectively
Example memory map (1 burst per bank, 2 banks interleaving, 8
words per burst):
Bit 0
Bit 26
Logical address:
RRR RRRR RRRR RRBB CCCC CCCB CCCW
Burst-size
Row
Example memory:
Bank-offset
Bank interleaving
16-bit DDR3-1600 64 MB
8 banks
8K rows / bank
1024 columns / row
16 bits / column
►
How would you choose a memory mapping
scheme?
26
Command generator
►
Generates and schedules commands for scheduled requests
– May work with both requests and commands
►
Many ways to determine which request to process
– Increase bank efficiency
• Prefer requests targeting open rows
– Increase read/write efficiency
• Prefer read after read and write after write
– Reduce stall cycles of processor
• Always prefer reads, since reads are blocking and writes often posted
– What are the pros and cons of these methods?
– What happens to the worst-case?
27
Command generator
►
Generate SDRAM commands without violating timing constraints
– Often build hierarchically: distribute request across banks, and issue
commands once timing constraints are satisfied. Then choose which
command for which bank is executed.
►
Many possible policies to determine which command to schedule
– Page policies
• Close rows as soon as possible to activate new one faster
• Keep rows open as long as possible to benefit from locality
– Command priorities
• Read and write commands have high priority, as they put data on the bus
• Precharge and activate commands have lower priorities
– Algorithms often try to put data on the bus as soon as possible
• Microsoft proposes a self-learning memory controller that uses
reinforment-learning to do long-term planning
28
Presentation Outline
Introduction to SDRAM
Basic SDRAM operation
Memory efficiency
SDRAM controller architecture
Conclusions
29
Conclusions
►
SDRAM is used as shared off-chip high-volume storage
– Cheaper but slower than SRAM
►
The worst-case efficiency of SDRAM depends on
– Refresh efficiency, bank efficiency, read/write efficiency, command
effiency, and data efficiency
– Actual case is highly variable and depends on the application
►
Controller tries to minimize latency and maximize efficiency
– Low-latency for critical requestors using priorities
– Fairness among multiple processors
– High efficiency by reordering requests to fit with memory state
►
Memory map impacts efficiency and power
30
Questions?
31
Presentation Outline (part 2)
Mixed time-criticality
Firm Real-Time Controllers
Soft/No Real-Time Controllers
Mixed Real-Time Controllers
Conclusions
32
Trends in embedded systems
►
Embedded systems get increasingly complex
–
–
–
–
►
Increasingly complex applications (more functionality)
Growing number of applications integrated in a device
More applications execute concurrently
Requires increased system performance without increasing power
The resulting complex contemporary platforms
– are heterogeneous multi-processor systems with distributed memory hierarchy to
improve performance/power ratio
– Resources in the system are shared to reduce cost
33
Mixed time-criticality
►
Applications have mixed time-criticality
►
Firm real-time requirements (FRT)
– E.g. software-defined radio application
– Failure to satisfy requirement may
violate correctness
– No deadline misses tolerable
►
Soft real-time requirements (SRT)
– E.g. media decoder application
– Failure to satisfy requirement reduces quality of output
– Occassional deadline misses tolerable
►
No real-time requirements (NRT)
– E.g. graphical user interface
– No timing requirements, but must
be responsive
34
Formal verification
►
Verifying MRT systems requires a combination of methods
– Formal verification
– Simulation-based verification
►
Formal verification is often used to verify FRT requirements
– Provides analytical bounds on response time or throughput
– Considers all application inputs
– Covers all combinations of concurrently running applications
►
Approach requires models of both applications and hardware
– Application models are not always available
– Behavior of dynamic applications is not captured accurately
– Most hardware is not designed with formal analysis in mind
35
Simulation Based verification
►
Simulation is typically used to verify SRT and NRT applications
– System simulated with a large set of inputs
►
Resource sharing results in interference between applications
–
–
–
–
Timing behaviors of applications in use-case inter-dependent
All use-cases must be verified instead of all applications
Verification must be repeated if applications are added or modified
Verification by simulation is a slow process with poor coverage
► Verification
is costly and effort is expected to increase in future!
36
Performance guarantees for SDRAM
►
SDRAM memories are particularly challenging resources
►
The execution time of a request in an SDRAM is variable
– WCET is pessimistic and guaranteed bandwidth is very low
• Less than 16% bandwidth can be guaranteed for all DDR3 devices
1
0.8
0.6
0.4
128MB_DDR 2-400
128MB_DDR 2-800
128MB_DDR 3-800
128MB_DDR 3-1600
128MB_LPDDR -266
128MB_LPDDR -400
256MB_LPDDR -416
256MB_LPDDR 2-667-S4
256MB_LPDDR 2-800-S4
256MB_LPDDR 2-1066-S4
0.2
0
►
SDRAM bandwidth is scarce and must be efficiently utilized
– Additional interfaces cannot be added due to cost constraints
37
Problem statement
►
Complex systems have mixed time-criticality
– Firm, soft, and no real-time requirements in one system
– We refer to this as mixed real-time (MRT) requirements
►
Sharing an SDRAM controllers between FRT and SRT/NRT
applications is challenging
►
We would like to use the SDRAM in an efficient and power
conscious manner
►
Satisfying the FRT requirements, while providing sufficient performance
to the SRT/NRT applications
38
Presentation Outline
Mixed time-criticality
Firm Real-Time Controllers
Soft/No Real-Time Controllers
Mixed Real-Time Controllers
Conclusions
39
Memory efficiency
►
Execution times of requests are variable and traffic dependent
– Can vary by an order of magnitude
– Three reasons for overhead cycles:
• Activating and precharging (opening and closing) rows
• Switching direction of data bus from read to write
• Refreshing the memory
►
Memory efficiency
– The fraction of clock cycles when requested data is transferred
– The exchange rate between peak bandwidth and net bandwidth
– High efficiency is required since bandwidth is a scarce resource
40
Firm Real-Time Controllers
►
FRT requirements must be satisfied even in worst-case scenario
►
Typical goals of firm real-time controllers:
– Maximize the worst-case net bandwidth
– Minimize the worst-case response time
– A trade-off between the two, since they
are contradictory
►
Given all we know about the incoming traffic,
what is the worst possible behavior the
controller can exhibit?
41
Firm Real-Time Controllers: Locality
►
SDRAM performance is highly dependent on locality
– Request served quickly if it targets an open row
– No overhead of opening and closing rows
►
FRT controllers are typically unable to exploit locality
– Locality has to be guaranteed also in worst case
– Difficult for a single executing application
• Requires intimate knowledge of memory accesses
– More or less impossible for multiple concurrent applications
• Memory accesses mixed by memory arbiter
– Average and worst-case performance are very different
• One reason why it is expensive to provide firm performance guarantees
42
Close-page policy
►
FRT controllers use close-page policies [Paolieri, Reineke]
– Precharge banks immediately after each request
– Assumes that every request targets closed rows
►
Benefits of policy
– Reduces worst-case overhead of opening/closing rows
– Increases guaranteed net bandwidth
►
Drawbacks of policy
– Sacrifices best and average-case performance and power
• Difference between average case and worst case is reduced.
– Limits best-case efficiency of DDR3-800 (16 bit) with 64B requests to 63%
• Results from the Predator SDRAM controller [Akesson]
43
Close-page policy
37
1.2
8.6
4.8
3.3
2.4
1.9
1.5
1.2
4,1
35% of peak
bandwidth
2,2
1
4,1
4
2,2
2,2
4
2,2
2,2
2,2
4
2
22
4
1
21
0.2
111
1
2
4
4
2,2
16
2
2
8
1
128MB_DDR2-400
0.6
128MB_DDR2-800
128MB_DDR3-800
0.5
128MB_DDR3-1600
128MB_LPDDR-266
0.4
128MB_LPDDR-400
256MB_LPDDR-416
0.2
256MB_LPDDR2-667-S4
256MB_LPDDR2-800-S4
0.1
256MB_LPDDR2-1066-S4
4
~83% of peak
bandwidth
4
4
net
bAG (GB/s)
2,2
0.4
0.8
4,1
4,1
4,1
0.8
4
2
1
4,1
2,2
2,2
0.6
Request size: 64 bytes
1
1
32
64
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Power (W)
0.7
0.8
0.9
1
• Slower memories reach peak efficiency at a smaller request size then the faster memories
44
Statically Scheduled Controllers
►
Controllers are classified as statically or dynamically scheduled
– Depends on SDRAM command scheduling mechanism
►
Statically scheduled controllers
– Pre-compute SDRAM schedule at design time
– Bandwidth and execution time bounded by inspecting schedule
• Suitable for FRT requirements
– Restricted to applications with (extremely) well-specified memory behavior
45
Dynamically Scheduled Controllers
► Dynamically
scheduled FRT controllers
– Schedule commands at run-time based on incoming requests
– Challenge is to analyze command scheduler
• Required to bound net bandwidth and execution times
– Analysis often assumes large fixed-size requests [Paolieri]
• Large enough to exploit maximum bank-level parallelism by
interleaving
• Requires 64-256 B requests depending on memory device
• Can exploit the “guaranteed locality” within the request
46
Predictable Arbitration
►
These FRT controllers all have bounded execution time
– This only covers the time the back-end needs to process a request
– Bounding response times requires predictable arbitration
– Bounds number of interfering requests from other memory clients
►
Different controllers uses different arbiters
– Statically scheduled controllers uses a static schedule
– [Paolieri] employs Round-Robin arbitration
• Targeting homogeneous chip multi-processors
– [Akesson] supports a variety of predictable arbiters
• E.g. (Weighted) Round-Robin, Credit-Controlled Static-Priority, and
Frame-Based Static-Priority
• Targets heterogeneous MPSoCs
47
Presentation Outline
Mixed time-criticality
Firm Real-Time Controllers
Soft/No Real-Time Controllers
Mixed Real-Time Controllers
Conclusions
48
Soft/No Real-Time Controllers
►
Same controllers normally used for SRT/NRT requirements
– Dynamically scheduled high-performance controllers
►
SRT applications are verified by simulation rather than formally
– Firm transaction-level guarantees are not necessary (and possible)
– Sufficient to satisfy application-level deadlines with high probability
• May correspond to thousands of memory requests
►
Typical goals of soft/no real-time controllers:
– Maximize the average net bandwidth
– Minimize the average response time
– A trade-off between the two, since they are contradictory
49
Locality in SRT Controllers
►
SRT controllers do not have to guarantee locality
– Requires locality to offset miss penalties with high probability
►
Open-page policies are common in SRT controllers
– Rows are speculatively kept open to exploit locality
– Average efficiency is hence typically higher than for FRT controllers
– Best-case memory efficiency is hence around 98%
• All requests are either reads or writes to the same row
• Efficiency losses only due to mandatory refresh activities
50
Flexibility
►
SRT controllers are flexible and supports most memory traffic
–
–
–
–
►
SRT Controllers are dynamically scheduled
Does not require formal analysis of supported memory traffic
Enables supports of e.g. variable request sizes
Makes no assumptions on alignment of requests
Fine-grained scheduling at level of single SDRAM bursts
– Reduces wasted data when serving small requests
– Reduces response times of sensitive clients
– Low worst-case memory efficiency
• Cannot guarantee locality or bank-level parallelism
• Worst-case efficiency about 16% for DDR3-800 with BL=8 words
51
Reducing Response Times
►
Memory efficiency is optimized using sophisticated mechanisms
►
Preference for requests that target open rows
– Reduces overhead of opening and closing rows
– Increases response times for clients targeting closed rows
►
Read/write grouping
– Reduces read/write switching overhead
►
Preference for reads over writes [Shao]
– Reduces stall cycles on processor
►
Preemption of low-priority requests in service [Lee]
– Reduces response times of high-priority clients
►
Interactions between mechanisms are complex
– Difficult to derive useful bounds on bandwidth and response times
– May even be difficult to guarantee the default 16% net bandwidth
52
Presentation Outline
Mixed time-criticality
Firm Real-Time Controllers
Soft/No Real-Time Controllers
Mixed Real-Time Controllers
Conclusions
53
Mixed Real-Time Controllers
►
MRT controllers must efficiently support FRT, SRT and NRT
►
Current FRT controllers treat SRT/NRT clients like FRT clients
– Expensive both in terms of bandwidth and power
►
Current SRT/NRT controllers treat FRT like SRT/NRT clients
– Guarantees are either not formally proven or very pessimistic
– Worst-case may be maximum observed case plus a safety margin
– Deadlines may be missed in corner cases
►
MRT controllers are likely to evolve from current controllers
– Either from FRT controllers or SRT/NRT controllers
54
Predator controller (Made at TU/e)
►
Predator is a mixed real-time hybrid controller
– Evolved from a firm real-time design
►
Combines static and dynamic scheduling
– Statically computed memory patterns (sub-schedules)
that are dynamically scheduled
►
Three key ideas to predictability
1.
Independent execution of resource to simplify analysis
•
2.
3.
►
Require entire request and space for response before scheduling
Predictable WCET by using memory patterns
Predictable WCRT by using predictable arbitration
Part of the CompSOC project ( www.compsoc.eu )
55
Predictable SDRAM
►
Schedule memory patterns instead of memory commands
– Design time generated
– Patterns + scheduling rules satisfy the timing constraints
– Simple scheduling rules allow worst-case bandwidth and latency analysis
►
Chop large requests into fixed size chunks (atoms)
– Allows preemption at the atom-level
►
R/W
switch
6 patterns:
–
–
–
–
–
–
Read
Write
read/write switch
write/read switch
Refresh
(NOP/IDLE/Power down)
Read
Refresh
W/R
switch
Pattern scheduling rules
56
Write
Memory patterns
►
Patterns enable scheduling at higher level than commands
– Less state and fewer constraints, making them easier to analyze
►
Bounding memory efficiency
– Worst sequence of patterns is known (mapping & pattern lengths)
– Data transferred by patterns is known (by definition)
Sequence of patterns for a DDR2-400 memory
57
Access Pattern Parameters
►
There are three key access pattern parameters
–
–
►
Defines the memory map of the memory controller
Determined per use-case at design time
Number of banks to interleave (BI) request over
–
–
Determines bank parallelism
Memory efficiency and power consumption increase with BI
►
Number of bursts to each bank, called burst count (BC)
►
Number of words in a burst, the burst length (BL)
•
•
►
Parameters determine access granularity (AG) of the memory
–
–
►
BC & BL determine amount of data transferred per bank
Increasing BC & BL amortizes overhead and increases efficiency
Amount of data accessed by a pattern
AG = BI x BC x BL x word size
Parameters are trade-off between efficiency, WCRT and power
58
Conservative open-page policy
►
Controller uses a conservative open-page policy
–
–
–
–
Page kept open as long as possible without sacrificing WCET
Decision to close made when last burst to a bank is issued
Enables limited locality to be exploited
Reduces execution time and power consumption
Request arrivals:
Request 1
Request 2
Request 3
Close-page policy
ACT
Read 1
PRE
Open-page policy
ACT
Read 1
Read 2
Conservative open-page
policy
ACT
Read 1
Read 2
►
ACT
Read 2
PRE
ACT
Read 3
Request 4
Request 5
PRE
ACT
Read 3
PRE
ACT
PRE
Read 3
PRE
ACT
(requests of the same color use the same page)
Read 4
ACT
Read 4
PRE
ACT
Read 4
PRE
PRE
ACT
Read 5
ACT
Read 5
Policy introduces three new types of read and write patterns
– With / without activates and precharges, respectively
– Pattern is chosen when it is known if following request hits or misses
– Worst-case analysis is only based on miss patterns
59
PRE
Read 5
PRE
Predictable Front-End Arbitration
►
Controller works with any predictable front-end arbiter
– Example: Round-Robin, TDM, or our own priority-based arbiter
►
Bounding response times
– Number of interfering requests is known (arbiter analysis)
– Request to pattern mapping is known (mapping)
– Pattern to cycle mapping is known (pattern lengths)
►
Design provides bounds on net bandwidth and response times
– For any combination of supported memory and arbiter
60
Experimental setup
►
Simple MRT use-case with three memory clients
Application
►
Net BW (MB/s)
Response time (ns)
Synthetic FRT app. 1
128
300
700
Synthetic FRT app. 2
64
300
700
SRT H.263 decoder
32
-
-
Memory is Micron 16-bit DDR3-800
–
–
–
►
Request size
(bytes)
Peak bandwidth of 1600 MB/s
Example total power budget of 0.5 W
Using own power model that analyzes command trace
Controller uses Round-Robin arbitration
61
FRT Bandwidth / Latency / Power Trade-Off
1.6
AG: 64 bytes
1.4
1
AG: 128 bytes
4
1
AG: 512 bytes
AG: 1024 bytes
Bandwidth vs Power vs Latency
For different values of BI and BC (BL=8)
1
–
Labels denote BI
Gross bandwidth requirement depends on AG
•
4
1
= 0.9 GB/s
b128
gross
•
8
Only AG ≤ 128 B is possible due to data
efficiency
2
0.8
1
16,32,64
0.6 bgross = 0.6 GB/s
2
1
0.4
1
0.2
Pmax
The labels denote BI (1 up to 8)
0
0
0.05
0.1
0.15
0.2
0.25 0.3
Power (W)
0.35
0.4
0.45
1600
0.5
1465 1465 1465
AG: 16 bytes
AG: 32 bytes
1400
1495
AG: 64 bytes
1200
Latency (ns)
(GB/s)
gross
–
–
8
AG: 256 bytes
1.2
bAG
2
Trade-off for FRT applications
►
24 8
1 2 4
8
4
2
AG: 16 bytes
AG: 32 bytes
AG: 128 bytes
AG: 256 bytes
1000
AG: 512 bytes
AG: 1024 bytes
800
1,2
600
505 505 505 535
400
235
825 825 825 855
245 255
295
265 275
375
338 345 365
200
0
62
1
2
1
4
2
1
2 4 8 1
2 4 8 1
2
Number of Banks Interleaving (BI)
4
8
1
2
4
8
1
SRT Execution Time / Power Trade-Off
►
SRT application is verified by SystemC system simulation
–
–
–
–
►
Traffic generator running H.263 trace
FRT applications modeled with synthetic traffic
All feasible configurations simulated
Results normalized to best configuration
Legend:
Labels: (BI, BC)
Colored points = close-page policy
Thin symbol = violates power budget
Results
–
–
–
–
Conservative OP policy
reduces power and
execution time
(Free lunch!)
BI=4, 8 violates power
budget
BI = BC = 2 best for
SRT performance
BI = 1, BC = 4 minimizes
power
63
Presentation Outline
Mixed time-criticality
Firm Real-Time Controllers
Soft/No Real-Time Controllers
Mixed Real-Time Controllers
Conclusions
64
Conclusions
►
Complex SoCs have mixed real-time (MRT) requirements
– Mix of firm (FRT), soft (SRT), and no real-time (NRT) requirements
– There are suitable controllers for FRT and SRT/NRT, but not MRT
►
Firm real-time controllers
–
–
–
–
►
Maximize bandwidth bound and minimize response time bound
Static, dynamic, or hybrid SDRAM command scheduling
Close-page policies to reduce miss penalty
Predictable arbitration
Soft/no real-time controllers
– Maximize average bandwidth and minimize average response time
– Dynamically scheduled with sophisticated mechanisms
– Open-page policies to exploit locality
65
Conclusions
►
We propose a mixed real-time SDRAM controller
►
Predictability enables formal verification of FRT requirements
– Achieved by memory patterns and predictable arbitration
– Configurable trade-off between efficiency, response time, and power
►
Uses a conservative open-page policy to exploit locality and increase
the average-case performance
►
First satisfy the requirements of the FRT applications, then use
simulation to find the best configuration for the SRT/NRT applications.
66
References
[Akesson] B. Akesson and K. Goossens. “Architectures and Modeling of Predictable
Memory Controllers for Improved System Integration”. In Proc. DATE, 2011
[Lee] K. Lee, T. Lin, and C. Jen. “An efficient quality-aware memory controller for
multimedia platform SoC. In IEEE Trans. on Circuits and Systems for Video
Technology”,15(5), 2005.
[Paolieri] M. Paolieri, E. Quinones, F. Cazorla, and M. Valero. “An Analyzable Memory
Controller for Hard Real-Time CMPs”. In: Embedded Systems Letters, IEEE, 1(4), 2009.
[Reineke] J. Reineke, Isaac Liu, Hiren Patel, Sungjun Kim, and Edward Lee. “PRET DRAM
Controller: Bank Privatization for Predictability and Temporal Isolation”. In: Proc.
CODES+ISSS, 2011
[Shao] J. Shao and B. Davis. “A burst scheduling access reordering mechanism”. In Proc.
HPCA, 2007.
67
Questions?
68

Systems and interconnects Addressing the design challenge

Transcript Systems and interconnects Addressing the design challenge

Directory