Document 7585946

Download Report

Transcript Document 7585946

Designing Efficient Memory for Future
Computing Systems
Aniruddha N. Udipi
University of Utah
Ph.D. Dissertation Defense, March 7, 2012
Advisor: Rajeev Balasubramonian
www.cs.utah.edu/~udipi
www.cs.utah.edu/~udipi
My other computer is..
www.cs.utah.edu/~udipi
2
Scaling server farms
• Facebook: 30,000 servers, 80 Billion images
stored, serves 600,000 photos a second, logs
25 TB of data per day… the statistics can go
on..
• The primary challenge to scaling: efficient
supply of data to thousands of cores
• It’s all about the memory!
www.cs.utah.edu/~udipi
3
Performance Trends
• Demand-side
– Multi-socket, multi-core, multi-thread
– Large datasets - big data analytics,
scientific computation models
– RAMCloud-like designs
– 1 TB/s per node by 2017
Source: Tom’s Hardware
• Supply-side
– Pin count, per pin BW, capacity
– Severely power limited
Source: ZDNet
www.cs.utah.edu/~udipi
4
Energy Trends
• Datacenters consume ~2% of all
power generated in the US
– Operation + cooling
• 100 Billion kWh, $7.4 Billion
• 25-40 % of total power in large
systems consumed in memory
• As processors get simpler, this
fraction likely to increase
www.cs.utah.edu/~udipi
5
Cost-per-bit
• Traditionally the holy grail of DRAM design
• Operational expenditure over 3 years == Capital
expenditure in datacenter servers
– Cost-per-bit less important than before
$0.30
60W
$3.00
13W
www.cs.utah.edu/~udipi
6
Complexity Trends
• The job of the memory controller is hard
– 18+ timing parameters for DRAM!
– Maintenance operations

Refresh, scrub, power down, etc.
• Several DIMM and controller variants
– Hard to provide interoperability
– Need processor-side support for new
memory features
• Now throw in heterogeneity
– Memristors, PCM, STT-RAM, etc.
www.cs.utah.edu/~udipi
7
Reliability Trends
• Shrinking feature sizes not helping
• Nor is the scale
– 64 x 1015 DRAM cells in a typical datacenter
• DRAM errors the #1 reason for servers at Google
to enter repair
• Datacenters are the backbone of web-connected
infrastructure
– Reliability is essential
• Server downtime has huge economic impact
– Breached SLAs, for example
www.cs.utah.edu/~udipi
8
Thesis statement
• Main memory systems are at an inflection point
– Convergence of several trends
• Major overhaul required to achieve a system
that is
– Energy-efficient, high-performance, low-complexity,
reliable, and cost effective
• Combination of two things
– Prudent application of novel technologies
– Fundamental rethinking of conventional design
decisions
www.cs.utah.edu/~udipi
9
Designing Future Memory Systems
DIMM
…
4
1
2
1
Memory Chip Architecture –
reducing overfetch & increasing
parallelism [ISCA ’10]
3
2
Memory Interconnect – Prudent
use of Silicon Photonics, without
modifying DRAM dies [ISCA ’11]
1
2
3
4
4
MC
CPU
3
Memory protocol – Streamlined
Slot-based Interface with semiautonomous memory [ISCA ’11]
Memory Reliability – Efficient
RAID-based high-availability
Chipkill memory
[ISCA ’12]
www.cs.utah.edu/~udipi
10
PART 1 – Memory Chip Organization
www.cs.utah.edu/~udipi
Key bottleneck
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
RAS
CAS
Cache Line
Row Buffer
One bank shown in each chip
www.cs.utah.edu/~udipi
12
Row buffer hit rate (%)
Why this is a problem
100
90
80
70
60
50
40
30
20
10
0
1 Core
4 Core
16
Core
www.cs.utah.edu/~udipi
13
Percentage of Row Fetches
…
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
20.0
10.0
0.0
Use Count
>3
Use Count
3
Use Count
2
Use Count
1
www.cs.utah.edu/~udipi
14
SSA Architecture
ONE DRAM CHIP
ADDR/CMD BUS
DIMM
64 Bytes
Subarray
Bitlines Bank
Row buffer
8 8 8 8 8 8 8 8
DATA BUS
MEMORY CONTROLLER
Global Interconnect to I/O
www.cs.utah.edu/~udipi
15
SSA Operation
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Address
Cache Line
www.cs.utah.edu/~udipi
Sleep Mode
(or other parallel
accesses)
16
SSA Impact
• Energy reduction
– Dynamic – fewer bitlines activated
– Static – smaller activation footprint – more and
longer spells of inactivity – better power down
• Latency impact
– Limited pins per cache line – serialization latency
– Higher bank-level parallelism – shorter queuing
delays
• Area increase
– More peripheral circuitry and I/O at finer granularities
– area overhead (< 5%)
www.cs.utah.edu/~udipi
17
Key Contributions
• Up to 6X reduction in DRAM chip dynamic
energy
• Up to 5X reduction in DRAM chip static
energy
• Up to 50% improvements in performance in
applications limited by bank contention
• All for ~5% increase in area
www.cs.utah.edu/~udipi
18
PART 2 – Memory Interconnect
www.cs.utah.edu/~udipi
Key Bottleneck
• Fundamental nature of electrical pins
– Limited pin count, per pin bandwidth, memory
capacity, etc.
• Diverging growth rates of core count and pin
count
• Limited by physics, not engineering!
www.cs.utah.edu/~udipi
20
Silicon Photonic Interconnects
• We need something that can break
the edge-bandwidth bottleneck
• Ring modulator based photonics
– Off chip light source
Source: Xu et al. Optical Express 16(6), 2008
– Indirect modulation using resonant rings
– Relatively cheap coupling on- and off-chip
• DWDM for high bandwidth density
DWDM
– As many as 67 wavelengths possible
– Limited by Free Spectral Range, and
coupling losses between rings
64 λ × 10 Gbps/ λ = 80 GB/s per
waveguide
www.cs.utah.edu/~udipi
21
The Questions We’re Trying to Answer
What should the role of
electrical signaling be?
How do we make
photonics less
invasive to memory
die design?
Should we replace all
interconnects with
photonics? On-chip
too?
What should the
role of 3D be in an
optically connected
memory?
Should we be designing
photonic DRAM dies?
Stacks? Channels?
www.cs.utah.edu/~udipi
22
Design Considerations – I
• Photonic interconnects
– Large static power dissipation: ring tuning



Rings are designed to resonate at a specific frequency
Processing defects and temperature change this
Need to heat the rings to correct for this
– Much lower dynamic energy consumption –
relatively independent of distance
• Electrical interconnects
– Relatively small static power dissipation
– Large dynamic energy consumption
www.cs.utah.edu/~udipi
23
Design Considerations – II
• Should not over-provision photonic
bandwidth, use only where necessary
• Use photonics where they’re really useful
– To break the off-chip pin barrier
• Exploit 3D-Stacking and TSVs
– High bandwidth, low static power, decouples
memory dies
• Exploit low-swing wires
– Cheap on-chip communication
www.cs.utah.edu/~udipi
24
Proposed Design
ADVANTAGE 2:
1:
3:
Increased
Not
Rings
disruptive
are co-located;
activity
to the
factor,
easier
design
more
to
ofisolate
efficient
commodity
oruse
tune
memory
ofthermally
photonics
dies
DRAM chips
Processor
Memory Waveguide
controller
DIMM
www.cs.utah.edu/~udipi
Photonic
Interface die
25
Key Contributions
DRAM chips
Processor
Memory
controller
Waveguide
Makes the job of
the memory
controller
difficult!
DIMM
Photonic
Interface die
• 23% reduced energy consumption
• 4X capacity per channel
• Potential for performance improvements
due to increased bank count
• Less disruptive to memory die design
www.cs.utah.edu/~udipi
26
PART 3 – Memory Access Protocol
www.cs.utah.edu/~udipi
Key Bottleneck
• Large capacity, high bandwidth, and evolving
technology trends will increase pressure on the
memory interface
• Memory controller micro-manages every operation
of the memory system
– Processor-side support required for every memory
innovation
– Several signals between processor and memory
 Heavy pressure on address/command bus
 Worse with several independent banks, large
amounts of state
www.cs.utah.edu/~udipi
28
Proposed Solution
• Release MC’s tight control, make memory stack
more autonomous
• Move mundane tasks to the interface die
– Maintenance operation (refresh, scrub, etc.)
– Routine operations (DRAM precharge, NVM wear
leveling)
– Timing control (18+ constraints for DRAM alone)
– Coding and any other special requirements
• Processor-side controller only schedules requests
and controls data bus
www.cs.utah.edu/~udipi
29
Memory Access Operation
ML
ML
x
Arrival
x
Issue Start
looking
> ML
x
S1
S2
First
free slot
Backup
slot
Time
Slot – Cache line data bus occupancy
X – Reserved Slot
ML – Memory Latency = Addr. latency +
Bank access + Data bus latency
www.cs.utah.edu/~udipi
30
Performance Impact – Synthetic Traffic
< 9% latency impact, even at maximum load
Virtually no impact on achieved bandwidth
www.cs.utah.edu/~udipi
31
Performance Impact – PARSEC/STREAM
Apps have very low BW requirements
Scaled down system, similar trends
www.cs.utah.edu/~udipi
32
Key Contributions
• Plug and play
– Everything is interchangeable and interoperable
– Only interface-die support required (communicate ML)
• Better support for heterogeneous systems
– Easier DRAM-NVM data movement on the same channel
• More innovation in the memory system
– Without processor-side support constraints
• Fewer commands between processor and memory
– Energy, performance advantages
www.cs.utah.edu/~udipi
33
PART 4 – Memory Reliability
www.cs.utah.edu/~udipi
Key Bottleneck
• Increased access granularity
– Every data access is spread across 36 DRAM chips
– DRAM industry standards define minimum access granularity
from each chip
– Massive overfetch of data at multiple levels



Wastes energy
Wastes bandwidth
Occupies ranks/banks for longer, hurting performance
• x4 device width restriction
– fewer ranks for given DIMM real estate
– x8/x16/x32 more power efficient per capacity
• Reliability level: 1 failed chip out of 36
www.cs.utah.edu/~udipi
35
A new approach: LOT-ECC
• Operate on a single rank of memory: 9 chips
– and support failure of 1 chip per rank (9 chips)
• Multiple tiers of localized protection
– Tier-1: Local Error Detection (checksums)
– Tier 2: Global Error Correction (parity)
– T3 & T4 to handle specific failure cases
• Error correction data stored in data memory
• Data mapping handled by memory controller
with firmware support
– Transparent to OS, caches, etc.
www.cs.utah.edu/~udipi
36
LOT-ECC Design
www.cs.utah.edu/~udipi
37
The Devil is in the Details
Surplus bit borrowed from data + LED
7b
1b
PA0-6
Chip 0
1b
T4
PA7-13 T4
. . PA49-55 T4
Chip 1
Chip 7
PA56
PPA
T4
Chip 8
• We’re borrowing one bit from [data + LED] to
use in the GEC
– Put them all in the same DRAM row
• When a cache line is written,
– Write data, LED, GEC – all “self-contained”
– no read-before-write
– Guaranteed row-buffer hit
www.cs.utah.edu/~udipi
38
Key Benefits
• Energy Efficiency: Fewer chips activated per access, reduced access
granularity, reduced static energy through better use of low-power
modes
• Performance Gains: More rank-level parallelism, reduced access
granularity
• Improved Protection: Can handle 1 failed chip out of 9, compared
to 1 in 36 currently
• Flexibility: Works with a single rank of x4 DRAMs or more efficient
wide-I/O x8/x16 DRAMs
• Implementation Ease: Changes to memory controller and system
firmware only; commodity processor/memory/OS
www.cs.utah.edu/~udipi
39
Power Results
-55%
www.cs.utah.edu/~udipi
40
Performance Results
Latency Reduction: LOT-ECC x8 – 43%
+GEC Coalescing – 47%
www.cs.utah.edu/~udipi
Oracular – 57%
41
Exploiting features in SSA
DIMM
L0 C
DRAM DEVICE
L1 C
L9 C L10 C
.
.
L2 C
L3 C
L4 C
L5 C
L6 C
L7 C
P0 C
L11 C L12 C L13 C L14 C L15 C P1 C L8 C
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
P7 C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C
L – Cache Line
C – Local Checksum
www.cs.utah.edu/~udipi
P – Global Parity
42
Putting it all together
www.cs.utah.edu/~udipi
Summary
• Tremendous pressure on the memory system
– Bandwidth, energy, complexity, reliability
• Prudently apply novel technologies
– Silicon photonics
– Low-swing wires
– 3D-stacking
• Rethink some fundamental design choices
– Micromanagement by the memory controller
– Overfetch in the face of diminishing locality
– Conventional ECC codes
www.cs.utah.edu/~udipi
44
Impact
• Significant static/dynamic energy reduction
– Memory core, channel, controller, reliability
• Significant performance improvement
– Bank parallelism, channel bandwidth, reliability
• Significant complexity reduction
– Memory controller
• Improved reliability
www.cs.utah.edu/~udipi
45
Synergies
•
•
•
•
SSA
Photonics
Photonics
Autonomous memory
SSA
Reliability
SSA, Photonics, and LOT-ECC provide additive energy
benefits
– Each targets one of three major sources of energy
consumption – DRAM array, off-chip channel, reliability
• SSA, Photonics, and LOT-ECC also provide additive
performance benefits
– Each targets one of three major performance bottleneck –
Bank-contention, off-chip BW, reliability
www.cs.utah.edu/~udipi
46
Research Contributions
• Memory reliability
[ISCA 2012]
• Memory access protocol
[ISCA 2011]
• Memory channel architecture
• Memory chip microarchitecture [ISCA 2010]
• On-chip networks
• Non-uniform power caches
• 3D stacked cache design
www.cs.utah.edu/~udipi
[HPCA 2010]
[HiPC 2009]
[HPCA 2009]
47
Future Work
• Future project ideas include
– Memory architectures for graphics/throughputoriented applications
– Memory optimizations for handheld devices



Tightly integrated software support
Managing heterogeneity, reconfigurability
Novel memory hierarchies
– Memory autonomy and virtualization
– Refresh management in DRAM
www.cs.utah.edu/~udipi
48
Acknowledgements
• Rajeev
• Naveen
• Committee: Al, Norm, Erik, Ken
• Awesome lab-mates
• Karen, Ann, Emily… front office
• Parents & family
• Friends
www.cs.utah.edu/~udipi
49