ibm-oct12 - School of Computing

Download Report

Transcript ibm-oct12 - School of Computing

Exploiting 3D-Stacked Memory Devices
Rajeev Balasubramonian
School of Computing
University of Utah
Oct 2012
1
Power Contributions
PROCESSOR
PERCENTAGE
OF TOTAL
SERVER
POWER
MEMORY
2
Power Contributions
PROCESSOR
PERCENTAGE
OF TOTAL
SERVER
POWER
MEMORY
3
Example IBM Server
Source: P. Bose, WETI Workshop, 2012
4
Reasons for Memory Power Increase
• Innovations for the processor, but not for memory
• Harder to get to memory (buffer chips)
• New workloads that demand more memory
 SAP HANA in-memory databases
 SAS in-memory analytics
5
The Cost of Data Movement
• 64-bit double-precision FP MAC: 50 pJ
(NSF CPOM Workshop report)
• 1 instruction on an ARM Cortex A5: 80 pJ
(ARM datasheets)
• Fetching 256-bit block from a distant cache bank: 1.2 nJ
(NSF CPOM Workshop report)
• Fetching 256-bit block from an HMC device: 2.68 nJ
Fetching 256-bit block from a DDR3 device: 16.6 nJ
(Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)
6
Memory Basics
MC
MC
Host Multi-Core
Processor
MC
MC
7
FB-DIMM
MC
MC
Host Multi-Core
Processor
MC
MC
…
8
SMB/SMI
MC
MC
Host Multi-Core
Processor
MC
MC
9
Micron Hybrid Memory Cube Device
10
HMC Architecture
MC
MC
Host Multi-Core
Processor
MC
MC
11
Key Points
• HMC allows logic layer to easily reach DRAM chips
• Open question: new functionalities on the logic chip –
cores, routing, refresh, scheduling
• Data transfer out of the HMC is just as expensive as before
 Near Data Computing … to cut off-HMC movement
 Intelligent Network-of-Memories … to reduce hops
12
Near Data Computing (NDC)
13
Timely Innovation
• A low-cost way to achieve NDC
• Workloads that are embarrassingly parallel
• Workloads that are increasingly memory bound
• Mature frameworks (MapReduce) in place
14
Open Questions
• What workloads will benefit from this?
• What causes the benefit?
15
Workloads
• Initial focus on MapReduce, but any workload with
localized data access patterns will be a good fit
• Map phase in MapReduce: the dataset is partitioned
and each Map phase works on its “split”; embarrassingly
parallel, localized data access, often the bottleneck;
e.g., count word occurrences in each individual document
• Reduce phase in MapReduce: aggregates the results of
many mappers; requires random access of data; but deals
with less data than Mappers;
e.g., summing up the occurrences for each word
16
Baseline Architecture
MC
MC
MC
MC
• Mappers and Reducers both execute on the host processor
• Many simple cores is better than few complex cores
• 2 sockets, 256 GB memory, processing power budget 260 W,
512 Arm cores (EE-Cores) per socket, each core at 876 MHz
17
NDC Architecture
MC
MC
MC
MC
• Mappers execute on ND Cores; Reducers execute on the
host processor
• 32 cores per HMC; 2048 total ND Cores and 1024 total
EE-Cores; 260 W total processing power budget
18
NDC Memory Hierarchy
MC
MC
MC
MC
• Memory latency excludes delay for link queuing and traversal
• Many row buffer hits
• L1 I and D caches per ND Core
• The vault has space reserved for intermediate outputs, and
Mapper/Runtime code/data
19
Methodology
• Three workloads:
 Range-Aggregate: count occurrences of something
 Group-By: count occurrences of everything
 Equi-Join: for two databases, it counts the pairs that
have similar attributes
• Dataset: 1998 World Cup web server logs
• Simulations of individual mappers and reducers on
EE-cores on TRAX simulator
20
Single Thread Performance
21
Effect of Bandwidth
22
Exec Time vs. Frequency
23
Maximizing the Power Budget
24
Scaling the Core Count
25
Energy Reduction
26
Results Summary
• Execution time reductions of 7%-89%
• NDC performance scales better with core count
• Energy reduction of 26%-91%
 No bandwidth limitation
 Lower memory access latency
 Lower bit transport energy
27
Intelligent Network of Memories
• How should several HMCs be connected to the processor?
• How should data be placed in these HMCs?
28
Contributions
• Evaluation of different network topologies
Route adaptivity does help
• Page placement to bring popular data to nearby HMCs
Percolate-down based on page access counts
• Use of router bypassing under low load
• Use of deep sleep modes for distant HMCs
29
Topologies
30
Topologies
31
Topologies
(d) F-Tree
(e) T-Tree
32
Network Properties
• Supports 44-64 HMC devices with 2-4 rings
• Adaptive routing (deadlock avoidance based on timers)
• An entire page resides in one ring, but cache lines are
striped across the channels
33
Percolate-Down Page Placement
• New pages are placed in nearest ring
• Periodically, inactive pages are demoted to the next ring;
thresholds matter because of queuing delays
• Activity is tracked with the multi-queue algorithm:
hierarchical queues, each entry has a timer and an access
count, demotion to lower queue if timer expires, promotion
to higher queue if access count is high
• Page migration off the critical path, striped across many
channels, distant links are under-utilized
34
Router Bypassing
• Topologies with more links and adaptive routing (T-Tree)
are better… but distant links experience relatively low load
• While a complex router is required for the T-Tree, the router
can often be bypassed
35
Power-Down Modes
• Activity shift to nearby rings  under-utilization at distant
HMCs
• Can power off the DRAM layers (PD-0) and the SerDes
circuits (PD-1)
• 26% energy saving for a 5% performance penalty
36
Methodology
• 128-thread traces of NAS parallel benchmarks (capacity
requirements of nearly 211 GB)
• Detailed simulations with 1 billion memory access traces,
confirmatory page-access simulations for the entire
application
• Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit
for HMC logic layer, 3.9 pJ/bit for a 5x5 router
37
Results – Normalized Exec Time
• T-Tree P-Down reduces exec time by 50%
• 86% of flits bypass the router
• 88% of requests serviced by Ring-0
38
Results – Energy
39
Summary
• Must reduce data movement on off-chip memory links
• NDC reduces energy, improves performance by
overcoming the bandwidth wall
• More work required to analyze workloads, build software
frameworks, analyze thermals, etc.
• iNoM uses OS page placement to minimize hops for
popular data and increase power-down opportunities
• Path diversity is useful, router overhead is small
40
Acknowledgements
• Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor,
Jeff Jestes, Al Davis, Feifei Li
• Group funded by: NSF, HP, Samsung, IBM
41
Backup Slide
42
Backup Slide
43