Verdana Bold 30 - Center for Energy Efficient Electronics

Download Report

Transcript Verdana Bold 30 - Center for Energy Efficient Electronics

Energy Efficient Designs with
Wide Dynamic Range
Vivek De
Intel Fellow
Director of Circuit Technology Research
Circuits Research Lab
Intel Labs
MID
Mobile
Desktop
Server
Platform segment characteristics
Power
High
Med
Low
Very
low
Workload
V-F
Cores
Thermal
Load
Active
High
Wide
Burst
Serial-Parallel
Med/High
Stream
Wide
Graphics
Fan (-less)
Med-High
Wide
Burst
Serial-Parallel
Med/High
Stream
Wide
Graphics
Fan (-less)
Throughput
Parallel
HPC-RAS
Med/High
Wide
Burst
Serial-Parallel
Low/Med
Stream
Wide
Graphics
Passive
Fan-less
Med-High
Wide
Low-Med
Wide
SOC
Few designs must serve all segments
2
2
Ambient
Controlled
Cool
Controlled
Varying
Uncontrolled
Extreme
Uncontrolled
Extreme
Dynamic platform control
Cache
Cache
Special
Purpose
Engines
Graphics
Cache
Video
Scalable
Scalable On-die
On-die Interconnect
Interconnect Fabric
Fabric
Last Level
Cache
Maximize
performance
& efficiency
Last Level
Cache
Independent
V/F control
regions
Integrated
Memory
Controllers
Last Level
Cache
Scenario-based
power allocation
Off Die
interconnect
Dynamic
V/F control
Workload-based
core activation
& shutdown
Deliver best user experience under constraints
3
3
Dynamic Control Unit
Dynamic adaptation & reconfiguration
Voltage
Control
Frequency
Control
Configuration
Control
Processor
Change V
Change F
Reconfigure
Sensors
Thermal
Sensor
Voltage
Sensor
Current
Sensor
Aging
Sensor
Adapt & reconfigure for best power-performance
4
4
Fault
confinement
Error
correction
System
recovery
System
adaptation
System
reconfiguration
Applications
Programming System
OS
VM
Firmware
Microcode
Microarchitecture
Circuit & Design
Less silicon overhead
Fault
diagnosis
Resiliency framework
Lower error rate
Error
detection
Less recovery overhead
Resilient platform features
Resilient platforms
Resiliency for performance, efficiency & reliability
5
5
Fine-grain
power
management
TFLOPS at
62 watts
Temporal domain
Slow voltage &
frequency
change
Energy
Coarse-grain
management
user-demand
delivered
TODAY
0.0
FUTURE
1.5
Time (msec)
• V/F change latency
• Active-sleep transition latency
• Wake-up latency
• Wake-up prediction
• State transition energy
• Supply noise
Fast voltage & frequency change
user-demand
delivered
Energy
Fine-grain
management
1.0
0.5
0.0
1.0
0.5
Time (msec)
6
6
1.5
Fine-grain power management
Spatial domain
Fine-grain
management
Coarse-grain
management
• Each core/cluster at optimum voltage
• Each core/cluster at optimum frequency
• Same voltage to all cores
• Same frequency for all cores
• V/F domain interfaces
• Synchronization overhead
• Clock generation/distribution
• Power grid routing
• Optimum V/F for non-cores
• Sub-core clock/leakage gating
FUTURE
TODAY
7
7
Advanced voltage regulators
Conversion
Distribution
•Area efficient
•Scalable
•Persistent rail
•Lower loss
•Higher fidelity
•Simpler
Efficient
Control
•Fast & efficient
•Load adaptive
•Independent rails
Low loss
Fine-grain
Efficiency
Load chip
RF
Launch
5mm
TOP
BOTTOM
Input
Capacitors
Load Adaptive
Power Source
80
Conventional
Power Source
60MHz
2.4V to 1.2V
80MHz
Output
Capacitors
2.4V to 1.5V
85
75
Inductors
ηPS
L = 1.9nH
L = 0.8nH
Efficiency [%]
Converter
chip
90
100MHz
70
0
5
10
15
Load Current [A]
20
ILOAD
VR innovations for fine-grain power management
8
8
Research testchip
12.64mm
I/O Area
1.5mm
2.0mm
MSINT
MSINT
39
39
Crossbar
Router
MSINT
single tile
Mesochronous
Interface
20 GB/s
MSINT
2KB Data memory (DMEM)
3KB Inst. memory (IMEM)
21.72mm
PLL
I/O Area
TAP
RIB
64
96
64
64
32
32
6-read, 4-write 32 entry RF
32
32
x
x
96
+
32
Normalize
FPMAC0
Tile
9
9
+
32
Normalize
FPMAC1
Processing Engine (PE)
Many-core power management
21 sleep regions per tile (not all shown)
Data Memory
Sleeping:
57% less power
Instruction
Memory
Sleeping:
Dynamic sleep
FP
Engine 1
Sleeping:
90% less
power
56% less power
STANDBY:
• Memory retains data
• 50% less power/tile
FULL SLEEP:
• Memories fully off
• 80% less power/tile
Router
Sleeping:
10% less power
(stays on to
pass traffic)
FP
Engine 2
Sleeping:
90% less
power
Energy efficiency of 19.4 Gflops/Watt
10
10
Memory integration
3D stacking with thru-silicon vias
256 KB SRAM per core
4X C4 bump density
8490 thru-silicon vias
Processor with Cu bump
Polaris
Processor
denser than
C4 pitch
DRAM
Memory
C4 pitch
Thru-Silicon
Via
Package
Package
11
11
Voltage-frequency range limiters
Vmax/Fmax limiters
• Reliability
• Thermals
• Power delivery
Fmax
Voltage
Vmax
Vmin
Frequency
Vmin limiters
•
•
•
•
Circuit functional failures
Soft errors
Steep frequency roll-off
Aging
Reliability & functional failures limit range
12
12
Voltage-frequency margins
V variation
T variation
IR drop
Inductive
droops
Load line
variations
Worst T
Frequency
MIS
Signal
coupling
Path activity
Worst
Critical
path
F margin
Voltage
V margin
Aging
Nominal T
EOL
V margin
Frequency
Voltage
V margin
F margin
Voltage
Voltage
V margin
F margin
F margin
Nominal
BOL
Frequency
Frequency
13
13
Multi-voltage cache
6T SRAM
Vmin
Cumulative fail rate
Array Vmin
SER, erratic bits
Nominal array Vmin
Worst die Vmin
Multi-V
6T SRAM
8T+ cell
LLC density
Density
Array voltage
Push active Vmin limit to Vmax
Embedded level
shifters for wordline &
write drivers minimize
area & power overhead
14
14
Dynamic multi-voltage cache
Wordline underdrive for read
BL#
6T SRAM cell
BL
VCC
Array to WL differential supply noise tracking
VCC
NX
Weak
Write
Strong
VOPAMP
TD
R1
NPD
Read
WL
PPU
NX
WL
Dynamic voltage collapse for write
VCC
_
Retention
)
VSS
+
R2
TM /RC
M = VCC(1-e-
VWL
C
Duration
Strong
Weak
Balanced
PPU
Strong
Weak
Balanced
Pulse width control
min-cells
128Kb
min-cells
128Kb
0.7
PBIST block
p-cells
128Kb
p-cells
128Kb
128Kb
p-cells
107Kb
p-cells
min-cells
Testing interface
1.1V-0.7V
membrane
probe card
128Kb
Vcc
WLUD &
dummy
WLs
0.91mm2
IO pad drivers
Chip Area
R
MIN cell
Source: Intel
6.2M
Magnitude
TM
45nm dynamic multi-V testchip
Transistor count
0
TD
VWL(V)
0.8
0.9
1
1.1
DVC off
1.E+10
Weakest DVC
1.E+08
Strongest DVC
1.E+06
1.E+04
Nominal DVC
1.E+02
26X less fails
15
15
1.E+00
Relative Single Bit Fails
NPD
Cache reconfiguration
Reduce cache size @ low V/F by eliminating failing words/bits
Bit fix
Word disable
00101000
Failing words
Bitmap of failing words
Supply Voltage
0.4
0.45
0.5
0.55
0.6
0.65
1.0E+00
Failure Prob.
1.0E-03
1.0E-04
Word disable
1.0E-05
1.0E-06
1.0E-07
1.0E-10
Density
Capacity
Latency
(cycles)
IPC
EPI
Conventional
660
1*
1*
(8-way)
L1: 3
L2: 20
1*
1*
32KB L1
Word disable
500
0.92
0.5
(4-way)
4
0.95
0.5
2MB L2
Bit fix
500
1
0.75
(6-way)
23
0.95
0.5
1-bit ECC
1.0E-02
1.0E-09
Vmin
(mV)
Source: Intel
1.0E-01
1.0E-08
0.7
Bit fix
10-bit ECC
Source: Intel
* Normalized reference value
1.0E-11
16
16
Low-voltage logic design
Narrow muxes No stack height > 2
Robust flip-flops
Robust level converters
vcch
“1”
vcch
vcch
“1”
“1”
vcch
vccl
“0”
“0”
“0”
input
4:1 Mux
Improve range
MIPS/Watt
Impact max
performance
Efficiency:
Efficiency:
MIPS/Watt
Design & technology optimizations to balance
range, performance & efficiency
Improve range
& efficiency
Performance
Performance
17
17
output
Low-voltage motion estimation engine
Motion Estimation Engine
I/O Control
65nm CMOS
70K transistors
Die area ~1mm2
Output FIFO Input FIFO
10000
100
450
Wide V-F range
Fmax (MHz)
10
100
1
10
GOPS/Watt
400
1000
0.1
0.4
0.6
0.8
1
1.2
412 Gops/Watt
300
@ 320 mV!
250
9.6X
200
150
Source:
Intel
P1264,
50°C
50
0
0.01
0.2
350
100
Source: Intel
1
Clock
Generator
0.2
1.4
0.4
0.6
0.8
1
Supply Voltage (V)
Supply Voltage (V)
18
18
1.2
1.4
F0
PLL0
F1
Clocking
PLL1
F2
Div
gate
I/O clk
PLL2
core clk
CLOCKING
Dynamic V & F adaptation
Noise gen
TCP/IP
processor
Output port
Input buffer
TCP/IP
Processor
Core
VRM
Noise gen
ctrl
Source: Intel
Noise
injector
PMOS body bias
PMOS
CBG
NMOS
CBG
Thermal
sensor
Time
DAB
Control
Droop
sensor
Prototype chip in 90nm
Vcc
PLL command
JTAG
CONTROL
NMOS body bias
DAB
Temp
Sensors Input
& Analog Buffer
2nd droop
Source: Intel
3rd droop
1 st droop
Time
Environment-aware • Adapt F/V to V/T change  reduce V/T margin
dynamic adaptation • Adapt F/V to aging  reduce aging margin
19
19
Resilient circuits
Error Detection Sequential (EDS)
MSFF
MSFF
E rro
clk
•
•
•
•
Error
det.
Voltage droop
r!
clk
Detect errors in critical path FFs
Propagate error signals
Correct errors by re-execution
Feedback to adaptive V/F
65nm resilient circuits testchip
20
20
Resiliency experiments
Response to voltage droops
1.0E-01
3.0
1.0E-04
Max TP
Resilient Design
Max TP
1.0E-07
Source: Intel
1.0
Power (mW)
4.0
2.0 Conventional Design
200
1.0E+02
VCC & Temperature
FCLK Guardband
Error Rate (%)
Throughput (BIPS)
5.0
Nominal: VCC=1.2V & Temp=60C
Worst-Case: 10% VCC Droop & Temp=110C
2400
2700
3000
3300
150
100
Conventional Design
Resilient Design
50
1.0E-10
0.0
2100
21% Throughput Gain
OR
37% Power Reduction
Source: Intel
0
1.0E-13
3600
1.0
Clock Frequency (MHz)
1.5
2.0
Throughput (BIPS)
21
21
2.5
3.0
Summary
• Energy efficiency and wide dynamic operating
range are critical for all platforms
• Integration, fine-grain power management,
advanced voltage regulators & 3D memory stacking
are key for energy efficiency
• Reliability, functionality, margins & efficiency limit
dynamic operating range
• Multi-voltage design, dynamic adaptation,
reconfiguration & resiliency are key enablers
22
22
Acknowledgement
CRL prototype
Tomm Aldridge
Jerry Bautista
Keith Bowman
Lev Finkelstein
Marci Glenn
Matthew Haycock
Andrew Henroid
Steven Hsu
Alaa Alameldeen
Sean Koehl
Sanu Mathew
Alon Naveh
Clark Roberts
Mark Rowland
Joe Schutz
Sriram Vangal
Bibiche Geuskens
Clair Webb
Bangalore Design
Mark Anders
Nitin Borkar
Saurabh Dighe
Shih-Lien Lu
George Goodman
Chris Wilkerson
Yatin Hoskote
Tanay Karnik
M. Khellah
R. Krishnamurthy
Randy Mooney
Trang Nguyen
Ronny Ronen
Greg Ruhl
D. Somasekhar
Manny Vara
Steven Hsu
Mark Bohr
LTD Design
Paolo Aseron
Shekhar Borkar
Zeshan Chishti
Varghese George
Steve Gunther
Ming Zhang
Jason Howard
Himanshu Kaul
V. Erraguntla
Partha Kundu
A. Raychowdhury
Fabrice Paillet
Erfaim Rotem
Gerhard Schrom
James Tschanz
Howard Wilson
Kevin Zhang
Gunjan Pandya
23
23
Murli Tirumala
Ravi Mahajan
Ravi Prasher
Venkat Natarajan
Anand Deshpande
Ali Farhang
Amit Agarwal
Robert Chau
Rajesh Kumar
Ketan Paranjape
Greg Taylor
P. Vishakantaiah
R. Kuppuswamy
Rohit Vidwans
Gautam Doshi
Sunit Tyagi
B. Chatterjee
Sati Banerjee
Q&A
24
24