DTTC Presentation Template

Download Report

Transcript DTTC Presentation Template

Exponential Challenges,
Exponential Rewards—
The Future of Moore’s
Law
Shekhar Borkar
Intel Fellow
Circuit Research, Intel Labs
Fall, 2004
R
®
1
ISSCC 2003—
Gordon Moore said…
“No exponential is forever…
But
We can delay Forever”
2
Outline

The exponential challenges
Circuit and mArch solutions
 Major paradigm shifts in design
 Integration & SOC
 The exponential reward
 Summary

3
Goal: 1TIPS by 2010
1000000
100000
Pentium® 4 Architecture
10000
Pentium® Pro Architecture
MIPS
1000
Pentium® Architecture
100
10
1
286
386
486
8086
0.1
0.01
1970
1980
1990
2000
2010
How do you get there?
4
Technology Scaling
GATE
SOURCE Xj
GATE
DRAIN
SOURCE
BODY
DRAIN
D
Tox
BODY
Leff
Dimensions scale
down by 30%
Oxide thickness
scales down
Doubles transistor
density
Faster transistor,
higher performance
Vdd & Vt scaling
Lower active power
Technology has scaled well, will it in the future?
5
Technology Outlook
High Volume
Manufacturing
Technology Node
(nm)
Integration
Capacity (BT)
Delay = CV/I
scaling
Energy/Logic Op
scaling
2004
2006
2008
2010
2012
2014
2016
2018
90
65
45
32
22
16
11
8
2
4
8
16
32
64
128
256
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
ILD (K)
RC Delay
Metal Layers
Medium
High
~3
<3
1
1
1
6-7
7-8
8-9
Very High
Reduce slowly towards 2-2.5
1
1
1
1
1
0.5 to 1 layer per generation
6
Is Transistor a Good Switch?
I=0
I≠0
On
I=∞
I=0
I = 1ma/u
I≠0
Off
I=0
I≠0
Sub-threshold Leakage
7
Exponential Challenge #1
8
Sub-threshold Leakage
Ioff (na/u)
10000
45nm
1000
100
10
0.25u
1
Assume:
30
50
0.25mm, Ioff = 1na/m
5X increase each generation
at 30ºC
70
90
110
130
Temp (C)
Sub-threshold leakage increases exponentially
9
SD Leakage (Watts)
SD Leakage Power
1000
100
2X Tr Growth
1.5X Tr Growth
10
1
0.1
0.25u 0.18u 0.13u 90nm 65nm 45nm
Technology
SD leakage power becomes prohibitive
10
Leakage Power
Leakage Power
(% of Total)
50%
Must stop
at 50%
40%
30%
20%
10%
0%
1.5
0.7
0.35
0.18
Technology (m)
0.09
0.05
A. Grove, IEDM 2002
Leakage power limits Vt scaling
11
Exponential Challenge #2
12
Gate Oxide is Near Limit
Gate Leakage (Watts)
90nm MOS Transistor
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1.E-02
1.E-03
1.5X
2X
During Burn-in
1.4X Vdd
0.25u 0.18u 0.13u 90nm 65nm 45nm
50nm
Gate
Technology
If Tox scaling slows down, then Vdd
scaling will have to slow down
1.2 nm SiO2
Silicon substrate
High-K dielectric is crucial
13
Exponential Challenge #3
14
Energy/Logic Operation
(Normalized)
Energy per Logic Operation
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
1.E-05
This was Good
1.E-06
1.E-07
Slow down
1.E-08
10
5
2
1
0.5
0.25 0.13 0.07
Technology (m)
Energy per logic operation scaling will slow down
15
The Power Crisis
1200
15 mm Die
Power (W)
1000
800
Leakage
Active
600
400
200
0
0.25u 0.18u 0.13u 90nm 65nm 45nm
Business as usual is not an option
16
Exponential Challenge #4
17
1
1000
10000
365nm
1000
micron
0.1
100
Lithography
Wavelength
248nm
193nm
nm
180nm
130nm
10
500
250
130
65
32
Technology Node (nm)
250
200
150
100
50
0
Heat Flux (W/cm2)
Results in Vcc variation
0.01
10
1980
Random Dopant Fluctuations
13nm
EUV
1990
2000
2010
2020
Source: Mark Bohr, Intel
Sub-wavelength Lithography
110
100
90
80
70
60
50
Temperature (C)
1000
100
Gap
90nm
65nm
45nm
Generation
32nm
Heat Flux (W/cm2)
Mean Number of Dopant Atoms
Sources of Variations
40
Temperature Variation (°C)
Hot spots
18
Normalized Frequency
Frequency & SD Leakage
1.4
1.3
1.2
0.18 micron
~1000 samples
1.1
30%
1.0
20X
0.9
0
5
10
15
20
Normalized Leakage (Isb)
Low Freq
Low Isb
High Freq
Medium Isb
High Freq
High Isb
19
Vt Distribution
120
0.18 micron
~1000 samples
# of Chips
100
80
~30mV
60
40
20
0
-39.71
-25.27
-10.83
3.61
18.05
32.49
VTn(mv)
High Freq
High Isb
High Freq
Medium Isb
Low Freq
Low Isb
20
Exponential Challenge #5
21
System Volume ( cubic inch)
Platform Requirements
3000
Shrinking volume
2500
2000
Quieter
1500
1000
Yet, High Performance
500
0
PC tower Mini towerm-tower
Slim line Small pc
Thermal Budget (oC/W)
Pentium ® III
75
1.0
Pentium ® 4
50
0.5
0
25
0
0
50
100
150
Power (W)
200
Heat-Sink Volume (in3)
Air Flow Rate (CFM)
100
1.5
Thermal budget
decreasing
Higher heat sink volume
Higher air flow rate
250
22
Exponential Challenge #6
23
Exponential Costs
$10,000
Litho Cost
FAB Cost
$10,000
$1,000
Fab Cost ($M)
Litho Tool Cost ($K)
$100,000
$1,000
$100
$10
www.icknowledge.com
$1
1970
1980
1990
2000
2010
1960
1.E-01
1980
1990
2000
2010
$ per MIPS
1.E+03
1.E-02
1.E+02
$/MIPs
1.E-03
1.E-04
1.E+01
1.E+00
1.E-05
1.E-01
1.E-06
1.E-02
1960
1970
1.E+04
$ per Transistor
$/Transistor
$10
G. Moore
ISSCC 03
$1
1960
$100
1970
1980
1990
2000
2010
1960
1970
1980
1990
2000
2010
24
Product Cost Pressure
Desk Top ASP ($)
1400
Platform Budget
1200
Other
1000
800
600
400
Power
Delivery
200
$25
Thermal
$10
0
1999
Source: IDC
2000 2001
2002
2003 2004
2005
Shrinking ASP, and shrinking $ budget for power
25
Power (W), Power Density (W/cm2)
Must Fit in Power Envelope
1400
10 mm Die
1200
1000
SiO2 Lkg
SD Lkg
Active
800
8 MB
600
4 MB
400
200
1 MB
2 MB
0
90nm 65nm 45nm 32nm 22nm 16nm
Technology, Circuits, and Architecture
to constrain the power
26
Some Implications
Vdd (Volts)
100

Tox scaling will slow
down—may stop?

Vdd scaling will slow
down—may stop?

Vt scaling will slow
down—may stop?

Approaching
constant Vdd scaling

Energy/logic op will
not scale
10
1
~1 Volt
0.1
10
3
1
0.35
0.13
0.05
Energy/Logic Operation
(Normalized)
Technology (m)
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08
Slow Down?
10
5
2
1
0.5 0.25 0.13 0.07
Technology (m)
27
The Gigascale Dilemma

1B T integration capacity will be available

But could be unusable due to power

Logic T growth will slow down

Transistor performance will be limited
Solutions

Low power design techniques

Improve design efficiency—Multi everywhere

Valued performance by even higher
integration (of potentially slower transistors)
28
Power—active and leakage
Variations
Microarchitecture
29
Slow
Fast
Slow
High Supply
Voltage
Low Supply
Voltage
Active Power Reduction
Multiple Supply
Voltages
Replicated Designs
Vdd
Logic Block
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
30
Leakage Control
Body Bias
Stack Effect
Sleep Transistor
Vbp
Vdd
+Ve
Equal Loading
-Ve
Logic Block
Vbn
2-10X
5-10X
2-1000X
Reduction
Reduction
Reduction
31
Circuit Design Tradeoffs
power
2
target
frequency
probability
1.5
1
0.5
0
2
1.5
1
0.5
small
large
Transistor size
0
low
high
Low-Vt usage
Higher probability of target frequency with:
1. Larger transistor sizes
2. Higher Low-Vt usage
But with power penalty
32
60%
40%
# critical
paths
20%
0%
0.9
1.1
1.3
1.5
Clock frequency
Mean clock frequency
Number of dies
Impact of Critical Paths
1.4
1.3
1.2
1.1
1
9
17
25
# of critical paths

With increasing # of critical paths
–Both s and m become smaller
–Lower mean frequency
33
Number of samples (%)
Impact of Logic Depth
40%
Device ION
NMOS
PMOS
20%
Delay
40%
20%
0%
-16%
-8%
0%
8%
16%
Logic depth: 16
NMOS Ion PMOS Ion
s/m
s /m
5.6%
3.0%
Delay
s /m
4.2%
Ratio of
delay-s to Ion-s
Variation (%)
1.0
0.5
0.0
16
49
Logic depth
34
mArchitecture Tradeoffs
1.5
1.5
1
1
0.5
frequency
0
target
frequency
probability
large
small
Logic depth
0.5
0
less
more
# uArch critical paths
Higher target frequency with:
1. Shallow logic depth
2. Larger number of critical paths
But with lower probability
35
Variation-tolerant Design
power
2
target
frequency
probability
1.5
1
0.5
0
small
large
Transistor size
1.5
1
Balance
power &
frequency
with
variation
tolerance
0.5
frequency
0
target
frequency
probability
large
small
Logic depth
2
1.5
1
0.5
0
low
high
Low-Vt usage
1.5
1
0.5
0
less
more
# uArch critical paths
36
Probabilistic Design
Probability
Due to
variations in:
Vdd, Vt, and
Temp
Path Delay
Delay Target
Probabilistic
Delay Target
Frequency
Deterministic
# of Paths
# of Paths
Delay
Deterministic
Probabilistic
10X variation
~50% total power
Leakage Power
Deterministic design techniques inadequate in the future
37
Shift in Design Paradigm

Multi-variable design optimization for:
– Yield and bin splits
– Parameter variations
– Active and leakage power
– Performance
Today:
Tomorrow:
Local Optimization
Single Variable
Global Optimization
Multi-variate
38
5.3 mm
Adaptive Body Bias--Experiment
Multiple
subsites
Resistor
Network
PD & Counter Resistor
Network
Delay
CUT
Bias
Amplifier
4.5 mm
Technology
Number of
subsites per die
150nm CMOS
21
Body bias range
0.5V FBB to
0.5V RBB
Bias resolution
32 mV
1.6 X 0.24 mm, 21 sites per die
150nm CMOS
Die frequency: Min(F1..F21)
Die power: Sum(P1..P21)
39
Number of dies
Adaptive Body Bias--Results
too
leaky
ABB
too
slow
FBB
ftarget
noBB
ABB
Frequency
RBB
ftarget
within die ABB
Accepted die
100%
97% highest bin
60%
100% yield
20%
For given Freq and Power density
• 100% yield with ABB
• 97% highest freq bin with ABB for
within die variability
0%
Higher Frequency 
40
4
Same Process Technology
Die Area
Performance
Power
3
2
1
40%
Reduction in MIPS/Watt
Growth (X) from previous uArch
Design & mArch Efficiency
Same Process Technology
Enegry efficiency
drops ~20%
20%
0%
0
S-Scalar
Dynamic
Deep
Pipeline
S-Scalar
Dynamic
Deep
Pipeline
Employ efficient design & mArchitectures
41
Memory Latency
CPU
Cache
Small
~few Clocks
Memory
Large
50-100ns
Memory Latency (Clocks)
1000
100
10
1
100
Assume: 50ns
Memory latency
1000
10000
Freq (MHz)
Cache miss hurts performance
Worse at higher frequency
42
Increase on-die Memory
Power Density (Watts/cm2)
100
Cache % of Total Area
100%
Logic
Memory
10
75%
Pentium® M
50%
Pentium® III
25%
486
Pentium®
Pentium® 4
0%
1
1u
m
m
m
0.5u
0.25u
0.13u
65nm
m
Large on die memory provides:
1. Increased Data Bandwidth & Reduced Latency
2. Hence, higher performance for much lower power
43
Multi-threading
100%
Thermals & Power Delivery
designed for full HW utilization
Performance
80%
Single Thread
Full HW Utilization
60%
1 GHz
40%
2 GHz
20%
ST
Multi-Threading
MT1
3 GHz
Wait for Mem
Wait for Mem
MT2
0%
100%
98%
96%
Wait
MT3
Cache Hit %
Multi-threading improves performance without
impacting thermals & power delivery
44
C1
C2
Cache
C3
C4
Relative Performance
Chip Multi-Processing
3.5
3
Multi Core
2.5
2
1.5
Single Core
1
1
•
•
•
•
•
2
3
Die Area, Power
4
Multi-core, each core Multi-threaded
Shared cache and front side bus
Each core has different Vdd & Freq
Core hopping to spread hot spots
Lower junction temperature
45
Special Purpose Hardware
TCP Offload Engine
1.E+06
MIPS
1.E+05
1.E+04
1.E+03
GP MIPS
@75W
TOE MIPS
@~2W
1.E+02
1995 2000 2005 2010 2015
2.23 mm X 3.54 mm, 260K transistors
Opportunities:
Network processing engines
MPEG Encode/Decode engines
Speech engines
Special purpose HW—Best Mips/Watt
46
Valued Performance: SOC
(System on a Chip)


Special-purpose hardware  more MIPS/mm²
SIMD integer and FP instructions in several ISAs
General
Purpose
Multimedia
Kernels
Si Monolithic
Special Wireline
HW
CMOS RF
CPU
Memory
Die Area
Power
Performance
2X
2X
~1.4X
<10%
<10%
1.5 - 4X
Polylithic
Heterogeneous
Si, SiGe, GaAs
OptoElectronics
RF
Dense
Memory
47
The Exponential Reward
1000000
Multi-Threaded, Multi-Core
100000
Multi Threaded
10000
Speculative, OOO
1000
MIPS
Era of
Thread &
Processor
Level
Parallelism
Super Scalar
100
10
286
0.1
0.01
1970
Era of
Instruction
Era of
Level
Pipelined
Architecture Parallelism
386
8086
1
1980
486
1990
Special
Purpose HW
2000
2010
Multi-everywhere: MT, CMP
48
Summary—Delaying Forever

Gigascale transistor integration capacity will
be available—Power and Energy are the
barriers

Variations will be even more prominent—shift
from Deterministic to Probabilistic design

Improve design efficiency

Multi—everywhere, & SOC valued
performance

Exploit integration capacity to deliver
performance in power/cost envelope
49