Design challenges in sub-100nm high performance microprocessors Vasantha Erraguntla

Download Report

Transcript Design challenges in sub-100nm high performance microprocessors Vasantha Erraguntla

Design challenges in sub-100nm
high performance microprocessors
Nitin Borkar, Siva Narendra, James Tschanz,
Vasantha Erraguntla
Circuit Research, Intel Labs
[email protected]
[email protected]
[email protected]
[email protected]
R
®
1
Outline
• Section 1: Challenges for low power and
high performance (90 mins)
–
–
–
–
Historical device and system scaling trends
Sub-100nm device scaling challenges
Power delivery and dissipation challenges
Power efficient design choices
• Section 2a: Circuit techniques for variation
tolerance (90 mins)
– Short channel effects
– Adaptive circuit techniques for variation tolerance
2
Outline (contd.)
• Section 2b: Circuit techniques for leakage
control (90 mins)
– Leakage power components
– Leakage power prediction
– Leakage reduction and control techniques
• Section 3: Full-chip power reduction techniques
(90 mins)
–
–
–
–
–
Micro-architecture innovations
Coding techniques for interconnect power reduction
CMOS compatible dense memory design
Special purpose hardware
Design methodologies & challenges for CAD
3
Section 1
Challenges for low power
and high performance
4
Moore’s Law on scaling
5
Scaling of dimensions
Gate
1
Tox
Source
L
Body
1
Delay  1
Freq  1
Drain
1
Gate
0.7 Tox
Source
0.7 L
Body
0.49
Drain
0.7
6
0.7
Delay  0.7
1
Freq 
 1.43
0 .7
Transistors on a chip
Transistors (MT)
1000
2X growth in 1.96 years!
100
10
486
1
386
286
0.1
0.01
Pentium 4
Pentium III
Pentium II
Pentium
8086
8080
8008
4004
8085
0.001
1970
1980
1990
Year
2000
2010
Transistors on Lead Microprocessors double every 2 years
7
Die size growth
Die size (mm)
100
10
8080
8008
4004
1
1970
8086
8085
1980
286
386
Pentium 4
Pentium III
Pentium II
Pentium
486
~7% growth per year
~2X growth in 10 years
1990
Year
2000
2010
Die size grows by 14% to satisfy Moore’s Law
8
Frequency
Frequency (Mhz)
10000
Doubles every
2 years
1000
100
486
10
8085
1
0.1
1970
8086 286
Pentium 4
Pentium III
Pentium II
Pentium
386
8080
8008
4004
1980
1990
Year
2000
2010
Lead Microprocessors frequency doubles every 2 years
9
Performance
Applications will demand TIPS performance
10
Power
Future
Pentium 4
Power (Watts)
100
Pentium III
Pentium
10
8086 286
1
8008
4004
486
386
8085
8080
0.1
1971
1974
1978
1985
1992
2000
Year
Lead Microprocessors power continues to increase
11
Obeying Moore’s Law...
10000
1.8B
Transistors (MT)
1000
900M
425M
200M
Pentium 4
Pentium II
Pentium
486
100
10
1
386
286
0.1
0.01
8086
8080
8008
4004
8085
0.001
1970
1980
1990
Year
2000
2010
200M--1.8B transistors on the Lead Microprocessor
12
Vcc will continue to reduce
Supply Voltage (V)
10.00
1.35
1
1.00
1.15
0.9
0.10
1970
1980
1990
Year
2000
2010
Only 15% Vcc reduction to meet frequency demand
13
Constant Electric Field Scaling
Oxide Field (MV/cm)
5
4
3
2
1
0
1.5
1.2
1.0
0.8
0.6
0.35
Technology Dimension (um)
14
0.25
0.18
Active capacitance density
2
Active Cap Density (nf/mm )
1.00
Pentium Pro
(R)
0.10
Active Cap 
Power
VCC2  freq
Cap Density 
Pentium
(MMX) (TM)
386
Pentium(R)
486
0.01








Active capacitance grows 30-35%
each technology generation
15
C
Area
Power will be a problem
100000
18KW
5KW
1.5KW
500W
Power (Watts)
10000
1000
100
286 386
10
8086
8085
1
80088080
4004
0.1
1971 1974
1978
1985
486
P4
P III
Pentium
1992
Year
2000
2004
2008
Power delivery and dissipation will be prohibitive
16
Closer look at the power
100,000
Will be...
Power (Watts)
18KW
10,000
5KW
Should be...
1.5KW
1,000
623W
500W
375W
225W
135W
100
2002
2004
2006
Year
17
2008
Advanced transistor design
Shallow highly doped
source/drain extension
Thin TOX
p+
p+
Halo/pocket
Shallow trench
isolation
n-well
Retrograde Well
Deep source/drain
18
Intel’s 15 nm bulk transistor
500
Vg = 0.8V
15nm
R. Chau et al., IEDM 2000
Drain Current (
25 nm
 A/  m)
Intel’s 15nm NMOS
400
0.7V
300
0.6V
200
0.5V
0.4V
100
0.3V
0
0
0.2
0.4
0.6
Drain Voltage (V)
19
0.8
Transistor scaling trends - SCE
Le
D
MOSFET Aspect Ratio
(lateral/vertical)
7.0
Tox
Dj
Uniform doping
Retrograde doping
6.0
5.0
4.0
3.0
2.0
1.0
0.25
0.18
0.13
0.10
0.07
Technology Generation (um)
Aspect Ratio:
Le
Tox .  si /  ox   d  d j 
1
3
1
3
Short channel effect (SCE) as measured as
aspect ratio has been worsening with scaling
20
1
3
Transistor scaling challenges - Dj
Salicide
Poly-Si
S. Asai et al., 1997.
Salicide
RC
RSE
Rsalicide
• Junction depth reduction:
+ Device channel length decrease for same SCE
- Series resistance to the channel increases
21
Transistor scaling challenges - Tox
• Thinning gate oxide
– Increased gate tunneling leakage
– Electrical thickness is ~2X physical thickness
– Gate stress now limits max VCC
• Solutions
– New decoupling caps
– Modified oxides/gate materials
– Model gate leakage in circuit simulation
22
VCC and VT scaling
VCC or VT (V)
V CC or V T (V)
5
4
3
5
VCC
4
3
(VCC- VT )
Gate over drive
22
VT
11
00
0
VCC=1.8V
VT =.45V
1
2
3
4
5
6
7
1.4 1.0 0.8 0.6 .35 .25 .18
Technology Generation (m)
23
Vcc scaling & Soft errors
• Vcc and cap scaling with technology
reduces charge stored
• Soft errors prominent in logic circuits
• No error correction in logic circuits
• Storage nodes per chip increasing
• Higher soft errors at the chip level
24
Motivation
• Soft error rate (SER) per bit staying constant
in future processes
– T. Karnik et al, 2001 VLSI Circuits Symposium
SER
A diff

bit
Cgate V cc
• Need to reduce SER/bit
Goal: Reduce chip-level SER with no performance
penalty and minimum power penalty
25
Measured Latch Data
SERX
2
5,250
Errors
2.25
Original
3,500
1,750
0
Hardened
0.5
0.7
0.9
1.1
1
1.3
SER ImprovementX
7,000
Supply Voltage (V)
T. Karnik et al, 2001 VLSI Circuits Symposium
Will need ~2X SER improvement in latches
with no performance loss.
26
VT vs. leakage
Leakage rises as the VT is lowered
– MOS has a sub-threshold slope of ~110mV/decade
– Lower VT by 50mV  3X leakage
Solutions
– Dual VT
– Stacking of off gates
– Controlled back gate bias?
– Multiple process technologies:
Mobile vs. Performance?
27
Sub-threshold Leakage
10000
DVt
1000
Ids
(log)
Exponential
Increase in Ioff
Ioff (na/u)
MOS Transistor Characteristics
100
10
1
30
Vgs
50
70
90
110
130
Temp (C)
Sub-threshold leakage current
will increase exponentially
28
Assumtions:
0.25m, Ioff = 1na/
5X increase each
generation at 30ºC
Leakage Power
SD Leakage (Watts)
1000
100
30M Tr
15mm Die
10
1
0.1
0.25u 0.18u 0.13u 90nm 65nm 45nm
Technology
Excessive sub-threshold leakage power
29
Leakage Power increases
0.18u
0.13u
0.07u
0.05u
Drain Leakage Power
100,000
0.1u
Ioff (na/u)
10,000
1,000
100
50%
8KW
40%
1.7KW
30%
20%
400W
88W
12W
10%
10
0%
30
40
50
60
70
80
90
100
2000
Temp (C)
2002
2004
2006
Year
Drain leakage will have to increase to meet freq demand
Results in excessive leakage power
30
2008
Wide Domino Functionality
CLK
CLK
Q2
Q1
A
B
B
C
C
Static Gate
D2 Domino Gate
CLK
D1 Domino Gate
• High performance ~30% over
static
• High fan-in NOR, less logic gates
• High fan-in complex gates
possible
• Smaller area
31
•
•
•
Lower AC noise margin ~ Vt
Ioff could limit NOR fan-in
High activity, higher power,
~2X
• Irreversible logic evaluation
• Scalability is not good
Bitline Delay Scaling Problem
Normalized delay
1.2
1
Logic circuit delay
Bit line delay (15% swing scaling)
Bit line delay (const swing)
0.8
0.6
0.4
0.2
0
0.25
0.18
0.13
0.10
Technology generation (um)
• Bit line swing limited by parameter mismatch & differential noise
• Cell stability degrades with Vt lowering
• Bit line delay a (Cap/W)*Vswing/(Ion/W - #rows*Ioff/W)
• Reducing # of rows per bitline approaching limit
32
Restrict transistor leakage
Frequency (Mhz)
10000
7 GHz
5.5 GHz
4 GHz
2.5 Ghz
Pentium 4
1000
Pentium II
Pentium
100
486
10 386
1985
1990
1995
2000
2005
2010
Year
Reduce leakage  Frequency will not double every 2 years
33
Interconnect scaling trends
Interconnect Stack
10
3.5
M5
ILD4
9
3.0
M4
8
ILD3
2.5
7
M3
6
ILD2
2.0
M2
5
ILD1
4
M1
3
ILD0
2
Poly


M3
M2
M1
0.5
0

M4
1.0
Gate oxide

M5
1.5
Field oxide
1
4.0
Minimum Widths (Relative)
Poly
0.0


Minimum Spacing (Relative)
3.5
3.5




Minimum Pitch (Relative)
3.0
3.0
2.5
2.5
2.0
M5
2.0
M4
M2
1.0
M4
1.5
M3
1.5
M5
M3
M2
1.0
M1
M1
0.5
Poly
0.5
0.0
Poly
0.0






34




Interconnect performance
Total Capacitance (Relative)
Relative Resistance
6
1.6
5
1.4
1.2
4
Poly
1.0
Poly
M1
0.8
3
M2
M2
M3
0.6
2
M4
M4
M5
1
0.2
0.0
0

7


Relative RC delay


6
5
Poly
4
M1
M2
3
M3




% increase each tech generation
R
C
RC
Poly
45%
-2%
42%
M1
53%
5%
61%
M2
46%
12%
62%
M3
39%
8%
51%
M4
18%
24%
46%
M4
2
M5
1
0

M3
M5
0.4

M1




35
R increases faster at lower levels
C increases faster at higher levels
RC increases ~40-60%
Interconnect distribution
No of nets
(Log Scale)
Pentium Pro (R)
Pentium(R) II
Pentium (MMX)
Pentium (R)
Pentium (R) II
10
100
1,000
10,000
100,000
Length (u)
Interconnect distribution does not change significantly
36
Wire Scaling
• Tall wires to reduce R
–thickness to width ratios of 2 to 1
–large cross cap
• Uarch for short wires
• Repeaters
37
Optimum Repeater
P size
N size
2
2
Repeater distance
1
• Vary
– N size, P size
– Repeater distance
– Metal width, space
Pitch
• Best speed at
– space ~2X width
• Include metal thickness
and optimize for PD3
– thickness ~2X width
38
P, V, T Variations
Device Ion
Process
• Die-to-die variation
• Within-die variation
• Static for each die
Very slow
Voltage
• Chip activity change
• Current delivery—RLC
• Dynamic: ns to 10-100us
• Within-die variation
Temperature
Years
Time dependent
degradation
• Activity & ambient change
• Dynamic: 100-1000us
• Within-die variation
39
Normalized Frequency
Frequency & SD Leakage
1.4
1.3
1.2
0.18 micron
~1000 samples
1.1
30%
1.0
20X
0.9
0
Low Freq
Low Isb
5
10
15
Normalized Leakage (Isb)
High Freq
Medium Isb
40
High Freq
High Isb
20
Vt Distribution
120
0.18 micron
~1000 samples
# of Chips
100
80
~30mV
60
40
20
0
-39.71
-25.27
-10.83
3.61
18.05
32.49
D VTn(mv)
High Freq
High Isb
High Freq
Medium Isb
41
Low Freq
Low Isb
Frequency Distribution
100
50
0
1.37
1.30
1.22
1.15
1.07
Freq (Normalized)
High Freq
High Isb
High Freq
Medium Isb
42
Low Freq
Low Isb
1.00
# of Chips
150
100
1
20.11
16.29
12.47
8.64
4.82
1.00
Isb (Normalized)
High Freq
High Isb
High Freq
Medium Isb
43
Low Freq
Low Isb
# of Chips
Isb Distribution
Supply voltage (V)
Supply Voltage Variation
Reliability & power  Vmax
Vmin  frequency
Time (sec)
• Activity changes
• Current delivery RI and L(di/dt) drops
• Dynamic: ns to 10-100us
• Within-die variation
44
Handling di/dt
Bulk Decoupling
High Frequency Decoupling
VRM Response
Local Decoupling
Silver Box
Response
1.E+00
1.E+01
1.E+02
On Die
Decoupling
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
Frequency
• Land-side / package capacitors
• High frequency or local VRMs
• Low leakage on-die capacitors
45
1.E+09
1.E+10
1.E+11
Vcc Variation Reduction
With Die Caps
Without Die Caps
On die decoupling capacitors reduce DVcc
• Cost area, and gate oxide leakage concerns
On die voltage down converters & regulators
46
Temperature Variation
Cache
70ºC
Temp
(oC)
Core
120ºC
• Activity & ambient change
• Dynamic: 100-1000us
• Within-die variation
47
Major Paradigm Shift
• From deterministic design to probabilistic
and statistical design
– A path delay estimate is probabilistic (not
deterministic)
• Multi-variable design optimization for
–
–
–
–
Yield and bin splits
Parameter variations
Active and leakage power
Performance
48
Performance Efficiency of Arch
Pollack’s Rule
4
3
Area
(Lead / Compaction)
Growth (X) 2
Performance
(Lead / Compaction)
1
*Note: Performance measured using SpecINT and SpecFP
0
1.5
1
0.7
0.5
0.35
Technology Generation
0.18
Implications (in the same technology)
• New microarchitecture ~2-3X die area of the last uArch
• Provides 1.5-1.7X performance of the last uArch
We are on the wrong side of a Square Law
49
Frequency & Performance
10,000
Freq (uArch)
Freq (Process)
Pentium® 4 proc
Freq (Mhz)
1,000
Frequency increased 61X
3.3X
1. 18.3X  process technology
Pentium® II & III proc
100
Pentium® proc
i486
2. Additional 3.3X  uArch
18.3X
10



  
Relative Performance
100
Additional due to
Pentium® 4 proc
uArch
7X
Perf due to Freq
Performance increased 100X
Pentium® II & III proc
10
14X
1
Pentium® proc
i486



1. 14X  process technology
2. Additional 7X  uArch, design
  
50
Growth (X) from previous uArch
Design Efficiency—Arch
4
Die Area
Performance
Power
3
In the same process technology,
compare:
Scalar  Super-scalar
 Dynamic
 Netburst
2
2-3X Growth in area
~1.4X Growth in Integer Performance
1
~1.7X Growth in Total Performance
0
S-Scalar
Dynamic
Netburst
2-2.5X Growth in Power
Pollack’s Rule in action—Power inefficiency
51
% Power for 1% Performance
Power Efficiency - Circuits
5
15% Domino
5% Other fast
circuits
4
3
2
15% Domino
Static 5% Domino
1
0
Scalar
Sup-Scalar
Dynamic
Netburst
Assumptions:
Activity: Static = 0.2,
Domino = 0.5
Clock consumes 40%
of full chip power
High Power circuits contribute to power inefficiency
52
Power density will increase
Power Density (W/cm2)
10000
1000
100
Rocket
Nozzle
Nuclear
Reactor
8086
P4
10 4004
Hot Plate
P III
8008 8085
Pentium
386
286
486
8080
1
1970
1980
1990
2000
Year
2010
Power density too high to keep junctions at low temp
53
Thermal Solutions
Ta
Attachment
Ts
Heat Sink
Interface
Tc
Package
Mounting
Tj
54
Qsa : Sink-to-Ambient
(Heat-Sink)
Resistance
Qcs : Case-to-Sink
(Interface)
Resistance
Qjc : Junction-to-Case
(Package)
Resistance
Thermal Capability—Today
1.0
Thermal Resistance (oC/W)
Package
- Polymer thermal interface
- 1.5mm Cu heat spreader
- 0.35oC/W (typical)
Thermal Interface Material
- Thermal Grease
- Phase Change Material
- 0.12oC/W
Heat Sink
- Al Folded Fin + Cu base
- 3.5” x 2.5” x 2” at 400g
- 0.38oC/W
- ~$5 (for RM & fan)
QJA =
TJ – TA
Power
0.8
0.6
Heat Sink
(0.38oC/W)
0.4
TIM
(0.12oC/W)
0.2
Package
(0.35oC/W)
QJA
0.82oC/W
0.0
TJ = 90oC, TA = 45oC, QJA= 0.82oC/W
P = (90-45)/0.82 =
55
55W
Thermal Capability—Future
Thermal Resistance (oC/W)
1.0
0.8
0.6
20%
20%
Heat Sink
20%
0.4
TIM2
0.2
Package
0.0
Today
Future
Must improve on all fronts—no silver bullet
56
Shrinking Size & Quieter
BA Scale
6.5
System Volume ( cubic inch)
3000
5.5
Typical Office
2500
4.5
Conference Room
2000
3.5
1500
Typical Home
VHS Recorder
Recording Studio
1000
2.5
Whisper
CD Player
500
1.5
0
Threshold Hearing
PC tower Mini tower -tower
Slim line
Small pc
0
Sound Power Levels
Small & quiet, yet high performance
57
Thermal Budget
$2000 PC cost (‘97)
Desktop PC ASP
2200
Performance PC
CPU
Fax/Modem
Margin
1800
Monitor
Memory
CD ROM
HDD
US $
1400
Value PC
1000
$1
$7
Power Supply
600
$7$15
EMI
Thermals
MB PWB
Pwr Delivery
200
1995
1996
1997
1998
1999
2000
Source: Dataquest Personal Computers
Shrinking ASP, and shrinking budget for thermals
58
$28
Thermal
• Throttling / clock gating
• Circuits and sizing
–10% performance gain at same power can be
translated into 25% power reduction by
changing VCC
• Improved die attach / package
• Can effect new uArch / floor planning
– Spread and reduce power
59
Thermal Envelope & Cost
Unit Cost ($)
1000
100
Mobile
High Perf
Itanium™ proc
10
Pentium® 4
proc
Celeron
1
1
10
100
Power (W)
60
1000
The “Odds”
1.5
100
1.0
Pentium ® 4
50
0.5
25
Air Flow Rate (CFM)
75
Heat-Sink Volume (in3)
Thermal Budget (oC/W)
Pentium ® III
0
0
0
50
100
150
200
250
Power (W)
Power  Thermals  Higher Heat Sink Volume  Higher Air-flow
Is this cheaper, smaller, and quieter…?
61
What’s next…
•
•
•
•
Circuit techniques for variation tolerance
Circuit techniques for leakage control
Full-chip power reduction techniques
30 min quiz
62
Section 2a
Circuit techniques for
variation tolerance
63
Moore’s law on scaling
64
Scaling of dimensions
Gate
1
Tox
Source
L
Body
1
Delay  1
Freq  1
Drain
1
Gate
0.7 Tox
Source
0.7 L
Body
0.49
0.7
Drain
0.7
65
Delay  0.7
1
Freq 
 1.43
0 .7
Channel length (um)
0.7X in
3.6 years
1
0.7X in
2 years
0.1
July-02
July-92
July-82
1e+8
July-72
0.01
2.1X in
2 years
Introduction date
1e+7
1e+6
Requires die size growth
or same die size
3.3X in
3.6 years
1e+5
1e+4
July-02
July-92
July-82
1e+3
July-72
Channel length (um)
Number
of logic
logictransistors
transistors
Number of
10
Introduction date
66
Core frequency (Hz)
10e+9
From early 90s to Present:
2X in
2 years
1e+9
100e+6
1.7X in
3.6 years
10e+6
July-02
July-92
July-82
July-72
10
Introduction date
0.7X in
2 years
1
Delay 
Introduction date
1
ION

1
( VDD - VT )a
VDD scaling requires
VT scaling
July-02
July-92
July-82
0.1
July-72
Supply voltage (V)
Supply voltage (V)
1e+6

2
Power  Area   VDD
F
t
1
~ 1 .0 
 1 2  2  2 .9
0 .7
1
~ 1 .0 
 0 .7 2  2  1 .4
0 .7
67
Drain current
(Linear scale)
1
0.8
0.6
0.4
0.2
0
- VT
VT
IOFF  10
S
S ~ 85 mV / decade
(Log scale)
0.5
1
1 0
0.1
Gate voltage
0.01
0.001
0.0001
0.00001
IOFF
0.000001
0
0.5
1
1.5
VDD  
VT
1.5
Gate voltage
68
IOFF 
Increasing
electron
energy
(NMOS)
n+
height
Barrier height
Barrier
Barrier Lowering (BL)
Source (n+)
L
p
Channel of length L
Xd
Short L
n+
Xd
Long L
L 
VT
69
IOFF 
Drain (n+)
Barrier height
Barrier height
Drain Induced BL (DIBL)
D
S
Increase Drain
voltage
D
S
Increase Drain
voltage
Long L
Short L
VDS 
VT
70
IOFF 
Impact of variation in L
VT (Volts)
BL (VDS0)
DIBL (VDS=VDD)
VTLIN (VDS0)
VTSAT (VDS=VDD)
Channel length (um)
DL  DVT  DION & DIOFF
71
29.1
30
20
11.8
12.1
10
Vds: 50 mV
Vds: 1.1 V
n: 110
0
10.1
10.2
40
21.0
3s threshold
voltage variation
(mV)
180nm measurements
0.18 0.36 0.72
L (m)
Necessary to make circuits less sensitive to
VT (ION & IOFF) variation
72
L
D
Device aspect ratio
Transistor scaling
Tox
Dj
Transistor
L

aspect
 si
ratio
3 Tox
Dj D
10
8
6
4
2
0
N-1
N N+1 N+2
Technology
generation
 ox
Short channel effects increase with scaling
73
Transistor scaling challenges - Dj
I
0.6
0.4
0.3
PMOS
0.5
0.2
130
L MET
0.15
L MET ( m)
NMOS
I DP (mA/  m)
0.7
0.2
R EXT
0.1
0.05
S. Thompson et al., 1998.
0.1
0
50
100
150
Junction Depth (nm)
0
200
Salicide
Poly-Si
110
100
S. Thompson et al., 1998.
0.4
120
R EXT ( W  m)
0.5
DN (mA/  m)
0.8
0
50
100
150
Junction Depth (nm)
S. Asai et al., 1997.
Salicide
RC
RSE
Rsalicide
74
90
200
eIGATE
IOFF
N+ Gate
P- Substrate
2E+20
Electrons (#/cm 3 )
Classical
2) 2)
Gate ICurrent
(A/cm
GATE (A/cm
Transistor scaling challenges - Tox
1E+03
3
10
1E+02
1E+01
1E+00
100
1E-01
1E-02
1E-03
1E-04
1E-05
1E-06
-7
1E-07
10
2.5nm
3.5nm
5.1nm
7.6nm
3.0nm
C. Hu, 1996.
0
0
22
44 66 88 10
10 1212
GateVoltage
Voltage (V)(V)
Gate
S.A. Hareland, et al., 1995.
1E+20
Q.M. (Three Subband)
1E+00
0.0
1.0
2.0
3.0
Depth (nm)
4.0
Electrical Tox = Physical Tox + 1nm
50% poly depletion, 50% quantum effect
75
High-K Gate Dielectric
1E+1
1E+0
TiO2
HfO2
2
JOX @1V (A/cm )
1E-1
1E-2
1E-3
ZrO2 Ta2O5
SiO2
1E-4
1E-5
Al2O3
1E-6
1E-7
1E-8
0.50
Fig 17. Comparison of gate leakage between SiO2
and high-K dielectrics.
1.00
1.50
2.00
C ox ( F/cm )
• Lower gate leakage
• Higher Cox at a given gate leakage
76
2.50
2
3.00
3.50
Parameter variation
10%
8%
6%
4%
2%
0%
Normalized frequency
DVT due to 10% DL
Device and chip level parameters
6%
4%
N
N+1
Technology generation
2
1.5
1
0.5
0
7%
5%
N
N+1
Technology generation
Parameter variations increase with scaling
Adaptive VDD, VT to reduce chip level variation
77
Scaling challenges summary
L, VDD, VT scaling
 Increasing parameter variation
 Increasing sub-threshold leakage power
 Increasing gate leakage power
Product life cycle reduced from 3.6 years to 2 years
 Concurrent engineering
 Better prediction models
78
VT variation categories
Neighborhood threshold
voltage mismatch
Within-die threshold
voltage variation
Die-to-die threshold
voltage variation
79
Adaptive Body Bias (ABB)
Adaptive Body Bias
Die count
Die count
Conventional
D Vt1
Before
adaptive
body bias
Vt-low
Vt-target
D Vt2
After adaptive
body bias
Vt-nom Vt-target
Die's Mean Vt (V)
Die's Mean Vt (V)
Adaptive body bias reduces die-to-die mean VT variation
80
Side effects of ABB
S (n+)
D (n+)
p
Xd
Xd
L
(1) Lower VT
(2) Apply reverse bias
Determine impact of adaptive body bias on
within-die VT variation.
81
Short Channel MOS VT
Vt  Vfb  2p 
b
2qN s (| 2p | Vsb ) - dVds
Cox
Barrier lowering
H. C. Poon et al., IEDM, pp. 156-159, 1973

X
2
W

b  1 - 1 
- 1 j

 L
Xj


DIBL equation (Empirical)
BL  b
DIBL  d
K. K. Ng et al., IEEE TED, pp. 1895-1897, Oct. 1993


L
d  

-2







2
.
2
m
(
T
0
.
012
m
)
(
W
0
.
15
m
)
(
X
2
.
9
m
)


ox
sd
j
82
-2 . 7
Within-die VT Variation
Within-die VT variation is primarily due to CD variation
DVt Vt d d
Vt db


DL  d dL
b dL
BL  b
DIBL  d

 DL
1
 DVt  2.7 Vdd  d 
2qN s ( 2 p  Vsb ) (1 - b )
C ox

 L
BL and DIBL increase due to ABB will result in
larger within-die VT variation
83
Solutions
DVt1
Vt-target
Die's Vt (V)
DVt2
Before
adaptive
body bias
Vt-low
After adaptive
body bias
Vt-nom Vt-target
Die's Vt (V)
Die count
Die count
Die count
• Bi-directional adaptive body bias
After
BABB
DVt1
Before
BABB
Vt-target
Die's Vt (V)
• Several separate bias generators on-chip
84
Testchip die micrograph
150nm CMOS
5.3 mm
21 subsites per die
Microprocessor critical path
Frequency=Min(F1..F21)
Power=Sum(P1..P21)
Separate VBS for each subsite
4.5 mm
62 dies per wafer
85
Sub-site micrograph
Critical path

Phase
detector
R
VREF
2R
Phase Detector &
Counter
Resistor
Network
Bias Amplifier
2R 2R
R
R
R
2R 2R 2R
5-bit
counter
Delay
CUT
R
Rf
+
-
VCCA
Bias selector
VBP,ext
VBN,ext
VBP
VCC
Circuit block
(CUT)
21 sub-sites with separate body bias for each sub-site
86
PD
VSS
CUT schematics
VDD
ROenable
VBSP (VDD+0.5 to VDD–0.5 V)
Microprocessor
critical path
16
ROout
GND (0 V)
VBSN (+0.5 to –0.5 V)
87
Simple Adaptive Body Bias (S-ABB)
Neglects WID variation
Circuit
Block
Apply Ftarget
Reduce Ftarget
Apply NMOS
bias
PD
PMOS bias
adapts
Bias
Gen.
Measure Pleak
of circuit block
PD
Pick best NMOS/PMOS
bias (minimize Pleak)
Phase detector
=
and critical path
Circuit block
Pleak < Pleak,max?
Area overhead: ~2%
YES
88
NO
Effectiveness of S-ABB
Normalized leakage
Die count
100%
80%
60%
40%
20%
0%
6
Accepted
dies:
NBB
110C
1.1V
S-ABB
5
S-ABB
4
3
Frequency
Variation
s/
NBB
4.1%
S-ABB
1.0%
2
1
NBB
0
0.925
1
1.075
1.15
Normalized frequency
89
1.225
Adaptive Body Bias (ABB)
Accounts for WID variation
Circuit
Block
PD
Apply Ftarget
Reduce Ftarget
Apply NMOS bias
PD
PD
Apply PMOS bias
PD
Bias
Gen.
PD
PD
PD
PD
Measure F and
Pleak of die
Pick best PMOS bias
Counter
...
PD0
Pick best NMOS/
PMOS bias
PDn
Die Pleak < Pleak,max?
Area overhead: ~2-3%
YES
90
NO
Effectiveness of ABB
Normalized leakage
Die count
100%
80%
60%
40%
20%
0%
6
Accepted
dies:
NBB
110C
1.1V
ABB
5
Frequency
Variation
s/
NBB
4.1%
ABB
0.69%
ABB
4
3
2
1
NBB
0
0.925
1
1.075
1.15
Normalized frequency
91
1.225
PMOS Body Bias (V)
NMOS Body Bias (V)
Adaptive Bias Distribution
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
1 die
N: RBB
P: FBB
N: FBB
P: RBB
10 dies
13 dies
N: FBB
P: FBB
38 dies
92
Frequency vs. Critical Path Count (NCP)
Number of dies
60%
NCP=14
40%
NCP=20
NCP=1
20%
0%
0.9
1.1
1.3
1.5
Normalized die frequency
• Frequency  and s reduce as NCP increases
• Frequency distribution unchanged for NCP > 14
93
Number of samples (%)
WID Delay Variation vs. Logic Depth
40%
Device ION
NMOS
PMOS
NMOS s/ = 5.6%
PMOS s/ = 3.0 %
20%
0%
40%
Delay
Delay s/ = 4.2%
20%
0%
-16%
-8%
0%
8%
16%
Variation (%)
[Miyazaki,
ISSCC 2000]
This work
Path Depth
Device s/
49
2.4%
16
4.27%
Frequency s/
0.55%
4.17%
94
Within-Die Adaptive Body Bias (WID-ABB)
Compensates for
WID variation
Apply Ftarget
Reduce Ftarget
Apply NMOS bias to all circuit blocks
Circuit
Block
Circuit Block 1:
Adapt PMOS
bias
Circuit Block n:
Adapt PMOS
bias
PD
Measure Pleak
of block
Measure Pleak
of block
PD
Pick best NMOS/
PMOS bias
Pick best NMOS/
PMOS bias
PD
PD
PD
PD
Bias
Gen.
PD
PD
Measure total die leakage Pleak
Area overhead:
Similar to ABB
Pleak < Pleak,max?
YES
95
NO
Effectiveness of WID-ABB
Normalized leakage
Die count
100%
80%
60%
40%
20%
0%
6
Accepted
dies:
5
110C
1.1V
ABB
WID-ABB
Frequency
Variation
s/
ABB
0.69%
WID
0.21%
ABB
ABB
4
3
2
1
WIDABB
0
0.925
1
1.075
1.15
Normalized frequency
96
97% in
highest bin
1.225
Within-Die Bias Distributions
Block Count
CircuitCount
14%
12%
P FBB
N RBB
10%
8%
6%
4%
2%
-0.5
-0.5
0
NMOS Body
Bias (V)
97
0.5
FBB
FBB
0
P RBB
N FBB
P,N RBB
0%
RBB
0.5
P,N FBB
PMOS Body
Bias (V)
RBB
Bias Resolution
Bias
resolutio
n
500mV
300mV
100mV
ABB
dies,
s/
F>1
79 %
2.87 %
100 %
1.47 %
100 %
0.69 %
WID-ABB
dies,
s/
F > 1.075
2%
1.89 %
66 %
0.50 %
97 %
0.21 %
• 300mV bias resolution sufficient for ABB
• WID-ABB requires 100mV bias resolution
98
ABB summary
• D2D and WID variations impact microprocessor
frequency and leakage
• ABB improves die acceptance rate from 50% to
100%
• ABB is most effective when WID variations are
considered
• Compensating for WID variations by WID-ABB
increases number of high frequency dies from
32% to 97%
99
Frequency (GHz)
Adaptive VDD & VT
1.5
Fast die
1
0.5
Slow die
For iso-frequency
Decrease VDD
Fast die
Increase VT
Increase VDD
Decrease VT
0
Slow die
0 0.5 1 1.5 2
VDD (V)
VT
Pleak a 10 S
Psw a V2
DD
100
Testchip goals
• Body bias (VBS) for VT modulation
• Measure frequency improvement with
Adaptive VDD
Adaptive VBS
Adaptive VDD+VBS
Adaptive VDD+ Within-die VBS
• Subject to total active and standby power
constraints
101
10 W/cm
10
2
400
9
200
0
100
1.05V 110°C a: 0.03
0.5 W/cm
2
08
10
1
0.85
40°C
0.9 0.95 1 1.05 1.1
Frequency (normalized)
102
Switched
capacitance
(normalized)
Standby
leakage
Total power
power
(normalized)
(normalized)
Baseline measurements
Adaptive VDD vs. Fixed VDD
Die count
Active power limit: 10W/cm2 & Standby power limit: 0.5W/cm2
100%
74%
80%
52%
60% 37%
40%
15%
20%
6% 10% 0% 0%
0%
0.9
0.95
1
1.05
Frequency Bin
Fixed VDD: 1.05V – Frequency reduced to meet power limit
Adaptive VDD: 20mV resolution – VDD & frequency changed
simultaneously
103
Die count
VDD resolution requirement
100%
80%
60%
40%
20%
0%
0.9
0.95
1
Frequency Bin
1.05
Fixed VDD: 1.05V
Adaptive VDD: 50mV resolution
Adaptive VDD: 20mV resolution
Minimum of 20mV resolution in VDD is required
104
0%
0%
16%
10%
79%
74%
3%
100%
80%
60%
40%
20%
0%
15%
Die count
Adaptive VDD vs. Adaptive VBS
0.9
0.95
1
1.05
Frequency Bin
Adaptive VDD: 20mV resolution
Adaptive VBS: 100mV resolution
6% Fixed VDD
10% Adaptive VDD
16% Adaptive VBS
Target frequency bin
105
0.9
0.95
1
Frequency Bin
0%
0%
26%
16%
71%
79%
2%
100%
80%
60%
40%
20%
0%
3%
Die count
Adaptive VDD + VBS
1.05
Adaptive VBS
Adaptive VDD+VBS
Adaptive VDD +VBS more effective than
adaptive VDD or adaptive VBS
106
50%
Nominal VDD: 1.05V
Adaptive VDD
Adaptive VDD+VBS
40%
30%
20%
4%
2%
1.07
0%
1.05
-2%
1.03
-4%
1.01
0%
-7%
0.99
10%
-9%
Accepted die count
VDD distribution
VDD (V)
VDD (normalized)
Adaptive VDD +VBS results in lower VDD
than in adaptive VDD
107
Adaptive VBS
0.4 P FBB
0.2 N RBB
2% …
25%
Adaptive VDD+VBS
P FBB
N FBB
P FBB
N RBB
P FBB
N FBB
P RBB
N FBB
P RBB
N RBB
P RBB
N FBB
NMOS body bias (V)
0.4
0.2
0
-0.2
-0.4
0.4
0
-0.2
-0.2 P RBB
-0.4 N RBB
0.2
0
-0.4
PMOS body bias (V)
VBS distribution
NMOS body bias (V)
Adaptive VDD +VBS results in more dies with FBB
than in adaptive
VBS
108
0.9
0.95
1
Frequency Bin
6%
0%
74%
26%
71%
18%
0%
100%
80%
60%
40%
20%
0%
2%
Die count
Adaptive VDD + Within-die VBS
1.05
Adaptive VDD+VBS
Adaptive VDD+ Within-die VBS
Adaptive VDD + Within-die VBS is most effective
109
AVDD + ABB Summary
150nm CMOS with 10W/cm2 active & 0.5W/cm2
standby power density limits result in:
Number of dies in
F=1 and F=1.05
frequency bins
6%
10%
16%
26%
80%
Fixed VDD
Adaptive VDD
Adaptive VBS
Adaptive VDD + VBS
Adaptive VDD + Within-die VBS
20mV resolution in VDD is required
100mV resolution for VBS is required
110
Neighborhood VT variation
The devices of interest that are in close proximity
can be either of the same or different polarity.
Impacts sense amps,
diff amps, current mirrors etc.
Impacts clock
generation circuits,
switching thresholds etc.
Voltage biasing
Current biasing
111
Voltage biasing
Linear threshold voltage mismatch of matched device pair for 500 mV
forward body bias, zero body bias and 500 mV reverse body bias.
D Vt (Volts)
0.01
n: 50 W/L: 40/0.18
0.23
0.35 0.42  Average Vt-lin (V)
3.8
5.8 6.1  Vt mismatch (mV)
0
-0.01
0.0
0.1
0.2
0.3
0.4
0.5
Vt-lin (Volts)
I mismatch
I avg
Vtsat 1 - Vtsat 2

avg
V gs - Vtsat
112
Application to sense-amp
Traditional sense-amplifier
2
Strobe
1.5
1
When Strobe is ½ Vcc
Vsb = -1.0 V
0.5
0
-0.5
Reset
Reset
Voltage (V)
Strobe
VsbTraditional
-1
-1.5
0
3s Vt mismatch
Reset
Reset
Vx
100
150 200 250
Time (pS)
New sense-amplifier
Strobe
50
80
60
Traditional
32 mV
40
20
0
Vds:1.5 V
2m/0.18m
New
-1000
-500
0
Body bias (mV)
113
Simulation results
Ramp+SA delay (pS)
1.5 V, 1 mV/pS ramp rate, and 110 C
400
300
Traditional
New
200
100
0
13%
32 mV
0 50 100 150 200
Input differential (mV)
114
Current biasing
Basic iso-current biasing
p
1
pIb
PMOS
Network
Ib
Output
Inputs
NMOS
Network
nIb
n
1
115
Charging current
= discharging
current when n=p
Application
Non-overlapping 2 clock generation
1
k
in
in
O1
kIb
Ib
Delay difference
between two phases
at 1/2 Vcc (S)
t1
n1
O2
n2
O2'
2 x 2 crossbar
kIb
O1'
1
Equalize
delay at one process,
voltage, and
temperature corner
t2
k
t1
t2
O2
O2'
20e-12
Standard scheme
O1 rising and falling
15e-12
10e-12
Iso-bias current
O2 rising and falling
5e-12
000e+00
3X smaller overlap duration @ 2X power and area cost
116
Current biasing
Process insensitive current biasing
1
p
Iref
pIref
PMOS
Network
Output
Inputs
NMOS
Network
nIref
1
n
1
117
Charging current
= discharging
current when n=p
Iref existing techniques
Reference voltage to reference current conversion
Bandgap circuit with off-chip resistor
MOS ‘reference’ voltage with off-chip resistor
Direct reference current generation
MOS based – temperature compensation only
118
Objective
Generate process compensated current
 with thin tox digital CMOS devices
 without external resistors
119
Device measurement
0.9
0.9VV
Iu
W/L
Iu (Amperes)
0.18 m CMOS technology, 30oC, Uncompensated current
2e-3
1.5e-3
1e-3
500e-6
0
n: 77
s: 235.8 A
120
0
50
100
Sample number
: 1.6 mA
s/: 15%
Subtraction method
y1  f1( x )
Let
y 2  f2 ( x )
y D  y1 - y 2 & y1  y 2
dy 1 dy 2
(at x = Xmid)
Design

dx dx
dy D
then
 0 & y D  0 (around Xmid)
dx
y1 & y2 vary with x, but yD is ‘insensitive’ to x
121
Example
y1  n1x(m1 x )
y 2  n2 x(m 2  x )
Let y D  y1 - y 2
dy 1 dy 2
n1 m 2  2 xd
At x  xd set



dx dx
n2 m1  2 xd
Choose m1m2 and n2
This will provide non-zero yD ‘insensitive’
to x around xd for proper n1
122
Illustration
n2 = 1; m1 = 4.2; m2 = 2; xd = 15  n1 = 0.13
1000
y1 (±35%)
800
y2 (±47%)
600
400
xd
200
yD (±6%)
0
10
15
x
123
20
MOS devices in saturation
Using long-channel wide devices
I2   z2 (Vgs 2 - Vt )
I1   z1(Vgs 1 - Vt )
2
  Cox
W1
z1 
2L
2
Q
Vt 
Cox
W2
z2 
2L
124
Compensation by subtraction
Iref  I1 - I2
Vgs 2  aVt
Vgs 1  bVt
TYP
and
TYP
For compensati on
b
a
z 1 /z 2
z1 (a - 1)
 2
z2 (b - 1)
5
4
3
2
2
2
1/8
1/5
3/8
2
125
VT generation circuit
VDD
½ VDD
20/2
~1VT
15/2
~5VT
15/2
20/2
2/20
126
Subtraction circuit
Vsg1  5VT
VDD
z2
z1
Iref =I1-I2
Vsg2  2VT
I1
127
I2
z1/z2 = 1/8
Device measurement
I (Amperes)
0.18 m digital CMOS technology, 30oC
2e-3
1e-3
I1
500e-6
I2
n: 112
000e+0
0
50
100
Sample number
128
Compensated current
Iref (Amperes)
0.18 m digital CMOS technology, 30oC
400e-6
300e-6
200e-6
Iref = I1 – I2
100e-6
0
0
n: 112
50
100
Sample number
s: 17.4 A
129
: 305 A
s/: 5.7%
Sub-1 V operation
b, a Vddmin Temp
(V)
(oC)
Iref
variation
Vdd sensitivity
5, 2
0.9
30
5.0%
0.3% per 100 mV
3, 2
0.6
30
5.2%
0.4% per 100 mV
Low voltage operation enabled by
redesigning Vt generation circuit
130
Process corner simulation
results
Normalized current
0.18 m digital CMOS technology, 30oC, VDD = 0.9 V, z1/z2 = 1/6
1.4
1.2
Iu (-16 to 22%)
0.97
0.95
1.0 0.84
0.89
0.8
0.6
0.4
0.2
0.0
Slow -
Slow
Iref (5%)
1.22
1.14
1 1
0.99
0.95
Typical
Fast
Fast +
Process corner
7.6X smaller variation than uncompensated current
131
Summary on Iref
• Subtraction technique for compensation
• Compensation technique reduces reference
current variation to 5% at Vdd of 0.9 V from
38%
• Variation remains as 5% at Vdd of 0.6 V
132
Section 2a Summary
• Device parameter variation increases with
scaling  design margins increase
• Adaptive schemes required to minimize
impact of device variation on design margin
of digital circuits
• Voltage and current biasing schemes to
minimize impact of variation on analog
circuits
133
Section 2b
Circuit techniques for
leakage control
134
Outline
•
•
•
•
•
Leakage sources & impact of variations
Leakage estimation with variations
Static leakage reduction techniques
Dynamic leakage reduction techniques
Leakage-tolerant circuits
135
Sources of Leakage
136
Transistor leakage mechanisms
From Keshavarzi, Roy, & Hawkins (ITC 1997)
Gate
I7 I8
Sourc
e
Drain
n+
n+
I2 I3 I6
I1
I5
I4
p-Well
1. PN junction leakage
2. Weak inversion SD leakage
3. DIBL and contribution from SCE
4. GIDL
137
5. Punchthrough current
6. Narrow width effects
7. Gate oxide leakage
8. Hot carrier injection
Components of leakage
1E-02
ID
VD = 4.0 V
1E-03
VD = 0.1 V
1E-04
VD = 2.7 V
(A) 1E-05
1E-06
1E-07
DIBL
GIDL
1E-08
1E-09
1E-10
1E-11
Weak Inversion
&
Junction Leakage
1E-12
1E-13
1E-14
-0.5
0
0.5
1
VG (V)
138
1.5
2
Ioff (nA/m)
Subthreshold leakage trends
10,000
1,000
100
10
1
30 40 50 60 70 80 90 100110
Temp (C)
• Historic Vt scaling: ~15% per generation
• S-D and gate leakage impact: 3-5X increase
• Significant component of total power
• Serious dynamic circuit robustness penalty
139
Leakage vs. switching power
Power (Watts)
250
Active Leakage
200
150
120%
Active Power
100%
80%
Leakage > 50%
of total power!
60%
100
40%
50
20%
0
0%

250nm

180nm

130nm

100nm

70nm
Technology
Key requirements:
• Accurate prediction of chip leakage power
• Techniques to reduce chip leakage power
140
DIBL impact on leakage
VT (Volts)
BL (VDS0)
DIBL (VDS=VDD)
Higher IOFF due
to DIBL
Channel length (um)
141
Variation impact on leakage
1.0E-05
150 nm
technology
m
0.18
CMOS
Intrinsic IOFF (A)
1.0E-06
110C
VD=1V
1.0E-07
NBB=0V
1.0E-08
1.0E-09
Lwc
RBB=1V
1.0E-10
Lnom
1.0E-11
1500
2000
2500
3000
3500
1/IDlin
Shorter L
Shorter L transistors contribute more to chip leakage
142
eIGATE
IOFF
N+ Gate
P- Substrate
2E+20
Electrons (#/cm 3 )
Classical
2) 2)
Gate ICurrent
(A/cm
GATE (A/cm
Transistor scaling challenges - Tox
1E+03
3
10
1E+02
1E+01
1E+00
100
1E-01
1E-02
1E-03
1E-04
1E-05
1E-06
-7
1E-07
10
2.5nm
3.5nm
5.1nm
7.6nm
3.0nm
C. Hu, 1996.
0
0
22
44 66 88 10
10 1212
GateVoltage
Voltage (V)(V)
Gate
S.A. Hareland, et al., 1995.
1E+20
Q.M. (Three Subband)
1E+00
0.0
1.0
2.0
3.0
Depth (nm)
4.0
Electrical Tox = Physical Tox + 1nm
50% poly depletion, 50% quantum effect
143
High-K Gate Dielectric
1E+1
1E+0
TiO2
HfO2
2
JOX @1V (A/cm )
1E-1
1E-2
1E-3
ZrO2 Ta2O5
SiO2
1E-4
1E-5
Al2O3
1E-6
1E-7
1E-8
0.50
Fig 17. Comparison of gate leakage between SiO2
and high-K dielectrics.
1.00
1.50
2.50
2
C ox ( F/cm )
• Lower gate leakage
• Higher Cox at a given gate leakage
144
2.00
3.00
3.50
Source/Drain Tunneling Leakage
Ig leakage @ 30nm
Ioff leakage @ 30nm
IJE(A/µm)
1E-06
1E1E-07
1E-
10nm
1E-08
1E-
15nm
20nm
30nm
1E-09
1E1E-10
1E1E-11
1E1E-12
1E0
5
10
15
18
20
Doping Concentration (10/cm³)
Fig 15. Junction leakage Vs doping concentration.
Circles - data, squares – extrapolated points.
Other sources of leakage at Lg=30nm have been
added to the graph
145
Leakage Estimation and
Modeling
146
Leakage estimation
Prior techniques
Lower bound:
Assumes all devices in the die are Lnom
I leak -l 
wp
kp
I
o
p

wn o
In
kn
Upper bound:
Assumes all devices in the die are Lwc
I leak-u 
wp
kp
I
3s
off - p

147
wn 3s
I off -n
kn
New model
Includes within-die variation
o
I leak
I w
1

k s 2
l max

e
-( l -  ) 2
2s 2
(  -l )
e 
dl
l min
s2
l max--(l -  )s
 - l min
2
2 s
s
erf ( z )  1 if z  1 and

, - l )
(
l max
After simplification
properties,
2s
2s 2
2
2 2
I o w using
1 error function
 I leak
k
os
2
s2
I w 22

e
o k
e 2

e
2s
e 
e 2
l min
s2
2
I w
1
2


e
k s 2
IOFF

I3s 2
s I
 l -
o

l max - 

a e-L/
2
s
2



e
dl

l min
148
L (m)
d
Applications…
A macroscopic standard
deviation (s) representing
parameter variation in a chip
 k I leak 
s   2 ln

w Io 
I leak - w 
I op w p
kp
s p2
e
2 p2
s n2

I no wn 2n2
e
kn
Leakage
estimation
Depends on parameters that can be estimated
149
Measurement results
Number of samples
0.18 um 32-bit microprocessors (n=960)
500
: 0.65
Ileak-u s: 0.27
400
300
: 1.04
Ileak-w s: 0.3
200
Ileak-l : 6.5
100
s: 3.8
0
0.1
1
10
100
Ratio of measured to
calculated leakage
50% of the samples within ±20% of the measured leakage
Compared 11% and 0.2% of the samples using other techniques
150
Static Leakage Reduction
1) Transistor Stacks
151
Leakage of Stacks
Normalizedcurrent
Normalized
current
Vdd
wu I
stack-u
Vint
wl I
stack-l
1.2
Istack -l
wl
1
0.8
Istack-u
wu
0.6
0.4
0.2
Normalized two
leakage
Normalized
twostack
stack
leakage
0
100000
100000
O
0
O
30 C and 80C
C
10000
10000
80 C
VX
1000
1000
100
100
0.5
1
Vint (V)
Vint
(V)
V
(V)
int
30 C
Stack leakage is
~5-10X smaller
10
10
1
1
1
10
10
100
1000 10000
10000 100000
100 1000
100000
Normalized single
device
leakage
Normalized
single
device
leakage
152
1.5
Scalability—Stack Effect
Stack effect becomes stronger with scaling
153
Exploiting natural stacks
% of input
vectors
32-bit Kogge-Stone adder
30%
Low VT
High VT
20%
10%
0%
5.0 5.6 6.2 6.8 7.4 105 120 135
Standby leakage current (A)
High VT
Reduction Avg Worst
High VT
1.5X 2.5X
Low VT
1.5X
2X
154
Low VT
Energy
Overhead
1.64 nJ
1.84 nJ
Savings
2.2 A
38.4 A
Min time in
Standby
84 S
5.4 S
10
load (A.U.)
w
Delay under iso-input
Stack forcing
Two-stack Low Vt
Low-Vt
1
1e-3
Equal Loading
1e-2 1e-1 1e+0
Normalized Ioff
for Lmin device
Delay Penalty
Leakage
Reduction
wu
wl
Low-Vt + stack-forcing reduces leakage power by 3X
155
Static Leakage Reduction
2) Dual-Vt Process
156
Dual VT design technique
Leakage 3X smaller
(Active & Standby)
No performance loss
157
Optimum choices of high & low Vt
Leakage Power Reduction
80
0.4
70
0.3
60
50
0.2
40
0.1
0
%leakage savings per
transistor
90
0.5
Percentage of HVth
transistors
1.1
100
30
25 50 75 100 125 150 175 200 225
Normalized standby leakage
power
0.6
gate-level
half gate level
quasi-transistor level
1
0.9
0.8
0.7
0.6
0.5
0
50
100
150
HVth - LVth (V)
HVth - LVth (mV)
75-100mV VT difference is optimal
158
200
Dual-VT and sizing
Original design in 180nm singleVT technology, scaled to 130nm
technology with all high-VT.
Noise constraints
parasitic estimation
Target frequency
Process
specifications
Dual-VT
Flow
Timing and power
estimation, data analysis
Techniques
• DVT
• min-lvt
• min-area
• min-pwr
Final design in 130nm
dual-VT technology
Optimize design with concurrent dual-VT allocation and sizing
159
Results: total power
4
Switching
Leakage
0
1.96 GHz
(High-VT target)
2.21 GHz
18%
17%
16%
22%
16%
16%
15%
15%
14%
1
14%
2
18%
3
14%
Total power
(normalized)
5
2.30 GHz
(Low-VT target)
• Total power reduced by 6-8% over DVT-only
• Leakage power reduced by 20% over DVT-only
160
Results: total device width
high-VT
low-VT
2
0
1.96 GHz
(High-VT target)
2.21 GHz
2.30 GHz
(Low-VT target)
• Less low-VT usage than DVT-only
• Trade-off between area & low-VT usage
161
14%
12%
8%
9%
8%
5%
13%
5%
2%
0.5
3%
1
22%
1.5
2%
Total width
(normalized)
2.5
Results: area comparison
Burn-in Leakage
Total Power
Power (normalized) (normalized)
1.1
Frequency: 2.3GHz
DVT
min-lvt
1
0.9 min-area
min-lvt: 15%
area overhead
min-pwr
0.8
1.1
0.7
1
20% burn-in
power reduction
0.9
0.8
0.7
0.95
1
1.05
1.1
1.15
Die Area (normalized)
162
1.2
Effect of leakage change
Path Count (x1000)
• Push leakage in manufacturing to increase
frequency
• Dual-VT design: ideally push low-VT only
20
2.2 GHz
10
0
20
2.76 GHz
DVT+S,
original
DVT+S,
low-VT leakage
push
10
0
203 228 253 278 303 328 353 378 403 428 453 478 503
Path Delay (ps)
163
High-VT paths do
not speed up
Enhanced dual-VT design
• Allow for efficient frequency change
• Insert additional low-VT devices
Path Count (x1000)
30
2.2 GHz
20
10
0
30
2.76 GHz
20
10
EDVT+S,
20%
EDVT+S,
low-VT leakage
push
0
203 223 243 263 283 303 323 343 363 383 403 423 443 463 483 503
Path Delay (ps)
Dual-VT insertion should consider process scaling
164
Dual-VT + sizing summary
• Dual-VT + sizing reduces low-VT usage
by ~30-60% compared with DVT-only
• Leakage power reduced by 20%
• Dual-VT designs offer 9% frequency
improvement over single-VT
• Enhanced design allows frequency
increase through low-VT leakage push
165
Dynamic Leakage Reduction
1) Body bias
166
Reverse body bias
400.0E-9
110C
0.5V RBB
10
Total Leakage Power
Higher VT
Power (Watts)
Intrinsic leakage
reduction factor (X)
100
Lower VT
Shorter L
1
0.01
0.1
1
10
100
1000
Target Ioff (nA/m)
Tech
300.0E-9
Optimum
200.0E-9
100.0E-9
Ibp junction leakage
000.0E+0
-1.0
-0.8
-0.6
SD leakage
Ibn junction leakage
-0.4
-0.2
0.0
Body Bias (V)
Total Leakage Power
Measured on 0.18 Test Chip
0.35 m 0.18 m
Opt.RBB
2V
0.5V
Ioff Red.
1000X
10X
Microprocessor critical path
circuit
RBB reduces SD leakage
Less effective with: shorter L, lower VT, & scaling
167
I/O
circuit
Iint reduction factor (X)
Impact of scaling on RBB effectiveness
10
110C
110 nm LVt
150 nm
110 nm HVt
1
0.1
1
10
100
1000 10000
Target Ioff (nA/um)
Average optimum reverse body bias
Ioff reduction at optimum body bias
Tech A
~0.5 V
~10 X
Tech B
~2 V
~1000 X
RBB becomes less effective with technology scaling
168
Switching + leakage reduction:
forward bodyVcc:
bias
1, 1.05, 1.1 … 1.5V
4
Normalized
total power
Vbp
Vdd
+Ve
110oC
a=0.1
3
2
ZBB
FBB
1.2V
500mV
1.1V
1
0
0.6 0.8
1
1.2 1.4
-Ve
Vbn
20% power reduction at 1GHz
8%  frequency at iso-power
20X  idle-mode leakage
169
FBB/ZBB
leakage ratio
Frequency (GHz)
30
20
10
27oC
0
0.6 0.8
1
1.2 1.4
Frequency (GHz)
Router chip with forward body bias
150nm technology
I/O: F-Links
CBG
Digital Core
Export
Import
6-port, 72-bit
symmetric
cross-bar
24 LBGs
Digital core with
on-chip PMOS FBB
I/O: F-Links
PLL
I/O: S-Links
CBG – Central bias generator (836 m x 267 m)
2 –each)
DieLBGs
size– Local bias generators
10.1 (156
x 10.1
total
m xmm
68 m
2
8 x 8 mm – digital core
Technology
150nm CMOS
Transistors
6.6 million
Body bias
0 or 450 mV for PMOS
devices in digital core
Total PMOS width
biased
2.2 meters
Area overhead
2%
Power overhead
1%
Core frequency
1 GHz
Supply voltage
1.1 V with 450 mV FBB
1.25 V with NBB
I/O: S-Links
Digital core with
on-chip PMOS body
bias generator (BG).
170
Power and performance gain by FBB
2000
2000
Body bias chip
with 450 mV FBB
1250
1000
750
500
Tj ~ 60°C
NBB chip
& body bias
chip with
ZBB
1.1
1500
1250
Body bias
chip with
ZBB
1000
750
500
T j ~ 60°C
250
250
0.9
Body bias chip
with 450 mV FBB
1750
Fmax (MHz)
1500
1.3
1.5
Vcc (V)
0
1.7
5
10
15
20
Active power (W)
DC + active leakage
Switching
8
Power (W)
Fmax (MHz)
1750
1GHz
6
4
2
33% performance gain at 1.1V!
1.2
1.3
25% power reduction at 1Ghz!!
4.7
3.1
0
FBB
ZBB
171
Standby leakage control by FBB
Number of dies
30
Tj ~ 60°C
n: 74
: 3.5X
s: 1.4X
20
10
0
1X 2X 3X 4X 5X 6X 7X 8X
/ Ioff I(FBB)/I(ZBB)
(ZBB)
Leakage Ioff(FBB)
current ratio:
172
Leakage reduction
and 1.1V operation
made possible by
body bias
Dynamic Leakage Reduction
2) Dynamic sleep transistor
173
Active leakage control
OFF: gate
underdrive
VCC
Virtual VCC
Sleep
transistor
...
IDLE
ACTIVE
+ sleep or
body bias
Virtual VSS
VSS
OFF: gate
underdrive
IDLE
+ sleep or
body bias
PMOS
body
VCC
VHIGH
PMOS
bias
Body bias
NMOS
bias
VLOW
174
500mV
RBB
...
...
500mV
RBB V
SS
NMOS
body
32-bit ALU overview
VCC external
Virtual
VCC
Dynamic
ALU
1.61 X 1.44 mm2
Transistors
160K
Frequency
4.05GHz @ 1.28V
450mV FBB, 75°C
LBG
core
LBG
core
Sleep
control
Sleep transistor
8
Scan
32
8
ALU core
...
Body bias
Die Area
LBG
core
...
Body
bias
CBG
130nm dual-VT CMOS
...
LBG
sleep
Technology
Virtual
VSS 3-bit A/D
Nonsleep
VSS external
ALU
CBG: central bias generator
LBG: local bias generator
Scan
control
Scan
out
Scan
FIFO Sleep
ALU
Control
Body bias
175
Sleep transistor layout
ALU
Sleep
transistor
cells
VCC
M4
V
VSS
M4
V
M3
Virtual VCC
M3
Virtual VSS
VCC
M4
VCC
M4
VSS
M4
VSS
M4
M3
M3
176
M3
M3
Body bias layout
Sleep transistor LBGs
ALU core LBGs
ALU
Number of ALU
core LBGs
Number of sleep
transistor LBGs
PMOS device width
Area overhead
ALU core LBGs
Sleep transistor LBGs
177
30
10
13mm
8%
PMOS
body bias
PMOS sleep
transistor
Frequency & leakage impact
Reference: No sleep
transistor, 450mV FBB to
core, 1.35V, 75°C
Frequency
degradation
Leakage
reduction
Area
increase
No over/under drive
or sleep body bias
2.3%
37X
6%
200mV over/under
drive
1.8%
44X
7%
Sleep body bias:
FBB – RBB
1.8%
64X
8%
Dynamic body bias:
FBB – ZBB
0%
1.9X
8%
178
Virtual supply (V)
Virtual supply convergence
1.4
Convergence > 1ms
1.2
Low-leakage 133nF
decap on virtual Vcc
1
1.32V
75°C
Convergence time is
dependent on capacitance
0.8
0.6
Convergence < 1s
0.4
No decap on virtual Vcc
0.2
0
0.01
0.1
1
10
100
Idle time (μs)
1000
10000
Leakage power savings
100%
Decap on virtual VCC (10% total area)
90%
Decap on full VCC
80%
1.32V
75°C
a=0.05
70%
0.1
1
10
Idle time (ms)
100
1000
179
Leaky MOS decap on
virtual VCC:
better leakage savings
for > 1ms idle time
Total power: equal frequency
TON = 100 cycles, 75°C, a=0.05, F=4.05GHz
12
15%
savings
 77%
8%
savings Overhead
Leakage
 45%
LBG
 3%
Switching
(mW)
power
Totalpower
Tota
(mW)
10
8
6
4
2
0
1.32V
1.28V
Clockgating
gating ++ Clock
gating
only
Clock
Clock
gating
sleeptransistor
transistor
sleep
only
180
1.28V
Clock gating
Clock
gating+ +
body bias
body
bias
Leakage-Tolerant Circuits
1) Dynamic register file
181
Impact of increasing leakage
• Leakage disturbs the local bit line (LBL)
– Noise can result in erroneous evaluation
– Wider addressing exacerbates problem
LBL1

LBL0
WL1
D1
...
D0
...
Storage cell
WL0
To other ports
182
... WL15
D15
N0
(to GBL)
Dual-Vt design for robustness
• High-Vt and stronger keepers mitigate
leakage and improve robustness
– Contention causes severe penalty in delay
LBL1

LBL0
WL0
WL1
D1
...
...
D0
To other ports
183
N0
... WL15
D15
High-Vt
Source-follower NMOS (SFN)
N0
D1
WL0
WL1
...
D15
WL15
LBL1
D0
LBL0

Automatic Vgs
reduction and
reverse Vbs
• As leakage charges the output node, feedback
reduces the leakage
184
Leakage bypass w/ stack forcing

LBL1
LBL0
WL0
D0
Vds=0
...
WL15
Stack node D15
• Extra PMOSs supply leakage currents
– Leakage is bypassed away from LBL
• Extra NMOS device forces stack node
185
N0
LBL Delay [normalized]
Better robustness vs. delay
2.5
2.0
1.5
Larger keeper &
smaller skew
DVT
LVT
SFN
LBSF
1.0
0.5
0.05
0.10
0.15
0.20
DC noise robustness
[unity gain DC noise / Vcc]
DVT: Much better than LVT
186
0.25
Energy vs. delay for SFN
Total transistor
width [normalized]
Total energy
[normalized]
1.2
• Robustness fixed at
10% across all
points
• Leakage-tolerant
techniques not only
improve robustness,
DVT
but reduce energy
SFN+DVT
SFN+LBSF
as well
• SFN width not as
competitive because
1.2
1.4 1.6 of PMOS pull-up
DVT
SFN+DVT
SFN+LBSF
1.0
0.8
0.6
0.4
1.4
1.2
1.0
0.8
0.6
0.4
0.6
0.8
1.0
Delay [normalized]
187
Energy vs. delay for LBSF
Total energy
[normalized]
1.0
Total transistor
width [normalized]
1.2
0.4
1.4
DVT
LBSF+DVT
Full LBSF
0.8
0.6
DVT
LBSF+DVT
Full LBSF
1.2
1.0
0.8
0.6
0.4
0.6
0.8
1.0
1.2
1.4
Delay [normalized]
188
1.6
• LBSF faster despite
3-stack pull-down in
LBL, 2-stack in GBL
• Comparable total
width in pull-down
stacks yield similar
capacitance
Summary of LBSF and SFN
Full
LBSF
SFN+
DVT
SFN+
LBSF
Delay improvement
33%
10%
31%
Energy reduction
37%
24%
38%
Total width reduction
47%
-3%
26%
• Improved RF robustness without delay penalty
• Advantages of LBSF and SFN improve as
leakage increases
189
Leakage-Tolerant Circuits
2) L1 cache using bitline leakage
reduction (BLR)
190
Bitline develop. rate
Bitline leakage reduction
1.5
• Memory cell: HVT and Lmax
• Solution: Larger, Dual-Vt
cell for L1 cache
1.0
0.5
0.0
160n
130n
100n
Technology generation
3 types of cells
– HVT + Lmax
– HVT + Lmin
– DVT + Lmin
191
Intrinsic and effective read current
• DVT+Lmin cell: IINT is 35% larger; IEFF is smaller
 100 nm
technology
1.5
Read Current
 128 rows
per bitline
IEFF
ILEAK
1.0
0.5
0.0
HVT+Lmax HVT+Lmin DVT+Lmin
192
Bitline leakage reduction
WL: -100mV ↔ Vmax;
193
Vvc = Vmax – 100mV
BLR test chip results
I/O
Timer
Array
128b X 128b
WL Driver
Decoder
133 m
200 m
Read current & Area
2Kb bank of 16Kb L1 cache
1.5
Effective read current
Cell area
1.0
0.5
0.0
HVT+Lmax DVT+Lmin DVT+Lmin
+ BLR
BLR: 25% higher read current, 3% larger cell area
194
Precharge delay
Other read delay
Bitline delay
oC
1.2V,
110
200
150
100
50
• Bitline delay improved from
91ps to 75ps
• Read delay reduced from
159ps to 132ps
0
HVT+ Lmax
Bitline develop. rate
Cycle time(pS)
BLR performance
DVT + Lmin +
BLR
1.5
1.0
• Bitline development rate
improved by 8%
0.5
0.0
160n
130n
100n BLR
Technology generation
195
Leakage-Tolerant Circuits
3) Conditional keeper for burn-in
196
Leakage at burn-in (BI)
• BI condition’s elevated voltage and temperature
further challenges leakage issue
• Higher leakage, higher temperature
• Thermal runaway issue and positive feedback
effect
• Impact of leakage (specially at BI) on circuit
functionality
• Stability of IDDQ measurement with BI stress
197
Keepers need to be upsized for burn-in
• Larger keepers increase delay at “normal” condition
198
Burn-in conditional keeper
Normal mode Keeper
Effective Burn-in Keeper
Burn-in signal (BI)
Clock
PKB
PK1
Min.
sized
Pull Down
NMOS
Clock
199
1.6
1.4
STD
1.2
BI-CKP
1
0.8
10 15 20 25 30
Burn-in Keeper size
[% of pull down]
Delay improvement [%]
Norm. delay
(Normal condition)
Burn-in keeper: 100nm comparison
20
15
10
5
0
2
3 4 5 6
NORs Fan-in
(number of inputs)
Larger delay improvement for wider dynamic gates
200
Summary
• Control of leakage power becoming crucial
• Leakage estimation is necessary during
design phase
• Static and dynamic techniques can be used
for leakage control
– Dual-VT process and stack effect
– Dynamic sleep transistor and body bias
• Leakage-tolerant circuits
– Cache and memory leakage techniques
– Burn-in leakage reduction
201
Section 3
Full-chip power reduction
techniques and design
methodologies
202
Micro architecture
innovations
203
Architecture Tradeoffs
1.5
1.5
1
1
0.5
frequency
0
target
frequency
probability
large
small
Logic depth
0.5
0
less
# uArch critical paths
Higher target frequency with:
1. Shallow logic depth
2. Larger number of critical paths
But with lower probability
204
more
Improve Arch Efficiency
3000
Thermals & Power Delivery
designed for full HW utilization
2500
CPU
MHz
2000
Single Thread
1500
GAP
1000
Mem
500
0
1992
ST
Multi-Threading
MT1
1994
1996
1998
2000
Wait for Mem
2002
Wait for Mem
MT2
Wait
MT3
Multi-threading improves performance without
impacting thermals & power delivery
Still obey Moore’s Law!
10,000
Transistors (MT)
Actual
Moore's Law.
1,000
100
10
2000
2002
2004
Year
2006
2008
Total transistors meet Moore’s Law
206
Fred’s Rule
4
Area(Lead/Compaction)
Growth (X)
3
2
1
Perf(Lead/Compaction)
0
1.5
1
0.7
0.5
0.35
Technology Generation
0.18
In the same process technology:
2X Area  1.4X Performance
207
Reduced die size causes
“Performance gap”
70%
59%
Performance Gap
60%
53%
47%
50%
40%
35%
30%
20%
10%
0%
0%
2000
2002
2004
Year
2006
2008
30-60% performance loss even after meeting Moore’s Law
208
Exploit Memory—Low PD
Large on die caches provide:
1. Increased Data Bandwidth & Reduced Latency
2. Hence, higher performance for much lower power
209
Memory has lower power density
Exploit memory !
210
Increase memory area
Memory Area % of total
70%
60%
54%
55%
2004
Year
2006
57%
50%
41%
40%
30%
29%
20%
10%
0%
2000
2002
2008
Use > 50% die area in memory
211
Memory trend
100000
12M
2.5M
Memory (KB)
10000
24M
1M
5.5M
1000
100
8
16
16
10
1
1980
1990
2000
Year
212
2010
Power density is reduced
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
Nuclear
Reactor
100
8086
Pentium 4
Hot Plate
10 4004
Pentium II
8008 8085
Pentium
386
286
486
8080
1
1970
1980
1990
2000
2010
Year
Full chip power density is reduced
But local power density will be high
213
Can DRAM help?
•
•
•
•
Transistor perf not critical for DRAM
Don’t need large retention time
10X more storage in same area & power
TB/sec Bandwidth, at <10ns latency
SRAM
DRAM
Cell size (f2)
~ 150
~ 10
Array efficiency
85%
60%
Memory density
1
11
214
Embedded DRAM on logic
Provides 10X memory--same area, same power as SRAM
215
Cache Latency (Clocks)
Embedded DRAM could
improve performance
1000
Foster
100
10
1
L0
L1
L2
DRAM
Instruction Cost
800
Source: Glenn Hinton, 99
External DRAM Latehcy
600
Embedded DRAM provides:
10X increase in on-die Memory
1,000 X increase in Bandwidth
10X reduction in Latency
400
200
0
Pentium Pentium Pentium WmtNW-2100
Pentium4
proc
Pro Proc III proc 1.4GHz
1400
2.1GHz
216
On-die DRAM Applications
(1)
Memory
Controller
CPU
(2)
External
DRAM
On-Die
DRAM
External
Memory
Network
Processor
On-Die
DRAM
Packet Memory
Routing Tables
217
130nm test chip
0.52
P
1.2
1.0
N+
“0”
“1”
0.8
N+/P
0.6
0.4
0.2
0.0
-1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7
Vg (V)
N+/P
Inversion
N
Capacitance per unit area
(normalized)
Capacitance per unit area
(normalized)
Vhigh+vth
P+
1.2
“1”
1.0
P+/N
0.8
0.6
0.4
“0”
0.2
0.0
-1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7
Vg (V)
P+/N
Accumulation
218
1.10
0.73
Capacitance per unit area
(normalized)
0.52
P+
P
1.2
1.0
P+/P
0.8
“0”
“1”
0.6
0.4
0.2
0.0
-1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7
Vg (V)
P+/P
Depletion
110°C
19.39
* SRAM / DRAM Cell Area
P+/N
P+/P
1.0
N+/P
3.3
3.4
110°C
Power (W/cm 2)
5
10
1.0
P+/N
15
1.0*
5.0
SRAM
N+/P
P+/P
0.85
0.99
1.25
0
0
Retention Time TR (ns)
500
1000
1500
20
Area and Power Comparison
0.80
1.20
1.60
2.00
2.40
Ratio of SRAM / DRAM array area
1.00
1.95
1.97
2.27
Ratio of SRAM / DRAM array area
• P+/P the best from power and area perspective
219
Interconnect power
reduction
220
Motivation: CC Multiplier (CCM)
CCM = 0
CCM = 1
CCM = 2
Cc
Cc
CCM: Cc
Multiplier
Cg
• RintCint delay of long busses is a key speed limiter
• Coupling cap (Cc) is a large component of Cint:
Cint = Cg + CCM  (2Cc)
221
75%
59%
100%
77%
metal-4
75%
Coupling Cap. Ratio
Coupling Capacitance Scaling
50%
25%
0%
150nm 130nm 100nm
(Al)
(Cu)
(Cu)
Coupling capacitance remains a large fraction of Cint
despite moving from Al to Cu.
222
Static Bus (SB)
Tclk-q
Tint
Tsetup
Cc
DFF
RFF
Cc

Cg

segment length
Delay = Tclk-q + Tint@CCM=2
• Simple scheme with no timing constraints
• Minimize delay through optimal repeater insertion
• CCM of 2 has negative impact on delay
223
Dynamic Bus
• Domino timing applied to interconnect
• Monotonic transitions
– Reduced collinear capacitance
• Static (worst case) = 2X
• Dynamic (worst case) = 1X
– Φ2 repeater required – susceptible to noise
• Higher transition activity when input = 1
• Static CMOS inverters drive all segments
224
Dynamic Bus Advantages
• Capacitance effects reduced
– Collinear capacitance reduced 2X
– Orthogonal capacitance unchanged
• Inductance effects reduced
– Can oppose transition for static bus
– Can reduce capacitive effects for dynamic bus
225
Static Pulsed Bus (SPB)
Tclk-q Tpg
DFF
Tint
Ttff Tsetup
TFF
D Q
C
PG
RFF


Delay = Tclk-q + Tpg+ Tint@CCM=1 + Ttff
• Static PG generates a pulse on a data transition
• Toggle FF (TFF) restores correct data at bus end
• Leading edge is critical: repeaters are skewed
226
SPB Benefits
CCM = 0
CCM = 1
Cc
Cc
Cg
• In SPB, data transitions are monotonic:
worst case CCM = 1 and repeaters can be skewed
• Similar to dynamic bus but: (1) has no clock overhead
and (2) its energy scales with switching activity
227
Normalized Energy
3
2.5
SB Vs. SPB: Delay
SPB
SPB Delay
Breakdown
2
RC + Rep. 77%
1.5
1
200
Other
SB
300
400
500
Delay (pS)
SPB reduces delay by 22% as a result of:
1. Repeater skewing
2. CCM < 1 due to “useful” noise coupling
228
23%
SPB
2
1.5
1
200
SB
300
400
500
5
4
5%
3
2
1
11%
SB
SPB
0
Delay (pS)
SPB reduces energy by 12% due to:
1. Smaller skewed repeater sizes
2. Smaller CCM
229
Other
Repeater + RC
89%
2.5
6
95%
3
Normalized Energy
Normalized Energy
SB Vs. SPB: Energy
SB vs. SPB: Different Bus Lengths
at iso-delay
% SPB Savings
%Delay Reduction
at iso-energy
30%
20%
10%
75%
50%
Energy
Width
Peak Current
25%
0%
0%
1500
3000
4500
1500
Bus Length (um)
3000
4500
Bus Length (m)
• At iso-energy, SPB improves delay by 15%-25%
• At iso-delay, SPB reduces energy by 12%-25%
• At iso-delay, SPB reduces current/width by 26%-34%
230
SPB summary
• SPB has monotonic data transitions:
 worse case CCM = 1
 repeaters can be skewed
• Unlike dynamic bus:
 no clock precharge-evaluate energy and routing
 energy consumption is data activity dependent
• For 1500m-4500m metal-4 line, SPB:
 improves delay by 15%-25%
 reduces energy by 12%-25%
 reduces width by 34%-42%
 reduces peak-current by 26%-34%
231
Transition-Encoded Bus (TEB)
• Encoder circuit
– XOR of previous and current input
– Domino compatible output
• Decoder circuit
– XOR of previous output and bus state
232
TEB Advantages
D1
2
decode
1
encode
D1
• Dynamic bus performance improvement
– Collinear capacitance reduction
• Static bus energy
– Transition dependent switching activity
• Noise-insensitive Φ2 repeater required
– Regains noise immunity of CMOS inverter
233
FF
1
Energy Comparison
Energy (normalized)
1.0
Static
0.8
0.6
Transition-encoded
0.4
0.2
9mm metal3, 130nm process, 1.2V, 30ºC
0.0
0.65
0.70
0.75 0.80 0.85 0.90
Delay (normalized)
234
0.95
1.00
Results
Equal
delay
Equal
transistor
width
Equal
driver size
Delay reduction
0%
19%
22%
Total transistor width
reduction
32%
0%
-20%
Peak current
reduction
49%
30%
17%
Energy increase
9%
16%
19%
• Averaged over 3-9mm buses
• Metal3 in 130nm technology, 1.2V, 30ºC
235
TEB summary
• Transition-encoded bus:
– High performance, energy efficient on-chip
interconnect technique
– 32% active area reduction
– 49% peak current reduction
– Transition dependent energy consumption
→ Energy savings at aggressive delay targets
• Enables 10%-35% performance
improvement on 79% of full-chip
Pentium® 4 buses
236
Special purpose hardware
237
Special-Purpose HW
• Special-purpose performance  more MIPS/mm²
• SIMD integer and FP instructions in several ISAs
General
Purpose
Multimedia
Kernels
Die Area
Power
Performance
2X
2X
~1.4X
<10%
<10%
1.5-4X
• Integration of other platform components, e.g.
memory controller, graphics
• Special-purpose logic, programmable logic,
and separately programmable engines
Improve power efficiency with Valued Performance
238
TCP/IP challenges
250
Saturated 1GbE
CPU %
200
150
1P Tx/Rx
100
2P Rx
2P Tx
50
0
128
2K
16K
Packet size (bytes)
64K
1GbE
1.48M pkts/sec
672 ns
10GbE
14.8M pkts/sec
67.2 ns
General purpose MIPS will not keep up!
239
Compute power required for TCP/IP
1000000
10000
160GbE
10
40GbE
1GbE
100
10GbE
CPU MIPS
1000
2010
2008
2003
1999
1997
0.1
1993
1
1985
MIPS required
100000
TCP/IP Engine will provide required MIPs
240
A sample approach
• A programmable hardware engine for
offloading TCP processing
• Focus on
– Most complex part: TCP inbound processing
– Handle 10Gbps Ethernet traffic with sufficient
headroom for outbound processing
– Aggressive wire speed goal - minimum
packet size on saturated wire
– Simple, scalable, flexible design enabling fast
time to market
241
Key features
• Special purpose processor
– Dual frequency, low latency, buffer-free design
– High frequency execution core
– Accelerated context lookup and loading
• Programmability for ever-changing protocols
– Programmable design with special instructions
– Rapid validation and debug
• Scalable solution
– Across bandwidth and packet sizes
– Extendable to multi-core solution
242
Packet size vs. core frequency
Packet size (bytes)
10000
1Gbps
10Gbps
40Gbps
1000
100
64
1Gbps
1GHz
10
10Gbps
10GHz
1
0.1
1
10
100
Core frequency required (GHz)
Increase packet size  reduce frequency
243
Chip characteristics
244
Chip Area
Process
2.23 x 3.54mm2
90nm dual-VT
CMOS
Interconnect
Transistors
Pad count
1 poly, 7 metal
460K
306
Standard FP MAC
A
EF
EA
Swapper
Control
AXB
EA ± EF
shifter
SUB
Post Norm
1’s Comp
X+Y
FB
MA
MA + FB
FB
MA
B
NEG
LZD
SUB
Post Norm
Exp Sub
EF
Critical Path Logic Stages = 26
@30ps per stage, Fmax = 1.2Ghz (P860, 1.1V)
245
shifter LZD
FB
Prototype FP MAC
A
B
M(CS)
EF
AXB
FB(CS)
EA
MP(CS)
Control
0 ZD
1
1
FB
MA
1
M >F
0
4:2 Compressor
4:2
compressor
Shift
by 32
Overflow
detector
Overflow Detect
ME = FBE
Critical Path Logic Stages = 12
@30ps per stage, Fmax = ~3GHz (P860, 1.1V)
246
1
0
Accumulator Algorithm
• Key: Minimize interaction between incoming operand and
accumulator result
• Floating point number converted to base 32
• Exponent subtraction no longer necessary
• Exponent comparison reduced from 8 to 3 bits
Sign
Mantissa (24 bits)
Exp[7:0]
00000
Sign Exp[7:5]
Mantissa (55 bits)
247
Die photograph and characteristics
MULTIPLIER
CLK
ALIGNER
FIFOs
&
SCAN
ACCUMULATE
Die Area 1.32 x 1.57 mm2
Process
90nm CMOS
Interconnect 1 poly, 7 metal
Transistors
230K
Pad Count
75
NORMALIZE
Clock Grid Buffers
248
Design methodologies
249
Motivation
• Parameter variations will become
worse with technology scaling
• Robust variation tolerant circuits and
microarchitectures needed
• Multi-variable design optimizations
considering parameter variations
• Major shift from deterministic to
probabilistic design
250
Probability
Impact on Design Methodology
Path Delay
Due to variations in:
Vdd, Vt, and Temp
Delay Target
Probabilistic
Delay Target
251
Frequency
Deterministic
# of Paths
# of Paths
Delay
Deterministic
Probabilistic
10X variation
~50% total power
Leakage Power
Tool Complexity
• Problems
–Far too many tools and tool interfaces
–Data is not easily extractable
–Circuit reuse is minimal
• Solutions
–Common tool interfaces
–Standard databases
–Parameterized design
252
Designer Cockpit
File
Edit
View Select
Synthesize
Parasitics
Path select
Sim path
gen BVR
RaceCheck
Sizing
Analyze
Schematic - cira
n1
Status: path selected
•
Everything on the menu bar
253
Experiment
Checks
Options
Designer Cockpit
File
Edit
View Select
Synthesize
Parasitics
Path select
Sim path
gen BVR
RaceCheck
Sizing
Analyze
Experiment
Checks
Schematic - cira
n1
Layout - cira
Source: layout extract 9/10/96
cira.cel2
cira.nout2o
cira.cel1
Status: path selected
•
•
Layout planning view compatible
with schematic view
Selection in either view
254
Options
Designer Cockpit
File
Edit
View Select
Synthesize
Parasitics
Sizing
Analyze
AMPS
Tune to target
sense amp
memory cell
Set restrictions 
Autosize
Experiment
Checks
Options
Speed power curve
Optimize with sensitivity
Optimize metal line
Pick VT
Delay vs size
Cell characterize
Sense amp characterize
Memory cell stability
Setup & hold chararacterize
User specified 
New Script 
•
Tools work with partial or full selection
Designer intervention allowed anywhere
•
Layout planner provides wiring parasitics
•
–
Not the route of the week
•
All tools callable from user programs
Experiment organizer
•
Optimization and experiments built in
•
255
Optimization Example
•
Imagine:
–Select gates from schematic editor or
layout planner to optimize
–Select optimization for P•D3
–Include a metal width and space
–Include VT range optimization
–Force a metal line length as a function of
transistor sizes in a cell
–Select Pathmill analysis
–Run with sensitivity turned on
256
Optimization Example
File
Edit
View Select
Synthesize
Optimum
gate1.size1
1.8
gate1.size2
3.5
gate2.size1
2.4
Vt low
0.234
line2.width
0.30
line2.space
0.55
•
Sizing
Analyze
1
Experiment
Checks
Options
Speed power curve
Optimize with sensitivity
Delay vs size
Cell characterize
Setup & hold chararacterize
User specified 
Sensitivity
0
•
Parasitics
2
3
4
Optimization determines best solution
Sensitivity analysis gives designer insight into the
selection
257
Evolve a Macro Library
Feasibility studies
& estimation
Parameterized
Macro Library
RTL
Circuit design
Library generator
with auto parameterize
Layout
Tapeout
• Executable on-line documentation
• Designs must be easily absorbed into the library
258
Tools and productivity
• Functional uarch modules
– Investigation tools and libraries
• Cross discipline optimization & Monte Carlo
• Easy database access
– Designer has same access as developer
• Full chip path extraction and visualization
• Productive design requires
– Innovation to be early
• Early innovation enabled by
– Flexible and open tools
259
Development CAD and DA
Tool Vendors
Research
- Core technologies
CAD Development
- Productize modules and sample flow
Design DA Groups
- Interface, flows and adaptations
Designers
- Special features
260
Examples
261
Chip with bias generator (BG)
150nm Communications router (ISSCC ’01)
Digital core with
on-chip PMOS body
bias generator (BG).
1.5 million PMOS devices
262
Distributed biasing scheme
Central Bias Generator (CBG) and
Local Bias Generator (LBG)
CBG
Placement of bias generators
Export &
Import
Cross-bar
24 LBGs
263
Global
routing
Reference
translation
Load
24 Local Bias
Generators (LBGs)
Buffer
Central
Bias
Generator
(CBG)
Global
routing
Bias generation & distribution
Local Vcc
Vcca
Scaled
bandgap
circuit
Current
mirror
Local Vcc - 450 mV
Vcca - 450 mV
264
Routing details
Global routing
Vcca
Vcca – 450mV
Vcca
To LBGs
FBB / ZBB
control bit
Local routing
Vcc
Vcc – 450mV
From LBGs
265
Router chip summary
I/O: F-Links
Digital Core
CBG
Die size
Technology
Transistors
Body bias
24 LBGs
Total PMOS width
biased
Area overhead
Power overhead
Core frequency
Supply voltage
PLL
I/O: S-Links
CBG – Central bias generator (836 m x 267 m)
LBGs – Local bias generators (156 m x 68 m each)
266
10.1 x 10.1 mm2 – total
8 x 8 mm2 – digital core
150nm CMOS
6.6 million
0 or 450 mV for PMOS
devices in digital core
2.2 meters
2%
1%
1 GHz
1.1 V with 450 mV FBB
1.25 V with NBB
Dual-VT Motivation
Leakage power
Low-VT
High-VT
Dual-VT
Frequency
• Low-VT used in critical
paths
• Achieve same frequency
as all low-VT design
• Leakage power much
smaller than all low-VT
design
267
Dual-VT Options
Original design in 180nm singleVT technology, scaled to 130nm
technology with all high-VT.
Noise constraints
parasitic estimation
Target frequency
Process
specifications
Dual-VT
Flow
Timing and power
estimation, data analysis
Final design in 130nm
dual-VT technology
268
1.
2.
3.
4.
DVT
H-SDVT
L-SDVT
DVT+S
Dual-VT Allocation Only (DVT)
Netlist: all high-VT
Target
frequency
TA-DVT
(H-L)
• Transistors sized for
original target
• Insert low-VT to meet
new target frequency
New DVT netlist
269
Selective LVT Insertion
(H-SDVT)
Netlist: all high-VT
AMPS sizing
Target
frequency
TA-DVT
(H-L)
AMPS sizing
• Size at target
frequency
• Insert low-VT to fix
critical paths
• Size to optimize slack
(down-size)
New DVT netlist
270
Selective HVT Insertion (L-SDVT)
• Convert netlist to all
low-VT
• Size at target
frequency
• Insert high-VT on
non-critical paths
• Size to optimize
slack
Netlist: all low-VT
AMPS sizing
Target
frequency
TA-DVT
(L-H)
AMPS sizing
New DVT netlist
271
Dual-VT and Sizing (DVT+S)
Netlist: all high-VT
Relaxed
target for
sizing
Target
frequency
AMPS sizing
TA-DVT
(H-L)
AMPS sizing
• Iterative DVT flow
• Use different amounts
of sizing, low-VT to
reach target
• Pick best iteration
Power estimation
Pick best iteration
(lowest power)
New DVT netlist
272
Tutorial summary
• Challenges for low power and high
performance
–
–
–
–
Historical device and system scaling trends
Sub-100nm device scaling challenges
Power delivery and dissipation challenges
Power efficient design choices
• Circuit techniques for variation tolerance
– Short channel effects
– Adaptive circuit techniques for variation tolerance
273
Tutorial summary (contd.)
• Circuit techniques for leakage control
– Leakage power components
– Leakage power prediction and control techniques
• Full-chip power reduction techniques
–
–
–
–
–
Micro-architecture innovations
Coding techniques for interconnect power reduction
CMOS compatible dense memory design
Special purpose hardware
Design methodologies & challenges for CAD
274
Power limited microprocessor
integration choices
Present
General
purpose
units
Adapt to
Process
Next decade
Adaptive
general
purpose
units
Special
purpose
units
Dense
Memory
Memory
Power (active and standby) management
275
Special purpose processing
DSP
Network processing
(wired/wireless)
Acknowledgements
The presenters would like to thank all
the CRL team members and Intel
design and manufacturing teams for
their contribution towards the contents
of this tutorial.
276
Bibliography (1 of 7)
•
•
•
•
•
•
•
•
•
De, V.; Borkar, S.; Technology and design challenges for low power and high performance
[microprocessors], Low Power Electronics and Design, 1999. Proceedings. 1999 International
Symposium on , 1999, Page(s): 163 –168
Lundstrom, M.; Ren, Z.; Essential physics of carrier transport in nanoscale MOSFETs, Electron
Devices, IEEE Transactions on , Volume: 49 Issue: 1 , Jan. 2002, Page(s): 133 -141
Thompson, S. et al; A 90 nm logic technology featuring 50 nm strained silicon channel transistors, 7
layers of Cu interconnects, low k ILD, and 1um2 SRAM cell, Electron Devices Meeting, 2002. IEDM
'02. Digest. International , 8-11 Dec. 2002, Page(s): 61 -64
Karnik, T.; Borkar, S.; Vivek De; Sub-90nm technologies--challenges and opportunities for CAD,
Computer Aided Design, 2002. ICCAD 2002. IEEE/ACM International Conference on , 2002,
Page(s): 203 –206
Belady, C.; Cooling and power considerations for semiconductors into the next century, Low Power
Electronics and Design, International Symposium on, 2001. , 6-7 Aug. 2001, Page(s): 100 -105
Karnik, T et al; Selective node engineering for chip-level soft error rate improvement, VLSI Circuits
Digest of Technical Papers, 2002. Symposium on , 13-15 June 2002, Page(s): 204 -205
Narendra, S.; De, V.; Borkar, S.; Antoniadis, D.; Chandrakasan, A.; Full chip sub-threshold leakage
power prediction model for sub 0.18um CMOS, Low Power Electronics and Design, 2002. ISLPED
'02. Proceedings of the 2002 International Symposium on , 2002, Page(s): 19 -23
Narendra, S.; Borkar, S.; De, V.; Antoniadis, D.; Chandrakasan, A.; Scaling of stack effect and its
application for leakage reduction, Low Power Electronics and Design, International Symposium on,
2001. , 2001, Page(s): 195 –200
Narendra, S. et al; 1.1 V 1 GHz communications router with on-chip body bias in 150 nm CMOS,
Solid-State Circuits Conference, 2002. Digest of Technical Papers. ISSCC. 2002 IEEE International ,
Volume: 1 , 2002, Page(s): 270 -466 vol.1
277
Bibliography (2 of 7)
•
•
•
•
•
•
•
•
Tschanz, J.W.; Narendra, S.; Nair, R.; De, V.; Effectiveness of adaptive supply voltage and body bias
for reducing impact of parameter variations in low power and high performance microprocessors,
Solid-State Circuits, IEEE Journal of , Volume: 38 Issue: 5 , May 2003, Page(s): 826 -829
Tschanz, J.W. et al; Adaptive body bias for reducing impacts of die-to-die and within-die parameter
variations on microprocessor frequency and leakage, Solid-State Circuits, IEEE Journal of , Volume:
37 Issue: 11 , Nov. 2002, Page(s): 1396 -1402
Vangal, S. et al; 5GHz 32b integer-execution core in 130nm dual-Vt CMOS, Solid-State Circuits
Conference, 2002. Digest of Technical Papers. ISSCC. 2002 IEEE International , Volume: 2 , 2002,
Page(s): 334 -535
Narendra, S.; Keshavarzi, A.; Bloechel, B.A.; Borkar, S.; De, V.; Forward body bias for
microprocessors in 130-nm technology generation and beyond, Solid-State Circuits, IEEE Journal of
, Volume: 38 Issue: 5 , May 2003, Page(s): 696 -701
Somasekhar, D; Lu, Shih-Lien; Bloechel, Bradley; Lai, Konrad; Borkar, Shekhar; De, Vivek; Planar
1T-Cell DRAM with MOS Storage Capacitors in a 130nm Logic Technology for High Density
Microprocessor Caches, European Solid-State Circuits Conference, 2002, Proceedings of the 2002
International Conference on, ESSCIRC 2002, Page(s): 127 - 130
Khellah, M.; Tschanz, J.; Ye, Y.; Narendra, S.; De, V.; Static pulsed bus for on-chip interconnects,
VLSI Circuits Digest of Technical Papers, 2002. Symposium on , 13-15 June 2002, Page(s): 78 –79
Anders, M.; Rai, N.; Krishnamurthy, R.K.; Borkar, S.; A transition-encoded dynamic bus technique for
high-performance interconnects, Solid-State Circuits, IEEE Journal of , Volume: 38 Issue: 5 , May
2003, Page(s): 709 –714
Vangal, S. et al; A 5GHz Floating Point Multiply Accumulator in 90nm Dual-VT CMOS, Solid-State
Circuits Conference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International , Volume:
46 , 2003, Page(s): 334 -335
278
Bibliography (3 of 7)
•
•
•
•
•
•
•
•
•
•
•
•
Hoskote, Y. et al; A 10GHz TCP Offload Accelerator for 10Gbps Ethernet in 90nm Dual-VT CMOS,
Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International ,
Volume: 46 , 2003, Page(s): 258-259
http://www.intel.com/research/silicon/mooreslaw.htm
G.E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, April
19, 1965.
K.G. Kempf, “Improving Throughput across the Factory Life-Cycle,” Intel Technology Journal, Q4,
1998.
S. Thompson, P. Packan, and M. Bohr, “MOS Scaling: Transistor Challenges for the 21st Century,”
Intel Technology Journal, Q3, 1998.
Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices, Cambridge University Press, 1998.
D. Antoniadis and J.E. Chung, “Physics and Technology of Ultra Short Channel MOSFET Devices,”
Intl. Electron devices Meeting, pp. 21-24, 1991.
A. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-Power CMOS Digital design,” IEEE J.
Solid-State Circuits, vol. 27, pp. 473-484, Apr. 1992.
Z. Chen, J. Shott, J. Burr, and J. D. Plummer, “CMOS Technology Scaling for Low Voltage Low
Power Applications,” IEEE Symp. Low Power Elec., pp. 56-57, 1994.
H.C. Poon, L.D. Yau, R.L. Johnston, D. Beecham, “DC Model for Short-Channel IGFET's,” Intl.
Electron Devices Meeting, pp. 156-159, Dec. 1973.
A. Asenov, G. Slavcheva, A.R. Brown, J.H. Davies, and S. Saini, “Increase in the Random Dopant
Induced Threshold Fluctuations and Lowering in Sub-100 nm MOSFETs due to Quantum Effects: A
3-D Density-Gradient Simulation Study,” IEEE Transactions on Electron Devices, vol. 48, no. 4, pp.
722-729, April 2001.
S. W. Sun and P. G. Y. Tsui, “Limitation of Supply Voltage Scaling by MOSFET Threshold-Voltage
variation,” Custom Integrated Circuits Conf., pp. 267-270, 1994.
279
Bibliography (4 of 7)
•
•
•
•
•
•
•
•
•
D.A. Muller, T. Sorsch, S. Moccio, F.H. Baumann, K. Evans-Lutterodt, and G. Timp, “The Electronic
Structure at the Atomic Scale of Ultrathin Gate Oxides,” Nature, vol. 399, pp. 758-761, June 1999.
M. Schulz, “The End of the Road for Silicon,” Nature, vol. 399, pp. 729-730, June 1999.
C. H. Lee, S. J. Lee, T. S. Jeon, W. P. Bai, Y. Sensaki, D. Roberts, and D. L. Kwong, “Ultra Thin
ZrO(2) and Zr(27)Si(10)O(63) Gate Dielectrics Directly Prepared on Si-Substrate by Rapid
Thermal Processing,” SRC Techcon, pp. 46, Sep. 2000.
N. R. Mohapatra, M. P. Desai, S. Narendra, and V. R. Rao, “The Impact of High-K Gate Dielectrics
on Sub 100 nm CMOS Circuit Performance,” IEEE Transactions on Electron Devices, To be
published, 2002.
J. Lee, G. Tarachi, A. Wei, T. A. Langdo, E. A. Fitzgerald, D. Antoniadis, “Super self-aligned
double-gate (SSDG) MOSFETs utilizing oxidation rate difference and selective epitaxy,” Intl.
Electron Devices Meeting, pp. 71-74, 1999.
I. Kohno, T. Sano, N. Katoh, and K. Yano, “Threshold Canceling Logic (TCL): A Post-CMOS Logic
Family Scalable Down to 0.02 mm,” Intl. Solid-State Circuits Conf., pp. 218-219, 2000.
T. Kuroda, T. Fujita, S, Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M.
Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, “A 0.9-V, 150-MHz, 10-mW, 4-mm2, 2D Discrete Cosine Transform core Processor with Variable Threshold-Voltage (VT) Scheme,” IEEE
J. Solid-State Circuits, vol. 31, pp. 1770-1779, Nov. 1996.
M. Miyazaki, H. Mizuno, and K. Ishibashi, “A Delay Distribution Squeezing Scheme with SpeedAdaptive Threshold-Voltage CMOS (SA-Vt CMOS) for Low Voltage LSIs,” Intl. Symp. Low Power
Electronics and Design, pp. 48-53, Aug. 1998.
M. Miyazaki, G. Ono, T. Hattori, K. Shiozawa, K. Uchiyama, and K. Ishibashi, “A 1000-MIPS/W
Microprocessor using Speed Adaptive Threshold-Voltage CMOS with Forward Bias,” Intl. SolidState Circuits Conf., pp. 420-421, 2000.
280
Bibliography (5 of 7)
•
•
•
•
•
•
•
•
•
•
•
•
•
V. De, “Forward Biased MOS Circuits,” United States Patent, Patent number: 6,166,584, Filed:
June 1997, Issued: Dec. 2000.
C. Wann, J. Harrington, R. Mih, S. Biesemans, K. Han, R. Dennard, O. Prigge, C. Lin, R.
Mahnkopf, and, B. Chen, “CMOS with Active Well Bias for Low-Power and RF/Analog
Applications,” Symp. on VLSI Technology, pp. 158-159, 2000.
R. Kraus, “Analysis and reduction of sense-amplifier offset,” IEEE J. Solid-State Circuits, vol. 24,
no. 4, pp. 1028-1033, Aug. 1989.
S. Narendra, D. Klowden, and V. De, “Sub-1 V Process Compensated MOS Current Generation
without Voltage Reference,” Symp. on VLSI Circuits, pp. 143-144, 2001.
Y.P. Tsividis, Operation and Modeling of The MOS Transistor, McGraw Hill, New York, 1987.
H.C. Poon et al., Intl. Electron Devices Meeting, pp. 156-159, 1973.
K.K. Ng, S.A. Eshraghi, and T.D. Stanik, “An improved generalized guide for MOSFET scaling,”
IEEE Transactions on Electron Devices, vol. 40, pp. 1895-1897, Oct. 1993.
K. Bowman, S. Duvall, and J. Meindl, “Impact of die-to-die and within-die parameter fluctuations on
the maximum clock frequency distribution”, Intl. Solid-State Circuits Conf., pp. 278-279, 2001.
A. Keshavarzi, S. Narendra, B. Bloechel, S. Borkar, and V. De, “Forward Body Bias for
Microprocessors in 130nm Technology Generation and Beyond,” Submitted for review, 2002
Symposium on VLSI circuits.
M. Haycock et.al., “3.2 GHz 6.4Gb/s per Wire Signaling in 0.18mm CMOS,” Intl. Solid-State
Circuits Conf., pp. 62-63, 2001.
R. Nair et.al., “A 28.5 GB/s CMOS Non-Blocking Router for Terabits/s Connectivity between
Multiple Processors and Peripheral I/O Nodes,” Intl. Solid-State Circuits Conf., pp. 224-225, 2001.
Y. Oowaki et.al., “A Sub-0.1mm Circuit Design with Substrate-over-Biasing,” Intl. Solid-State
Circuits Conf., pp. 88-89, 1998.
http://public.itrs.net/files/1999_SIA_Roadmap/ORTC.pdf.
281
Bibliography (6 of 7)
•
•
•
•
•
•
•
•
•
H. Banba et.al., “A CMOS Band-gap Reference Circuit with Sub-1V operation,” Symp. on VLSI
Circuits, pp. 228-229, 1998.
S. Mutoh et al, “1-V power Supply High-Speed Digital Circuit Technology with MultithresholdVoltage CMOS,” IEEE J. Solid-State Circuits, pp. 847-854, Aug. 1995.
A. Keshavarzi, S. Ma, S. Narendra, B. Bloechel, K. Mistry, T. Ghani, S. Borkar, V. De,
“Effectiveness of reverse body bias for leakage control, in scaled dual Vt CMOS ICs,” Intl. Symp.
Low Power Electronics and Design, pp. 207-212, Aug. 2001.
Y. Ye, S. Borkar, and V. De, “A Technique for Standby Leakage Reduction in High-Performance
Circuits,” Symp. of VLSI Circuits, pp. 40-41, 1998.
J.P. Halter and F. Najim, “A gate-level leakage power reduction method for ultra-low-power CMOS
circuits,” Custom Integrated Circuits Conf., pp. 475-478, 1997.
Z. Chen, M. Johnson, L. Wei, and K. Roy, “Estimation of Standby Leakage Power in CMOS
Circuits Considering Accurate Modeling of Transistor Stacks,” Intl. Symp. Low Power Electronics
and Design, pp. 239-244, 1998.
L. Su, R. Schulz, J. Adkisson, K. Beyer, G. Biery, W. Cote, E. Crabbe, D. Edelstein, J. EllisMonaghan, E. Eld, D. Foster, R. Gehres, R. Goldblatt, N. Greco, C. Guenther, J. Heidenreich, J.
Herman, D. Kiesling, L. Lin, S-H. Lo, McKenn, “A high-performance sub-0.25mm CMOS
technology with multiple thresholds and copper interconnects,” Intl. Symp. on VLSI Technology,
Systems, and Applications, pp. 18-19, 1998.
D. T. Blaauw, A. Dharchoudhury, R. Panda, S. Sirichotiyakul, C. Oh, and T. Edwards “Emerging
power management tools for processor design,” Intl. Symp. Low Power Electronics and Design,
pp. 143-148, 1998.
A. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High Performance Microprocessor Circuits,
IEEE Press, pp. 46-47, 2000.
282
Bibliography (7 of 7)
•
•
•
•
•
•
•
•
•
•
Z. Liu, C. Hu, J. Huang, T. Chan, M. Jeng, P. Ko, and Y. Cheng, “Threshold Voltage Model for
Deep-Submicrometer MOSFET’s,” IEEE Transactions on Electron Devices, vol. 40, no. 1, pp. 8695, January 1993.
Y. Taur, “CMOS Scaling beyond 0.1mm: how far can it go?” Intl. Symp. on VLSI Technology,
Systems, and Applications, pp. 6-9, 1999.
S. Tyagi, M. Alavi, R. Bigwood, T. Bramblett, J. Bradenburg, W. Chen, B. Crew, M. Hussein, P.
Jacob, C. Kenyon, C. Lo, B. Mcintyre, Z. Ma, P. Moon, P. Nguyen, L. Rumaner, R. Schweinfurth, S.
Sivakumar, M. Stettler, S. Thompson, B. Tufts, J. Xu, S. Yang, and M. Bohr, “A 130 nm Generation
Logic Technology Featuring 70 nm Transistors, Dual Vt Transistors and 6 layers of Cu
Interconnects,” Intl. Elec. Devices Meeting, pp. 567-570, December 2000.
D. Dobberpuhl, “The Design of a High Performance Low Power Microprocessor,” Intl. Symp. Low
Power Electronics and Design, pp. 11-16, 1996.
E. Vittoz, “The Design of High-Performance Analog Circuits on Digital CMOS Chips,” IEEE J.
Solid-State Circuits, pp. 657-665, June 1985.
H. Banba, H. Shiga, A. Umezawa, T. Miyaba, T. Tanzawa, S. Atsumi, and K. Sakui, “A CMOS
band-gap reference circuit with sub 1 V operation,” Symp. on VLSI Circuits, pp. 228-229, 1998.
E. Vittoz et al., “CMOS analog integrated circuits based on weak inversion operations,” IEEE J.
Solid-State Circuits, pp. 224-231, June 1977.
H.J. Oguey and D. Aebischer, “CMOS current reference without resistance,” IEEE J. Solid-State
Circuits, pp. 1132-1135, July 1997.
C.H. Lee and H.J. Park, “All-CMOS temperature independent current reference,” Electronics
Letter, pp. 1280-1281, July 1996.
S. Yang, S. Ahmed, B. Arcot, R. Arghavani, P. Bai, S. Chambers, P. Charvat, R. Cotner, R. Gasser,
T. Ghani, M. Hussein, C. Jan, C. Kardas, J. Maiz, P. McGregor, B. McIntyre, P. Nguyen, P. Packan,
I. Post, S. Sivakumar, J. Steigerwald, “A high performance 180 nm generation logic technology,”
Intl. Elec. Devices Meeting, pp. 197-200, Dec. 1998.
283