Reliability in VLSI Design
Download
Report
Transcript Reliability in VLSI Design
PPGEE ’08
Reliability in Nanometer Technologies –
Problems and Solutions
Dr.-Ing. Frank Sill
Department of Electrical Engineering, Federal University of Minas Gerais,
Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil
[email protected]
http://www.cpdee.ufmg.br/~frank/
Agenda
Motivation
Failures in Nanometer Technologies
Techniques to Increase Reliability
Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
2
Motivation
Reliability important for
Normal
user
Companies
Medical
applications
Cars
Air
/ Space Environment
…
Copyright Sill, 2008
PPGEE‘08, Reliability
3
Motivation
[Mill.]
Transistors [Mill.]
Transistors
130 nm
400
400
90 nm
300
300
100
100
0
0
100 nm
Yonah
65 nm
151 Mill.
200
200
Prescott
125 Mill.
45 nm
50 nm
Northwood
55 Mill.
Yonah,
151 Mill.
0 nm
2002
2002
2004
2004
Year
Year
2006
2006
Probability for failures increases due to:
Increasing transistor count
Shrinking technology
Copyright Sill, 2008
150 nm
Technology
Wolfdale
410 Mill.
500
500
PPGEE‘08, Reliability
2008
2008
Dimensions
m
10
cm
100
nm
10
111mm
cm
µm
µm
Source: „Spektrum der Wissenschaften“
„65 nm“-Transistor
Source: Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
5
Failures in Nanometer
Technologies
Process Failures
Occur at production phase
Based on
Process Variations
Particles
…
Source: Mak
Copyright Sill, 2008
PPGEE‘08, Reliability
7
Sub-wavelength Lithography
Generation [µ]
365nm
248nm
193nm
180nm
130nm
0,1
Gap
90nm
100
65nm
Generation
45nm
32nm
13nm
EUV
0,01
1980
1990
2000
2010
Lithography Wavelength [nm]
1000
1
10
2020
Source: Mark Bohr, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
8
Field-dependent Aberrations
CELL _ A( X1,Y1) CELL _ A( X 0 ,Y0 ) CELL _ A( X 2 ,Y2 )
Big Chip
Towards Lens
Lens
Cell A
(X1 , Y1)
Cell A
Wafer
Plane
(X0 , Y0)
Cell A
Center:
Minimal
Aberrations
Edge: High
Aberrations
(X2 , Y2)
Source: R. Pack, Cadence
Copyright Sill, 2008
PPGEE‘08, Reliability
9
LineWidth [nm]
Varying Line Width
2.3
2.2
2.1
2.0
1.9
1.8
150
60
100
50
Wafer X
0 0
20
40
Wafer Y
Source: Zhou, 2001
Copyright Sill, 2008
PPGEE‘08, Reliability
10
Mean Number of Dopant
Atoms
Random Dopant Fluctuations
Causes Vth Variations
10000
1000
100
10
1000
500
250
130
65
32
Technology Node (nm)
Non-uniform
Uniform
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
11
Power Density
Sun’s
Surface
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
100
Nuclear
Reactor
Prescott
Pentium®
8086 Hot Plate
10 4004
P4
8008 8085
Pentium®
386
286
486
8080
1
1970
Copyright Sill, 2008
1980
1990
Year
PPGEE‘08, Reliability
2000
2010
Source: Moore, ISSCC 2003
12
Temperature Variation
Power Map
On-Die Temperature
Power density is not uniformly distributed across the chip
Silicon is not a good heat conductor
Max junction temperature is determined by hot-spots
Impact on packaging, cooling
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
13
Temperature Variation cont’d
Power4 Server Chip
Source: Devgan, ICCAD’03
Copyright Sill, 2008
PPGEE‘08, Reliability
14
Delay [s]
Drain current IDS [pA]
Temperature Variation cont’d
Temperature [°C]
Threshold voltage Vth changes with temperature drain-source current
changes delay changes
Source: Burleson, UMASS, 2007
Copyright Sill, 2008
PPGEE‘08, Reliability
15
Supply Voltage Drop
Source: Trester, 2005
Copyright Sill, 2008
PPGEE‘08, Reliability
16
Failures Through Increasing Delay
FF
Logic
FF
Data are
processed before
clock phase is over
Clk
Clock (Clk)
VDD↓, Temp.↑, ...
FF
→ Data processing
FF
longer than clock
phase
→ Wrong Data in
next clock phase!
Clk
Copyright Sill, 2008
Logic too slow!
PPGEE‘08, Reliability
17
Soft Errors
Source: Automotive 7-8, 2004
1
In 70’s observed: DRAMs occasionally flip bits for no apparent reason
Ultimately linked to alpha particles and cosmic rays
Collisions with particles create electron-hole pairs in substrate
These carriers are collected on dynamic nodes, disturbing the voltage
Copyright Sill, 2008
PPGEE‘08, Reliability
18
Soft Errors cont’d
Internal state of node flips shortly
If error isn’t masked by
Logic: Wrong input doesn’t lead to wrong output
Electrical: Pulse is attenuated by following gates
Timing: Data based on pulse reach flipflop after clock transistion
wrong data
Copyright Sill, 2008
FF
FF
FF
FF
PPGEE‘08, Reliability
19
Electromigration
Electromigration:
Top View
Transport of material caused
by the gradual movement of
ions in a conductor
One of the major failure
mechanisms in interconnects.
Proportional to the width and
thickness of the metal lines
Inversely proportional to the
current density
Void
Metal 1
Metal 1
Whisker, Hillock
Cross Section View
Metal 1
Thick Oxide
Metal 2
Source: Plusquellic, UMBC
Copyright Sill, 2008
PPGEE‘08, Reliability
20
Electromigration cont’d
Void in 0.45mm Al-0.5%Cu line
Source: IMM-Bologna
Whiskers in Sn
Source: EPA Centre
Hillocks in ZnSn
Source: Ku&Lin,2007
Copyright Sill, 2008
PPGEE‘08, Reliability
21
Time-Dependent Dielectric Breakdown (TDDB)
Tunneling currents
Wear out of gate oxide
Creation of conducting path
between Gate and Substrate,
Drain, Source
Depending on electrical field over
gate oxide, temperature (exp.),
Source: Pey&Tung
and gate oxide thickness (exp.)
Also: abrupt damage due to
extreme overvoltage (e.g. ElectroStatic Discharge)
Source: Pey&Tung
Copyright Sill, 2008
PPGEE‘08, Reliability
22
Variability Trends
70
60
Vdd
% Variability
50
Vth
40
Performance
30
Power
20
Lgate
10
0
90
Copyright Sill, 2008
80
70
65 57 50 45 40
Technology Node [nm]
PPGEE‘08, Reliability
36
32
28
Source: Burleson, UMASS, 2007
23
Variability Trends cont’d
Soft Error / Chip (Logic & Mem)
Relative SER
150
100
50
0
180 130 90
65
45
32
22
16
Technology [nm]
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
24
Variability Trends cont’d
Frequency and sub-threshold leakage variations
Normalized Frequency
1.4
1.3
Frequency
~30%
30%
1.2
Leakage
Power
~5-10X
130nm
~1000 samples
1.1
1.0
5X
0.9
1
2
3
4
Normalized Leakage (Isub)
5
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
25
Variability Trends cont’d
10000
16
Current Density Jox
Reliability (Weibull slope β)
Increasing probability for Gate-Oxide-Breakdown
12
8
4
0
1000
100
high-k?
10
1
0
2
4
6
8
10
12
Gate Oxide Thickness [nm]
Source: Kauerauf, EDL, 2002
Copyright Sill, 2008
PPGEE‘08, Reliability
180 nm 90 nm
45 nm
22 nm
Technology
Source: Borkar, Intel
26
Future Designs
100 BT integration capacity
100
Billion
Transistors
Billions unusable (variations)
Some will fail over time
Intermittent failures
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
27
Approaches to Increase
Reliability
Failure Measurement
Reliability R(t):
– Probability of a system to perform as desired until time t
– Example: R(tx) = 0.8 80 % chance that system is still running at time tx
Mean Time To Failure MTTF:
– Average time that a system runs until it fails
Failure rate λ:
– Probability that system fails in given time interval
R (t ) e t
MTTF R(t )dt
0
Copyright Sill, 2008
1
PPGEE‘08, Reliability
29
Bathtube Failure Model
Wearout period
Infant mortality
Increasing failure rate
Based on TDDB, EM, etc.
Declining failure rate
Based on latent reliability
defects
Normal lifetime
Failure rate
Constant failure rate
Based on TDDB,
EM, hot-electrons…
1-40
weeks
Copyright Sill, 2008
7-15 years
PPGEE‘08, Reliability
Time
30
Classification
Failure
Temporary
Permanent
Defects, wearout, out of
range parameters , EM,
TDDB ...
Transient
Intermittent
Process variations, infant
mortality, random dopant
fluctation, ...
Radiation
Non - Radiation
Soft errors
Power supply, coupling,
operation peaks
Source: Mitra, 2007
Copyright Sill, 2008
PPGEE‘08, Reliability
31
The Whole System Counts!
Copyright Sill, 2008
PPGEE‘08, Reliability
32
Triple Module Redundancy (TMR)
Input
Logic L
A
Copy of
Logic L
B
Voter
Output
C
Copy of
Logic L
Copyright Sill, 2008
PPGEE‘08, Reliability
33
Triple Module Redundancy: Voter
Hardware realization of 1-bit majority voter
A
OUT = AB+AC+BC
Out
B
C
Requires 2 gate delays
Copyright Sill, 2008
PPGEE‘08, Reliability
A
B
C
OUT
1
1
0
1
0
0
1
0
0
1
0
0
0
1
1
1
:
:
34
Triple Module Redundancy cont’d
Note: For a constant module failure rate
1.0
Reliability
TMR
0.5
Simplex (only 1 module)
0
Time
After certain time: Reliability of TMR system is lower than of simplex
system
Why: After some time probability that 2 modules are wrong is higher
that 2 modules are working!
Copyright Sill, 2008
PPGEE‘08, Reliability
35
Self Adaptive Design
Extend idea of clock domains to Adaptive Power Domains
Tackle static process and slowly varying timing variations
Control VDD, Vth (indirectly by body bias), fclk by calibration at
Power On
Test inputs
and
responses
fclk
Test
Module
VDD
Module
VBB
Copyright Sill, 2008
PPGEE‘08, Reliability
36
Self Adaptive Design: Example
21 submodules per die
Applying 0.5V Forward/Reverse Body Biasing (FBB/RBB) in steps
of 32 mV, respectively
noBB
ABB
within die ABB
Accepted die
100%
97% highest bin
100% yield
60%
20%
0%
Higher Frequency
Source: Borkar, Intel
For given Freq and Power density
100% yield with ABB
97% highest freq bin with ABB for within die variability
Copyright Sill, 2008
PPGEE‘08, Reliability
37
Razor Flip-Flop
For uncertainty- and variation-tolerant design
Razor methodology
Voltage-scaling
methodology based on real-time
detection and correction of circuit timing errors
Use
the actual hardware to check for errors
Latch
the input data twice:
Once on the clock edge, and then a little later
If the data is not the same, you are going too fast
Source: Austin, Computer Magazine, 2004
Copyright Sill, 2008
PPGEE‘08, Reliability
38
Razor Flip-Flop cont’d
D
Logic
Stage n
Shadow FF
M
U
X
Main
flip-flop
Q
Logic
stage n+1
Error_Sl
Shadow
latch
CLK
Comperator
Error
CLK_delayed
CLK
CLK_delayed
D
Instr 1
Instr 2
Error
Q
Instr 1
Instr 2
Source: Austin, 2004
Copyright Sill, 2008
PPGEE‘08, Reliability
39
Shadow Transistor
Approach
TDDB model
TDDB between gate and channel
For an Inverter, 65nm-BPTM:
100%
Gate
20
Gate Oxide
Source
Drain
75%
50%
Model:
15
Vout/VDD
10
rel. delay
25%
5
0%
0
RGC
- RGC [kΩ] →
W
W1
W2
W= W1+W2
Copyright Sill, 2008
Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.
PPGEE‘08, Reliability
41
TDDB Model cont’d
TDDB between gate and source/drain
For an Inverter, 65nm-BPTM:
Gate
Gate Oxide
Source
100%
Drain
75%
Vout/VDD
50%
Model:
25%
0%
RGD
RGS
W
-RGC [kΩ] →
W
Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.
Copyright Sill, 2008
PPGEE‘08, Reliability
42
Shadow Transistors
1. Insertion of additional transistors in
parallel to vulnerable transistors
Shadow transistors (ST)
VDD/Vout
Relative Delay
10
8
6
4
2
0
wo/ ST
100%
75%
w/ ST
50%
w/ ST
wo/ ST
25%
R
- GC [kΩ] →
0%
-R
GC
[kΩ] →
For an Inverter, 65nm-BPTM
Copyright Sill, 2008
PPGEE‘08, Reliability
43
Shadow Transistors cont’d
H-Vt/To
2. Application of H-Vt/To transistors with:
– Higher threshold voltage
– Thicker gate oxide
Less vulnerable to TDDB
10
tox
0.22
MTTF – Mean Time To Failure
Copyright Sill, 2008
0.15
MTTFH Vt / To
10 0.22 4.81
MTTFLVt / To
Source: Srinivasan, “RAMP: A Model for Reliability Aware Microprocessor Design”
Stathis, J., “Reliability Limits for the Gate Insulator in CMOS Technology”
PPGEE‘08, Reliability
44
Shadow Transistors cont’d
3. Selective insertion of shadow transistors in parallel to vulnerable
transistors:
– Component reliability depends on
Activity, state, temperature, size, fabrication …
Most vulnerable can be identified
Netlist
modification
Copyright Sill, 2008
PPGEE‘08, Reliability
Shadow transistors
only added in parallel
to most vulnerable
devices.
45
Shadow Transistors cont’d
3. Selective insertion of shadow transistors in parallel to vulnerable
transistors:
– Component reliability depends on
New Approach
Activity, state, temperature, size, fabrication …
Estimation of stress factors
Most vulnerable can be identified
Determination of components reliability
Adding redundancy only at most vulnerable components
Advantage: Lower area, power and delay penalty compared to
complete redundancy or random insertion [Sri04]
Shadow transistors
Source:
[Sri04]
Sirisantana, D&T, 2004
Netlist
only added in parallel
modification
to most vulnerable
devices.
Copyright Sill, 2008
PPGEE‘08, Reliability
46
Shadow Transistors cont’d
Advantages
Increased reliability in respect to TDDB
H-Vt/To: Reliability increases by ~5x (for Δtox = 0.15 nm)
Remarkable increase of system life time
Drawbacks
Higher input capacity → higher delay and dynamic power dissipation
Area increase
Remarks
Only slight improvements for Gate-Drain/Source breakdown
H-Vt/To has to be supported by technology
Copyright Sill, 2008
PPGEE‘08, Reliability
47
ST – Improvement MTTF
Improvemnet of MTTF as regards TDDB
≈ 23 % additional transistors
20%
15%
10%
5%
0%
c17
c432 c499
c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552
our algorithm
random insertion
Insertion of L-Vt/To Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
48
Improvemnet of MTTF as regards TDDB
ST – Improvement MTTF (H-Vt/To)
250%
200%
150%
100%
50%
0%
c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552
SPth = 30
SPth = 55
Insertion of H-Vt/To Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
49
Take Home Messages
Integrated circuits face several kinds of failures
Decreasing structures sizes create more failure sources
Future designs should (have to) be failure tolerant
Possible approaches:
Triple Module Redundancy (TMR)
Self-Adapting Designs
Razor Flip-Flops
Shadow Transistors
There’s still a lot to do!
Copyright Sill, 2008
PPGEE‘08, Reliability
50
Thank you!
[email protected]
Copyright Sill, 2008
PPGEE‘08, Reliability
51