Reliability in VLSI Design

Download Report

Transcript Reliability in VLSI Design

PPGEE ’08
Reliability in Nanometer Technologies –
Problems and Solutions
Dr.-Ing. Frank Sill
Department of Electrical Engineering, Federal University of Minas Gerais,
Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil
[email protected]
http://www.cpdee.ufmg.br/~frank/
Agenda

Motivation

Failures in Nanometer Technologies

Techniques to Increase Reliability

Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
2
Motivation

Reliability important for
 Normal
user
 Companies
 Medical
applications
 Cars
 Air
/ Space Environment
…
Copyright Sill, 2008
PPGEE‘08, Reliability
3
Motivation
[Mill.]
Transistors [Mill.]
Transistors
130 nm
400
400
90 nm
300
300
100
100
0
0
100 nm
Yonah
65 nm
151 Mill.
200
200
Prescott
125 Mill.
45 nm
50 nm
Northwood
55 Mill.
Yonah,
151 Mill.
0 nm
2002
2002
2004
2004
Year
Year
2006
2006
Probability for failures increases due to:
 Increasing transistor count
 Shrinking technology
Copyright Sill, 2008
150 nm
Technology
Wolfdale
410 Mill.
500
500
PPGEE‘08, Reliability
2008
2008
Dimensions
m
10
cm
100
nm
10
111mm
cm
µm
µm
Source: „Spektrum der Wissenschaften“
„65 nm“-Transistor
Source: Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
5
Failures in Nanometer
Technologies
Process Failures

Occur at production phase

Based on

Process Variations

Particles

…
Source: Mak
Copyright Sill, 2008
PPGEE‘08, Reliability
7
Sub-wavelength Lithography
Generation [µ]
365nm
248nm
193nm
180nm
130nm
0,1
Gap
90nm
100
65nm
Generation
45nm
32nm
13nm
EUV
0,01
1980
1990
2000
2010
Lithography Wavelength [nm]
1000
1
10
2020
Source: Mark Bohr, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
8
Field-dependent Aberrations
CELL _ A( X1,Y1)  CELL _ A( X 0 ,Y0 )  CELL _ A( X 2 ,Y2 )
Big Chip
Towards Lens
Lens
Cell A
(X1 , Y1)
Cell A
Wafer
Plane
(X0 , Y0)
Cell A
Center:
Minimal
Aberrations
Edge: High
Aberrations
(X2 , Y2)
Source: R. Pack, Cadence
Copyright Sill, 2008
PPGEE‘08, Reliability
9
LineWidth [nm]
Varying Line Width
2.3
2.2
2.1
2.0
1.9
1.8
150
60
100
50
Wafer X
0 0
20
40
Wafer Y
Source: Zhou, 2001
Copyright Sill, 2008
PPGEE‘08, Reliability
10
Mean Number of Dopant
Atoms
Random Dopant Fluctuations
Causes Vth Variations
10000
1000
100
10
1000
500
250
130
65
32
Technology Node (nm)
Non-uniform
Uniform
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
11
Power Density
Sun’s
Surface
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
100
Nuclear
Reactor
Prescott
Pentium®
8086 Hot Plate
10 4004
P4
8008 8085
Pentium®
386
286
486
8080
1
1970
Copyright Sill, 2008
1980
1990
Year
PPGEE‘08, Reliability
2000
2010
Source: Moore, ISSCC 2003
12
Temperature Variation
Power Map
On-Die Temperature

Power density is not uniformly distributed across the chip

Silicon is not a good heat conductor

Max junction temperature is determined by hot-spots

Impact on packaging, cooling
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
13
Temperature Variation cont’d
Power4 Server Chip
Source: Devgan, ICCAD’03
Copyright Sill, 2008
PPGEE‘08, Reliability
14
Delay [s]
Drain current IDS [pA]
Temperature Variation cont’d
Temperature [°C]
Threshold voltage Vth changes with temperature  drain-source current
changes  delay changes
Source: Burleson, UMASS, 2007
Copyright Sill, 2008
PPGEE‘08, Reliability
15
Supply Voltage Drop
Source: Trester, 2005
Copyright Sill, 2008
PPGEE‘08, Reliability
16
Failures Through Increasing Delay

FF
Logic
FF
Data are
processed before

clock phase is over
Clk
Clock (Clk)

VDD↓, Temp.↑, ...
FF
→ Data processing
FF

longer than clock
phase
→ Wrong Data in
next clock phase!
Clk
Copyright Sill, 2008
Logic too slow!
PPGEE‘08, Reliability
17
Soft Errors
Source: Automotive 7-8, 2004
1

In 70’s observed: DRAMs occasionally flip bits for no apparent reason

Ultimately linked to alpha particles and cosmic rays

Collisions with particles create electron-hole pairs in substrate

These carriers are collected on dynamic nodes, disturbing the voltage
Copyright Sill, 2008
PPGEE‘08, Reliability
18
Soft Errors cont’d

Internal state of node flips shortly

If error isn’t masked by

Logic: Wrong input doesn’t lead to wrong output

Electrical: Pulse is attenuated by following gates

Timing: Data based on pulse reach flipflop after clock transistion
 wrong data
Copyright Sill, 2008
FF
FF
FF
FF
PPGEE‘08, Reliability
19
Electromigration
Electromigration:
Top View

Transport of material caused
by the gradual movement of
ions in a conductor

One of the major failure
mechanisms in interconnects.

Proportional to the width and
thickness of the metal lines

Inversely proportional to the
current density
Void
Metal 1
Metal 1
Whisker, Hillock
Cross Section View
Metal 1
Thick Oxide
Metal 2
Source: Plusquellic, UMBC
Copyright Sill, 2008
PPGEE‘08, Reliability
20
Electromigration cont’d
Void in 0.45mm Al-0.5%Cu line
Source: IMM-Bologna
Whiskers in Sn
Source: EPA Centre
Hillocks in ZnSn
Source: Ku&Lin,2007
Copyright Sill, 2008
PPGEE‘08, Reliability
21
Time-Dependent Dielectric Breakdown (TDDB)

Tunneling currents
Wear out of gate oxide

Creation of conducting path
between Gate and Substrate,
Drain, Source

Depending on electrical field over
gate oxide, temperature (exp.),
Source: Pey&Tung
and gate oxide thickness (exp.)

Also: abrupt damage due to
extreme overvoltage (e.g. ElectroStatic Discharge)
Source: Pey&Tung
Copyright Sill, 2008
PPGEE‘08, Reliability
22
Variability Trends
70
60
Vdd
% Variability
50
Vth
40
Performance
30
Power
20
Lgate
10
0
90
Copyright Sill, 2008
80
70
65 57 50 45 40
Technology Node [nm]
PPGEE‘08, Reliability
36
32
28
Source: Burleson, UMASS, 2007
23
Variability Trends cont’d
Soft Error / Chip (Logic & Mem)
Relative SER
150
100
50
0
180 130 90
65
45
32
22
16
Technology [nm]
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
24
Variability Trends cont’d
Frequency and sub-threshold leakage variations
Normalized Frequency
1.4
1.3
Frequency
~30%
30%
1.2
Leakage
Power
~5-10X
130nm
~1000 samples
1.1
1.0
5X
0.9
1
2
3
4
Normalized Leakage (Isub)
5
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
25
Variability Trends cont’d
10000
16
Current Density Jox
Reliability (Weibull slope β)
Increasing probability for Gate-Oxide-Breakdown
12
8
4
0
1000
100
high-k?
10
1
0
2
4
6
8
10
12
Gate Oxide Thickness [nm]
Source: Kauerauf, EDL, 2002
Copyright Sill, 2008
PPGEE‘08, Reliability
180 nm 90 nm
45 nm
22 nm
Technology
Source: Borkar, Intel
26
Future Designs
 100 BT integration capacity
100
Billion
Transistors
 Billions unusable (variations)
 Some will fail over time
 Intermittent failures
Source: Borkar, Intel
Copyright Sill, 2008
PPGEE‘08, Reliability
27
Approaches to Increase
Reliability
Failure Measurement
 Reliability R(t):
– Probability of a system to perform as desired until time t
– Example: R(tx) = 0.8  80 % chance that system is still running at time tx
 Mean Time To Failure MTTF:
– Average time that a system runs until it fails
 Failure rate λ:
– Probability that system fails in given time interval
R (t )  e   t

MTTF   R(t )dt 
0
Copyright Sill, 2008
1

PPGEE‘08, Reliability
29
Bathtube Failure Model
Wearout period
Infant mortality
 Increasing failure rate
 Based on TDDB, EM, etc.
 Declining failure rate
 Based on latent reliability
defects
Normal lifetime
Failure rate
 Constant failure rate
 Based on TDDB,
EM, hot-electrons…
1-40
weeks
Copyright Sill, 2008
7-15 years
PPGEE‘08, Reliability
Time
30
Classification
Failure
Temporary
Permanent
Defects, wearout, out of
range parameters , EM,
TDDB ...
Transient
Intermittent
Process variations, infant
mortality, random dopant
fluctation, ...
Radiation
Non - Radiation
Soft errors
Power supply, coupling,
operation peaks
Source: Mitra, 2007
Copyright Sill, 2008
PPGEE‘08, Reliability
31
The Whole System Counts!
Copyright Sill, 2008
PPGEE‘08, Reliability
32
Triple Module Redundancy (TMR)
Input
Logic L
A
Copy of
Logic L
B
Voter
Output
C
Copy of
Logic L
Copyright Sill, 2008
PPGEE‘08, Reliability
33
Triple Module Redundancy: Voter
Hardware realization of 1-bit majority voter
A
OUT = AB+AC+BC
Out
B
C
Requires 2 gate delays
Copyright Sill, 2008
PPGEE‘08, Reliability
A
B
C
OUT
1
1
0
1
0
0
1
0
0
1
0
0
0
1
1
1
:
:
34
Triple Module Redundancy cont’d
Note: For a constant module failure rate 
1.0
Reliability
TMR
0.5
Simplex (only 1 module)
0
Time
 After certain time: Reliability of TMR system is lower than of simplex
system
 Why: After some time probability that 2 modules are wrong is higher
that 2 modules are working!
Copyright Sill, 2008
PPGEE‘08, Reliability
35
Self Adaptive Design

Extend idea of clock domains to Adaptive Power Domains

Tackle static process and slowly varying timing variations

Control VDD, Vth (indirectly by body bias), fclk by calibration at
Power On
Test inputs
and
responses
fclk
Test
Module
VDD
Module
VBB
Copyright Sill, 2008
PPGEE‘08, Reliability
36
Self Adaptive Design: Example

21 submodules per die

Applying 0.5V Forward/Reverse Body Biasing (FBB/RBB) in steps
of 32 mV, respectively
noBB
ABB
within die ABB
Accepted die
100%
97% highest bin
100% yield
60%
20%
0%
Higher Frequency 

Source: Borkar, Intel
For given Freq and Power density

100% yield with ABB

97% highest freq bin with ABB for within die variability
Copyright Sill, 2008
PPGEE‘08, Reliability
37
Razor Flip-Flop

For uncertainty- and variation-tolerant design

Razor methodology
 Voltage-scaling
methodology based on real-time
detection and correction of circuit timing errors
 Use
the actual hardware to check for errors
 Latch
the input data twice:

Once on the clock edge, and then a little later

If the data is not the same, you are going too fast
Source: Austin, Computer Magazine, 2004
Copyright Sill, 2008
PPGEE‘08, Reliability
38
Razor Flip-Flop cont’d
D
Logic
Stage n
Shadow FF
M
U
X
Main
flip-flop
Q
Logic
stage n+1
Error_Sl
Shadow
latch
CLK
Comperator
Error
CLK_delayed
CLK
CLK_delayed
D
Instr 1
Instr 2
Error
Q
Instr 1
Instr 2
Source: Austin, 2004
Copyright Sill, 2008
PPGEE‘08, Reliability
39
Shadow Transistor
Approach
TDDB model
TDDB between gate and channel
For an Inverter, 65nm-BPTM:
100%
Gate
20
Gate Oxide
Source
Drain
75%
50%
Model:
15
Vout/VDD
10
rel. delay
25%
5
0%
0
RGC
- RGC [kΩ] →
W
W1
W2
W= W1+W2
Copyright Sill, 2008
Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.
PPGEE‘08, Reliability
41
TDDB Model cont’d
TDDB between gate and source/drain
For an Inverter, 65nm-BPTM:
Gate
Gate Oxide
Source
100%
Drain
75%
Vout/VDD
50%
Model:
25%
0%
RGD
RGS
W
-RGC [kΩ] →
W
Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.
Copyright Sill, 2008
PPGEE‘08, Reliability
42
Shadow Transistors
1. Insertion of additional transistors in
parallel to vulnerable transistors
Shadow transistors (ST)
VDD/Vout
Relative Delay
10
8
6
4
2
0
wo/ ST
100%
75%
w/ ST
50%
w/ ST
wo/ ST
25%
R
- GC [kΩ] →
0%
-R
GC
[kΩ] →
For an Inverter, 65nm-BPTM
Copyright Sill, 2008
PPGEE‘08, Reliability
43
Shadow Transistors cont’d
H-Vt/To
2. Application of H-Vt/To transistors with:
– Higher threshold voltage
– Thicker gate oxide
Less vulnerable to TDDB
10
tox
0.22
MTTF – Mean Time To Failure
Copyright Sill, 2008
0.15
MTTFH Vt / To
 10 0.22  4.81
MTTFLVt / To
Source: Srinivasan, “RAMP: A Model for Reliability Aware Microprocessor Design”
Stathis, J., “Reliability Limits for the Gate Insulator in CMOS Technology”
PPGEE‘08, Reliability
44
Shadow Transistors cont’d
3. Selective insertion of shadow transistors in parallel to vulnerable
transistors:
– Component reliability depends on
Activity, state, temperature, size, fabrication …
Most vulnerable can be identified
Netlist
modification
Copyright Sill, 2008
PPGEE‘08, Reliability
Shadow transistors
only added in parallel
to most vulnerable
devices.
45
Shadow Transistors cont’d
3. Selective insertion of shadow transistors in parallel to vulnerable
transistors:
– Component reliability depends on
New Approach



Activity, state, temperature, size, fabrication …
Estimation of stress factors
Most vulnerable can be identified
Determination of components reliability
Adding redundancy only at most vulnerable components
 Advantage: Lower area, power and delay penalty compared to
complete redundancy or random insertion [Sri04]
Shadow transistors
Source:
[Sri04]
Sirisantana, D&T, 2004
Netlist
only added in parallel
modification
to most vulnerable
devices.
Copyright Sill, 2008
PPGEE‘08, Reliability
46
Shadow Transistors cont’d
Advantages
 Increased reliability in respect to TDDB
 H-Vt/To: Reliability increases by ~5x (for Δtox = 0.15 nm)
 Remarkable increase of system life time
Drawbacks


Higher input capacity → higher delay and dynamic power dissipation
Area increase
Remarks


Only slight improvements for Gate-Drain/Source breakdown
H-Vt/To has to be supported by technology
Copyright Sill, 2008
PPGEE‘08, Reliability
47
ST – Improvement MTTF
Improvemnet of MTTF as regards TDDB
≈ 23 % additional transistors
20%
15%
10%
5%
0%
c17
c432 c499
c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552
our algorithm
random insertion
Insertion of L-Vt/To Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
48
Improvemnet of MTTF as regards TDDB
ST – Improvement MTTF (H-Vt/To)
250%
200%
150%
100%
50%
0%
c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552
SPth = 30
SPth = 55
Insertion of H-Vt/To Shadow Transistors
Copyright Sill, 2008
PPGEE‘08, Reliability
49
Take Home Messages

Integrated circuits face several kinds of failures

Decreasing structures sizes create more failure sources

Future designs should (have to) be failure tolerant

Possible approaches:


Triple Module Redundancy (TMR)

Self-Adapting Designs

Razor Flip-Flops

Shadow Transistors
There’s still a lot to do!
Copyright Sill, 2008
PPGEE‘08, Reliability
50
Thank you!
[email protected]
Copyright Sill, 2008
PPGEE‘08, Reliability
51