Chapter 3 - Fault

Download Report

Transcript Chapter 3 - Fault

Chapter 3
Fault-Tolerant Design
EE141
System-on-Chip
Test Architectures
1
Ch. 3 - Fault-Tolerant Design - P. 1
What is this chapter about?

Gives Overview of Fault-Tolerant Design

Focus on




Basic Concepts in Fault-Tolerant Design
Metrics Used to Specify and Evaluate Dependability
Review of Coding Theory
Fault-Tolerant Design Schemes
– Hardware Redundancy
– Information Redundancy
– Time Redundancy
 Examples of Fault-Tolerant Applications in Industry
EE141
System-on-Chip
Test Architectures
2
Ch. 3 - Fault-Tolerant Design - P. 2
Fault-Tolerant Design
Introduction
 Fundamentals of Fault Tolerance
 Fundamentals of Coding Theory
 Fault Tolerant Schemes
 Industry Practices
 Concluding Remarks

EE141
System-on-Chip
Test Architectures
3
Ch. 3 - Fault-Tolerant Design - P. 3
Introduction

Fault Tolerance
 Ability of system to continue error-free operation in
presence of unexpected fault

Important in mission-critical applications
 E.g., medical, aviation, banking, etc.
 Errors very costly

Becoming important in mainstream applications
 Technology scaling causing circuit behavior to
become less predictable and more prone to failures
 Needing fault tolerance to keep failure rate within
acceptable levels
EE141
System-on-Chip
Test Architectures
4
Ch. 3 - Fault-Tolerant Design - P. 4
Faults

Permanent Faults
 Due to manufacturing defects, early life failures,
wearout failures
 Wearout failures due to various mechanisms
– e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.

Temporary Faults
 Only present for short period of time
 Caused by external disturbance or marginal design
parameters
EE141
System-on-Chip
Test Architectures
5
Ch. 3 - Fault-Tolerant Design - P. 5
Temporary Faults
 Transient
Errors (Non-recurring errors)
 Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
 Intermittent
Errors (Recurring errors)
 Cause by marginal design parameters
 Timing problems
– e.g., races, hazards, skew
 Signal integrity problems
– e.g., crosstalk, ground bounce, etc.
EE141
System-on-Chip
Test Architectures
6
Ch. 3 - Fault-Tolerant Design - P. 6
Redundancy
 Fault
Tolerance requires some form of
redundancy
 Time Redundancy
 Hardware Redundancy
 Information Redundancy
EE141
System-on-Chip
Test Architectures
7
Ch. 3 - Fault-Tolerant Design - P. 7
Time Redundancy
 Perform




Same Operation Twice
See if get same result both times
If not, then fault occurred
Can detect temporary faults
Cannot detect permanent faults
– Would affect both computations
 Advantage
 Little to no hardware overhead
 Disadvantage
 Impacts system or circuit performance
EE141
System-on-Chip
Test Architectures
8
Ch. 3 - Fault-Tolerant Design - P. 8
Hardware Redundancy
 Replicate
hardware and compare outputs
 From two or more modules
 Detects both permanent and temporary faults
 Advantage
 Little or no performance impact
 Disadvantage
 Area and power for redundant hardware
EE141
System-on-Chip
Test Architectures
9
Ch. 3 - Fault-Tolerant Design - P. 9
Information Redundancy
 Encode
outputs with error detecting or
correcting code
 Code selected to minimize redundancy for
class of faults
 Advantage
 Less hardware to generate redundant
information than replicating module
 Drawback
 Added complexity in design
EE141
System-on-Chip
Test Architectures
10
Ch. 3 - Fault-Tolerant Design - P. 10
Failure Rate
 (t)
= Component failure rate
 Measured in FITS (failures per 109 hours)
Failure rate
Infant
mortality
Working life
Wearout
Overall curve
Random failures
Early
failures
Wearout
failures
Time
EE141
System-on-Chip
Test Architectures
11
Ch. 3 - Fault-Tolerant Design - P. 11
System Failure Rate
 System
constructed from components
 No Fault Tolerance
 Any component fails, whole system fails
k
sys   c ,i
i 1
EE141
System-on-Chip
Test Architectures
12
Ch. 3 - Fault-Tolerant Design - P. 12
Reliability
 If
component working at time 0
 R(t) = Probability still working at time t
 Exponential
Failure Law
 If failure rate assumed constant
– Good approximation if past infant mortality period
R(t )  e
EE141
System-on-Chip
Test Architectures
 t
13
Ch. 3 - Fault-Tolerant Design - P. 13
Reliability for Series System
 Series
System
 All components need to work for system to
work
A
B
C
Rsys  RA RB RC
EE141
System-on-Chip
Test Architectures
14
Ch. 3 - Fault-Tolerant Design - P. 14
System Reliability with Redundancy
 System
reliability with component B in
Parallel
 Can tolerate one component B failing
B
A
C
B


Rsys  RA 1  (1  RB ) RC  RA (2RB  R )RC
EE141
System-on-Chip
Test Architectures
2
2
B
15
Ch. 3 - Fault-Tolerant Design - P. 15
Mean-Time-to-Failure (MTTF)
 Average
time before system fails
 Equal to area under reliability curve

MTTF   R (t )dt
0
 For
Exponential Failure Law

MTTF   e dt 
0
EE141
System-on-Chip
Test Architectures
 t
1

16
Ch. 3 - Fault-Tolerant Design - P. 16
Maintainability
 If
system failed at time 0
 M(t) = Probability repaired and operational
at time t
 System
repair time divided into
 Passive repair time
– Time for service engineer to travel to site
 Active repair time
– Time to locate failing component,
repair/replace, and verify system operational
– Can be improved through designing system so
easy to locate failed component and verify
EE141
System-on-Chip
Test Architectures
17
Ch. 3 - Fault-Tolerant Design - P. 17
Repair Rate and MTTR

= rate at which system repaired
 Analogous to failure rate 
 Maintainability
often modeled as
M (t )  1  e
 Mean-Time-to-Repair
EE141
System-on-Chip
Test Architectures
 t
(MTTR) = 1/
18
Ch. 3 - Fault-Tolerant Design - P. 18
Availability
S
1
0
t0
Normal system operation
t1 t2
t3
t4
t
failures
 System Availability
 Fraction of time system is operational
MTTF
system ava ilability 
MTTF  MTTR
EE141
System-on-Chip
Test Architectures
19
Ch. 3 - Fault-Tolerant Design - P. 19
Availability
 Telephone
Systems
 Required to have system availability of
0.9999 (“four nines”)
 High-Reliability
Systems
 May require 7 or more nines
 Fault-Tolerant
Design
 Needed to achieve such high availability
from less reliable components
EE141
System-on-Chip
Test Architectures
20
Ch. 3 - Fault-Tolerant Design - P. 20
Coding Theory
 Coding
 Using more bits than necessary to
represent data
 Provides way to detect errors
– Errors occur when bits get flipped
 Error




Detecting Codes
Many types
Detect different classes of errors
Use different amounts of redundancy
Ease of encoding and decoding data varies
EE141
System-on-Chip
Test Architectures
21
Ch. 3 - Fault-Tolerant Design - P. 21
Block Code
 Message
= Data Being Encoded
 Block code
 Encodes m messages with n-bit codeword
log 2 m 
redundancy  1 
n
 If
no redundancy
 m messages encoded with log2(m) bits
 minimum possible
EE141
System-on-Chip
Test Architectures
22
Ch. 3 - Fault-Tolerant Design - P. 22
Block Code
 To
detect errors, some redundancy
needed
 Space of distinct 2n blocks partitioned into
codewords and non-codewords
 Can
detect errors that cause codeword
to become non-codeword
 Cannot detect errors that cause
codeword to become another codeword
EE141
System-on-Chip
Test Architectures
23
Ch. 3 - Fault-Tolerant Design - P. 23
Separable Block Code
 Separable
 n-bit blocks partitioned into
– k information bits directly representing message
– (n-k) check bits
 Denoted (n,k) Block Code
 Advantage
 k-bit message directly extracted without
decoding
 Rate
of Separable Block Code = k/n
EE141
System-on-Chip
Test Architectures
24
Ch. 3 - Fault-Tolerant Design - P. 24
Example of Separable Block Code
 (4,3)
Parity Code
 Check bit is XOR of 3 message bits
 message 101  codeword 1010
 Single
Bit Parity
log2 m
log2 (2k )
k nk 1
redundancy 1 
1
1 

n
n
n
n
n
k n 1
rate  
n
n
EE141
System-on-Chip
Test Architectures
25
Ch. 3 - Fault-Tolerant Design - P. 25
Example of Non-Separable Block Code
 One-Hot
Code
 Each Codeword has single 1
 Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000
00001000, 00000100, 00000010, 00000001
 Redundancy = 1 - log2(8)/8 = 5/8
log 2 m 
log 2 (n)
redundancy  1 
1
n
n
EE141
System-on-Chip
Test Architectures
26
Ch. 3 - Fault-Tolerant Design - P. 26
Linear Block Codes
 Special
class
 Modulo-2 sum of any 2 codewords also
codeword
 Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
 For
any n-bit codeword c
 cHT = 0
 All 0 codeword exists in any linear code
EE141
System-on-Chip
Test Architectures
27
Ch. 3 - Fault-Tolerant Design - P. 27
Linear Block Codes
 Generator
Matrix, G
 kxn Matrix
 Codeword
c for message m
 c = mG
 GHT
=0
EE141
System-on-Chip
Test Architectures
28
Ch. 3 - Fault-Tolerant Design - P. 28
Systematic Block Code
 First
k-bits correspond to message
 Last n-k bits correspond to check bits
 For
Systematic Code
 G = [Ikxk : Pkx(n-k)]
 H = [I(n-k)x(n-k) : PT(n-k)xk]
 Example
1 0 0
G  0 1 0
0 0 1
EE141
System-on-Chip
Test Architectures
1
1
1
H  1 1 1 1
29
Ch. 3 - Fault-Tolerant Design - P. 29
Distance of Code
 Distance
between two codewords
 Number of bits in which they differ
 Distance
of Code
 Minimum distance between any two
codewords in code
 If n=k (no redundancy), distance = 1
 Single-bit parity, distance = 2
 Code
with distance d
 Detect d-1 errors
 Correct up to (d-1)/2 errors
EE141
System-on-Chip
Test Architectures
30
Ch. 3 - Fault-Tolerant Design - P. 30
Error Correcting Codes
 Code
with distance 3
 Called single error correcting (SEC) code
 Code
with distance 4
 Called single error correcting and double
error detecting (SEC-DED) code
 Procedure
for constructing SEC code
 Described in [Hamming 1950]
 Any H-matrix with all columns distinct and
no all-0 column is SEC
EE141
System-on-Chip
Test Architectures
31
Ch. 3 - Fault-Tolerant Design - P. 31
Hamming Code
 For
any value of n
 SEC code constructed by
– setting each column in H equal to binary
representation of column number (starting from 1)
 Number of rows in H equal to log2(n+1)
 Example
of SEC Hamming Code for n=7
0 0 0 1 1 1 1


H  0 1 1 0 0 1 1
1 0 1 0 1 0 1
EE141
System-on-Chip
Test Architectures
32
Ch. 3 - Fault-Tolerant Design - P. 32
Error Correction in Hamming Code
 Syndrome,
s
 s = HvT for received vector v
 If v is codeword
– Syndrome = 0
 If v non-codeword and single-bit error
– Syndrome will match one of columns of H
– Will contain binary value of bit position in error
EE141
System-on-Chip
Test Architectures
33
Ch. 3 - Fault-Tolerant Design - P. 33
Example of Error Correction
 For
(7,3) Hamming Code
 Suppose codeword 0110011 has one-bit
error changing it to 1110011
0
0

0

T
s  vH  [1110011] 1
1

1
1

EE141
System-on-Chip
Test Architectures
0 1
1 0
1 1

0 0  [001]
0 1

1 0
1 1 
34
Ch. 3 - Fault-Tolerant Design - P. 34
SEC-DED Code
 Make
SEC Hamming Code SEC-DED
 By adding parity check over all bits
 Extra parity bit
– 1 for single-bit error
– 0 for double-bit error
 Makes possible to detect double bit error
– Avoid assuming single-bit error and
miscorrecting it
EE141
System-on-Chip
Test Architectures
35
Ch. 3 - Fault-Tolerant Design - P. 35
Example of Error Correction
 For
(7,4) SEC-DED Hamming Code
 Suppose codeword 0110011 has two-bit
error changing it to 1010011
– Doesn’t match any column in H
0
0

0

T
s  vH  [1010011] 1
1

1
1

EE141
System-on-Chip
Test Architectures
0 1 1
1 0 1
1 1 1

0 0 1  [0010]
0 1 1

1 0 1
1 1 1
36
Ch. 3 - Fault-Tolerant Design - P. 36
Hsiao Code
 Weight
of column
 Number of 1’s in column
 Constructing
n-bit SEC-DED Hsiao Code
 First use all possible weight-1 columns
– Then all possible weight-3 columns
– Then weight-5 columns, etc.
 Until n columns formed
 Number check bits is log2(n+1)
 Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome
– Disadvantage: Correction logic more complex 37
EE141
System-on-Chip
Test Architectures
Ch. 3 - Fault-Tolerant Design - P. 37
Example of Hsiao Code
 (7,3)
Hsiao Code
 Uses weight-1 and weight-3 columns
0
0
H 
0

1
EE141
System-on-Chip
Test Architectures
0
0
1
0
0
1
0
0
1
0
0
0
0
1
1
1
1
0
1
1
1

1
0

1
38
Ch. 3 - Fault-Tolerant Design - P. 38
Unidirectional Errors
 Errors
in block of data which only cause
01 or 10, but not both
 Any number of bits in error in one direction
 Example
 Correct codeword 111000
 Unidirectional errors could cause
– 001000, 000000, 101000 (only 10 errors)
 Non-unidirectional errors
– 101001, 011001, 011011 (both10 and 01)
EE141
System-on-Chip
Test Architectures
39
Ch. 3 - Fault-Tolerant Design - P. 39
Unidirectional Error Detecting Codes
 All
unidirectional error detecting (AUED)
Codes
 Detect all unidirectional errors in codeword
 Single-bit parity is not AUED
– Cannot detect even number of errors
 No linear code is AUED
– All linear codes must contain all-0 vector, so
cannot detect all 10 errors
EE141
System-on-Chip
Test Architectures
40
Ch. 3 - Fault-Tolerant Design - P. 40
Two-Rail Code
 Two-Rail
Code
 One check bit for each information bit
– Equal to complement of information bit
 Two-Rail Code is AEUD
 50% Redundancy
 Example
of (6,3) Two-Rail Code
 Message 101 has Codeword 101010
 Set of all codewords
– 000111, 001110, 010101, 011100, 100110,
101010, 110001, 111000
EE141
System-on-Chip
Test Architectures
41
Ch. 3 - Fault-Tolerant Design - P. 41
Berger Codes
 Lowest
redundancy of separable AUED
codes
 For k information bits, log2(k+1) check bits
 Check bits equal to binary representation
of number of 0’s in information bits
 Example
 Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
EE141
System-on-Chip
Test Architectures
42
Ch. 3 - Fault-Tolerant Design - P. 42
Berger Codes
 Codewords
for (5,3) Berger Code
 00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100
 If
unidirectional errors
 Contain 10 errors
– increase 0’s in information bits
– can only decrease binary number in check bits
 Contain 01 errors
– decrease 0’s in information bits
– can only increase binary number in check bits
EE141
System-on-Chip
Test Architectures
43
Ch. 3 - Fault-Tolerant Design - P. 43
Berger Codes
 If
8 information bits
 Berger code requires log28+1=4 check bits
log2 m
log2 (2k )
8 1
redundancy 1 
1
 1    25%
n
n
12 4
 (16,8)
Two-Rail Code
 Requires 50% redundancy
 Redundancy
advantage of Berger Code
 Increases as k increased
EE141
System-on-Chip
Test Architectures
44
Ch. 3 - Fault-Tolerant Design - P. 44
Constant Weight Codes
 Constant
Weight Codes
 Non-separable, but lower redundancy than
Berger
 Each codeword has same number of 1’s
 Example
2-out-of-3 constant weight code
 110, 011, 101
 AEUD
code
 Unidirectional errors always change number
of 1’s
EE141
System-on-Chip
Test Architectures
45
Ch. 3 - Fault-Tolerant Design - P. 45
Constant Weight Codes
 Number
codewords in m-out-of-n code
n
m
C
 Codewords
maximized when m close to
n/2 as possible
 n/2-out-of-n when n even
 (n/2-0.5 or n/2+0.5)-out-of-n when n odd
 Minimizes redundancy of code
EE141
System-on-Chip
Test Architectures
46
Ch. 3 - Fault-Tolerant Design - P. 46
Example
 6-out-of-12
constant weight code
C612  924 codewords
log 2 m 
log 2 (924 )
redundancy  1 
1
 17.9%
n
12
 12-bit
Berger Code
 Only 28 = 256 codewords
log2 m
log2 (28 )
redundancy 1 
1
 33.3%
n
12
EE141
System-on-Chip
Test Architectures
47
Ch. 3 - Fault-Tolerant Design - P. 47
Constant Weight Codes
 Advantage
 Less redundancy than Berger codes
 Disadvantage
 Non-separable
 Need decoding logic
– to convert codeword back to binary message
EE141
System-on-Chip
Test Architectures
48
Ch. 3 - Fault-Tolerant Design - P. 48
Burst Error
 Burst
Error
 Common, multi-bit errors tend to be clustered
– Noise source affects contiguous set of bus lines
 Length of burst error
– number of bits between first and last error
 Wrap around from last to first bit of codeword
 Example:
Original codeword 00000000
 00111100 is burst error length 4
 00110100 is burst error length 4
– Any number of errors between first and last error
EE141
System-on-Chip
Test Architectures
49
Ch. 3 - Fault-Tolerant Design - P. 49
Cyclic Codes
 Special
class of linear code
 Any codeword shifted cyclically is another
codeword
 Used to detect burst errors
 Less redundancy required to detect burst
error than general multi-bit errors
– Some distance 2 codes can detect all burst
errors of length 4
– detecting all possible 4-bit errors requires
distance 5 code
EE141
System-on-Chip
Test Architectures
50
Ch. 3 - Fault-Tolerant Design - P. 50
Cyclic Redundancy Check (CRC) Code
 Most
widely used cyclic code
 Uses binary alphabet based on GF(2)
 CRC
code is (n,k) block code
 Formed using generator polynomial, g(x)
– called code generator
– degree n-k polynomial (same degree as
number of check bits)
nk
g ( x)  gnk x  ...  g2 x  g1x  g0
c( x)  m( x) g ( x)
EE141
System-on-Chip
Test Architectures
2
51
Ch. 3 - Fault-Tolerant Design - P. 51
Message
m(x)
g(x)
c(x)
Codeword
0000
0
x2 + 1
0
000000
0001
1
x2 + 1
x2 + 1
000101
0010
x
x2 + 1
x3 + x
001010
0011
x+1
x2 + 1
x3 + x2 + x + 1
001111
0100
x2
x2 + 1
x4 + x2
010100
0101
x2 + 1
x2 + 1
x4 + 1
010001
0110
x2 + x
x2 + 1
x4 + x3 + x2 + x
011110
0111
x2 + x + 1
x2 + 1
x4 + x3 + x + 1
011011
1000
x3
x2 + 1
x5 + x3
101000
1001
x3 + 1
x2 + 1
x5 + x3 + x2 + 1
101101
1010
x3 + x
x2 + 1
x5 + x
100010
1011
x3 + x + 1
x2 + 1
x5 + x2 + x + 1
100111
1100
x3 + x2
x2 + 1
x5 + x4 + x3 + x2
111100
1101
x3 + x2 + 1
x2 + 1
x5 + x4 + x3 + 1
111001
1110
x3 + x2 + x
x2 + 1
x5 + x4 + x2 + x
110110
1111
x3 + x2 + x + 1
x2 + 1
x5 + x4 + x + 1
110011
EE141
System-on-Chip
Test Architectures
52
Ch. 3 - Fault-Tolerant Design - P. 52
CRC Code
 Linear
block code
 Has G-matrix and H-matrix
 G-matrix shifted version of generator
polynomial
 g nk
 0
G
 .

 0
EE141
System-on-Chip
Test Architectures
...
g1
g0
0
0
g nk
.
...
.
g1
.
g0
0
.
.
0
...
g nk
... g1
0
0 
. 

g0 
53
Ch. 3 - Fault-Tolerant Design - P. 53
CRC Code Example
 (6,4)
CRC code generated by g(x)=x2+1
1
0
G
0

0
EE141
System-on-Chip
Test Architectures
0
1
0
0
1
0
1
0
0
1
0
1
0
0
1
0
0
0
0

1
54
Ch. 3 - Fault-Tolerant Design - P. 54
Systematic CRC Codes
 To
obtain systematic CRC code
 codewords formed using Galois division
– nice because LFSR can be used for performing
division
c( x)  m( x) x
n k
 r ( x)
nk
m( x) x
r ( x)  rem ainderof
g ( x)
EE141
System-on-Chip
Test Architectures
55
Ch. 3 - Fault-Tolerant Design - P. 55
Galois Division Example
 Encode
m(x)=x2+x with g(x)=x2+1
 Requires dividing m(x)xn-k =x4+x3 by g(x)
111
101 11000
101
110
101
110
101
11 remainder
 Remainder r(x)=x+1
– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
EE141
System-on-Chip
Test Architectures
56
Ch. 3 - Fault-Tolerant Design - P. 56
Message
m(x)
g(x)
r(x)
c(x)
Codeword
0000
0
x2 + 1
0
0
000000
0001
1
x2 + 1
1
x2 + 1
000101
0010
x
x2 + 1
x
x3 + x
001010
0011
x+1
x2 + 1
x+1
x3 + x2 + x + 1
001111
0100
x2
x2 + 1
1
x4 + 1
010001
0101
x2 + 1
x2 + 1
0
x4 + x2
010100
0110
x2 + x
x2 + 1
x+1
x4 + x3 + x + 1
011011
0111
x2 + x + 1
x2 + 1
x
x4 + x3 + x + 1
011110
1000
x3
x2 + 1
x
x4 + x3 + x + 1
100010
1001
x3 + 1
x2 + 1
x+1
x4 + x3 + x + 1
100111
1010
x3 + x
x2 + 1
0
x4 + x3 + x + 1
101000
1011
x3 + x + 1
x2 + 1
1
x4 + x3 + x + 1
101101
1100
x3 + x2
x2 + 1
x+1
x4 + x3 + x + 1
110011
1101
x3 + x2 + 1
x2 + 1
x
x4 + x3 + x + 1
110110
1110
x3 + x2 + x
x2 + 1
1
x4 + x3 + x + 1
111001
1111
x3 + x2 + x + 1
x2 + 1
0
x4 + x3 + x2 + x 111100
EE141
System-on-Chip
Test Architectures
57
Ch. 3 - Fault-Tolerant Design - P. 57
Generating Check Bits for CRC Code
 Use
LFSR
 With characteristic polynomial equal to g(x)
 Append n-k 0’s to end of message
 Example:
m(x)=x2+x+1 and g(x)=x3+x+1
Message
0
0
0
111000
Appended 0’s
0
1
0
Final state after shifting equals remainder
EE141
System-on-Chip
Test Architectures
58
Ch. 3 - Fault-Tolerant Design - P. 58
Checking CRC Codeword
 Checking
Received Codeword for Errors
 Shift codeword into LFSR
– with same characteristic polynomial as used to
generate it
 If final state of LFSR non-zero, then error
0
0
0
111010
codeword to check
EE141
System-on-Chip
Test Architectures
59
Ch. 3 - Fault-Tolerant Design - P. 59
Selecting Generator Polynomial
 Key
issue for CRC Codes
 If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
 If generator polynomial is mutliple of (x+1)
– Will detect any odd number of errors
 If g(x) = (x+1)p(x) where p(x) primitive of
degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors
EE141
System-on-Chip
Test Architectures
60
Ch. 3 - Fault-Tolerant Design - P. 60
Commonly Used CRC Generators
CRC code
Generator Polynomial
CRC-5 (USB token packets)
x5+x2+1
CRC-12 (Telecom systems)
x12+x11+x3+x2+x+1
CRC-16-CCITT (X25, Bluetooth)
CRC-32 (Ethernet)
CRC-64 (ISO)
EE141
System-on-Chip
Test Architectures
x16+x12+x5+1
x32+x26+x23+x22+x16+x12+x11+x10+x8
+x7+x5+x4+x+1
x64+x4+x3+x+1
61
Ch. 3 - Fault-Tolerant Design - P. 61
Fault Tolerance Schemes
 Adding
Fault Tolerance to Design
 Improves dependability of system
 Requires redundancy
– Hardware
– Time
– Information
EE141
System-on-Chip
Test Architectures
62
Ch. 3 - Fault-Tolerant Design - P. 62
Hardware Redundancy
 Involves
replicating hardware units
 At any level of design
– gate-level, module-level, chip-level, board-level
 Three
Basic Forms
 Static (also called Passive)
– Masks faults rather than detects them
 Dynamic (also called Active)
– Detects faults and reconfigures to spare hardware
 Hybrid
– Combines active and passive approaches
EE141
System-on-Chip
Test Architectures
63
Ch. 3 - Fault-Tolerant Design - P. 63
Static Redundancy
 Masks
faults so no erroneous outputs
 Provides uninterrupted operation
 Important for real-time systems
– No time to reconfigure or retry operation
 Simple self-contained
– No need to update or rollback system state
EE141
System-on-Chip
Test Architectures
64
Ch. 3 - Fault-Tolerant Design - P. 64
Triple Module Redundancy (TMR)
 Well-known
static redundancy scheme
 Three copies of module
 Use majority voter to determine final output
 Error in one module out-voted by other two
Module
1
Module
2
Majority
Voter
Module
3
EE141
System-on-Chip
Test Architectures
65
Ch. 3 - Fault-Tolerant Design - P. 65
TMR Reliability and MTTF
 TMR
works if any 2 modules work
 Rm = reliability of each module
 Rv = reliability of voter
RTMR  Rv [Rm3  C23 Rm2 (1  Rm )]  Rv (3Rm2  2Rm3 )
 MTTF
for TMR



0
0
0
MTTFTMR   RTMR dt   Rv (3Rm2  2 Rm3 )dt   e vt (3e 2mt  2e 3mt )dt
3
2


2m  v 3m  v
EE141
System-on-Chip
Test Architectures
66
Ch. 3 - Fault-Tolerant Design - P. 66
Comparison with Simplex
 Neglecting
MTTFTMR
 TMR
fault rate of voter
 5  1  5


     MTTFsimplex
2m 3m  6  m  6
3
2
has lower MTTF, but
 Can tolerate temporary faults
 Higher reliability for short mission times
EE141
System-on-Chip
Test Architectures
67
Ch. 3 - Fault-Tolerant Design - P. 67
Comparison with Simplex
 Crossover
point
RTMR  Rsimplex
3e 2 mt  2e 3mt  e mt
ln 2
Solve  t 
 0.7 MTTFsimplex
m
 RTMR
> Rsimplex when
 Mission time shorter than 70% of MTTF
EE141
System-on-Chip
Test Architectures
68
Ch. 3 - Fault-Tolerant Design - P. 68
N-Modular Redundancy (NMR)
 NMR
 N modules along with majority voter
– TMR special case
 Number of failed modules masked = (N-1)/2
 As N increases, MTTF decreases
– But, reliability for short missions increases
 If
goal only to tolerate temporary faults
 TMR sufficient
EE141
System-on-Chip
Test Architectures
69
Ch. 3 - Fault-Tolerant Design - P. 69
Interwoven Logic
 Replace
each gate
 with 4 gates using inconnection pattern
that automatically corrects errors
 Traditionally
not as attractive as TMR
 Requires lots of area overhead
 Renewed interest by researchers
investigating emerging nanoelectronic
technologies
EE141
System-on-Chip
Test Architectures
70
Ch. 3 - Fault-Tolerant Design - P. 70
Interwoven Logic with 4 NOR Gates
+
2a
X
+
2b
X
Y
+
+
2c
+
1a
+
4a
+
+
2
+
1b
1
+
4
+
1c
3
+
+
1d
+
2d
4b
+
+
4c
3a
+
+
4d
3b
+
Y
3c
+
3d
EE141
System-on-Chip
Test Architectures
71
Ch. 3 - Fault-Tolerant Design - P. 71
Example of Error on Third Y Input
X
X
Y
0
0
0
0
+ 0
2a
+ 0
2b
+ 1
2c
+ 1
1a
+ 0
4a
+ 1
+
+
2
+
1
+
4
+ 1
2d
1b
4b
+ 0
1c
3
+ 0
1d
Y
EE141
System-on-Chip
Test Architectures
+ 0
0
0
1
0
+ 1
+ 0
4c
3a
+ 1
+ 0
4d
3b
+ 0
3c
+ 0
3d
72
Ch. 3 - Fault-Tolerant Design - P. 72
Dynamic Redundancy
 Involves
 Detecting fault
 Locating faulty hardware unit
 Reconfiguring system to use spare fault-free
hardware unit
EE141
System-on-Chip
Test Architectures
73
Ch. 3 - Fault-Tolerant Design - P. 73
Unpowered (Cold) Spares
 Advantage
 Extends lifetime of spares
 Equations
 Assume spare not failing until powered
 Perfect reconfiguration capability
Rw / cold _ spare  (1  t )e  t
MTTFw / cold _ spare 
EE141
System-on-Chip
Test Architectures
2

74
Ch. 3 - Fault-Tolerant Design - P. 74
Unpowered (Cold) Spares
 One
cold spare doubles MTTF
 Assuming faults always detected and
reconfiguration circuitry never fails
 Drawback
of cold spare
 Extra time to power and initialize
 Cannot be used to help in detecting faults
 Fault detection requires either
– periodic offline testing
– online testing using time or information
redundancy
EE141
System-on-Chip
Test Architectures
75
Ch. 3 - Fault-Tolerant Design - P. 75
Powered (Hot) Spares
 Can
use spares for online fault detection
 One approach is duplicate-and-compare
 If outputs mismatch then fault occurred
– Run diagnostic procedure to determine which
module is faulty and replace with spare
 Any number of spares can be used
Module
A
Spare
Module
Module
B
EE141
System-on-Chip
Test Architectures
Output
Compare
Agree/Disagree
76
Ch. 3 - Fault-Tolerant Design - P. 76
Pair-and-a-Spare
 Avoids
halting system to run diagnostic
procedure when fault occurs
Module
A
Module
B
Output
Compare
Agree/Disagree
Switch
Module
C
Module
D
EE141
System-on-Chip
Test Architectures
Output
Compare
Agree/Disagree
77
Ch. 3 - Fault-Tolerant Design - P. 77
TMR/Simplex
 When
one module in TMR fails
 Disconnect one of remaining modules
 Improves MTTF while retaining advantages
of TMR when 3 good modules
 TMR/Simplex
 Reliability always better than either TMR or
Simplex alone
EE141
System-on-Chip
Test Architectures
78
Ch. 3 - Fault-Tolerant Design - P. 78
Comparison of Reliability vs Time
1
0.9
RELIABILITY
0.8
0.7
SIMPLEX
TMR
0.6
TMR/SIMPLEX
0.5
0.4
0.3
0
0.2
0.4
0.6
0.8
1
NORMALIZED MISSION TIME (T/MTTF)
EE141
System-on-Chip
Test Architectures
79
Ch. 3 - Fault-Tolerant Design - P. 79
Hybrid Redundancy
 Combines
both static and dynamic
redundancy
 Masks faults like static
 Detects and reconfigures like dynamic
EE141
System-on-Chip
Test Architectures
80
Ch. 3 - Fault-Tolerant Design - P. 80
TMR with Spares
 If
TMR module fails
 Replace with spare
– can be either hot or cold spare
 While system has three working modules
– TMR will provide fault masking for
uninterrupted operation
EE141
System-on-Chip
Test Architectures
81
Ch. 3 - Fault-Tolerant Design - P. 81
Self-Purging Redundancy
 Uses
threshold voter instead of majority
voter
 Threshold voter outputs 1 if number of
input that are 1 greater than threshold
– Otherwise outputs 0
 Requires hot spares
EE141
System-on-Chip
Test Architectures
82
Ch. 3 - Fault-Tolerant Design - P. 82
Self-Purging Redundancy
Module
1
Elem.
Switch
Module
2
Elem.
Switch
Module
3
Elem.
Switch
Module
4
Elem.
Switch
Module
5
Elem.
Switch
Threshold
Voter
2
Elementary Switch
Initialization
S
R

Flip
Flop
EE141
System-on-Chip
Test Architectures
Module
&
Voter
83
Ch. 3 - Fault-Tolerant Design - P. 83
Self-Purging Redundancy
 Compared
with 5MR
 Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)
– Cannot tolerate two modules simultaneously
failing (5MR can)
 Compared
with TMR with 2 spares
 Self-purging with 5 modules
– simpler reconfiguration circuitry
– requires hot spares (3MR w/spares can use
either hot or cold spares)
EE141
System-on-Chip
Test Architectures
84
Ch. 3 - Fault-Tolerant Design - P. 84
Time Redundancy
 Advantage
 Less hardware
 Drawback
 Cannot detect permanent faults
 If
error detected
 System needs to rollback to known good
state before resuming operation
EE141
System-on-Chip
Test Architectures
85
Ch. 3 - Fault-Tolerant Design - P. 85
Repeated Execution
 Repeat
operation twice
 Simplest time redundancy approach
 Detects temporary faults occurring during
one execution (but not both)
– Causes mismatch in results
 Can reuse same hardware for both
executions
– Only one copy of functional hardware needed
EE141
System-on-Chip
Test Architectures
86
Ch. 3 - Fault-Tolerant Design - P. 86
Repeated Execution
 Requires
mechanism for storing and
comparing results of both executions
 In processor, can store in memory or on
disk and use software to compare
 Main
cost
 Additional time for redundant execution
and comparison
EE141
System-on-Chip
Test Architectures
87
Ch. 3 - Fault-Tolerant Design - P. 87
Multi-threaded Redundant Execution
 Can
use in processor-based system that
can run multiple threads
 Two copies of thread executed concurrently
 Results compared when both complete
 Take advantage of processor’s built-in
capability to exploit processing resources
– Reduce execution time
– Can significantly reduce performance penalty
EE141
System-on-Chip
Test Architectures
88
Ch. 3 - Fault-Tolerant Design - P. 88
Multiple Sampling of Ouputs
 Done
at circuit-level
 Sample once at end of normal clock cycle
 Same again after delay of t
 Two samples compared to detect mismatch
– Indicates error occurred
 Detect fault whose duration is less than t
 Performance overhead depends on
– Size of t relative to normal clock period
EE141
System-on-Chip
Test Architectures
89
Ch. 3 - Fault-Tolerant Design - P. 89
Multiple Sampling of Outputs
 Simple
approach using two latches
Main
Latch
Signal
Clk

Error
Shadow
Latch
Clk+t
EE141
System-on-Chip
Test Architectures
90
Ch. 3 - Fault-Tolerant Design - P. 90
Multiple Sampling of Outputs
 Approach
using stability checker at output
Normal
Clock Period
Checking
Period
Stability
Stability
Checking
Checking
Period
Period
Normal
Clock Period
t
t
&
+
&
Signal
Error
+
&
EE141
System-on-Chip
Test Architectures
91
Ch. 3 - Fault-Tolerant Design - P. 91
Diverse Recomputation
 Use
same hardware, but perform
computation differently second time
 Can detect permanent faults that affects
only one computation
 For
arithmetic or logical operations
 Shift operands when performing second
computation [Patel 1982]
 Detects permanent fault affecting only one
bit-slice
EE141
System-on-Chip
Test Architectures
92
Ch. 3 - Fault-Tolerant Design - P. 92
Information Redundancy
 Based
on Error Detecting and
Correcting Codes
 Advantage
 Detects both permanent and temporary
faults
 Implemented with less hardware overhead
than using multiple copies of module
 Disadvantage
 More complex design
EE141
System-on-Chip
Test Architectures
93
Ch. 3 - Fault-Tolerant Design - P. 93
Error Detection
 Error
detecting codes used to detect
errors
 If error detected
– Rollback to previous known error-free state
– Retry operation
EE141
System-on-Chip
Test Architectures
94
Ch. 3 - Fault-Tolerant Design - P. 94
Rollback
 Requires
adding storage to save
previous state
 Amount of rollback depends on latency of
error detection mechanism
 Zero-latency error detection
– rollback implemented by preventing system
state from updating
 If errors detected after n cycles
– need rollback restoring system to state at least
n clock cycles earlier
EE141
System-on-Chip
Test Architectures
95
Ch. 3 - Fault-Tolerant Design - P. 95
Checkpoint
 Execution
divided into set of operations
 Before each operation executed
– checkpoint created where system state saved
 If any error detected during operation
– rollback to last checkpoint and retry operation
 If multiple retries fail
– operation halts and system flags that
permanent fault has occurred
EE141
System-on-Chip
Test Architectures
96
Ch. 3 - Fault-Tolerant Design - P. 96
Error Detection
 Encode
outputs of circuit with error
detecting code
 Non-codeword output indicates error
Inputs
m
k
Functional
Logic
Outputs
k
m
EE141
System-on-Chip
Test Architectures
Check Bit
Generator
Checker
c
Error
Indication
97
Ch. 3 - Fault-Tolerant Design - P. 97
Self-Checking Checker
 Has
two outputs
 Normal error-free case (1,0) or (0,1)
 If equal to each other, then error (0,0) or (1,1)
 Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected
EE141
System-on-Chip
Test Architectures
98
Ch. 3 - Fault-Tolerant Design - P. 98
Totally Self-Checking Checker
 Requires
three properties
 Code Disjoint
– all codeword inputs mapped to codeword outputs
 Fault Secure
– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output
or non-codeword output (not incorrect codeword)
 Self-Testing
– For each fault, at least one codeword input gives
error indication
EE141
System-on-Chip
Test Architectures
99
Ch. 3 - Fault-Tolerant Design - P. 99
Duplicate-and-Compare
 Equality
checker indicates error
 Undetected error can occur only if
common-mode fault affecting both copies
 Only faults after stems detected
 Over 100% overhead (including checker)
Stems
Primary
Inputs
Functional
Logic
Equality
Checker
Error
Indication
Functional
Logic
EE141
System-on-Chip
Test Architectures
100
Ch. 3 - Fault-Tolerant Design - P. 100
Single-Bit Parity Code
 Totally
self-checking checker formed by
removing final gate from XOR tree

Functional
Logic

EI0

EI1


Parity
Prediction
EE141
System-on-Chip
Test Architectures
101
Ch. 3 - Fault-Tolerant Design - P. 101
Single-Bit Parity Code
 Cannot
detect even bit errors
 Can ensure no even bit errors by
generating each output with independent
cone of logic
– Only single bit errors can occur due to single
point fault
– Typically requires a lot of overhead
EE141
System-on-Chip
Test Architectures
102
Ch. 3 - Fault-Tolerant Design - P. 102
Parity-Check Codes
 Each
check bit is parity for some set of
output bits
 Example: 6 outputs and 3 check bits
Z1 Z2 Z3 Z4 Z5 Z6 c1 c2 c3
Parity Group 1
1
0
0
1
1
0
1
0
0
Parity Group 2
0
1
1
0
0
0
0
1
0
Parity Group 3
0
0
0
0
0
1
0
0
1
EE141
System-on-Chip
Test Architectures
103
Ch. 3 - Fault-Tolerant Design - P. 103
Parity-Check Codes
 For
c check bits and k functional outputs
 2ck possible parity check codes
 Can choose code based on structure of
circuit to minimize undetected error
combinations
 Fanouts in circuit determine possible error
combinations due to single-point fault
EE141
System-on-Chip
Test Architectures
104
Ch. 3 - Fault-Tolerant Design - P. 104
Checker for Parity-Check Codes
 Constructed
from single-bit parity
checkers and two-rail checkers
Z1
Z4
Z5
c1
Z2
Z3
Parity
Checker
Two-Rail
Checker
Parity
Checker
Two-Rail
Checker
E0
E1
c2
Z6
c3
Parity
Checker
EE141
System-on-Chip
Test Architectures
105
Ch. 3 - Fault-Tolerant Design - P. 105
Two-Rail Checkers
 Totally
self-checking two-rail checker
A0
B0
&
+
C0
+
C1
&
A1
B1
EE141
System-on-Chip
Test Architectures
&
&
106
Ch. 3 - Fault-Tolerant Design - P. 106
Berger Codes
 Inverter-free
circuit
 Inverters only at primary inputs
 Can be synthesized using only algebraic
factoring [Jha 1993]
 Only unidirectional errors possible for
single point faults
– Can use unidirectional code
– Berger code gives 100% coverage
EE141
System-on-Chip
Test Architectures
107
Ch. 3 - Fault-Tolerant Design - P. 107
Constant Weight Codes
 Non-separable
with lower redundancy
 Drawback: need decoding logic to convert
codeword back to its original binary value
 Can use for encoding states of FSM
– No need for decoding logic
EE141
System-on-Chip
Test Architectures
108
Ch. 3 - Fault-Tolerant Design - P. 108
Error Correction
 Information
redundancy can also be
used to mask errors
 Not as attractive as TMR because logic for
predicting check bits very complex
 However, very good for memories
– Check bits stored with data
– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient
EE141
System-on-Chip
Test Architectures
109
Ch. 3 - Fault-Tolerant Design - P. 109
Error Correction
 Memories
very dense and prone to errors
 Especially due to single-event upsets (SEUs)
from radiation
 SEC-DED
check bits stored in memory
 32-bit word, SEC-DED requires 7 check bits
– Increases size of memory by 7/32=21.9%
 64-bit word, SEC-DED requires 8 check bits
– Increases size of memory by 8/64=12.5%
EE141
System-on-Chip
Test Architectures
110
Ch. 3 - Fault-Tolerant Design - P. 110
Memory ECC Architecture
Write Data Word
Read Data Word
Generate
Check
Bits
Data Word
In
Write
Check Bits
Memory
Calculated
Check Bits
Data Word
Out
Correct
Data
EE141
System-on-Chip
Test Architectures
Generate
Syndrome
Read
Check Bits
111
Ch. 3 - Fault-Tolerant Design - P. 111
Hamming Code for ECC RAM
Z
Input Data
Hamming
Check Bit
Generator
Bit Error
Z
Correction Circuit
Z
RAM
Core
c
Parity Bit
Generator
c
N words
Z+c+1
bits/word
Parity Bit
Generator
Parity Group 1
Parity Group 2
Parity Group 3
Parity Group 4
Z2
1
0
1
0
Error Type
No bit error
Single-bit correctable error
Double-bit error detection
EE141
System-on-Chip
Test Architectures
Z3
0
1
1
0
Z4
1
1
1
0
Parity
Check
Hamming
Check
Hamming c
Check Bit
Detect/Correct
Generator
Generate
Z1
1
1
0
0
Output
Data
Z5
1
0
0
1
Z6
0
1
0
1
Z7
1
1
0
1
Z8
0
0
1
1
c1
1
0
0
0
c2
0
1
0
0
c3
0
0
1
0
c4
0
0
0
1
Condition
Hamming check bits match, no parity error
Hamming check bits mismatch, parity error
Hamming check bits mismatch, no parity error
112
Ch. 3 - Fault-Tolerant Design - P. 112
Memory ECC
 SEC-DED
generally very effective
 Memory bit-flips tend to be independent
and uniformly distributed
 If bit-flip occurs, gets corrected next time
memory location accessed
 Main risk is if memory word not access for
long time
– Multiple bit-flips could accumulate
EE141
System-on-Chip
Test Architectures
113
Ch. 3 - Fault-Tolerant Design - P. 113
Memory Scrubbing
 Every
location in memory read on
periodic basis
 Reduces chance of multiple errors
accumulating in a memory word
 Can be implemented by having memory
controller cycle through memory during idle
periods
EE141
System-on-Chip
Test Architectures
114
Ch. 3 - Fault-Tolerant Design - P. 114
Multiple-Bit Upsets (MBU)
 Can
occur due to single SEU
 Typically occur in adjacent memory cells
 Memory
interleaving used
 To prevent MBUs from resulting in multiple
bit errors in same word
Memory
Word1
Bit1
Word2
Bit1
Word3
Bit1
EE141
System-on-Chip
Test Architectures
Word4
Bit1
Word1
Bit2
Word2
Bit2
Word3
Bit2
Word4
Bit2
Word1 Word2
Bit3
Bit3
Word3
Bit3
Word4
Bit3
115
Ch. 3 - Fault-Tolerant Design - P. 115
Type
Issues
Goal
Examples
Techniques
Long-Life
Systems
Difficult or
Expensive to Repair
Maximize
MTTF
Satellites
Spacecraft
Implanted Biomedical
Dynamic
Redundancy
Reliable
Real-Time
Systems
Error or Delay
Catastrophic
Fault Masking
Capability
Aircraft
Nuclear Power Plant
Air Bag Electronics
Radar
TMR
High
Availability
Systems
Downtime
Very Costly
High
Availability
Reservation System
Stock Exchange
Telephone Systems
No Single Point of
Failure;
Self-Checking Pairs;
Fault Isolation
High
Integrity
Systems
Data Corruption
Very Costly
High
Data Integrity
Banking
Transaction
Processing
Database
Checkpointing,
Time Redundancy;
ECC; Redundant
Disks
Mainstream
Low-Cost
Systems
Reasonable Level of
Failures Acceptable
Meet Failure Rate
Expectations
at Low Cost
Consumer Electronics
Personal Computers
Often None;
Memory ECC; Bus
Parity; Changing as
Technology Scales
EE141
System-on-Chip
Test Architectures
116
Ch. 3 - Fault-Tolerant Design - P. 116
Concluding Remarks
 Many
different fault-tolerant schemes
 Choosing scheme depends on
 Types of faults to be tolerated
– Temporary or permanent
– Single or multiple point failures
– etc.
 Design constraints
– Area, performance, power, etc.
EE141
System-on-Chip
Test Architectures
117
Ch. 3 - Fault-Tolerant Design - P. 117
Concluding Remarks
 As
technology scales
 Circuits increasingly prone to failure
 Achieving sufficient fault tolerance will be
major design issue
EE141
System-on-Chip
Test Architectures
118
Ch. 3 - Fault-Tolerant Design - P. 118