Part15: Software fault Tolerance II

Transcript Part15: Software fault Tolerance II

UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Fault Tolerant Computing
ECE 655
Software Fault Tolerance II
ECE655/swftII .1
Copyright 2004 Koren & Krishna
Data Diversity
 Input space of a program can be divided into fault
and non-fault regions - program fails if and only if an
input from the fault region is applied
 Consider an unrealistic input space of 2 dimensions
 In both cases Fault regions occupy
a third of input area
 Perturb input slightly new input may fall in a non-faulty region
 This is called data diversity
 One copy of software - use acceptance test recompute with perturbed inputs and recheck output
 Massive redundancy - apply slightly different input
sets to different versions and vote
ECE655/swftII .2
Copyright 2004 Koren & Krishna
Explicit & Implicit Perturbation
 Explicit - add a small deviation term to a selected
subset of inputs
Implicit - gather inputs to program such that we can
expect them to be slightly different
 Example 1: software control of industrial process inputs are pressure and temperate of boiler
Every second - (p i ,t i) measured - input to controller
 Measurement in time i not much different from i-1
 Implicit perturbation may consist of using (p i-1,t i-1 )
as alternative to (p i ,t i)
 If (p i ,t i ) is in fault region - (p i-1,t i-1) may not be
ECE655/swftII .3
Copyright 2004 Koren & Krishna
Explicit Perturbation - Reorder Inputs
 Example 2: add floating-point numbers a,b,c -
compute a+b, and then add c
 a=1.1E+20, b=5, c=-1.1E+20
 Depending on precision used, a+b may be 1.1E+20
resulting in a+b+c=0
Change order of inputs to a,c,b - then a+c=0 and
a+c+b=5
 Example 2 - exact re-expression - output can be
used as is (if it passes acceptance test or vote)
Example 1 - inexact re-expression - likely to have
f(p i ,t i )  f(p i-1,t i-1 )
 Use raw output as a degraded but acceptable
alternative, or attempt to correct before use, e.g.,
Taylor expansion
ECE655/swftII .4
Copyright 2004 Koren & Krishna
Software Implemented Hardware Fault
Tolerance (SIHFT)
 Variation on data diversity, used to deal with
permanent hardware failures
 Each input is multiplied by a constant, k, and program
is transformed to correct for this multiplication
 Example: construct if (x=y) then ...
 x=001, y=000 - equality checked by hardware - has
x0 input stuck-at-0 - will erroneously compute x=y
 SIHFT with k=2 yields
x=010, y=000 fault not exercised circuit correctly
determines xy
ECE655/swftII .5
Copyright 2004 Koren & Krishna
SIHFT
 Transforming the program to compensate for
multiplying by k is not difficult
 Finding an appropriate value of k:
(1) Ensure that it is possible to find suitable data
types so that arithmetic overflow or underflow
does not happen
 (2) Select k such that it is able to mask a large
fraction of the hardware faults - experimental
studies by injecting faults
ECE655/swftII .6
Copyright 2004 Koren & Krishna
Recovery Block
Approach
 N versions, one running -
if it fails, execution is
switched to a backup
Example - primary +
3 secondary versions
 Primary executed - output
passed to acceptance test
If output is not accepted system state is rolled back
and secondary 1 starts,
and so on
 If all fail - computation fails
 Success of recovery block approach depends on
failure independence of different versions and
quality of acceptance test
ECE655/swftII .7
Copyright 2004 Koren & Krishna
Recovery Block Approach - Analytical Model
 Assumption - different versions fail independently
 Notations:
E - the event - output of a version is erroneous
 T - the event - test fails (test detects a fault)
 f - failure probability of a version
 f = P(E)
 s - test sensitivity
 s = P(T/E)
  - test specificity
  = P(E/T)
 N - number of software versions
ECE655/swftII .8
Copyright 2004 Koren & Krishna
Failure Probability Calculation
 Calculate complementary probability - of success
 For the scheme to succeed, it must succeed at some
stage i, 1  i  N
Both software and test fail at stages 1,...,i-1, and
at stage i the software version is correct and the
output passes the test
E - output of
version erroneous
T - test fails
f = P(E)
s = P(T/E)
 = P(E/T)
ECE655/swftII .9
Copyright 2004 Koren & Krishna
Failure Probability Calculation cont.
 P(E/T)=P(E  T)/P(T), and therefore,

and
 Substituting and summing over i yields
 P(scheme fails) = 1 - P(scheme succeeds)
 Critical importance of high acceptance-test
specificity 
ECE655/swftII .10
Copyright 2004 Koren & Krishna
Distributed
Recovery
Blocks
 Two nodes
carry identical
copies of
primary and secondary
 Node 1 executes the primary - in parallel, node 2
executes the secondary
 If node 1 fails the acceptance test, output of node
2 is used (provided that it passes the test)
 Output of node 2 can also be used if node 1 fails to
produce an output within a prespecified time
ECE655/swftII .11
Copyright 2004 Koren & Krishna
Distributed Recovery Blocks - cont.
 Once primary fails, roles of primary and secondary
are reversed
 Node 2 continues to execute the secondary copy,
which is now treated as primary
 Execution by node 1 of primary is used as a backup
This continues until execution by node 2 is flagged
erroneous, then system toggles back to using
execution by node 2 as a backup
 Rollback is not necessary - saves time - useful for
real-time system with tight task deadlines
 Scheme can be extended to N versions (primary plus
N-1 secondaries run in parallel on N processors
ECE655/swftII .12
Copyright 2004 Koren & Krishna
Exception Handling
 Exception indicates that something happened during
execution that needs attention
Control is transferred to an exception-handler routine which takes appropriate action
 Example: When executing y=a*b, if overflow, result
incorrect - signal an exception
Effective exception-handling can make a significant
improvement to system fault tolerance
 Over half of code lines in many programs are
devoted to exception-handling
Exceptions can be used to deal with
 (a) domain or range failure
 (b) out-of-ordinary event (not failure) needing special
attention
 (c) timing failure
ECE655/swftII .13
Copyright 2004 Koren & Krishna
Domain and Range Failure
 A domain failure happens when illegal input is used
- Example: if X, Y are real numbers and X = Y is
attempted with Y=-1, a domain failure occurs
 A range failure occurs when program produces an
output or carries out an operation that is seen to
be incorrect in some way
 Examples include:
 Encountering an end-of-file while reading data from file
 Producing a result that violates an acceptance test
 Trying to print a line that is too long
 Generating an arithmetic overflow or underflow
ECE655/swftII .14
Copyright 2004 Koren & Krishna
Out-of-the-Ordinary Events
Exceptions can be used to ensure special handling of
rare, but perfectly normal, events
 Example - Reading the last item of a list from a
file - may trigger an exception to notify invoker
that this was the last item
 Timing Failures:
 In real-time applications, tasks have deadlines
 If deadlines are violated - can trigger an exception
Exception-handler decides what to do to in
response: e.g., may switch to a backup routine
ECE655/swftII .15
Copyright 2004 Koren & Krishna
Requirements of Exception-Handlers
 (1) Should be easy to program and use
 Be modular and separable from rest of software
 Not be mixed with other lines of code in a routine -
would be hard to understand, debug, and modify
 (2) Exception-handling should not impose a
substantial overhead on normal functioning of system
 Exceptions be invoked only in exceptional
circumstances
 Exception-handling not inflict a burden in the usual
case with no exception conditions
 (3) Exception-handling must not compromise system
state - not render it inconsistent
ECE655/swftII .16
Copyright 2004 Koren & Krishna
Software Reliability - Definitions
Failure - departure of software behavior from user
requirements
 Reliability - probability of failure-free software
operation in a defined environment for a specified
period of time
 Software does not deteriorate with time like
hardware - remains constant if no changes are made
 If during testing faults are detected and removed software reliability will increase with time
 Notations:
 N - number of faults existing at the start of testing (can be
a random variable)
 M(t) - number of faults detected and removed by time t
 N-M(t) - number of faults remaining at time t
ECE655/swftII .17
Copyright 2004 Koren & Krishna
Software Reliability Models
 Attempt to predict future failure rate of software as
a function of either number of faults removed or
number of faults remaining at time t
 Unlike hardware models, these are largely untested
 Available models often give contradictory results
 When testing starts, “easiest’’ faults are caught
quickly
 Remaining faults are more difficult to catch - either
harder to exercise or their effects masked by
subsequent computations
Rate at which a yet-undiscovered fault causes failures
drops as testing proceeds
 Failure rate - described either as a decreasing
function of M(t) or as an increasing function of N-M(t)
ECE655/swftII .18
Copyright 2004 Koren & Krishna
Jelinski-Moranda Model
Failure rate  (t) is proportional to number of
faults remaining in software
  (t) = C (N-M(t))
 When a failure is detected and removed, time to
next failure is exponentially distributed with
parameter  (t)
 Problem - not all faults are equal: some occur
more often - others more difficult to catch
ECE655/swftII .19
Copyright 2004 Koren & Krishna
Littlewood-Verrall Model
 Time between failures is exponentially distributed
with (i) where i = N-M(t) - number of remaining
faults
 (i) - random variable with the gamma density
function
 Gamma function - generalization of the factorial
function
 Find (i) and  by experiments on the software
ECE655/swftII .20
Copyright 2004 Koren & Krishna
Musa-Okumoto Logarithmic Poisson
Execution Time Model
 More widely used software reliability model
 Failure rate after testing for time t is
 (t) = (0) exp(- (t) )
  - constant
 (t) - expected value of M(t) - number of failures
experienced and removed by time t

d(t)/dt = (t)
ECE655/swftII .21
Copyright 2004 Koren & Krishna
Musa-Okumoto Model - Cont.
(0)
=1
(0)=1

 Very slow decay of failure rate requiring significant
amount of testing
ECE655/swftII .22
Copyright 2004 Koren & Krishna
Selecting Model and Parameter Estimation
 (1) Which model is appropriate
 (2) How to estimate model parameters
 No comprehensive experimental data to guide users
 Study failure rate as a function of testing, and
guess which model it follows
 Then - estimate its parameters
Use standard statistical estimation techniques e.g., Maximum Likelihood and Least Squares
methods
ECE655/swftII .23
Copyright 2004 Koren & Krishna
Basics of Exceptions and Exception-Handling
 Internal or external
ECE655/swftII .24
Copyright 2004 Koren & Krishna
Is Software Reliability Modeling Hopeless?
 Can we model software reliability with confidence?
 We show that this is intrinsically impractical
 Suppose we try to determine if some model works for
an ultra-reliable application (extremely low failure
rates desired
 Do so by experiment or simulation
 Find mean E(X) of a random variable X with standard
deviation (X) by experiment or simulation
 Collect a sample of n randomly selected and
statistically independent instances
 Example: X is waiting time in queue - run n
simulations to get n instances of X
 X, S(X) - average, standard deviation of n instances
- different value for each new sample of size n
ECE655/swftII .25
Copyright 2004 Koren & Krishna
Software Reliability Modeling
 For a large enough n, use Central Limit Theorem and
calculate an interval around E(X) in which X will reside
with a pre-determined probability
 1- - probability that X will reside in range:
where Zk satisfies
 Estimate (X) by S(X) :
 Confidence interval for E(X)
confidence level 1-
ECE655/swftII .26
Copyright 2004 Koren & Krishna
Cont. - incomplete
 Probability that interval will include the real value of
E(X) is 1-
 If a sample of size n is selected many times, and a
different interval is calculated each time, a fraction
1- of these intervals will include E(X)
The width of the interval is
 Interval informative if it has high confidence level
and a small width
 After a specific interval is calculated based on X and
S(X) obtained from the sample, it is not accurate to
claim that E(X) lies within this interval with
probability 1-
ECE655/swftII .27
Copyright 2004 Koren & Krishna
Cont. 2 - incomplete
 Equation is only correct before the sample is taken,
when there is still a probability of 1- that the
resulting calculated interval will include in it the real
E(X) as one of its points
 Once the interval is calculated, it either does or does
not contain E(X) in it (and so the probability that the
interval includes E(X) is either 0 or 1)
 Still, we claim to have a ``confidence" of 1- that
the calculated interval includes the real value of E(X)
 This level of confidence is based on the fact that the
procedure which generated the interval has a 1-
probability of success
ECE655/swftII .28
Copyright 2004 Koren & Krishna

Part15: Software fault Tolerance II

Transcript Part15: Software fault Tolerance II

Directory