Part15: Software fault Tolerance II
Download
Report
Transcript Part15: Software fault Tolerance II
UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Fault Tolerant Computing
ECE 655
Software Fault Tolerance II
ECE655/swftII .1
Copyright 2004 Koren & Krishna
Data Diversity
Input space of a program can be divided into fault
and non-fault regions - program fails if and only if an
input from the fault region is applied
Consider an unrealistic input space of 2 dimensions
In both cases Fault regions occupy
a third of input area
Perturb input slightly new input may fall in a non-faulty region
This is called data diversity
One copy of software - use acceptance test recompute with perturbed inputs and recheck output
Massive redundancy - apply slightly different input
sets to different versions and vote
ECE655/swftII .2
Copyright 2004 Koren & Krishna
Explicit & Implicit Perturbation
Explicit - add a small deviation term to a selected
subset of inputs
Implicit - gather inputs to program such that we can
expect them to be slightly different
Example 1: software control of industrial process inputs are pressure and temperate of boiler
Every second - (p i ,t i) measured - input to controller
Measurement in time i not much different from i-1
Implicit perturbation may consist of using (p i-1,t i-1 )
as alternative to (p i ,t i)
If (p i ,t i ) is in fault region - (p i-1,t i-1) may not be
ECE655/swftII .3
Copyright 2004 Koren & Krishna
Explicit Perturbation - Reorder Inputs
Example 2: add floating-point numbers a,b,c -
compute a+b, and then add c
a=1.1E+20, b=5, c=-1.1E+20
Depending on precision used, a+b may be 1.1E+20
resulting in a+b+c=0
Change order of inputs to a,c,b - then a+c=0 and
a+c+b=5
Example 2 - exact re-expression - output can be
used as is (if it passes acceptance test or vote)
Example 1 - inexact re-expression - likely to have
f(p i ,t i ) f(p i-1,t i-1 )
Use raw output as a degraded but acceptable
alternative, or attempt to correct before use, e.g.,
Taylor expansion
ECE655/swftII .4
Copyright 2004 Koren & Krishna
Software Implemented Hardware Fault
Tolerance (SIHFT)
Variation on data diversity, used to deal with
permanent hardware failures
Each input is multiplied by a constant, k, and program
is transformed to correct for this multiplication
Example: construct if (x=y) then ...
x=001, y=000 - equality checked by hardware - has
x0 input stuck-at-0 - will erroneously compute x=y
SIHFT with k=2 yields
x=010, y=000 fault not exercised circuit correctly
determines xy
ECE655/swftII .5
Copyright 2004 Koren & Krishna
SIHFT
Transforming the program to compensate for
multiplying by k is not difficult
Finding an appropriate value of k:
(1) Ensure that it is possible to find suitable data
types so that arithmetic overflow or underflow
does not happen
(2) Select k such that it is able to mask a large
fraction of the hardware faults - experimental
studies by injecting faults
ECE655/swftII .6
Copyright 2004 Koren & Krishna
Recovery Block
Approach
N versions, one running -
if it fails, execution is
switched to a backup
Example - primary +
3 secondary versions
Primary executed - output
passed to acceptance test
If output is not accepted system state is rolled back
and secondary 1 starts,
and so on
If all fail - computation fails
Success of recovery block approach depends on
failure independence of different versions and
quality of acceptance test
ECE655/swftII .7
Copyright 2004 Koren & Krishna
Recovery Block Approach - Analytical Model
Assumption - different versions fail independently
Notations:
E - the event - output of a version is erroneous
T - the event - test fails (test detects a fault)
f - failure probability of a version
f = P(E)
s - test sensitivity
s = P(T/E)
- test specificity
= P(E/T)
N - number of software versions
ECE655/swftII .8
Copyright 2004 Koren & Krishna
Failure Probability Calculation
Calculate complementary probability - of success
For the scheme to succeed, it must succeed at some
stage i, 1 i N
Both software and test fail at stages 1,...,i-1, and
at stage i the software version is correct and the
output passes the test
E - output of
version erroneous
T - test fails
f = P(E)
s = P(T/E)
= P(E/T)
ECE655/swftII .9
Copyright 2004 Koren & Krishna
Failure Probability Calculation cont.
P(E/T)=P(E T)/P(T), and therefore,
and
Substituting and summing over i yields
P(scheme fails) = 1 - P(scheme succeeds)
Critical importance of high acceptance-test
specificity
ECE655/swftII .10
Copyright 2004 Koren & Krishna
Distributed
Recovery
Blocks
Two nodes
carry identical
copies of
primary and secondary
Node 1 executes the primary - in parallel, node 2
executes the secondary
If node 1 fails the acceptance test, output of node
2 is used (provided that it passes the test)
Output of node 2 can also be used if node 1 fails to
produce an output within a prespecified time
ECE655/swftII .11
Copyright 2004 Koren & Krishna
Distributed Recovery Blocks - cont.
Once primary fails, roles of primary and secondary
are reversed
Node 2 continues to execute the secondary copy,
which is now treated as primary
Execution by node 1 of primary is used as a backup
This continues until execution by node 2 is flagged
erroneous, then system toggles back to using
execution by node 2 as a backup
Rollback is not necessary - saves time - useful for
real-time system with tight task deadlines
Scheme can be extended to N versions (primary plus
N-1 secondaries run in parallel on N processors
ECE655/swftII .12
Copyright 2004 Koren & Krishna
Exception Handling
Exception indicates that something happened during
execution that needs attention
Control is transferred to an exception-handler routine which takes appropriate action
Example: When executing y=a*b, if overflow, result
incorrect - signal an exception
Effective exception-handling can make a significant
improvement to system fault tolerance
Over half of code lines in many programs are
devoted to exception-handling
Exceptions can be used to deal with
(a) domain or range failure
(b) out-of-ordinary event (not failure) needing special
attention
(c) timing failure
ECE655/swftII .13
Copyright 2004 Koren & Krishna
Domain and Range Failure
A domain failure happens when illegal input is used
- Example: if X, Y are real numbers and X = Y is
attempted with Y=-1, a domain failure occurs
A range failure occurs when program produces an
output or carries out an operation that is seen to
be incorrect in some way
Examples include:
Encountering an end-of-file while reading data from file
Producing a result that violates an acceptance test
Trying to print a line that is too long
Generating an arithmetic overflow or underflow
ECE655/swftII .14
Copyright 2004 Koren & Krishna
Out-of-the-Ordinary Events
Exceptions can be used to ensure special handling of
rare, but perfectly normal, events
Example - Reading the last item of a list from a
file - may trigger an exception to notify invoker
that this was the last item
Timing Failures:
In real-time applications, tasks have deadlines
If deadlines are violated - can trigger an exception
Exception-handler decides what to do to in
response: e.g., may switch to a backup routine
ECE655/swftII .15
Copyright 2004 Koren & Krishna
Requirements of Exception-Handlers
(1) Should be easy to program and use
Be modular and separable from rest of software
Not be mixed with other lines of code in a routine -
would be hard to understand, debug, and modify
(2) Exception-handling should not impose a
substantial overhead on normal functioning of system
Exceptions be invoked only in exceptional
circumstances
Exception-handling not inflict a burden in the usual
case with no exception conditions
(3) Exception-handling must not compromise system
state - not render it inconsistent
ECE655/swftII .16
Copyright 2004 Koren & Krishna
Software Reliability - Definitions
Failure - departure of software behavior from user
requirements
Reliability - probability of failure-free software
operation in a defined environment for a specified
period of time
Software does not deteriorate with time like
hardware - remains constant if no changes are made
If during testing faults are detected and removed software reliability will increase with time
Notations:
N - number of faults existing at the start of testing (can be
a random variable)
M(t) - number of faults detected and removed by time t
N-M(t) - number of faults remaining at time t
ECE655/swftII .17
Copyright 2004 Koren & Krishna
Software Reliability Models
Attempt to predict future failure rate of software as
a function of either number of faults removed or
number of faults remaining at time t
Unlike hardware models, these are largely untested
Available models often give contradictory results
When testing starts, “easiest’’ faults are caught
quickly
Remaining faults are more difficult to catch - either
harder to exercise or their effects masked by
subsequent computations
Rate at which a yet-undiscovered fault causes failures
drops as testing proceeds
Failure rate - described either as a decreasing
function of M(t) or as an increasing function of N-M(t)
ECE655/swftII .18
Copyright 2004 Koren & Krishna
Jelinski-Moranda Model
Failure rate (t) is proportional to number of
faults remaining in software
(t) = C (N-M(t))
When a failure is detected and removed, time to
next failure is exponentially distributed with
parameter (t)
Problem - not all faults are equal: some occur
more often - others more difficult to catch
ECE655/swftII .19
Copyright 2004 Koren & Krishna
Littlewood-Verrall Model
Time between failures is exponentially distributed
with (i) where i = N-M(t) - number of remaining
faults
(i) - random variable with the gamma density
function
Gamma function - generalization of the factorial
function
Find (i) and by experiments on the software
ECE655/swftII .20
Copyright 2004 Koren & Krishna
Musa-Okumoto Logarithmic Poisson
Execution Time Model
More widely used software reliability model
Failure rate after testing for time t is
(t) = (0) exp(- (t) )
- constant
(t) - expected value of M(t) - number of failures
experienced and removed by time t
d(t)/dt = (t)
ECE655/swftII .21
Copyright 2004 Koren & Krishna
Musa-Okumoto Model - Cont.
(0)
=1
(0)=1
Very slow decay of failure rate requiring significant
amount of testing
ECE655/swftII .22
Copyright 2004 Koren & Krishna
Selecting Model and Parameter Estimation
(1) Which model is appropriate
(2) How to estimate model parameters
No comprehensive experimental data to guide users
Study failure rate as a function of testing, and
guess which model it follows
Then - estimate its parameters
Use standard statistical estimation techniques e.g., Maximum Likelihood and Least Squares
methods
ECE655/swftII .23
Copyright 2004 Koren & Krishna
Basics of Exceptions and Exception-Handling
Internal or external
ECE655/swftII .24
Copyright 2004 Koren & Krishna
Is Software Reliability Modeling Hopeless?
Can we model software reliability with confidence?
We show that this is intrinsically impractical
Suppose we try to determine if some model works for
an ultra-reliable application (extremely low failure
rates desired
Do so by experiment or simulation
Find mean E(X) of a random variable X with standard
deviation (X) by experiment or simulation
Collect a sample of n randomly selected and
statistically independent instances
Example: X is waiting time in queue - run n
simulations to get n instances of X
X, S(X) - average, standard deviation of n instances
- different value for each new sample of size n
ECE655/swftII .25
Copyright 2004 Koren & Krishna
Software Reliability Modeling
For a large enough n, use Central Limit Theorem and
calculate an interval around E(X) in which X will reside
with a pre-determined probability
1- - probability that X will reside in range:
where Zk satisfies
Estimate (X) by S(X) :
Confidence interval for E(X)
confidence level 1-
ECE655/swftII .26
Copyright 2004 Koren & Krishna
Cont. - incomplete
Probability that interval will include the real value of
E(X) is 1-
If a sample of size n is selected many times, and a
different interval is calculated each time, a fraction
1- of these intervals will include E(X)
The width of the interval is
Interval informative if it has high confidence level
and a small width
After a specific interval is calculated based on X and
S(X) obtained from the sample, it is not accurate to
claim that E(X) lies within this interval with
probability 1-
ECE655/swftII .27
Copyright 2004 Koren & Krishna
Cont. 2 - incomplete
Equation is only correct before the sample is taken,
when there is still a probability of 1- that the
resulting calculated interval will include in it the real
E(X) as one of its points
Once the interval is calculated, it either does or does
not contain E(X) in it (and so the probability that the
interval includes E(X) is either 0 or 1)
Still, we claim to have a ``confidence" of 1- that
the calculated interval includes the real value of E(X)
This level of confidence is based on the fact that the
procedure which generated the interval has a 1-
probability of success
ECE655/swftII .28
Copyright 2004 Koren & Krishna