Review of the October 20-21 MIT Autonomous Sensing Conference

Download Report

Transcript Review of the October 20-21 MIT Autonomous Sensing Conference

Software Reliability Methods and Experience
Dave Dwyer
USA – E&IS
david.j.dwyer@
baesystems.com
2007 MIT BAE Systems Fall Conference: October 30-31
Overview and outline
• Definitions
• Similarities and differences: hardware and software reliability
• Foundations of Musa’s models reviewed
– Trachtenberg (Trachtenberg, Martin. “The Linear Software Reliability
Model and Uniform Testing,” IEEE Transactions on Reliability, 1985,
pp 8-16)
– Downs (Downs, Thomas. “An Approach to the Modeling of Software
Testing with Some Applications,” IEEE Transactions on Software
Engineering, Vol. SE-11, No. 4, April 1985, pp 375-386)
• Instantaneous Failure Rate, a.k.a. failure intensity
– Hardware - Duane, Codier
– Software - analogous derivation
• Testing results
• SW reliability calculator
2007 MIT BAE Systems Fall Conference: October 30-31
Page 2
SW reliability defined
• Software reliability defined:
– The probability of failure-free operation for a specified time in a specified
environment for a specified purpose (“Software Engineering”, 5th edition,
I. Somerville, Addison-Wesley, 1995)
– The probability of failure-free operation of a computer program for a specified
time in a specified environment (“Software Reliability”, Musa, Iannino,
Okumoto, McGraw-Hill, 1987)
– We will use MTBF or its reciprocal, λ
2007 MIT BAE Systems Fall Conference: October 30-31
Page 3
HW vs. SW reliability
• The hardware reliability discipline provided an impetus to provide for safety
margins in the stresses, both mechanical and electrical
• But margins of safety don’t mean much in software because it doesn’t wear out
• Software has ‘x’ failures per million unique executions [if ‘y’ executions/hour, then
‘xy’ failures/million hours]
• Once a process has been successfully executed, that identical process is not
going to fail in the future
2007 MIT BAE Systems Fall Conference: October 30-31
Page 4
Martin Trachtenberg (1985):
• Simulation testing showed that:
– Testing the functions of the software system in a random or round-robin order
and fixing the failures gives linearly decaying system error rates
– Testing and fixing each function exhaustively one at a time gives flat
system-error rates
– Testing and fixing different functions at widely different frequencies gives
exponentially decaying system error rates [operational profile testing], and
– Testing strategies that result in linear decaying error rates tend to require the
fewest tests to detect a given number of errors
– Testing to the operational profile gives the lowest time to reach an operational
MTBF
2007 MIT BAE Systems Fall Conference: October 30-31
Page 5
Down’s ‘Pure’ approach reflected the nature
of software (1985)
• The execution of a sequence of M paths
• The actual number of paths affected by a fault is treated as a random variable ‘c’
• Not all paths are equally likely to be executed
• j = (N – j), where:
N = the total number of faults,
j = the number of corrected faults,
 = -r log(1 – c/M),
r = the number of paths executed/unit time
2007 MIT BAE Systems Fall Conference: October 30-31
Page 6
Down’s execution path parameters
Start
x1
xN
x2
3
1
M
2
2 paths affected by x1
‘M’ total paths
1 path affected by x2
‘N’ total faults initially
‘c’ paths affected by an arbitrary fault
2007 MIT BAE Systems Fall Conference: October 30-31
Page 7
Our data analysis approach
• Cumulative 8-hour test shifts are recorded
• Failures plotted:
– All
– First instance
• The last data point will be put at the end of the test time
• Only integration and system test data
2007 MIT BAE Systems Fall Conference: October 30-31
Page 8
Failure rate is proportional to failure number,
Downs: j  (N – j)r(c/M)
Given:
N
(0)
j
j
i
T
= total initial number of faults
= initial failure rate => 0 errors detected/corrected (start of testing)
= cumulative failure rate after some number of faults is detected, ‘j’
= the number of faults removed over time
= instantaneous failure rate (failure intensity)
= time
N
j
j = j/T
2007 MIT BAE Systems Fall Conference: October 30-31
0
Page 9
Failure rate plots against failure number for a range
of non-uniform testing profiles, M1, M2 paths and
N1, N2 initial faults in those paths
‘Concave’ or
logarithmic plots
2007 MIT BAE Systems Fall Conference: October 30-31
Page 10
Instantaneous failure intensity derivation ~
Duane’s for hardware
Instantaneous  for HW
Instantaneous  for SW
j  j /T
c  F / T
 kT (  m )
F  kT (1 m)
i  F / T
Same Approach
 k (1  m)T (  m )
i  (1  m)c
Similar Result
2007 MIT BAE Systems Fall Conference: October 30-31
 ( N  j )
j  T ( N  j )
i  j / T
 ( N  j )  T (j / T )
i   j  T (i )
i (1  T )   j
i   j /(1  T )
Page 11
Background – test example
• Console operation and operating profile
• Necessity of distinguishing failure priorities:
– Priority 1: “Prevents mission essential capability”
– Priority 2: “Adversely affects mission essential capability with no alternative
workaround”
– Priority 3: “Adversely affects mission essential capability with alternative
workaround”
• Work shifts varied over test duration: 1-3/day
• Calculation of failure intensity
2007 MIT BAE Systems Fall Conference: October 30-31
Page 12
Corrective action for Priority 2 failures
suspended while Priority 1 failures corrected
400.0
Series1
350.0
Series2
Series3
Linear (Series2)
300.0
Linear (Series3)
Sum Failures
250.0
200.0
y = -176.83x + 349.85
150.0
100.0
y = -179.88x + 288.61
50.0
0.0
0
0.2
0.4
0.6
0.8
1
1.2
Failures/8 Hours
2007 MIT BAE Systems Fall Conference: October 30-31
Page 13
Codier, Duane 1964 RAMS HW reliability growth
• Ref. Appendix B, Notes on Plotting (Codier, Ernest O., “Reliability Growth in Real
Life”, Proceedings, 1968 Annual Symposium on Reliability, New York, IEEE,
January 1968, pp 458-469)
– 1. “The latter points, having more information content, must be given more
weight than earlier points” (Trachtenberg, too)
– 2. The normal curve-fitting procedures of drawing the line through the “center
of gravity” of all the points should not be used
– 3. Start the line on the last data point and seek the region of highest density of
points to the left [right for Musa plots] of it”
2007 MIT BAE Systems Fall Conference: October 30-31
Page 14
How I draw a growth line through the points
on a reliability growth plot?
• Is there one point that is most important?
– Yes, the last point represents the cumulative MTBF to date; it has the most
degrees of freedom
• Should the trend line go through that point?
– Yes, it has the best measure of cumulative MTBF
• Would an Excel trend line go through that point?
– No, it’s just a least squares fit with all points weighing the same
• What is the least important point?
– The first; it has the least degrees of freedom
2007 MIT BAE Systems Fall Conference: October 30-31
Page 15
Questions: Drawing a line through the points (cont.)
• If the line goes through the last point, what else should it go through?
– The center of density of the other points (ref. back to Duane, Codier)
• What is the center of density?
– The center of density is where the center of mass would be if “The latter
points …[are]… given more weight than earlier points”
2007 MIT BAE Systems Fall Conference: October 30-31
Page 16
Example - Priority 1 data plotted
45.0
Sum Failures (n)
40.0
35.0
30.0
25.0
20.0
y = -43.964x + 38.803
15.0
10.0
5.0
0.0
0
0.2
0.4
0.6
Failures/8 Hours
2007 MIT BAE Systems Fall Conference: October 30-31
Page 17
Point estimates vs. instantaneous
2007 MIT BAE Systems Fall Conference: October 30-31
Page 18
The formula for calculation of i correlates with
interval estimates of failure intensity
From the previous graph j = -431c + 66
j
44.00
41.84
46.16
i
c
0.050
0.055
0.045
T= j/c
880
761
1026
= (46.16 – 41.84)/(1,026 – 761)
= 4.32/265
= 0.016
From the formula for instantaneous failure intensity:
i

T
= c/(1 + T)
= 1/431
= 880
i
= 0.050/(1 + 880/431)
= 0.050/(1 + 2.04)
= 0.050/3.04
= 0.016
2007 MIT BAE Systems Fall Conference: October 30-31
Page 19
Failure count - first instance
Most recent data plot
70
60
50
40
30
20
10
0
0
0.02
0.04
0.06
0.08
0.1
Failure rate, Lambda
2007 MIT BAE Systems Fall Conference: October 30-31
Page 20
A calculator has been developed for
BAE Systems SW reliability practice 8349714
2007 MIT BAE Systems Fall Conference: October 30-31
Page 21
Priority 1 data graph
2007 MIT BAE Systems Fall Conference: October 30-31
Page 22
Questions?
• Anybody want a grad course in SW Reliability? I need 5 more students
• Rivier College can do that through teleconference
(e-mail: [email protected])
• You will solve a real problem @ no charge to your department (except tuition)
2007 MIT BAE Systems Fall Conference: October 30-31
Page 23