Draper IR&D Project Progress Report Reliable Software
Download
Report
Transcript Draper IR&D Project Progress Report Reliable Software
Some thoughts for the
industry session
Cochin Conference
Dec 18, 2002
Prof. Kishor S. Trivedi
Department of Electrical and Computer Engineering
Duke University
Durham, NC 27708-0291
Phone: (919)660-5269
e-mail: [email protected]
At present: visiting Professor IIT Kanpur, CSE Dept.
1
What does industry want?
Well trained students
Short term research problems solved
Short courses on timely topics
2
What do faculty want?
Funding for `their’ research
Place their students in good company labs
Hope to get their research results
transferred to industry
To get to know important and difficult
problems that can drive their research
3
Some lessons learned
Student placement should be guided by the advisor
Start early with summer internship
Patience is needed in listening to problems from
industry
Patience is needed in getting the IP problems
resolved
Expect to do at least 50% more work than the
funding provided
Tech transfer is a double edged sword
Practical problems can give rise to respectable
research papers
Short courses are ideal entry points
4
Characteristics of the Systems
being Studied
Dependability (Reliability, Availability, Safety):
Redundancy: Hardware (Static,Dynamic),
Information, Time
Fault Types: Permanent, Intermittent, Transient,
Design
Fault Detection, Automated Reconfiguration
Imperfect Coverage
Maintenance: scheduled, unscheduled
5
Characteristics of the Systems
being Studied
Performance:
Resource
Contention, Concurrency and
Synchronization
Timeliness (Have to Meet Deadlines)
Composite Performance and Dependability:
Degradable
Levels of Performance
Need Techniques and Tools that can Evaluate:
Systems
with All the Characteristics Above
Explicitly Address Complexity
6
MEASURES TO BE EVALUATED
Dependability
Reliability:
R(t), System MTTF
Availability: Steady-state, Transient, Interval
Safety
“Does it work, and for how long?''
Performance
Throughput,
Loss Probability, Response
Time
“Given that it works, how well does it work?''
7
MEASURES TO BE EVALUATED
Composite Performance and Dependability
“How much work will be done(lost) in a
given interval including the effects of
failure/repair/contention?''
Need Techniques and Tools That Can
Evaluate
Performance,
Dependability and Their
Combinations
8
PURPOSE OF EVALUATION
Understanding a System
Observation
Operational Environment
Controlled Environment
Reasoning
A Model is a Convenient Abstraction
9
PURPOSE OF EVALUATION
Predicting Behavior of a System
Need a Model
Accuracy Based on Degree of Extrapolation
All Models are Wrong; Some Models are Useful
Prediction is fine as long as it is not about the
future
10
Methods of Quantitative
EVALUATION
Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
11
Methods of Quantitative Evaluation
(Continued)
Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-StateSpace Methods
3. Hybrid: Simulation + Analytic (SPNP)
4. State Space + Non-State Space
(SHARPE)
12
Why MODEL?
Provides a framework for gathering, organizing,
understanding and evaluating information about
a system e.g. Zitel, US&S,HP
A cost-effective means to evaluate a system
e.g. Boeing, US&S, HP,IBM, Motorola,
Cisco,SUN
13
Why MODEL? (continued)
Provides a means of evaluating a set of
alternatives in a structured and quantitative
manner e.g. Zitel, DEC,HP
Sometimes needed due to legal and contractual
obligations e.g. FAA
Sometimes needed for business reasons:
Motorola, SUN, Cisco
14
Compare two CLIENT-SERVER
Architectures
Architecture 2
Architecture 1
15
Compare Connection Reliabilities
Connection reliability R(t) is the probability
that throughout the interval [0,t) at least one
path exists from the client to server on
which all components are operational.
From R(t), system mean time to failure can
be computed:
MTTF R(t )dt
0
16
Compare Connection Reliabilities
17
Compare Connection Availabilities
Connection (instantaneous, transient or point)
availability A(t) is the probability that at time t at
least one path exists from the client to server on
which all components are operational.
A(t)R(t) and limiting or steady-state Availability
A lim A(t )
t
18
Compare Connection Availabilities
19
MODELING THROUGHOUT
SYSTEM LIFECYCLE
System Specification/Design Phase
Answer “What-if Questions''
Compare
design alternatives
(Zitel,HP,Motorola)
Performance-Dependability Trade-offs (DEC)
Design Optimization (wireless handoff)
20
MODELING THROUGHOUT
SYSTEM LIFECYCLE
Design Verification Phase
Use Measurements + Models
E.g. Fault/Injection + Reliability Model
Union Switch and Signals, Boeing, Draper
Configuration Selection Phase: DEC
System Operational Phase: Lucent
• It is fun!
21
CASE STUDY: ZITEL
Comparison of two different fault-tolerant
RAMdisks.
Stochastic Petri Net Package (SPNP) was used
to model the two systems for their reliability.
22
CASE STUDY: ZITEL
Trivedi worked with the designers directly:
Model
Validation was done using face validation
and sanity checks.
Parameterization
was easy due to the
experience of the designers.
One
difficult research problem originated from
the study; Subsequently solved and published in
Microelectronics and Reliability journal.
23
CASE STUDY: VAXCLUSTER
Developed three models of Processor Subsystem:
Two-Level
Decomposition (IEEE-TR, Apr 89)
Inner Level: 9-state Markov
Outer level: n parallel diodes
A Detailed
SPN Model (PNPM 89)
A Detailed
SPN model for Heterogeneous Cluster
(Averesky book)
24
CASE STUDY: VAXCLUSTER
Storage Subsystem Model: A fixed-point iteration
over a set of Markov submodels. (IEEE-TR, to
appear)
Observed that availability is maximized with 2
processors (HCSS 90)
Many interesting reliability, availability,
performability measures computed
25
Case Study: HP
Cluster Availability Modeling
Server Availability
Mass Storage Arrays Availability Modeling
Started with Markov chains via SHARPE
Progressed toward Stochastic Petri Nets
and Stochastic Reward nets via SPNP
26
CASE STUDY: LUCENT
A Validated Model of Hardware-Software
Availability.
Worked with V. Mendiratta of Naperville.
Model is semi-Markov; solved using SHARPE.
Parameters collected form field data.
Model results validated against actual
measurements.
27
CASE STUDY: LUCENT, IBM,
Motorola, SUN
Software Rejuvenation:
A technique to counter software “aging” and increase its
availability to clients.
Evaluated optimum rejuvenation interval which
maximizes steady state availability (minimizes expected
cost) for IBM cluster, Motorola CMTS cluster
Collected data from real systems to show aging and to
determine proactive fault management strategies.
Worked in our lab, with SUN Microsystems
28
CASE STUDY: MOTOROLA
Availability & Performability Modeling:
Modeled several configurations of
Communication Enterprise Common
Platform.
Practical approaches for approximating
steady state measures in large, repairable,
and highly dependable system: model
decomposition, state space truncation, etc.
Both SHARPE and SPNP used.
29
CASE STUDY: MOTOROLA
Recovery strategies in wireless handoff:
proposed
a
and modeled several strategies
patent being filed by Motorola
SPNP
was used
Hierarchy
of two-level models used
Fixed-point
iteration was used
30
CASE STUDY: BELLCORE
Architecture-based software reliability:
proposed
applied
used
a methodology
the methodology to SHARPE
Bellcore’s test coverage tool, ATAC, to
parameterize the model
Bellcore
is currently enhancing ATAC to
incorporate our methodology
31
CASE STUDY: DRAPER LAB
Overall aim was Verification of system with
very high reliability/availability specifications.
Prototype under consideration was FTPP
cluster 3.
Hybrid approach proposed
Fault
injection based measurements.
Statistical
analysis of measured data to enable
parameterization of analytical models.
32
CASE STUDY: DRAPER LAB
Reliability modeling of the prototype done:
Parameterization done with the aid of existing
reliability databases.
Analytical
solution provided exact closed form
expressions
Markov model solved using SHARPE
Petri net model solved using SPNP
Reliability bottlenecks found
33
CASE STUDY: AT & T
GSHARPE:
A Preprocessor to SHARPE developed at
Bell Labs by a Duke Student.
User can specify Weibull Failure times and
lognormal and other repair time
distributions.
GSHARPE fits these to phase type
distributions and produces a Markov model
that is generated for processing by
SHARPE
34
CASE STUDY: BOEING
An Integrated Reliability Environment
A working prototype
Developed a high-level modeling language
(SDM)
Designed and implemented an intelligent
interpreter
35
CASE STUDY: BOEING
(Continued)
Interpreter determines which solution method is
applicable
Five different modeling engines are integrated:
CAFTA,
SETS, EHARP, SHARPE and
SPNP.
36
QUANTITATIVE EVALUATION
TAXONOMY
Closed-form solution
Numerical solution using a tool
37
MODELING TAXONOMY
38
STATE SPACE MODELING
TAXONOMY
39
ANALYTIC MODELING
TAXONOMY
NON-STATE SPACE MODELING TECHNIQUES
Product form queuing models
SP reliability block diagrams
Non-SP reliability block diagrams
40
State Space Modeling Taxonomy
discrete-time Markov chains
Markovian modeling
continuous-time Markov chains
Markov reward models
State space methods
Semi-Markov models
non-Markovian modeling
Markov regenerative models
Non-Homogeneous Markov 41
State-Space Based Models
Transition label:
Probability:
(homogeneous) discrete-time
Markov chain (DTMC)
Time-independent Rate: homogeneous
continuous-time Markov chain
Time-dependent Rate: non-homogeneous
continuous-time Markov chain
Distribution function: semi Markov process
Two Dist. Functions: Markov Regenerative
Process
42
IN ORDER TO FULFILL OUR
GOALS OF
Modeling Performance, Dependability and
Performability
Modeling Complex Systems
We Need
Automatic Generation and Solution of Large
Markov Reward Models
43
IN ORDER TO FULFILL OUR
GOALS OF
Facility for State Truncation, Hierarchical
composition of Non-State-Space and StateSpace Models, Fixed-Point Iteration
There
are Two Tools that Potentially meet these
Goals
Stochastic Petri Net Package (SPNP)
Symbolic Hierarchical Automated Rel. and Perf.
Evaluator (SHARPE)
44
MODELING SOFTWARE PACKAGES
HARP - Hybrid Automated Reliability Predictor
(Duke Univ, funded by NASA Langley)
SAVE - System Availability Estimator
(Duke Univ. funded by IBM)
SHARPE - Symbolic Hierarchical Automated Reliability and
Performance Evaluator; installed at nearly 280 locations (GUI
available)
SPNP - Stochastic Petri Net Package installed at nearly 120
locations (iSPN - GUI available)
D_RAMP for Union Switch and Signals by Duke, UVA and CMU
SDM - Boeing Integrated Reliability Modeling Environment
(Jointly developed by Duke Univ., Univ. of Wash. and Boeing)
SDDS - Developed by Sohar with the help from K. Trivedi
SREPT - Software Reliability Estimation and Prediction Tool
45
Challenges
in Modeling
COMPLEXITIES OF MODELS
Large State Space
Model
construction problem
Model
solution problem
Model Stiffness.
Fast and slow rates acting together
Failure And
Recovery/Repair
Performance
and failure
47
COMPLEXITIES OF MODELS
Modeling Non-Exponential Distributions
Combining performance and reliability
Believability/Understandability/Usability
Incorporation in the design process
Connection between measurements &
models:
Parameterization
Validation
48
LARGENESS TOLERANCE
Automated Model Construction
Stochastic
Petri nets (GreatSPN, SPNP,
SHARPE, DSPNexpress, ULTRASAN)
High
level languages (SAVE, QNAP, ASSIST,
SDM)
Fault-Tree
+ Recovery Info (HARP)
Object-Oriented Approaches
(TANGRAM)
Loops
in the specification of CTMC
(SHARPE)
49
LARGENESS TOLERANCE
Efficient numerical solution techniques
Sparse
Storage
Accurate
and Efficient Solution Methods
We have Generated and Solved Models
with 1,000,000 states (has gone up
considerably recently)
Steady-State : NEAR-Optimal SOR
Transient: Modified Jensen's method
50
MODEL SPECIFICATION
LANGUAGES
Different languages can be used to specify
a single model type:
SAVE,QNAP,SPNP all appear very different;
underlying model type is Markov
Same language can be used to specify
different model types:RESQ input language
used for PFQN or EQN
51
LARGENESS AVOIDANCE
Non-State-Space methods
Reliability
block diagrams
Fault-trees
Product-Form
Queuing Networks
Approximate solutions
State
Truncation
SAVE, SPNP, ASSIST (Kantz and Trivedi: PNPM91)
52
LARGENESS AVOIDANCE
Approximate solutions
Hierarchical
Decomposition (Chapter 11)
and Fixed-Point Iteration among submodels:
Heidelberger and Trivedi; IEEE-TC,1983
(Queueing Models)
Ciardo and Trivedi; PNPM91 (SPN Models)
Tomek and Trivedi (Availability Models)
Singhal (IEEE-TPDS, 1992)
Chapter 11 of Sahner et al.
53
LARGENESS AVOIDANCE
Approximate solutions
Time-Scale
Decomposition
Bobbio and Trivedi(IEEE-TC;1986); Section 11.2
Fluid Approximation:
Miltra; Kulkarni; Ciardo; Nicol, and Trivedi;
FSPN
Performability
(Chapters 6 and 12)
54
Difficulties in Modeling Using
MRMs
Stiffness
Causes numerical difficulties in solution
Stiffness
Tolerance
Develop stiffness tolerant numerical
solution methods
Stiffness Avoidance
Avoid generating stiff models through
decomposition
55
STIFFNESS TOLERANCE
Automatic Detection of Stiffness (HARP)
Special Stable ODE Solver
Reibman and Trivedi (TR-BDF2)
Computers and Operations Research, 1988.
Malhotra and Trivedi (Pade, Implicit RK)
56
STIFFNESS TOLERANCE
Uniformization for Stiff Markov Chains
Muppala and Trivedi
We can solve models with rate ratios of 108 or
higher
Implemented in SHARPE & SPNP
57
STIFFNESS AVOIDANCE
Model-level decomposition
Behavioral
Decomposition (HARP, Bobbio &
Trivedi) Fault-Occurrence vs. Fault/Error
Handling
Hierarchical
Composition (SHARPE)
Composition of Submodel solutions without
generating a single one-level overall model
Fixed-Point
Iteration (Ciardo and Trivedi; SPNP)
58
Non-Exponential Behavior
Non state space models: Fault Trees, Reliability
Graphs, RBDs; no problem
59
Non-Exponential Behavior
in State Space Models
60
NON-EXPONENTIAL
DISTRIBUTIONS
Phase-Type Expansions
Malhotra and Reibman (GSHARPE)
See Figure 9.38 on p. 191(Red Book)
Non-Homogeneous Markov Chains
CARE III, HARP
Soft Reliability model with imperfect repairs
solved using SHARPE
61
NON-EXPONENTIAL DISTRIBUTIONS
Semi-Markov Chains
Ciardo et al, IEEE-TC Oct. 90
Markov Regenerative Processes:
Choi, Logothetis, Kulkarni, Trivedi
DSPN and MRSPN:
Choi, Kulkarni, Trivedi
Discrete-Event Simulation
Now in SPNP (FSPN an Non-Markovian SPN
Simulation), RESQ, QNAP
62
BELIEVABILITY
UNDERSTANDABILITY
Integration of Measurements and Models
Measurements
Models
Provide Parameters to Models
Provide Guidelines For
Measurements
Models
Validated Against Measurements
Integration of Different Modeling Tools
Boeing
SDM project
IDEAS
project at Duke
63
BELIEVABILITY/
UNDERSTANDABILITY
Many Case-Studies of Validations Needed
Vaxcluster Availability
Hsueh,
AT
Iyer and Trivedi; IEEE-TC, Apr. 1988
& T Validation of ESS
Technology Transfer
Seminars
and Workshops
Development
Model: Wein & Sathaye
and Dissemination of Tools
Application of the Techniques and Tools
64
MODELING AND MEASUREMENTS:
INTERFACES
Measurements supply Input Parameters to
Models
(Model Calibration or Parameterization)
Confidence Intervals should be obtained
Boeing, Draper, Union Switch projects
Model Sensitivity Analysis can suggest which
Parameters to Measure More Accurately:
Blake, Reibman and Trivedi: SIGMETRICS
1988.
65
MODELING AND MEASUREMENTS:
INTERFACES
Model Validation
1. Face Validation
2. Input-Output Validation
3. Validation of Model Assumptions
(Hypothesis Testing)
Rejection of a hypothesis regarding model
assumption based on measurement data leads
to an improved model
66
MODELING AND MEASUREMENTS:
INTERFACES
Model Structure Based on Measurement Data
Hsueh, Iyer and Trivedi; IEEE TC, April 1988;
Gokhale et al, IPDS 98
67