Draper IR&D Project Progress Report Reliable Software

Download Report

Transcript Draper IR&D Project Progress Report Reliable Software

Some thoughts for the
industry session
Cochin Conference
Dec 18, 2002
Prof. Kishor S. Trivedi
Department of Electrical and Computer Engineering
Duke University
Durham, NC 27708-0291
Phone: (919)660-5269
e-mail: [email protected]
At present: visiting Professor IIT Kanpur, CSE Dept.
1
What does industry want?



Well trained students
Short term research problems solved
Short courses on timely topics
2
What do faculty want?




Funding for `their’ research
Place their students in good company labs
Hope to get their research results
transferred to industry
To get to know important and difficult
problems that can drive their research
3
Some lessons learned








Student placement should be guided by the advisor
Start early with summer internship
Patience is needed in listening to problems from
industry
Patience is needed in getting the IP problems
resolved
Expect to do at least 50% more work than the
funding provided
Tech transfer is a double edged sword
Practical problems can give rise to respectable
research papers
Short courses are ideal entry points
4
Characteristics of the Systems
being Studied
Dependability (Reliability, Availability, Safety):





Redundancy: Hardware (Static,Dynamic),
Information, Time
Fault Types: Permanent, Intermittent, Transient,
Design
Fault Detection, Automated Reconfiguration
Imperfect Coverage
Maintenance: scheduled, unscheduled
5
Characteristics of the Systems
being Studied

Performance:
 Resource
Contention, Concurrency and
Synchronization
 Timeliness (Have to Meet Deadlines)

Composite Performance and Dependability:
 Degradable

Levels of Performance
Need Techniques and Tools that can Evaluate:
 Systems

with All the Characteristics Above
Explicitly Address Complexity
6
MEASURES TO BE EVALUATED

Dependability
 Reliability:
R(t), System MTTF
 Availability: Steady-state, Transient, Interval
 Safety
“Does it work, and for how long?''

Performance
 Throughput,
Loss Probability, Response
Time
“Given that it works, how well does it work?''
7
MEASURES TO BE EVALUATED

Composite Performance and Dependability
“How much work will be done(lost) in a
given interval including the effects of
failure/repair/contention?''

Need Techniques and Tools That Can
Evaluate
Performance,
Dependability and Their
Combinations
8
PURPOSE OF EVALUATION

Understanding a System
 Observation
Operational Environment
Controlled Environment
 Reasoning
A Model is a Convenient Abstraction
9
PURPOSE OF EVALUATION

Predicting Behavior of a System
Need a Model
Accuracy Based on Degree of Extrapolation


All Models are Wrong; Some Models are Useful
Prediction is fine as long as it is not about the
future
10
Methods of Quantitative
EVALUATION

Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
11
Methods of Quantitative Evaluation
(Continued)

Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-StateSpace Methods
3. Hybrid: Simulation + Analytic (SPNP)
4. State Space + Non-State Space
(SHARPE)
12
Why MODEL?


Provides a framework for gathering, organizing,
understanding and evaluating information about
a system e.g. Zitel, US&S,HP
A cost-effective means to evaluate a system
e.g. Boeing, US&S, HP,IBM, Motorola,
Cisco,SUN
13
Why MODEL? (continued)



Provides a means of evaluating a set of
alternatives in a structured and quantitative
manner e.g. Zitel, DEC,HP
Sometimes needed due to legal and contractual
obligations e.g. FAA
Sometimes needed for business reasons:
Motorola, SUN, Cisco
14
Compare two CLIENT-SERVER
Architectures
Architecture 2
Architecture 1
15
Compare Connection Reliabilities


Connection reliability R(t) is the probability
that throughout the interval [0,t) at least one
path exists from the client to server on
which all components are operational.
From R(t), system mean time to failure can
be computed:

MTTF   R(t )dt
0
16
Compare Connection Reliabilities
17
Compare Connection Availabilities

Connection (instantaneous, transient or point)
availability A(t) is the probability that at time t at
least one path exists from the client to server on
which all components are operational.

A(t)R(t) and limiting or steady-state Availability
A  lim A(t )
t 
18
Compare Connection Availabilities
19
MODELING THROUGHOUT
SYSTEM LIFECYCLE

System Specification/Design Phase
Answer “What-if Questions''
 Compare
design alternatives
(Zitel,HP,Motorola)
 Performance-Dependability Trade-offs (DEC)
 Design Optimization (wireless handoff)
20
MODELING THROUGHOUT
SYSTEM LIFECYCLE

Design Verification Phase
Use Measurements + Models
E.g. Fault/Injection + Reliability Model
Union Switch and Signals, Boeing, Draper

Configuration Selection Phase: DEC

System Operational Phase: Lucent
• It is fun!
21
CASE STUDY: ZITEL


Comparison of two different fault-tolerant
RAMdisks.
Stochastic Petri Net Package (SPNP) was used
to model the two systems for their reliability.
22
CASE STUDY: ZITEL

Trivedi worked with the designers directly:
 Model
Validation was done using face validation
and sanity checks.
 Parameterization
was easy due to the
experience of the designers.
 One
difficult research problem originated from
the study; Subsequently solved and published in
Microelectronics and Reliability journal.
23
CASE STUDY: VAXCLUSTER

Developed three models of Processor Subsystem:
 Two-Level
Decomposition (IEEE-TR, Apr 89)
Inner Level: 9-state Markov
Outer level: n parallel diodes
 A Detailed
SPN Model (PNPM 89)
 A Detailed
SPN model for Heterogeneous Cluster
(Averesky book)
24
CASE STUDY: VAXCLUSTER



Storage Subsystem Model: A fixed-point iteration
over a set of Markov submodels. (IEEE-TR, to
appear)
Observed that availability is maximized with 2
processors (HCSS 90)
Many interesting reliability, availability,
performability measures computed
25
Case Study: HP

Cluster Availability Modeling

Server Availability

Mass Storage Arrays Availability Modeling

Started with Markov chains via SHARPE

Progressed toward Stochastic Petri Nets
and Stochastic Reward nets via SPNP
26
CASE STUDY: LUCENT





A Validated Model of Hardware-Software
Availability.
Worked with V. Mendiratta of Naperville.
Model is semi-Markov; solved using SHARPE.
Parameters collected form field data.
Model results validated against actual
measurements.
27
CASE STUDY: LUCENT, IBM,
Motorola, SUN

Software Rejuvenation:
A technique to counter software “aging” and increase its
availability to clients.
 Evaluated optimum rejuvenation interval which
maximizes steady state availability (minimizes expected
cost) for IBM cluster, Motorola CMTS cluster


Collected data from real systems to show aging and to
determine proactive fault management strategies.
Worked in our lab, with SUN Microsystems
28
CASE STUDY: MOTOROLA

Availability & Performability Modeling:
 Modeled several configurations of
Communication Enterprise Common
Platform.
 Practical approaches for approximating
steady state measures in large, repairable,
and highly dependable system: model
decomposition, state space truncation, etc.
 Both SHARPE and SPNP used.
29
CASE STUDY: MOTOROLA

Recovery strategies in wireless handoff:
 proposed
a
and modeled several strategies
patent being filed by Motorola
 SPNP
was used
 Hierarchy
of two-level models used
 Fixed-point
iteration was used
30
CASE STUDY: BELLCORE

Architecture-based software reliability:
 proposed
 applied
 used
a methodology
the methodology to SHARPE
Bellcore’s test coverage tool, ATAC, to
parameterize the model
 Bellcore
is currently enhancing ATAC to
incorporate our methodology
31
CASE STUDY: DRAPER LAB


Overall aim was Verification of system with
very high reliability/availability specifications.
Prototype under consideration was FTPP
cluster 3.
Hybrid approach proposed
 Fault
injection based measurements.
 Statistical
analysis of measured data to enable
parameterization of analytical models.
32
CASE STUDY: DRAPER LAB

Reliability modeling of the prototype done:
Parameterization done with the aid of existing
reliability databases.
 Analytical
solution provided exact closed form
expressions
 Markov model solved using SHARPE
 Petri net model solved using SPNP
 Reliability bottlenecks found
33
CASE STUDY: AT & T

GSHARPE:
 A Preprocessor to SHARPE developed at
Bell Labs by a Duke Student.
 User can specify Weibull Failure times and
lognormal and other repair time
distributions.
 GSHARPE fits these to phase type
distributions and produces a Markov model
that is generated for processing by
SHARPE
34
CASE STUDY: BOEING




An Integrated Reliability Environment
A working prototype
Developed a high-level modeling language
(SDM)
Designed and implemented an intelligent
interpreter
35
CASE STUDY: BOEING


(Continued)
Interpreter determines which solution method is
applicable
Five different modeling engines are integrated:
CAFTA,
SETS, EHARP, SHARPE and
SPNP.
36
QUANTITATIVE EVALUATION
TAXONOMY
Closed-form solution
Numerical solution using a tool
37
MODELING TAXONOMY
38
STATE SPACE MODELING
TAXONOMY
39
ANALYTIC MODELING
TAXONOMY
NON-STATE SPACE MODELING TECHNIQUES
Product form queuing models
SP reliability block diagrams
Non-SP reliability block diagrams
40
State Space Modeling Taxonomy
discrete-time Markov chains
Markovian modeling
continuous-time Markov chains
Markov reward models
State space methods
Semi-Markov models
non-Markovian modeling
Markov regenerative models
Non-Homogeneous Markov 41
State-Space Based Models

Transition label:
 Probability:
(homogeneous) discrete-time
Markov chain (DTMC)
 Time-independent Rate: homogeneous
continuous-time Markov chain
 Time-dependent Rate: non-homogeneous
continuous-time Markov chain
 Distribution function: semi Markov process
 Two Dist. Functions: Markov Regenerative
Process
42
IN ORDER TO FULFILL OUR
GOALS OF


Modeling Performance, Dependability and
Performability
Modeling Complex Systems
We Need

Automatic Generation and Solution of Large
Markov Reward Models
43
IN ORDER TO FULFILL OUR
GOALS OF

Facility for State Truncation, Hierarchical
composition of Non-State-Space and StateSpace Models, Fixed-Point Iteration
 There
are Two Tools that Potentially meet these
Goals


Stochastic Petri Net Package (SPNP)
Symbolic Hierarchical Automated Rel. and Perf.
Evaluator (SHARPE)
44
MODELING SOFTWARE PACKAGES






HARP - Hybrid Automated Reliability Predictor
(Duke Univ, funded by NASA Langley)
SAVE - System Availability Estimator
(Duke Univ. funded by IBM)
SHARPE - Symbolic Hierarchical Automated Reliability and
Performance Evaluator; installed at nearly 280 locations (GUI
available)
SPNP - Stochastic Petri Net Package installed at nearly 120
locations (iSPN - GUI available)
D_RAMP for Union Switch and Signals by Duke, UVA and CMU
SDM - Boeing Integrated Reliability Modeling Environment
(Jointly developed by Duke Univ., Univ. of Wash. and Boeing)

SDDS - Developed by Sohar with the help from K. Trivedi

SREPT - Software Reliability Estimation and Prediction Tool
45
Challenges
in Modeling
COMPLEXITIES OF MODELS


Large State Space
 Model
construction problem
 Model
solution problem
Model Stiffness.
Fast and slow rates acting together
 Failure And
Recovery/Repair
 Performance
and failure
47
COMPLEXITIES OF MODELS

Modeling Non-Exponential Distributions

Combining performance and reliability

Believability/Understandability/Usability

Incorporation in the design process

Connection between measurements &
models:
 Parameterization
 Validation
48
LARGENESS TOLERANCE

Automated Model Construction
 Stochastic
Petri nets (GreatSPN, SPNP,
SHARPE, DSPNexpress, ULTRASAN)
 High
level languages (SAVE, QNAP, ASSIST,
SDM)
 Fault-Tree
+ Recovery Info (HARP)
 Object-Oriented Approaches
(TANGRAM)
 Loops
in the specification of CTMC
(SHARPE)
49
LARGENESS TOLERANCE

Efficient numerical solution techniques
 Sparse
Storage
 Accurate
and Efficient Solution Methods
We have Generated and Solved Models
with 1,000,000 states (has gone up
considerably recently)
Steady-State : NEAR-Optimal SOR
Transient: Modified Jensen's method
50
MODEL SPECIFICATION
LANGUAGES

Different languages can be used to specify
a single model type:
SAVE,QNAP,SPNP all appear very different;
underlying model type is Markov

Same language can be used to specify
different model types:RESQ input language
used for PFQN or EQN
51
LARGENESS AVOIDANCE

Non-State-Space methods
 Reliability
block diagrams
 Fault-trees
 Product-Form

Queuing Networks
Approximate solutions
 State
Truncation
SAVE, SPNP, ASSIST (Kantz and Trivedi: PNPM91)
52
LARGENESS AVOIDANCE

Approximate solutions
 Hierarchical
Decomposition (Chapter 11)
and Fixed-Point Iteration among submodels:
Heidelberger and Trivedi; IEEE-TC,1983
(Queueing Models)
Ciardo and Trivedi; PNPM91 (SPN Models)
Tomek and Trivedi (Availability Models)
Singhal (IEEE-TPDS, 1992)
Chapter 11 of Sahner et al.
53
LARGENESS AVOIDANCE

Approximate solutions
 Time-Scale
Decomposition
Bobbio and Trivedi(IEEE-TC;1986); Section 11.2
 Fluid Approximation:
Miltra; Kulkarni; Ciardo; Nicol, and Trivedi;
FSPN
 Performability
(Chapters 6 and 12)
54
Difficulties in Modeling Using
MRMs

Stiffness
Causes numerical difficulties in solution
 Stiffness
Tolerance
Develop stiffness tolerant numerical
solution methods
 Stiffness Avoidance
Avoid generating stiff models through
decomposition
55
STIFFNESS TOLERANCE

Automatic Detection of Stiffness (HARP)

Special Stable ODE Solver
Reibman and Trivedi (TR-BDF2)
Computers and Operations Research, 1988.
Malhotra and Trivedi (Pade, Implicit RK)
56
STIFFNESS TOLERANCE

Uniformization for Stiff Markov Chains
Muppala and Trivedi
We can solve models with rate ratios of 108 or
higher
Implemented in SHARPE & SPNP
57
STIFFNESS AVOIDANCE

Model-level decomposition
 Behavioral
Decomposition (HARP, Bobbio &
Trivedi) Fault-Occurrence vs. Fault/Error
Handling
 Hierarchical
Composition (SHARPE)
Composition of Submodel solutions without
generating a single one-level overall model
 Fixed-Point
Iteration (Ciardo and Trivedi; SPNP)
58
Non-Exponential Behavior

Non state space models: Fault Trees, Reliability
Graphs, RBDs; no problem
59
Non-Exponential Behavior
in State Space Models
60
NON-EXPONENTIAL
DISTRIBUTIONS

Phase-Type Expansions
Malhotra and Reibman (GSHARPE)
See Figure 9.38 on p. 191(Red Book)

Non-Homogeneous Markov Chains
CARE III, HARP
Soft Reliability model with imperfect repairs
solved using SHARPE
61
NON-EXPONENTIAL DISTRIBUTIONS

Semi-Markov Chains
Ciardo et al, IEEE-TC Oct. 90

Markov Regenerative Processes:
Choi, Logothetis, Kulkarni, Trivedi

DSPN and MRSPN:
Choi, Kulkarni, Trivedi

Discrete-Event Simulation
Now in SPNP (FSPN an Non-Markovian SPN
Simulation), RESQ, QNAP
62
BELIEVABILITY
UNDERSTANDABILITY

Integration of Measurements and Models
 Measurements
 Models
Provide Parameters to Models
Provide Guidelines For
Measurements
 Models

Validated Against Measurements
Integration of Different Modeling Tools
 Boeing
SDM project
 IDEAS
project at Duke
63
BELIEVABILITY/
UNDERSTANDABILITY

Many Case-Studies of Validations Needed
 Vaxcluster Availability
 Hsueh,
 AT

Iyer and Trivedi; IEEE-TC, Apr. 1988
& T Validation of ESS
Technology Transfer
 Seminars
and Workshops
 Development

Model: Wein & Sathaye
and Dissemination of Tools
Application of the Techniques and Tools
64
MODELING AND MEASUREMENTS:
INTERFACES

Measurements supply Input Parameters to
Models
(Model Calibration or Parameterization)
Confidence Intervals should be obtained
Boeing, Draper, Union Switch projects

Model Sensitivity Analysis can suggest which
Parameters to Measure More Accurately:
Blake, Reibman and Trivedi: SIGMETRICS
1988.
65
MODELING AND MEASUREMENTS:
INTERFACES

Model Validation
1. Face Validation
2. Input-Output Validation
3. Validation of Model Assumptions
(Hypothesis Testing)

Rejection of a hypothesis regarding model
assumption based on measurement data leads
to an improved model
66
MODELING AND MEASUREMENTS:
INTERFACES

Model Structure Based on Measurement Data
Hsueh, Iyer and Trivedi; IEEE TC, April 1988;
Gokhale et al, IPDS 98
67