Draper IR&D Project Progress Report Reliable Software

Download Report

Transcript Draper IR&D Project Progress Report Reliable Software

Probability and Statistics with
Reliability, Queuing and Computer
Science Applications: Introduction
IIT Kanpur
Kishor S. Trivedi
Visiting Prof. Of Computer Science and Engineering, IITK
Prof. Department of Electrical and Computer Engineering
Duke University
Durham, NC 27708-0291
Phone: 7576
e-mail: [email protected]
URL: www.ee.duke.edu/~kst
1
Outline

Introduction

Reliability, Availability, Security, Performance, Performability

Methods of Evaluation

Evaluation Vs. Optimization
Model construction, parameterization,solution,validation, interpretation
Preliminaries: Sample Space, Probability Axioms, Independence, Conditioning,
Binomial Trials
Random Variables: Binomial, Poisson, Exponential, Weibull, Erlang,
Hyperexponential, Hypoexponential, Pareto, Defective
Reliability, Hazard Rate
Average Case Analysis of Program Performance
Reliability Analysis Using Block Diagrams and Fault Trees
Reliability of Standby Systems
Statistical Inference Including Confidence Intervals
Hypothesis Testing
Regression










2
Schedule & Textbooks

Schedule: Jan 21, 23, 28 and Feb 6, 18, 25, 27

Probability & Statistics with reliability, queuing,
and computer science applications, K. S. Trivedi, second
edition, John Wiley & Sons, 2001.

Performance and reliability analysis of computer systems:
An Example-Based Approach Using the SHARPE Software
Package, Sahner, Trivedi, Puliafito, Kluwer Academic
Publishers, 1996.
3
Program Performance Evaluation









Worst-case vs. Average case analysis
Data-structure-oriented vs. Control structure-oriented
Sequential vs. Concurrent
Centralized vs. Distributed
Structured vs. with unrestricted transfer of control
Unlimited (hardware) resources vs. limited resources
Software architecture: modules, their characteristics
(execution time) and interactions (branching, looping)
Measures: completion time (mean, variance & dist.)
Measurements or Models (simulation vs. analytic)
analytic models: combinatorial, DTMC, SMP, CTMC, SPN
4
System Performance Evaluation










Workload: traffic arrivals, service time distributions
pattern of resource requests
Hardware architecture and software architecture
Resource Contention, Scheduling & Allocation
Concurrency, Synchronization, distributed processing
Timeliness (Have to Meet Deadlines)
Measures: Thruput, Goodput, loss probability,
response time or delay (mean, variance & dist.)
Low-level (Cache, memory interference: ch. 7)
System-level (CPU-I/O, multiprocessing: ch. 8,9)
Network-level (protocols, handoff in wireless: ch. 7,8)
Measurements or models (simulation or analytic)
analytic models: DTMC, CTMC, PFQN, SPN
5
System Performance Evaluation

Workload:
Single vs. multiple types of requests (classes, chains)
 The following items needed for each type of request:

traffic arrivals: one time vs. a stream
stream: Poisson (Bernoulli), General renewal, IPP (IBP),
MMPP(MMBP), MAP, BMAP, NHPP, Self-similar
 service time distributions: Exponential (geometric), deterministic,
uniform, Erlang, Hyperexponential, Hypoexponential, Phasetype, general (with finite mean and variance), Pareto
 pattern of resource requests: service time distribution (or the
mean) at each resource per visit, branching probabilities; often
described as a DTMC (discrete-time Markov chain) and can also
be seen as the behavior of an individual program


All this information should be collected from actual
measurements (if possible) followed by statistical inference
6
Software Reliability




Black-box (measurements+ statistical inference) vs.
Architecture-based approach (models)
Black-box approaches treat software as a monolithic whole,
considering only its interactions with external environment,
without an attempt to model its internal structure
With growing emphasis on reuse, software development
process moves toward component-based software design
White-box approach may be better to analyze a system with
many software components and how they fit together
7
Software Architecture


Software behavior with respect to the manner in which
different components interact
May include the information about the execution time of
each component

Use control flow graph to represent architecture

Sequential program architecture modeled by
 Discrete
Time Markov Chain (DTMC)
 Continuous Time Markov Chain (CTMC)
 Semi-Markov process (SMP)
8
Failure Behavior of Components and
Interfaces
Failure can happen

during the execution of any component or

during the transfer of control between components
Failure behavior can be specified in terms of

reliability

constant failure rate

time-dependent failure intensity
9
System Reliability/Availability











Faultload: fault types, fault arrivals, repair/recovery
procedures and delay time distributions
Hardware architecture and software architecture
Minimum Resource Requirements
Dynamic failures
Performance/Reliability interdependence
Measures: Reliability, Availability, MTTF, Downtime
Low-level (Physics of failures, chip level)
System-level (CPU-I/O, multiprocessing: ch. 8,9)
Software and Hardware combined together
Network-level
Measurements or models (simulation or analytic)
analytic models: RBD, FTREE, CTMC, SPN
10
Definition of Reliability
Recommendations E.800 of the International
Telecommunications Union (ITU-T) defines reliability as follows:


“The ability of an item to perform a required function
under given conditions for a given time interval.”
In this definition, an item may be a circuit board, a component on
a circuit board, a module consisting of several circuit boards, a
base transceiver station with several modules, a fiber-optic
transport-system, or a mobile switching center (MSC) and all its
subtending network elements. The definition includes systems
with software.

11
Definition of Availability
Availability is closely related to reliability, and is also defined in
ITU-T Recommendation E.800 as follows:[1]

"The ability of an item to be in a state to perform a required
function at a given instant of time or at any instant of time
within a given time interval, assuming
that the external resources, if required, are provided."
An important difference between reliability and availability is that
reliability refers to failure-free operation during an interval, while
availability refers to failure-free operation at a given instant of
time, usually the time when a device or system is first accessed
to provide a required function or service

12
High Reliability/Availability/Safety

Traditional applications
(long-life/life-critical/safety-critical)
 Space missions, aircraft control, defense,
nuclear systems

New applications
(non-life-critical/non-safety-critical, business
critical)
 Banking, airline reservation, e-commerce
applications, web-hosting,
telecommunication

Scientific applications
(non-critical)
13
Motivation: High Availability








Scott McNealy, Sun Microsystems Inc.
 "We're paying people for uptime.The only thing that really
matters is uptime, uptime, uptime, uptime and uptime. I want to
get it down to a handful of times you might want to bring a Sun
computer down in a year. I'm spending all my time with
employees to get this design goal”
SUN Microsystems – SunUP & RASCAL program for highavailability
Motorola - 5NINES Initiative
HP, Cisco, Oracle, SAP - 5nines:5minutes Alliance
IBM – Cornhusker clustering technology for high-availability, eLiza,
autonomic computing
Microsoft – Trustable computing initiative
John Hennessey – in IEEE Computer
Microsoft – Regular full page ad on 99.999% availability in USA
Today
14
Motivation – High Availability
15
Need for a new term



Reliability is used in a generic
sense
Reliability used as a precisely
defined mathematical function
To remove confusion, IFIP WG
10.4 has proposed Dependability
as an umbrella term
16
Dependability– Umbrella term
Trustworthiness of a computer system such that reliance can justifiably be
placed on the service it delivers
DEPENDABILITY
ATTRIBUTES
AVAILABILITY
RELIABILITY
SAFETY
CONFIDENTIALITY
INTEGRITY
MAINTAINABILITY
MEANS
FAULT
FAULT
FAULT
FAULT
THREATS
FAULTS
ERRORS
FAILURES
SECURITY
PREVENTION
REMOVAL
TOLERANCE
FORECASTING
17
IFIP WG10.4



Failure occurs when the delivered
service no longer complies with the
specification
Error is that part of the system state
which is liable to lead to subsequent
failure
Fault is adjudged or hypothesized cause
of an error
Faults are the cause of errors that may lead to failures
Fault
Error
Failure
18
Dependability:Reliability, Availability,Safety,
Security


Redundancy: Hardware (Static,Dynamic), Information,
Time, software
Fault Types: Permanent (needs repair or replacement),
Intermittent (reboot/restart or replacement), Transient
(retry), Design :
Heisenbugs, Aging related bugs
Bohrbugs

Fault Detection, Automated Reconfiguration

Imperfect Coverage

Maintenance: scheduled, unscheduled
19
Software Fault Classification
 Many software bugs are reproducible, easily found
and fixed during the testing and debugging phase
Bohrbugs

Other bugs that are hard to find and fix remain in the software
during the operational phase
 These bugs may never be fixed, but if the operation is retried
or the system is rebooted, the bugs may not manifest
themselves as failures
 manifestation is non-deterministic and dependent on the
software reaching very rare states
Heisenbugs
20
Software Fault
Classification
Software
Heisenbugs
Bohrbugs
Test/
Debug
Design/
Development
Des./Data
Diversity
Retry
opn.
Restart
app.
“Aging”
related bugs
Reboot
node
Operational
21
Failure Classification (Cristian)

Failures

Omission failures (Send/receive failures)
Crash failures
 Infinite loop


Timing failures
Early
 Late (performance or dynamic failures)


Response failures
Value failures
 State-transition failures

22
Security
•
•
Security intrusions cause a system to fail
• Security Failure
• Integrity: Destruction/Unauthorized
modification of information
• Confidentiality: Theft of information
• Availability: e.g., Denial of Services
(DoS)
Similarity (as well as differences) between:
• Malicious vs. accidental faults
• Security vs. reliability/availability
• Intrusion tolerance vs. fault tolerance
23
The Need of Performability
Modeling

New technologies, services & standards need
new modeling methodologies
Pure performance modeling: too optimistic!
Outage-and-recovery behavior not considered

Pure dependability modeling: too conservative!
Different levels of performance not considered

24
“ilities” besides performance
Performability
measures of the
systems ability to
perform designated
functions
for a specified
operational time
Reliability
at any given instant
Availability
Performance under
failures
Survivability
R.A.S.-ability concerns grow. High-R.A.S. not only a selling point for
equipment vendors and service providers. But, regulatory outage report
required by FCC for public switched telephone networks (PSTN) may soon
apply to wireless.
25
Evaluation vs. Optimization


Evaluation of system for desired measures given a
set of parameters
Sensitivity Analysis
Bottleneck analysis
 Reliability importance


Optimization
Static:Linear,integer,geometric,nonlinear, multiobjective; constrained or unconstrained
 Dynamic: Dynamic programming, Markov decision
process, semi-Markov decision process

26
PURPOSE OF EVALUATION

Understanding a system

Observation
Operational environment
Controlled environment

Reasoning
A model is a convenient abstraction

Predicting behavior of a system

Need a model

Accuracy based on degree of extrapolation
27
PURPOSE OF EVALUATION
(Continued)
These famous quotes bring out the difficulty of prediction
based on models:

“All Models are Wrong; Some Models are Useful”
George Box

“Prediction is fine as long as it is not about the future”
Mark Twain
28
Basic Definitions

Reliability R(t):
X : time to failure of a system
F(t): distribution function of system lifetime
Rt   P X  t   1  F t 

Mean Time To system Failure:
MTTF  EX    tf t dt   Rt dt


0
0
f(t): density function of system lifetime
29
Availability

(Continued)
Instantaneous (point) Availability A(t):
A(t) = P (system working at t)
Let H(t) be the convolution of F and G:

g(t): density function of system repair time
H t    F (t  x) g ( x)dx
t
0
Then:
t
A(t )  R(t )   A(t  x )dH ( x)
0
Inst. Availability ,
A(t )  R(t ) , Reliability
30
Availability
Never failed in (0,t), prob: R(t)

System working at time t
First failed and got repaired at
time x<t & UP at end of interval
t
(x,t), prob: A(t  x)dH ( x )

0
x + dx
0
x
t
First repair completed here
31
Availability (Continued)

MTTR: Mean Time to Repair

Y: repair period of the system
MTTR  EY    tg (t )dt

0

Availability and Reliability are related but different!
32
Availability

(Continued)
We can show from equation (1) that:
MTTF
ASS 
MTTF  MTTR

Also:
downtim e (1  Ass ) * 8760* 60
(in minutes per year)
33
Availability (Continued)

Steady-State Availability:
ASS  lim A(t )
t 

There are two kinds of Availabilities!


Instantaneous & Steady-state
For a system with high degree of redundancy
MTTFeq
ASS 
MTTFeq  MTTReq
where MTTFeq & MTTReq must be carefully defined;
they can be computed using SHARPE
34
MEASURES TO BE EVALUATED

Dependability
 Reliability: R(t), System MTTF
 Availability: Steady-state, Transient; Downtime
 Safety, security
“Does it work, and for how long?''

Performance

Throughput, Blocking Probability, Response Time
“Given that it works, how well does it work?''
35
MEASURES TO BE EVALUATED
(Continued)

Composite Performance and Dependability
“How much work will be done(lost) in a
given interval including the effects of
failure/repair/contention?''

Need Techniques and Tools That Can Evaluate
 Performance,
Dependability and Their
Combinations
36
Methods of EVALUATION

Measurement-Based
Most believable, most expensive
Not always possible or cost effective during system
design


Statistical techniques are very important here
Model-Based
37
Methods of EVALUATION
(Continued)

Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-State-Space
Methods
3. Hybrid: Simulation + Analytic (SPNP)
4. State Space + Non-State Space (SHARPE)
38
Methods of EVALUATION
(Continued)

Measurements + Models
Vaidyanathan et al ISSRE 99
39
QUANTITATIVE EVALUATION
TAXONOMY
Closed-form solution
Numerical solution using a tool
40
Note that

Both measurements & simulations imply statistical
analysis of outputs (ch. 10,11)
Statistical inference
 Hypothesis testing
 Design of experiments
 Analysis of variance
 Regression (linear, nonlinear)





Distribution driven simulation requires generation of
random deviates (variates) (ch. 3, 4, 5)
Probability and Statistics are different yet highly related
Probability models need inputs that generally come from
measurement data (followed by statistical inference)
Statistics in turn uses probability theory
41
MODELING TAXONOMY
42
ANALYTIC MODELING
TAXONOMY
NON-STATE SPACE MODELING TECHNIQUES
SP reliability block diagrams
Non-SP reliability block diagrams
43
State Space Modeling Taxonomy
discrete-time Markov chains
Markovian modeling
continuous-time Markov chains
Markov reward models
State space methods
Semi-Markov models
non-Markovian modeling
Markov regenerative models
Non-Homogeneous Markov 44
Modeling Steps
•
•
•
•
•
Model construction
Model parameterization
Model solution
Result interpretation
Model Validation
45
MODELING AND MEASUREMENTS:
INTERFACES

Measurements supply Input Parameters to Models
(Model Calibration or Parameterization)
Confidence Intervals should be obtained
Boeing, Draper, Union Switch projects

Model Sensitivity Analysis can suggest which Parameters
to Measure More Accurately: Blake, Reibman and Trivedi:
SIGMETRICS 1988.
46
MODELING AND MEASUREMENTS:
INTERFACES

Model Validation
1. Face Validation
2. Input-Output Validation
3. Validation of Model Assumptions
(Hypothesis Testing)

Rejection of a hypothesis regarding model assumption
based on measurement data leads to an improved
model
47
MODELING AND MEASUREMENTS:
INTERFACES

Model Structure Based on Measurement Data

Hsueh, Iyer and Trivedi; IEEE TC, April 1988

Gokhale et al, IPDS 98;
 Vaidyanathan
et al, ISSRE99
48
MODELING TAXONOMY
49
ANALYTIC MODELING
TAXONOMY
NON-STATE SPACE MODELING TECHNIQUES
SP reliability block diagrams
Non-SP reliability block diagrams
50
State Space Modeling Taxonomy
discrete-time Markov chains
Markovian models
continuous-time Markov chains
Markov reward models
(discrete) State space models
Semi-Markov process
non-Markovian models
Markov regenerative process
Non-Homogeneous Markov 51
MODELING THROUGHOUT
SYSTEM LIFECYCLE

System Specification/Design Phase
Answer “What-if Questions''
 Compare
design alternatives (Bedrock,
Wireless handoff)
 Performance-Dependability
Trade-offs
(Wireless Handoff)
 Design
Optimization (optimizing the number of
guard channels)
52
MODELING THROUGHOUT
SYSTEM LIFECYCLE (Continued)

Design Verification Phase
Use Measurements + Models
E.g. Fault/Injection + Availability Model
Union Switch and Signals, Boeing, Draper

Configuration Selection Phase: DEC, HP

System Operational Phase: IDEN Project
Workload based adaptive rejuvenation
• It is fun!
53
MODELER'S DILEMMA
Should I Use Discrete-Event Simulation?

Point Estimates and Confidence Intervals

How many simulation runs are sufficient?

What Specification Language to use?
 C,
SIMULA, SIMSCRIPT, MODSIM, GPSS, RESQ,
SPNP v6, Bones, SES workbench, ns, opnet
54
MODELER'S DILEMMA

(Continued)
Simulation:
+ Detailed System Behavior including non-exponential
behavior
+ Performance, Dependability and Performability
Modeling Possible
- Long Execution Time (Variance Reduction Possible)
 Importance
Sampling, importance splitting,
regenerative simulation.
 Parallel
and Distributed Simulation
- Many users in practice do not realize the need to
calculate confidence intervals
55
MODELER'S DILEMMA
(Continued)
Should I Use Non-State-Space Methods?

Also Known as Combinatorial Models

Model Solved Without Generating State Space

Use: Order Statistics, Mixing, Convolution (chapters 1-5)

Common Dependability Model Types:
also called Combinatorial Models

Series-Parallel Reliability Block Diagrams

Non-Series-Parallel Block Diagrams (or Reliability Graphs)

Fault-Trees Without Repeated Events

Fault-Trees With Repeated Events
56
Combinatorial analytic models

Reliability block diagrams, Fault trees and Reliability
graphs

Commonly used for reliability and availability

These model types are similar in that they capture
conditions that make a system fail in terms of the
structural relationships between the system
components.
57
RBD example
58
Combinatorial Models


Combinatorial modeling techniques like RBDs
and FTs are easy to use and assuming
statistical independence solve for system
availability and system MTTF
Each component can have attached to it
 A probability
of failure
 A failure
rate
 A distribution of time to failure
 Steady-state and instantaneous unavailability
59
Non-State Space
Modeling Techniques

Possible to compute (given
component failure/repair rates:)
 System Reliability
 System Availability
(Steady-state, instantaneous)
 Downtime
 System MTTF
60
Non-State Space Modeling
Techniques (Continued)


Assuming:

Failures are statistically independent

As many repair units as needed
Relatively good algorithms are available for
solving such models so that 100 component
systems can be handled.
61
Non-State Space Modeling
Techniques (Continued)

Common Model Types: Performance

Series-Parallel Task Precedence Graphs

Product-Form Queuing Networks
+ Easy specification, fast computation, no
distributional assumption
+ Can easily solve models with 100’s of components
62
Combinatorial Modeling
(Continued)

-
These models can be solved using fast algorithms assuming
stochastic independence between system components. Systems with
several hundred components can be handled.
 Sum of disjoint products (SDP) algorithms
 Binary decision diagrams (BDD) algorithms
 Factoring (conditioning) algorithms
 Series-parallel composition algorithm
Failure/Repair Dependencies are often present; RBDs, FTREEs
cannot easily handle these
 (e.g., shared repair, warm/cold spares, imperfect coverage, non-
zero switching time, travel time of repair person, reliability with
repair)
63
Markov chain

To model more complicated interactions between
components, use other kinds of models like Markov
chains or more generally state space models.

Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.
64
State-Space-Based Models


States and labeled state transitions
State can keep track of:
 Number
of functioning resources of each type
 States of recovery for each failed resource
 Number of tasks of each type waiting at each
resource
 Allocation of resources to tasks

A transition:
 Can
occur from any state to any other state
 Can represent a simple or a compound event
65
State-Space-Based Models (Continued)

Transitions between states represent the change of the system
state due to the occurrence of an event

Drawn as a directed graph

Transition label:

Probability: homogeneous discrete-time Markov chain
(DTMC)

Rate: homogeneous continuous-time Markov chain (CTMC)

Time-dependent rate: non-homogeneous CTMC

Distribution function: semi-Markov process (SMP)

Two distribution functions; Markov regenerative process
(MRGP)
66
MODELER'S DILEMMA
(Continued)
Should I Use Markov Models?
State-Space-Based Methods
+ Model Fault-Tolerance and Recovery/Repair
+ Model Dependencies
+ Model Contention for Resources
+ Model Concurrency and Timeliness
+ Generalize to Markov Reward Models for Modeling
Degradable Performance
67
MODELER'S DILEMMA
(Continued)
Should I Use Markov Models?
+ Generalize to Markov Regenerative Models for Allowing
Generally Distributed Event Times
+ Generalize to Non-Homogeneous Markov Chains for
Allowing Weibull Failure Distributions
+ Performance, Availability and Performability Modeling
Possible
- Large (Exponential) State Space
68
IN ORDER TO FULFILL OUR
GOALS


Modeling Performance, Availability and
Performability
Modeling Complex Systems
We Need

Automatic Generation and Solution of Large
Markov Reward Models
69
IN ORDER TO FULFILL OUR
GOALS (Continued)

Facility for State Truncation, Hierarchical composition of
Non-State-Space and State-Space Models, Fixed-Point
Iteration



There are Two Tools that Potentially meet these Goals
Stochastic Petri Net Package (SPNP)
Symbolic Hierarchical Automated Reliability and
Performance Evaluator (SHARPE)
70
Model-based
Performance/Dependability
evaluation

Choice of the model type is dictated by:
 Measures
of interest
 Level
of detailed system behavior to be
represented
 Ease
of model specification and solution
 Representation
 Access
power of the model type
to suitable tools or toolkits
71
Difficulty in Modeling using
Markov chains
The Markov chains tend to be large and complex
leading too:
 Model
generation problem
Use automated means of generating the Markov
chains: Stochastic Petri Nets, Stochastic Reward
Nets
72
Difficulty in Modeling using
Markov chains (Continued)

Model solution problem
Use sparse storage for the matrices
Use sparsity preserving solution methods
 Sucessive
Overrelaxation,
 Gauss-Seidel,

Uniformization,
 ODE-solution
methods
73
Markov Reward Models
(MRMs)

Modeling any system with a pure reliability / availability
model can lead to incomplete, or, at least, less precise
results.

Gracefully degrading systems may be able to survive the
failure of one or more of their active components and
continue to provide service at a reduced level.

Markov reward model is commonly used technique for
the modeling of gracefully degradable system
74
State-Space-Based Models

Use also the following model types:

Markov chains & Markov reward models

semi-Markov & Markov regenerative processes

Stochastic reward nets or generalized stochastic Petri nets.

SRN & GSPN models are transformed into Markov chains for
analysis.

Only model types (in SHARPE) that requires a conversion to a
different model (Markov chain) to be solved.
75
Summary- Modeling
Techniques




Combinatorial techniques like RBDs and FTREEs are
easy to use and solve
Combinatorial models cannot easily represent intricate
dependencies
State space based models like Markov chains can
handle dependencies
State space explosion problem

Use automated generation methods: stochastic Petri
nets

Concurrency, contention and conditional branching
easily modeled with Petri nets.
76
Hierarchy used

State space explosion can be handled in two
ways:
Large model tolerance must apply to
specification, storage and solution of the model.
If the storage and solution problems can be
solved, the specification problem can be solved
by using more concise (and smaller) model
specifications that can be automatically
transformed into Markov models.
 Large models can be avoided by using
hierarchical (Multilevel) model composition.

77
LARGENESS AVOIDANCE

Non-State-Space methods
 Reliability
block diagrams
 Fault-trees
 Product-Form

Queuing Networks
Approximate solutions
 State
Truncation
SAVE, SPNP, ASSIST (Kantz and Trivedi: PNPM91)
78
VAXcluster example
Hierarchie:
Diode on top, CTMC at bottom
Storage model
79
Approximate Availability Model for
the Processing Subsystem
80
Availability Model for the VAXcluster
Storage Subsystem
Reliability block diagram for the storage system
81
A novel availability model for VAXclusters
with large storage subsystems.


The configuration shown consists of two HSCs,
and a set of disks. The disks are further classified
into two system disks and two application disks.
The operating system resides on the system disk,
and the user accounts and other application
software on the application disks.
Further, it is assumed that the disks are shadowed
and dual pathed and ported between the two
HSCs. A disk dual pathed between two HSCs can
be accessed cluster-wide in a coordinated way
82
through either HSC. In case, one of the HSC fails,
Assumptions

The model assumed that each component in the
block diagram has its own repair facility.

The repair time is a 2-stage hypoexponentially
distributed random variable with the first phase
being the travel time and the second phase being
the actual repair time.
83
Top level reliability block
diagram for the storage
subsystem
Top level reliability block diagram for the storage subsystem
84
CMTC submodels
85
Assumptions

In the second improved model, we removed the
assumption of independent repair. Instead, it is
assumed that a repair facility is shared within a
subsystem.

The storage system is now assumed as a two-level
hierarchical model. The bottom level consists of
three independent CTMC models, namely HSC,
SDisk and ADisk, representing the HSC, system
disk and application disk subsystems respectively.

The top level consists of a reliability block diagram 86
Storage shared repair recovery
(b) Disk reload and system recovery included for
(a) System recovery included for HSC subsystem
SDisk subsystem
87
(c) Disk reload and system recovery included for ADisk subsystem
Assumptions

In the third approximation we took into account
disk reload and system recovery. This takes into
account the following activities.

When a disk subsystem experiences a failure, data
on the disk may be corrupted or lost. After the disk
is repaired the data is reloaded on to the disk from
an external source, such as a backup disk or tape.
While the reload is a local activity of a disk
subsystem, recovery is a global system-wide
88
HSC submodel with
failover included
89
An Introduction to SHARPE
software tool
90
Overview of SHARPE





SHARPE: Symbolic-Hierarchical Automated
Reliability and Performance Evaluator
Well-known modeling tool (Installed at over
300 Sites; companies and universities)
Combines flexibility of Markov models and
efficiency of combinatorial models
Ported to most architectures and operating
systems
Used for Education, Research, Engineering
Practice
91
Overview of SHARPE (cont.)

Graphical User Interface is available

Used for analysis of performance(traffic),
dependability and performability

Hierarchy facilitates largeness & stiffness avoidance

Steady-state as well as transient analysis

Written in C language

Used as an engine by several other tools
92
SHARPE - new features

Many more built in distributions

Ability to easily specify structured Markov
chains (Loop feature)

Ability to print models and outputs
93
New Features







Equivalent mean time to system failure and equivalent mean
time to system repair implemented for Markov chains and
RBDs
BDD algorithms implemented for FTs and RGs
Steady-state computation of MRGP models
Stochastic reward net is available as a model type
Fast MTTF algorithm implemented for Markov chain
Mathematica used for some fully symbolic computations
GUI implemented
94
Architecture of SHARPE interface
Reliability
Block
Diagrams
Fault tree
MRGP
Markov chain
Hierarchical & Hybrid Compositions
Petri net
Reliability graph
(GSPN & SRN)
Task graph
Reliability/Availability
Pfqn, Mfqn
Performance
Performability
95
SHARPE MENU OF MODEL TYPES

Availability/Reliability:
 Series-Parallel
Reliability Block
Diagram (block)
 Fault
Trees (ftree)
 Reliability
Graphs (relgraph)
96
SHARPE MENU OF MODEL TYPES

Performance (traffic modeling):
 Product-Form
Queuing Networks
(pfqn, mpfqn)
 Series-Parallel
Task Graphs (graph)
97
SHARPE MENU OF MODEL TYPES


Both Availability and Performance

Markov Chains (markov)

Semi-Markov Chains (semimark)

Reward Models

Generalized Stochastic Petri Nets (gspn)

Hierarchical & Hybrid Compositions of Above
Many solution algorithms for each model type; these algorithms
continually improving
98
Architecture of SHARPE
Fault tree
Multistate fault tree
Reliability block diagram
Reliability graph
Phased-mission systems
Markov chain
Semi-Markov chain
GSPN
Stochastic reward net
MRGP
PFQN
MPFQN
Task Graph
Reliability/Availability
Performance
Performability
99
State Space Explosion

State space explosion can be handled in two ways:
Large model tolerance must apply to specification, storage
and solution of the model. If the storage and solution
problems can be solved, the specification problem can be
solved by using more concise (and smaller) model
specifications that can be automatically transformed into
Markov models (GSPN and SRN models).
 Large models can be avoided by using hierarchical model
composition.


Ability of SHARPE to combine results from different kinds of
models

Possibility to use state-space methods for those parts of a
system that require them, and use non-state-space methods
for the more “well-behaved” parts of the system.
100
Reliability models in practice
Fully symbolic CDF
Fully symbolic MTTF
Fully symbolic PQCDF
101
Availability models in practice
Expected interval availability
102
RBD example
103
Fault tree example
104
Performance models in practice
105
Markov chain model of a multiprocessor system
106
Markov reward model
107
GSPN model
108
GSPN model
109
Performability models in practice
110
Possible outputs






Availability, Unavailability and Downtime
Cost of downtime
Mean Time to System Failure, Mean Time to System Repair
Downtime breakdown into Hardware, Software & Upgrade
Breakdown of downtime by states for Markov chain models,
by blocks for Reliability block diagram models.
Sensitivity Analysis, Strategy to improve the availability of
the systems.
111
SHARPE - references

Performance and Reliability Analysis of Computer
Systems, Robin Sahner, Kishor Trivedi, A. Puliafito,
Kluwer Academic Press, 1996, Red book

Reliability and Performability Modeling using
SHARPE 2000, C. Hirel, R. Sahner, X. Zang, K.
Trivedi Computer performance evaluation:
Modelling tools and techniques; 11th International
Conference; TOOLS 2000, Schaumburg, Il., USA,
March 2000.
112
ADVANTAGES OF THE APPROACH

Pick a Natural Model Type for a Given Application
(No Retrofitting Required)

Use a Natural Model Type for a Portion of a Model
(Encourages Hybrid and Hierarchical Composition)
113
ADVANTAGES OF THE APPROACH

Except for gspn and srn Models, No Internal Conversion
Done
Appropriate Solution Algorithm for Each Model Type
i.e., Hierarchy for Solution as well as Specification

Pedagogic Advantages

Multi-Version Modeling

Step-Wise Refinement in Modeling
114