xxx - Computer Engineering

Download Report

Transcript xxx - Computer Engineering

E E 681 - Module 2
Background on Reliability and
Availability
Slides prepared by Wayne D. Grover and Matthieu Clouqueur
TRLabs & University of Alberta
© Wayne D. Grover 2002, 2003
( Version for book website )
Overview of the lecture
• Concept of Reliability
– Reliability function, Failure density function, hazard rate
• Concept of Availability:
– Availability function, unavailability, availability of elements in
series/parallel
• Methodology for Availability Analysis
– Quick Unavailability Lower bound estimation
– Cut sets method
– Tie paths method
• Automatic Protection Switching (APS) Systems
– Principle
– Availability Analysis of an APS system
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
2
Technical meaning of Reliability
• In everyday English:
– “My car is very reliable”  It works well, it starts every time (even at
-30°).
• Technical meaning:
– Reliability is the probability of a device performing its purpose
adequately for the period of time intended under the operating
conditions intended.
– Example:
• Reliability of a fuel-pump during a rocket launch
Reliability is a mission-oriented question
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
3
Reliability
• The reliability function R(t):
– R(t) = probability { no failure in interval [0,t] }
R(t)
1
dR (t )
 0 (R(t) is always a nondt
increasing function)
R(0) = 1
R() = 0
t
– Q(t) = probability { at least one failure in interval [0,t] }
– Q(t) = 1 - R(t)
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
4
Reliability
• R(t) = prob { no failure in [0,t] }
• Related function: failure density function, f(t)
t2
 f (t )dt 
probability of at least one failure in interval [t1,t2]
t1
t
Q(t )   f (t )dt
o
t
R(t )  1   f (t )dt
o
f(t) can be seen as the pdf of time to next failure
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
5
Reliability
• Hazard rate (t) (age specific failure rate) : Rate of failure of
an element given that this element has survived this long
Given that the
element has
survived this long
 (t ) 
Rate of failures
1 dQ(t )
1 dR(t )



R(t ) dt
R(t ) dt
integration
t
 (u)du   ln R(t )
0
t
R (t )  e
E E 681 Lecture #2

  ( u )du
0
© Wayne D. Grover 2002, 2003
6
Reliability
• Expected Time To Failure or Mean Time To Failure (MTTF):
– It is the expected value of the random variable with pdf f(t):


0
0
MTTF  E {f (t )}   t  f (t )dt   t 
d
 R(t )  dt
dt
TTF1
0
E E 681 Lecture #2
Failure
t
© Wayne D. Grover 2002, 2003
7
Reliability
• Special case: constant hazard rate (memoryless)
 (t )  0
R(t )  e0 t
MTTF 
1
0
– In this case we can apply the Poisson distribution:
(0  t )k  0 t
prob { k failures in [0,t] } =
e
k!
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
8
Reliability
• Numerical example:
– Poisson distribution with  = 1 / (5 years)
• Probability of 1 failure in the first year: P = 16.4%
• Probability of at least one failure in the first year: P = 18.1%
• Probability of 1 failure in the first 5 years: P = 36.8%
• Probability of at least one failure in the first 5 years: P = 63.2%
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
9
Availability (Repairable systems)
• “What is the probability that the engine of a formula 1 car
will work during the whole race?”
– This is a reliability question
• “How often do I hear the dial tone when I pick up the
phone?”
– This is an availability question
Availability is the probability of finding the system in
the operating state at any arbitrary time in the future
Unlike in the context of reliability we now consider systems that
can be repaired
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
10
Availability
• Comparison of Availability and Reliability Functions:
A(t)
“As ystem”
A(t)
R(t)
Region 2: Repair actions begin
to hold up A
R(t)
1
2
t=0
E E 681 Lecture #2
Region 1: R(t) and A(t) are the
same
Region 3: A reaches a steady
state
3
t
© Wayne D. Grover 2002, 2003
11
Availability
Time
Between Failures
Time
to Repair
Failure
Time
to Failure
Repair
Failure
Uptime 
A  lim 

Tobs 
T
 obs 
E E 681 Lecture #2
A
MTTF: Mean-Time To
Failure
MTBF: Mean-Time
Between Failures
MTTR: Mean-Time To
Repair
Time
to Repair
Repair
t
MTTF
MTTF  MTTR
© Wayne D. Grover 2002, 2003
12
Availability
• Unavailability:
U  1 A
MTTR
MTTR
U

MTTF  MTTR MTBF
In availability analysis we usually work with unavailability quantities
because of some simplifications that can be done on the
unavailability of elements in series and in parallel
FIT: Unit corresponding to 1 failure in 109 hours
1 FIT = 1 failure in 114,155 years
1 failure / year = 114,155 FITS (high!)
Typical value for telecom equipment: 1500 FITS
( MTTF = 76 years )
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
13
Availability
• Series elements unavailability reduction:
Numerical examp.
1
2
3
n
Us   U i
...
n
Approximation based on
the fact that Ui << 1
i 1
Ui = 10-3, n = 3
 Us = 3 . 10 -3
• Parallel elements unavailability reduction:
1
n
2
...
Us   U i
i 1
 Us = 10 -9
n
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
14
Availability Analysis
• The reliability engineer can use different techniques to
evaluate the availability of a system:
–
–
–
–
–
1) Quick estimate of a lower bound for the unavailability
2) Series and parallel unavailability reductions
3) Cut set method
4) Tie paths method
5) Conditional decomposition
• The general methodology is explained next…
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
15
Availability Analysis
• General Methodology:
1) Get unavailability values of all components and sub-systems.
2) Draw parallel and series availability relationships
3) Reduce the system availability model by repeated applications of the
parallel/series availability simplifications.
4) If not completely reduced, do quick unavailability lower bound
estimation, use the tie paths method, the cut sets method or the
conditional decomposition
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
16
Availability Analysis
• Lower bound on unavailability
– The contributions of parallel elements to the unavailability is not
taken into account
B
D
F
A
C
G
H
E
Lower bound of Us: UA+UH
– In some cases this quick evaluation of a lower bound on U can be
enough to conclude that the system does not meet the availability
requirements
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
17
Availability Analysis
• Tie paths method:
– We enumerate all the paths from I to O
3
I
1
2
4
I
O
5
6+7
A
2
3
1
O
5
B
4
A
8 || 9
B
– The availability of each paths is calculated:
A path i 

Av
(vpath)
– The availability of the system is:
A syst 
A path i
( itiepaths)
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
18
Availability Analysis
• Cut sets method:
– Which combinations of element failures can bring the system down?
5
2
3
1
B
4
A
– The probability of each cut is calculated:
P(cut i) 
 1 A 
v
(vcut i)
– The availability of the system is : A syst  1
P(cut i)
(icusets)
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
19
Availability Analysis
• Conditional decomposition (High Unavailability Elements):
– When some elements have high U, it becomes less acceptable to
sum unavailabilities.
– Solution: Conditional decomposition:
A1
A3
A2
A4
Ad low
A1
A2
A1
A2
A3
A4
A3
A4
Asyst 1
– The availability of the system is :
E E 681 Lecture #2
Asyst 2
A syst  (1 A d )  A syst 1  A d  A syst 2
© Wayne D. Grover 2002, 2003
20
Automatic Protection Switching (APS) Systems
• Basic idea:
– to provide a standby transmission channel that is kept in fully
operating condition and used to replace any of the other traffic
bearing channels in the event of their failure
• Characteristics of an APS system:
– spare to working ratio:
• ‘1-to-1’ or ‘1-to-N’
– co-routed / diversely routed:
• ‘1-to-1’ or ‘1-to-1 /DP’
• ‘1-to-N’ or ‘1-to-N /DP’
– 1+1 or 1:1:
• ‘1+1’: Signal always sent on the spare channel
• ‘1:1’: Signal sent on spare channel upon failure of the working channel
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
21
Automatic Protection Switching (APS) Systems
• 1:N APS system:
1
Ub1
Uw
Ut1
N working
N
Uw
Ub2
Ut2
s pare
E E 681 Lecture #2
For Head End Bridge(HEB)
and Tail End Transfer(TET):
Mode 1 failure: working
signal is not relayed
Mode 2 failure: no bridging
or transfer to/from spare
channel
Us
© Wayne D. Grover 2002, 2003
22
Automatic Protection Switching (APS) Systems
• Cut sets approach to 1:N APS availability analysis:
– Combinations creating outage for a specific channel (cut sets):
• Cut set 1: Failure that channel with prior failure of at least one other working
channel
• Cut set 2: Failure of that working channel plus the spare channel or head end
bridge or tail end transfer in mode 2
• Cut set 3: Failure of head end bridge or tail end transfer in mode 1
– The probability of each cut set is:
• Cut set 1: Uw  (N-1)Uw  0.5
• Cut set 2: Uw  (Us + Ub2 + Ut2)
• Cut set 3: Ub1 + Ut1
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
23
Automatic Protection Switching (APS) Systems
• 1:N APS Unavailability
– The unavailability of a channel is:
N 1 2
U channel 
 Uw  Uw  Us  Ub 2  Ut 2   (Ub1  Ut 1 )
2
O(U)
– The term in O(U) reflects the irreducible series-availability elements:
the HEB and the TET in their mode 1 failure.
• It is impossible to make a perfectly redundant system. There
is always some parallelism-accessing device c that brings
series unavailability contribution
c
A
B
E E 681 Lecture #2
c
UA = UB = 10-3
UC = 10 -5
US = UA UB + 2 UC  2 UC
© Wayne D. Grover 2002, 2003

O(Uc)
24
Summary
• Reliability is a mission oriented question for non-repairable
systems
• In telecom engineering we are interested in the availability
of the system designed
• There are several techniques that can be used for
availability analysis. The one we will use in the rest of the
course is the algebraic approach (equivalent to cut sets)
• APS is a protection scheme that enhances availability by
providing a spare channel for restoration of failed working
channels
E E 681 Lecture #2
© Wayne D. Grover 2002, 2003
25