Transcript SENG 521

SENG 521
Software Reliability &
Testing
Fault Tolerant Software Systems:
Techniques (Part 4a)
Department of Electrical & Computer Engineering, University of Calgary
B.H. Far
([email protected])
http://www.enel.ucalgary.ca/~far/Lectures/SENG521/04a/
SENG521 (Fall 2002)
[email protected]
1
What Is Fault Tolerance



A fault-tolerant computing system must be capable
of providing specified services in the presence of a
bounded number of failures.
These failures could occur because of faults present
in either the components of the system or in the
system’s design.
Building large computing systems is a complex
task; fault-tolerance requirements could make the
task even more difficult unless appropriate system
structuring concepts are utilized.
SENG521 (Fall 2002)
[email protected]
2
Problems …



The traditional approaches to fault tolerance in
hardware systems have been based on coping with
the effects of well-understood failure modes of
physical components.
Conventional hardware fault tolerance methods are
rarely powerful enough to cope with deficiencies of
design.
Consequently, most hardware fault tolerance
techniques cannot be applied in software, where
almost all faults are design faults.
SENG521 (Fall 2002)
[email protected]
3
History …

Defensive programming:


Implementing relatively ad hoc methods are used
to minimize the damage which could arise from
the damage of presence of residual bugs.
Dual software technique:

Implementing two distinct versions of the same
software and executing them. Any discrepancy in
the outputs of the two versions may trigger an
alarm.
SENG521 (Fall 2002)
[email protected]
4
Fault Tolerance Phases /1

Phase 1: Error detection


For a fault to be tolerated, it must first be detected. Thus
the starting point for fault-tolerance techniques is
observing failures.
Phase 2: Damage assessment


It is necessary to assess the extent to which the system
state has been damaged or corrupted.
If the delay involved between the manifestation of a fault
(failure) and the detection of its cause (error) is large then
it is likely that the damage to the system state will be
more severe than if the latency interval were shorter.
SENG521 (Fall 2002)
[email protected]
5
Fault Tolerance Phases /2

Phase 3: Error recovery


Error recovery techniques must be utilized in order to
obtain a normal, error-free system state.
There are two different kinds of recovery technique.


Backward recovery technique consists of discarding the current
(corrupted) state in favor of an earlier state Therefore,
mechanisms are needed to record and store system states.
Forward recovery technique involves making use of the current
(corrupted) state to construct an error-free state.
SENG521 (Fall 2002)
[email protected]
6
Fault Tolerance Phases /3

Phase 4: Fault treatment & continued service



Once recovery has been undertaken, it is essential to
ensure that the normal operation of the system will
continue without the fault immediately manifesting itself
once more.
The first aspect of fault treatment is to attempt to locate
the fault.
Following this, steps can be taken either to repair the
fault or to reconfigure the rest of the system to avoid the
fault.
SENG521 (Fall 2002)
[email protected]
7
Recovery Block Mechanism

Syntax of a recovery block construct:
ensure<acceptance test>by P0 else-by P1 else fail



It depicts a software system with 3 components, the two
procedures P0 (the primary) and P1 (the alternative), and the
acceptance test.
The design of the system is the control structure implied by
the syntax.
Assume that the acceptance test is perfect (i.e., detects all
violations of the specification) then the recovery block P1
will tolerate all the faults of procedure P0 that could lead to
its failure.
SENG521 (Fall 2002)
[email protected]
8
Example

Fault tolerance phases:




Error detection: acceptance test (a Boolean expression) is
used.
Damage assessment: only the program in execution is
assumed to be affected.
Error recovery: (backward in this case) consists of
recovering the state of the executing program to that at
the beginning of the recovery block.
Fault treatment: the program in execution (primary or
alternative) is assumed to be faulty, so its faults are
avoided by executing the next alternative (if any).
SENG521 (Fall 2002)
[email protected]
9
Design Technique /1


Robust Software Systems (Anderson and Lee 1981,
etc.):
Construction of a robust module requires:



Exception handlers for coping with exceptions
propagated from lower levels; and
Boolean expressions for detecting exceptions arising in
the module itself, and their exception handlers.
It is often possible (and desirable for the sake of
simplicity) to map several exceptions onto a single
handler.
SENG521 (Fall 2002)
[email protected]
10
Design Technique /2



Assuming the use of data
abstractions (abstract data types)
in program development.
The software system is
structured into a hierarchy of
modules represented by an
acyclic graph.
Modules are represented by nodes and arrow from a node A to node B
means that there are one or more operations in A that a successful
completion of that operation depends on the successful completion of
some operation provided by B; in other words, B provides certain
services to A.
SENG521 (Fall 2002)
[email protected]
11
Design Technique /3
 A normal chain of events
consist of some procedure
of ‘A’ making a call on ‘B’,
and ‘B’ calls a lower level
module (say ‘F’), this call
returns normally, and
subsequent A’s call returns
normally.
SENG521 (Fall 2002)
[email protected]
12
Design Technique /4
Exception cases:
1. A call from ‘B’ to a lower level module returns an
exception and this is passed to ‘A’
2. A call from ‘B’ to a lower level module returns an
exception but ‘B’ has exception handlers that can
handle this and provides a normal service to ‘A’
3. A Boolean expression in B - inserted specifically for
detecting an error (exception) - evaluates to false. This
is handled by either of:


Exception is masked, in which case ‘B’ will return normally to
‘A’
An exceptional return is obtained by ‘A’
SENG521 (Fall 2002)
[email protected]
13
Notation


A procedure P, besides the normal return, also provides an exceptional return
E:
procedure P(--) signals E
The invoker of P can define the exceptional continuation to be some operation
H which is called the handler of E:
P(--) [E ⇒ H]

In P the following constructs can be inserted:
[T⇒ .. ;{signal E}]
O[L⇒ .. ;{signal E}]



(1)
(2)
(1) represents an exception is detected by a run time test T.
(2) represents the case when invocation of an operation 'O' results in an
exceptional return L which in turn could lead to the signaling of exception E.
When an exception is signaled using construct (1) or (2), the control passes to
the handler of that exception (H).
SENG521 (Fall 2002)
[email protected]
14
Example: Expected Events





Design of a procedure P which adds three positive integers.
The procedure uses operation ' + ' and an overflow signal exception 'OV'.
procedure P (var i,j,k:integer) signals OW;
begin
i:=i+j [OV ⇒ signal OW];
i:=i+k [OV ⇒ i:=i-j; signal OW];
end;
An important aspect of exception handling: clean-up operation
If all the procedures of a module follow this strategy, we get a module
with the following highly desirable property:
Either the module produces results that reflect the desired normal service
to the caller, or no results are produced and an exceptional return is
obtained by the caller.
SENG521 (Fall 2002)
[email protected]
15
Unexpected Events /1
1. The execution of P does not terminate.
2. A lower level exception is detected for which there is no
exception handler in P.
3. The execution of P terminates normally (the invoker
obtains a normal return) but the results produced by P are
not in accordance with the specification.

Situations (1) and (2) will eventually cause a failure of the
module; situation (3) represents the case where the module
has failed but this event has not yet been detected by the
system.
SENG521 (Fall 2002)
[email protected]
16
Unexpected Events /2

To cope with such cases, we can employ a default exception
handler:
procedure P (--) signals E;
begin
… … …
end[⇒ "default handler"];

The control goes to this handler during the execution of P
whenever an exception is detected for which there is no
handler.
SENG521 (Fall 2002)
[email protected]
17
Unexpected Events /3



Case (1): It is possible to start a ‘timer’ concurrently
with the invocation of P; the ‘time out’ exception
will then be handled by the default handler.
Case (2): All the lower level exceptions with no
programmed handlers will similarly be handled by
the default handler.
Case (3): Make use of run time checks to detect
possible violations of specifications to minimize the
danger of undetected failures.
SENG521 (Fall 2002)
[email protected]
18
Unexpected Events /4

What strategy should be adopted by the
default handler?


The simplest thing to do is to undo any sideeffects produced by the procedure and to signal
a fail exception.
When the invoker receives a fail exception, it
means that the called module has failed to
provide the specified service.
SENG521 (Fall 2002)
[email protected]
19
Design Guidelines



For a given module, carefully analyze the cases that
could prevent the module from providing the
desired normal services.
Make use of exception handlers either to mask the
effects of such undesired, but expected, exceptions
or to signal an appropriate exception to the caller of
the module.
Make use of default exception handlers or recovery
blocks to obtain a measure of tolerance against
design faults.
SENG521 (Fall 2002)
[email protected]
20
Discussion




The capability of tolerating design faults rests largely on the
‘coverage’ of run-time checks (i.e. acceptance tests) for
detecting errors.
Often, it is not possible to check completely within a
procedure that the results produced have been according to
the specification (e.g. for a routine that sorts its input, the
check that the output has been sorted would be as complex
as the routine itself).
Hence run-time checks are often limited to checking certain
critical aspects of the specification.
This means that the possibility of undetected failures cannot
be ruled out entirely.
SENG521 (Fall 2002)
[email protected]
21