5. Software Redundancy Reliable System Design 2010 by: Amir M. Rahmani Software There are many kinds of software  System software • • • • •  – Operating system (Windows, Linux,

Download Report

Transcript 5. Software Redundancy Reliable System Design 2010 by: Amir M. Rahmani Software There are many kinds of software  System software • • • • •  – Operating system (Windows, Linux,

5. Software Redundancy

Reliable System Design 2010 by: Amir M. Rahmani

Software

There are many kinds of software

System software

• • • • • –

Operating system (Windows, Linux, Solaris, etc.)

Device driver (for printer, graphic card, etc.)

Compiler (gcc)

Library (DLLs)

Distributed Napster, etc.) system (software shared memory,

User-level software

• –

E.g., simulator, word processor, spreadsheet, game, etc.

matlab1.ir

Software Faults/Errors

 

Operating system software

• • • – – –

(including device drivers) Deadlock (may be able to escape with Ctrl-Alt-Delete) Crash and reboot Incorrect I/O User-level software

• –

Deadlock (can escape with Control-C)

• • – –

Incorrect algorithm Array bounds violation

• – •

Memory leak (C, C++, but not Java) Allocating memory, but not de allocating it

• • – –

Reference to a NULL pointer (C, C++, but not Java)

• •

Incorrect synchronization in multithreaded code Allowing more than one thread in critical section at a time Blocking when holding a lock

• • – –

Inability to handle unanticipated inputs

• •

Exception that triggers OS to kill process Segmentation fault Bus error matlab1.ir

Specification vs. Implementation

 There are many, many techniques  Problem is in two parts: 1- Correct specification, erroneous implementation • 2- Erroneous specification, correct implementation •  Both parts concern us as software engineers  Both parts need attention - why a system fails is not important to the users  Different techniques for the two parts

matlab1.ir

Static (Pre-Release) Fault Detection in Software

   As with hardware, can try to find faults before shipping the product • • * Design reviews * Formal verification (analysis of design) * Testing (analysis of implementation) • Can try to add in redundancy to mask potential faults * N-Version programming • Can try to proactively “scrub” the software to remove latent errors (due to aging) before failures occur • * Software rejuvenation: It involves stopping the running software occasionally, “cleaning” its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting. E.g.

periodically reboot system to flush out remaining latent problems due to aging

matlab1.ir

Static Fault Detection with Formal Verification

      Formal verification is a systematic, mathematical way to prove that a system (software or hardware) is correct or incorrect Correctness is based on a specification Examples of mathematical objects often used to model systems are: • finite state machines , algebra lPetri nets , timed automata , process Two broad approaches for formal verification – Theorem proving – Model checking

matlab1.ir

Formal Verification: Theorem Proving

Theorem Proving

version of consists of using a formal mathematical inference) about the system.

reasoning (logical    Example: theorem proving software such as a HOL (Higher Order Logic) theorem prover, • • ACL2 (A Computational Logic for Applicative Common Lisp) Develop describe: • logical/mathematical – System to be verified • equations – Specification of correctness for the system • that In the rules of this logic/mathematics, prove that the system is equivalent to its specification Theorem proving is difficult for very large complex systems, but can work on small sub systems

matlab1.ir

Formal Verification: Model Checking

     which consists of a systematically exhaustive exploration of the mathematical model (possible for finite models) Example: Describe system as finite state machine (FSM) • Develop logical/mathematical equations that describe required properties of the FSM Example properties: • – Never ends up in state X • – Can reach every desired state in FSM Software has been developed to perform model checking logically, this is an exhaustive search • – Example: Murp ِ model checker (from Stanford) Similar to theorem proving, model checking is difficult for large complicated systems • – Algorithms tend to be exponential in number of states

matlab1.ir

Verification and Validation

 

Validation: "Are we trying to make the right thing?", i.e., is the product specified to the user's actual needs?

Verification: "Have we made what we were trying to make?", i.e., does the product conform to the specifications?

Often refers to the overall checking process as V & V matlab1.ir

Software tools for Static Analysis

   

There are tools that can analyze software to determine if it has bugs

In most cases the analysis is performed on some version of the source code and in the other cases some form of the object code .

Can check to see if:

• • – –

All code is reachable Deadlock is possible Advantage of static analysis tools

• –

Checks all possible control flow paths through application can detect any possible specified problem, even if it would only occur very rarely in practice Disadvantages

• –

Must have access to entire code base, e.g., can’t deal with dynamically loaded libraries

Difficult to assess probability of error occurring in practice

matlab1.ir

Dynamic Fault Detection in Software

 • Must add code to check software as it is running – Unless you’re willing to wait for it to crash  Added code = redundancy!

 • Most common form of error detection: assertions – E.g., assert (Grade >= 0 && Grade <= 20)  • • • Challenges – Knowing which invariants to check – Knowing when to check these invariants – Dealing with black box code (e.g., libraries)

matlab1.ir

Automatic Dynamic Fault Detection with Meta-Compilation

    Recent research from Berkeley explores how to have the compiler automatically integrate error checking to code User can specify general high-level invariants Compiler automatically checking into the code integrates invariant Example • • 1- 99% of lock_acquire() must have corresponding lock_release(). // so that other 1% is probably wrong 2- if (ptr = = NULL){ printf(%d, ptr->data) // what’s wrong here?

}

matlab1.ir

Other Forms of Dynamic Fault Detection

   

Java has automatic array bounds checking, and it won’t let you write beyond the bounds of the array Operating system will not let an application process access memory that doesn’t belong to it. This is what is happening when you see “segmentation fault”!

FTP software uses a checksum to make sure that the data that was received is the same as the data that was sent Other examples?

matlab1.ir

Self-Checking Code

 Can we write software that checks that its output is • • Example: if we divide A/B = C, we can check the result by multiplying B*C. If B*C != A, then the division was incorrect.

– Detects hardware faults (famous Pentium bug) – Detects software faults (assuming more complicated operation than just division, which is a single instruction)  Key idea: checking a computation is always at least as easy as performing it (result from computational complexity theory)  Other examples? Finding paper.

matlab1.ir

Hardware for Software Fault-Tolerance

 Difficult for HW to know that SW is in error, because HW doesn’t know what SW is trying to do  Example • • – it’s unlikely that a program really wants to divide by zero – Any others?

• - Watchdog timer  Current work at Duke is exploring hardware support for detecting starvation

matlab1.ir

Software for Hardware Fault-Tolerance

 Many examples of using software to tolerate HW faults  In fact, all schemes for tolerating software errors will detect hardware errors that manifest themselves in the same way (i.e., • they have the same error model) – E.g., self-checking software will detect a hardware fault if it leads to an incorrect result • Example: if we divide A/B = C, we can check the result by multiplying B*C. If B*C != A, then the division was incorrect.

matlab1.ir

What is Software Fault Tolerance?

 The term ” software fault-tolerance ” can mean two things: 1. ”the tolerance of software faults”, or 2. ”the tolerance of faults by the use of software” s Definition 1 is more commonly used.

The term ”software redundancy” corresponds to definition 2.

 Remember: All software faults are design faults (Specification and Implementation mistakes)!

matlab1.ir

Cause-and-Effect Relationship

matlab1.ir

Software Redundancy

 Software redundancy techniques can be divided in two major classes: • •

• With diversity

– Design or data diversity – Aim is to tolerate design faults • •

• Without diversity

– Implements error detection, recovery, etc – Aim is to handle errors of any origin (physical faults, design faults, operator faults)

matlab1.ir

Design Diversity

 Design diversity is used to tolerate design faults in hardware and software  • • Two techniques for tolerating software design faults: • N-version programming • Recovery blocks

matlab1.ir

N-version programming

 • • Heterogeneous redundancy – TMR is homogeneous redundancy – Question? Why would TMR not work here?

 Uses majority voting on results produced by N program versions  Program versions are developed different teams of programmers by  Assumes that programs fail independently  Look likes masking hardware redundancy  Uses Forward Error Recovery

matlab1.ir

N-version programming

matlab1.ir

Achieving Version Independence-Diversity

       Different design teams for each version Diverse specifications Versions with differing capabilities Teams working on different modules are forbidden to directly communicate Diverse programming languages, development tools, compilers, hardware, operating systems and etc.

Questions regarding ambiguities in specifications or any other issue have to be addressed to some central authority who makes any necessary corrections and updates all teams …

matlab1.ir

Causes of Version Correlation

    

Common specifications:

propagate to software errors in specifications will

Inherent difficulty of problem:

algorithms may be more difficult to implement for some inputs, causing faults triggered by same inputs

Common algorithms:

algorithm itself may contain instabilities in certain regions of input space - different versions have instabilities in same region

Cultural factors:

programmers make similar mistakes in interpreting ambiguous specifications

Common software and hardware platforms:

if same hardware, operating system, and compiler are used - their faults can trigger a correlated failure

matlab1.ir

N-version programming depends on

   Initial specification stem from inadequate specification? A specification error will manifest — The majority of software faults itself in all N versions of the implementation Independence of effort caught during system testing — Experiments produce conflicting results. Where part of a specification is complex, this leads to a lack of understanding of the requirements. If these requirements also refer to rarely occurring input data, common design errors may not be Adequate budget — The predominant cost is software. A 3-version system will triple the budget requirement and cause problems of maintenance. Would a more reliable system be produced if the resources potentially available for constructing an N-versions were instead used to produce a single version?

matlab1.ir

Evaluation of N-version programming

   Few experimental studies of effectiveness of N-version programming Published results only for work in universities.

Program: Anti-missile application • • 27 versions produced by students at University of Virginia and University of California at Irvine. Published in 1985.

• • • Some had no prior industrial experience while others over ten years • All students was given the same specification • • • • • • All versions were written in Pascal • 200 test cases to validate each program • 1 million test cases to test independence (simulation of production • 93 correlated faults were identified by standard statistical hypothesis-testing methods • No correlation observed between quality of programs produced and experience of programmer

matlab1.ir

Recovery Block

 N-versions; one running - if it fails, execution is switched to a backup  Uses one primary software module and one or several secondary (back-up) software modules  Assumes that program failures can be detected by acceptance tests  Executes only the primary module under error free conditions  Look likes dynamic hardware redundancy

matlab1.ir

Recovery Block

matlab1.ir

Recovery Block Mechanism

Restore Recovery Point Establish Recovery Point Any Alternatives Left?

Yes Execute Next Alternative No Fail Recovery Block

matlab1.ir

Fail Evaluate Acceptance Test Pass Discard Recovery Point

Recovery Block Format

  Acceptance test is provided to check if answers are reasonable Format:

ensure

acceptance test

by

primary module

else by

first alternative

else by

second alternative ….

else error matlab1.ir

Example: Solution to Differential Equation ensure by

Rounding_err_has_acceptable_tolerance Explicit Kutta Method

else by

Implicit Kutta Method

else error

  Explicit Kutta Method fast but inaccurate when equations are stiff Implicit Kutta Method more expensive but can deal with stiff equations • • - The above will cope with all equations - It will also potentially tolerate design errors in the Explicit Kutta Method if the acceptance test is flexible enough

matlab1.ir

Construction of Acceptance Tests

        An acceptance test is a software implemented check designed to detect errors in the results produced by a primary or a secondary module The design of the acceptance test is crucial to the efficacy of the Recovery Block scheme Acceptance tests often relies on application specific information All the previously discussed error detection techniques discussed can be used to form the acceptance tests There is a trade-off between providing comprehensive acceptance tests and keeping overhead to a minimum, so that fault-free execution is not affected Note that the term used is acceptance not component to provide a degraded service correctness ; this allows a However, care must be taken as a faulty acceptance test may lead to residual errors going undetected Success of acceptance test Recovery Block approach depends on failure independence of different versions (modules) and quality of

matlab1.ir

Examples of how acceptance can be constructed  Satisfaction of requirements  • • (Structural checks) • Inversion of mathematical functions; e.g.

squaring the result of a square-root operation to see if it equals the original operand.

• Checking sort functions; result should have elements in descending order

matlab1.ir

Examples of how acceptance can be constructed  • • Reasonable checks • Checking physical constraints; e.g. speed, pressure, etc • Checking sequence of application states  • • • Structural checks • Structural checks are based on known properties of data structures – a number or elements in a list can be counted, or links and pointer can be verified

matlab1.ir

Evaluation of Recovery Blocks

Naval command and control system (8000 statements in the Coral language)  117 abnormal events Correct recovery • • • • Incorrect recovery, program failure Incorrect recovery, no program failure Unnecessary recovery 78 % 3 % 15 % 3 % • Anderson, T., et al., ”Software Fault Tolerance: An Evaluation,” IEEE Trans. on Software Engineering, vol. SE-11, no. 12, Dec 1985, pp. 1502 1510.

matlab1.ir

N-Version vs. Recovery Block

  N-version programming • Applied at the program level • • • • • • Runs N programs at the same time • Look likes static hardware redundancy • Vote comparison (error masking) • Assumes that independence among program versions is achieved by random differences in programming style among programmers Recovery block • Applied at the module (subprogram) level • • • • • • Runs only the primary module under error-free conditions • Look likes dynamic hardware redundancy • Error detection : acceptance test • Independence is achieved by intentionally designing the primary and secondary modules to be as different as possible (different algorithms)

matlab1.ir

Data Diversity

 This technique is cheaper to implement than the design diversity tecghnique.

 Popular techniques which are based on the data diversity concept for fault tolerance in software are: • • • Retry blocks • N-copy programming

matlab1.ir

Retry Blocks

A retry block is a modification of the recovery block structure that uses data diversity instead of design diversity ( data and re-expressed data complement of data).

like

Rather than the multiple alternate algorithms used in a recovery block, a retry block use only one algorithm.

A retry block's acceptance test has the same form and purpose as a recovery block's acceptance test.

matlab1.ir

N-Copy Programming

An N-copy programming is similar to an N version programming but uses data diversity instead of design diversity.

N copies of a program execute in parallel, each on a set of data produced by re-expression.

The system selects the output to be used by an enhanced voting scheme.

matlab1.ir

Airbus A330

       

National origin Manufacturer First flight Status Primary users

Cathay Pacific Delta Air Lines

• •

Qatar Airways Emirates

Produced Number built Unit cost Multi-national Airbus 2 November 1992 In production, in service 1993–present 1,016 as of 10 October 2013 A330-300, €215 million(2011)

http://en.wikipedia.org/wiki/Airbus_A330 (3 Nov. 2013) matlab1.ir

A340

       

National origin Manufacturer First flight Status Primary users

Lufthansa Iberia,

• •

South African Airways Virgin Atlantic Airways

Produced Number built Unit cost Multi-national Airbus 25 October 1991 Out of production, in service 1993-2011 375 A340-600: US$275.4 million matlab1.ir

Design Diversity in Airbus A330/A340

 • • Two types of computers • 3 primary computers • 2 secondary computers  Each computer are internally duplicated • • and consists of two channels • Command channel • Monitor channel

matlab1.ir

Architecture for A330/A340

Flight control Flight control primary computers secondary computers Flight control data concentrators

matlab1.ir

Design Diversity in Airbus 330/A340

 Implementation of primary computers • • Supplier: Aerospatiale (HW&SW) • Hardware: Two Intel 80386 (one for each channel) • • • Software: assembler for command channel, PL/M for monitor channel .

 Implementation of secondary computers • • Supplier: Sextant Avionique (HW), Aerospatiale (SW) • • • Hardware: Two Intel 80186 (one for each channel) • Software: assembler for command channel, Pascal for monitor channel.

matlab1.ir

Exception Handling

       Exception indicates that something happened during execution that needs attention Control is transferred to an exception-handler - routine which takes appropriate action Example: When executing y = a*b, if overflow => result incorrect => signal an exception Effective exception-handling can make a significant improvement to system fault tolerance Over half of code lines in many programs are devoted to exception-handling Exception handling is a procedures can be initiated Forward Error Recovery mechanism, as there is no roll back to a previous state; instead control is passed to the handler so that recovery However, the exception handling facility can be used to provide Backward Error Recovery

matlab1.ir

Example: Domain and Range Failure

     Exceptions can be used to deal with • • - domain or range failure - out-of-ordinary event (not failure) needing special attention - timing failure • A domain failure happens when illegal input is used Example: if X, Y are real numbers and X = √Y is attempted with Y = -1, a domain failure occurs A range failure occurs when program produces an output or carries out an operation that is seen to be incorrect in some way Examples include: • • • • - Encountering an end-of-file while reading data from file - Producing a result that violates an acceptance test - Trying to print a line that is too long - Generating an arithmetic overflow or underflow

matlab1.ir

Timing Failure

Timing Checks:

Timing checks are an effective form of software check for detecting errors even in cases of running programs in a dual redundant execution mode, if the specification of a component includes timing constraints.

Watch-dog timer

• • •

- is used to guard against program hang-ups.

- Also used in communications between CPU and main store.

-Also used in periodic "hello" exchanges (network surveillance) and in I/O operations matlab1.ir