3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani Forms of Redundancy  Hardware redundancy •  Software redundancy •  – add extra software for detection and.

Download Report

Transcript 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani Forms of Redundancy  Hardware redundancy •  Software redundancy •  – add extra software for detection and.

3. Hardware Redundancy
Reliable System Design 2010
by: Amir M. Rahmani
Forms of Redundancy

Hardware redundancy
•

Software redundancy
•

– add extra software for detection and possibly
tolerating faults
Information redundancy
•

– add extra hardware for detection or tolerating
faults
– extra information, i.e. codes
Time redundancy
•
– extra time for performing tasks for fault
tolerance
matlab1.ir
Types of Hardware Redundancy
Fault Tolerance requires Redundancy
1- Static Redundancy (that is Passive)

•
•
•
• uses fault masking to hide occurrence of fault
• does not require reconfiguration
• Example: TMR, Voting
2- Dynamic Redundancy (that is Active)
•
•
• uses comparison for detection and/or diagnoses
• requires reconfiguration
• remove faulty hardware from system
•
• Example: Stand-by system
3- Hybrid Redundancy
•
• combination of static & dynamic redundancy
matlab1.ir
1- Static Redundancy

A class of redundancy techniques that can
tolerate faults without reconfiguration
(failover).

Static redundancy can be divided into two
major subclasses:
•
•
• Masking redundancy
• Active redundancy
matlab1.ir
Masking Redundancy



Uses majority voting to mask faults
Requires 2f +1 modules to tolerate f faulty
modules
N-Modular Redundant system (NMR) N
independent modules replicate the same function
•
•
•

– parallelism
– results are voted on
– requirements: N >= 3
TMR (Triple Modular Redundancy)
matlab1.ir
Triple Modular Redundancy (TMR)
e.g. Majority voting.
1-bit majority voter (3 AND gates ORed)
matlab1.ir
Triple Modular Redundancy
(TMR)
matlab1.ir
Masking Redundancy
TMR with triple voting
matlab1.ir
Masking Redundancy
Multi-stage TMR
matlab1.ir
N-Modular Redundant system (NMR)
matlab1.ir
Active Redundancy




Two or more units are active and produce
replicated results simultaneously
Relies on fail-stop units
Fail-stop property: a unit produces correct
results or no results at all
Requires f +1 modules to tolerate f faulty
modules
matlab1.ir
Fail-stop Nodes
Node 1 and 2 send their results individually to node 3 and 4
All nodes are fail-stop: They send correct results or no
results at all
matlab1.ir
2- Dynamic Redundancy




Relies on error detection and reconfiguration
Requires f +1 modules to tolerate f faulty
modules
May require recovery of system or
application state
May require outage time
matlab1.ir
Example: Duplicate and Compare
•
– can only detect, but NOT diagnose
• i.e. fault detection, no fault-tolerance
•
•
– may order shutdown
– comparator is single point of failure
• simple implementation: 2 input XOR for single bit
compare
matlab1.ir
Example: Stand-by System
• E.g. communications checksums and memory parity bits
•
•
– only one module is driving outputs
– other modules are:
• idle => hot spares
• shut down => cold spares
•
– error detection => switch to a new module (hot or
cold spares)
matlab1.ir
Types of Stand-by Systems



Hot standby
Warm standby
Cold standby
matlab1.ir
Hot Stand-by

Characteristics
•

Spare updated simultaneously with primary
module
+ Advantages
•
•

•
+ Very short or no outage time
+ Does not require recovery of application
- Drawbacks
•
•
- High failure rate (fault rate)
- High power consumption
matlab1.ir
Warm Stand-by

Characteristics
•
•

+ Advantages
•

• Spare up and running
• Needs to recover application status
+ Does not require simultaneous up-dating of spare
and primary module
- Drawbacks
•
•
•
- Requires recovery of application state
- High fault rate
- High power consumption
matlab1.ir
Cold Stand-by

Characteristics
•

• Spare powered-down
+ Advantages
•
•
+ Low failure rate (fault rate)
+ Low power consumption
• Satellite application

- Drawbacks
•
•
- Very long outage time
- Needs to boot kernel/operating system and
recover application status.
matlab1.ir
3- Hybrid Redundancy

N-Modular Redundancy with spares
•
•
•
– N active + S spare modules (off-line)
– Voting and comparison
– Replaces erroneous module from spare pool
matlab1.ir
N-Modular Redundancy with spares
N-Modular Redundancy with spares
matlab1.ir
Coding checks / Exception checks
Coding checks



Error detection codes are formed by the addition of check
bits to a data word.
A cyclic redundancy code check was used in the disk
store of ESS.
A parity bit was used in the RAM
Exception checks


Hardware constraints: Usually result from the inability of
the hardware to provide the better service needed by the
software.
Examples
•
•
•
•
• Improper address alignment
• Unequipped memory locations
• Unused op-code
• Stack overflow
matlab1.ir
Watchdog Timers


So far, we’ve figured out how to detect when
something is wrong … but how do we detect
when we’re not doing anything at all?
Watchdog timer monitors a module and
triggers a recovery if the module doesn’t do
anything in a given amount of time
•

– E.g., put a watchdog timer on a microprocessor bus
Who watches the watchdog?
•
•
– If we assume single fault scenario, then this usually
isn’t a problem
– But what if watchdog has hard fault that causes it to
never timeout and trigger a recovery?
matlab1.ir