3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani Forms of Redundancy Hardware redundancy • Software redundancy • – add extra software for detection and.
Download ReportTranscript 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani Forms of Redundancy Hardware redundancy • Software redundancy • – add extra software for detection and.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani Forms of Redundancy Hardware redundancy • Software redundancy • – add extra software for detection and possibly tolerating faults Information redundancy • – add extra hardware for detection or tolerating faults – extra information, i.e. codes Time redundancy • – extra time for performing tasks for fault tolerance matlab1.ir Types of Hardware Redundancy Fault Tolerance requires Redundancy 1- Static Redundancy (that is Passive) • • • • uses fault masking to hide occurrence of fault • does not require reconfiguration • Example: TMR, Voting 2- Dynamic Redundancy (that is Active) • • • uses comparison for detection and/or diagnoses • requires reconfiguration • remove faulty hardware from system • • Example: Stand-by system 3- Hybrid Redundancy • • combination of static & dynamic redundancy matlab1.ir 1- Static Redundancy A class of redundancy techniques that can tolerate faults without reconfiguration (failover). Static redundancy can be divided into two major subclasses: • • • Masking redundancy • Active redundancy matlab1.ir Masking Redundancy Uses majority voting to mask faults Requires 2f +1 modules to tolerate f faulty modules N-Modular Redundant system (NMR) N independent modules replicate the same function • • • – parallelism – results are voted on – requirements: N >= 3 TMR (Triple Modular Redundancy) matlab1.ir Triple Modular Redundancy (TMR) e.g. Majority voting. 1-bit majority voter (3 AND gates ORed) matlab1.ir Triple Modular Redundancy (TMR) matlab1.ir Masking Redundancy TMR with triple voting matlab1.ir Masking Redundancy Multi-stage TMR matlab1.ir N-Modular Redundant system (NMR) matlab1.ir Active Redundancy Two or more units are active and produce replicated results simultaneously Relies on fail-stop units Fail-stop property: a unit produces correct results or no results at all Requires f +1 modules to tolerate f faulty modules matlab1.ir Fail-stop Nodes Node 1 and 2 send their results individually to node 3 and 4 All nodes are fail-stop: They send correct results or no results at all matlab1.ir 2- Dynamic Redundancy Relies on error detection and reconfiguration Requires f +1 modules to tolerate f faulty modules May require recovery of system or application state May require outage time matlab1.ir Example: Duplicate and Compare • – can only detect, but NOT diagnose • i.e. fault detection, no fault-tolerance • • – may order shutdown – comparator is single point of failure • simple implementation: 2 input XOR for single bit compare matlab1.ir Example: Stand-by System • E.g. communications checksums and memory parity bits • • – only one module is driving outputs – other modules are: • idle => hot spares • shut down => cold spares • – error detection => switch to a new module (hot or cold spares) matlab1.ir Types of Stand-by Systems Hot standby Warm standby Cold standby matlab1.ir Hot Stand-by Characteristics • Spare updated simultaneously with primary module + Advantages • • • + Very short or no outage time + Does not require recovery of application - Drawbacks • • - High failure rate (fault rate) - High power consumption matlab1.ir Warm Stand-by Characteristics • • + Advantages • • Spare up and running • Needs to recover application status + Does not require simultaneous up-dating of spare and primary module - Drawbacks • • • - Requires recovery of application state - High fault rate - High power consumption matlab1.ir Cold Stand-by Characteristics • • Spare powered-down + Advantages • • + Low failure rate (fault rate) + Low power consumption • Satellite application - Drawbacks • • - Very long outage time - Needs to boot kernel/operating system and recover application status. matlab1.ir 3- Hybrid Redundancy N-Modular Redundancy with spares • • • – N active + S spare modules (off-line) – Voting and comparison – Replaces erroneous module from spare pool matlab1.ir N-Modular Redundancy with spares N-Modular Redundancy with spares matlab1.ir Coding checks / Exception checks Coding checks Error detection codes are formed by the addition of check bits to a data word. A cyclic redundancy code check was used in the disk store of ESS. A parity bit was used in the RAM Exception checks Hardware constraints: Usually result from the inability of the hardware to provide the better service needed by the software. Examples • • • • • Improper address alignment • Unequipped memory locations • Unused op-code • Stack overflow matlab1.ir Watchdog Timers So far, we’ve figured out how to detect when something is wrong … but how do we detect when we’re not doing anything at all? Watchdog timer monitors a module and triggers a recovery if the module doesn’t do anything in a given amount of time • – E.g., put a watchdog timer on a microprocessor bus Who watches the watchdog? • • – If we assume single fault scenario, then this usually isn’t a problem – But what if watchdog has hard fault that causes it to never timeout and trigger a recovery? matlab1.ir