Transcript Design and Test Technology for Automotive Electronic Systems
New Approaches to Fault-Tolerant Systems Design
Andreas Steininger
Vienna University of Technology
My contact data
Andreas Steininger
Vienna University of Technology Faculty of Informatics Institute of Computer Engineering Embedded Computing Systems Group Treitlstrasse 3 A- 1040 Vienna Austria
http://ti.tuwien.ac.at/ecs A. Steininger page 2
Main Contributors to this Material
Dr. Thomas Kottke
R. Bosch AG / EADS
Dr. Peter Tummeltshammer
R. Bosch AG / Thales
Dr. Christoph Scherrer
Alcatel / Thales
Dr. Eric Armengaud Dr. Karl Thaller Dr. Martin Horauer Paul Milbredt
DecomSys / VirtualVehicle DecomSys / Elektrobit Austria UAT Technikum Wien AUDI AG
A. Steininger page 3
Outline
• Fault tolerance – some (very) basics • Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 4
Faults, Errors and Failures
fault
computer
1 0 error failure
A. Steininger page 5
Error Detection
fault Fault detection:
usually too difficult (too many possibilities) computer
1 0 error failure
A. Steininger page 6
Error Detection
Failure detection:
too late: want to prevent failure!
computer
1 0 error failure
A. Steininger page 7
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from? computer
1 0 0 error
A. Steininger Option 1: Perform same compu tation a second time (hopefully the fault is gone by then …)
Time redundancy
page 8
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from? computer
1 0 error
A. Steininger page 9
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from?
0
computer
1 0
Option 2: Use a second computer in parallel (hopefully this one works well…)
Space redundancy
A. Steininger page 10
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from? computer
1 0 0 error
Option 3: Add additional information (hopefully not affected as well…)
Information redundancy
A. Steininger page 11
Achieving Fault Tolerance
computer ED
Fail safe:
system can be safely stopped when error is detected
example: train
computer computer ED
Fail operational:
system must keep on working when error is detected
example: autopilot in airplane
A. Steininger page 12
Outline
Fault tolerance – some (very) basics • Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 13
Electronics in Cars – some Facts
high proportion of value:
up to 30%
high development potential:
more than 80% of the innovations
high number of Electronic Control Units (ECUs)
up to 70
complex distributed system
different networks & topologies
A. Steininger page 14
Electronics in Cars - Benefits
cheap alternative to existing mechanical solutions – lighter, smaller, cheaper, more flexible,… enabler for further optimizations – electronic ignition, motor management, … key to new functionality – safety: ESP, active suspension, crash sensing… – comfort: air conditioning, infotainment,… – security: immobilizer, alarm, electronic key, GPS tracking,… – autonomy: anticipatory braking, lane keeping,… A. Steininger page 15
Key Demands
Safety Real-Time Low Cost Robustness Testability A. Steininger page 16
Key Demands
Safety
Real-Time Low Cost Robustness Testability – high risk potential (energy!) – high public awareness – no safe state (in general) – certification required (EN 61508, ISO 26262) – high complexity of system & application – legal issues (liability) A. Steininger page 17
Key Demands
Safety
Real-Time
Low Cost Robustness Testability – engine: 6000 rpm = 1/10ms – VDM: 100km/h = 28cm/10ms – need to synchronize distributed activities – real-time communication – image processing tasks A. Steininger page 18
Key Demands
Safety Real-Time
Low Cost
Robustness Testability – extreme competition – high cost inhibits introduction – tailored safety concepts minimum degree of replication use structural redundancies – generic solutions scalable, configurable, flexible – marginal costs beat NRE A. Steininger page 19
Current Status
fail safe functions realized: – shut off upon error – mechanical fall-back system assumes control no true “by wire” functions – single-channel solutions sufficient tolerance against random faults – avoid design faults by field experience => no diversity – avoid common cause faults by design (?) single fault assumption – keep faults rare (shielding, etc.) A. Steininger page 20
Outline
Fault tolerance – some (very) basics Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 21
A Fault Tolerant Node
mission: make a node (processor) fault tolerant need to consider CPU and memory aim is “fail safe” (but keep option for fail op in mind) – simplex unit with error detection capabilities – duplication and comparison – hybrid approach A. Steininger page 22
Options for the CPU Core
Single core + ED
Dual core + cmp Superscalar proc.
+ cmp + ED
modify custom CPU core
– parity for buses – two-rail coding for signals – self-checking implemen tation of simple units – duplicate & compare for complex units – careful layout A. Steininger page 23
Options for the CPU Core
Single core + ED
Dual core + cmp
Superscalar proc.
+ cmp + ED
duplicate custom CPU core
– master/checker operation – shared (safe) memory – validity check for inputs – self-checking comparator checks equality of outputs – option: clock delay – option: mode switch A. Steininger page 24
Solution Example “Dual Core Frame”
benefits can use custom core without modifications safety analysis valid for other cores as well promises high ED coverage with moderate efforts CPU is hard to protect otherwise crucial points enable easy recovery ( => keep outage short) eliminate single points of failure detect common cause faults A. Steininger page 25
Protection in the Dual Core Frame
Instr. Addr.
Core 1 (Master) Data out Data Addr.
Data in =?
Instr. Addr.
=?
Instr.
Data out Data Addr.
Core 2 (Checker) A. Steininger Error_Sig Data in page 26
Potential for Common Cause Faults
identical input data identical clock (lock step) shared clock generator shared power supply both processors on same die (physical proximity; thermal & mechanical coupling) A. Steininger page 27
Temporal Diversity
operate checker with a delay against master – same fault hits at different point of computation – therefore different effect => detect by comparison – different critical paths emerge store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles – larger delay causes high effort for little gain (=>experiments) – error detection latency is equal to the delay – need to delay memory write and outputs by this amount A. Steininger page 28
Temporal Diversity: Implementation
Instr. Addr.
Core #1 (Master) Instr.
Data out Data Addr.
Data in =?
Instr. Mem Instr. Addr.
Data Mem =?
=?
Instr.
Data out Data Addr.
Core #2 (Checker) A. Steininger Error Data in D
T
page 29
Fail Safe Dual Core Frame – Summary
safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address) dual rail coding for single signals (int, rst, err) totally self-checking comparators temporal diversity
How safe is the proposed solution?
A. Steininger page 30
Assessment of the Solution’s Quality
How measure quality? ( Aim is fail safe) error detection coverage => detect all errors error detection latency => detect them quickly Which method to choose?
theoretical analysis / modelling experimental fault injection field observation A. Steininger page 31
Fault Injection Experiment
2 SPEAR cores in fail safe frame (
= DUT)
synthesized to EDIF netlist injected one by one into netlist exhaustive list of stuck-at-1 and stuck-at-0 faults download to FPGA, application run “golden device” as reference (
= REF)
upon mismatch (DUT REF) => check comparator A. Steininger page 32
Results of FI Experiment
detected not detected overall no effect before effect during effect after effect RD WR RD WR no effect with effect master 204 19047 0 559 31455 0 4269 0 55534 slave 51170 98 0 0 0 0 4276 0 55544 frame 3517 734 0 921 87 0 1073 0 6332 overall 54891 19879 0 1480 31542 0 9618 0 117410 A. Steininger page 33
Enabling fast Recovery
error signal (dual rail) notifies external component / memory turns any further WR into RD (
error confinement
) triggers processor interrupt status register (memory mapped) updated by HW indicates source of error (data parity, address mismatch,…) recovery can build on uncorrupted status can benefit from detailed status information A. Steininger page 34
Why is fast Recovery important?
application specific fault-tolerance time application can “survive” without computer even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms) meaning of fast recovery if failed computer recovers within FT time, no need for hot standby => COST!
re-booting after failure is - pragmatic - safe - expensive!
A. Steininger page 35
Fail Safe Dual Core – Summary 1
duplicate & compare generic approach, applicable to any core type covers all (local) errors need to carefully eliminate single points of failure need to complement with protection for signals & buses temporal diversity mitigates (many) common cause failures requires output delay to ensure error confinement A. Steininger page 36
Possible Sources of CCFs
Design & process design fault or (latent) process deficiency Thermal coupling hot spot affects both replica in the same way Mechancial defect affects both replica symmetrically Electrical coupling wire bound (shared lines: VDD, reset, clock) wireless (EMI) A. Steininger page 37
Why use Single Die then?
cheaper and faster
use two instances of same design fast & comprehensive comparison
CCFs on single die
intuitively higher thread quantification of thread?
mitigation techniques?
Komp.
error A. Steininger page 38
The Actual Problem with CCFs
One fault event affects both replica AND is not detected by comparator i.e.
leads to “symmetric” fault effect AND produces an erroneous output i.e. does not crash the cores A. Steininger page 39
Possible Countermeasures for CCFs
Design & process diversity, burn-in, Thermal coupling Mechancial defect affects both replica symmetrically Electrical coupling wireless (EMI) A. Steininger propagation paths asymmetric asymmetric antennas (?) page 40
Possible Countermeasures for CCFs
Design & process design fault or (latent) process deficiency Thermal coupling Mechancial defect affects both replica symmetrically propagation paths Electrical coupling wire bound (shared lines: VDD, reset, clock) wireless (EMI) A. Steininger page 41
Propagation Speed Comparison
Thermal & mechanical propagation are relatively slow 10000s of clock cycles within 1ms A. Steininger page 42
Experimental Assessment
Evaluation Experiments 1) single corresponding points with offset
t
Core 1 Core 2 Master 2) multiple corresp. points with offset
t
Core 1 Core 2 3) single non-corresp. points no offset Core 1 Core 2 Compare unit Checker Erroneous write access?
Golden Node A. Steininger page 43
Symmetry Requirements for CCF
even a small offset… fault multiplicity … asymmetry of impact … …improve detection coverage A. Steininger page 44
Symmetry Requirements for CCF
even a small offset… fault multiplicity … asymmetry of impact … ExVecTab (8202) DEC (152) RF (7028) PSW (308) ALU (2472) P2 (158) …improve detection coverage PC+P1 (182) A. Steininger page 45
Squeezing our more Efficiency
dual core is expensive normally yields performance improvement would be welcome here as well: increasing performance demand @ limited clock rates but: exclusively dedicated to safety here observation: not all tasks are safety critical
enable flexible switching between “safety mode” and “performance mode”
A. Steininger page 46
Operation in Performance Mode
cores execute different instruction streams in parallel both cores have direct access to memory / peripherals instruction caches introduced to minimize penalties from conflicting access temporal diversity disabled comparator disabled A. Steininger page 47
Requirements on the Mode Switching
coherent operation in safety mode internal states of cores must be aligned before switching to safety mode (register file, cache) safe operation in safety mode switching must not introduce safety leakage no corruption of safety-relevant data in perform. mode low performance penalty for mode switching slow or complicated switching would spoil the anticipated performance gain A. Steininger page 48
Implementation of the Split Core Frame
Instruc tion instruction memory RAM Control Instruc tion RAM Control data memory A. Steininger page 49
Mode Switch: Safety => Performance
load ID reg address mode switch instr => core1 wait => core2 wait => clk align => switch mode load & check ID bit => cond branch core2 LDL r1, 248 LDH r1, 255
mode switching
LDW r2, r1 BTEST r2, 1 JMPI_CT A. Steininger page 50
Mode Switch: Performance => Safety
core1 encounters mode switch instr => trigger MSU (core1 signal) => halt core1 (wait1) => interrupt core2 (message2) core2 encounters interrupt => save context => jump to mode switch instr core2 executes mode switch => halt core2 & switch clock => resume core1 => resume core2 after delay A. Steininger page 51
Fault Injection in Safety Mode
detected not detected overall no effect before effect within 1,5cy later no effect with effect master 1029 5026 50956 0 7055 0 64066 slave 56962 0 0 0 7102 0 64064 frame 5334 1324 569 0 overall 63325 6350 51525 0 4275 18432 0 0 11502 139632
Delayed WR still ensures error confinement
A. Steininger page 52
Fault Injection in Performance Mode
fault injected in performance mode, then switch to safety mode detection in effect in perf only both modes safety only none early perf mode late stuck 1149 - - 1473 423 - - 25617 - - safety mode ≤1.5cy >1.5cy
34583 0 9654 47715 0 0 never 458 0 0 18560 A. Steininger page 53
We still need a “Safe Memory”
detect bit flips in storage cells parity (or EDC/ECC) Why not duplicate & compare?
protect interfaces parity for data, address and control buses detect erroneous address decoding special decoder logic design prevent illegal WR access provide mask input for write enable A. Steininger page 54
We still need a “Safe Memory”
detect bit flips in storage cells parity (or EDC/ECC) protect interfaces parity for data, address and control buses detect erroneous address decoding special decoder logic design prevent illegal WR access provide mask input for write enable A. Steininger page 55
Possible Address Decoder Errors
correct behavior: any given address activates exactly one assigned memory cell erroneous behaviors: an address activates no memory cell at all an address activates more than one memory cell an address activates a wrong memory cell A. Steininger page 56
A 2 A 1 A 0 A P
Checking the Address Decoder
& & & & & & check for missing or multiple cell activations: XOR(upper half) XOR(lower half) ?
memory cell array p e XOR dual-rail checker XOR dual-rail checker & re-check parity behind cell array: OR over even cells parity ?
& large decoders built from cascade of smaller ones A. Steininger page 57
Summary
the automotive domain has its own laws and rules need “extremely cost-effective robust solutions for safety critical real time applications, versatile and custom tailored” on node level different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required on system level crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing A. Steininger page 58
Hungry for more?
http://ti.tuwien.ac.at/ecs [email protected]
A. Steininger page 59
Related publications of my group (1)
[1] [2] [3] [4] [5] [6] [7] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”,
9th IEEE European Test Symposium
, Corsica 2004.
T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”,
Journal of Computing and Informatics
, vol. 23, no.5, 2004.
T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”,
Int’l Conference on Dependable Systems and Networks (DSN2006),
Philadelphia, 2006.
T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”,
13 th IEEE Pacific Rim Int’l Symposium on Dependable Computing
, Melbourne, 2007.
C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”,
21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06),
Washington, 2006.
A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”,
IEEE Design & Test of Computers
, vol 3/1999 A. Steininger, “Testing and Built-in Self-Test – A Survey”,
Journal of Systems Architecture
46(2000) A. Steininger page 60
Related publications of my group (2)
[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical Applications – A Case Study”,
29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29),
Madison, 1999.
[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”,
2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00),
Yamanashi, 2001.
[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”,
2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01),
San Francisco, 2001 [11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault Tolerant Computer System”,
IEEE Transactions on Reliability
, vol. 52, no. 4, 2003.
[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”,
IEEE Transactions on Reliability
, vol. 52, no. 4, 2003.
A. Steininger page 61
Related publications of my group (3)
[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”,
Int’l Test Conference 2005
, Austin 2005.
[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”,
11th IEEE Int’l Conference on Emerging Technologies and Factory Automation
, Prague 2006.
[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“,
IEEE Transactions on Industrial Informatics
vol 4, no 3 [16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”,
4th Int’l Symposium on Electronic Design, Test and Applications,
Hong Kong, 2008.
[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“,
Proc. 3rd IEEE Symposium on Industrial Embedded Systems
, Las Vegas, 2008.
A. Steininger page 62
Related publications of my group (4)
[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause Faults — Experimental Assessment of Potential Countermeasures“,
9th IEEE International Conference on Dependable Systems and Networks
, Estoril, 2009.
[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time Triggered Automotive Communication Systems”,
5th IEEE Int’l Workshop on Factory Communication Systems
, Vienna, 2004.
[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”,
11th IEEE Int’l Conference on Emerging Technologies and Factory Automation
, Prague 2006.
[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real Time Networks”,
6th IEEE Int’l Workshop on Factory Communication Systems
, Torino, 2006.
[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”,
4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008),
Hong Kong, 2008.
A. Steininger page 63
Related PhD theses of my group
T. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German) C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German) K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001.
E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008.
P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009.
G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009 A. Steininger page 64
Related Projects
STEACS (Systematic Test of Embedded Automotive Communication Systems) http://embsys.technikum-wien.at/projects/steacs/index.html
EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing) http://ti.tuwien.ac.at/ecs/research/projects/extract DARTS (Distributed Algorithms for Robust Tick Synchronization) http://ti.tuwien.ac.at/ecs/research/projects/DARTS A. Steininger page 65