Design and Test Technology for Automotive Electronic Systems

Transcript Design and Test Technology for Automotive Electronic Systems

New Approaches to Fault-Tolerant Systems Design

Andreas Steininger

Vienna University of Technology

My contact data

Andreas Steininger

Vienna University of Technology Faculty of Informatics Institute of Computer Engineering Embedded Computing Systems Group Treitlstrasse 3 A- 1040 Vienna Austria

[email protected]

http://ti.tuwien.ac.at/ecs A. Steininger page 2

Main Contributors to this Material

 Dr. Thomas Kottke

R. Bosch AG / EADS

 Dr. Peter Tummeltshammer

R. Bosch AG / Thales

 Dr. Christoph Scherrer

Alcatel / Thales

 Dr. Eric Armengaud  Dr. Karl Thaller  Dr. Martin Horauer  Paul Milbredt

DecomSys / VirtualVehicle DecomSys / Elektrobit Austria UAT Technikum Wien AUDI AG

A. Steininger page 3

Outline

• Fault tolerance – some (very) basics • Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 4

Faults, Errors and Failures

fault

computer

1 0 error failure

A. Steininger page 5

Error Detection

fault Fault detection:

usually too difficult (too many possibilities) computer

1 0 error failure

A. Steininger page 6

Error Detection

Failure detection:

too late: want to prevent failure!

computer

1 0 error failure

A. Steininger page 7

Error Detection

To decide that „1“ is wrong we need a reference.

Where to get this reference from? computer

1 0 0 error

A. Steininger Option 1: Perform same compu tation a second time (hopefully the fault is gone by then …)

Time redundancy

page 8

Error Detection

To decide that „1“ is wrong we need a reference.

Where to get this reference from? computer

1 0 error

A. Steininger page 9

Error Detection

To decide that „1“ is wrong we need a reference.

Where to get this reference from?

computer

1 0

Option 2: Use a second computer in parallel (hopefully this one works well…)

Space redundancy

A. Steininger page 10

Error Detection

To decide that „1“ is wrong we need a reference.

Where to get this reference from? computer

1 0 0 error

Option 3: Add additional information (hopefully not affected as well…)

Information redundancy

A. Steininger page 11

Achieving Fault Tolerance

computer ED

Fail safe:

system can be safely stopped when error is detected 

example: train

computer computer ED

Fail operational:

system must keep on working when error is detected 

example: autopilot in airplane

A. Steininger page 12

Outline

 Fault tolerance – some (very) basics • Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 13

Electronics in Cars – some Facts

 high proportion of value:

up to 30%

 high development potential:

more than 80% of the innovations

 high number of Electronic Control Units (ECUs)

up to 70

 complex distributed system

different networks & topologies

A. Steininger page 14

Electronics in Cars - Benefits

 cheap alternative to existing mechanical solutions – lighter, smaller, cheaper, more flexible,…  enabler for further optimizations – electronic ignition, motor management, …  key to new functionality – safety: ESP, active suspension, crash sensing… – comfort: air conditioning, infotainment,… – security: immobilizer, alarm, electronic key, GPS tracking,… – autonomy: anticipatory braking, lane keeping,… A. Steininger page 15

Key Demands

 Safety  Real-Time  Low Cost  Robustness  Testability A. Steininger page 16

Key Demands



Safety

 Real-Time  Low Cost  Robustness  Testability – high risk potential (energy!) – high public awareness – no safe state (in general) – certification required (EN 61508, ISO 26262) – high complexity of system & application – legal issues (liability) A. Steininger page 17

Key Demands

 Safety 

Real-Time

 Low Cost  Robustness  Testability – engine: 6000 rpm = 1/10ms – VDM: 100km/h = 28cm/10ms – need to synchronize distributed activities – real-time communication – image processing tasks A. Steininger page 18

Key Demands

 Safety  Real-Time 

Low Cost

 Robustness  Testability – extreme competition – high cost inhibits introduction – tailored safety concepts  minimum degree of replication  use structural redundancies – generic solutions  scalable, configurable, flexible – marginal costs beat NRE A. Steininger page 19

Current Status

 fail safe functions realized: – shut off upon error – mechanical fall-back system assumes control no true “by wire” functions – single-channel solutions sufficient  tolerance against random faults – avoid design faults by field experience => no diversity – avoid common cause faults by design (?)  single fault assumption – keep faults rare (shielding, etc.) A. Steininger page 20

Outline

 Fault tolerance – some (very) basics  Automotive electronics: the specific situation • Design of a cost efficient fault tolerant node – Basic architecture – Temporal diversity – Treatment of common cause faults – Switching performance mode / safety mode – Fault-tolerance validation by fault injection A. Steininger page 21

A Fault Tolerant Node

 mission: make a node (processor) fault tolerant  need to consider CPU and memory  aim is “fail safe” (but keep option for fail op in mind) – simplex unit with error detection capabilities – duplication and comparison – hybrid approach A. Steininger page 22

Options for the CPU Core



Single core + ED

 Dual core + cmp  Superscalar proc.

+ cmp + ED

modify custom CPU core

– parity for buses – two-rail coding for signals – self-checking implemen tation of simple units – duplicate & compare for complex units – careful layout A. Steininger page 23

Options for the CPU Core

 Single core + ED 

Dual core + cmp

 Superscalar proc.

+ cmp + ED

duplicate custom CPU core

– master/checker operation – shared (safe) memory – validity check for inputs – self-checking comparator checks equality of outputs – option: clock delay – option: mode switch A. Steininger page 24

Solution Example “Dual Core Frame”

 benefits  can use custom core without modifications  safety analysis valid for other cores as well  promises high ED coverage with moderate efforts  CPU is hard to protect otherwise  crucial points  enable easy recovery ( => keep outage short)  eliminate single points of failure  detect common cause faults A. Steininger page 25

Protection in the Dual Core Frame

Instr. Addr.

Core 1 (Master) Data out Data Addr.

Data in =?

Instr. Addr.

Instr.

Data out Data Addr.

Core 2 (Checker) A. Steininger Error_Sig Data in page 26

Potential for Common Cause Faults

 identical input data  identical clock (lock step)  shared clock generator  shared power supply  both processors on same die (physical proximity; thermal & mechanical coupling) A. Steininger page 27

Temporal Diversity

 operate checker with a delay against master – same fault hits at different point of computation – therefore different effect => detect by comparison – different critical paths emerge   store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles – larger delay causes high effort for little gain (=>experiments) – error detection latency is equal to the delay – need to delay memory write and outputs by this amount A. Steininger page 28

Temporal Diversity: Implementation

Instr. Addr.

Core #1 (Master) Instr.

Data out Data Addr.

Data in =?

Instr. Mem Instr. Addr.

Data Mem =?

Instr.

Data out Data Addr.

Core #2 (Checker) A. Steininger Error Data in D

page 29

Fail Safe Dual Core Frame – Summary

      safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address) dual rail coding for single signals (int, rst, err) totally self-checking comparators temporal diversity

How safe is the proposed solution?

A. Steininger page 30

Assessment of the Solution’s Quality

 How measure quality? ( Aim is fail safe)  error detection coverage => detect all errors  error detection latency => detect them quickly  Which method to choose?

   theoretical analysis / modelling experimental fault injection field observation A. Steininger page 31

Fault Injection Experiment

 2 SPEAR cores in fail safe frame (

= DUT)

 synthesized to EDIF netlist  injected one by one into netlist  exhaustive list of stuck-at-1 and stuck-at-0 faults  download to FPGA, application run  “golden device” as reference (

= REF)

 upon mismatch (DUT  REF) => check comparator A. Steininger page 32

Results of FI Experiment

detected not detected overall no effect before effect during effect after effect RD WR RD WR no effect with effect master 204 19047 0 559 31455 0 4269 0 55534 slave 51170 98 0 0 0 0 4276 0 55544 frame 3517 734 0 921 87 0 1073 0 6332 overall 54891 19879 0 1480 31542 0 9618 0 117410  A. Steininger page 33

Enabling fast Recovery

 error signal (dual rail)  notifies external component / memory   turns any further WR into RD (

error confinement

) triggers processor interrupt  status register (memory mapped)   updated by HW indicates source of error (data parity, address mismatch,…)  recovery   can build on uncorrupted status can benefit from detailed status information A. Steininger page 34

Why is fast Recovery important?

 application specific fault-tolerance time  application can “survive” without computer   even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms)  meaning of fast recovery  if failed computer recovers within FT time, no need for hot standby => COST!

 re-booting after failure is - pragmatic - safe - expensive!

A. Steininger page 35

Fail Safe Dual Core – Summary 1

 duplicate & compare  generic approach, applicable to any core type  covers all (local) errors  need to carefully eliminate single points of failure  need to complement with protection for signals & buses  temporal diversity  mitigates (many) common cause failures  requires output delay to ensure error confinement A. Steininger page 36

Possible Sources of CCFs

 Design & process  design fault or (latent) process deficiency  Thermal coupling  hot spot affects both replica in the same way  Mechancial defect  affects both replica symmetrically  Electrical coupling  wire bound (shared lines: VDD, reset, clock)  wireless (EMI) A. Steininger page 37

Why use Single Die then?



cheaper and faster

 use two instances of same design  fast & comprehensive comparison 

CCFs on single die

 intuitively higher thread  quantification of thread?

 mitigation techniques?

Komp.

error A. Steininger page 38

The Actual Problem with CCFs

 One fault event affects both replica AND  is not detected by comparator i.e.

leads to “symmetric” fault effect AND  produces an erroneous output i.e. does not crash the cores A. Steininger page 39



Possible Countermeasures for CCFs

Design & process  diversity, burn-in,  Thermal coupling   Mechancial defect  affects both replica symmetrically  Electrical coupling   wireless (EMI) A. Steininger propagation paths asymmetric asymmetric antennas (?) page 40

Possible Countermeasures for CCFs

 Design & process  design fault or (latent) process deficiency  Thermal coupling   Mechancial defect  affects both replica symmetrically propagation paths  Electrical coupling  wire bound (shared lines: VDD, reset, clock)  wireless (EMI) A. Steininger page 41

Propagation Speed Comparison

 Thermal & mechanical propagation are relatively slow  10000s of clock cycles within 1ms A. Steininger page 42

Experimental Assessment

 Evaluation Experiments 1) single corresponding points with offset

Core 1 Core 2 Master 2) multiple corresp. points with offset

Core 1 Core 2 3) single non-corresp. points no offset Core 1 Core 2 Compare unit Checker Erroneous write access?

Golden Node A. Steininger page 43

Symmetry Requirements for CCF

 even a small offset…  fault multiplicity …  asymmetry of impact … …improve detection coverage A. Steininger page 44

Symmetry Requirements for CCF

 even a small offset…  fault multiplicity …  asymmetry of impact … ExVecTab (8202) DEC (152) RF (7028) PSW (308) ALU (2472) P2 (158) …improve detection coverage PC+P1 (182) A. Steininger page 45

Squeezing our more Efficiency

  dual core is expensive  normally yields performance improvement  would be welcome here as well: increasing performance demand @ limited clock rates  but: exclusively dedicated to safety here  observation: not all tasks are safety critical

enable flexible switching between “safety mode” and “performance mode”

A. Steininger page 46

Operation in Performance Mode

 cores execute different instruction streams in parallel  both cores have direct access to memory / peripherals  instruction caches introduced to minimize penalties from conflicting access  temporal diversity disabled  comparator disabled A. Steininger page 47

Requirements on the Mode Switching

 coherent operation in safety mode  internal states of cores must be aligned before switching to safety mode (register file, cache)  safe operation in safety mode  switching must not introduce safety leakage  no corruption of safety-relevant data in perform. mode  low performance penalty for mode switching  slow or complicated switching would spoil the anticipated performance gain A. Steininger page 48

Implementation of the Split Core Frame

Instruc tion instruction memory RAM Control Instruc tion RAM Control data memory A. Steininger page 49

Mode Switch: Safety => Performance

load ID reg address mode switch instr => core1 wait => core2 wait => clk align => switch mode load & check ID bit => cond branch core2 LDL r1, 248 LDH r1, 255

mode switching

LDW r2, r1 BTEST r2, 1 JMPI_CT A. Steininger page 50

Mode Switch: Performance => Safety

core1 encounters mode switch instr => trigger MSU (core1 signal) => halt core1 (wait1) => interrupt core2 (message2) core2 encounters interrupt => save context => jump to mode switch instr core2 executes mode switch => halt core2 & switch clock => resume core1 => resume core2 after delay A. Steininger page 51

Fault Injection in Safety Mode

detected not detected overall no effect before effect within 1,5cy later no effect with effect master 1029 5026 50956 0 7055 0 64066 slave 56962 0 0 0 7102 0 64064 frame 5334 1324 569 0 overall 63325 6350 51525 0 4275 18432 0 0 11502 139632 

Delayed WR still ensures error confinement

A. Steininger page 52

Fault Injection in Performance Mode

fault injected in performance mode, then switch to safety mode detection in effect in perf only both modes safety only none early perf mode late stuck 1149 - - 1473 423 - - 25617 - - safety mode ≤1.5cy >1.5cy

34583 0 9654 47715 0 0 never 458 0 0 18560 A. Steininger page 53

We still need a “Safe Memory”

 detect bit flips in storage cells  parity (or EDC/ECC) Why not duplicate & compare?

  protect interfaces  parity for data, address and control buses  detect erroneous address decoding  special decoder logic design prevent illegal WR access  provide mask input for write enable A. Steininger page 54

We still need a “Safe Memory”

 detect bit flips in storage cells  parity (or EDC/ECC)   protect interfaces  parity for data, address and control buses  detect erroneous address decoding  special decoder logic design prevent illegal WR access  provide mask input for write enable A. Steininger page 55

Possible Address Decoder Errors

 correct behavior:  any given address activates exactly one assigned memory cell  erroneous behaviors:  an address activates no memory cell at all  an address activates more than one memory cell  an address activates a wrong memory cell A. Steininger page 56

A 2 A 1 A 0 A P

Checking the Address Decoder

& & & & & & check for missing or multiple cell activations: XOR(upper half)  XOR(lower half) ?

memory cell array p e XOR dual-rail checker XOR dual-rail checker & re-check parity behind cell array: OR over even cells  parity ?

& large decoders built from cascade of smaller ones A. Steininger page 57

Summary

 the automotive domain has its own laws and rules  need “extremely cost-effective robust solutions for safety critical real time applications, versatile and custom tailored”  on node level    different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required  on system level    crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing A. Steininger page 58

Hungry for more?

http://ti.tuwien.ac.at/ecs [email protected]

A. Steininger page 59

Related publications of my group (1)

[1] [2] [3] [4] [5] [6] [7] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”,

9th IEEE European Test Symposium

, Corsica 2004.

T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”,

Journal of Computing and Informatics

, vol. 23, no.5, 2004.

T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”,

Int’l Conference on Dependable Systems and Networks (DSN2006),

Philadelphia, 2006.

T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”,

13 th IEEE Pacific Rim Int’l Symposium on Dependable Computing

, Melbourne, 2007.

C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”,

21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06),

Washington, 2006.

A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”,

IEEE Design & Test of Computers

, vol 3/1999 A. Steininger, “Testing and Built-in Self-Test – A Survey”,

Journal of Systems Architecture

46(2000) A. Steininger page 60

Related publications of my group (2)

[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical Applications – A Case Study”,

29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29),

Madison, 1999.

[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”,

2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00),

Yamanashi, 2001.

[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”,

2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01),

San Francisco, 2001 [11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault Tolerant Computer System”,

IEEE Transactions on Reliability

, vol. 52, no. 4, 2003.

[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”,

IEEE Transactions on Reliability

, vol. 52, no. 4, 2003.

A. Steininger page 61

Related publications of my group (3)

[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”,

Int’l Test Conference 2005

, Austin 2005.

[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”,

11th IEEE Int’l Conference on Emerging Technologies and Factory Automation

, Prague 2006.

[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“,

IEEE Transactions on Industrial Informatics

vol 4, no 3 [16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”,

4th Int’l Symposium on Electronic Design, Test and Applications,

Hong Kong, 2008.

[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“,

Proc. 3rd IEEE Symposium on Industrial Embedded Systems

, Las Vegas, 2008.

A. Steininger page 62

Related publications of my group (4)

[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause Faults — Experimental Assessment of Potential Countermeasures“,

9th IEEE International Conference on Dependable Systems and Networks

, Estoril, 2009.

[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time Triggered Automotive Communication Systems”,

5th IEEE Int’l Workshop on Factory Communication Systems

, Vienna, 2004.

[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”,

11th IEEE Int’l Conference on Emerging Technologies and Factory Automation

, Prague 2006.

[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real Time Networks”,

6th IEEE Int’l Workshop on Factory Communication Systems

, Torino, 2006.

[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”,

4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008),

Hong Kong, 2008.

A. Steininger page 63

Related PhD theses of my group

T. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German) C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German) K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001.

E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008.

P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009.

G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009 A. Steininger page 64

Related Projects

STEACS (Systematic Test of Embedded Automotive Communication Systems) http://embsys.technikum-wien.at/projects/steacs/index.html

EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing) http://ti.tuwien.ac.at/ecs/research/projects/extract DARTS (Distributed Algorithms for Robust Tick Synchronization) http://ti.tuwien.ac.at/ecs/research/projects/DARTS A. Steininger page 65