Evaluation of Processor Faults Due to to EM Interference

Download Report

Transcript Evaluation of Processor Faults Due to to EM Interference

Evaluation of Computer Faults Due to to EM Interference

Results Concepts, Simulation Environment and Some

Shantanu Dutt (Student Involved: Hasan Arslan) ECE Dept.

University of Illinois -Chicago

Outline

     Past Work-- General Fault Detection and Tolerance, EMI Faults Our Goals Our Methodology for Fault Detection and Classification Experimental Results Conclusions and Future Work

Past work – General Fault Detection and tolerance

 

Off-line testing (mainly for hard faults) Concurrent-online testing (operational faults):

Adding external hardware, monitoring data, address and control lines   Memory:error-detecting & correcting codes Computer systems  Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88]  Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]

Past Work On EM/Radiation Induced Faults

   Detection of high level computer failure due to different types of EM signals [Mojert et al., EMC’01] Failure in real-time communication & control systems from communication line errors due to EM signals [Kohlberg & Carter, EMC’01] Also: Radiation Hardened Processors: Leon and ERC32 processors (http sites). But primarily only ECC for memory and register file---simple fault tolerance but probably targeting the most likely source of “permanent” faults.

Assumptions/Scenarios of Past Work

  Past Work on general fault detection:   Random single (sometimes double) faults Deterministic faults  Types of faults: permanent, transient, intermittent; intermittent type not generally tackled Past Work on EM-induced faults:  No how/why/what analysis and classification of computer failure due to EM interference

Goals of Our Work

    Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults:     Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults Data errors. Causes: computation errors, memory & bus faults Termination Errors (hung processor & crashes). transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interrupts (?) Causes: C.U. Note: Error types are NOT mutually exclusive Provide broad-based recipes for FT and reliable operation To the best of our knowledge, more comprehensive analysis of fault effects on a computer system than that attempted previously Comprehensive analysis needed due to the nature of EM effects--all pervasive, periodic, clustered

Our System of Fault Analysis in a Computer System

Fault Injection In each comp; control of fault duration, freq, #, pattern (rand, clust) Computer Sys = Processor + Memory + Et. Buses Use VHDL model of a modern micro-proc---DLX & SuperScalar DLX

Characteristics of Fault Injection Methods -- Previous Work

Hardaware Software

With contact Without contact Compilation Runtime Cost Damage Trigger Repeatability High High Yes High High Low No Low Low None Yes High Low None Yes High Controllability High Acc. FIP Chip pin.

Low Chip int.

High Reg. Mem. Soft.

High Reg. Mem. I/O cont./port

Our Fault Injection Approach

•Inject Faults in a “Software” Model (VHDL) of a Computer- adv of both the h/w and s/w approaches w/o the disadvantages •Fault Types ( stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc ) •Variable Duration of Faults & Frequency Fault Generator Memory data 1 0 Counter_1 MUX Data Bus Address Bus Counter_2 Var-width Var-period Pulse gen.

DLX CPU Fault Generator

v2

Methodologies for Control Flow Errors [Mahmood & McCluskey, TC’88]

   A

node

is a block of instructions with a branch at the end A

derived signature

of a node is a function (e.g., or, LFSR) of all its instructions A

program graph

is one in which there is an arc from node u to v if the branch at u can lead to node v

v1 v4 v3 NOP Sign(v4) ADD r1 r3 LD r2 address NOP sign(v5) NOP sign(v6) BLT r4 r8 off Memory Hierarchy Memory Bus Watchdog v5 Signal from branch circuit v6 Sign(v4) BRT v6 v5 Processor

Methodologies for Control Flow Errors: CFC Checking Using a Watchdog

v2

   WD compares the information gathered concurrently to the information previously provided Complexity,lies between the current circuit-level and system-level tech.

90% error coverage for single errors [Mahmood et al, ieee tc’88]

Block Header START NOP Sign(v4) v1 ADD r1 r3 Checking LD r2 address v3 v4 v5 NOP sign(v5) NOP sign(v6) BLT r4 r8 off Compu te Block Sign.

Check Branch Sig.

Wait for new block v6 Sign(v4) BRT v6 v5 Check Block Sig.

Error flag

Data Error Methodologies: Algorithm Based Fault Tolerance

    Difficult to detect, occurs inside the microproc, not necessarily observable to an external WD processor Use properties of the computation to check correctness of computed data E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f( ) can be used to check it     Pre-compute v’ = v1 + v1 + …+ vk (input checksum) Computer f(v1), …..f(vk) Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) Check if f(v’) = u; inequality indicates computation error(s) Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]

Data Error Methodologies: Data Encoding

    Data that is numerically processed can be encoded and checked if the output of arithmetic operations is still encoded (e.g., Berger, AN codes) A simple coding scheme is AN coding: # N is transformed to A.N where A is odd, say, 3 Works for addition: 3.N1 + 3.N2 = 3(N1+N2) - check if result is still a multiple of 3; if not then error 100% det of single faults -- single fault will change result by +/- (2^i) and so no longer multiple of 3.

Methodologies for Termination Errors

  Valid address range registers R_low, R_high in processor -- check generated address to see if in range  Can detect crashes due to invalid addresses Timeout Mechanisms -- Store upper bound exec time for each block in the watchdog; if time is exceeded during run time flag error  Can detect infinite loops or hung processor due to control unit faults

Current Implementation

      Fault Injection w/ various controls (duration, frequency, extent, pattern) for a non-pipelined DLX processor in VHDL Fault injection on memory data/address buses Description of a watchdog processor in VHDL for control flow checking + infinite-loop termination errors Valid address range registers in processor ECC (1-error correction and 2-error detection) of memory (commercial feature) and buses (non standard) Some error analysis results for a simple Fibonacci computation: f(i) = f(i-1) + f(i-2), i=2 to 99, f(0)=f(1)=0

Current Implementation -- ECC Capabilities on Memory and Buses

Fault Injector Add r30,r0,r14 000ef020

Memory

address En Dec 4181ee8008

4 3

8 0 ee8

81

8 En Dec 32+7ECC=39 bits rfe

41

0e

e

0

3

0 CPU L. Adr. Reg.

PC H. Adr. Reg

.

Some Error Observations

Adress 00000040 00000044 .

.

.

.

.

.

000000A4 000000A8 000000D4 000000D8 000000DC 000000E0 000000E4 000000E8 000000EC 0000010C 00000110 00000114 Orig. Instruction ADDI R7, R0, 3 SW 0(R3), R7 .

.

.

.

.

.

LW R3, -12(R30) LHI R4, 0 LW R6, -12(R30) SLLI R6, R6, 2 ADD R5, R5, R6 LW R4, 0(R4) LW R5, 0(R5) ADD R4, R4, R5 SW 0(R3), R4 LW R3, -12(R30) ADDI R3, R3, 1 SW -12(R30), R3 Corrupted Inst.

ADDI R7, R0, 3 SW 0(R2), R23 LHI R4, 0 SLLI .

.

SW .

.

LW R3, -12(R30) LW R6, -12(R30) R14, R4, 2 ADD R5, R5, R6 LW R4, 0(R4) LW R5, 0(R5) ADD R4, R4, R5 0(R3), R4 .

.

wrong initializ.

LW R3, 1040(R14) ADDI R3, R3, 1 SW -12(R30), R3 Err

Invalid Addr Err

Ind. Addr. Inc. unknown value as index value

Some Error Observations(contd.)

0 5 6 31 TRAP Trap_id Trap_0 code=44000000 Slli r5,r5,#4 50c60004 2 bit faults Add r4,r4,r5 00a62820 Trap_0 44XXXXX4 •DLX uses TRAP_0 to stop exec. Processor checks first 6 bits (0-5) for Trap instruction, and last bit (31) for trap_id. No other bit checked.

•For All ALU instructions , first 6 bits are always 0. When 2 nd they become trap inst. Hence their distance is 2. and 5 th bits are set , •For trap instructions if last bit is 0, then execution stops (Trap_0). Unfortunately, for most ALU inst.(add,and,xor,rfe…etc), the last bit is also 0.

•DLX interprets the last 5 bits (27-31) as trap_id (bit 6-26 are ignored). Non-trap instructions interpret bit 6-10 as src./dst. register. •Check for trap/non-trap inst. extended to bit 6-10, to inc. min. dist. from 2 to 3.

•Premature stops due to trap_0 thus reduced.

•More refined schemes to increase min. distance -- on-going work

Experimental Setup: Fault Injection Parameters

• 4 random errors simulated on the data bus w/ foll.

characteristics Clock cycle: 22 ns Repeat Period: 10 ns - 800 ns (f=100 Mz - 1.25 MHz) Duration Range: 5 ns - 400 ns R=20 D=5 Low_Low Med_Med High_High Repeat Period Range: 305 - 425 160 - 440 305 - 425 Duration Range: 5 - 25 180 - 220 300 - 400

Experimental Results

No. Addr. Corr. Trap_0 Addr. Corr. Trap_0 Avrg. Exec. Inst. For each run stopped by Trap.

Avgr. Exec. Inst. For each sim.

(1134 inst. no fault) When we fixed trap few runs is terminated because of trap.But Invalid. Addr. Termination (IAT) error increases 35.42

inst exec. When Sim stopped because of IAT.

38.37 inst exec. For second type 88.46

124.54

inst exec. (IAT) third type inst exec. (IAT) 4th type

Experimental Results (Cont)

Simulation Times, Data Computation No. Addr. Corr. Trap_0 No. Addr. Corr. Trap_Fixed Addr. Corr. Trap_0 Addr. Corr. Trap_Fixed Avgr. Exec. time. For each sim.

(265,620 ns for non faulty) Avrg. Array. Elts Updates

When simulation runs more it calculates more data elements

Experimental Results

52 simulation for Low_Low

100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% data. Err.

Cont.flow

Term.Err

T_0 T_F A_0 A_F

T_0: No Addr. Corr. Trap_0 : 410 err.

Error_density:7/100 43:Term. Err(%10) 14 trap 29 Inv.Addr

66 CF (%16) 301 Dat_Err. (%74) T_F: No Addr. Corr. Trap_Fixed : 424 err.

Error_density:7/100 41:Term. Err(%9.6) 9 trap 32 Inv.Addr

76 CF (%18) 307 Dat_Err. (%73) A_0: Addr. Corr. Trap_0: 444 err.

Error_density:2.2/100 38:Term. Err(%6.8) 11 trap 27 Inv.Addr

54 CF (%13.6) 315 Dat_Err. (%79.5) A_F: Addr. Corr. Trap_Fixed: 446 err.

Error_density:1.7/100 27:Term. Err(%6) 6 trap 21 Inv.Addr

24 CF (%5.4) 395 Dat_Err. (%88.6) The more program runs the more it gives Data Err.

When trap is fixed, more simulation is completed. But it increase the Inv. Addr. Term.

When Addr. corrected Inv. Addr. Err. is reduced.Simulation executes more instruction It increase the Data Err.

Experimental Results (cont)

52 simulation for Med_MEd

100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% data. Err.

Cont.flow

Term.Err

T_0 T_F A_0 A_F

T_0: No Addr. Corr. Trap_0 : 68 err.

Error_density:15/100 52:Term. Err(%76.5) 1 trap 51 Inv.Addr

3 CF (%4.4) 13 Dat_Err. (%19.1) T_F: No Addr. Corr. Trap_Fixed Error_density: 11/100 52:Term. Err(%63.5) 0 trap 52 Inv.Addr

7 CF (%8.5) 23 Dat_Err. (%28) A_0: Addr. Corr. Trap_0: Error_Density:10/100 54 CF (%36) : 150 err.

82 err.

52:Term. Err(%34) 7 trap 43 Inv.Addr

44 Dat_Err. (%30) A_F: Addr. Corr. Trap_Fixed: 175 err.

Error_density:8/100 45:Term. Err(%25.7) 8 trap 37 Inv.Addr

45 CF (%25.7) 85 Dat_Err. (%48.6) Increasing fault inject period, reduces the # of executed Inst. So error density increases terribly

Experimental Results (cont)

52 simulation for High_High

100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% data. Err.

Cont.flow

Term.Err

T_0 T_F A_0 A_F

T_0: No Addr. Corr. Trap_0 : 61 err.

Error_density: 35/100 52:Term. Err(%85) 1 trap 51 Inv.Addr

9 CF (%15) 0 Dat_Err. (%0) T_F: No Addr. Corr. Trap_Fixed Error_density: 48/100 52:Term. Err(%57) 0 trap 52 Inv.Addr

38 CF (%43) 0 Dat_Err. (% 0) A_0: Addr. Corr. Trap_0: Error_density: 22/100 41 CF (%43.6) 93 err.

: 90 err.

52:Term. Err(%55.3) 9 trap 43 Inv.Addr

1 Dat_Err. (% 1.1) Process never get able to calculate Fib.val

because of high fault injection.None of the simulation is completed.

A_F: Addr. Corr. Trap_Fixed: 52 err.

Error_Density: 26/100 52:Term. Err(%100) 4 trap 48 Inv.Addr

0 CF (% 0) 0 Dat_Err. (% 0)

Error Coverage

For error coverage, we run our simulation 122 times for: repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns 90 80 70 60 50 40 30 20 10 0 95 20 Total: 434 erroneous inst. executed T_0: No. Addr. Corr. Trap_0 ECC cover.

Cont.Flow. Cov.

Data cov.

434 erroneous inst . 411 err. inst. covered by ECC (95%) 90 err. Inst. covered by WD (20%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block.

Error Coverage(Cont.)

For error coverage, we run our simulation 122 times for repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns 90 80 70 60 50 40 30 20 10 0 95 23 Total: 474 erroneous inst. executed T_F: No. Addr. Corr. Trap_Fixed ECC cover.

Cont.Flow. Cov.

Data cov 474 erroneous inst . 450 err. inst. covered by ECC (95%) 106 err. Inst. covered by WD (23%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block.

Error Coverage (Cont.)

For error coverage, we run our simulation 122 times for repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns 90 80 70 60 50 40 30 20 10 0 83 20 13 Total: 3553 erroneous inst. executed A_0: Addr. Corr. Trap_0 ECC cover.

Cont.Flow. Cov.

Data cov 3553 erroneous inst . 2959 err. inst. covered by ECC (83%) 703 err. Inst. covered by WD (20%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block.

There were 89 data error. 12 (13%) of them covered by 3N coding

Error Coverage (Cont.)

For error coverage, we run our simulation 122 times for repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns 90 80 70 60 50 40 30 20 10 0 82 19 18 Total: 4219 erroneous inst. executed A_F: Addr. Corr. Trap_Fixed ECC cover.

Cont.Flow. Cov.

Data cov 4219 erroneous inst . 3426 err. inst. covered by ECC (82%) 762 err. Inst. covered by WD (18%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block.. There were 106 data error. 20 (19%) of them covered by 3N coding

Error Coverage (cont)

For error cover., 20 runs selected that resulted in complete simulations w/ combinations of period: 305 - 460 ns and dura. range : 5 - 60 ns 90 80 70 60 50 40 30 20 10 0 80 39 23 Total: 438 erroneous inst. executed Addr. Corr. Trap_0 ECC cover.

Cont.Flow. Cov.

Data cov 438 erroneous inst . 349 err. inst. covered by ECC (80%) 170 err. Inst. covered by WD (39%). We are injecting 4 bit faults. If process jumps the middle of a block, WD spends time to get beginning of block.

There were 217 data error. 51 (23%) of them covered by 3N coding

    

Conclusions

Have completed a significant but preliminary fault simulation of the DLX processor in VHDL Obtain % of termination, control and data errors for different fault duration and frequencies Encoding the TRAP instruction to have a min. distance from other instructions helps in reducing incorrect termination Need to have ECC for register fields of instrs to reduce incorrect address generation and data errors It seems to be possible to catch most errors by the combination of mechanisms we have suggested so at least a fail safe mode can be guaranteed with high confidence; though room for improvement for control & data error detection

     

Future Work

Other fault patterns (e.g., clusters); correlation with EM induced fault work by others in our group Other block signature techniques (e.g., LFSR) with better fault coverage Aliasing analysis (math., empirical) for signatures Perform error analysis for more substantial “real-life” programs (scientific computations, non-numeric, system or O.S.) Fault injection and analysis for SuperScalar DLX Fault injection and analysis of on-chip processor components (integer and FP ALU, register files, control unit, internal buses, power/ground lines)

Looking Further Ahead

     Q: Are there patterns of errors that lead to computer crashes w/ high probability?

Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) Q:Are there patterns of errors that are characteristic of EM induced faults versus random single/double faults?

Q:If so, can these be used as “early detection & warning” of EM interference?

Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.