Transcript [PPT
Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution Raghuraman Balasubramanian Karthikeyan Sankaralingam PERSim A framework with unprecedented fidelity in studying the end-to-end physical effects of reliability at the gate-level while running entire programs. 2 2 /29 Understanding Reliability How does reliability affect the processor over its lifetime? What happens to the user application? Architecture Errors Logic Fault Wearout Particle Strikes Permanent faults Change in gate behavior 3 3 /29 State-Of-The-Art Simulators capable of running full programs What happens to the user application? Architecture Errors Logic Fault Wearout Particle Strikes Permanent faults Change in gate behavior Device-level models that capture the effects of reliability physics4 4 /29 PERSim Simulators capable of running full programs A framework with unprecedented fidelity in studying the end-to-end physical effects of reliability at the gate-level while running entire programs. Device-level models that capture the effects of reliability physics5 5 /29 Why is this important? Device level problems ➔ Microarchitectural / Architectural level. M.Agarwal,B.Paul,M.Zhang,andS.Mitra. Circuit failure prediction and its application to transistor aging. In VTS ’07. T. M. Austin. Diva: A reliable substrate for deep submicron microarchitecture design. In MICRO ’99. R. Balasubramanian and K. Sankaralingam. Virtually aged sampling dmr: Unifying circuit failure detection and circuit failure prediction. MICRO ’13. J.Blome,S.Feng,S.Gupta,andS.Mahlke. Self-calibrating online wearout detection. In MICRO ’07. K. Bowman, J. Tschanz, C. Wilkerson, S. Lu, T. Karnik, V. De, and S. Borkar. Circuit techniques for dynamic varia- tion tolerance. In DAC ’09. Evaluations ➔ A small structured hardware, abstracted physics 6 6 /29 Executive Summary PERSIM Model physical effects at gate-level See impact running full programs on a full processor At high simulation speeds (25 million cycles per second) With good signal observability Fine grain control on fault injection Demonstration Evaluation of 4 recently proposed techniques End-to-end transient fault analysis 7 7 /29 Outline Motivation and Overview How do we do it Motivating running example Breaking it down into mechanisms Implementation Evaluations using PERSim Questions 8 8 /29 Virtually Aged Sampling DMR Virtual Aging Fault Exposure • In most gates the faults are automatically exposed • A new mechanism to expose faults in other gates Detect Errors 9 9 /29 Applications Applications Evaluation using PERSim µP Input Sequences Delay Aware Simulation µP µP DMR Error?? Fault Vector Wearout 10 Delay as a function of Time/Vdd 10 /29 Input Sequences Wearout 11 Applications Applications Mechanisms - Fault Modeling Delay Aware Simulation DMR Error?? Fault Vector Fault Modeling Delay as a function of Time/Vdd 11/29 Input Sequences Wearout 12 Applications Applications Mechanisms – Delay Aware Simulation Delay Aware Simulation DMR Error?? Fault Vector Fault Modeling Delay as a function of Time/Vdd 12 /29 Input Sequence Extraction Input Sequences Wearout 13 Applications Applications Mechanisms – Input Sequence Extraction Delay Aware Simulation DMR Error?? Fault Vector Fault Modeling Delay as a function of Time/Vdd 13 /29 Input Fault Injection & Sequence Deterministic Extraction Re-execution Input Sequences Wearout 14 Applications Applications Mechanisms – Fault Injection Delay Aware Simulation DMR Error?? Fault Vector Fault Modeling Delay as a function of Time/Vdd 14 /29 Mechanisms Input Sequence Extraction Delay Aware Simulation Fault Modeling Fault Injection and Deterministic Re-execution 15 /29 15 Outline Motivation and Overview How do we do it Implementation Input Sequence Extraction & Fault Injection Delay Aware Simulation Fault Modeling Evaluations using PERSim Questions 16 /29 16 Implementation Input Sequence Extraction Fault Injection and Deterministic Re-execution Be able to run full programs 17 /29 17 Input Sequence Extraction Input Sequence Extraction Fault Injection and Deterministic Re-execution Be able to observe signals on a cycle-by-cycle basis 18 /29 18 Fault injection Input Sequence Extraction Fault Injection and Deterministic Re-execution Fine grain control for fault injection 19 /29 19 Implementation Input Sequence Extraction Fault Injection and Deterministic Re-execution Automate – run multiple tests at the push of a button 20 /29 20 Implementation 25 million cycles per second per board Full SPEC benchmarks 21 /29 21 Fault Modeling Reliability phenomena ⇒ behavior of gates? Wearout Synopsys HSPICE+MOSRA 22 /29 22 Fault Modeling Reliability phenomena ⇒ behavior of gates? Transient Faults Charge Accumulation Model 23 /29 23 Fault Modeling Reliability phenomena ⇒ behavior of gates? Permanent Faults Probabilistic Models 24 /29 24 Input Sequences Applications Applications Implementation Delay Aware Simulation DMR Error?? Fault Vector Fault Modeling Wearout : HSPICE+MOSRA Transient Faults :Charge Accumulation Model Permanent Faults : Probabilistic models 25 /29 Outline Motivation and Overview How do we do it Implementation Evaluations using PERSim Reliability Techniques Key Results Questions 26 /29 26 Reliability Techniques Circuit failure prediction (Wearout) FIRST [Smolens et al., SELSE’07] WearMon [Zandian et al., DSN’ 12] Online Wearout Prediction [Blome et al., MICRO’07] Transient Fault Analysis Gate level modeling of particle strike Application level impact analysis Permanent Fault Detection Sampling-DMR [Nomura et al., ISCA’11] 27 /29 27 Key Results Circuit failure prediction (Wearout) Gates in non-critical paths not covered. PERSim enables full processor coverage Modeling Hole Covered Transient Fault Analysis Accurate modeling of particle strikes on individual gates ⇒ impact on full programs Cross-layer transient fault analysis Permanent Fault Detection Cycle-by-cycle error traces running full programs Fine-grained signal visibility 28 /29 28 Executive Summary PERSIM Model physical effects at gate-level See impact running full programs on a full processor At high simulation speeds (25 million cycles per second) With good signal observability Fine grain control on fault injection Demonstration Evaluation of 4 recently proposed techniques End-to-end transient fault analysis www.persim.org 29 /29 29 Backup slides 30 /29 Limitations Related faults Interaction between faults not captured OpenRISC Simple, in-order processor Limited online visibility on current state/program progress ZedBoard memory footprint/Zynq FPGA size Determinism Requires careful manipulation of programs 31 /29