Transcript [PPT

Understanding the Impact of Gate-Level
Physical Reliability Effects on Whole
Program Execution
Raghuraman Balasubramanian
Karthikeyan Sankaralingam
PERSim
A framework with unprecedented fidelity in studying the
end-to-end physical effects of reliability at the gate-level
while running entire programs.
2 2 /29
Understanding Reliability
How does reliability affect the processor over its lifetime?
What happens to the user application?
Architecture Errors
Logic Fault
Wearout
Particle Strikes
Permanent faults
Change in gate behavior
3 3 /29
State-Of-The-Art
 Simulators capable of running full programs
What happens to the user application?
Architecture Errors
Logic Fault
Wearout
Particle Strikes
Permanent faults
Change in gate behavior
 Device-level models that capture the effects of reliability physics4
4
/29
PERSim
 Simulators capable of running full programs
A framework with unprecedented fidelity in studying
the end-to-end physical effects of reliability at the
gate-level while running entire programs.
 Device-level models that capture the effects of reliability physics5
5
/29
Why is this important?
 Device level problems ➔
Microarchitectural / Architectural level.
M.Agarwal,B.Paul,M.Zhang,andS.Mitra. Circuit failure prediction and its
application to transistor aging. In VTS ’07.
T. M. Austin. Diva: A reliable substrate for deep submicron
microarchitecture design. In MICRO ’99.
R. Balasubramanian and K. Sankaralingam. Virtually aged sampling dmr:
Unifying circuit failure detection and circuit failure prediction. MICRO ’13.
J.Blome,S.Feng,S.Gupta,andS.Mahlke. Self-calibrating online wearout
detection. In MICRO ’07.
K. Bowman, J. Tschanz, C. Wilkerson, S. Lu, T. Karnik, V. De, and S. Borkar.
Circuit techniques for dynamic varia- tion tolerance. In DAC ’09.
 Evaluations ➔
A small structured hardware, abstracted physics
6 6 /29
Executive Summary
 PERSIM
 Model physical effects at gate-level
 See impact running full programs on a full processor
 At high simulation speeds (25 million cycles per second)
 With good signal observability
 Fine grain control on fault injection
 Demonstration
 Evaluation of 4 recently proposed techniques
 End-to-end transient fault analysis
7 7 /29
Outline
 Motivation and Overview
 How do we do it
 Motivating running example
 Breaking it down into mechanisms
 Implementation
 Evaluations using PERSim
 Questions
8 8 /29
Virtually Aged Sampling DMR
Virtual Aging
Fault Exposure
• In most gates the faults are automatically exposed
• A new mechanism to expose faults in other gates
Detect Errors
9 9 /29
Applications
Applications
Evaluation using PERSim
µP
Input Sequences
Delay Aware Simulation
µP
µP
DMR
Error??
Fault Vector
Wearout
10
Delay as a function of Time/Vdd
10 /29
Input Sequences
Wearout
11
Applications
Applications
Mechanisms - Fault Modeling
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
11/29
Input Sequences
Wearout
12
Applications
Applications
Mechanisms – Delay Aware Simulation
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
12 /29
Input
Sequence
Extraction
Input Sequences
Wearout
13
Applications
Applications
Mechanisms – Input Sequence
Extraction
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
13 /29
Input
Fault Injection &
Sequence
Deterministic
Extraction
Re-execution
Input Sequences
Wearout
14
Applications
Applications
Mechanisms – Fault Injection
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
14 /29
Mechanisms
 Input Sequence Extraction
 Delay Aware Simulation
 Fault Modeling
 Fault Injection and Deterministic Re-execution
15 /29
15
Outline
 Motivation and Overview
 How do we do it
 Implementation
 Input Sequence Extraction & Fault Injection
 Delay Aware Simulation
 Fault Modeling
 Evaluations using PERSim
 Questions
16 /29
16
Implementation
 Input Sequence Extraction
 Fault Injection and Deterministic Re-execution
 Be able to run full programs
17 /29
17
Input Sequence Extraction
 Input Sequence Extraction
 Fault Injection and Deterministic Re-execution
 Be able to observe signals on a cycle-by-cycle basis
18 /29
18
Fault injection
 Input Sequence Extraction
 Fault Injection and Deterministic Re-execution
 Fine grain control for fault injection
19 /29
19
Implementation
 Input Sequence Extraction
 Fault Injection and Deterministic Re-execution
 Automate – run multiple tests at the push of a button
20 /29
20
Implementation
 25 million cycles per second per board
 Full SPEC benchmarks
21 /29
21
Fault Modeling
 Reliability phenomena ⇒ behavior of gates?
 Wearout
Synopsys HSPICE+MOSRA
22 /29
22
Fault Modeling
 Reliability phenomena ⇒ behavior of gates?
 Transient Faults
Charge Accumulation Model
23 /29
23
Fault Modeling
 Reliability phenomena ⇒ behavior of gates?
 Permanent Faults
Probabilistic Models
24 /29
24
Input Sequences
Applications
Applications
Implementation
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
 Wearout : HSPICE+MOSRA
 Transient Faults :Charge Accumulation Model
 Permanent Faults : Probabilistic models
25 /29
Outline
 Motivation and Overview
 How do we do it
 Implementation
 Evaluations using PERSim
 Reliability Techniques
 Key Results
 Questions
26 /29
26
Reliability Techniques
 Circuit failure prediction (Wearout)
 FIRST [Smolens et al., SELSE’07]
 WearMon [Zandian et al., DSN’ 12]
 Online Wearout Prediction [Blome et al., MICRO’07]
 Transient Fault Analysis
 Gate level modeling of particle strike
 Application level impact analysis
 Permanent Fault Detection
 Sampling-DMR [Nomura et al., ISCA’11]
27 /29
27
Key Results
 Circuit failure prediction (Wearout)
 Gates in non-critical paths not covered.
 PERSim enables full processor coverage
 Modeling Hole Covered
 Transient Fault Analysis
 Accurate modeling of particle strikes on individual gates
⇒ impact on full programs
 Cross-layer transient fault analysis
 Permanent Fault Detection
 Cycle-by-cycle error traces running full programs
 Fine-grained signal visibility
28 /29
28
Executive Summary
 PERSIM
 Model physical effects at gate-level
 See impact running full programs on a full processor
 At high simulation speeds (25 million cycles per second)
 With good signal observability
 Fine grain control on fault injection
 Demonstration
 Evaluation of 4 recently proposed techniques
 End-to-end transient fault analysis
www.persim.org
29 /29
29
Backup slides
30
/29
Limitations
 Related faults
 Interaction between faults not captured
 OpenRISC
 Simple, in-order processor
 Limited online visibility on current state/program progress
 ZedBoard memory footprint/Zynq FPGA size
 Determinism
 Requires careful manipulation of programs
31
/29