Transcript [PPT
Understanding the Impact of Gate-Level
Physical Reliability Effects on Whole
Program Execution
Raghuraman Balasubramanian
Karthikeyan Sankaralingam
PERSim
A framework with unprecedented fidelity in studying the
end-to-end physical effects of reliability at the gate-level
while running entire programs.
2 2 /29
Understanding Reliability
How does reliability affect the processor over its lifetime?
What happens to the user application?
Architecture Errors
Logic Fault
Wearout
Particle Strikes
Permanent faults
Change in gate behavior
3 3 /29
State-Of-The-Art
Simulators capable of running full programs
What happens to the user application?
Architecture Errors
Logic Fault
Wearout
Particle Strikes
Permanent faults
Change in gate behavior
Device-level models that capture the effects of reliability physics4
4
/29
PERSim
Simulators capable of running full programs
A framework with unprecedented fidelity in studying
the end-to-end physical effects of reliability at the
gate-level while running entire programs.
Device-level models that capture the effects of reliability physics5
5
/29
Why is this important?
Device level problems ➔
Microarchitectural / Architectural level.
M.Agarwal,B.Paul,M.Zhang,andS.Mitra. Circuit failure prediction and its
application to transistor aging. In VTS ’07.
T. M. Austin. Diva: A reliable substrate for deep submicron
microarchitecture design. In MICRO ’99.
R. Balasubramanian and K. Sankaralingam. Virtually aged sampling dmr:
Unifying circuit failure detection and circuit failure prediction. MICRO ’13.
J.Blome,S.Feng,S.Gupta,andS.Mahlke. Self-calibrating online wearout
detection. In MICRO ’07.
K. Bowman, J. Tschanz, C. Wilkerson, S. Lu, T. Karnik, V. De, and S. Borkar.
Circuit techniques for dynamic varia- tion tolerance. In DAC ’09.
Evaluations ➔
A small structured hardware, abstracted physics
6 6 /29
Executive Summary
PERSIM
Model physical effects at gate-level
See impact running full programs on a full processor
At high simulation speeds (25 million cycles per second)
With good signal observability
Fine grain control on fault injection
Demonstration
Evaluation of 4 recently proposed techniques
End-to-end transient fault analysis
7 7 /29
Outline
Motivation and Overview
How do we do it
Motivating running example
Breaking it down into mechanisms
Implementation
Evaluations using PERSim
Questions
8 8 /29
Virtually Aged Sampling DMR
Virtual Aging
Fault Exposure
• In most gates the faults are automatically exposed
• A new mechanism to expose faults in other gates
Detect Errors
9 9 /29
Applications
Applications
Evaluation using PERSim
µP
Input Sequences
Delay Aware Simulation
µP
µP
DMR
Error??
Fault Vector
Wearout
10
Delay as a function of Time/Vdd
10 /29
Input Sequences
Wearout
11
Applications
Applications
Mechanisms - Fault Modeling
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
11/29
Input Sequences
Wearout
12
Applications
Applications
Mechanisms – Delay Aware Simulation
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
12 /29
Input
Sequence
Extraction
Input Sequences
Wearout
13
Applications
Applications
Mechanisms – Input Sequence
Extraction
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
13 /29
Input
Fault Injection &
Sequence
Deterministic
Extraction
Re-execution
Input Sequences
Wearout
14
Applications
Applications
Mechanisms – Fault Injection
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Delay as a function of Time/Vdd
14 /29
Mechanisms
Input Sequence Extraction
Delay Aware Simulation
Fault Modeling
Fault Injection and Deterministic Re-execution
15 /29
15
Outline
Motivation and Overview
How do we do it
Implementation
Input Sequence Extraction & Fault Injection
Delay Aware Simulation
Fault Modeling
Evaluations using PERSim
Questions
16 /29
16
Implementation
Input Sequence Extraction
Fault Injection and Deterministic Re-execution
Be able to run full programs
17 /29
17
Input Sequence Extraction
Input Sequence Extraction
Fault Injection and Deterministic Re-execution
Be able to observe signals on a cycle-by-cycle basis
18 /29
18
Fault injection
Input Sequence Extraction
Fault Injection and Deterministic Re-execution
Fine grain control for fault injection
19 /29
19
Implementation
Input Sequence Extraction
Fault Injection and Deterministic Re-execution
Automate – run multiple tests at the push of a button
20 /29
20
Implementation
25 million cycles per second per board
Full SPEC benchmarks
21 /29
21
Fault Modeling
Reliability phenomena ⇒ behavior of gates?
Wearout
Synopsys HSPICE+MOSRA
22 /29
22
Fault Modeling
Reliability phenomena ⇒ behavior of gates?
Transient Faults
Charge Accumulation Model
23 /29
23
Fault Modeling
Reliability phenomena ⇒ behavior of gates?
Permanent Faults
Probabilistic Models
24 /29
24
Input Sequences
Applications
Applications
Implementation
Delay Aware Simulation
DMR
Error??
Fault Vector
Fault Modeling
Wearout : HSPICE+MOSRA
Transient Faults :Charge Accumulation Model
Permanent Faults : Probabilistic models
25 /29
Outline
Motivation and Overview
How do we do it
Implementation
Evaluations using PERSim
Reliability Techniques
Key Results
Questions
26 /29
26
Reliability Techniques
Circuit failure prediction (Wearout)
FIRST [Smolens et al., SELSE’07]
WearMon [Zandian et al., DSN’ 12]
Online Wearout Prediction [Blome et al., MICRO’07]
Transient Fault Analysis
Gate level modeling of particle strike
Application level impact analysis
Permanent Fault Detection
Sampling-DMR [Nomura et al., ISCA’11]
27 /29
27
Key Results
Circuit failure prediction (Wearout)
Gates in non-critical paths not covered.
PERSim enables full processor coverage
Modeling Hole Covered
Transient Fault Analysis
Accurate modeling of particle strikes on individual gates
⇒ impact on full programs
Cross-layer transient fault analysis
Permanent Fault Detection
Cycle-by-cycle error traces running full programs
Fine-grained signal visibility
28 /29
28
Executive Summary
PERSIM
Model physical effects at gate-level
See impact running full programs on a full processor
At high simulation speeds (25 million cycles per second)
With good signal observability
Fine grain control on fault injection
Demonstration
Evaluation of 4 recently proposed techniques
End-to-end transient fault analysis
www.persim.org
29 /29
29
Backup slides
30
/29
Limitations
Related faults
Interaction between faults not captured
OpenRISC
Simple, in-order processor
Limited online visibility on current state/program progress
ZedBoard memory footprint/Zynq FPGA size
Determinism
Requires careful manipulation of programs
31
/29