Olay: Combat the Signs of Aging with Introspective

Download Report

Transcript Olay: Combat the Signs of Aging with Introspective

Olay: Combat the Signs of Aging with
Introspective Reliability Management
Authors:
Shuguang Feng
Shantanu Gupta
Scott Mahlke
W-QUAD (ISCA-35)
June 21, 2008
1
University of Michigan
Electrical Engineering and Computer Science
Motivation

“Designing Reliable Systems from Unreliable
Components…”
- Shekhar Borkar (Intel)
More failures to come
Failures will be wearout
induced
[Srinivasan, DSN‘04]
[Borkar, MICRO‘05]
2
University of Michigan
Electrical Engineering and Computer Science
Approaches to Reliability
Tolerate Faults
(reactive)
or…
Prevent Faults
(proactive)
Circuit-level
 Margining
 High-K dielectrics
Approaches
to Reliability
 Robust cell topologies
 Passivation
Architecture-level
 Detect
 Dynamic thermal mgmt (DTM)
 Diagnose
 Repair/reconfigure/recover
 Introspective reliability mgmt (IRM)
Targeted management based on wearout monitoring
3
University of Michigan
Electrical Engineering and Computer Science
Not All Cores Are Created Equal

Chip-multiprocessors will be subject to severe process
variation

Dynamic thermal/power budgeting can be suboptimal



Temperature is only part of the picture
Need low-level reliability awareness

Low-level sensors measure physical changes
Wearout-aware management improves reliability
enhancement



System reconfiguration
Dynamic voltage and frequency scaling (DVFS)
Job assignment
4
University of Michigan
Electrical Engineering and Computer Science
Introspective Reliability Management (IRM)
OS
Scheduled Jobs
IRM Policy
Virtualization Layer
Reliability Assesment
WDU [MICRO`07]
 measure propagation
delay
 track statistical trends
5
Olay
 track the progression of
wearout
 profile workload behavior
 generate wearout-aware
job schedules
Low-level Sensors
 delay
 leakage
 temperature
 etc.
Aggregate Analysis
Processed Data
Filtering and Analysis
Raw Sensor Data
Management Decisions
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Scheduling
T0
T1
Per-module Reliability Profile
Activity:
75%
10%
50%
15%
35%
25%
25%
45%
35%
85%
5%
T2
T3
Tn
Active Jobs
Available Cores
6
T11
T1
T6
T2
Idle
Idle
T7
T4
T5
Idle
T7
T1
T9
T10
T8
T10
T8
T0
Idle
T11
T6
Idle
T4
T3
Idle
T4
T6
T1
T7
Job Schedule
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Scheduling
OS
T0
Scheduled Jobs
Application
IRM Policy
Life
Remaining
Virtualization Layer
T1
Reliability Assesment
Job-to-Core Binding
T2
Aggregate Analysis
100%
Processed Data
50% 35% 55% 85%
Filtering and Analysis
10% 15% 30% 80%
Tn
Core
25% 17% 75%
Raw Sensor Data
T3
8%
Lightweight
Strong
0%
Heavyweight
Weak
17% 60% 70% 30%
7
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Policies

GreedyE


Weak
Strong
Optimizes for early life performance
Minimizes premature failures with wear-leveling
C7
C6
C1
C0
Light
T13
T12
T4
T0
C6
C3
C1
T8
T3
T1
C10
C1
C3
C2
T9
T2
C10
C4
C3
T3
T5
C10
C0
C4
T5
T4
T7
Cn
Cores
Heavy
T11
T13
T0
T7
T5
T2
T4
T12
T15
T6
T8
T3
T1
T10
T15
T9
Tn
Jobs
8
Schedule
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Policies

GreedyE



Optimizes for early life performance
Minimizes premature failures with wear-leveling
GreedyL



Optimizes for end of life performance
Victimizes weak cores to maximize the life of stronger
cores
GreedyA


Hybrid of GreedyE and GreedyL
Adapts behavior based on system utilization
9
University of Michigan
Electrical Engineering and Computer Science
Lifetime Reliability Simulation (FACE)
Offline Characterization
SimAlpha
Wattch
HotSpot
Benchmark
Profiles
Benchmark
Suite
SPEC2000 (INT Execution
& FP)
Temperature
Trace
TracePower Trace Synthetic
Benchmarks
 representative of SPEC2000
suite
 reduces online profiling
complexity
10
University of Michigan
Electrical Engineering and Computer Science
Lifetime Reliability Simulation (FACE)
Offline Characterization
SimAlpha
Wattch
HotSpot
Benchmark
Profiles
Benchmark
Suite
Workload
Simulator
Parameter
Specification
WorkloadCMP
Generation
Simulate
Aging
Reliability
Management
Online Simulation
emulates
OS health
scheduler
tracks
progression
of
monitors
CMP
 Device
lifetimes
temperature
traces
wearout mechanisms
wearout-aware
scheduling
 Utilization
pattern
power
tracesdesign
 hierarchical
 profiling
 intelligent heuristics
Olay
Monte Carlo Engine
CMP Simulator
11
University of Michigan
Electrical Engineering and Computer Science
Wearout Modeling

Mean time to failure (MTTF)
1
MTTFTDDB   
V 


 a bT 
e
Y


 X   ZT 
T


T

MTTFNBTI
1
  e
V 
Ea NBTI
T
defines distribution of device lifetimes
Damage accumulation


Dn  1   n 1 Dn 1   in01 1   i  D0

where α is the degradation rate
i 
MTTFqual
MTTFi
12
University of Michigan
Electrical Engineering and Computer Science
CMP Reliability Simulation
CMP
CMPs:
 variable number of cores
 model systematic variation
Core
Cores:
 Alpha 21264-type processor
Modules:
Module
 experience load-dependent stress
 smallest granularity of
temperature modeling
Transistors:
 multiple mechanisms evolve
Transistor
independently
13
University of Michigan
Electrical Engineering and Computer Science
Evaluation

Policies

Random (baseline), GreedyE, GreedyL, GreedyA

Figures of merit

Failure distribution

Useful work performed prior to system failure

Varied system parameters

CMP size

System utilization

Sensor error
14
University of Michigan
Electrical Engineering and Computer Science
Failure Distribution
w/ 16-cores
15
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to System Utilization
w/ 16-cores
16
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to CMP Size
w/ 100% utilization & GreedyE
17
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to Sensor Error
w/ 16-cores,100% utilization, & GreedyE
18
University of Michigan
Electrical Engineering and Computer Science
Conclusions

Heterogeneity exists in both CMPs and their
workloads

Wearout-aware job assignments effectively exploit
this heterogeneity

Real-time health monitoring (low-level sensors)

CMPs augmented with Olay perform up to 20% more
useful work

Proper high-level analysis and profiling is essential
for enhancing lifetime reliability.
19
University of Michigan
Electrical Engineering and Computer Science
Questions?
?
20
University of Michigan
Electrical Engineering and Computer Science