Transcript Document

From Crash-and-Recover to Senseand-Adapt: Our Evolving Models
of Computing Machines
Rajesh K. Gupta
UC San Diego.
To a software designer, all chips look alike
To a hardware engineer, a chip is
delivered as per contract in a data-sheet.
2
Reality is
COMPUTERS ARE BUILT ON STUFF
THAT IS IMPERFECT AND…
3
Changing From Chiseled Objects to Molecular
Assemblies
45nm Implementation of Leon3
Processor Core
Courtesy: P. Gupta, UCLA
4
Engineers Know How to “Sandbag”
• PVTA margins add to guardbands
– Static Process variation: effective transistor channel
length and threshold voltage
– Dynamic variations: Temperature fluctuations, supply
Voltage droops, and device Aging (NBTI, HCI)
actual circuit delay
guardband
Clock
Aging
Temperature
VCC Droop
Across-wafer Frequency
5
Uncertainty Means Unpredictability
• VLSI Designer: Eliminate It
– Capture physics into models
– Statistical or plain-old Monte Carlo
– Manufacturing, temperature effects
• Architect: Average it out
– Workload (Dynamic) Variations
• Software, OS: Deny It
Simulate ‘degraded’ netlist
with model input changes
(DVth)
Deterministic simulations
capture known physical
processes (e.g., aging)
Multiple (Monte-Carlo)
simulations wrapped
around a nominal model
– Simplify, re-organize OS/tasks breaking these into
parts that are precise (W.C.) and imprecise (Ave.)
Each doing their own thing, massive overdesign…
6
Let us step back a bit: HW-SW Stack
Application
Application
Operating System
Hardware Abstraction Layer (HAL)
7
Let us step back a bit: HW-SW Stack
Application
Application
Operating System
Hardware Abstraction Layer (HAL)
Time or part
8
Let us step back a bit: HW-SW Stack
Application
Application
Operating System
20x in sleep power
50% in performance
Hardware Abstraction Layer (HAL)
}
overdesigned
hardware
40% larger chip
35% more active power
60% more sleep power
Time or part
9
What if?
Application
Application
Operating System
Hardware Abstraction Layer (HAL)
}
underdesigned
hardware
Time or part
10
New Hardware-Software Interface..
Application
Application
Traditional
Fault-tolerance
Operating System
Hardware Abstraction Layer (HAL)
minimal variability
handling in hardware
Opportunistic
Software
Underdesigned
Hardware
Time or part
11
UNO Computing Machines Seek
Opportunities based on Sensing Results
Do Nothing
(Elastic User,
Robust App)
Change Hardware Change
Operating Point
Algorithm
Parameters
(Disabling parts of
the cache,
(Codec Setting,
Changing V-f)
Duty Cycle Ratio)
Change
Algorithm
Implementation
(Alternate code
path, Dynamic
recompilation)
Metadata Mechanisms: Reflection, Introspection
Variability signatures:
-cache bit map
-cpu speed-power map
-memory access time
-ALU error rates
Models
Sensors
Variability manifestations
-faulty cache bits
-delay variation
-power variation
12
UnO Computing Machines:
Taxonomy of Underdesign
Software
Hardware
Nominal
Design
Performance
Constraints
Puneet Gupta/UCLA
D D D
D
D D D
Manufacturing D D D
Manufactured Die
Die
Specific
Adaptation
Hardware
Characterization
Tests
Signature
Burn In
D D D
D D D
D D D
D
Manufactured Die
With Stored
13
Signatures
Several Fundamental Questions
• How do we distinguish between codes that
need to be accurate versus that can be not so?
– How fine grain are these (or have to be)?
• How do we communicate this information
across the stack in a manner that is robust and
portable?
– And error controllable (=safe).
• What is the model of error that should be
used in designing UNO machines?
14
Building Machines that leverage move from
Crash & Recover to Sense & Adapt
15
Expedition Grand Challenge & Questions
“Can microelectronic variability be controlled and
utilized in building better computer systems?”
Three Goals:
a. Address fundamental technical
challenges (understand the
problem)
b. Create experimental systems
(proof of concept prototypes)
c. Educational and broader
impact opportunities to make
an impact (ensure training for
future talent).
What are most effective ways to detect variability?
What are software-visible manifestations?
What are software mechanisms to exploit variability?
How can designers and tools leverage adaptation?
How do we verify and test hw-sw interfaces?16
Thrusts traverse institutions on testbed
vehicles seeding various projects
Group A: Signature
Detection and
Generation
Group B: Variability
Mitigation
Measures
Group C:
Opportunistic
Software and
Abstractions
Characterizing variability in power
consumption for modern computing
platforms, and implications
Mitigating variability in solid-state
storage devices
Effective error resilience
Runtime support and software
adaptation for variable hardware
Hardware solutions to better
understand and exploit variability
Negative bias temperature instability
and electromigration
Probabilistic analysis of faulty
hardware
VarEmu emulation-based testbed for
variability-aware software
Memory-variability aware runtime
systems
Understanding and exploiting
variability in flash memory devices
Variability-aware opportunistic
system software stack
Design-dependent ring oscillator and
software testbed
FPGA-based variability simulator
Application robustification for
stochastic processors
Executing programs under relaxed
semantics
17
Observe and Control Variability Across Stack
The steps to build
variability abstractions
up to the SW layer
Instruction-level
Vulnerability (ILV)
Sequence-level
Vulnerability (SLV)
By the time, we get to TLV, we are into a
parallel software context: instruct OpenMP Procedure-level
scheduler, even create an abstraction for
Vulnerability (PLV)
programmers to express irregular and
unstructured parallelism (code refactoring).Task-level
Vulnerability (TLV)
Monitor manifestations from instructions levels to task levels.
[ILV,SLV,PLV,TLV] Rahimi et al, DATE’12, ISLPED’12, TC’13, DATE’13
Closer to HW: Uncertainty Manifestations
• The most immediate manifestations of variability are
in path delay and power variations.
– Path delay variations has been addressed extensively in
delay fault detection by test community.
• With Variability, it is possible to do better by focusing
on the actual mechanisms
– For instance, major source of timing variation is voltage
droops, and errors matter when these end up in a state
change.
Combine these two observations and you get a rich literature in
recent years for handling variability induced errors: Razor, EDA, TRC,
19
…
Detecting and Correcting Timing Errors
• Detect error, tune supply voltage to reach an error rate, borrow
time, stretch clock
– Exploit detection circuits (e.g., voltage droops), double sampling with shadow
latches, Exploit data dependence on circuit delays
– Enable reduction in voltage margin
– Manage timing guardbands and voltage margins
– Tunable Replica allow non-intrusive operation.
Voltage droop
Voltage droop
20
Sensing: Razor, RazorII, EDS, Bubble Razor
Double Sampling (Razor I) Transition Detector with Time
Borrowing [Bowman’09]
[Ernest’03]
Razor II [Das’09]
Double Sampling with Time
Borrowing [Bowman’09]
EDS [Bowman ‘11]
Task Ingredients:
Model, Sense, Predict, Adapt
I. Sense & Adapt
Observation using in situ monitors (Razor, EDS) with cycleby-cycle corrections (leveraging CMOS knobs or replay)
II. Predict & Prevent
Relying on external or replica monitors Model-based rule
 derive adaptive guardband to prevent error
Adapt (correct)
4ns
3ns
5ns
Sense (detect)
Prevent
1ns
1ns
4ns
3ns
5ns
Model
Sensors
22
Don’t Fear Errors: Bits Flip, Instructions Don’t Always Execute Correctly
CHARACTERIZE, MODEL, PREDICT
Bit Error Rate, Timing Error Rate, Instruction Error Rate, ….
23
Characterize Instructions and Instruction
Sequences for Vulnerability to timing errors
Characterize LEON3 in 65nm TSMC across
Design Time
Timing constraints
VHDL
PLUT
Module
Verilog
Design
Compiler
full range of operating conditions:
IC
Compiler
Verilog
netlist
Netlist + Parasitics
Leon-3
TSMC 45nm Libraries
(0.81V,125˚C)
ILV Flow
(0.81V,-40˚C)
Desire (V,T)
PrimeTime
Instruction
Generator
(0.99V,-40˚C)
Adaptive CLK
SLV Flow
Intolerant
Apps
Netlist+ SDF
Netlist+ SDF
Operand
Distribution
(0.72V,125˚C)
Pin
Operand
Distribution
Testbench
ILV metadata
ModelSim
Vsim
ModelSim
Vsim
Adaptive Guardbanding
(-40°C−125°C, 0.72V−1.1V)
Testbench
SLV metadata
gcc
High-frequent
Sequences
Sequence
Generator
Critical path (ns)
Dynamic variations cause
the critical path delay to
increase by a factor of 6.1×.
24
Generate ILV, SLV “Metadata”
• The ILV (SLV) for each instructioni (sequencei) at every
operating condition is quantified:
1
ILV (i,V , T , cycle _ time) 
Ni
Ni
 Violationj
j 1
1 If any stage violates at cyclej
Violationj  
0 otherwise
1
SLV (i,V , T , cycle _ time) 
Mi
Mi
 Violationj
j 1
1 If any stage violates at cyclej
Violationj  
0 otherwise
– where Ni (Mi) is the total number of clock cycles in Monte
Carlo simulation of instructioni (sequencei) with random
operands.
– Violationj indicates whether there is a violated stage at clock
cyclej or not.
• ILVi (SLVi) defined as the total number of violated cycles over
the total simulated cycles for the instructioni (sequencei).
Now, I am going to make a jump over characterization data…
Connect the dots from paths to Instructions
x 10000
16
Number of f ailed paths
Observe:
The execute and memory parts are
sensitive to V/T variations, and
also exhibit a large number of
critical paths in comparison to the
rest of processor.
0.72V
0.88V
1.10V
T= 125°C
14
12
10
8
6
4
2
0
Fetch
Decode Reg. acc. Execute Memory Write back
-40 C
x 10000
8
0 C
125 C VDD= 1.1V
Hypothesis:
We anticipate that the instructions
that significantly exercise the
execute and memory stages are
likely to be more vulnerable to V/T
variations Instruction-level
For SPARC V8 instructions (V, T, F) are varied and ILV is
evaluated for every instruction with random operands; SLV is
Vulnerability (ILV)
Number of f ailed paths
7
6
5
4
3
2
1
0
Fetch
Decode Reg. acc. Execute Memory Write back
i
i
evaluated for a high-frequent sequencei of instructions.
i
ILV AND SLV: Partition them into groups
according to their vulnerability to timing errors
• For every operating conditions:
ILV (3rd Class) ≥ ILV (2nd Class) ≥ ILV (1st Class)
SLV (Class II) ≥ SLV (Class I)
ILV: 1st class= Logical and arithmetic; 2nd class= Memory; 3rd class= Multiply and divide.
SLV: Class II= mixtures of memory, logic, control; Class I= logical and arithmetic.
For top 20 high-frequency sequence from 80 billion dynamic instructions of 32 benchmarks
ILV and SLV classification for integer SPARC V8 ISA.
Use Instruction Vulnerabilities to Generate Better Code, Call/Returns
APPLY: STATICALLY TO ACHIEVE
HIGHER INSTRUCTION
THROUGHPUT, LOWER POWER
32
Compile time
Now Use ILV, SLV to Dynamically Adapt
Guardbands
App.
type
ILV
I.
Application
Code
Error-tolerant Applications

Duplication of critical instructions

Satisfying the fidelity metric
Error-intolerant Application

Increasing the percentage of the sequences
of ClassI, i.e., increasing the number
arithmetic instructions with regard to the
memory and control flow instructions,
e.g., through loop unrolling technique
II.
VA Compiler
SLV
(V,T)
CPM
WB
ME
EX
RA
ID
Seqi
IF
via memory-mapped I/O
PLUT
Adaptive Clocking
Runtime
Adaptive Guardbanding
I$
D$
• Adaptive clock scaling for each
class of sequences mitigates the
conservative inter- and intra-corner
guardbanding.
LEON3 core clock
•
At the runtime, in every cycle, the PLUT module sends the desired frequency
to the adaptive clocking circuit utilizing the characterized SLV metadata of the
current sequence and the operating condition monitored by CPM.
Utilization SLV at Compile Time
45
50
a: Without loop unrolling
40
35
30
25
20
15
ClassI (length=7)
ClassI (length=6)
ClassI (length=5)
ClassI (length=4)
10
ClassI (length=3)
5
ClassI (length=2)
0
Percentage of sequences (%)
Percentage of sequences (%)
50
45
b: With loop unrolling
40
35
ClassI (length=7)
30
ClassI (length=6)
25
20
15
ClassI (length=5)
ClassI (length=4)
10
ClassI (length=3)
5
ClassI (length=2)
0
• Applying the loop unrolling produces a longer chain of ALU
instructions, and as a result the percentage of sequences
of ClassI is increased up to 41% and on average 31%.
• Hence, the adaptive guardbanding benefits from this
compiler transformation technique to further reduce the
guardband for sequences of ClassI.
Effectiveness of Adaptive Guardbanding
• Adaptive guardbanding achieves up
to 1.9× performance improvement
for error-tolerant (probabilistic)
applications in comparison to the
traditional worst-case design.
(0.81V,0°)
(0.81V,125°)
Normalized Throughput
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
(0.72V,125°)
2.00
Normalized Throughput
• Using online SLV coupled with
offline compiler techniques enables
the processor to achieve 1.6×
average speedup for intolerant
applications
• Compared to recent work
[Hoang’11], by adapting the cycle
time for dynamic variations (intercorner) and different instruction
sequences (intra-corner).
(0.72V,125°)
1.80
1.60
1.40
1.20
1.00
0.80
(0.81V,0°)
(0.81V,125°)
Example: Procedure Hopping in Clustered
CPU, Each core with its voltage domain
I$B0
I$Bi-1
Level Shifters
f+180°
Log. Interc.
Level Shifters
High VDD
Typical VDD
Low VDD
...
CPM
Level Shifters
Core15
f
CPM
Level Shifters
f+180°
VA-VDD-hopping
SHM
Core0
VA-VDD-hopping
PSS
PSS
Log. Interc.
TCDMB0
VDD = 0.81V
f0
862
f4
826
f8
820
f12
901
f1
909
f5
855
f9
826
f13
917
f2
870
f6
877
f10
909
f14
847
VDD = 0.99V
f3
847
f7
893
f11
847
f15
901
f0
1408
f4
1370
f8
1370
f12
1408
f1
1389
f5
1408
f9
1370
f13
1408
f2
1408
f6
1408
f10
1389
f14
1389
f3
1370
f7
1408
f11
1370
f15
1389
DFS
• Statically characterize procedure for PLV
• A core increases voltage if monitored
delay is high
• A procedure hops from one core to
another if its voltage variation is high
• Less 1% cycle overhead in EEMBC.
...
...
TCDMBj-1
VA-VDD-Hopping=(0.81V, 0.99V )
f0
f1
f2
f3
862
909
870
847
f4
f5
f6
f7
1370 855
877
893
f8
f9
f10
f11
1370 1370 909
847
f12
f13
f14
f15 36
901
917
847
901
HW/SW Collaborative Architecture to
Support Intra-cluster Procedure Hopping
Operating Con. Monit.
TCDM
Shared
Local
PHIT
Shared
Stack
Stacks
Heap
…
ProcX
@Caller
Caller Corei
…
…
call ProcX //conventional compile
Call ProcX@Caller //VA-compile
…
Proc X@Caller:
If (calculate_PLV ≤ PLV_threshold)
call ProcX
else
create_shared_stack_layout
set_PHIT_for_ProcX
send_broadcast_req
set_timer
wait_on_ack_or_timer
…
Broadcast_ack_ISR:
if (statusX_PHIT == done)
load_context&return_from_SSPX
Interrupt Cont.
…
ProcX@Callee:
if (calculate_PLV ≤ PLV_threshold)
set_statusX_PHIT = running
load_contex&param_from_SSP X
set_all_param&pointers
call ProcX
store_contex_to_SSPX
set_statusX_PHIT = done
send_broadcast_ack
else
resume_normal_execution
…
Broadcast_req_ISR:
ProcX@Callee = search_in_PHIT
call ProcX@Callee
ProcX
Operating Con. Monit.
Interrupt Cont.
ProcX
@Callee
Callee Core k
Shared
L1 -I$
…
• The code is easily accessible via the shared-L1 I$.
• The data and parameters are passed through the shared stack in
TCDM (Tightly Coupled Data Memory)
• A procedure hopping information table (PHIT) keeps the status
for a migrated procedure.
37
Combine Characterization with Online Recognition
APPLY: MODEL, SENSE, AND ADAPT
DYNAMICALLY
38
Consider a Full Permutation of PVTA
Parameters
• 10 32-bit integer, 15 single precision FP
Functional Units (FUs)
• For each FUi working with tclk and a given
PVTA variations, we defined Timing Error
Rate (TER):
TER (FUi,tclk,V,T,P,A) 
 CriticalPaths (FU ,t ,V,T,P,A)  100
 Paths (FU )
i clk
FUs VERILOG
Design
Ware
Libs
Variable
Parameters
Voltage
Temp.
Process
tclk
Vth
Voltage
Temperature
Process (σWID)
Aging (∆Vth)
tclk
End
Point
1.10V
120°C
9.6%
100mV
5.0ns
Step
0.01V
10°C
3.2%
25mV
0.2ns
# of
Points
23
13
4
5
25
45nm
Corners
Libs
IC Compiler
Netlist
&SPEF
i
Start
Point
0.88V
0°C
0%
0mV
0.2ns
Design
Compiler
Prime
Time
SSTA
&STA
Timing
Error Rate
Analysis
45nm
Process
VA Libs
MATLAB
Linear
Classifier
Parametric Model
39
Parametric Model Fitting
PVTA
Linear discriminant analysis
tclk
K
yˆ  arg min P(k | x) C ( y | k )
y 1,...,K k 1
P( x | k ) 
HFG ASIC Analysis
Flow for TER
TER
Classes
of TER
TER
Class
P (k | x) 
1
T
1
exp( ( x  k ) Mσ 1 ( x   ))
k
2
(2 Mσ )
0.5
P(k | x) P(k )
P( x)
K
cost(k )   P(i | x) C(k | i)
i 1
TER=0%
33%>= TER >0%
66%>= TER >33%
100%>= TER >66%
Class0 (C0)
ClassLow (CL)
ClassMedium (CM)
ClassHigh (CH)
Parametric Model
• We used Supervised learning (linear discriminant analysis) to
generate a parametric model at the level of FU that relates PVTA
parameters variation and tclk to classes of TER.
• On average, for all FUs the resubstitution error is 0.036, meaning the
models classify nearly all data correctly.
• For extra characterization points, the model makes correct estimates
for 97% of out-of-sample data. The remaining 3% is misclassified to
the high-error rate class, CH, thus will have safe guardband.
40
Delay Variation and TER Characterization
60
60
40
40
20
20
0
0
0.9
50
0.95
1
1.05
100
1.1
Temperature (°C)
VDD (V)
(P(σ WID) = 0%, A(∆Vth)=100mV)
Timing Error Rate (%)
100
80
80
60
60
40
40
20
20
0
0.9
0.95
1
1.05
1.1
VDD (V)
120
100
80
60
40
Temperature (°C)
(P(σ WID) = 9.6%, A(∆Vth)=100mV)
20
0
Delay
(ns)
80
100
60
Timing Error Rate (%)
80
80
80
40
60
20
40
20
0
0
0
0.9
50
0.95
1
1.05
100
1.1
Temperature (°C)
VDD (V)
(P(σ WID) = 0%, A(∆Vth)=0mV)
80
100
Timing Error Rate (%)
Timing Error Rate (%)
100
(P,?,?,?) (P,A,?,?) (P,A,T,?)(P,A,T,V)
V=1.10V
…
T=0°C
V=0.88V
A(∆Vth)=
…
…
0mV
V=1.10V
…
T=120°C
V=0.88V
P(σWID)=
…
…
…
0%
V=1.10V
…
T=0°C
V=0.88V
A(∆Vth)=
…
…
100mV
V=1.10V
…
T=120°C
V=0.88V
…
…
…
…
V=1.10V
…
T=0°C
V=0.88V
A(∆Vth)=
…
0mV
V=1.10V
…
T=120°C
V=0.88V
P(σWID)=
…
…
…
9.6%
V=1.10V
…
T=0°C
V=0.88V
A(∆Vth)=
…
…
100mV
V=1.10V
…
T=120°C
V=0.88V
80
60
60
40
40
20
20
0
0.9
0
0
0.95
50
1
1.05
1.1
VDD (V)
100
Temperature (°C)
(P(σ WID) = 9.6%, A(∆Vth)=0mV)
• During design time the delay of the FP adder has a large uncertainty of
[0.73ns,1.32ns], since the actual values of PVTA parameters are unknown.
41
Hierarchical Sensors Observability
• The question is what mix of monitors that would be
useful?
• The more sensors we provide for a FU, the better
conservative guardband reduction for that FU.
•
•
•
•
8% (P_sensor),
24% (PA_sensors),
28% (PAT_sensors),
44% (PATV_sensors)
PA_sensors
PAT_sensors
PATV_sensors
2.6
2.4
2.2
2.0
tclk (ns)
• The guardband of
FP adder can be
reduced up to
P_sensor
1.8
1.6
FP_exp
FP_add
INT_mac
1.4
1.2
1.0
0.8
0.6
In-situ PVT sensors impose 1−3% area overhead [Bowman’09]
Five replica PVT sensors increase area of by 0.2% [Lefurgy’11]
The banks of 96 NBTI aging sensors occupy less than 0.01% of the core's area [Singh’11]
42
Online Utilization of Guardbanding
The control system tunes the clock frequency through an online model-based rule.
Classifier
Parametric
Model
PATV_config
target_TER
offline
TER raw
data
P (2-bit)
V
t clk
…
…
…
…
…
t clk(5-bit)
online
2.
T
LUTs
1.
A
FUk
FUj
SIMD IF
CLK
control
T (3-bit)
V (3-bit)
instruction
P
FUi
PATV
Sensor
GPU
A (3-bit)
max
Fine-grained granularity of instruction-by-instruction monitoring and
adaptation that uses signals of PATV sensors from individual FUs
Coarse-grained granularity of kernel-level monitoring uses a
representative PATV sensors for the entire execution stage of pipeline
43
Throughput benefit of HFG
P_sensor
PA_sensors
PAT_sensors
PATV_sensors
3.5
Kernel-level monitoring improves
throughput by 70% from P to
PATV sensors. Target TER=0
2.5
2.0
1.5
1.0
0.5
0.0
P_sensor
PA_sensors
PAT_sensors
PATV_sensors
4.5
4.0
Instruction-level monitoring
improves throughput by 1.8-2.1X.
Throughput (GIPS)
Throughput (GIPS)
3.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
44
Consider shared 8-FPU 16-core architectures
PUTTING IT TOGETHER: COORDINATED
ADAPTATION TO PROPAGATE ERRORS
TOWARDS APPLICATION
45
Accurate, Approximate Operating Modes
Modeled after STM P2012 16-core machine
EDS +ECU
opmode
Opnd1&2
res
done
opmode
Opnd1&2
res
done
opmode
Opnd1&2
res
done
S1
S2
EDS +ECU
S1
S2
FLV
ADD/SUB
pipe
FLV
MUL pipe
EDS +ECU
oprt
S1 S2 … S18
FLV
DIV
pipe
SLAVE
PORT
• Accurate mode: every pipeline uses (with 3.8% area overhead)
– EDS circuit sensors to detect any timing errors, ECU to correct errors using
multiple-issue operation replay mechanism (without changing frequency)
46
Accuracy-Configurable Architecture
• In the approximate mode
– Pipeline disables the EDS sensors on the less significant N bits of the
fraction where N is reprogrammable through a memory-mapped
register.
– The sign and the exponent bits are always protected by EDS.
– Thus pipeline ignores any timing error below the less significant N bits
of the fraction and save on the recovery cost.
• Switching between modes disables/enables the error
detection circuits partially on N bits of the fraction  FP
pipeline can efficiently execute subsequent interleaved
accurate or approximate software blocks.
47
Fine-grain Interleaving Possible Through
Coordination and Controlled Approximation
Architecture: accuracy-reconfigurable FPUs that are shared
among tightly-coupled processors and support online FPV
characterization
Compiler: OpenMP pragmas for approximate FP computations;
profiling technique to identify tolerable error significance and
error rate
Runtime: Scheduler utilizes FPV metadata and promotes FPUs to
accurate mode, or demotes them to approximate mode
depending upon the code region requirements.
Either ignore the timing errors (in approximate regions) or reduce
frequency of errors by assigning computations to correctible
hardware resources for a cost.
Ensure safety of error ignorance through a set of rules.
48
FP Vulnerability Dynamically Monitored and
Controlled by ECU
• % of cycles with timing errors as reported by
EDS sensors captured as FPV metadata
• Metadata is visible to the software through
memory-mapped registers.
• Enables runtime scheduler to perform on-line
selection of best FP pipeline candidates
– Low FPV units for accurate blocks, or steer error
without correction to application.
49
OpenMP Compiler Extension
#pragma omp accurate
structured-block
#pragma omp approximate [clause]
structured-block
error_significance_threshold (<value N>)
#pragma omp parallel
{
#pragma omp accurate
#pragma omp for
for (i=K/2; i <(IMG_M-K/2); ++i) {
// iterate over image
for (j=K/2; j <(IMG_N-K/2); ++j) {
float sum = 0;
int ii, jj;
for (ii =-K/2; ii<=K/2; ++ii) {
// iterate over kernel
for (jj = -K/2; jj <= K/2; ++jj) {
float data = in[i+ii][j+jj];
float coef = coeffs[ii+K/2][jj+K/2];
float result;
#pragma omp approximate error_significance_threshold(20)
{
result = data * coef;
sum += result;
}
programs the FPU
}
}
out[i][j]=sum/scale;
} } }
Code snippet for
int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_MUL,
Gaussian filter20);
utilizing OpenMP
GOMP_FP (ID, data, coeff, &result);
variability-aware
directives 20);
int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_ADD,
GOMP_FP (ID, sum, result, &sum);
Invokes the runtime
FPU scheduler
50
FPV Metadata can even drive synthesis!
Source code
Annotated
source code
Input
data
Profiling
error rate
threshold
utilizing fast leaky standard
cells (low-VTH) for these paths
Runtime
library
scheduler
OpenMP
approximate
directives
Fidelity
(PSNR)
Controlled
error sig.
approximation
error rate
analysis
error sig. threshold (N)
Approximate-aware timing
constraint generation
Design-time hardware FPU
synthesis & optimization
utilizing the regular and slow
standard cells (regular-VTH
and high-VTH) for the rest of
paths  since errors can be
ignored!
tight timing
relaxed timing
clock
N
51
Save Recovery Time, Energy using FPV
Monitoring (TSMC 45nm)
ARM v6 core
I$ size(per core)
I$ line
Latency hit
Latency miss
Shared-FPUs
FP MUL latency
16
16KB
4 words
1 cycle
≥ 59 cycles
8
2
TCDM banks
TCDM latency
TCDM size
L3 latency
L3 size
FP ADD latency
FP DIV latency
16
2 cycles
256 KB
≥ 60 cycles
256MB
2
18
• Error-tolerant applications: Gaussian, Sobel filters
– PSNR results show error significance threshold at N=20
while maintaining >30 dB
• 36% more energy efficient FPUs, recovery cycles reduced by 46%
• 5 kernel codes as error-intolerant applications
– 22% average energy savings.
52
Expedition Experimental Platforms &
Artifacts
• Interesting and unique challenges in building
research testbeds that drive our explorations
– Mocks up don’t go far since variability is at the
heart of microelectronic scaling. Need platforms
that capture scaling and integration aspects.
• Testbeds to observe (Molecule, GreenLight,
Ming), control (Oven, ERSA)
Molecule
Ming the Merciless
Red Cooper
ERSA@BEE3
53
Red Cooper Testbed
• Customized chip with processor + speed/leakage sensors
• Testbed board to finish the sensor feedback loop on board
• Used in building a duty-cycled OS based on variability sensors
Power
Applications
Performance
Microarchitecture and Compilers
Runtime
Vendor
Process
Ambient
Aging
Errors
CPU
Mem
Energy Source
(Batteries)
Storage
Accelerators
Network 54
Ferrari Chip: Closing Loop On-Chip
JTAG
ARM
Cortex-M3
GPIO
AMBA
Bus
GPIO
ARM
Cortex
-M3
Timers
DMEM
Config
64 kB IMEM
176 kB DMEM
PLL
RO CLK
E
C
C
19
DDROs
Counters
Available
since April
2013
DMEM
P
L
L
Sens
Out
DMEM
8 banks of sensors
(N/P Leak, Temp, Oxide)
IMEM
DUT
Device
• On-Chip Sensors
Power
Applications
Performance
Microarchitecture and Compilers
Runtime
Vendor
Process
Errors
Ambient
• Better support for OS and
software
Ref.
Device
Aging
– Memory mapped i/o and control
– Leakage sensors, DDROs,
temperature sensors, reliability
sensors
CPU
Mem
Energy Source
(Batteries)
Storage
Accelerators
Network
55 55
Sense-and-Adapt Fundamentally Alters The
Stack
• Machines that consist of parts with variations
in performance, power and reliability
• Machines that incorporate sensing circuits
• Machines w/ interfaces to change ongoing
computation & structures
• New machine models: QOS or Relaxed
Reliability parts.
56
Thank You!
Rajesh K. Gupta
Nikil Dutt, UCI
Punit Gupta, UCLA
Mani Srivastava, UCLA
Steve Swanson, UCSD
Lara Dolecek, UCLA
Subhashish Mitra, Stanford
YY Zhou, UCSD
Tajana Rosing, UCSD
Alex Nicolau, UCI
Ranjit Jhala, UCSD
Sorin Lerner, UCSD
Rakesh Kumar, UIUC
Dennis Sylvester, UMich
Yuvraj Agrawal, CMU
Lucas Wanner, UCLA
The Variability Expedition
A NSF Expeditions in Computing Project
http://variability.org
57