Transcript Document

Application Level Fault
Tolerance and Detection
Principal Investigators:
C. Mani Krishna
Israel Koren
Presented By:
Eric Ciocca
Architecture and Real-Time Systems (ARTS) Lab.
Department of Electrical and Computer Engineering
University of Massachusetts Amherst MA 01003
What is ALFTD?


Application Level Fault Tolerance and Detection
ALFTD complements existing system or algorithm
level fault tolerance by leveraging information
available only at the application level


Using such application level semantic information
significantly reduces the overall cost providing fault
tolerance
ALFTD may be used alone or to supplement other fault
detection schemes
Application Level Fault Tolerance and Detection
ALFTD Overview

Application Level Fault Tolerance and Detection
allows for system survival of both data and
system (instruction/hardware) faults.



System faults cause a process to eventually cease
functioning
Data faults cause a process to continue running with
incorrect results
ALFTD is scalable

The level of fault tolerance can be traded off with
invested time overhead
Application Level Fault Tolerance and Detection
Principles of ALFTD


Node 1
P1
S4
Node 2
P2
S1
Node 3
P3
S2
Node 4
P4
S3
To provide system fault tolerance, every physical node runs
its own work (P,primary) as well as a scaled-down copy of
a neighboring node’s work (S,secondary)
If a fault should corrupt a process, the corresponding
secondary of that task will still produce output, albeit at a
lower (but acceptable) quality
Application Level Fault Tolerance and Detection
Principles of ALFTD

The secondary processes can be scaled-down by




In some applications the secondary can be run
optionally on an as-needed basis




reducing the resolution of input data
reducing the precision of calculations
heuristically predicting results from previous
iterations’ output
If the corresponding primary is approaching a
deadline miss
If the corresponding primary has been incapacitated
If the corresponding primary has produced faulty data
If faults are infrequent, an optional secondary
will incur very little additional overhead
Application Level Fault Tolerance and Detection
ALFTD in OTIS



ALFTD was implemented into OTIS (Oribital
Thermal Imaging Spectrometer) to test its
viability as a fault tolerance and detection
scheme
OTIS, part of the REE (Remote Exploration and
Experimentation) program group from JPL, is
intended to run on orbiting satellites
OTIS processes radiation data of a geographic
area from a sensor array [input] and produces
temperature and emissivity data [output]
Application Level Fault Tolerance and Detection
OTIS Structure
M
2
OUTPUT
3
1
4
MPI
5
1. MPI Starts
S
2. MPI Starts Slave and
master processes
S
3. Master sends tasks
4. Slave Calculations
S
Application Level Fault Tolerance and Detection
5. Slave Returns Results
ALFTD in OTIS (cont’d)

ALFTD is suited for remote applications,



As a software-based fault handling mechanism, it
requires no extra hardware
The scaled secondaries require less power than full
software redundancy
In OTIS, and other applications, ALFTD is passive, only
requiring extra runtime in a fault case.
Application Level Fault Tolerance and Detection
ALFTD OTIS Structure
?
5
M
2
3
1
4
MPI
OUTPUT
P1
S2
P2
S3
P3
S1
1. MPI Starts
2. MPI Starts master and
slaves, primary and
secondary processes
3. Master sends tasks
4. Slave Calculations
5. Slave Returns Results
Application Level Fault Tolerance and Detection
Secondaries in OTIS


The secondary required for ALFTD is
implemented to be functionally similar to the
primary
Secondary scaling occurs through resolution
reduction



OTIS’ “natural” temperature data input exhibits spatial
locality
Points not directly calculated can be approximately
estimated using interpolation between calculated
points
Secondary processes have been tested at 20%50% of the primary calculation overhead

While 50% affords better quality, 20% has less
overhead
Application Level Fault Tolerance and Detection
Example of Secondary Resolution
100%
Secondary
Resolution
50%
Secondary
Resolution
33%
Secondary
Resolution
25%
Secondary
Resolution
(ALFTD Compensation for 10 rows in a sample dataset)
Application Level Fault Tolerance and Detection
Fault Detection


Output filters on the primary data determine
when secondary validation is required
Output filters are created to check for
application-specific trends in data


Aberrations from normal data characteristics can be
considered to be the product of potentially faulty
processes
OTIS relies on natural temperature
characteristics to detect potentially faulty data


Spatial Locality: temperature changes gradually over small
areas
Absolute Bounds: temperature should not exceed certain
values
Application Level Fault Tolerance and Detection
Fault Detection (cont’d)

After the secondary has been run to validate a
primary’s results, the “better” data is chosen
according to the following logic grid:
Secondary
Results
Primary Results
Faultless
Ambiguous
Faulty
Faultless
Primary
Secondary
Secondary
Ambiguous
Primary
Primary
Secondary
Faulty
Primary
Primary
Primary*
Application Level Fault Tolerance and Detection
Data Sets

Three data sets were chosen for their
interesting characteristics
“Blob”
“Stripe”
Broad, unchanging
Relatively
areas with dark
undynamic except
spots
for one “stripe”
Application Level Fault Tolerance and Detection
“Spots”
Turbulent spots
may defy “spatial
locality”
predictions
Fault Tolerance Results: “Spots”

Fault Tolerance with injected faults in “Spots”
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Spots”
Fault-Free Output
25% ALFTD Computation Overhead
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
ALFTD-corrected faulty output
Application Level Fault Tolerance and Detection
(cont’d)
Fault Tolerance Results: “Blob”

Fault Tolerance with injected faults in “Blob”
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Blob”
Fault-Free Output
25% ALFTD Computation Overhead
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
ALFTD-corrected faulty output
Application Level Fault Tolerance and Detection
(cont’d)
Fault Tolerance Results: “Stripe”
Difference Plots – faulty output versus faultless output
No ALFTD
25% ALFTD
Computation Overhead
No Error
Application Level Fault Tolerance and Detection
33% ALFTD
Computation Overhead
50% ALFTD
Computation Overhead
Max Error
Fault Tolerance Results: “Stripe”(cont’d)
Fault-Free Output
25% ALFTD Computation Overhead
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
ALFTD-corrected faulty output
Application Level Fault Tolerance and Detection
Conclusion / Future Work




ALFTD has shown to be a cost-effective
alternative to full redundancy
Improvements on the scheme will increase fault
coverage and decrease secondary calculation
overhead
OTIS has general application characteristics that
will make its implementation a springboard to
other, similar programs
ALFTD should continue to be effective in any
programs that have predictable data
characteristics
Application Level Fault Tolerance and Detection
Thank You!

For additional information, please contact



Eric Ciocca ([email protected])
Israel Koren ([email protected])
C. Mani Krishna ([email protected])
Application Level Fault Tolerance and Detection