Transcript Document

Advanced Space Computing with
System-Level Fault Tolerance
Grzegorz Cieslewski, Adam Jacobs,
Chris Conger, Alan D. George
ECE Dept., University of Florida
NSF CHREC Center
Outline

Overview

NASA Dependable Multiprocessor

Reconfigurable Fault Tolerance (RFT)

Space Applications

Novel Computing Platforms

RapidIO

Conclusions
2
Overview

What is advanced space computing?


New concepts, methods, and technologies to enable and deploy high-performance
computing in space – for an increasing variety of missions and applications
Why is advanced space computing vital?

On-board data processing





On-board autonomous processing & control





Downlink bandwidth to Earth is extremely limited
Sensor data rates, resolutions, and modes are dramatically increasing
Remote data processing from Earth is no longer viable
Must process sensor data where it is captured, then downlink results
Remote control from Earth is often not viable
Propagation delays and bandwidth limits are insurmountable
Space vehicles and space-delivered vehicles require autonomy
Autonomy requires high-speed computing for decision-making
Why is it difficult to achieve?

Cannot simply strap a Cray to a rocket!




Hazardous radiation environment in space
Platforms with limited power, weight, size, cooling, etc.
Traditional space processing technologies (RadHard) are severely limited
Potential for long mission times with diverse set of needs


Need powerful yet adaptive technologies
Must ensure high levels of reliability and availability
3
Taxonomy of Fault Tolerance

First, let us define various possible modes/methods of providing fault tolerance (FT)



Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem
Software FT vs. hardware FT concepts largely similar, differences only at implementation level
Radiation-hardening not listed, falls under “prevention” as opposed to detection or correction
Most of these FT
modes are currently
being used at UF
NMR
FT-HLL
Fault-Tolerant
HLL (e.g. MPI)
N-Modular
Redundancy
SIFT
Software-Implemented
Fault Tolerance
CED
Temporal and spatial
variants possible
for many techniques
CR
Concurrent Error
Detection
Checkpointing
& Roll-back
Correct
or
Mask
Detect
SCP
BR
Self-Checking
Pairs
Byzantine
Resilience
ABFT
Algorithm-Based
Fault-Tolerance
NVP
ECC
Error Correction
Codes
N-Version
Programming
4
NASA/Honeywell/UF Project
 1st
Instruments

Funded by NASA NMP
In-situ sensor processing
Autonomous control

Speedups of 100 to 1000

First fault-tolerant, parallel,
reconfigurable computer for space



Space Supercomputer
NASA Dependable Multiprocessor (DM)
Spacecraft I /F
System
Controller
B
Spacecraft I /F
System
Controller
A
(RHPPC)
Infrastructure for fault-tolerant,
high-speed computing in space

Robust system services

Fault-tolerant MPI services

Application services

FPGA services

Standard design framework

Transparent API to resources for
earth & space scientists
Reconfigurable
Cluster
Computer
Data
Processor
(PPC, FPGA)
#1
...
Data
Processor
(PPC, FPGA)
#N
High-Speed Network A
High-Speed Network B
Spacecraft I /F
Mission -Specific
Spacecraft Interface
Mission -Specific
Devices
5
Dependable Multiprocessor

DM System Architecture

Dual system controllers



Data processing engines





Redundant radiation-hardened PPC
boards
Monitor data processors’ health and
communicate with spacecraft
High-performance, low-power
COTS SBCs running Linux
PowerPC with AltiVec capabilities
Optional FPGA co-processor for
additional performance
Scalable to 20 data processing
nodes

DM Middleware (DMM)

FT System Services


FT Embedded MPI (FEMPI)




Dual GigE connections
Automatically switch networks when
error is detected



Lightweight subset of MPI
Allows fault recovery without
restarting an entire parallel
application
Application & FPGA Services
Redundant Interconnect

Manages status and health of
multiple concurrent jobs
Commonly used libraries such as
ATLAS, FFTW, GSL
Simplified, generic API for FPGA
usage through USURP*
High-Availability Middleware

Framework used to enable health
monitoring of cluster
* USURP is a standardized interface
specification for RC platforms,
developed by researchers at UF
6
DMM Components
Hardened System
COTS Data Processors
Mission-Specific Parameters

Mission Manager (MM)


Controls high-level job deployment
Facilitates replication of lower-level
jobs





Controls low-level job deployment
and scheduling across system
FT Manager (FTM)

Manages low-level system faults
(node crash, job crash)
JMA
FTM
FEMPI
Reliable Messaging Middleware
COTS OS and Drivers
COTS Packet-Switched Network
COTS Processor
FEMPI – Fault-Tolerant Embedded MPI
ASL – Application Services Library
FCL – FPGA Coprocessor Library
JM Agent (JMA)



FCL
COTS OS and Drivers
JM – Job Manager
JMA – Job Manager Agent
FTM – Fault Tolerance Manager

ASL
Reliable Messaging Middleware
Hardened Processor
Job Manager (JM)


JM
Spatial or temporal replication
Automatically compares and
validates outputs
Monitors real-time deadlines
Enables roll-forward / roll-back
when faults occur
MPI Application Process
Mission Manager
Deploys and monitors
programs on given node
Provides application “heartbeat”
to system controller
Mass Data Store (MDS)


Provides reliable centralized data
services
Enables reliable checkpointing
7
Algorithm-Based Fault Tolerance
Commonly refers to matrix coding method that is
preserved through certain linear algebra operations

Matrix and vector multiply


Discrete Fourier Transform
Discrete Wavelet Transform
Matrix decomposition: C = AB (LU, QR, Cholesky)

Matrix inversion
Used to detect errors in these operations, and in certain
cases allows for error correction
ABFT algorithms integrate with DM through Application
Services API
An improved method of using ABFT on the 2D-FFT and
SAR has been researched at UF







Uses Hamming encoding
Low overhead due to ABFT
Important aspects of ABFT currently under investigation
at UF





2. Partial
Transform
Round-off analysis
Coverage analysis
Code types
Encoding and Decoding strategies
Overhead
3. Verify the
Checksums
Fault-tolerant Partial Transform
Fault-Tolerant 2D DFT
Time Domain
Data Matrix
1. Faulttolerant Partial
Transform
2. Transpose
3. Faulttolerant Partial
Transform
4. Transpose
Frequency
Domain Data
Matrix
Computation Flow of Fault-tolerant 2D-FFT
95%
85%
75%
Overhead Incurred

1. Augment
the Input
Matrix
65%
55%
45%
35%
Error Free
With Error
25%
15%
128
256
512
1024
2048
4096
Image Size [N x N]
Experimental Overhead of Fault-tolerant
RDP vs. a Fault-intolerant Version
8
Source Code Transformations

Most science applications are inherently non-fault-tolerant


Possible to immunize programs against most errors by
transforming application source code





Less overhead
More control over FT techniques
Compiler-independent
Integrates with DM system through Application Services API
Custom source-to-source (S2S) transformation tool is
currently under development at UF






Requires SIFT framework to improve reliability
Accepts C source files as inputs
Generates fault tolerant versions
Uses fine-grain NMR-type of approach to provide improved
reliability and dependability
Provides means of control flow checking (CFC) through software
Minimizes number of undetected errors
Transformation options to be supported by the tool





Variable replication
Function replication
Memory duplication / memory checking
Synchronization intervals
Condition evaluation



Post-evaluation verification
Evaluation using replicated variables
Block protection
9
Reconfigurable Fault Tolerance


GOAL – Research how to take advantage of reconfigurable nature of FPGAs, to provide
dynamically-adaptive fault tolerance in RC systems

Leverage partial reconfiguration (PR) where advantageous

Explore virtual architectures to enable PR and reconfigurable
fault tolerance (RFT)
MOTIVATION – Why go with fixed/static FT, when
performance & reliability can be tuned as needed?

Environmentally-aware & adaptive computing is wave of future

Achieving power savings and/or performance improvement,
without sacrificing reliability
Performance

Fault Tolerance
CHALLENGES – limitations in concepts and tools,
open-ended problem requires innovative solutions

Conventional methods typically based upon radiationhardened components and/or fault masking via chip-level TMR

Highly-custom nature of FPGA architectures in different systems
and apps makes defining a common approach to FT difficult
Satellite orbits, passing through
the Van Allen radiation belt
10
Reconfigurable FT

Adaptable
Componentlevel
Protection
Virtual Architecture for RFT


Novel concept of adaptable
component-level protection (ACP)
Common components within VA:








Internal configuration through ICAP
External configuration controller
Benefits of internal protection:



B
L
A
N
K
A B
A C
B D
B
A
“sockets”
for modules
no
parallel,
SCP
2×
TMR
4×
parallel,
single
Reliable
ICAP
Controller
ACP Module #1
Early error detection and handling = faster recovery
Redundancy can be changed into parallelism
PR can be leveraged to provide uninterrupted
operation of non-failed components
Challenges of internal protection:

B
L
A
N
K
Adaptable protection frame – largely module/design-independent (see figure above)
Error Status Register (ESR) for system-level error tracking/handling
Re-synchronization controller or interfaces, for state saving and restoration
Configuration controller, two options:
FPGA


FPGA arch. diagram
Protection Module “Frame”
Impossible to eliminate single points of failure, may still
need higher-level (external) detection and handling
Stronger possibility of fault/error going unnoticed
Single-event functional interrupts (SEFI) are major
concern
Reliable
Static
Logic
ACP Module #N
Reliable ESR
&
Resynch.
Controller
State saving
&
restoration
interfaces
To RH
FPGA_CON
Error detection indicators
VA concept diagram
11
Space Applications

Synthetic Aperture Radar (SAR)






Used to form high-resolution images of Earth’s
surface from moving platform in space
Patch-based processing with significant amount
of overlap between patch boundaries
Parallelizable on multiple levels of granularity,
possible without need for any inter-processor
communication (one patch per node)
2-dimensional data set, can range in size from
several hundred Megabytes to Gigabytes
Data set not significantly reduced through course
of application
Highly amenable to ABFT; possible application for
the Dependable Multiprocessor project
12
Space Applications

Hyperspectral Imaging (HSI)





Mostly embarrassingly parallel, exception being
weight computation (shown in red below)
3-dimensional data set, reduced through course of
application
Auto-correlation sample matrix (ACSM) calculation
and beamforming (detection) amenable to ABFT


Uses traditional beamforming techniques to
perform coarse-grained classification on
hyperspectral images
Adjustable to enable real-time processing
Suggest NMR for weight computation (weight)
Parallel and multi-FPGA decompositions explored
13
Space Applications

Cosmic Ray Elimination




Uses image processing techniques to remove artifacts
caused by cosmic rays
Image shows pre- and post-processed versions of a Hubble
Telescope observation
Images are highly parallelizable, with minimal
communication necessary
Main computation: median filtering

Fault-tolerant median filter developed

Other portions of algorithm replicated by hand or S2S
translator

Other aerospace-related application kernels





Space-Time Adaptive Processing (STAP)
Ground Moving Target Indicator (GMTI)
Airborne LIDAR
Digital Down Conversion (DDC)
PDF Estimation
14
Novel Computing Platforms

Fixed multi-core (FMC) devices

Cell


GPU


Heterogeneous, vector compute engine, 3.2 GHz
clock rate, ~70 W max. power consumption
Homogeneous, many (e.g. 100+) stream processors,
~1.5 GHz clock rate, ~120 W max. power
consumption
Reconfigurable multi-core (RMC) devices

Field-Programmable Object Array (FPOA)


Heterogeneous, fine-grained processing elements,
max. clock rate ~500 MHz, achievable clock rate
varies, ~30 W max. power consumption
Tilera


Cell processor block diagram http://www.research.ibm.com/journal/rd/494/kahle.html
Field-Programmable Gate Array (FPGA)


Heterogeneous, coarse-grained processing
elements, 1 GHz clock rate, ~35 W max power
consumption
Homogeneous, coarse-grained processing elements
(64 32-bit MIPS-like processors on-chip), ~750 MHz
clock rate, ~30 W max. power consumption
Element CXi

Heterogeneous, coarse-grained processing
elements, 200 MHz clock rate, ~1 W max. power
consumption
FPOA architecture http://www.mathstar.com/Architecture.php
15
RC: Vital Technology for Space

Versatility in space missions (adapts as needs demand)


Fixed archs. burdened with fixed choices, limited tradeoffs
Performance in space missions (speed, power, size, etc.)

e.g. Computational density per Watt (CDW) device metric

FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)
HPEC devices featured
here; similar results vs.
65nm Xeon, 90nm GPU,
etc. (see RSSI’08).
16
Parallel Operations – scales up
to max. # of adds and mults (#
of adds = # of mults) possible
Achievable Frequency – lowest
frequency after PAR of DSP &
logic-only impls. of add & mult
comp. cores [FPGA]
Power – scales linearly with
resource util; max. power
reduced by ratio of achievable
freq. to max. freq. [FPGA]
Results excerpted from
pending presentation
from CHREC-UF site for
HPEC’08 Workshop.
16
Experimental logic analyzer measurements
RapidIO
High-speed embedded system
interconnect, replacement for bus-based backplanes



Parallel and serial variants, serial is wave of future
Multiple programming models
Research with RapidIO at UF



Simulative research studying capability of RapidIO-based
computing platforms to support space-based radar (SBR)
processing
Custom testbed designed and built, for verification of
simulation models & experimentation with RapidIO & FPGAs
Visualization of simulated GMTI application progress
256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges
100
SDRAM Utilization (%)
90
80
70
60
50
Trace files
40
30
20
10
0
0
256
512
768
1024
1280
1536
1792
2048
Time (ms)
17
Conclusions

Fault tolerance for space should be more than
RadHard components & spatial TMR designs






Fixed worst-case designs extremely limited in perf/Watt
Instead, many FT methods & modes can be exploited
Adaptive systems that react to environmental changes
COTS featured inside critical performance path
RadHard for FT management, outside critical perf. path
UF active on many space-related FT issues




NASA Dependable Multiprocessor, CHREC RFT F4-08
Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc.
Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc.
Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.
18
2009 IEEE Aerospace Conference

Track 7.12 Dependable Software for High Performance
Embedded Computing Platforms
 Transient error detection and recovery techniques


Compiler-based fault-tolerant techniques
Algorithm-based fault-tolerant techniques
Tools and techniques for designing reliable software
 SIFT management frameworks
 Software dependability analysis
 Adaptive fault-tolerant techniques
 FT applications
Track Chairs
 Richard Linderman
[email protected]
 Grzegorz Cieslewski
[email protected]
Dates
 Abstract Submissions Due: July 1st, 2008
 Paper Submissions Due: November 2nd, 2008



19