NMP ST8 Dependable Multiprocessor (DM) Dr. John R. Samson, Jr.

Download Report

Transcript NMP ST8 Dependable Multiprocessor (DM) Dr. John R. Samson, Jr.

NMP ST8 Dependable
Multiprocessor (DM)
Dr. John R. Samson, Jr.
Honeywell Defense & Space Systems
13350 U.S. Highway 19 North
Clearwater, Florida 33764
(727) 539 - 2449
[email protected]
High Performance Embedded Computing Workshop (HPEC)
18 – 20 September 2007
Outline
• Introduction
- Dependable Multiprocessor * technology
- overview
- hardware architecture
- software architecture
• Current Status & Future Plans
• TRL6 Technology Validation
• TRL7 Flight Experiment
• Summary & Conclusion
* formerly known as the Environmentally-Adaptive Fault-Tolerant Computer (EAFTC);
The Dependable Multiprocessor effort is funded under NASA NMP ST8 contract
NMO-710209.
This presentation has not been published elsewhere, and is hereby offered for exclusive publication
except that Honeywell reserves the right to reproduce the material in whole or in part for its own use
and where Honeywell is so obligated by contract, for whatever use is required thereunder.
2
DM Technology Advance: Overview
•
A high-performance, COTS-based, fault tolerant cluster onboard processing
system that can operate in a natural space radiation environment
NASA
Level 1
Requirements
(Minimum)

high throughput, low power, scalable, & fully programmable >300 MOPS/watt (>100)

high system availability > 0.995 (>0.95)

high system reliability for timely and correct delivery of data >0.995 (>0.95)

technology independent system software that manages cluster of high performance
COTS processing elements

technology independent system software that enhances radiation upset tolerance
Benefits to future users if DM experiment is successful:
- 10X – 100X more delivered computational throughput in space than currently available
- enables heretofore unrealizable levels of science data and autonomy processing
- faster, more efficient applications software development
-- robust, COTS-derived, fault tolerant cluster processing
-- port applications directly from laboratory to space environment
--- MPI-based middleware
--- compatible with standard cluster processing application software including
existing parallel processing libraries
- minimizes non-recurring development time and cost for future missions
- highly efficient, flexible, and portable SW fault tolerant approach applicable to space and
other harsh environments
- DM technology directly portable to future advances in hardware and software technology
3
Dependable Multiprocessor Technology
• Desire - -> ‘Fly high performance COTS multiprocessors
in
space’
- To
satisfy the long-held desire to put the power of today’s PCs and
supercomputers in space, three key issues, SEUs, cooling, & power
efficiency, need to be overcome
DM has addressed and solved all three issues
 Single Event Upset (SEU): Radiation induces transient faults in COTS
hardware causing erratic performance and confusing COTS software
DM Solution
- robust control of cluster
- enhanced, SW-based, SEU-tolerance
 Cooling: Air flow is generally used to cool high performance COTS
multiprocessors, but there is no air in space
DM Solution
- tapped the airborne-conductively-cooled market
 Power Efficiency: COTS only employs power efficiency for compact
mobile computing, not for scalable multiprocessing
DM Solution
- tapped the high performance density mobile market
4
DM Hardware Architecture
Co-Processor
Memory
Instruments
Memory
FPGA
Volatile
Co& NV
Processor
Main
750FX
Processo
PPC
S/C Interface B
S/C Interface A
System
Controller
B
System
Controller
A
Data
Processor
1
…
r
Data
Processor
N
Bridge/
Controller
High-Speed Network I/0
Net &
N Ports Instr IO
Network B
Network A
Mission-Specific Devices *
Custom S/C or Sensor I/0 *
Mass Data Storage Unit *
* Mass Data Storage Unit, Custom Spacecraft I/O, etc.
* Examples: Other mission-specific functions
5
DMM Top-Level Software Layers
DMM – Dependable Multiprocessor Middleware
Scientific Application
System Controller
Policies
Configuration
Parameters
S/C
Interface
SW and Mission Specific
SOH
Applications
And
Exp.
Data
Collection
DMM
OS – WindRiver
VxWorks 5.4
Hardware
Honeywell RHSBC
Data Processor
Application
Specific
Application
Generic Fault
Tolerant
Framework
DMM
OS/Hardware
Specific
Application Programming
Interface (API)
OS – WindRiver PNE-LE (CGE)
Linux
Hardware
Extreme 7447A
FPGA
cPCI (TCP/IP over cPCI)
DMM components
and agents.
SAL
(System Abstraction Layer)
6
DMM Software Architecture “Stack”
DM System
System Controller
Data Processors
Mission Specific
Parameters
MPI Application Process
Spacecraft
Control
Process
DMM
SCIP
&
I/F S/W
JM
FTM
MM
JMA
AS
FEMPI
Data Processor
Application Data
Check Points
MDS
DMS, CMS, and AMS
DMS, CMS, and
AMS
VxWorks OS and Drivers
Linux OS and Drivers
Linux OS and
Drivers
System Controller
Data Processor with FPGA CoProcessor
Data Processor
DMS, CMS, and AMS
RTMM
Network and sideband signals
■ HA Middleware
■ Platform Components
■ Application Components
■ Mission Specific Components
■ Dependable Multiprocessor MW
Specific Components
JM – Job Manager
JMA – Job Manager Agent
MM - Mission Manager
FTM- Fault Tolerance Manager
FEMPI – Fault Tolerant Embedded
Message Passing Interface
SCIP - Space Craft Interface
Message Processor
7
AS – Application Services
MDS – Mass Data Storage
CMS – Cluster Management Services
AMS – Availability Management Services
DMS – Distributed Messaging Services
RTMM – Radiation Tolerant Mass Memory
Examples: User-Selectable Fault Tolerance Modes
Fault Tolerance Option
Comments
NMR Spatial Replication Services
Multi-node HW SCP and Multi-node HW TMR
NMR Temporal Replication Services
Multiple execution SW SCP and Multiple Execution
SW TMR in same node with protected voting
ABFT
Existing or user-defined algorithm; can either
detector detect or detect and correct data errors
with less overhead than NMR solution
ABFT with partial Replication Services
Optimal mix of ABFT to handle data errors and
Replication Services for critical control flow
functions
Check-pointing Roll Back
User can specify one or more check-points within
the application, including the ability to roll all the
way back to the original
Roll forward
As defined by user
Soft Node Reset
DM system supports soft node reset
Hard Node Reset
DM system supports hard node reset
Fast kernel OS reload
Future DM system will support faster OS re-load for
faster recovery
Partial re-load of System Controller/Bridge Chip
configuration and control registers
Faster recovery that complete re-load of all
registers in the device
Complete System re-boot
System can be designed with defined interaction
with the S/C; TBD missing heartbeats will cause the
S/C to cycle power
8
DM Technology Readiness & Experiment
Development Status and Future Plans
10/27/06
5/17/06
NASA ST8 Project
Confirmation Review
TRL5
Technology
Validation
10/08 9/08 *
TRL6
Technology
Validation
Technology Demonstration
in a Relevant Environment *
Technology in
Relevant Environment
6/27/07
5/31/06
Preliminary
Design
Review
Critical
Design
Review
Preliminary Experiment
HW & SW
Design & Analysis
Final Experiment
HW & SW
Design & Analysis
5/06, 4/07, & 5/07
Preliminary
Radiation
Testing
Final
Radiation
Testing
Critical Component
Survivability &
Preliminary Rates
Complete Component
& System-Level
Beam Tests
Key:
X*
X*
Flight
Readiness
Review
Built/Tested
HW & SW
Ready to Fly
Launch 11/09 *
Mission 1/10 - 6/10 *
X*
TRL7
Technology
Validation
Flight
* Per direction from NASA Headquarters 8/3/07;
The ST8 project ends with TRL6 Validation
Test results indicate DM components
will survive and upset adequately
@ 455 km x 960 km x 98.2o orbit
- Complete
9
DM Phase C/D Flight Testbed System
Point-to-Point Ethernet
System Controller:
Wind River OS
- VxWorks 5.4
Honeywell RHSBC
(PPC 603e)
RS422
Spacecraft
Computer
Data Processor:
Wind River OS
- PNE-LE 4.0 (CGE) Linux
Extreme 6031
PPC 7447a with AltiVec
co-processor
System
Controller
Data
Processor
Data
Processor
Data
Processor
Data
Processor
(Emulates
Mass
Data
Service)
DMM
DMM
DMM
DMM
Interface
Message
Process
SCIP
Memory Card:
Aitech S990
DMM
Networks:
cPCI
Ethernet: 100Mb/s
cPCI
SCIP – S/C Interface Process
10
Rad
Tolerant
Memory
Module
DM Phase C/D Flight Testbed
Custom Commercial Open
cPCI Chassis
System Controller
(flight RHSBC)
Backplane Ethernet
Extender Cards
Flight-like Mass Memory Module
Flight-like COTS DP nodes
11
TRL6 Technology Validation Demonstration (1)
Automated Fault Injection Tests:
CTSIM
or
S/C Emulator
Host
NFTAPE
System
Controller
Ethernet
RTMM
DP
Boards
Chassis
DP Board with
NFTAPE kernel
Injector and
NFTAPE interface
KEY:
RTMM - Rad Tolerant Memory Module
DP - COTS Data Processor
NFTAPE – Network Fault Tolerance And
Performance Evaluation tool
CTSIM – Command & Telemetry
Simulator
cPCI
Phase C/D Testbed System
12
TRL6 Technology Validation Demonstration (2)
System-Level Proton Beam Tests:
Additional
Cooling Fan
Aperture for
Radiation Beam
.
Proton
Beam
Radiation
Source
CTSIM
or
S/C Emulator
Borax
Shield
System
Controller
Ethernet
DP Board on
cPCI Extender Card
RTMM
DP
Boards
Extender
Card
cPCI
Phase C/D Test Bed
KEY:
RTMM - Rad Tolerant Memory Module
DP - COTS Data Processor
CTSIM – Command & Telemetry Simulator
13
Dependable Multiprocessor Experiment Payload
on the ST8 “NMP Carrier” Spacecraft
Power Supply
Module
DM
Payload
Test, Telemetry, &
Power Cables
ST8 Orbit: - sun-synchronous
- 955 km x 460km @ 98.2o inclination
Software
• Multi-layered System SW
- OS, DMM, APIs, FT algorithms
• SEU-Tolerance
- detection
- autonomous, transparent recovery
• Applications
- 2DFFT, LUD, Matrix Multiply, GSFC
RHPPC-SBC
System
Controller
4-xPedite 6031
DP nodes
Flight Hardware
• Dimensions
10.6 x 12.2 x 24.0 in.
(26.9 x 30.9 x 45.7
cm)
• Weight (Mass)
~ 61.05 lbs
(27.8 kg)
Neural Sensor application
• Multi-processing
- parallelism, redundancy
- combinable FT modes
Mass Memory
Module
MIB
• Power
The ST8 DM Experiment Payload is a
stand-alone, self-contained, bolt-on system.
14
~ 121 W (max)
Overview of DM Payload Flight Experiment Operation
S/C
S/C DM
Warm-Up
Power On
DM Payload
DM Warms
To Start - Up
Temperature
Note:
1) The data collected for the periodic SOH message includes summary experiment
statistics on the environment and on system operation and performance
2) Data collection for the Experiment Data Telemetry message i s triggered by
detection of a System - Level SEU event
Continuous execution after start-up, as long as DM experiment is “on”
S/C DM
Operational
Power On
DM Init.
Power Up
Sequence
1) Syst. Cntrl.
2) DP Nodes
3) Syst. SW
Uplink or
S/C DM
Payload
Command
DM
System
Controller
DM
Responds
To Command
Periodic
SOH
Message
DM
System
Controller
Data Collection
for Periodic
SOH Message
Experiment
Telemetry
Message
DM
System
Controller
Data Collection
for Experiment
Telemetry Msg.
S/C Imm.
Power Off
Indication
DM
System
Controller
DM Power
Down
Sequence
DM
System
SW *
DM
Environment
Data
Collection
DM
Experiment
Application
Sequence
System - Level
SEU Event
Detection
* S/C Interface, OS, HAM, DMM
15
DM Technology - Platform Independence
• DM technology has already been ported successfully to a number of
platforms with heterogeneous HW and SW elements
- Pegasus II with Freescale 7447a 1.0GHz processor with AltiVec vector
processor with existing DM TRL5 Testbed
- 35-Node Dual 2.4GHz Intel Xeon processors with 533MHz front-side bus and
hyper-threading (Kappa Cluster)
- 9-Node Dual Motorola G4 7455 @ 1.42 GHz, with AltiVec vector processor (Sigma
Cluster)
- DM flight experiment 7447a COTS processing boards with DM TRL5 Testbed
- State-of-the-art IBM multi-core Cell processor
-- DMM working on Cell; awaiting integration & demonstration with the DM TRL5 Testbed
DM TRL6 “Wind Tunnel” with
COTS 7447a ST8 Flight Boards
DM TRL5 Testbed System
With COTS 750fx boards
16
35-Node Kappa Cluster at UF
NASA GSFC Application Port to DM –
Demonstrated Ease of Use
Time to port a previously unseen application, the NASA Goddard Neural
System Application written in FORTRAN and Java, to the DM TRL5 testbed.*
Task Description
Download/Install Gfortran compiler
Attempt initial compile (failure due to F2003 code)
Install G95 compiler
Review FORTRAN Code
Analyze FORTRAN Code
Run script with 2 nodes (took a long time to run)
Review and analyze Java code
Get Eclipse and JAT
Created Uni-Processor version
Create NN Evaluation Program
Update Data Entry
Clean up NN Training Program
Convert Java code to a small C application
Still working on NN Training
Install G95 on PPC Cluster
Install G95 on DM System
Set up DM files for a new Mission
Set up DM files for a new Mission
Modify makefile structure for FORTRAN code
Execute Mission on DM System
Total Time per
Task (Hours)
1
0.5
1
2
3
5
1
1
2.5
0.5
4
2
4
7
1
1
5
2
4
2
2
Cummulative
Total Time
Spent (Hours)
1
1.5
2.5
4.5
7.5
12.5
13.5
14.5
17
17.5
21.5
23.5
27.5
34.5
35.5
36.5
41.5
43.5
47.5
49.5
51.5
Comment
Successful compilation
MPI (eminently parallelizable)
Mostly training time
Eclipse - Java dev. environment
JAT - Java Astrodynamics Toolkit
Java --> C conversion
TRL5 Testbed
Demo NN with spatial redundancy
Approximately one man-week, including time to find and test FORTRAN compilers that would work on the DM system !
* Port performed by Adam Jacobs, doctoral student at the University of Florida, member of the ST8 DM team.
Neural System application provided by Dr. Steve Curtis (NASA GFSC) and Dr. Michael Rilee (CSC/NASA GFSC)
17
Summary & Conclusion
• Flying high performance COTS in space is a long-held desire/goal
- Space Touchstone - (DARPA/NRL)
- Remote Exploration and Experimentation (REE) - (NASA/JPL)
- Improved Space Architecture Concept (ISAC) - (USAF)
• NMP ST8 DM project is bringing this desire/goal closer to reality
• Successful DM Experiment CDR on 6/27/07
• DM technology is applicable to wide range of missions
-
science and autonomy missions
landers/rovers
CEV docking computer
MKV
UAVs (Unattended Airborne Vehicles)
UUVs (Unattended or Un-tethered Undersea Vehicles)
ORS (Operationally Responsive Space)
Stratolites
ground-based systems
rad hard space applications
18