High-Performance, Dependable Multiprocessor John David Eriksen Jamie Unger-Fink Background and Motivation  Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support   Data collected in.

Download Report

Transcript High-Performance, Dependable Multiprocessor John David Eriksen Jamie Unger-Fink Background and Motivation  Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support   Data collected in.

High-Performance, Dependable Multiprocessor John David Eriksen Jamie Unger-Fink

Background and Motivation

   Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support Data collected in space and processed on the ground Data sets in space applications continue to grow

Background and Motivation

  Communication bandwidth not growing fast enough to cope with increasing size of data sets ◦ Instruments and sensors grow in capability Increasing need for on-board data processing ◦ Perform data filtering and other operations on board ◦ Autonomous systems demand more computing power

Related Work

  Advanced Onboard Signal Processor (AOSP) ◦ Developed in 70’s and 80’s ◦ Helped develop understanding of radiation on computing systems and components.

Advanced Architecture Onboard Processor (AAOP) ◦ Engineered new approaches to onboard data processing

Related Work

  Space Touchstone ◦ First COTS-based, FT, high-performance system Remote Exploration and Experimentation ◦ Extended FT techniques to parallel and cluster computing ◦ Focused on low-cost, high-performance, good power-ratio compute cluster designs.

Goal

  Address need for increased data processing requirements Bring COTS systems to space ◦ COTS (Commodity Off-The-Shelf)    Less expensive General-purpose Need special considerations to meet requirements of aerospace environments    Fault-tolerance High reliability High availability

Dependable Multiprocessor is…

 A reconfigurable cluster computer with centralized control.

Dependable Multiprocessor is…

    A hardware architecture ◦ High-performance characteristics ◦ ◦ Scalable Upgradable (thanks to reliance on COTS) A parallel processing environment ◦ Support common scientific computing development environment (FEMPI) A fault-tolerant computing platform ◦ System controllers provide FT properties A toolset for predicting application behavior ◦ Fault behavior, performance, availability…

Hardware Architecture

     Redundant radiation-hardened system controller Cluster of COTS-based reconfigurable data processors Redundant COTS-based packet-switched networks Radiation-hardened mass data store Redundancy available in: ◦ System controller ◦ Network ◦ Configurable N-of-M sparing in compute nodes

Hardware Architecture

Hardware Architecture

  Scalability ◦ Variable number of compute nodes ◦ Cluster-of-cluster Compute nodes ◦ IBM PowerPC 750FX general processor ◦ Xilinx VirtexII 6000 FPGA co-processor  Reconfigurable to fulfill various roles ◦    DSP processor Data compression Vector processing  Applications implemented in hardware can be very fast Memory and other support chips

Hardware Architecture

Hardware Architecture

Hardware Architecture

  Network Interconnect ◦ Gigabit Ethernet for data exchange ◦ A low-latency, low-bandwidth bus used for control Mission Interface ◦ Provides interface to rest of space vehicle’s computer systems ◦ Radiation-hardened

Hardware Architecture

 Current hardware implementation ◦ Four data processors ◦ Two redundant system controllers ◦ One mass data store ◦ Two gigabit ethernet networks including two network switches ◦ ◦ Software-controlled instrumented power supply Workstation running spacecraft system emulator software

Hardware Architecture

   ◦ Platform layer is lowest layer, interfaces hardware to middleware, hardware-specific software, network drivers Uses Linux, allows for use of many existing software tools Mission Layer: Middleware: includes DM System Services: fault tolerance, job management, etc.

   DM Framework is application independent, platform independent API to communicate with mission layer, SAL (System Abstraction Layer) for platform layer Allows for future applications by facilitating porting to new platforms

   HA Middleware foundation includes: Availability Management (AMS), Distributed Messaging (DMS), Cluster Management (CMS) Primary functions ◦ ◦ ◦ ◦ ◦ Resource monitoring Fault detection, diagnosis, recovery and reporting Cluster configuration Event logging Distributed messaging Based on small, cross-platform kernel

  ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Hosted on the cluster’s system controller Managed Resources include: Applications Operating System Chassis I/O cards Redundant CPUs Networks Peripherals Clusters Other middleware

     Provides a reliable messaging layer for communications in DM cluster Used for Checkpointing, Client/server, Communications, Event notification, Fault management, Time-critical communications Application opens a DMS connection (channel) to pass data to interested subscribers Since messaging is in middleware instead of lower layers, application doesn’t have to specify explicitly destination address Messages are classified and machines choose to receive message of a certain type

   Manages physical nodes or instances of HA middleware Discovers and monitors nodes in a cluster Passes node failures to AMS and FT Manager via DMS

   Database Management Logging Services Tracing

   ◦ Interface to control computer or ground station Communicates with system via DMS Monitors system health with FT Manager “Heartbeat”

   ◦ Detects and recovers from system faults FTM refers to set of recovery policies at runtime Relies on distributed software agents to gather system and application liveliness information Avoids monitoring bottleneck

     Provides application scheduling, resource allocation Opportunistic load balancing scheduler Jobs are registered and trace by the JM via tables Checkpointing to allow seamless recovery of the JM Heartbeats to the FT via middleware

 ◦ ◦ ◦ Fault-Tolerant Embedded Message Passing Interface Application independent FT middleware Message Passing Interface (MPI) Standard Built on top of HA middleware

     Recovery from failure should be automatic, with minimal impact Needs to maintain global awareness of the processes in parallel applications 3 Stages: ◦ Fault Detection ◦ ◦ Notification Recovery Process failures vs Network failures Survives the crash of n-1 processes in an n process job

  ◦ ◦ ◦ ◦ Proprietary nature of FPGA industry USURP - USURP’s Standard for Unified Reconfigurable Platforms Standard to interact with hardware Provides middleware for portability Black box IP cores Wrappers mask FPGA board

      Not a universal tool for mapping high-level code with hardware design OpenFPGA Adaptive Computing System (ACS) vs USURP ◦ Object Oriented Models vs Software APIs IGOL BLAST CARMA

 Responsible for:  Unifying vendor APIs  Standardizing HW interface   Organization of data for the user application core Exposing the developer to common FPGA resources.

  ◦ ◦ User level protocol for system recovery Consists of: Server Process that runs on Mass Data Store   DMS API for applications C-type interfaces

    Algorithm-based Fault Tolerance Library Collection of mathematical routines that can detect and correct faults BLAS-3 Library ◦ Matrix multiply, LU decomposition, QR decomposition, single-value decompositions (SVD) and fast Fourier transform (FFT).

Uses checksums

  Triple Modular Redunancy Process Level Replication

Conclusion

   System architecture has been defined Testbench has been assembled ◦ ◦ Improvements: More aggressively address power consumption issues Add support for other scientific computing platforms such as Fortran