Fault Tolerant FPGA Co-processing Toolkit Douglas Michael DiSabello Oral defense in partial fulfillment of the requirements for the degree of Master of Science 2006

Download Report

Transcript Fault Tolerant FPGA Co-processing Toolkit Douglas Michael DiSabello Oral defense in partial fulfillment of the requirements for the degree of Master of Science 2006

Fault Tolerant FPGA
Co-processing Toolkit
Douglas Michael DiSabello
Oral defense in partial fulfillment of the requirements for the degree of
Master of Science 2006
1
Overview



Computation in radiation space environments is
slow compared to terrestrial computation
FPGAs can fill and exceed this gap for specific
applications
The Fault Tolerant FPGA Co-processing Toolkit
facilitates a general capability of FPGA coprocessing capability for space based
applications
2
Outline




The Space Computation Performance Gap
Ionizing Radiation and FPGA Background
Development Board
Fault Tolerant Co-Processing Toolkit



Support System
Fault Tolerant Support System
End User Environment
3
The Space Computation
Performance Gap

Radiation Hardened Microprocessor



BAE Systems 133MHz PowerPC
Special and proprietary design techniques
Laptop running this presentation

1,500 MHz PowerPC
4
Computation Gap Problems


Bare minimum of computations performed
in space
Large and slow data transfers to limited
number of receiving stations
5
Field Programmable Gate Arrays

200 – 1600 fold speed ups




Exploits fine grain parallelism of algorithms
Speed up computations in space beyond what
can even be offered by normal terrestrial
computers
Fast development cycles
In flight reprogramming to adapt to changing
mission requirements
6
FPGAs Details

SRAM Based



Configuration Logic Blocks



Look Up Tables and supporting logic
Input/Output Blocks


Configuration memory stores and implements design
Programmed using JTAG or SelectMap interfaces
Used for all general I/O package pins
Block RAMs
General Routing Matrix
7
Ionizing Radiation



Low Earth Orbit contains ionizing particles
trapped in the Van Allen Belts
Particles cause direct and secondary
nuclear reactions in silicon substrate
Decreasing feature sizes and shrinking
threshold voltages increase probability of
these interactions causing errors in circuits
8
Ionizing Radiation Definitions

Single Event Transient


Single Event Upset


Temporary change in logic value
SET that is latched into a memory
Single Event Functional Interrupt

Component stops service
9
Ionizing Radiation and FPGAs

Architecture




Data



SEUs in configuration memory
Instantiated design changes
Usually results in a SEFI
SEUs in Flip Flops, Latches, BRAMS, etc…
Incorrect computation results and/or SEFIs
Off Chip Communication


Non-dedicated configurable input/outputs package pins
SEUs to input/output blocks can disable a package pin
10
Typical FPGA Cross Section

Configuration Memory accounts for 91% of
a typical FPGA cross section


78% - 84.8% Routing Structure
20% Control bits and CLB LUT values
Michael Affrey, Paul Graham, Eric Johnson, Michael Wirthlin, Nathan
Rollins, and Carl Carmichael, “Single-Event Upsets in SRAM FPGAs”
MAPLD, Sep. 2002
11
Fault Mitigation Techniques

Scrubbing fixes architectural upsets



Continuously rewrite static portions of
configuration memory
Active partial reconfiguration bitstream
SEUs corrected at given reconfiguration rate
(shorter than expected upsets rate for given
orbit)
12
Fault Mitigation Techniques

Triple Modular Redundancy



Allows continuous service when architectural
upsets occur
Majority Voters determine final output
Inherent data redundancy
13
Hardware Development Board



Design developed by Naval Post Graduate
School
Naval Research Laboratory modified with
Virtex II FPGA
Designed for Configurable Fault Tolerant
Computing
14
Hardware Development Board
Support FPGA
Co-processing FPGA
SelectMap
F
L
A
S
H
E
E
P
R
O
M
SelectMap
Xilinx Virtex
FPGA
General I/O
Xilinx Virtex II
FPGA
JTAG
PC104/ISA Bus
Embedded
X86 PC
Only off-chip memory
15
Development Board Radiation Testing



NPS and NRL conducted tests at the
Crocker Nuclear Lab, U.C. Davis
Protons were emitted from a cyclotron to
interact with the FPGAs
Both the Virtex and Virtex II were
irradiated
16
Development Board Radiation Testing

Results:




1 upset for every 5 days for the Virtex in orbit
Order of magnitude greater for Virtex II in orbit
Scrubbing was proven to repair configuration
memory upsets
TMR was proven to allow continuous service
between SEU scrubbing intervals
James C. Coudeyras, “Radiation Testing of The Configurable Fault Tolerant Processor
(CFTP) for Space-Based Applications,” Thesis, United States Naval Post
Graduate School, 2005
17
The Toolkit Objective



Build a suite of VHDL designs, C++ software,
and tools to give a general FPGA co-processing
capability
Modular design for easy integration into new
hardware platforms and with new HDL modules
Allow designers to concentrate mainly on the coprocessing algorithm and design instead of Fault
Tolerance
18
How Objective was Achieved
1.
2.
3.
Create a support system to interface
between all components
Modify the support system into a Fault
Tolerant version
Co-processing designer templates and
interfaces
19
Support System




Interpret and execute all commands from a host
embedded computer
Route data between all components on the
board
Program FPGAs
Software for embedded x86 to interact with
support system instantiated into an FPGA
20
Virtex
SelectMap
Interface
FLASH
INTERFACE
Flash
Arbitrator
Support
System
FLASH Control
Interface
Virtex II
SelectMap
Interface
Inter-FPGA
Communication
Interface
PC104/ISA Bus Interfaces
21
PC104 / ISA Bus Interface


Responsible for all data transfers on and off the
FPGA board
Two addresses from the host PC are used:

Data address


Control address



8 Bit data words
A write causes the interface to reset and send a reset to any
other modules
A read gives the status of the buffer FIFOs
Each main support module use a copy of this
interface
22
WRITE ENABLE
READ ENABLE
EMPTY
FULL
PC104 / ISA Bus Interface
Write Enable
FULL
EMPTY
Data Out
Input
FIFO
Output
FIFO
Read Enable Bus Control
Logic
Bus Data
ADDRESS
BUS READ
BUS WRITE
BUS DATA
AEN
Data In
Bus Data
23
Flash Components

Interface


Control


Translates commands and data into a series of
signals to interact with a Flash chip
Facilitates interaction between the flash interface and
the Bus Interface
Arbitrator


Each module that requires Flash access is given a
priority number
A modules must relinquish control before another
module can be given access
24
SelectMap Interfaces




Control configuration data flow to the FPGAs
configuration interfaces
Configuration Clock is used to allow non-uniform data
loading
Actual FPGA configuration commands are contained in
the configuration data
Two versions: Virtex and Virtex II (and Virtex 4)
Flash Address
SelectMap Data Out <7:0>
Flash Interface CMD
Flash Data
Flash Data Valid
Flash Interface Busy
Write
SelectMap
Interface
Chip Select
Configuration Clock
Flash Control Request
Current Flash Owner
25
SelectMap Interface Commands







Load Start Address
Load Stop Address
Program Using Flash
Load Bus Word Number
Program Using Bus
Scrub using Flash
Abort (Virtex version only)
26
Inter-FPGA Communication Interface



Modified ISA Bus interface for delays
between physical FPGA chips
Co-processing FPGA has direct access to
BUS and own memory space
Designed specifically to use resources of
support FPGA
27
x86 Host PC Programs





Flash program
Flash verify
SelectMap Configuration
Scrub On and Scrub Off
Co-processing echo check program
28
Fault Tolerant Support System




Add fault tolerance to original support
system
Specialized fault mitigation techniques
FPGA configuration, scrubbing, and
BitStream manipulation
Support System End User Environment
29
Support System Main Fault
Tolerance Methods

Triple Modular Redundancy




All modules are made in triplicate
Majority voter determines correct output
Between HDL modules three voters are used
to keep redundancy of signal paths
Place and Route to keep redundant modules
separate
30
Majority Voters

Tri-State Buffers



Not made from SRAM material
Only interconnects are susceptible and
correct operation still results
Takes multiple SEUs for incorrect function
Carl Carmichael, “Triple Modular Redundancy
Design Techniques for Virtex FPGAs,” Xilinx
Application Note 197, 2001
31
FSMs and TMR


Method keeps FSM synchronized if SEU occurs in state
register
INPUTS
Outputs are also majority voted
Next State
Logic
V
Current State Register
V
Current State Register
V
Current State Register
Next State
Logic
Next State
Logic
32
Block RAM Fault Mitigation

TMR




Necessary for routing structure
Inherent data redundancy
Method is fine for short term data storage
SEUs are not corrected during long term data
storage in individual BRAM
33
BRAM TMR w/ Refresh




All BRAMs are dual ported
Second ports are used to constantly read
data values from three copies, vote, and
rewrite the values
Data write collision avoidance
Not needed for support system, but useful
for co-processing applications
Carl Carmichael, “Triple Modular Redundancy
Design Techniques for Virtex FPGAs,” Xilinx
Application Note 197, 2001
34
Specialized Techniques

Off FPGA transfers


TMR of package pins to a single trace
TMR of package pins not available on
development board
Carl Carmichael, “Triple Modular Redundancy
Design Techniques for Virtex FPGAs,” Xilinx
Application Note 197, 2001
35
Specialized Techniques

Inter-FPGA transfers




Not enough pins available to triplicate all
signals
8 data bit and 4 redundant bit Hamming code
used for data
Double Error Detection and Single Error
Correction
Triplication of all other signals
36
Specialized Techniques

PC104 / ISA Bus transfers



Updated fault tolerant module allows for
extended data transfer sizes
All data sent is encoded in (8,4) hamming
code
No redundancy available for other signals
37
Flash Memory Data Reads



CRC values embedded every 512 16-bit
flash words by Flash program
Fault tolerant Flash interface uses BRAM
buffer cache to hold every 512 blocks of
data when data is requested
If CRC value is incorrect data cache is
flushed and the data is read again
38
Configuration and Scrubbing

Three modes of configuration determined by the bit file
created using Xilinx Bitgen tool

Initial Configuration


Reconfiguration



Contains startup commands
Contains shutdown and startup commands
Doesn’t require a powercycle, but FPGA is taken out of service
Active Partial Reconfiguration




Used for scrubbing
Rewrites static portions of bitstream
Removes initial BRAM contents
Any portion of design could be masked out
39
Support System End User Environment

Flash program and verify host programs


Scrub On and Off



Store initial co-processing, co-processing scrubbing,
and support scrubbing bitstreams at designated flash
addresses
Selectmap interfaces begin scrubbing at given
periods using flash data
configuration can also still be done across the bus
C++ examples programs and headers files for
data transfers to the co-processing FPGA
40
Toolkit User Environment

Fault Tolerant Support System




Controls all data flow
Modular design for quick integration to different
physical systems with varying number of FPGAs
C++ programs and header files
Co-Processing Tools



Templates for HDL co-processing components
HDL “hooks” for data transfers
C++ programs and header files
41
Future Toolkit Additions



Different bus modules for interaction with
radiation hardened microprocessors
Co-processing library
Use support system on single and multiple
FPGA boards
42