Transcript Slide 1

Upset Susceptibility
and
Design Mitigation
of
PowerPC405 Processors
Embedded in
Virtex II-Pro FPGAs
Swift
1
P173/MAPLD 2005
Authors
Gary Swift
Jet Propulsion Laboratory/California Institute of Technology
Gregory Allen
Jet Propulsion Laboratory/California Institute of Technology
Jeffrey George
The Aerospace Corporation
Swift
2
P173/MAPLD 2005
Authors
Sana Rezgui
Xilinx Corporation
Carl Carmichael
Xilinx Corporation
Fayez Chayab
MDRobotics
Swift
3
P173/MAPLD 2005
Abstract
We show recent results for the upset susceptibility of the registers
and caches in the embedded PowerPC405 in the Xilinx V2P40 FPGA.
For critical flight designs where configuration upsets are mitigated
effectively, these upsets can dominate the system error rate.
We consider several techniques for implementing various levels of
redundancy to reduce system errors, including single-, dual- and
triple-chip options. We conclude that the dual-chip option may
often be the best choice and warrants further study.
Swift
4
P173/MAPLD 2005
Background - Reconfigurable FPGA Upsets
The basic building blocks are soft to upset [Ref. 1]
1E-7
Cross Section (cm2 /bit)
Config
BRAM
1E-8
1E-9
Configuration Cells and Block RAM
XQR2VP40
1E-10
0
10
20
30
40
LET (MeV per mg/cm2)
Swift
5
P173/MAPLD 2005
Background - Upset Mitigation
Critical applications require design-level upset
mitigation
• Design Triplication
– The use of TMR (or triple modular redundancy) in a design
allows correct function through triplicated majority voters
even when a configuration element is upset.
– The extra design effort is now largely automated by new
software (TMRtool).
• Active Configuration Scrubbing
– Upsets in the configuration must not be allowed to accumulate
or TMR will “break”
– Scrubbing uses some resources, but can be implemented so
that it is transparent to system operation.
Swift
6
P173/MAPLD 2005
Embedded “Hard-Core” Processor(s) Upset
PowerPC 405 cores in Virtex II-Pro family FPGAs offer
unprecedented computational power inside an FPGA,
but include additional upsetable storage elements
Cross Section (cm2 /bit)
1E-6
1E-7
Ones
Zeros
1E-8
General Purpose Registers
XQR2VP40 embedded PPC405 core
1E-9
1E-10
0
10
20
30
40
LET (MeV per mg/cm2)
Swift
7
P173/MAPLD 2005
Processor Upsets – Data Cache
Processor caches are very important features for
increased performance; however, upsets in the caches
can lead to system errors.
Cross Section (cm2 /bit)
1E-7
Ones
1E-8
Zeros
1E-9
Data Cache
XQR2VP40 embedded PPC405 core
1E-10
0
10
20
30
40
LET (MeV per mg/cm2)
Swift
8
P173/MAPLD 2005
Processor Upset Mitigation
The “obvious” solution of implementing TMR with
three processor cores is not an available single chip
option because the maximum number of processors
per FPGA is currently two.
Tradeoffs between upset robustness and system
complexity, possibly spanning multiple FPGAs, must be
considered.
Swift
9
P173/MAPLD 2005
One-Chip Solution
Running two processors
in lockstep is
conceptually simple,
esp. as they can reside
in a single FPGA. A
fast TMR-ed
comparison block is
required to contain
errors and not allow
them to propagate into
the rest of the system.
A processor upset will
appear to the
comparison block as a
disagreement,
necessitating both
processors be stopped
within the current
clock cycle. Then they
both must be forced to
roll back to a known
good software
“bookmark” or,
alternatively, to
reboot.
Swift
10
P173/MAPLD 2005
Flow Chart
One-Chip Solution
Single Instruction Executed
(in Lockstep)
N
Compare Processor Outputs
Error Detected
Y
Stop Execution
Initialize Processor Reboot
Execute reboot
and/or resynchronization processes
Swift
11
P173/MAPLD 2005
Advantages
• Contained in one chip
– No chip-to-chip interconnects (minimal latency and
propagation delay)
– Lower power consumption
– Less board area
– No chip-to-chip synchronization
• Technology is more developed and tested
[See Reference 2]
Swift
12
P173/MAPLD 2005
Disadvantages
• More system outages
– Reboot or rollback on every error
– Not suitable for some critical real-time applications
• Twice as many errors as on a single processor,
but at least they are detected
Note: Requires extra device – either watchdog timer or external
configuration scrubber
Swift
13
P173/MAPLD 2005
Two-Chip Solution
With four processors in
lockstep (necessitating two
chips), a solution as robust
as full TMR is possible. In
this scheme, a pair of
processors that get into a
disagreement due to an
upset will be stopped while
the system runs without
interruption on the
processor pair that are in
agreement. Correct internal
state information is
available in the working
pair., preferably soon.
Thus, it is possible to resynchronize almost
transparently and rapidly
get back to full fourprocessor lockstep operation
with minimal intrusion. As a
side effect of using two
separate FPGAs, additional
robustness is possible by
adding on cross-strapped
configuration control.
Swift
14
P173/MAPLD 2005
Flow Chart
Two-Chip Solution
Power up configuration (both FPGAs
from the same ROM)
Parallel internal error checking
N
Error Detected
Y
Processors with disagreement
halt.
Wait for an opportunity to
reconfigure
Healthy FPGA takes over and initiates a full or
partial reconfiguration of the upset FPGA
Resynchronization arbitrator synchronizes
processors to appropriate location
Swift
15
P173/MAPLD 2005
Advantages
• Reboots rare; requires simultaneous errors in two
separate processors
• Processor upsets are transparently handled without
system outage until convenient re-synchronization
opportunites
• Enhanced robustness – outages lowered to less than
the SEFI rate of ~1 in 80 years per device
• Allows added configuration robustness
– Chips check each other (not self-checking)
– Eliminates need for external watchdog timer
Swift
16
P173/MAPLD 2005
Disadvantages
• Complicated
– Inter-chip communication/synchronization
– Transparent reboot/resynchronization of both
processors in chip with error
• Twice the power consumption
• In-beam testing is not yet done (although
planned for the near future)
Swift
17
P173/MAPLD 2005
Three-Chip Solution
The three-chip
implementation
(also known as the
“virtual FPGA”
solution [Ref. 3])
takes the
responsibility of
error detection out
of the hands of the
upsetable FPGAs by
adding a RadiationHardened ASIC.
Note that only one
processor per FPGA
is needed. The ASIC
handles stopping
error propagation
and re-synchronizing
an upset processor.
Additionally, the
ASIC can be used for
configuration
control of all three
FPGAs.
Swift
18
P173/MAPLD 2005
Flow Chart
Three-Chip Solution
Configure all three FPGAs
Processors execute a cycle in
lockstep
N
Error is detected
Y
Re-synchronize state of
device with upset
Swift
19
P173/MAPLD 2005
Advantages
•
•
•
•
Maximum robustness to upsets
Only three processors in lockstep (but in 3 chips)
More fabric available for other functions
No system outages; errors and SEFIs are handled
transparently
• Most implementation details are confined to the
ASIC and don’t affect the IP in the FPGAs
significantly
Swift
20
P173/MAPLD 2005
Disadvantages
• Complex ASIC development for controller to vote
outputs and re-load/re-sync upset processor
• ASIC development cost (currently funded though)
• Board area
Swift
21
P173/MAPLD 2005
Conclusions
• Both two-chip and three-chip solutions have about the
same robustness, power consumption, and system
complication, but handle upsets better than the onechip solution.
• The two- vs. three-chip decision mostly boils down to
the familiar FPGA vs. ASIC debate
• Three-chip solution may use less power than the twochip. (Is the ASIC’s power consumption less than that
of one processor core?)
• At present, the JPL-preferred approach is the twochip implementation achieving maximum flexibility
and near maximum robustness to upsets.
Swift
22
P173/MAPLD 2005
References
• [1] J. George et al., “Initial Single-Event Effects
Testing and Mitigation in the Xilinx Virtex II-Pro
FPGA,” Paper 211, MAPLD 2005.
• [2] M. Wang and G. Bolotin, “SEU Mitigation
Techniques for Xilinx Virtex-II Pro FPGA,” Paper D110,
MAPLD 2004,
http://klabs.org/mapld04/presentations/session_d/
1_d110_wang_s.ppt
• [3] J. Lyke and B. Marty, Virtual Field Programmable
Gate Array Triple Modular Redundant Cell Design, Air
Force Research Laboratory: Space Vehicles
Directorate, AFRL-VS-PS-TR-2004-1093, April 28, 2004.
Swift
23
P173/MAPLD 2005