A Genetic Representation for Evolutionary Fault Recovery

Download Report

Transcript A Genetic Representation for Evolutionary Fault Recovery

FPGA Self-Repair
using an
Organic Embedded System Architecture
Kening Zhang, Jaafar Alghazo and Ronald F. DeMara
University of Central Florida
06 December 2007
Organic Computing (OC)
biologically-inspired computing with “self-x” properties
Technical Objective: support long lifetime missions with multiple failure occurrences
Research Focus: Reliability
Availability
Sustainability
OC Approach: addresses system controllability with increasing complexity
System
Property
Composed of large
collection of
autonomous systems
Self-x
Characteristics
•Self-organization
•Self-configuration
•Self-optimization
Autonomous system
owned sensor and
actuators
•Self-healing
•Self-protection
•Self-explaining
Communication
networks among
autonomous systems
•Context-awareness
•Self-synchronization
Example Relevance:
How to achieve sustainable presence in NASA’s Moon, Mars & Beyond objective???
Reconfigurable Hardware with Self-Healing
based on SRAM FPGA platform
Sponsors: NASA: FPGA platform and Genetic Algorithm research
DARPA: OC approach and SOAR Longevity Platform
Goal: Autonomous FPGA Refurbishment
increase availability without
carrying pre-configured spares …
Refurbishment
Redundancy
Overhead from Unutilized Spares
weight, size, power
increases with amount
of spare capacity
weakly-related to number
recovery capacity
restricted at design-time
variable at recovery-time
Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency
availability via downtime
required to handle fault
Quality of Repair
likelihood and completeness
Autonomous Operation
fix without outside intervention

based on time required to
select spare resource
determined by adequacy of
spares available (?)
yes



based on time required to
find suitable recovery
affected by multiple
characteristics (+ or -)
yes


Fault-Handling Techniques for
SRAM-based FPGAs
Device Failure
Characteristics
Duration:
Target:
Approach:
Transient:
SEU
Device
Processing
Configuration Datapath
Scrubbing
Bitwise
Comparison
Processing
Datapath
Evolutionary
Majority
Vote
STARS
CED
Vigander
OC
Supplementary
Testbench
Duplex
Output
Comparison
(not
addressed)
Duplex/Triplex
Output
Comparison
(not
addressed)
Autonomous
Element (AE)
unnecessary
Autonomous
Supervisor (AS)
Population-based
GA using
Extrinsic Fitness
Evaluation
Evolutionary
Algorithm using
Intrinsic Fitness
Evaluation
Cartesian
Intersection
Worst-case
Clock Period
Dilation
Diagnosis:
Reload Bitstream
/ Invert Bit Value
SEL, Oxide Breakdown,
Electron Migration, LPD
TMR
Detection:
Recovery:
Device
Configuration
BIST
Methods
Isolation:
Permanent:
Ignore
Discrepancy
Replicate in
Spare Resource
Fast Run-time
Location
Select Spare
Resource
Autonomous System-on-a-Chip (ASoC)
Architecture
Dual-layer ASoC proposed by Lipsa et al [Lipsa 05]
•
Functional Layer
•
•
Functional Elements (FEs) e.g. CPU, RAM, Network interface
Autonomic Layer
•
•
Autonomic Elements (AEs)
•
Monitor
•
Actuator
•
Communication interface
Autonomic Supervisor (AS)
UCF Approach for fault coverage
Functional Layer & Autonomic Layer
•
achieved by assessing consensus
among elements
1.
2.
first to realize failure detection
consensus provides an organic method
for fitness evaluation of competing alternatives during
evolution providing a self-regulating approach to fault resolution
EHW Environments
• Evolvable Hardware (EHW) Environments enable experimental
methods to research soft computing intelligent search techniques
• EHW operates by repetitive reprogramming of real-world physical devices
using an iterative refinement process:
Extrinsic
Evolution
Two
modes
of
Genetic
Algorithm
Simulation in the loop
Intrinsic
Evolution
or
Application
Genetic
Algorithm
Hardware in the loop
Evolvable
Deep Space Satellite:
• >100 FPGAs onboard
• hostile environment:
radiation, thermal stress
• How to achieve reliability
to avoid mission failure???
Hardware
Done?
software model Build it
device “design-time”
refinement
new approach to
device “run-time”
refinement
Autonomous Repair
of failed devices
Genetic Algorithms (GAs)
Mechanism coarsely modeled after neo-Darwinism (natural selection +
genetics)
start
replacement
offspring
population of
candidate
solutions
mutation
crossover
parents
selection
of
parents
Fitness
function
evaluate
fitness
of
individuals
Goal
reached
Genetic Mechanisms
•
Guided trial-and-error search techniques using principles of Darwinian evolution
•
GAs frequently use strings of 1s and 0s to represent candidate solutions



iterative selection, “survival of the fittest”
genetic operators -- mutation, crossover, …
implementor must define fitness function
Genotype chromosomes of GA operation: if 100101 is better than 010001 it will have more chance to
breed and influence future population
Genotype changes during evolution must adhere to the Xilinx-defined format of bitstream
To prevent undesirable conditions that may damage the FPGA such as a mutation which has two logic
outputs tied together, a logical genotype is used for evolution and mapped to physical phenotype
Logic # = functional logic index number for LUT
Row/Column= physical location of LUT in FPGA
•
Can invoke Elitism Operator (E=1, E=2 …)

guarantees monotonically increasing fitness of best individual over all generations
Loosely Coupled Solution
on Xilinx Virtex II Pro & Virtex 4
FP G A
O ut p u t
Input Data
Bit file
Control
hosted on
PC
PCI Interface
Virtex-II
Pro FPGA
Off Chip
RAM
Avnet FPGA Development Board
The entire system operates on a
32-bit basis
The Virtex 2Pro/4 is mounted on a
development board which can then
be interfaced with a WorkStation
running Xilinx EDK and ISE.
Organic Embedded System (OES)
Architecture
One Dimensional Column-oriented OES based on Xilinx Virtex II Pro FPGA platform
•
•
•
•
•
FEs and AEs reside on two distinct layers with interconnection structure between them
AEs and FEs can either be realized in hardware, software, or co-design
AE layer supervises functionality of FE elements while requiring no application-specific
algorithms on the AE layer
Observer/Controller architecture includes an AS element which had no counterpart to
evaluate if the AS fault-free, so address by minimizing its complexity in proposed approach
utilize Xilinx partial reconfiguration technology to manipulate relocatable bitstreams
OES AE Component Design
AEs decentralize Observer/Controller functionality:
•
•
•
•
•
•
Concurrent Error Detection (CED) unit collects 2 FE Outputs for
discrepancy identification
A Checksum for AE fault detection which are checked against Stored
Checksum values
Evaluator of outputs from 2 FEs against checksum and Actuator which
initiates recovery phase
An important architectural property is that all AE components are
identical in structure despite the fact that they monitor different types of
FEs.
Homogeneous characteristics deliver a uniform-behavior property
leveraged for consensus-based evaluation fault-handling methodology
OC Concept: although AE components add an additional complexity to
the design, they will ease integration of fault-handling difficulties
inherent with current commercial IP cores
Consensus-Based Evaluation (CBE)
• Uses a Relative Fitness Measure
 Pairwise discrepancy checking yields relative fitness measure
 Broad temporal consensus in the population used to determine
fitness metric
 Transition between Fitness States occurs in the population
 Provides graceful degradation in presence of changing
environments, applications and inputs, since this is a moving
measure
• Test Inputs = Normal Inputs for Data Throughput
 CBE does not utilizes additional functional nor resource test
vectors
 Potential for higher availability as regeneration is integrated
with normal operation
Genetic Operators: Mutation
Typical Approach: bit inversion of LUT functionality
Selected Approach: input interconnection of LUTs mutated
Rearrange input interconnection to search unused
LUT resources which occlude faulty resource
Mutation: Genotype chromosomes
• original functionality is
F = F1·(F3+F4) w/ input F2
unassigned by synthesis tool
• mutation operator will
change input F4 to unused
as F = F1·(F3+F2)
• shadow shows changed
input and LUT contents
Mutation: Phenotype chromosomes
• some opportunity for input
stuck-at fault or LUT content
stuck-at fault.
• functionalities of LUTs
remain undistorted while
search space explored
Genetic Operators:
Cell Swapping
Cell-Swap operation on
Genotype chromosomes
interchanges two distinct LUT blocks
while maintaining correct logic order
and functionalities in genotype
• exchange all LUT input
interconnections, LUT
content and physical 2-tuple
(Col#, Row#) as well as the
logic sequence
Cell-Swap operation on
Phenotype chromosomes
Genetic Operators:
PMX Operator
Partial Match Crossover (PMX) maintains crossover information as well
as order information
• two genotype configuration streams are
aligned at LUT boundary
• crossover site selected at random along
LUT boundary
• this crossover point defines a left/right
partition used to affect crossover through
LUT-by-LUT exchange
• suppose crossover point at position 4 of
the LUT vector:
• first step is to map configuration B
to configuration A by exchanging the
following aligned LUTs
{(4,7),(5,2),(6,1),(7,5)}.
•Applying PMX results in two new
configurations A’ and B’
Illustrative Example:
Gate Level Design of OES
• Experiment circuit:
1-bit Full-adder
• Fault-free model: Duplex
• Fault-impact model: TMR
• Fault-detect model: CBE
• Fault recovery strategy: GA
operation
• Experimental setup:
 Hardware prototype implemented in
Xilinx Virtex-II Pro FPGA
 VHDL implementation
 Using the GNAT library along with
the MRRA framework and JTAG
reconfiguration interface.
MCNC-91 Benchmark
Case Studies
Circuit Name
Circuit Function
Inputs
Outputs
Approximate Gates
z4ml
2-bit Add
7
4
20
cm85a
logic
11
3
38
cm138a
Logic
6
8
17
System Availability under Multiple Faults
Fc = number of correct behaviors of FE observed during evolutionary recovery phase
Fe = number of errant or discrepant behaviors
1 = exactly one output required to detect the fault during the original CED configuration.
2 = number of the reconfigurations required, i.e. one from CED to TMR, and one back
from TMR to CED
Fc1 & Fe1 = correct and faulty output number of the FE during the AE repair period
Fc2 & Fe2 = correct and faulty output number during the FE repair period
n = number of reconfigurations of the FE
β represents reconfiguration to computation time ratio
Experimental Results
• Fault Free arrangement: CED FEs
with cold standby FE
• Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT
input pins
• CED -> TMR to identify faulty FE
or AE
• CBE used to resolve faulty AE
Redundancy for both FE (RFE)
and AE (RAE) = ratio of unused
LUT inputs to total number of
LUTs inputs
Fc = number of correct behaviors
of FE observed during
evolutionary recovery phase
Fe = number of errant or
discrepant behaviors
n = number of reconfigurations of
the FE
β represents reconfiguration to
computation time ratio
Experimental Results
• Fault Free arrangement: CED FEs
with cold standby FE
• Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT
input pins
• CED -> TMR to identify faulty FE
or AE
• CBE used to resolve faulty AE
Redundancy for both FE (RFE)
and AE (RAE) = ratio of unused
LUT inputs to total number of
LUTs inputs
Fc = number of correct behaviors
of FE observed during
evolutionary recovery phase
Fe = number of errant or
discrepant behaviors
n = number of reconfigurations of
the FE
β represents reconfiguration to
computation time ratio
Experimental Results
• Fault Free arrangement: CED FEs
with cold standby FE
• Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT
input pins
• CED -> TMR to identify faulty FE
or AE
• CBE used to resolve faulty AE
Redundancy for both FE (RFE)
and AE (RAE) = ratio of unused
LUT inputs to total number of
LUTs inputs
Fc = number of correct behaviors
of FE observed during
evolutionary recovery phase
Fe = number of errant or
discrepant behaviors
n = number of reconfigurations of
the FE
β represents reconfiguration to
computation time ratio
Conclusion
•
A self-adaptation and self-healing OES architecture
developed for autonomic operation without human
intervention.
•
The OES architecture is capable of handling many single
fault scenarios and several multiple fault scenarios for
small digital logic design.
•
Experimental result support our design objectives during
the repair phase averaged 75.05%, 82.21%, and 65.21% for
the z4ml, cm85a, and cm138a circuits respectively under
stated conditions.
•
Reconfiguration time ratio (β) ratio is key factor limiting
availability during AE repair
•
Future work: evaluate extensions of the OES architecture
addressing scalability of in terms of pipelined stages
Backup Slides
• On following pages …
Isolation of a single faulty individual with
1-out-of-64 impact
instantaneous
DV (point
values) for a
sample
individual in
population
and
population
oracles (solid
lines)
Sliding Window
• Outliers are identified after EW iterations have elapsed
• Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault
• Isolated faulty individual’s DV differs from the average DV by 3 after
1 or more observation intervals of length EW
Future Work:
Development Board to Self-Contained FPGA
(Xilinx Virtex-II Pro)
Virtex-II
Pro FPGA
Off Chip
RAM
Functional
CLBs
Bit file
ICAP
CRR on a Chip
(Xilinx Virtex-II Pro)
Config
Data
Reconfig
Request
Control via
on-chip
Power PC
Output
PCI Interface
Year 3
CRR on a Chip
Data
Output
Control
hosted on
PC
Input Data
Year 2
Bit file
Year 1
Device Fault
Configurations
in On Chip
RAM Blocks
Avnet FPGA Development Board
Qualitative Analysis of CRR model
• Number of iterations and completeness of regeneration repair
• Percentage of time the device remains online despite physical resource
fault (availability)
Hardware Resource Management
• Optimization of hardware profile for Xilinx Virtex II Pro
Field Testing on SRAM-based FPGA in a Cubesat mission
OES Integrated FE and AE Failure
Detection Procedure
•
System Initialization


•
FE Fault Detection/Recovery


•
FE Initialization step
Compute Checksum step
AE-CED fault detection
FE fault-recovery
AE fault detection Phase



A fault may exist in the CED, Actuator,
or Evaluator,
A fault may exist in Check Sum
component, or
A fault may exist in the Stored
CheckSum-LUT.
Runtime inputs to FE applied to both active
instance under a CED strategy. After allowing for
FE inputs propagation time through the AE, the
expected output will be supplied to AE-CED for the
fault detection. The output of the FE is then
compared in the AE-CED module and any
Previous Work
Detection Characteristics of FPGA Fault-Handling Schemes
Fault Detection
Resource Coverage
Fault Isolation
Approach
Fault Handling Method
Latency
Distinguish
Transients
Logic
TMR
Spatial voting
Negligible
No
Yes
Yes
No
Voting element
[Vigander01]
Spatial voting & offline
evolutionary regeneration
Negligible
No
Yes
No
No
Voting element
[Lohn,
Larchev,
DeMara03]
Offline evolutionary
regeneration
Negligible
No
Yes
Yes
No
Unnecessary
[Lach98]
Static-capability tile
reconfiguration
STARS
[Abramovici01]
[Keymeulen,
Stoica,
Zebulum00]
CRR
InterComparator
connect
Granularity
Relies on independent fault detection mechanism
Online BIST
Up to 8.5M
erroneous outputs
Test pattern
transients
Yes
Yes
No
LUT function
Population-based fault
insensitive design
Design-time
prevention emphasis
No
Yes
Yes
No
Not addressed
at runtime
Negligible
Transients are
attenuated
automatically
Yes
Unnecessary, but
can isolate
functional
components
Competing configurations
with temporal voting and
online regeneration
Yes
Yes
… Strategy #1) Evolve redundancy into design before the anticipated


failure or …
Previous Work
Fault Recovery Characteristics of Selected Approaches
Approach
Online
Recovery
TMR
Yes
[Vigander01]
No
[Lohn,
Larchev,
DeMara03]
[Lach98]
Basis for
Recovery
Quality of
Recovery
No
Single
datapath
3n
Design
complexity
GA Controller,
NonNondeterministic deterministic function test vectors
Yes
None
3n+r
No
Design
complexity
GA Controller,
NonNondeterministic deterministic function test vectors
Yes
None
2n+r
No
Available
spares
No
Only one
faulty CLB
per tile
2n+i+r
Yes
Available
spares
Yes
Available
spares within
routing
chokepoints
s • (c+m+b)
No
Depends on
redundancy
during design
n • (1 + f(g))
Yes
None
2n+r
STARS
[Keymeulen,
Stoica,
Zebulum00]
CRR
PrePower
Externally-supplied Resource
determined
Consumption
Recycling
Elements
Limits
Either
100% for
Requires 2
single fault, 2 of 3 Majority Voter
datapaths are
complete or
0% thereafter
operational
none
Either
[Abramovici
01]
Availability
No
Yes
complete or
none
Either
complete or
none
Restricted by Only ~93%
Test Reconfiguration
regardless of
nonController + device
fault
optimizable
test vectors
occurrence
re-routing
Depends on
NonNoncharacteristics
deterministic deterministic
at design time
Recovery
complexity
Device test vectors
and controller
Optimized by
second-order
fitness metric
Adaptable
None at runtime
Optional RAM. RAM
coverage is intrinsic.
No test vectors.
… Strategy #2) Evolve recovery from specific failure after (and if) it occurs or …
CRR Arrangement in SRAM FPGA
SRAM-based FPGA
Configurations in Population
• C = CL CR
• CL = subset of left-half configurations
• CR = subset of right-half configurations
• |CL|=|CR |= |C|/2
CONFIGURATION BIT STREAM
L
Half-Configuration
R
Half-Configuration
Discrepancy Operator
• Baseline Discrepancy Operator  is dyadic
operator with binary output:
Function Logic L
Function Logic R
• Z(Ci) is FPGA data throughput output of
configuration Ci
0 Z (CiL )  Z (CiR )
C C  
Othewise
1
L
i
`
Discrepancy Check L
DATA OUTPUT
CONTROL
R
i
• Each half-configuration evaluates  using
embedded checker (XNOR gate) within each
individual
Discrepancy Check R
FEEDBACK
OFF-CHIP EEPROM
( NOTE: a non-volatile memory is already required to boot any SRAM
FPGA from cold start ... this is not an additional chip )
INPUT DATA
• Any fault in checker lowers that individual’s
fitness so that individual is no longer preferred
and eventually undergoes repair
WTA: = i^ j Ci , j EOR Ci , j RS:  = ij Ci , j EOR Ci , j
L
Reconfiguration Algorithm
(Equivalence)
R
L
R
(Hamming Distance)
Terminology and Characteristics
Pristine Pool: CP. For any CiC, is member of CP at generation G if and only if
G
C
K 1
L
K
 C KR  0
Suspect Pool: CS. For any CiC, is member of CS at generation G if and only
if at least one of CKL  CKR  0(1  K  G)
Under Repair Pool: CU: For any CiC, is member of CU at generation G if and
only if
G
C
K 1
L
K
 C KR  1
Refurbished Pool: CR: after Genetic Operator applied, the new generated
G
individual is member of CR at generation G if and only if
L
R
C
K 1
K
 CK  0
ED is Discrepancy Count of Ci and EC is Correctness Count of Ci
Length of Evaluation Fitness Window: W = ED+ EC
Fitness Metric: f(Ci) =EC/ EW
Sketch of CRR Approach
Premise: Recovery Complexity << Design Complexity
1. Initialization
 Population P of functionally-identical yet physically-distinct configurations
 Partition P into sub-populations that use supersets of physically-distinct resources
e.g. size |P|/2 to designate physical FPGA
left-half or right-half resource utilization
2. Fitness Assessment
 Discrepancy Operator  is some function of
bitwise agreement between each half’s output
fitness assessment via
pairwise discrepancy
 Four Fitness States defined for Configurations as
(temporal voting vs.
{CP,CS,CU,CR} with transitions, respectively:
spatial voting)
Pristine
Suspect Under Repair Refurbished
 Fitness Evaluation Window W determines comparison interval
3. Regeneration
 Genetic Operators used to recover from fault based on Reintroduction Rate 
 Operators only applied once then offspring returned to “service” without for
concern about increasing fitness
Configuration Health States
States Transitions during lifetime of ith Half-Configuration
primordial
C
O
M
P
E
T
I
T
I
O
N
L=R
1
L=R
pristine
9
complete
repair
partial
repair
2
LR
refurbished
L=R
3
10
L R : fi  fOT
suspect
LR
:
fi  fRT
4
integral with
EVOLUTION
L=R
LR
fi  fOT
fi < fRT
:
:
LR
COMPETITION
11
8
:
L = R :
5
7
fi < fRT
LR
under
repair
6
fi < fOT
Procedural Flow under
Competitive Runtime Reconfiguration
Initialization
Population partitioned into
functionally-identical yet
physically-distinct
half-configurations
L=R
is
either L's or R's
fitness < Repair
Threshold?
L=R
Selection
Detection
choose
FPGA configuration(s)
labeled L and R
apply functional inputs
to compute FPGA
outputs using L, R
discrepancy
free
Fitness
Adjustment
PRIMARY
LOOP
update fitness of only
L and R based on
detection results
YES
invoke
Genetic
Operators
only once
L, R results
and only on L or R
Adjust Controls
detection mode, overlap interval, ...
Integrates all fault handling stages using EC strategy



Detects faults by the occurrence of discrepancy
Isolates faults by accumulation of discrepancies
Failure-specific refurbishment using Genetic Operators:

Intra-Module-Crossover, Inter-Module-Crossover, Intra-Module-Mutation
Realize online device refurbishment


Refurbished online without additional function or resource test vectors
Repair during the normal data throughput process
NO
Fitness Evaluation Window
• Fitness Evaluation Window: W
 denotes number of iterations used to evaluate fitness before the state of
an individual is determined
•
Determination of W for 3x3 multiplier
 6 input pins articulating 26=64 possible inputs
 W should be selected so that all possible inputs appear
 More formally,
Let rand(X) return some xi  X at random


Seek W
W
: [ 
rand(X) ] = X with high probability
i=1
• xK = distinct orderings of K inputs
showing in D trials
• if D constant, can calculate Pk>1
successively
• probability PK of K inputs showing
after D trials is ratio of xK / KD
K 
 K 
K 
K
  xK  
 xK 1  .....    x2    x1  K D
K 
 K  1
2
1
K 
 K 
K
K
  PK  
 PK 1  .....    P2    x1  1
K 
 K  1
2
1
K
K
 Pm  1

m 1  m 
W Determination
When K=64:
Integer Multiplier Case Study
• 3bit x 3bit unsigned multiplier automated design:
– Building blocks
 Half-Adder: 18 templates created
 Full-Adder: 24 templates
 Parallel-And : 1 template created
– Randomly select templates for instantiation in modules
GA parameters
GA operators
Population size : 20 individuals
Crossover rate : 5%
Mutation rate : up to 80% per bit
Experimental Evaluation
Xilinx Virtex II Pro on Avnet PCI board
External-Module-Crossover
Internal-Module-Crossover
Internal-Module-Mutation
Experiments Demonstrate …
•
•
•
Objective fitness function replaced by
the Consensus-based Evaluation
Approach and Relative Fitness
Elimination of additional test vectors
Temporal Assessment process
Template Fault Coverage
Half-Adder Template A
Half-Adder Template A
Half-Adder Template B
Template A
–
–
Gate3 is an AND gate
Will lose correctness if a Stuck-At-Zero fault occurs in second
input line of the Gate3, an AND gate
Template B
–
–
Gate3 is a NOT gate and only uses the first input line
Will work correctly even if second input line is stuck at Zero or
One
Regeneration Performance
Parameters:
Difference (vs. Hamming Distance)
Evaluation Window, Ew = 600
Suspect Threshold: S = 1-6/600=99%
Repair Threshold: R = 1-4/600 = 99.3%
Re-introduction rate: r = 0.1
Repairs evolved in-situ, in real-time, without additional test
vectors, while allowing device to remain partially online.
Isolation of a single faulty individual with
1-out-of-64 impact
•
•
•
Outliers are identified after W iterations elapsed
E.V. = (1/64)*600 = 9.375 from minimum impact faulty individual
Isolated individual’s f differs from the average DV by 3 after 1 or more
observation intervals of length W