Bus Architectures for Satety

Download Report

Transcript Bus Architectures for Satety

Bus Architectures for SatetyCritical Embedded Systems
--by Harit Desai
Introduction
• Safety-critical systems are federated
– Each function has its own fault tolerant
embedded control system with minor
interconnections.
– Provides strong barrier to fault propagation.
– Federated approach is expensive(replication)
host
host
host
interface
Bus interconnect
host
Buses
• Time-triggered buses
– All activities are driven by passage of time.
– Interacts with the environment according to internal
schedule
• Event triggered buses
– All activities are driven by occurrence of events
– under the control of environment and respond to stimuli
as they occur.
Why not event triggered system ?
• In safety-critical system it is necessary to guarantee some
basic quality of service, even in presence of faults.
• Guaranteed low latency is required.
– Events arriving at different nodes may have to contend
for access to the bus
– So, some form of media access control is required
• Ethernet resolves contention probabilistically
• To resolve contention deterministically, lowest number
wins the arbitration but latency increases as the load
increases.
• In presence of faults, message may be retransmitted
thereby delaying the next message even if it has higher
priority.
• Furthermore, faulty nodes may make excessive demands
for service.
• ARNIC 629, uses a technique called ‘minislotting’
– Each node has to wait a certain period after sending a
message before it can contend to send another
– But here also, latency is function of load
• Byteflight (BMW) extends this mechanism with
guaranteed, preallocated slots for critical messages
– Provides no protection against a faulty node that fails to
recognize them, this kind of fault is called the ‘babbling
idiot’ failure.
Time-triggered bus
• Static preallocation of communication bandwidth in the form of a
global schedule
• Thus , contention is resolve at design time rather than at run time.
• But what about babbling idiot failure….
– Each node has an independent component, called a bus
guardian,that allows to transmit only when its allowed to do so.
– Guardian has an independent clock and independent knowledge of
the schedule and allows its node to broadcast only when indicated
by schedule.
• No need for source or destination address in the message
– This reduces the size of the message.
– Increases the message bandwidth of the bus.
Continued…
• Fault-tolerant clock synchronization is a fundamental
requirement for a time-triggered bus architecture.
• Abstraction of global clock is realized by each node having
a local clock that is closely synchronized with the clocks of
all other nodes.
Fault hypotheses and Fault Containment Units
• Fault hypotheses must describe
– The modes of faults that are to be tolerated
– Their maximum number
– And arrival rates.
• It must also identify different fault containment units.(FCU)
– There must be no propagation of faults from one FCU to another.
– And no “common mode failures” meaning a single physical event
produces faults in multiple FCUs.
• Fault may exhibit different modes at different levels of protocol
hierarchy.
– Example: at electrical level  intermediate voltage
at message level  byzantine failure
– Such faults must be controlled at underlying intermediate level
Basic Dimensions of faults
• Faults can affect value, time or space.
• Value fault: causes an incorrect value to be computed,
transmitted or received.
• Timing fault: causes value to be computed, transmitted or
received at wrong time.
• Spatial proximity:where all the matter in some specified
volume is destroyed.
– Redundant buses come into close proximity at each node.
– Central hub topology is more resilient.
Fault Classification
• Manifest: fault can be reliably detected.
– A fault that causes FCU to cease transmitting.
• Symmetric: meaning whatever the effect , it is same for all
observers
• Arbitrary: may be asymmetric or byzantine , meaning that
its effect is perceived differently by different observers.
– Slightly out of specification (SOS) fault
• Intermediate electrical voltage or a weak edge.
• Redunduncy required for fault tolerance
depends on the type of fault considered.
• number of FCUs required for clock
synchronization
n > 3a + 2s + m
where a  arbitrary faults
s  symmetric faults
m  manifest faults
• Some architectures can tolerate only one fault at a time,
then they reconfigure and are able to tolerate additional
faults.
• In such architecture, fault arrival rate is very important.
– faults must not arrive faster than the architecture can reconfigure
– operates according to static schedules, which consists “rounds” or
“frames” that are executed repeatedly.
– acceptable fault arrival rate is expressed in faults per rounds.
• Sometimes system may experience many simultaneous
faults. (due to HIRF).
– Restart is usually initiated.
– detection of such failure and restart must be very fast.
– estimate of steer-by-wire automobile application is 50ms.
Services
• Basic purpose of these architectures is to build reliable
distributed application.
• Basic services
– clock synchronization
– time-triggered activation
– reliable message delivery
• the problem of distributing data consistently in presence of
fault is variously called interactive consistency
– Agreement: all nonfaulty receivers obtain the same message.
– Validity: if the transmitter is nonfaulty, then nonfaulty receivers
obtain the message actually sent.
• failure notification or ‘membership’ service.
– service must produce consistent knowledge.
– if one nonfaulty node thinks that a particular node has failed then
all the nonfaulty nodes must hold the same opinion.
• each node maintains a private membership list.
– Agreement: the membership lists of all nonfaulty nodes are the
same.
– Validity: the membership lists of all nonfaulty nodes contain all
nonfaulty nodes and atmost one faulty node.
• When unable to maintain accurate membership, best
resource is to maintain agreement, but sacrifice
validity.This weakened requirement is called “clique
avoidance”.
Practical Implementations
• SAFEbus:- develop by Honeywell for cockpit displays
– Interface or BIUs are duplicated. BIUs perform clock
synchronization, message scheduling and transmission functions
– each BIUs of a pair is a different FCU.
– interconnect bus is quad-redundant.
– each BIU of a pair drives a different pair of interconnect buses but
is able to read all of four.
– each interconnect bus comprise of two data lines and a clock line
and operate at 30MHz
– it can handle arbitrary faults and a high rate of fault arrivals.it also
tolerates spatial proximity faults.
– considered to be the best , used in passenger aircraft in Boeing 777.
• SPIDER :- Scalable Processor-Independent Design for
Electromagnetic Resilience
– developed at NASA langley research center
– it’s a research platform to explore recovery strategies for radiationinduced (HIRF) faults.
– uses star configuration, in which interface may be located either
with their hosts or in centralized hosts.
– services include interactive consistent message broadcast and
identification of failed nodes (membership service).
• FlexRay:- developed for powertrain and chassis control in
cars.
– more flexible than other buses
– supports ‘static’ time-triggered operation and ‘dynamic’ event
triggered operation