eChallenges e2004 PPT Template

Download Report

Transcript eChallenges e2004 PPT Template

Dependability Considerations
in Distributed Control Systems
Klemen Žagar, Cosylab
2005-10-13
ICALEPCS 2005, Geneve, Switzerland
Dependability
• A dependable system is one which the users may trust.
• Examples of dependable distributed systems:
– The Internet
– Power distribution grid
– Water supply
• Dependability is very general term. Among others, it covers:
– Availability: it is there when needed.
– Reliability: it can work autonomously for a long period of
time.
– Maintainability: easily fixed when broken.
– Safety: will not harm other equipment or personnel.
– Security: unauthorized, possibly malicious, users can not
gain control
ICALEPCS 2005, Geneve, Switzerland
2
Motivation
• Nodes of a distributed system are like dominos
– The domino effect: one falls, all may go down
– May happen often, and takes a long time to rebuild
• Thus, fault tolerance is important:
– Improved mean-time-to-failure of the system as a whole
– Lower mean-time-to-repair
 Improved availability
 Reduced maintenance effort
• Fault tolerance in distributed control systems?
ICALEPCS 2005, Geneve, Switzerland
3
Research Objectives
• Dependable Distributed Systems (DeDiSys)
research project with the European Union.
• What are the most frequent causes of faults in
distributed control systems?
• What mitigation mechanisms are available?
• How to improve availability by trading it against
constraint consistency?
– What is constraint consistency in control
systems?
ICALEPCS 2005, Geneve, Switzerland
4
Reliability
• Reliability,
, is the probability that a system will perform as
specified for a given period of time.
– Typically exponential:
– Alternative measure is the mean time to failure (MTTF/MTBF):
R(t)
1
Relability of the Microsoft Windows 95
operating system
49.7
days
ICALEPCS 2005, Geneve, Switzerland
t
5
Reliability of Composed Systems
• Weakest link: reliability of a coupled composed system is
less than the reliability of its least reliable constituent:
• Redundancy: reliability of a redundant subsystem is
greater than the reliability of its most reliable constituent:
ICALEPCS 2005, Geneve, Switzerland
6
Maintainability and Availability
• Maintainability: how long it takes to repair a system after a
failure.
– The measure is mean time to repair (MTTR)
• Availability: percentage of time the system is actually
available during periods when it should be available.
– Directly experienced by users!
– Expressed in percent. In marketing, also with number of
nines
(e.g., 99.999% availability  unavailable 7 min/year).
• Example: a gas station (working hours 6AM to 10PM – 16
hours) 12AM
6AM
10PM
2h
2h
– Ran out of gas at 10AM (2h)
– Pump malfunction at 2PM (2h)
– Availability: 12h/16h = 75%
ICALEPCS 2005, Geneve, Switzerland
7
Research Methodology
• Research in the context of the DeDiSys project
• Collection of requirements from
– DeDiSys project’s interest group members
– Cosylab’s customers (e.g., ANKA, SLS, ...)
• Identification of scenarios
– ALMA Common Software (ACS)
– EPICS
– Geographical Information Systems
• Definition of the architecture for a fault-tolerance
naming service (FTNS)
ICALEPCS 2005, Geneve, Switzerland
8
Faults in Distributed Systems
Node failures
• A host crashes or a process dies
• Volatile state is lost
Link failures
• A network link is broken
• Results in two or more
partitions
• Difficult to distinguish from a
host crash
Consequences
• Affected services are lost
• Dependent systems
malfunction
• User interface doesn’t
show actual status
Client 4
Client 3
Copy 1:
active
available
Severed
Network Link
Client 1
Client 5
Client 6
ICALEPCS 2005, Geneve, Switzerland
Copy 3:
active
inconsistent
available
Copy 2:
crashed
Crashed
Server
Client 2
9
Improving Hardware MTTF
• Reduce the number of mechanical parts:
– Solid-state storage instead of hard disks
– Passive cooling of power supplies and CPUs (no fans)
• High-quality or redundant power supplies
• Replication:
– network links
– CPU boards
• Remote reset (e.g., via power cycling)
ICALEPCS 2005, Geneve, Switzerland
10
Improving Software MTTF
• Ensure that overflows of variables that constantly increase
(handle IDs, timers, counters, ...) are properly handled.
• Ensure all resources are properly released when no longer
needed (memory leaks, …)
– Use a managed platform (Java, .NET)
– Use auto-pointers (C++)
• Avoid using heap storage on a per-transaction basis (may
result in memory fragmentation); e.g., use free-lists
• Restart a process in a controllable fashion (rejuvenation)
• Isolate processes through inter-process communication
• Recovery:
– Recover state after a crash
– Effective for host and process crashes
– Automated repair
ICALEPCS 2005, Geneve, Switzerland
11
Decreasing MTTR
• Foresee failures during design
– The major difference between a thing that might go wrong and a thing that
cannot possibly go wrong is that when a thing that cannot possibly go
wrong goes wrong it usually turns out to be impossible to get at or repair.
– Douglas Adams: Mostly Harmless
• Provide good diagnostics
– Alarms
• Detailed description of where and when an error occurred
– Logs
– State-dump at failures
• ADC buffers after a beam dump
• Status of synchronization primitives
• Memory dump
• Automated fail-over
– In combination with redundancy
– Passive replica must have up-to-date state of the primary copy
– Fault detection (network ping, analog signal, …)
ICALEPCS 2005, Geneve, Switzerland
12
Consistency/Availability Trade-Off
Consistency
Finance
Banking
Access control
Corporate databases
ICALEPCS 2005, Geneve, Switzerland
Availability
Control systems
Air-traffic control
Fly-by-wire
Drive-by-wire
13
Constraint Consistency in Control Systems
• Constraints: rules that one or more objects must satisfy,
for example:
– If and only if serverChannel.monitors.contains(client)
then client.isSubscribedTo(serverChannel)
– serverChannel.value == clientChannel.value
– server.getFromDatabase(‘x’) == database.get(‘x’)
– If client.referencesComponent(component)
then component.isReferencedBy(client)
• Can some constraints be temporarily relaxed in presence of
faults?
• If so, how to reconcile the system in a consistent state
when faults are removed?
ICALEPCS 2005, Geneve, Switzerland
14
Future Work
• DeDiSys:
– Design and implementation (due: January 2007)
– Validation (due: June 2007)
• Possible inclusion of research findings in control system
infrastructures:
– ACS (e.g., replication of the manager and components)
– EPICS (e.g., V4 fault-tolerance efforts of the EPICS
community)
• Inclusion in products:
– The microIOC platform
– Servers for Geographical Information Systems
– Other high-availability products (telecommunications,
automotive)
• Know-how for consulting and development services
ICALEPCS 2005, Geneve, Switzerland
15
Conclusion
• Distributed systems are inherently fragile
• Fault tolerance is difficult to program
– Should be addressed by
infrastructure/middle-ware, but
frequently isn’t
• Comments/questions/contributions:
[email protected]
ICALEPCS 2005, Geneve, Switzerland
16