Transcript Document

Deployment and Runtime Techniques
for Fault-tolerance in Distributed, Realtime and Embedded Systems
Aniruddha Gokhale
Associate Professor, Dept of EECS, Vanderbilt Univ, Nashville, TN, USA
www.dre.vanderbilt.edu/~gokhale
Based on work done by Jaiganesh Balasubramanian and Sumant Tambe
Presented at Dept of CS, IUPUI, April 15, 2011
Work supported in part by by NSF CAREER, NSF SHF/CNS
Focus: Distributed Real-time and Embedded Systems
• Is this a Distributed, Real-time and Embedded (DRE) System?
Just an embedded system =>
Not a DRE system
• Highly resource-constrained
2
Focus: Distributed Real-time and Embedded Systems
• Is this a Distributed, Real-time and Embedded (DRE) System?
A composition of embedded systems =>
Not DRE yet
• Highly resource-constrained
• Real-time requirements on interactions among
individual embedded systems
• Failures of individual systems possible
• Other QoS requirements
3
Focus: Distributed Real-time and Embedded Systems
Networked systems of systems => is DRE
• Highly resource-constrained
• Real-time requirements on intra- and inter
subsystem interactions
• Failures of individual subsystems possible
• Other QoS requirements
• Network with constraints on bandwidth
• Workloads can fluctuate
4
Focus: Distributed Real-time and Embedded Systems
OPEN
• Multiple tasks with real-time requirements
• Resource-constrained environment
• Resource fluctuations and faults are a
norm => maintain high availability
• Uses COTS component middleware
technologies, e.g., RTCORBA/CCM
CLOSED
Objective: Highly available DRE systems
•
•
•
Resource-aware
Fault-tolerant
QoS-aware (soft real-time)
5
Challenge 1: Satisfy Multi-objective Requirements
• Soft real-time performance must
be assured despite failures
6
Challenge 1: Satisfy Multi-objective Requirements
• Soft real-time performance must
be assured despite failures
• Passive (primary-backup)
replication is preferred due to low
resource consumption
7
Challenge 1: Satisfy Multi-objective Requirements
• Soft real-time performance must
be assured despite failures
• Passive (primary-backup)
replication is preferred due to low
resource consumption
• Replicas must be allocated on
minimum number of resources
=> task allocation that minimizes
resources used
8
Challenge 2: Dealing with Failures & Overloads
Context
• One or more failures at runtime
in processes, processors, links,
etc
• Mode changes in operation may
occur
• System overloads are possible
Solution Needs
• Maintain QoS properties
maximally
• Minimize impact
• Require middleware-based
solutions for reuse and portability
9
Challenge 3: Replication with End-to-end Tasks
• DRE systems often include end-to-end workflows of tasks
organized in a service oriented architecture
• A multi-tier processing model focused on the end-to-end QoS
requirements
• Critical Path: The chain of tasks with a soft real-time deadline
• Failures may compromise end-to-end QoS (response time)
LEGEND
Receptacle
Error
Recovery
Event Sink
Event Source
Detector1
Effector1
Facet
Planner3
Planner1
Config
Detector2
Effector2
Non determinism in behavior leads to orphan components
10
Non-determinism and the Side Effects of Replication
Many sources of non-determinism in DRE systems
 e.g., Local information (sensors, clocks), thread-scheduling, timers,
and more
 Enforcing determinism is not always possible
Side-effects of replication + non-determinism + nested
invocation => Orphan request & orphan state Problem
Hard to support exactly-once semantics
Non-determinism
Nested
Invocation
Orphan Request
Problem
Passive
Replication
11
Exactly-once Semantics, Failures, & Determinism
 Deterministic component A
 Caching of request/reply at
component B is sufficient
Caching of
request/reply
rectifies the problem
 Non-deterministic
component A
 Two possibilities upon
failover
1. No invocation
2. Different invocation
 Caching of request/reply
does not help
Orphan request &
orphan state
 Non-deterministic code
13
must re-execute
Challenge 4: Engineering Challenges
Context
• Solutions to challenges 1 thru 3
require system (re)configuration
and (re)deployment
• Manual efforts at configuring
middleware must be avoided
Solution Needs
• Maximally automate the
configuration and deployment =>
Leads to systems that are
“correct-by-construction”
• Autonomous adaptive
capabilities
14
Contributions within the Lifecycle of DRE Systems
Lifecycle
Specification
Algorithms + Systems + S/W Engineering
• CQML to provide expressive capabilities
to capture requirements
• CoSMIC MDE toolsuite
Composition
Deployment
Configuration
Run-time
• DeCoRAM task allocation to balance
resources, real-time and faults
• GRAFT to automatically inject FT logic
• DAnCE for deployment & configuration
• FLARe adaptive middleware for RT+FT
• CORFU middleware for componentizing
FLARe
• The Group-failover Protocol for orphan
requests
15
15
Contributions within the Lifecycle of DRE Systems
Lifecycle
Algorithms + Systems + S/W Engineering
Specification
Composition
Deployment
• DeCoRAM task allocation to balance
resources, real-time and faults
• DAnCE for deployment & configuration
Configuration
Run-time
16
16
Our Solution: The DeCoRAM D&C Middleware
• DeCoRAM = “Deployment &
No coupling with
Configuration Reasoning via
allocation algorithm
Analysis & Modeling”
• DeCoRAM consists of
• Pluggable Allocation Engine
that determines appropriate node
mappings for all applications &
replicas using installed algorithm
Middleware-agnostic
• Deployment & Configuration
D&C Engine
Engine that deploys &
configures (D&C) applications
and replicas on top of
middleware in appropriate hosts
• A specific allocation algorithm
This talk focuses on the
that is real time-, fault- and
allocation algorithm 18
resource-aware
DeCoRAM Allocation Algorithm
• System model
• N periodic DRE
system tasks
• RT requirements –
periodic tasks, worstcase execution time
(WCET), worst-case
state synchronization
time (WCSST)
• FT requirements – K
number of processor
failures to tolerate
(number of replicas)
• Fail-stop processors
How many processors shall we
need for a primary-backup
scheme?
An intuition
Num proc in No-fault case <=
Num proc for passive replication <=
Num proc for active replication
19
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78]* algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
Task
WCET
WCSST
Period
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
*[Dhall:78] S. K. Dhall & C. Liu, “On a Real-time
Scheduling Problem”, Operations Research, 1978
Util
22
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78] algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
Task
WCET
WCSST
Period
Util
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
P1
A
B
23
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78] algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
Task
WCET
WCSST
Period
Util
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
P1
A
B
C
24
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78] algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
P1
A
Task
WCET
WCSST
Period
Util
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
P2
C
B
25
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78] algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
P1
A
Task
WCET
WCSST
Period
Util
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
P2
C
D
B
E
26
Designing the DeCoRAM Allocation Algorithm (1/5)
Basic Step 1: No fault tolerance
• Only primaries exist consuming
WCET each
• Apply first-fit optimal bin-packing
using the [Dhall:78] algorithm
• Consider sample task set shown
• Tasks arranged according to rate
monotonic priorities
Task
WCET
WCSST
Period
Util
A
20
0.2
50
40
B
40
0.4
100
40
C
50
0.5
200
25
D
200
2
500
40
E
250
2.5
1,000
25
Outcome -> Lower bound
established
• System is schedulable
• Uses minimum number of
resources
RT & resource constraints satisfied; but no FT
27
Designing the DeCoRAM Allocation Algorithm (2/5)
Refinement 1: Introduce replica
tasks
• Do not differentiate between
primary & replicas
• Assume tolerance to 2 failures =>
2 replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
28
Designing the DeCoRAM Allocation Algorithm (2/5)
Refinement 1: Introduce replica
tasks
• Do not differentiate between
primary & replicas
• Assume tolerance to 2 failures =>
2 replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
29
Designing the DeCoRAM Allocation Algorithm (2/5)
Refinement 1: Introduce replica
tasks
• Do not differentiate between
primary & replicas
• Assume tolerance to 2 failures =>
2 replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
30
Designing the DeCoRAM Allocation Algorithm (2/5)
Refinement 1: Introduce replica
tasks
• Do not differentiate between
primary & replicas
• Assume tolerance to 2 failures =>
2 replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
31
Designing the DeCoRAM Allocation Algorithm (2/5)
Refinement 1: Introduce replica
tasks
• Do not differentiate between
primary & replicas
• Assume tolerance to 2 failures =>
2 replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Outcome -> Upper bound is established
• A RT-FT solution is created – but with Active replication
• System is schedulable
• Demonstrates upper bound on number of resources needed
Minimize resource using passive replication
32
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
33
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
in no failure case
Primaries
contribute WCET
34
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
C1
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
in no failure case
Primaries
contribute WCET
35
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
C1
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
in no failure case
Primaries
contribute WCET
36
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
in no failure case
C1
37
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
in no failure case
C1
Allocation is fine
when A2/B2 are
backups
38
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
C1
39
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Promoted backups now
contribute WCET
C1
Failure triggers
promotion of A2/B2
to primaries
40
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
Backups only
contribute WCSST
C1
Allocation
is fine
System
unschedulable
when
A2/B2are
are
when
A2/B2
backups
promoted
41
Designing the DeCoRAM Allocation Algorithm (3/5)
Refinement 2: Passive replication
• Differentiate between primary &
replicas
• Assume tolerance to 2 failures =>
2 additional backup replicas each
• Apply the [Dhall:78] algorithm
C1/D1/E1 cannot
be placed here -unschedulable
Task
WCET
WCSST
Period
A1,A2,A3
20
0.2
50
B1,B2,B3
40
0.4
100
C1,C2,C3
50
0.5
200
D1,D2,D3
200
2
500
E1,E2,E3
250
2.5
1,000
C1/D1/E1 may be
placed on P2 or
P3 as long as
there are no
failures
Outcome
• Resource minimization & system schedulability feasible in non faulty
scenarios only -- because backup contributes only WCSST
• Unrealistic not to expect failures
• Need a way to consider failures & find which backup will
be promoted to primary (contributing WCET)?
42
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
43
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Looking ahead that any of
A2/B2 or A3/B3 may be
promoted, C1/D1/E1 must be
placed on a different processor
44
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Where should backups of
C/D/E be placed? On P2 or
P3 or a different processor?
P1 is not a choice.
45
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
• Suppose the allocation of the backups
of C/D/E are as shown
• We now look ahead for any 2 failure
combinations
46
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Schedule is feasible
=> original placement
decision was OK
• Suppose P1 & P2 were to fail
• A3 & B3 will be promoted
47
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Schedule is feasible
=> original placement
decision was OK
• Suppose P1 & P4 were to fail
• Suppose A2 & B2 on P2 were to be
promoted, while C3, D3 & E3 on P3
were to be promoted
48
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Schedule is not
feasible => original
placement decision
was incorrect
• Suppose P1 & P4 were to fail
• Suppose A2, B2, C2, D2 & E2 on P2
were to be promoted
49
Designing the DeCoRAM Allocation Algorithm (4/5)
Refinement 3: Enable the offline algorithm to consider failures
• “Look ahead” at failure scenarios of already allocated tasks & replicas
determining worst case impact on a given processor
• Feasible to do this because system properties are invariant
Looking ahead that any of
A2/B2 or A3/B3 may be
promoted, C1/D1/E1 must be
placed on a different processor
Placing backups of C/D/E here
points at one potential combination
that leads to infeasible schedule
Outcome
• Due to the potential for an
infeasible schedule, more
resources are suggested by
the Lookahead algorithm
• Look-ahead strategy cannot determine impact
of multiple uncorrelated failures that may make
system unschedulable
50
Designing the DeCoRAM Allocation Algorithm (5/5)
Refinement 4: Restrict the order in which failover targets are chosen
• Utilize a rank order of replicas to dictate how failover happens
• Enables the Lookahead algorithm to overbook resources due to
guarantees that no two uncorrelated failures will make the system
unschedulable
Replica number
denotes ordering in
the failover process
• Suppose the replica allocation is as
shown (slightly diff from before)
• Replica numbers indicate order in the
failover process
51
Designing the DeCoRAM Allocation Algorithm (5/5)
Refinement 4: Restrict the order in which failover targets are chosen
• Utilize a rank order of replicas to dictate how failover happens
• Enables the Lookahead algorithm to overbook resources due to
guarantees that no two uncorrelated failures will make the system
unschedulable
• Suppose P1 & P4 were to fail (the
interesting case)
• A2 & B2 on P2, & C2, D2, E2 on P3
will be chosen as failover targets due
to the restrictions imposed
• Never can C3, D3, E3 become
primaries along with A2 & B2 unless
more than two failures occur
52
Designing the DeCoRAM Allocation Algorithm (5/5)
Refinement 4: Restrict the order in which failover targets are chosen
• Utilize a rank order of replicas to dictate how failover happens
• Enables the Lookahead algorithm to overbook resources due to
guarantees that no two uncorrelated failures will make the system
unschedulable
For a 2-fault tolerant system, replica
numbered 3 is assured never to become
a primary along with a replica numbered
2. This allows us to overbook the
processor thereby minimizing resources
Resources minimized from 6 to 4 while assuring both RT & FT
53
DeCoRAM Evaluation Criteria
• Hypothesis – DeCoRAM’s
Failure-aware Look-ahead
Feasibility algorithm allocates
applications & replicas to
hosts while minimizing the
number of processors utilized
• number of processors
utilized is lesser than the
number of processors
utilized using active
replication
DeCoRAM Allocation Engine
54
DeCoRAM Evaluation Hypothesis
• Hypothesis – DeCoRAM’s
Failure-aware Look-ahead
Feasibility algorithm allocates
applications & replicas to
hosts while minimizing the
number of processors utilized
• number of processors
utilized is lesser than the
number of processors
utilized using active
replication
• Deployment-time configured
real-time fault-tolerance
solution works at runtime
when failures occur
• none of the applications
lose high availability &
timeliness assurances
DeCoRAM Allocation Engine
55
Experiment Results
Linear increase in #
of processors utilized
in AFT compared to
NO FT
60
Experiment Results
Rate of increase is
much more slower
when compared to AFT
61
Experiment Results
DeCoRAM only uses
approx. 50% of the
number of processors
used by AFT
62
Experiment Results
As task load increases,
# of processors utilized
increases
63
Experiment Results
As task load increases,
# of processors utilized
increases
64
Experiment Results
As task load increases,
# of processors utilized
increases
65
Experiment Results
DeCoRAM scales well,
by continuing to save
~50% of processors
66
DeCoRAM Pluggable Allocation Engine Architecture
• Design driven by separation of concerns
• Use of design patterns
• Input Manager component – collects per-task FT & RT requirements
• Task Replicator component – decides the order in which tasks are allocated
• Node Selector component – decides the node in which allocation will be checked
• Admission Controller component – applies DeCoRAM’s novel algorithm
• Placement Controller component – calls the admission controller repeatedly to
deploy all the applications & their replicas
Input Manager
Allocation Engine implemented in
~7,000 lines of C++ code
Task Replicator
Node Selector
Admission
Controller
Placement
Controller
Output decisions realized by
DeCoRAM’s D&C Engine
67
DeCoRAM Deployment & Configuration Engine
• Automated deployment &
configuration support for faulttolerant real-time systems
• XML Parser
• uses middleware D&C
mechanisms to decode
allocation decisions
• Middleware Deployer
• deploys FT middlewarespecific entities
• Middleware Configurator
• configures the underlying
FT-RT middleware artifacts
• Application Installer
• installs the application
components & their replicas
• Easily extendable
• Current implementation on top
of CIAO, DAnCE, & FLARe
middleware
DeCoRAM D&C Engine implemented in ~3,500 lines of C++ code
68
Summary of DeCoRAM Contributions
• DeCoRAM allocation algorithm saves number of resources
used via clever resource overbooking of backup replicas
• DeCoRAM allocation engine can execute many different
allocation algorithms
• DeCoRAM D&C engine requires a concrete bridge
implemented for the underlying middleware => cost is
amortized over number of uses.
• Existing fault tolerant middleware runtimes can leverage
DeCoRAM decisions
• For closed DRE systems, runtimes can be very simple and
obey all the decisions determined at design-time
• For closed DRE systems, runtimes can use DeCoRAM
results for initial deployment.
www.dre.vanderbilt.edu/CIAO
69
Contributions within the Lifecycle of DRE Systems
Lifecycle
Specification
Algorithms + Systems + S/W Engineering
Composition
Deployment
Configuration
• FLARe adaptive middleware for RT+FT
Run-time
70
70
Resolving Challenges 2 & 4: FLARe
Key Ideas
• Load-Aware Adaptive Failover (LAAF) Target Selection
• Load-aware  maintain desired soft real-time performance after
recovery
• Adaptive  handle dynamic load due to workload changes and
multiple failures
• Resource Overload Management and rEdirection
(ROME)
• maintain soft real-time performance during overloads
Fault-Tolerant Load-Aware and Adaptive MiddlewaRe
• Failure model
• multiple processor/process
failures
• fail-stop
• Replication Model
• passive replication
• asynchronous state
updates
• Implemented on top of TAO
Real-time CORBA
Middleware
72
Middleware Architecture
• Client Failover Manager
• catches processor/process
failure exceptions
• redirects clients to failover
targets
• Monitors
• periodically monitor liveness
and CPU utilization of each
processor
• Replication Manager
• collects system utilizations
from monitors
• calculates ranked-list of
failover targets using LAAF
• updates client-side with ranked
list of targets
• manages overloads using
ROME
Load-Aware Adaptive Failover (LAAF)
• monitor CPU utilization of
each processor
• rank backup processors
based on load
• distribute failover targets of
objects on a same processor
 avoid overload after
processor failure
• proactively update clients
74
Resource Overload Management & rEdirection (ROME)
• overloads can occur due to
multiple processor failures
• soft real-time  treat
overloads as failures
• redirect clients of high
utilization objects to backups
on lightly loaded processors
• distributes overloads across
multiple processors
75
Experiment Setup
•
•
•
•
•
Linux clusters at ISISLab
6 clients – 2 clients CL-5 & CL-6 are dynamic clients (start after 50 seconds)
6 different servers – each have 2 replicas
Experiment ran for 300 seconds – each server consumes some CPU load
Rate Monotonic scheduling on each processor
76
Experiment Configurations
• Static Failover Strategy
• each client knows the order in which they access the server replicas in
the presence of failures – i.e., the failover targets are known in advance
• this strategy is optimal at deployment time
77
LAAF Algorithm Results
At 50 secs,
dynamic loads
are introduced
78
LAAF Algorithm Results
At 150 seconds
failures are
introduced
79
LAAF Algorithm Results
static strategy
increases CPU
utilizations to 90% and
80% - could cause
system crashes
80
LAAF Algorithm Results
LAAF modifies failover
targets at 50 seconds –
prevents overloads
when failure occurs by
choosing different
failover targets
81
Contributions within the Lifecycle of DRE Systems
Lifecycle
Specification
Algorithms + Systems + S/W Engineering
Composition
Deployment
Configuration
• Group Failover to handle orphan requests
Run-time
82
82
Resolving Challenges 3 & 4: Group Failover
 Enforcing determinism
 Point solutions: Compensate specific sources of non-determinism
 e.g., thread scheduling, mutual exclusion
 Compensation using semi-automated program analysis
 Humans must rectify non-automated compensation
A
C
B
Enforce
Determinism
A’
B’
83
Unresolved Challenges: End-to-end Reliability of
Non-deterministic Stateful Components
 Integration of replication & transactions
 Applicable to multi-tier transactional web-based systems only
 Overhead of transactions (fault-free situation)
 Messaging overhead in the critical path (e.g., create, join)
 2 phase commit (2PC) protocol at the end of invocation
Transaction
Manager
Create
A
Join
Join
B
Join
C
D
Client
84
Unresolved Challenges: End-to-end Reliability of
Non-deterministic Stateful Components
 Integration of replication & transactions
 Applicable to multi-tier transactional web-based systems only
 Overhead of transactions (fault-free situation)
 Messaging overhead in the critical path (e.g., create, join)
 2 phase commit (2PC) protocol at the end of invocation
 Overhead of transactions (faulty situation)
 Must rollback to avoid orphan state
 Re-execute & 2PC again upon recovery
 Transactional semantics are not transparent
 Developers must implement: prepare, commit, rollback
A
State
Update
State
Update
B
C
State
Update
D
Potential
orphan
state
growing
Client
Orphan state bounded in B, C, D
85
Solution: The Group-failover Protocol
Orphan state bounded in a group of components
A
B
C
D
Group
failover
Client
A’
B’
C’
D’
Passive Replica
 Protocol characteristics:
1. Supports exactly-once execution semantics in presence of
 Nested invocation, non-deterministic stateful components, passive replication
2. Ensures state consistency of replicas
3. Does not require intrusive changes to the component implementation
 No need to implement prepare, commit, & rollback
4. Supports fast client failover that is insensitive to
 Location of failure in the operational string
 Size of the operational string
86
The Group-failover Protocol (1/3)
 Constituents of the group-failover protocol
1. Accurate failure detection
2. Transparent failover
3. Identifying orphan components
4. Eliminating orphan components
5. Ensuring state consistency
Timely fashion
87
The Group-failover Protocol (1/3)
 Constituents of the group-failover protocol
1. Accurate failure detection
2. Transparent failover
3. Identifying orphan components
Timely fashion
4. Eliminating orphan components
5. Ensuring state consistency
1. Accurate failure detection
 Fault-monitoring infrastructure based on
heart-beats
 Synthesized using model-to-model
transformations in GRAFT
88
The Group-failover Protocol (1/3)
 Constituents of the group-failover protocol
1. Accurate failure detection
2. Transparent failover
3. Identifying orphan components
Timely fashion
4. Eliminating orphan components
5. Ensuring state consistency
1. Accurate failure detection
 Fault-monitoring infrastructure based on
heart-beats
 Synthesized using model-to-model
transformations in GRAFT
2. Transparent failover alternatives
 Client-side request interceptors
 CORBA standard
 Aspect-oriented programming (AOP)
 Fault-masking code generation using
model-to-code transformations in
GRAFT
89
The Group-failover Protocol (2/3)
3. Identifying orphan components
 Without transactions, the run-time stage of a nested invocation is opaque
Transaction
Manager
Create
A
Join
Join
B
Join
C
D
Client
90
The Group-failover Protocol (2/3)
3. Identifying orphan components
 Without transactions, the run-time stage of a nested invocation is opaque
 Strategies for determining the extent of the orphan group (statically)
1. The whole operational string
Potentially
non-isomorphic
operational strings
 Tolerates catastrophic faults
• Pool Failure
• Network failure
 Tolerates Bohrbugs
 A Bohrbug repeats itself predictably when the
same state reoccurs
 Preventing Bohrbugs
 Reliability through diversity
 Diversity via non-isomorphic replication
 Different implementation, structure, QoS
91
The Group-failover Protocol (2/3)
3. Identifying orphan components
 Without transactions, the run-time stage of a nested invocation is opaque
 Strategies for determining the extent of the orphan group (statically)
1. The whole operational string
2. Dataflow-aware component grouping
92
The Group-failover Protocol (3/3)
4. Eliminating orphan components
 Using deployment and configuration (D&C) infrastructure
 Invoke component life-cycle operations (e.g., activate, passivate)
 Passivation:
 Discards the application-specific state
 Component is no longer remotely addressable
5. Ensuring state consistency
 Must assure exactly-once semantics
 State must be transferred atomically
 Strategies for state synchronization
Strategies
Eager
Lag-by-one
Fault-free scenario
Messaging overhead
No overhead
Faulty scenario (recovery)
No overhead
Messaging overhead
93
Eager State Synchronization Strategy
 State synchronization in two explicit phases
 Fault-free Scenario messages: Finish , Precommit (phase 1), State transfer,
Commit (phase 2)
 Faulty-scenario: Transparent failover
94
Lag-by-one State Synchronization Strategy
 No explicit phases
 Fault-free scenario messages: Lazy state transfer
 Faulty-scenario messages: Prepare, Commit, Transparent failover
95
Evaluation: Overhead of the State Synchronization
Strategies
 Experiments
 CIAO middleware
 2 to 5 components
 Eager state synchronization
 Insensitive to the # of
components
 Concurrent state transfer using
CORBA AMI (Asynchronous
Messaging)
 Lag-by-one state synchronization
 Insensitive to the # of
components
 Fault-free overhead less than
the eager protocol
96
Evaluation: Client-perceived failover latency of the
Synchronization Strategies
 The Lag-by-one protocol has messaging (low) overhead during failure
recovery
 The eager protocol has no overhead during failure recovery
(Jitter +/- 3%)
97
Relevant Publications
1. GroupFailOver, CBSE 2011 To Appear, Boulder, CO, June 2011
2. DeCoRAM, IEEE RTAS 2010, Stockholm, Sweden, 2010.
3. Adaptive Failover for Real-time Middleware with Passive Replication, IEEE RTAS
2009
4. Component Replication Based on Failover Units, IEEE RTCSA 2009
5. Towards Middleware for Fault-tolerance in Distributed Real-time Embedded
Systems, IFIP DAIS 2008
6. FLARe: A Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed
Real-time Embedded Systems, ACM Middleware Conference Doctoral Symposium
(MDS 2007), 2007
7. MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Realtime & Embedded Systems, ISAS 2007
8. A Framework for (Re)Deploying Components in Distributed Real-time Embedded
Systems, ACM SAC 2006
9. Middleware Support for Dynamic Component Updating, DOA 2005
10. Model-driven QoS Provisioning for Distributed Real-time & Embedded Systems,
Under Submission, IEEE Transactions on Software Engineering, 2009
11. NetQoPE: A Model-driven Network QoS Provisioning Engine for Distributed Realtime & Embedded Systems, IEEE RTAS 2008
12. Model-driven Middleware: A New Paradigm for Deploying & Provisioning Distributed
Real-time & Embedded Applications: Elsevier Jour. of Science & Comp. Prog., 2008
13. DAnCE: A QoS-enabled Deployment & Configuration Engine, CD 2005
98
Concluding Remarks & Future Work
• Satisfying multiple QoS properties simultaneously in
DRE systems is hard
• Resource constraints and fluctuating
workloads/operating conditions make the problem even
harder
• DOC Group at Vanderbilt/ISIS has made significant
R&D contributions in this area
• Technologies we have developed are part of our
ACE/TAO/CIAO/DAnCE middleware suites
• www.dre.vanderbilt.edu
• Future work seeks to address issues in cyber physical
systems
• Needs interdisciplinary expertise
99