Transcript Document
Deployment and Runtime Techniques for Fault-tolerance in Distributed, Realtime and Embedded Systems Aniruddha Gokhale Associate Professor, Dept of EECS, Vanderbilt Univ, Nashville, TN, USA www.dre.vanderbilt.edu/~gokhale Based on work done by Jaiganesh Balasubramanian and Sumant Tambe Presented at Dept of CS, IUPUI, April 15, 2011 Work supported in part by by NSF CAREER, NSF SHF/CNS Focus: Distributed Real-time and Embedded Systems • Is this a Distributed, Real-time and Embedded (DRE) System? Just an embedded system => Not a DRE system • Highly resource-constrained 2 Focus: Distributed Real-time and Embedded Systems • Is this a Distributed, Real-time and Embedded (DRE) System? A composition of embedded systems => Not DRE yet • Highly resource-constrained • Real-time requirements on interactions among individual embedded systems • Failures of individual systems possible • Other QoS requirements 3 Focus: Distributed Real-time and Embedded Systems Networked systems of systems => is DRE • Highly resource-constrained • Real-time requirements on intra- and inter subsystem interactions • Failures of individual subsystems possible • Other QoS requirements • Network with constraints on bandwidth • Workloads can fluctuate 4 Focus: Distributed Real-time and Embedded Systems OPEN • Multiple tasks with real-time requirements • Resource-constrained environment • Resource fluctuations and faults are a norm => maintain high availability • Uses COTS component middleware technologies, e.g., RTCORBA/CCM CLOSED Objective: Highly available DRE systems • • • Resource-aware Fault-tolerant QoS-aware (soft real-time) 5 Challenge 1: Satisfy Multi-objective Requirements • Soft real-time performance must be assured despite failures 6 Challenge 1: Satisfy Multi-objective Requirements • Soft real-time performance must be assured despite failures • Passive (primary-backup) replication is preferred due to low resource consumption 7 Challenge 1: Satisfy Multi-objective Requirements • Soft real-time performance must be assured despite failures • Passive (primary-backup) replication is preferred due to low resource consumption • Replicas must be allocated on minimum number of resources => task allocation that minimizes resources used 8 Challenge 2: Dealing with Failures & Overloads Context • One or more failures at runtime in processes, processors, links, etc • Mode changes in operation may occur • System overloads are possible Solution Needs • Maintain QoS properties maximally • Minimize impact • Require middleware-based solutions for reuse and portability 9 Challenge 3: Replication with End-to-end Tasks • DRE systems often include end-to-end workflows of tasks organized in a service oriented architecture • A multi-tier processing model focused on the end-to-end QoS requirements • Critical Path: The chain of tasks with a soft real-time deadline • Failures may compromise end-to-end QoS (response time) LEGEND Receptacle Error Recovery Event Sink Event Source Detector1 Effector1 Facet Planner3 Planner1 Config Detector2 Effector2 Non determinism in behavior leads to orphan components 10 Non-determinism and the Side Effects of Replication Many sources of non-determinism in DRE systems e.g., Local information (sensors, clocks), thread-scheduling, timers, and more Enforcing determinism is not always possible Side-effects of replication + non-determinism + nested invocation => Orphan request & orphan state Problem Hard to support exactly-once semantics Non-determinism Nested Invocation Orphan Request Problem Passive Replication 11 Exactly-once Semantics, Failures, & Determinism Deterministic component A Caching of request/reply at component B is sufficient Caching of request/reply rectifies the problem Non-deterministic component A Two possibilities upon failover 1. No invocation 2. Different invocation Caching of request/reply does not help Orphan request & orphan state Non-deterministic code 13 must re-execute Challenge 4: Engineering Challenges Context • Solutions to challenges 1 thru 3 require system (re)configuration and (re)deployment • Manual efforts at configuring middleware must be avoided Solution Needs • Maximally automate the configuration and deployment => Leads to systems that are “correct-by-construction” • Autonomous adaptive capabilities 14 Contributions within the Lifecycle of DRE Systems Lifecycle Specification Algorithms + Systems + S/W Engineering • CQML to provide expressive capabilities to capture requirements • CoSMIC MDE toolsuite Composition Deployment Configuration Run-time • DeCoRAM task allocation to balance resources, real-time and faults • GRAFT to automatically inject FT logic • DAnCE for deployment & configuration • FLARe adaptive middleware for RT+FT • CORFU middleware for componentizing FLARe • The Group-failover Protocol for orphan requests 15 15 Contributions within the Lifecycle of DRE Systems Lifecycle Algorithms + Systems + S/W Engineering Specification Composition Deployment • DeCoRAM task allocation to balance resources, real-time and faults • DAnCE for deployment & configuration Configuration Run-time 16 16 Our Solution: The DeCoRAM D&C Middleware • DeCoRAM = “Deployment & No coupling with Configuration Reasoning via allocation algorithm Analysis & Modeling” • DeCoRAM consists of • Pluggable Allocation Engine that determines appropriate node mappings for all applications & replicas using installed algorithm Middleware-agnostic • Deployment & Configuration D&C Engine Engine that deploys & configures (D&C) applications and replicas on top of middleware in appropriate hosts • A specific allocation algorithm This talk focuses on the that is real time-, fault- and allocation algorithm 18 resource-aware DeCoRAM Allocation Algorithm • System model • N periodic DRE system tasks • RT requirements – periodic tasks, worstcase execution time (WCET), worst-case state synchronization time (WCSST) • FT requirements – K number of processor failures to tolerate (number of replicas) • Fail-stop processors How many processors shall we need for a primary-backup scheme? An intuition Num proc in No-fault case <= Num proc for passive replication <= Num proc for active replication 19 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78]* algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities Task WCET WCSST Period A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 *[Dhall:78] S. K. Dhall & C. Liu, “On a Real-time Scheduling Problem”, Operations Research, 1978 Util 22 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78] algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities Task WCET WCSST Period Util A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 P1 A B 23 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78] algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities Task WCET WCSST Period Util A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 P1 A B C 24 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78] algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities P1 A Task WCET WCSST Period Util A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 P2 C B 25 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78] algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities P1 A Task WCET WCSST Period Util A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 P2 C D B E 26 Designing the DeCoRAM Allocation Algorithm (1/5) Basic Step 1: No fault tolerance • Only primaries exist consuming WCET each • Apply first-fit optimal bin-packing using the [Dhall:78] algorithm • Consider sample task set shown • Tasks arranged according to rate monotonic priorities Task WCET WCSST Period Util A 20 0.2 50 40 B 40 0.4 100 40 C 50 0.5 200 25 D 200 2 500 40 E 250 2.5 1,000 25 Outcome -> Lower bound established • System is schedulable • Uses minimum number of resources RT & resource constraints satisfied; but no FT 27 Designing the DeCoRAM Allocation Algorithm (2/5) Refinement 1: Introduce replica tasks • Do not differentiate between primary & replicas • Assume tolerance to 2 failures => 2 replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 28 Designing the DeCoRAM Allocation Algorithm (2/5) Refinement 1: Introduce replica tasks • Do not differentiate between primary & replicas • Assume tolerance to 2 failures => 2 replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 29 Designing the DeCoRAM Allocation Algorithm (2/5) Refinement 1: Introduce replica tasks • Do not differentiate between primary & replicas • Assume tolerance to 2 failures => 2 replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 30 Designing the DeCoRAM Allocation Algorithm (2/5) Refinement 1: Introduce replica tasks • Do not differentiate between primary & replicas • Assume tolerance to 2 failures => 2 replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 31 Designing the DeCoRAM Allocation Algorithm (2/5) Refinement 1: Introduce replica tasks • Do not differentiate between primary & replicas • Assume tolerance to 2 failures => 2 replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Outcome -> Upper bound is established • A RT-FT solution is created – but with Active replication • System is schedulable • Demonstrates upper bound on number of resources needed Minimize resource using passive replication 32 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 33 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST in no failure case Primaries contribute WCET 34 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm C1 Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST in no failure case Primaries contribute WCET 35 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm C1 Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST in no failure case Primaries contribute WCET 36 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST in no failure case C1 37 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST in no failure case C1 Allocation is fine when A2/B2 are backups 38 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 C1 39 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Promoted backups now contribute WCET C1 Failure triggers promotion of A2/B2 to primaries 40 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 Backups only contribute WCSST C1 Allocation is fine System unschedulable when A2/B2are are when A2/B2 backups promoted 41 Designing the DeCoRAM Allocation Algorithm (3/5) Refinement 2: Passive replication • Differentiate between primary & replicas • Assume tolerance to 2 failures => 2 additional backup replicas each • Apply the [Dhall:78] algorithm C1/D1/E1 cannot be placed here -unschedulable Task WCET WCSST Period A1,A2,A3 20 0.2 50 B1,B2,B3 40 0.4 100 C1,C2,C3 50 0.5 200 D1,D2,D3 200 2 500 E1,E2,E3 250 2.5 1,000 C1/D1/E1 may be placed on P2 or P3 as long as there are no failures Outcome • Resource minimization & system schedulability feasible in non faulty scenarios only -- because backup contributes only WCSST • Unrealistic not to expect failures • Need a way to consider failures & find which backup will be promoted to primary (contributing WCET)? 42 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant 43 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Looking ahead that any of A2/B2 or A3/B3 may be promoted, C1/D1/E1 must be placed on a different processor 44 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Where should backups of C/D/E be placed? On P2 or P3 or a different processor? P1 is not a choice. 45 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant • Suppose the allocation of the backups of C/D/E are as shown • We now look ahead for any 2 failure combinations 46 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Schedule is feasible => original placement decision was OK • Suppose P1 & P2 were to fail • A3 & B3 will be promoted 47 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Schedule is feasible => original placement decision was OK • Suppose P1 & P4 were to fail • Suppose A2 & B2 on P2 were to be promoted, while C3, D3 & E3 on P3 were to be promoted 48 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Schedule is not feasible => original placement decision was incorrect • Suppose P1 & P4 were to fail • Suppose A2, B2, C2, D2 & E2 on P2 were to be promoted 49 Designing the DeCoRAM Allocation Algorithm (4/5) Refinement 3: Enable the offline algorithm to consider failures • “Look ahead” at failure scenarios of already allocated tasks & replicas determining worst case impact on a given processor • Feasible to do this because system properties are invariant Looking ahead that any of A2/B2 or A3/B3 may be promoted, C1/D1/E1 must be placed on a different processor Placing backups of C/D/E here points at one potential combination that leads to infeasible schedule Outcome • Due to the potential for an infeasible schedule, more resources are suggested by the Lookahead algorithm • Look-ahead strategy cannot determine impact of multiple uncorrelated failures that may make system unschedulable 50 Designing the DeCoRAM Allocation Algorithm (5/5) Refinement 4: Restrict the order in which failover targets are chosen • Utilize a rank order of replicas to dictate how failover happens • Enables the Lookahead algorithm to overbook resources due to guarantees that no two uncorrelated failures will make the system unschedulable Replica number denotes ordering in the failover process • Suppose the replica allocation is as shown (slightly diff from before) • Replica numbers indicate order in the failover process 51 Designing the DeCoRAM Allocation Algorithm (5/5) Refinement 4: Restrict the order in which failover targets are chosen • Utilize a rank order of replicas to dictate how failover happens • Enables the Lookahead algorithm to overbook resources due to guarantees that no two uncorrelated failures will make the system unschedulable • Suppose P1 & P4 were to fail (the interesting case) • A2 & B2 on P2, & C2, D2, E2 on P3 will be chosen as failover targets due to the restrictions imposed • Never can C3, D3, E3 become primaries along with A2 & B2 unless more than two failures occur 52 Designing the DeCoRAM Allocation Algorithm (5/5) Refinement 4: Restrict the order in which failover targets are chosen • Utilize a rank order of replicas to dictate how failover happens • Enables the Lookahead algorithm to overbook resources due to guarantees that no two uncorrelated failures will make the system unschedulable For a 2-fault tolerant system, replica numbered 3 is assured never to become a primary along with a replica numbered 2. This allows us to overbook the processor thereby minimizing resources Resources minimized from 6 to 4 while assuring both RT & FT 53 DeCoRAM Evaluation Criteria • Hypothesis – DeCoRAM’s Failure-aware Look-ahead Feasibility algorithm allocates applications & replicas to hosts while minimizing the number of processors utilized • number of processors utilized is lesser than the number of processors utilized using active replication DeCoRAM Allocation Engine 54 DeCoRAM Evaluation Hypothesis • Hypothesis – DeCoRAM’s Failure-aware Look-ahead Feasibility algorithm allocates applications & replicas to hosts while minimizing the number of processors utilized • number of processors utilized is lesser than the number of processors utilized using active replication • Deployment-time configured real-time fault-tolerance solution works at runtime when failures occur • none of the applications lose high availability & timeliness assurances DeCoRAM Allocation Engine 55 Experiment Results Linear increase in # of processors utilized in AFT compared to NO FT 60 Experiment Results Rate of increase is much more slower when compared to AFT 61 Experiment Results DeCoRAM only uses approx. 50% of the number of processors used by AFT 62 Experiment Results As task load increases, # of processors utilized increases 63 Experiment Results As task load increases, # of processors utilized increases 64 Experiment Results As task load increases, # of processors utilized increases 65 Experiment Results DeCoRAM scales well, by continuing to save ~50% of processors 66 DeCoRAM Pluggable Allocation Engine Architecture • Design driven by separation of concerns • Use of design patterns • Input Manager component – collects per-task FT & RT requirements • Task Replicator component – decides the order in which tasks are allocated • Node Selector component – decides the node in which allocation will be checked • Admission Controller component – applies DeCoRAM’s novel algorithm • Placement Controller component – calls the admission controller repeatedly to deploy all the applications & their replicas Input Manager Allocation Engine implemented in ~7,000 lines of C++ code Task Replicator Node Selector Admission Controller Placement Controller Output decisions realized by DeCoRAM’s D&C Engine 67 DeCoRAM Deployment & Configuration Engine • Automated deployment & configuration support for faulttolerant real-time systems • XML Parser • uses middleware D&C mechanisms to decode allocation decisions • Middleware Deployer • deploys FT middlewarespecific entities • Middleware Configurator • configures the underlying FT-RT middleware artifacts • Application Installer • installs the application components & their replicas • Easily extendable • Current implementation on top of CIAO, DAnCE, & FLARe middleware DeCoRAM D&C Engine implemented in ~3,500 lines of C++ code 68 Summary of DeCoRAM Contributions • DeCoRAM allocation algorithm saves number of resources used via clever resource overbooking of backup replicas • DeCoRAM allocation engine can execute many different allocation algorithms • DeCoRAM D&C engine requires a concrete bridge implemented for the underlying middleware => cost is amortized over number of uses. • Existing fault tolerant middleware runtimes can leverage DeCoRAM decisions • For closed DRE systems, runtimes can be very simple and obey all the decisions determined at design-time • For closed DRE systems, runtimes can use DeCoRAM results for initial deployment. www.dre.vanderbilt.edu/CIAO 69 Contributions within the Lifecycle of DRE Systems Lifecycle Specification Algorithms + Systems + S/W Engineering Composition Deployment Configuration • FLARe adaptive middleware for RT+FT Run-time 70 70 Resolving Challenges 2 & 4: FLARe Key Ideas • Load-Aware Adaptive Failover (LAAF) Target Selection • Load-aware maintain desired soft real-time performance after recovery • Adaptive handle dynamic load due to workload changes and multiple failures • Resource Overload Management and rEdirection (ROME) • maintain soft real-time performance during overloads Fault-Tolerant Load-Aware and Adaptive MiddlewaRe • Failure model • multiple processor/process failures • fail-stop • Replication Model • passive replication • asynchronous state updates • Implemented on top of TAO Real-time CORBA Middleware 72 Middleware Architecture • Client Failover Manager • catches processor/process failure exceptions • redirects clients to failover targets • Monitors • periodically monitor liveness and CPU utilization of each processor • Replication Manager • collects system utilizations from monitors • calculates ranked-list of failover targets using LAAF • updates client-side with ranked list of targets • manages overloads using ROME Load-Aware Adaptive Failover (LAAF) • monitor CPU utilization of each processor • rank backup processors based on load • distribute failover targets of objects on a same processor avoid overload after processor failure • proactively update clients 74 Resource Overload Management & rEdirection (ROME) • overloads can occur due to multiple processor failures • soft real-time treat overloads as failures • redirect clients of high utilization objects to backups on lightly loaded processors • distributes overloads across multiple processors 75 Experiment Setup • • • • • Linux clusters at ISISLab 6 clients – 2 clients CL-5 & CL-6 are dynamic clients (start after 50 seconds) 6 different servers – each have 2 replicas Experiment ran for 300 seconds – each server consumes some CPU load Rate Monotonic scheduling on each processor 76 Experiment Configurations • Static Failover Strategy • each client knows the order in which they access the server replicas in the presence of failures – i.e., the failover targets are known in advance • this strategy is optimal at deployment time 77 LAAF Algorithm Results At 50 secs, dynamic loads are introduced 78 LAAF Algorithm Results At 150 seconds failures are introduced 79 LAAF Algorithm Results static strategy increases CPU utilizations to 90% and 80% - could cause system crashes 80 LAAF Algorithm Results LAAF modifies failover targets at 50 seconds – prevents overloads when failure occurs by choosing different failover targets 81 Contributions within the Lifecycle of DRE Systems Lifecycle Specification Algorithms + Systems + S/W Engineering Composition Deployment Configuration • Group Failover to handle orphan requests Run-time 82 82 Resolving Challenges 3 & 4: Group Failover Enforcing determinism Point solutions: Compensate specific sources of non-determinism e.g., thread scheduling, mutual exclusion Compensation using semi-automated program analysis Humans must rectify non-automated compensation A C B Enforce Determinism A’ B’ 83 Unresolved Challenges: End-to-end Reliability of Non-deterministic Stateful Components Integration of replication & transactions Applicable to multi-tier transactional web-based systems only Overhead of transactions (fault-free situation) Messaging overhead in the critical path (e.g., create, join) 2 phase commit (2PC) protocol at the end of invocation Transaction Manager Create A Join Join B Join C D Client 84 Unresolved Challenges: End-to-end Reliability of Non-deterministic Stateful Components Integration of replication & transactions Applicable to multi-tier transactional web-based systems only Overhead of transactions (fault-free situation) Messaging overhead in the critical path (e.g., create, join) 2 phase commit (2PC) protocol at the end of invocation Overhead of transactions (faulty situation) Must rollback to avoid orphan state Re-execute & 2PC again upon recovery Transactional semantics are not transparent Developers must implement: prepare, commit, rollback A State Update State Update B C State Update D Potential orphan state growing Client Orphan state bounded in B, C, D 85 Solution: The Group-failover Protocol Orphan state bounded in a group of components A B C D Group failover Client A’ B’ C’ D’ Passive Replica Protocol characteristics: 1. Supports exactly-once execution semantics in presence of Nested invocation, non-deterministic stateful components, passive replication 2. Ensures state consistency of replicas 3. Does not require intrusive changes to the component implementation No need to implement prepare, commit, & rollback 4. Supports fast client failover that is insensitive to Location of failure in the operational string Size of the operational string 86 The Group-failover Protocol (1/3) Constituents of the group-failover protocol 1. Accurate failure detection 2. Transparent failover 3. Identifying orphan components 4. Eliminating orphan components 5. Ensuring state consistency Timely fashion 87 The Group-failover Protocol (1/3) Constituents of the group-failover protocol 1. Accurate failure detection 2. Transparent failover 3. Identifying orphan components Timely fashion 4. Eliminating orphan components 5. Ensuring state consistency 1. Accurate failure detection Fault-monitoring infrastructure based on heart-beats Synthesized using model-to-model transformations in GRAFT 88 The Group-failover Protocol (1/3) Constituents of the group-failover protocol 1. Accurate failure detection 2. Transparent failover 3. Identifying orphan components Timely fashion 4. Eliminating orphan components 5. Ensuring state consistency 1. Accurate failure detection Fault-monitoring infrastructure based on heart-beats Synthesized using model-to-model transformations in GRAFT 2. Transparent failover alternatives Client-side request interceptors CORBA standard Aspect-oriented programming (AOP) Fault-masking code generation using model-to-code transformations in GRAFT 89 The Group-failover Protocol (2/3) 3. Identifying orphan components Without transactions, the run-time stage of a nested invocation is opaque Transaction Manager Create A Join Join B Join C D Client 90 The Group-failover Protocol (2/3) 3. Identifying orphan components Without transactions, the run-time stage of a nested invocation is opaque Strategies for determining the extent of the orphan group (statically) 1. The whole operational string Potentially non-isomorphic operational strings Tolerates catastrophic faults • Pool Failure • Network failure Tolerates Bohrbugs A Bohrbug repeats itself predictably when the same state reoccurs Preventing Bohrbugs Reliability through diversity Diversity via non-isomorphic replication Different implementation, structure, QoS 91 The Group-failover Protocol (2/3) 3. Identifying orphan components Without transactions, the run-time stage of a nested invocation is opaque Strategies for determining the extent of the orphan group (statically) 1. The whole operational string 2. Dataflow-aware component grouping 92 The Group-failover Protocol (3/3) 4. Eliminating orphan components Using deployment and configuration (D&C) infrastructure Invoke component life-cycle operations (e.g., activate, passivate) Passivation: Discards the application-specific state Component is no longer remotely addressable 5. Ensuring state consistency Must assure exactly-once semantics State must be transferred atomically Strategies for state synchronization Strategies Eager Lag-by-one Fault-free scenario Messaging overhead No overhead Faulty scenario (recovery) No overhead Messaging overhead 93 Eager State Synchronization Strategy State synchronization in two explicit phases Fault-free Scenario messages: Finish , Precommit (phase 1), State transfer, Commit (phase 2) Faulty-scenario: Transparent failover 94 Lag-by-one State Synchronization Strategy No explicit phases Fault-free scenario messages: Lazy state transfer Faulty-scenario messages: Prepare, Commit, Transparent failover 95 Evaluation: Overhead of the State Synchronization Strategies Experiments CIAO middleware 2 to 5 components Eager state synchronization Insensitive to the # of components Concurrent state transfer using CORBA AMI (Asynchronous Messaging) Lag-by-one state synchronization Insensitive to the # of components Fault-free overhead less than the eager protocol 96 Evaluation: Client-perceived failover latency of the Synchronization Strategies The Lag-by-one protocol has messaging (low) overhead during failure recovery The eager protocol has no overhead during failure recovery (Jitter +/- 3%) 97 Relevant Publications 1. GroupFailOver, CBSE 2011 To Appear, Boulder, CO, June 2011 2. DeCoRAM, IEEE RTAS 2010, Stockholm, Sweden, 2010. 3. Adaptive Failover for Real-time Middleware with Passive Replication, IEEE RTAS 2009 4. Component Replication Based on Failover Units, IEEE RTCSA 2009 5. Towards Middleware for Fault-tolerance in Distributed Real-time Embedded Systems, IFIP DAIS 2008 6. FLARe: A Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time Embedded Systems, ACM Middleware Conference Doctoral Symposium (MDS 2007), 2007 7. MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Realtime & Embedded Systems, ISAS 2007 8. A Framework for (Re)Deploying Components in Distributed Real-time Embedded Systems, ACM SAC 2006 9. Middleware Support for Dynamic Component Updating, DOA 2005 10. Model-driven QoS Provisioning for Distributed Real-time & Embedded Systems, Under Submission, IEEE Transactions on Software Engineering, 2009 11. NetQoPE: A Model-driven Network QoS Provisioning Engine for Distributed Realtime & Embedded Systems, IEEE RTAS 2008 12. Model-driven Middleware: A New Paradigm for Deploying & Provisioning Distributed Real-time & Embedded Applications: Elsevier Jour. of Science & Comp. Prog., 2008 13. DAnCE: A QoS-enabled Deployment & Configuration Engine, CD 2005 98 Concluding Remarks & Future Work • Satisfying multiple QoS properties simultaneously in DRE systems is hard • Resource constraints and fluctuating workloads/operating conditions make the problem even harder • DOC Group at Vanderbilt/ISIS has made significant R&D contributions in this area • Technologies we have developed are part of our ACE/TAO/CIAO/DAnCE middleware suites • www.dre.vanderbilt.edu • Future work seeks to address issues in cyber physical systems • Needs interdisciplinary expertise 99