ORTEGA - PolyU

Download Report

Transcript ORTEGA - PolyU

ORTEGA: An Efficient and Flexible Online
Fault Tolerance Architecture for Real-Time
Control Systems
Xue Liu, Qixin Wang, Sathish
Gopalakrishnan, Wenbo He, Lui Sha, Hui
Ding, Kihwal Lee
1
Outline






Motivation and related work
ORTEGA goals
ORTEGA architecture
Details of ORTEGA designs
Implementation and evaluation
Demo
2
Motivations

Cyber-Physical Systems



Real-world systems involves not only computer
science, but knowledge related to various disciplines.
Not only the computer system becomes more complex,
the complexity of integrated system (i.e. the cyberphysical system) grows even faster.
Major challenge: how to let engineers of drastically
different backgrounds collaborate with each other?
3
Motivations

Control Systems






Conventional analog control systems
 x(kh  h)  e Ah x(kh)  
Digital control systems


Computer Systems

 x  Ax  Bu

u   Kx


u (kh)   Kx (kh)
h
 0
e As ds  Bu(kh)

Real-time scheduling
Fault tolerance
Reliable/online software upgrade
We need to design a framework so that
computer engineers and control engineers can
easily collaborate and integrate their knowledge
4
Motivations

Control Systems






Conventional analog control systems
Digital control systems
 x(k  1)  Fx(k )  Gu(k )
Computer Systems

 x  Ax  Bu

u   Kx

u(k )   Kx(k )
Real-time scheduling
Fault tolerance
Reliable/online software upgrade
We need to design a framework so that
computer engineers and control engineers can
easily collaborate and integrate their knowledge
5
Related work: Simplex architecture

Demand:

Low cost development of upgraded control
systems for mission critical control applications




instead of multi-versioning, just develop one version
Focus on the control theories
Runtime upgrade/testing of the single version
buggy new system.
Applications:


Aircraft control (F-16, Seto et. al, 2000)
Submarine control (NSSN, new attack
submarine program at US navy)
6
Simplex for real-time control
Decision
Simple high assurance
control subsystem (HAC)
Plant
Complex high performance
control subsystem (HPC)
Simplex Architecture
7
Simplex for real-time control
Given LTI control system:
x  A x  Bu
 A x  BKx  Ax
The above LTI control system is stable iff there exists a P>0,
such that the Lyapunov function
x ( A P  PA) x  0
T
T
8
Simplex for real-time control
Maximum Stability
Region (Recovery
Region)
Stability Region
State
Constraints
Lyapunov
Functions
We can choose smaller solution ellipsoid (i.e. xTPx < xTPmaxx) to leave
margins to guard against model/actuator/measurement errors.
9
Drawbacks of Simplex

P1: Lack of Efficiency

Analytically redundant high assurance controller (HAC)
runs in parallel with complex controller (HPC)



Lowers system performance, increase operating costs
Limits the application of Simplex in only safety-critical
domains
P2: Lack of Flexibility

Enforces the same execution period on HAC and HPC


In practice, different controllers may use different periods for
different performance considerations
For example: fast HAC recovery
10
Design goals of ORTEGA

On-demand Real-TimE GuArd (ORTEGA)


More efficient resource usage (P1)


Through on-demand real-time recovery
Flexible design (P2)



A new efficient fault tolerance software architecture
designed for real-time control systems
Allows HAC and HPC to run at different rates
Through new design and schedulability analysis
Applicable to a wider range of real-time control
systems
11
ORTEGA Architecture
12
On-demand execution of HAC


At any time, only one of the HAC or HPC is
running to control the plant
Decision module (DM) uses a mutex semaphore to
control which of the HAC and HPC is running



When the HPC is running well, the HAC blocks on the
semaphore;
Only when a fault is detected in the HPC, the DM
releases the semaphore to allow HAC to take over
Decision logic is based on stability regions


Determined through Linear Matrix Inequality theory
Details later
13
CPU savings of ORTEGA
HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};
Pr: the percentage of time for recovery (HAC) during a total time of T
• Total CPU resource usage under Simplex
• Total CPU resource usage under ORTEGA
• CPU resource usage savings:
14
No Free Lunch: An extra period of delay
Fault
Detected
k
Simplex
k
ORTEGA
pa
pa
pa
p
p
p
p
p
p
1st time of HAC ‘s control output takes action
1
t t
p
a
Simplex
p
a
t
pa
a
t2
a
ORTEGA
1st time of HAC ‘s control output takes action
up to Ta incurred due to the on-demand execution of HAC
15
Handle the extra delay by state projections
current state
x(t)
a
projected state
x(t+Ta)
t
t: decision time
t+Ta
Stability Region
Resource usage reduction v.s. extra delay :
(1) Extra delay causes disturbances when fault occurs (infrequent)
(2) But the gain in resource usage is large.
16
Recovery region design
Maximum Stability
Region (Recovery
Region)
Stability Region
State
Constraints
Lyapunov
Functions
• The decision module uses recovery region to determine when to
switch to HAC
• Recovery region is defined as the maximum region in which the HAC
can make the plant stable
17
Determine recovery region (1)
Digital controllers:
u(k )   Kx(k )
x(k  1)  Fx(k )
State constraints:
Stability region:
(*)
( F  F  GK )
mT x  1 m  1 , q. (1)
The discrete LTI control system is stable iff
there exists a P>0, such that F T PF  P  0
18
Determine recovery region (1)
Digital controllers:
u(k )   Kx(k )
x(k  1)  Fx(k )
State constraints:
Stability region:
(*)
( F  F  GK )
mT x  1 m  1 , q. (1)
Stability region of the system with respect to
P is defined as
{x | xT Px  1}
19
Determine recovery region (2)
Maximum Stability
Region (Recovery
Region)
Stability Region
State
Constraints
Lyapunov
Functions
Theorem: Determine the maximum stability region of digital
implemented closed loop system with constraints (1) can be
transformed to the following MAXDET (LMI) problem.
Maximize logdet P1
st 
P  0
Area of recovery region
Stability
F PF  P  0
T
mT P1m  1
m  1
 q
State constraints
20
Recovery region v.s. control loop period
Stability Index A(T): Area of the maximum stability region
• It is a function of the control loop period T. The smaller the controller loop
period, the larger the maximum stability region.
-7
Example: an inverted pendulum
4
x 10
3.5
System model
0
1
0  
0 
0

 

0
0
0
1
0
x
u
x
 0 27528 109526 00043   19432 

 

 0 285812 249179 00441  44385 
Controller
u(k )  [57807, 422087,140953, 86016]x(k )
Stability Index
3
2.5
2
1.5
1
0.5
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Control Loop Period
The smaller the
period, the larger
the recovery region.
ORTEGA allows
larger recovery region
(more flexible)
21
Implementation and evaluation
• Inverted pendulum from Quanser
• CPU: Pentium II 350MHz
• OS: Linux kernel 2.4.18-3 with RMS
• HAC: field tested state feedback controller
Evaluation of CPU savings
• If HAC and HPC both run at 50Hz, ORTEGA’s CPU saving is 29.29%
• If HAC runs at 50Hz, HPC runs at 20Hz, ORTEGA’s CPU saving is 50.87%
22
Evaluation of fault tolerance








Infinite loop bug
Non-performing bug
Maximum control output bug
Divided by zero bug
Bang-Bang type bug
Positive feedback bug
Tricky design bug
…
23
Evaluation of fault tolerance
24
Evaluation of fault tolerance
25
Thank You
Q&A
26
Backup Slides
27
Simplex: software engineering economics: the more
effort, the more reliable
Reliability:0
failure
happened
during [0, t]
Failure Rate
Complexity
R(t )  exp(t )  exp(kCt / E )
R( E )  exp(kCt / E )
Effort
Simplex: comparison with N-version
R( E)  R( E / 3)3  3R( E / 3)2
Simplex: roots in recover-block: only one version must
be correct
R( E)  1  (1  R( E / 3))3
Simplex: recovery-blcok: dividing into more alternatives
doesn’t always gain.
Simplex: recovery-blcok: reducing complexity gains
One alternative has complexity 1, ½, and 1/10
Simplex: a two-alternatives recovery block with reduced
complexity wins if a reliable acceptance test is possible.
Simplex: recovery-blcok: reducing complexity gains
RB2: 2 alter, same complexity C=1, perfect acceptance test
*RB2L5: 2 alter, C1=1, C2=1/5, imperfect acceptance test whose reliability = alter. 2
Schedulability analysis of ORTEGA
35
Mode-Change Problem Incurred by Recovery
Example: Suppose one plant 1p : (C1p,T1p) = (3,5); 1a : (C1a,T1a) = (4,10) ;
with another real time task 2 : (C2,T2) = (6,15).
• Before the recovery at t=10, {1p , 2 } = {(3,5), {6,15}} is schedulable;
• After the recovery transition, {1a , 2 } = {(4,10), {6,15}} is also schedulable;
• However, during the transition of recovery, 2 misses its deadline at t=15!
(3,5) ->
(4,10)
Unschedulable
of tasks due to
the recovery
Mode change
incurred by recovery
1a
1 p
0
5
2
10
15
0
5
10
15
(6,15)
20
Miss deadline!
Mode-change in fixed priority scheduling is a well-recognized difficult
problem by the real-time community
36
Schedulability Analysis
Schedulability Analysis: We adopt the work by Real and Crespo (2004)
Idea: Analyze the transitional scheduling overhead incurred by the recovery.
(I) Schedulability analysis of steady state task set
(II) Schedulability analysis of old-mode tasks with transitional
scheduling overhead (due to the mode change)
(III) Schedulability analysis of new-mode tasks with transitional
scheduling overhead (due to the mode change) RR
k
kp
kp
kp
ka
ka
ka
wi(x)
i
x
37
Fault Tolerance and Scheduling Co-design
-- one FT-enabled task case
Maximize the recovery
region subject to
schedulability constraint
Find the smallest (optimal)
control loop period Tk*a, s.t.
the task set is schedulable
under random recoveries
Given the schedulability
test, we can use binary
search algorithm to find
Tk*a
38
P2: Recovery Region for Digital Controllers
Sampling time h,
Zero-order hold
h
F (h)  e , G(h)  0 e As dsB
Ah
u(k )   Kx(k )
Controller
x(k  1)  Fx(k )
( F  F  GK )
Theorem (Lyapunov): A discrete time LTI system shown above is
stable iff there exists a matrix P>0, such that
F PF  P  0.
T
39
Stability Region (Continued)
Stability region of the system
with respect to P is defined as:
x(k 1)  Fx(k )
{x | xT Px  1}
Stability Region with Constraints
State constraints
Control input constraints
Can be combined in the
closed loop system as
aiT x  1
i  1 l
bTj u  1
j  1 r
mT x  1 m  1… q. (1)
Lemma: The stability region defined above satisfy
constraints (1) iff mT P1m  1, m  1… q.
40