Transcript Document
Reliable Software for Real-Time Mission Critical
Systems
Chris Turiano
Mesa State College
Grand Junction, CO 81501
Some Quick Definitions
Mission Critical
Items (data, software, etc) determined to be vital to operational readiness or
mission effectiveness in terms of both content and timeliness and must be
absolutely accurate and available on demand
Failure
when a system does not conform to its specified behavior
Fault
internal or algorithmic cause of a failure
Real-time system
a system that produces a time-constrained output into the physical world based on
input from the same physical world
Hard real-time system
system failure occurs for every missed deadline
Soft real-time system
system is tolerant of missed deadlines
Reliability
a quantitative measure of how well a system conforms to its specified behavior
System Design
System design (planning) is crucial to reliability
Incorporate
– High-end design specifications
Choices
– Hardware, Software, IDE, OS
– Fault Prevention
Testing
Reviews, program analysis tools
– Fault Tolerance
Error handling
Error recovery
Design Specification
• Hardware
– Mil-spec vs. cost
– Single expensive unit vs. Redundant cheap unit
• Software
– Language
• Data-abstraction
• Modularity
• Strongly typed
– Strict choice of timing requirements
• Integrated Development Environments
– Help prevent typographical errors and reduce complexity
• Operating System
– Support for concurrency
– Top level error detection/recovery
Fault Prevention
– Begins during design specification
Appropriate specifications
Fault Avoidance techniques
Help prevent introduction of faults into system
–
–
–
–
Well established techniques
Strongly typed languages
Data abstraction and modularity
Tools
Design reviews
Program-verification software
Fault Removal
– Testing, testing, testing
the best method to improve the reliability
exhaustive system testing is impossible
– difficulty proving correctness
– adequate realism
– incorrect early assumptions
Unit Testing vs. System Testing
Fault Tolerance
• 3 levels
– Full fault tolerance
• All functionality in the presence of faults (perhaps for a limited time)
– Graceful degradation
• Less functionality until repair of the fault
– Fail safe
• Records a prior safe state so it can return to that state after shutdown
– Allow regression
•
• Data recovery
• System restart may restore functionality
Add-on components
– To detect, handle and recover from faults
– Redundancy vs. complexity
• Balance redundancy, introducing new faults and physical restrictions
• Separate the support components from the potentially faulty components
Fault Tolerance
(cont)
Dynamic Redundancy
– Error detection system
“exception” block in Ada95
– 4 stages
Error detection
Error diagnosis
Error recovery
Fault treatment
Static Redundancy
– Masking errors by comparing the output of multiple components
An average or a majority-rule algorithm to select the correct value
n-version programming
Error Detection
• Error Detection
package body SomeThing is
procedure SomeProc (param : OUT integer) is
ProcException : exception;
begin
<the internal workings of SomeProc>
exception
when ProcException => ProcExceptionHandler(param : IN integer);
when anError : Others =>
Ada.Text_IO.Put_Line(Exception_Information(anError));
end SomeProc;
end SomeThing;
– Levels of abstraction
– Propagation
– Error detection occurs everywhere but diagnosis/recovery occur where the
specific error is best handled
Error Detection
Shortcomings
– Stack corrupting errors
No pointers to the error handling code nor to the parent
– Infinite loop
Error never fires and the infinite loop continues
Necessitates other mechanisms
Timeliness of interaction
– Detecting deadline errors
Handles infinite loop and stack corruption errors
RTOS scheduling
– Inherent part of the OS
Watchdog Scheduler
– Analogy
Diagnosis and Recovery
• Error diagnosis
– Preventing spread of erroneous data
– Examining components where error could have spread
• Object-oriented design naturally limits the interactivity between components
– Access from outside the object is well defined
– Object hides its data
– All access to shared resources indivisible
• Error Recovery
– Roll participating threads to a known point
• Forward error recovery
– Predetermined checkpoints are written into the code
• Backward error recovery
– A “good” state is saved as threads progress
– Infinite loop
Threads enter atomic action and each creates a recovery point, then an
error occurs rolling all thread back to the same state they just left
A Little Real-Time System in Ada95
Imagine a satellite with an array of solar cells that extends or retracts
based on the amperage of the battery it recharges.
User’s keyboard controller
A task accepting input to alter states
Amperage Sensor
System Controller
A task which senses amps
A task which decides to
extend or retract array
Watchdog Scheduler
A task which handles problems
with the timeliness of extension or
retraction of the array
Motor Controller
A task which extends or
retracts array using a
protected object
Ada 95’s Task
TASK SensorTask IS
PRAGMA Priority(SensorPriority);
END SensorTask;
TASK BODY SensorTask IS
T : Ada.Real_Time.Time;
MyWait: Ada.Real_Time.Time_Span := SensorPeriod;
Sensor : CurrentSensor.Object;
Span : Time_Span
:= Milliseconds (5000);
BEGIN
theWindows.WindowInitSensor;
T := Ada.Real_Time.Clock + MyWait;
LOOP
EventID eid := TheWatchDogScheduler.AddTimerEvent(TimerEvent.MakeEvent(newPid =>
Current_Task, newID => SensorID, newSpan => Span,
newErrHan => ErrorHandler.Object(TheSensorErrorHandler)));
TheSensor.SetAmpsRandom;
TheWatchDogScheduler.RemoveTimerEvent(eid)
DELAY UNTIL(T);
T := T + MyWait;
END LOOP;
EXCEPTION
WHEN AnError : OTHERS => OutPoint := (x => 1, y => 7);
theWindows.WriteString(3, OutPoint, Exception_Information(AnError));
END SensorTask;
Another Example
Ada 95’s Design
History
Built-in Fault Prevention
Built-in mechanisms to aid reliability
– Very strongly typed
– Modular and Object-oriented
– Exception mechanism
Raise <exception name>
– Delays
– Tasks and Protected Objects
Readers/writer (analogy)
Entry
– Queues
– Barriers/Guards
Accept
Select…
– When <condition>
– Or…Delay/Abort/Terminate
– Else…
Requeue
Priority
Ada 95’s Error Detection
PROCEDURE InitWindow (newWindow : IN OUT Window.Object) IS
BEGIN
<statements of procedure removed due to space limitations>
EXCEPTION
WHEN WindowOutofBoundsX =>
Movecursor(2, 23);
Put_Line(Item => "The window width is too large: recalling InitWindow with smaller box.");
newWindow.deltaX := newWindow.deltaX - 1;
InitWindow(newWindow => newWindow);
WHEN WindowOutofBoundsY =>
Movecursor(2, 23);
Put_Line(Item => "The window height is too large: recalling InitWindow with smaller box.");
newWindow.deltaY := newWindow.deltaY - 1;
InitWindow(newWindow => newWindow);
WHEN Another : OTHERS =>
Movecursor(2, 23);
Put_Line(Item => Exception_Information(Another));
END InitWindow;
Full code here
Ada 95’s Error Recovery
Watchdog Scheduler
TASK BODY WatchDog IS
BEGIN
LOOP
TheWatchDogScheduler.GetTopItem(TE, Exists);
IF Exists THEN
MyTime := Clock;
IF TE.ExpTime >= MyTime THEN
CASE TE.ID IS
WHEN 1 => TheWatchDogScheduler.SolarArrayErrorRecoveryProcedure(TE);
WHEN 2=>TheWatchDogScheduler.SystemControllerErrorRecoveryProcedure(TE);
WHENOTHERS=>TheWatchDogScheduler.CurrentSensorRecoveryProcedure(TE);
END CASE;
END IF;
END IF;
DELAY 0.1;
END LOOP;
END WatchDog;
Ada 95’s Error Recovery
• Select…
– Requeue/Delay
– Abort/Terminate
• Exception handling
Wrap-up
Reliability begins at Design Specification stage
– Choices
hardware, programming languages, timing requirements
– Incorporating
Correctness Proofs
Design Reviews
Extensive Testing
– Planning
Fault Prevention
Fault Tolerance
– Error Handling
– Error Recovery
Use time tested methodologies