Transcript Document

Reliable Software for Real-Time Mission Critical
Systems
Chris Turiano
Mesa State College
Grand Junction, CO 81501
Some Quick Definitions
Mission Critical
Items (data, software, etc) determined to be vital to operational readiness or
mission effectiveness in terms of both content and timeliness and must be
absolutely accurate and available on demand
Failure
when a system does not conform to its specified behavior
Fault
internal or algorithmic cause of a failure
Real-time system
a system that produces a time-constrained output into the physical world based on
input from the same physical world
Hard real-time system
system failure occurs for every missed deadline
Soft real-time system
system is tolerant of missed deadlines
Reliability
a quantitative measure of how well a system conforms to its specified behavior
System Design
 System design (planning) is crucial to reliability
 Incorporate
– High-end design specifications
 Choices
– Hardware, Software, IDE, OS
– Fault Prevention
 Testing
 Reviews, program analysis tools
– Fault Tolerance
 Error handling
 Error recovery
Design Specification
• Hardware
– Mil-spec vs. cost
– Single expensive unit vs. Redundant cheap unit
• Software
– Language
• Data-abstraction
• Modularity
• Strongly typed
– Strict choice of timing requirements
• Integrated Development Environments
– Help prevent typographical errors and reduce complexity
• Operating System
– Support for concurrency
– Top level error detection/recovery
Fault Prevention
– Begins during design specification
 Appropriate specifications
 Fault Avoidance techniques
 Help prevent introduction of faults into system
–
–
–
–
Well established techniques
Strongly typed languages
Data abstraction and modularity
Tools
 Design reviews
 Program-verification software
 Fault Removal
– Testing, testing, testing
 the best method to improve the reliability
 exhaustive system testing is impossible
– difficulty proving correctness
– adequate realism
– incorrect early assumptions
 Unit Testing vs. System Testing
Fault Tolerance
• 3 levels
– Full fault tolerance
• All functionality in the presence of faults (perhaps for a limited time)
– Graceful degradation
• Less functionality until repair of the fault
– Fail safe
• Records a prior safe state so it can return to that state after shutdown
– Allow regression
•
• Data recovery
• System restart may restore functionality
Add-on components
– To detect, handle and recover from faults
– Redundancy vs. complexity
• Balance redundancy, introducing new faults and physical restrictions
• Separate the support components from the potentially faulty components
Fault Tolerance
(cont)
 Dynamic Redundancy
– Error detection system
 “exception” block in Ada95
– 4 stages




Error detection
Error diagnosis
Error recovery
Fault treatment
 Static Redundancy
– Masking errors by comparing the output of multiple components
 An average or a majority-rule algorithm to select the correct value
 n-version programming
Error Detection
• Error Detection
package body SomeThing is
procedure SomeProc (param : OUT integer) is
ProcException : exception;
begin
<the internal workings of SomeProc>
exception
when ProcException => ProcExceptionHandler(param : IN integer);
when anError : Others =>
Ada.Text_IO.Put_Line(Exception_Information(anError));
end SomeProc;
end SomeThing;
– Levels of abstraction
– Propagation
– Error detection occurs everywhere but diagnosis/recovery occur where the
specific error is best handled
Error Detection
 Shortcomings
– Stack corrupting errors
 No pointers to the error handling code nor to the parent
– Infinite loop
 Error never fires and the infinite loop continues
 Necessitates other mechanisms
 Timeliness of interaction
– Detecting deadline errors
 Handles infinite loop and stack corruption errors
 RTOS scheduling
– Inherent part of the OS
 Watchdog Scheduler
– Analogy
Diagnosis and Recovery
• Error diagnosis
– Preventing spread of erroneous data
– Examining components where error could have spread
• Object-oriented design naturally limits the interactivity between components
– Access from outside the object is well defined
– Object hides its data
– All access to shared resources indivisible
• Error Recovery
– Roll participating threads to a known point
• Forward error recovery
– Predetermined checkpoints are written into the code
• Backward error recovery
– A “good” state is saved as threads progress
– Infinite loop
 Threads enter atomic action and each creates a recovery point, then an
error occurs rolling all thread back to the same state they just left
A Little Real-Time System in Ada95
Imagine a satellite with an array of solar cells that extends or retracts
based on the amperage of the battery it recharges.
User’s keyboard controller
A task accepting input to alter states
Amperage Sensor
System Controller
A task which senses amps
A task which decides to
extend or retract array
Watchdog Scheduler
A task which handles problems
with the timeliness of extension or
retraction of the array
Motor Controller
A task which extends or
retracts array using a
protected object
Ada 95’s Task
TASK SensorTask IS
PRAGMA Priority(SensorPriority);
END SensorTask;
TASK BODY SensorTask IS
T : Ada.Real_Time.Time;
MyWait: Ada.Real_Time.Time_Span := SensorPeriod;
Sensor : CurrentSensor.Object;
Span : Time_Span
:= Milliseconds (5000);
BEGIN
theWindows.WindowInitSensor;
T := Ada.Real_Time.Clock + MyWait;
LOOP
EventID eid := TheWatchDogScheduler.AddTimerEvent(TimerEvent.MakeEvent(newPid =>
Current_Task, newID => SensorID, newSpan => Span,
newErrHan => ErrorHandler.Object(TheSensorErrorHandler)));
TheSensor.SetAmpsRandom;
TheWatchDogScheduler.RemoveTimerEvent(eid)
DELAY UNTIL(T);
T := T + MyWait;
END LOOP;
EXCEPTION
WHEN AnError : OTHERS => OutPoint := (x => 1, y => 7);
theWindows.WriteString(3, OutPoint, Exception_Information(AnError));
END SensorTask;
Another Example
Ada 95’s Design
 History
 Built-in Fault Prevention
 Built-in mechanisms to aid reliability
– Very strongly typed
– Modular and Object-oriented
– Exception mechanism
 Raise <exception name>
– Delays
– Tasks and Protected Objects
 Readers/writer (analogy)
 Entry
– Queues
– Barriers/Guards
 Accept
 Select…
– When <condition>
– Or…Delay/Abort/Terminate
– Else…
 Requeue
 Priority
Ada 95’s Error Detection
PROCEDURE InitWindow (newWindow : IN OUT Window.Object) IS
BEGIN
<statements of procedure removed due to space limitations>
EXCEPTION
WHEN WindowOutofBoundsX =>
Movecursor(2, 23);
Put_Line(Item => "The window width is too large: recalling InitWindow with smaller box.");
newWindow.deltaX := newWindow.deltaX - 1;
InitWindow(newWindow => newWindow);
WHEN WindowOutofBoundsY =>
Movecursor(2, 23);
Put_Line(Item => "The window height is too large: recalling InitWindow with smaller box.");
newWindow.deltaY := newWindow.deltaY - 1;
InitWindow(newWindow => newWindow);
WHEN Another : OTHERS =>
Movecursor(2, 23);
Put_Line(Item => Exception_Information(Another));
END InitWindow;
Full code here
Ada 95’s Error Recovery
 Watchdog Scheduler
TASK BODY WatchDog IS
BEGIN
LOOP
TheWatchDogScheduler.GetTopItem(TE, Exists);
IF Exists THEN
MyTime := Clock;
IF TE.ExpTime >= MyTime THEN
CASE TE.ID IS
WHEN 1 => TheWatchDogScheduler.SolarArrayErrorRecoveryProcedure(TE);
WHEN 2=>TheWatchDogScheduler.SystemControllerErrorRecoveryProcedure(TE);
WHENOTHERS=>TheWatchDogScheduler.CurrentSensorRecoveryProcedure(TE);
END CASE;
END IF;
END IF;
DELAY 0.1;
END LOOP;
END WatchDog;
Ada 95’s Error Recovery
• Select…
– Requeue/Delay
– Abort/Terminate
• Exception handling
Wrap-up
 Reliability begins at Design Specification stage
– Choices
 hardware, programming languages, timing requirements
– Incorporating
 Correctness Proofs
 Design Reviews
 Extensive Testing
– Planning
 Fault Prevention
 Fault Tolerance
– Error Handling
– Error Recovery
 Use time tested methodologies