Deterministic Replay for Real- time Software Systems

Download Report

Transcript Deterministic Replay for Real- time Software Systems

Deterministic Replay for Realtime Software Systems
Alice Lee
Safety, Reliability & Quality
Assurance Office
JSC, NASA
[email protected]
Yann-Hang Lee
Computer Science & Eng
Arizona State University,
Tempe, AZ
[email protected]
0
Background
 Major difficulties of building real-time embedded applications



handling concurrent events (real-world events occur in parallel)
timing control and temporal dependence in program behavior
asynchronous operations
 Non-deterministic operation, Time-dependent behavior, and race
condition

difficult to model, analyze, test, and re-produce.
 Example: NASA Pathfinder spacecraft

Total system resets in Mars Pathfinder
 An overrun of data collection task 
a priority inversion in mutex semaphore 
failure of communication task 
a system reset.
 Took 18 hours to reproduce the failure
in a lab replica  the problem became
obvious and a fix was installed
lee_IV&V-1
Background (Cont’d)
 Other examples


select(2)/accept(2) Race Condition in TCP Servers of NetBSD
 the bug depends on a specific event and is sometimes difficult
to reproduce, particularly if the server is very fast and the
network is relatively slow.
The Delphi Bug Report 459
 difficult to reproduce the bug since the timing of the two threads
(one is being destroyed and one is being created) has to be
“right” for it to occur.
 it is easy to identify the faults and fix them once the failing
sequences are reproduced (or observed).
 The failures are rooted in the interaction of multiple concurrent
operations/threads and are based on timing dependencies.
lee_IV&V-2
Deterministic Replay
 Can we re-produce the exact execution behavior with additional
delays in a controlled environment

the delays may be caused by instrumentation and break points
 For multiple purposes:

Test analysis

Debugging

Recovery
Execution/
Instrumentation
Execution
D. replay/
Instrumentation
Execution/
Observation/
Assertion
Execution
D. replay/
Observation/
Assertion
Execution/
Checkpointing/
Msg logging
Rollback/
D. replay
lee_IV&V-3
Deterministic Replay (Cont’d)
 Programs read in the same input values (timer, DAQ, status, etc.)
 Interrupts occurs in the same program execution instances
 Need to log external events during real-time execution and re-
submit the events during replay

recording and replaying stages
intrusions
interrupt_1
real-time
interrupt_2 execution
interrupt_1
PC=1000
PC=1000
deterministic
replay
PC=2000
interrupt_2
time
PC=2000
lee_IV&V-4
Testing Analysis and Timing Intrusion
 Software quality analysis and test coverage

Instrumentation at source programs

program behavior may be changed due to timing intrusion
 test a robotic controller in the target system – hardware
and human-in-the loop operations
some solutions :
 hardware-based trace collection (Applied Microsystems)
 special data logging, monitoring, and test facility (SVF for
NASA ISS)

 Apply instrumentation during deterministic replay
 if the overhead of logging external events can be minimized
lee_IV&V-5
Our Approach -- A Two-stage Instrumentation
 Instrumentation based on RTOS -- for context
switches, interrupts, events, and task communication
 Annotation for device drivers
 Synchronize program execution with external events


cannot rely on program counter
an interrupt during a loop (need loop count and program
counter)
simulated time
must be adjusted to match with the real execution time
determine when an event occurs
 if no data dependence, it can occur at any instance during a
block execution
 else, need to know the corresponding statement
lee_IV&V-6
Software Instruction Counter
 Exact instance in program execution

specified by program counter (PC)
I/O status changed
read I/O
check value
read I/O
check value
 Software instruction counter (SIC) -
incremented when backward jump or procedure call
 software or hardware implemented
 Has been applied to recovery and debugging
lee_IV&V-7
Current Status
source program
code analyzer
code
instrumentation
ESIC, system, and
event instrumentation
ESIC and replay
instrumentation
instrumented
program_1
execution
trace
target - record
environment
instrumented
program_2
target - replay
environment
event trace_1
PC stamp
converter
event trace_2
lee_IV&V-8
Current Status (Cont’d)
 Works for single execution thread in the
whole system (vxWork + MPC860)
 There are kernel and non-instrumented
threads
instrumented
program

test analysis of one program in a multitasking
environment
 debug a program which calls library routines
 system calls to RTOS
semTake()
RTOS
The other
thread
 Can we still reach deterministic replay if the
execution of the instrumented thread is
interleaved with other threads?
 If interrupts (input)  thread_1  thread_2,
then, both threads must be instrumented
ISR
interrupt
semGive()
lee_IV&V-9
Current Status (Cont’d)
 If interrupts (input)  thread_2 and thread_1  thread_2,
thread_1 doesn’t need to be instrumented
 however, interrupts can occur while thread_1 is running (I.e. execution
is not in the instrumentation region due to a blocked system call or
library call)

 Solution:

check thread id when an interrupt occurs
 if the interrupted instruction is in the instrumentation region, use
PC+SIC for replay
 else, replay the interrupt just before the call (RTOS or library)
lee_IV&V-10
Current Tasks
 Tool integration and GUI
 Experiments

joystick program with input and timer
 DC motor controller with a LabView-based simulator
 Applications in JSC


X38
AERCam
 Porting
vxWorks and Suds on MBX860 embedded controller
 porting to RT-linux and other platforms

 Documentation and dissemination
lee_IV&V-11