Deterministic Replay for Real- time Software Systems
Download
Report
Transcript Deterministic Replay for Real- time Software Systems
Deterministic Replay for Realtime Software Systems
Alice Lee
Safety, Reliability & Quality
Assurance Office
JSC, NASA
[email protected]
Yann-Hang Lee
Computer Science & Eng
Arizona State University,
Tempe, AZ
[email protected]
0
Background
Major difficulties of building real-time embedded applications
handling concurrent events (real-world events occur in parallel)
timing control and temporal dependence in program behavior
asynchronous operations
Non-deterministic operation, Time-dependent behavior, and race
condition
difficult to model, analyze, test, and re-produce.
Example: NASA Pathfinder spacecraft
Total system resets in Mars Pathfinder
An overrun of data collection task
a priority inversion in mutex semaphore
failure of communication task
a system reset.
Took 18 hours to reproduce the failure
in a lab replica the problem became
obvious and a fix was installed
lee_IV&V-1
Background (Cont’d)
Other examples
select(2)/accept(2) Race Condition in TCP Servers of NetBSD
the bug depends on a specific event and is sometimes difficult
to reproduce, particularly if the server is very fast and the
network is relatively slow.
The Delphi Bug Report 459
difficult to reproduce the bug since the timing of the two threads
(one is being destroyed and one is being created) has to be
“right” for it to occur.
it is easy to identify the faults and fix them once the failing
sequences are reproduced (or observed).
The failures are rooted in the interaction of multiple concurrent
operations/threads and are based on timing dependencies.
lee_IV&V-2
Deterministic Replay
Can we re-produce the exact execution behavior with additional
delays in a controlled environment
the delays may be caused by instrumentation and break points
For multiple purposes:
Test analysis
Debugging
Recovery
Execution/
Instrumentation
Execution
D. replay/
Instrumentation
Execution/
Observation/
Assertion
Execution
D. replay/
Observation/
Assertion
Execution/
Checkpointing/
Msg logging
Rollback/
D. replay
lee_IV&V-3
Deterministic Replay (Cont’d)
Programs read in the same input values (timer, DAQ, status, etc.)
Interrupts occurs in the same program execution instances
Need to log external events during real-time execution and re-
submit the events during replay
recording and replaying stages
intrusions
interrupt_1
real-time
interrupt_2 execution
interrupt_1
PC=1000
PC=1000
deterministic
replay
PC=2000
interrupt_2
time
PC=2000
lee_IV&V-4
Testing Analysis and Timing Intrusion
Software quality analysis and test coverage
Instrumentation at source programs
program behavior may be changed due to timing intrusion
test a robotic controller in the target system – hardware
and human-in-the loop operations
some solutions :
hardware-based trace collection (Applied Microsystems)
special data logging, monitoring, and test facility (SVF for
NASA ISS)
Apply instrumentation during deterministic replay
if the overhead of logging external events can be minimized
lee_IV&V-5
Our Approach -- A Two-stage Instrumentation
Instrumentation based on RTOS -- for context
switches, interrupts, events, and task communication
Annotation for device drivers
Synchronize program execution with external events
cannot rely on program counter
an interrupt during a loop (need loop count and program
counter)
simulated time
must be adjusted to match with the real execution time
determine when an event occurs
if no data dependence, it can occur at any instance during a
block execution
else, need to know the corresponding statement
lee_IV&V-6
Software Instruction Counter
Exact instance in program execution
specified by program counter (PC)
I/O status changed
read I/O
check value
read I/O
check value
Software instruction counter (SIC) -
incremented when backward jump or procedure call
software or hardware implemented
Has been applied to recovery and debugging
lee_IV&V-7
Current Status
source program
code analyzer
code
instrumentation
ESIC, system, and
event instrumentation
ESIC and replay
instrumentation
instrumented
program_1
execution
trace
target - record
environment
instrumented
program_2
target - replay
environment
event trace_1
PC stamp
converter
event trace_2
lee_IV&V-8
Current Status (Cont’d)
Works for single execution thread in the
whole system (vxWork + MPC860)
There are kernel and non-instrumented
threads
instrumented
program
test analysis of one program in a multitasking
environment
debug a program which calls library routines
system calls to RTOS
semTake()
RTOS
The other
thread
Can we still reach deterministic replay if the
execution of the instrumented thread is
interleaved with other threads?
If interrupts (input) thread_1 thread_2,
then, both threads must be instrumented
ISR
interrupt
semGive()
lee_IV&V-9
Current Status (Cont’d)
If interrupts (input) thread_2 and thread_1 thread_2,
thread_1 doesn’t need to be instrumented
however, interrupts can occur while thread_1 is running (I.e. execution
is not in the instrumentation region due to a blocked system call or
library call)
Solution:
check thread id when an interrupt occurs
if the interrupted instruction is in the instrumentation region, use
PC+SIC for replay
else, replay the interrupt just before the call (RTOS or library)
lee_IV&V-10
Current Tasks
Tool integration and GUI
Experiments
joystick program with input and timer
DC motor controller with a LabView-based simulator
Applications in JSC
X38
AERCam
Porting
vxWorks and Suds on MBX860 embedded controller
porting to RT-linux and other platforms
Documentation and dissemination
lee_IV&V-11