Related Work: 3 approaches to fault injection/analysis

Transcript Related Work: 3 approaches to fault injection/analysis

FIG: A Prototype Tool for On-Line
Verification of Recovery Mechanisms
Naveen Sastry, Pete Broadwell,
Jonathan Traupman, David Patterson
University of California, Berkeley
Presentation Outline
1. Introduction
–
–
Objective/Motivation
Background
2. Methods
–
–
Implementation
Test setup
3. Evaluation
–
–
Test results
Conclusions
The Berkeley/Stanford
ROC Project
• Purpose: investigating novel
techniques for building highlydependable Internet services
• Example techniques:
– Advanced support for operator undo
– Stability through targeted restarts
– Integrated root cause analysis
– Online verification of recovery
mechanisms
FIG Project
Objective/Motivation
Objective:
• Develop a lightweight, extensible tool
for injecting errors to test recovery
code/mechanisms
Motivation:
• Testing and production environments
are always different
• Large systems will require recovery
code, which should be tested as part of
normal operation
“Software’s Invisible Users”
User Input
User interface
Other libraries
Application
System libraries (libc)
OS
Concept: Jim Whittaker
Florida Institute of Technology
Other apps
Related Testing Methods
1. Ballista
(DeVale, Koopman, Siewiorek)
• “Top-down” testing of POSIX-compliant
OS and library interfaces
2. Fuzz
(Miller, Fredriksen, So)
• Tested UNIX applications by feeding
them random input streams
3. Holodeck
(Whittaker et al.)
• Similar approach to ours, but only for
Windows 2000/XP
FIG Implementation
• Thin stub library
between app &
libraries
• Traps API calls
Application
libfig.so
– Logs them
– Inserts faults
libc.so, other libs
• Can be inserted
into any app
without
modification
OS
– Uses LD_PRELOAD
Normal call path
Injected fault
Extensibility
• API stubs are
automatically
generated
• Very easy to add
new APIs to log
• Fault injection is
under script control
• Can simulate
multiple fault models
(e.g., memory
pressure)
Sample control file:
MALLOC_INDEX
interval 82 to infinity return 0
errno ENOMEM probability 0.03
OPEN_INDEX
// device out of space.
interval 100 to infinity return
–1 errno ENOSPC probability 0.001
// kernel out of memory.
interval 100 to 120 return –1
errno ENOMEM probability 0.1
// too many files open.
callnumber 108 return -1 errno EMFILE
probability 1.0
Test Setup: Applications
• GNU file utilities (ls, mv, etc.)
• Emacs 20.7.1 – with and without X
• Apache 1.3.22
• Berkeley DB 4.0.14
• Netscape Navigator 4.76
• MySQL server 3.23.36
Test Setup:
Instrumented Calls & Their Errors
• malloc() – memory exhaustion
• read() – I/O error, system call was
interrupted
• write() – I/O error, no space left on
device, call interrupted
• open() – memory exhaustion, no space
on device, too many files open
• select() – memory exhaustion
Test Results: Client Apps
read()
write()
select()
malloc()
EINTR
EIO
ENOSPC
EIO
ENOMEM ENOMEM
Emacs –
no X
o.k.
exit
warn
warn
o.k.
crash
Emacs w/X
o.k.
crash
o.k.
crash
crash/
exit
crash
Netscape
warn
exit
exit
exit
n/a
exit
Test Results: Server Apps
read()
EINTR
ENOSPC
EIO
retry detect
Xact
abort
Xact
abort
Berkeley
DB – no
Xact
retry detect
data
loss
MySQL
Server
Xact
abort
retry,
warn
Apache
o.k.
req.
drop
Berkeley
DB – Xact
EIO
write()
select()
malloc()
ENOMEM ENOMEM
n/a
Xact
abort
data
loss
n/a
detect,
or data
loss
Xact
abort
Xact
abort
retry
restart
process
req.
drop
req.
drop
o.k.
n/a
Netscape Reacts
Test Results: Overhead
Time (s)
Overhead
No FIG
33.46
N/A
FIG, no logging
34.28
2.5%
Logging w/o timestamps
47.83
42.9%
Logging w/timestamps
61.74
84.5%
112.85
237.3%
strace (all syscalls)
Timing using Berkeley DB (non-transactional) to
read, sort and write one million words.
• Note: FIG communicates with a separate logging
daemon through shared memory to reduce logging
overhead.
Strategies for
Reliable Services:
• Intelligent retry
– ls: “bounded retry” of malloc()
• Resource preallocation
– Apache: allocates buffer pool at startup
• Degraded service
– Apache: deactivates logging if disk full
• Process pools
– Apache and MySQL
FIG as a Prototype for
Online Error Injection
• Low run-time overhead
• Easy to enable/disable
• Easy to configure
• Extensible
• Can simulate multiple fault
models
A Case for Online
Error Injection
• Recovery code is not usually
exercised during normal operation
• Deployed environments tend to
differ from testing environments
• Can run error injection tests on a
subset of deployed systems
• FIG can simulate common
environmental errors
Conclusions
• FIG exposed a variety of deficiencies in
how our test applications handled
environmental errors
• Server apps are generally more robust
than client applications
• FIG exhibits low overhead
• FIG is suitable for online error injection
Future Directions
• Limitations of FIG:
– Only for UNIX-like OSes
– Limited to app/library interface (proxy for
app/OS interaction)
• Make FIG part of a larger test suite
• Include clock time and event based
error triggers
• Greater flexibility in configuration file
Other Related Work
1. Xept
(Vo et al.)
• Instruments object code to ensure
that error handling code exists
2. Processor & memory errors
• DOCTOR, HYBRID, DEFINE
3. Process memory corruption
• FERRARI, DEFINE