Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at.
Download
Report
Transcript Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at.
Triage: Diagnosing Production Run
Failures at the User’s Site
Joseph Tucek, Shan Lu, Chengdu Huang, Spiros
Xanthos and Yuanyuan Zhou
University of Illinois at Urbana Champaign
Motivation
Software failures are a major contributor to system
downtime.
Security holes.
Software has grown in size, complexity and cost.
Software testing has become more difficult.
Software packages inevitably contain bugs (even production
ones).
Motivation
Result: Software failures during production runs at user’s
site.
One Solution: Offsite software diagnosis:
Difficult to reproduce failure triggering conditions.
Cannot provide timely online recovery (e.g. from fast Internet
Worms).
Programmers cannot be provided to every site.
Privacy concerns.
Goal: automatically diagnosing software failures
occurring at end-user site production runs.
Understand a failure that has happened.
Find the root causes.
Minimize manual debugging.
Current state of the art
Offsite diagnosis:
Primitive onsite diagnosis:
Interactive debuggers.
Unprocessed failure
Program slicing.
information collections.
Deterministic replay tools.
Core Dump analysis
(Partial execution path
construction).
Large overhead makes it
impractical for production
sites.
All require manual analysis.
Privacy concerns.
Onsite Diagnosis
Efficiently reproduce the occurred failure (i.e. fast and
automatically).
Impose little overhead during normal execution.
Require no human involvement.
Require no prior knowledge.
Triage
Capturing the failure point and conducting just-in-time
failure diagnosis with checkpoint-reexecution.
Delta Generation and Delta Analysis.
Automated top-down human-like software failure diagnosis
protocol.
Reports:
Failure nature and type.
Failure-triggering conditions.
Failure-related code/variable and the fault propagation chain.
Triage Architecture
3 groups of components:
1. Runtime Group.
2. Control Group.
3. Analysis Group.
Checkpoint & Reexecution
Uses Rx (Previous work by authors).
Rx checkpointing:
Use fork()-like operations.
Keeps a copy of accessed files and file pointers.
Record messages using a network proxy.
Replay may be potentially modified.
Lightweight Monitoring for detecting
failures
Must not impose high overhead.
Cheapest way: catch fault traps:
Assertions
Access violations
Divide by zero
More…
Extensions: Branch histories, system call trace…
Triage only uses exceptions and assertions.
Control layer
Implements the Triage Diagnosis protocol.
Controls reexecutions with different inputs based on past
results.
Choice of analysis technique.
Collects results and sends to off-site programmers.
Analysis Layer Techniques:
TDP: Triage Diagnosis Protocol
Simple Replay
Deterministic bug
Coredump
analysis
Stack/Heap OK. Segmentation fault: strln()
Dynamic bug
detection
Null-pointer dereference
Delta
Generation
Collection of good and bad inputs
Delta Analysis
Code paths leading to fault
Report
TDP: Triage Diagnosis Protocol
Example report
Protocol extensions and variations
Add different debugging techniques.
Reorder diagnosis steps.
Omit steps (e.g. memory checks for java programs).
Protocol may be costume-designed for specific applications.
Try and fix bugs:
Filter failure triggering inputs.
Dynamically delete code – risky.
Change variable values.
Automatic patch generation – future work?
Delta Generation
Two Goals:
1.
2.
Generate many similar replays: some that fail and some that
don’t.
Identify signature of failure triggering inputs.
Signatures may be used for:
Failure analysis and reproduction.
Input filtering e.g. Vigilante, Autograph ,etc.
Delta Generation
Changing the input
Replay previously stored client
requests via proxy – try
different subsets and
combinations.
Isolate bug-triggering part –
data “fuzzing”.
Find non-failing inputs with
minimum distance from failing
ones.
Make protocol aware changes.
Use a “normal form” of the
input, if specific triggering
portion is known.
Changing the Environment
Pad or zero-fill new
allocations.
Change messages order.
Drop messages.
Manipulate thread
scheduling.
Modify the system
environment.
Make use of prior steps
information (e.g. target
specific buffers).
Delta Generation
Results passed to the next stage:
Break code to basic blocks.
For each replay extract a vector of exercise count of each block
and block trace.
Possible to change granularity.
Example revisited
Good run:
Trace: AHIKBDEFEF…EG
Block vector:
{A:1,B:1,D:1,E:11,F:10,G:1
,H:1,I:1,K:1}
Bad run:
Trace: AHIJBCDE
Block vector:
{A:1,B:1,C:1,D:1,E:1,H:1,I
:1,J:1,K:1}
Delta Analysis
Follows three steps:
1. Basic Block Vector (BBV) Comparison: Find a pair of most
similar failing and non-failing replays F and S.
2. Path comparison: Compare the execution path of F and S.
3. Intersection with backward slice: Find the difference that
contributes to the failure.
Delta Analysis: BBV Comparison
The number of times each block is executed is recorded
using instrumentation.
Calculate the Manhattan distance between every pair of
failing and non-failing replays (can relax the minimum
demand and settle for similar).
In the Example: {c:-1,E:10,F:10,G:1,J:-1,K:1} giving a
Manhattan distance of 24.
Delta Analysis: Path Comparison
Consider execution order.
Find where the failing and non-failing runs diverge.
Compute: Minimum Edit Distance i.e. the minimum number
of insertion, deletion, and substitution operations needed to
transform one to the other.
Example:
Delta Analysis: Backward Slicing
Want to eliminate differences that have no effect on the
failure.
Dynamic Backward Slicing: extracts a program slice
consisting of all and only those that lead to a given
instruction’s execution.
Starting point may be supplied by earlier steps of the
protocol.
Overhead is acceptable in post-hoc analysis.
Optimization: Dynamically build dependencies during
replays.
Experiments show that overhead is acceptably low.
Backward Slicing and result
Intersection
Limitations and Extensions
Need to define a privacy policy for the results sent to
programmers.
Very limited success with patch generation.
Does not handle memory leaks well.
Failure must occur. Does not handle incorrect operation.
Difficult to reproduce bugs that take a long time to manifest.
No support for deterministic replay on multi-processor
architectures.
False positives.
Evaluation Methodology
Experimented with 10 real software failures in 9
applications.
Triage is implemented in Linux OS (2.4.22).
Hardware: 2.4 GHz Pentium-4, 512K L2 cache, 1G memory
and 100Mbs Ethernet.
Triage checkpoints every 200ms and keeps 20 checkpoint.
User study: 15 programmers were given 5 bugs and Triage’s
report for some of the bugs. Compared time to locate the
bug with and without the report.
Bugs used for Evaluation
Name
Program
App
Description
#L
OC
Bug Type
Root Cause Description
Apache1
apache-1.3.27
A web server
114
K
Stack Smash
Long alias match pattern overflows a local array
Apache2
apache-1.3.12
A web server
102
K
Semantic (NULL
ptr)
Missing certain part of url causes NULL pointer
dereference
CVS
cvs-1.11.4
GNU version
control server
115
K
Double Free
Error-handling code placed at wrong order leads
to double free
NySQL
msql-4.0.12
A database server
102
8K
Data Race
Database logging error in case of data race
Squid
squid-2.3
A web proxy cache
server
94K
Heap Buffer
Overflow
Buffer length calculation misses special character
cases
BC
bc-1.06
Interactive algebraic
language
17K
Heap Buffer
Overflow
Using wrong variable in for-loop end-condition
Linux
linux-extract
Extracted from
linux-2.6.6
0.3
K
Semantic (copypaste error)
Forget-to-change variable identifier due to copypaste
MAN
man-1.5h1
Documentation
tools
4.7
K
Global Buffer
Overflow
Wrong for-loop end-condition
NCOMP
ncompress-1.2.4
File
(de)compression
1.9
K
Stack Smash
Fixed length array can not hold long input file
name
TAR
tar-1.13.25
GNU tar archive
tool
27K
Semantic (NULL
ptr)
Directory property corner case is not well
handled
Experimental Results
No input testing
Experimental Results
For application bugs, Delta generation only worked for BC
and TAR.
In all cases Triage correctly diagnoses the nature of the bug
(deterministic or non-deterministic).
In all 6 applicable cases Triage correctly pinpoints the bug
type, buggy instruction, and memory location.
When Delta Analysis is applied, it reduces the amount of data
to be considered by 63% (Best: 98% worse: 12%).
For MySQL – Finds an example interleaving pair as a trigger.
Case Study 1: Apache
Failure at ap_gregsub.
Bug detector catches a stack
smash in lmatcher.
How can lmatcher affect
try_alias_list?
Stack smash overwrites the stack
frame above it, invalidating r.
Trace shows how lmatcher is
called by try_alias_list.
Failure is independent of the
headers.
Failure is triggered by requests
for a specific resource.
Case Study 2: Squid
Coredump analysis suggests a
heap overflow.
Happens at strcat of two
buffers.
Fault propagation shows how
buffers were allocated.
t has strlen(usr) while the other
buffer has strlen(user)*3.
Input testing gives failuretriggering input.
Gives minimally different
non-failing inputs.
Efficiency and Overhead
Normal Execution overhead:
Negligble effect caused by
checkpointing.
In no case over 5%.
With 400ms checkpointing
intervals – overhead is 0.1%
Efficiency and Overhead
Diagnosis Efficiency:
Except for Delta Analysis, all steps are efficient.
All (other) diagnostic steps finish within 5 minutes.
Delta analysis time is governed by the Edit Distance D in the O(ND)
computation (N – number of blocks).
Comparison step of Delta Analysis may run in the background.
User Study
Real bugs:
On average, programmers
took 44.6% less time
debugging using Triage
reports.
Toy bugs:
On average, programmers
took 18.4% less time
debugging using Triage
reports.
Questions?