Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at.

Download Report

Transcript Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at.

Triage: Diagnosing Production Run
Failures at the User’s Site
Joseph Tucek, Shan Lu, Chengdu Huang, Spiros
Xanthos and Yuanyuan Zhou
University of Illinois at Urbana Champaign
Motivation
 Software failures are a major contributor to system




downtime.
Security holes.
Software has grown in size, complexity and cost.
Software testing has become more difficult.
Software packages inevitably contain bugs (even production
ones).
Motivation
 Result: Software failures during production runs at user’s
site.
 One Solution: Offsite software diagnosis:
 Difficult to reproduce failure triggering conditions.
 Cannot provide timely online recovery (e.g. from fast Internet
Worms).
 Programmers cannot be provided to every site.
 Privacy concerns.
Goal: automatically diagnosing software failures
occurring at end-user site production runs.
 Understand a failure that has happened.
 Find the root causes.
 Minimize manual debugging.
Current state of the art
Offsite diagnosis:
Primitive onsite diagnosis:
 Interactive debuggers.
 Unprocessed failure
 Program slicing.
information collections.
 Deterministic replay tools.
 Core Dump analysis
(Partial execution path
construction).
Large overhead makes it
impractical for production
sites.
All require manual analysis.
Privacy concerns.
Onsite Diagnosis
 Efficiently reproduce the occurred failure (i.e. fast and
automatically).
 Impose little overhead during normal execution.
 Require no human involvement.
 Require no prior knowledge.
Triage
 Capturing the failure point and conducting just-in-time
failure diagnosis with checkpoint-reexecution.
 Delta Generation and Delta Analysis.
 Automated top-down human-like software failure diagnosis
protocol.
 Reports:
 Failure nature and type.
 Failure-triggering conditions.
 Failure-related code/variable and the fault propagation chain.
Triage Architecture
3 groups of components:
1. Runtime Group.
2. Control Group.
3. Analysis Group.
Checkpoint & Reexecution
 Uses Rx (Previous work by authors).
 Rx checkpointing:
 Use fork()-like operations.
 Keeps a copy of accessed files and file pointers.
 Record messages using a network proxy.
 Replay may be potentially modified.
Lightweight Monitoring for detecting
failures
 Must not impose high overhead.
 Cheapest way: catch fault traps:
 Assertions
 Access violations
 Divide by zero
 More…
 Extensions: Branch histories, system call trace…
 Triage only uses exceptions and assertions.
Control layer
 Implements the Triage Diagnosis protocol.
 Controls reexecutions with different inputs based on past
results.
 Choice of analysis technique.
 Collects results and sends to off-site programmers.
Analysis Layer Techniques:
TDP: Triage Diagnosis Protocol
Simple Replay
Deterministic bug
Coredump
analysis
Stack/Heap OK. Segmentation fault: strln()
Dynamic bug
detection
Null-pointer dereference
Delta
Generation
Collection of good and bad inputs
Delta Analysis
Code paths leading to fault
Report
TDP: Triage Diagnosis Protocol
Example report
Protocol extensions and variations
 Add different debugging techniques.
 Reorder diagnosis steps.
 Omit steps (e.g. memory checks for java programs).
 Protocol may be costume-designed for specific applications.
 Try and fix bugs:
 Filter failure triggering inputs.
 Dynamically delete code – risky.
 Change variable values.
 Automatic patch generation – future work?
Delta Generation
 Two Goals:
1.
2.

Generate many similar replays: some that fail and some that
don’t.
Identify signature of failure triggering inputs.
Signatures may be used for:


Failure analysis and reproduction.
Input filtering e.g. Vigilante, Autograph ,etc.
Delta Generation
Changing the input
 Replay previously stored client




requests via proxy – try
different subsets and
combinations.
Isolate bug-triggering part –
data “fuzzing”.
Find non-failing inputs with
minimum distance from failing
ones.
Make protocol aware changes.
Use a “normal form” of the
input, if specific triggering
portion is known.
Changing the Environment
 Pad or zero-fill new
allocations.
 Change messages order.
 Drop messages.
 Manipulate thread
scheduling.
 Modify the system
environment.
 Make use of prior steps
information (e.g. target
specific buffers).
Delta Generation
 Results passed to the next stage:
 Break code to basic blocks.
 For each replay extract a vector of exercise count of each block
and block trace.
 Possible to change granularity.
Example revisited
Good run:
Trace: AHIKBDEFEF…EG
Block vector:
{A:1,B:1,D:1,E:11,F:10,G:1
,H:1,I:1,K:1}
Bad run:
Trace: AHIJBCDE
Block vector:
{A:1,B:1,C:1,D:1,E:1,H:1,I
:1,J:1,K:1}
Delta Analysis
Follows three steps:
1. Basic Block Vector (BBV) Comparison: Find a pair of most
similar failing and non-failing replays F and S.
2. Path comparison: Compare the execution path of F and S.
3. Intersection with backward slice: Find the difference that
contributes to the failure.
Delta Analysis: BBV Comparison
 The number of times each block is executed is recorded
using instrumentation.
 Calculate the Manhattan distance between every pair of
failing and non-failing replays (can relax the minimum
demand and settle for similar).
 In the Example: {c:-1,E:10,F:10,G:1,J:-1,K:1} giving a
Manhattan distance of 24.
Delta Analysis: Path Comparison
 Consider execution order.
 Find where the failing and non-failing runs diverge.
 Compute: Minimum Edit Distance i.e. the minimum number
of insertion, deletion, and substitution operations needed to
transform one to the other.
 Example:
Delta Analysis: Backward Slicing
 Want to eliminate differences that have no effect on the





failure.
Dynamic Backward Slicing: extracts a program slice
consisting of all and only those that lead to a given
instruction’s execution.
Starting point may be supplied by earlier steps of the
protocol.
Overhead is acceptable in post-hoc analysis.
Optimization: Dynamically build dependencies during
replays.
Experiments show that overhead is acceptably low.
Backward Slicing and result
Intersection
Limitations and Extensions
 Need to define a privacy policy for the results sent to






programmers.
Very limited success with patch generation.
Does not handle memory leaks well.
Failure must occur. Does not handle incorrect operation.
Difficult to reproduce bugs that take a long time to manifest.
No support for deterministic replay on multi-processor
architectures.
False positives.
Evaluation Methodology
 Experimented with 10 real software failures in 9




applications.
Triage is implemented in Linux OS (2.4.22).
Hardware: 2.4 GHz Pentium-4, 512K L2 cache, 1G memory
and 100Mbs Ethernet.
Triage checkpoints every 200ms and keeps 20 checkpoint.
User study: 15 programmers were given 5 bugs and Triage’s
report for some of the bugs. Compared time to locate the
bug with and without the report.
Bugs used for Evaluation
Name
Program
App
Description
#L
OC
Bug Type
Root Cause Description
Apache1
apache-1.3.27
A web server
114
K
Stack Smash
Long alias match pattern overflows a local array
Apache2
apache-1.3.12
A web server
102
K
Semantic (NULL
ptr)
Missing certain part of url causes NULL pointer
dereference
CVS
cvs-1.11.4
GNU version
control server
115
K
Double Free
Error-handling code placed at wrong order leads
to double free
NySQL
msql-4.0.12
A database server
102
8K
Data Race
Database logging error in case of data race
Squid
squid-2.3
A web proxy cache
server
94K
Heap Buffer
Overflow
Buffer length calculation misses special character
cases
BC
bc-1.06
Interactive algebraic
language
17K
Heap Buffer
Overflow
Using wrong variable in for-loop end-condition
Linux
linux-extract
Extracted from
linux-2.6.6
0.3
K
Semantic (copypaste error)
Forget-to-change variable identifier due to copypaste
MAN
man-1.5h1
Documentation
tools
4.7
K
Global Buffer
Overflow
Wrong for-loop end-condition
NCOMP
ncompress-1.2.4
File
(de)compression
1.9
K
Stack Smash
Fixed length array can not hold long input file
name
TAR
tar-1.13.25
GNU tar archive
tool
27K
Semantic (NULL
ptr)
Directory property corner case is not well
handled
Experimental Results
No input testing
Experimental Results
 For application bugs, Delta generation only worked for BC




and TAR.
In all cases Triage correctly diagnoses the nature of the bug
(deterministic or non-deterministic).
In all 6 applicable cases Triage correctly pinpoints the bug
type, buggy instruction, and memory location.
When Delta Analysis is applied, it reduces the amount of data
to be considered by 63% (Best: 98% worse: 12%).
For MySQL – Finds an example interleaving pair as a trigger.
Case Study 1: Apache
 Failure at ap_gregsub.
 Bug detector catches a stack





smash in lmatcher.
How can lmatcher affect
try_alias_list?
Stack smash overwrites the stack
frame above it, invalidating r.
Trace shows how lmatcher is
called by try_alias_list.
Failure is independent of the
headers.
Failure is triggered by requests
for a specific resource.
Case Study 2: Squid
 Coredump analysis suggests a





heap overflow.
Happens at strcat of two
buffers.
Fault propagation shows how
buffers were allocated.
t has strlen(usr) while the other
buffer has strlen(user)*3.
Input testing gives failuretriggering input.
Gives minimally different
non-failing inputs.
Efficiency and Overhead
Normal Execution overhead:
 Negligble effect caused by
checkpointing.
 In no case over 5%.
 With 400ms checkpointing
intervals – overhead is 0.1%
Efficiency and Overhead
Diagnosis Efficiency:
 Except for Delta Analysis, all steps are efficient.
 All (other) diagnostic steps finish within 5 minutes.
 Delta analysis time is governed by the Edit Distance D in the O(ND)
computation (N – number of blocks).
 Comparison step of Delta Analysis may run in the background.
User Study
 Real bugs:
 On average, programmers
took 44.6% less time
debugging using Triage
reports.
 Toy bugs:
 On average, programmers
took 18.4% less time
debugging using Triage
reports.
Questions?