Transcript (slides)

A HUMAN STUDY OF
FAULT
LOCALIZATION
ACCURACY
Zachary P. Fry
Westley Weimer
University of Virginia
September 16, 2010
SOFTWARE MAINTENANCE

Maintenance can account for the majority of
the software lifecycle


What if we knew how easy it was to locate
faults in a code base beforehand?



Locating defects in code is a considerable challenge
Engineer systems to make bug finding easier
Concentrate on problem areas
Could we develop a model that measures this?

How would we gather a data set?
2
PROBLEM – FAULT
LOCALIZATION
We treat fault localization as the
task of determining if a program
or code fragment contains a defect
and, if so, locating the line where
that defect resides
 Research question: Which factors
contribute to a human’s ability
to detect and locate defects?

3
PROBLEM – FAULT
LOCALIZATION

We examine four categories of
defect and code characteristics
Error type
 Surface and syntactical features
 Control flow and contextual features
 Abstraction


Which of these affect humans’
abilities to locate defects in
code?
4
OUTLINE





Motivation
Structure of Model
Human Study
Evaluation of Model
Conclusions
5
MOTIVATION: AN
EXAMPLE
/** Move a single disk from src to dest. */
public static void hanoi1(int src, int dest){
System.out.println(src + " => " + dest);
}
/** Move two disks from src to dest,
making use of a spare peg. */
public static void hanoi2(int src,
int dest, int spare) {
hanoi1(src, dest);
System.out.println(src + " => " + dest);
hanoi1(spare, dest);
}
/** Move three disks from src to dest,
making use of a spare peg. */
public static void hanoi3(int src,
int dest, int spare) {
hanoi2(src, spare, dest);
System.out.println(src + " => " + dest);
hanoi2(spare, dest, src);
}

hanoi1(src,correctly
spare);located the defect
33% of participants
6
TOWERS OF HANOI –
VERSION 2

More complex control
flow
if/else statement
 recursion




Rich commenting
Descriptive identifiers
53% of participant
correctly located the
fault
moveOneDisk (start, end);
/*******************************************
Performs the initial call to moveTower
to solve the puzzle. Moves the disks
from tower 1 to tower 3 using tower 2.
********************************************/
public void solve () {
moveTower (totalDisks, 1, 3, 2);
}
/*******************************************
Moves the specified number of disks
from one tower to another by moving a
subtower of n-1 disks out of the way,
moving one disk, then moving the
subtower back. Base case of 1 disk.
********************************************/
private void moveTower (int numDisks,
int start, int end, int temp) {
if (numDisks == 1)
moveTower(numDisks-1, temp, end, start);
else {
moveTower (numDisks-1, start, temp, end);
moveOneDisk (start, end);
moveTower (numDisks-1, temp, end, start);
}
}
/*******************************************
Prints instructions to move one disk
from the specified start tower to the
specified end tower.
*******************************************/
private void moveOneDisk (int start, int end) {
System.out.println ("Move one disk from "
+ start + " to " + end);
}
7
MODEL – OVERVIEW



We desire a model of human fault localization
accuracy that, given source code as input, can
predict the likelihood that a human will be
able to accurately locate faults within it
We hypothesize that features relevant to such
a model will fall into four categories: fault
type, syntax, context, and abstraction
 Existing work tends to focus on only one of
these areas at a time
Linear regression – trained on human study
data
 Ease of analysis
8
DEFECT FEATURES

Error type



Adapted and
expanded existing
Knight taxonomy
Sampled from
consecutive Mozilla
bugs to obtain types
and distribution
We consider 17 total
types of single-line
defects
Missing statement
Uninitialized variable
Extra assignment
Incorrect type
Incorrect constant
Incorrect parameter
Negated conditional
Incorrect method call
Incorrect variable
…
9
MODEL – CODE FEATURES

Code based features
Most measured automatically, some manually
 92 total

Syntax
Context
Abstraction
Block nesting level
Avg/Max CFG in-edges
Number of method calls
Avg/Max CFG out-edges
Num of array-based
structures
Num of local vars
Avg CFG path length
Uses underlying data
structure
Num of var declarations
Num of CFG edges
Implements a heap
Num of var uses
Num of CFG leaves
Implements a tree
Avg line length
Ratio of “ifs” to “elses”
Implements reheap
…
…
…
10
HUMAN STUDY – PARTICIPANT
SELECTION


215 fourth year students and volunteers from
the internet (crowdsourcing)
Monetary reward given for completion to
encourage best effort
Subset
Average
Accuracy
Number of
Participants
All
46.3%
65
Accuracy > 40%
55.2%
46
Experience >4 years
51.5%
34
Experience = 4 years
46.7%
17
Experience < 4 years
33.4%
14
11
HUMAN STUDY – CODE
SELECTION

Five textbooks

Three sets of code features to vary or control:





Syntax and Surface
Control flow and Contextual
Abstraction
Provides similar concepts but differing
presentations and/or implementations
45 Java files total
12
HUMAN STUDY – FAULT
SEEDING

Types and distribution based on Mozilla


All faults selected are limited to one line for simplicity
Random seeding
Zero or one bugs per file
 Type chosen based on distribution
 All possible sites enumerated and one is randomly chosen
 Fault seeded manually, based on actual bugs if possible


20 line search-space windows
To further control for code length and facilitate quick and
accurate response
 Randomly chosen around the seeded fault location

13
HUMAN STUDY - PROTOCOL

Each participant sees 30 consecutive files
and is asked:
Is there a bug in this code?
 If so, on what line does the bug occur?
 How difficult do you feel this code is to
understand (1-5)?


Participants cannot execute or
automatically search the code – only
manual inspection is permitted
14
EVALUATION

Three separate experiments
Examines defect type as related to fault
localization accuracy
1.

Are certain bugs harder to find?
Examines Syntactical, Contextual, and
Abstraction features as related to fault
localization accuracy
2.

Does our model correlate with actual human ability
to locate faults better than existing baselines?
Analysis of individual features
3.

What features contribute the most towards humans’
ability to locate defects in source code?
15
EVALUATION – EXPERIMENT
1

Goal: relate fault type to fault localization accuracy
16
EVALUATION – EXPERIMENT
2


Goal: measure accuracy of our model’s ability to predict
ease of human fault localization
Two version of our model


All features vs. only those that are measured automatically
Baselines
Code readability (syntactic and surface features)
 Cyclomatic complexity (contextual features)
 “Textbook difficulty” (chapter number in the textbook)


10-fold cross validation to mitigate over-fitting
17
EVALUATION –
EXPERIMENT 2


Our model
greatly
outperforms the
baselines
Automatic-only
model does only
slightly worse
than the full
model
18
EVALUATION – EXPERIMENT
2

Perceived difficulty
is a concrete
measure of
understandability


Fault localization
accuracy is
correlated with
understandability
While baselines do
comparably better,
our model
correlates in a
similar fashion
19
EVALUATION – FEATURE
ANALYSIS

ANOVA of features with respect to human
accuracy
(type) - Feature
F
Pr(F)
Dir
abs – uses abstraction: array
130.9
< 0.001
-
abs – provides abstraction: queue
54.1
< 0.001
+
syn – ratio of constant to variable assignments
40.4
< 0.001
+
syn – avg block nesting level
38.9
< 0.001
-
abs – provides abstraction: heap
28.3
< 0.001
+
syn – max global variables
25.6
< 0.001
+
abs – uses abstraction: linked list
25.6
< 0.001
-
syn – ratio simple to constant conditional
20.6
< 0.001
-
cfg – max CFG out-edges per node
10.0
0.002
-
cfg – avg CFG in-edges per node
5.8
0.016
+
…
20
CONCLUSION




We present a human study of 65 participants based
on concrete fault localization tasks
We analyze the effect that the type of defects has on
humans’ ability to locate faults
Based on the source code, we analyze the correlation
of surface, control flow, and abstract features on
humans’ ability to locate faults
We present a model of human fault localization
accuracy based on these features that correlates with
human accuracy at least four times more than
corresponding baselines
21
Questions?
22
23
24