Introduction to Software Evolution and Maintenance

Download Report

Transcript Introduction to Software Evolution and Maintenance

Predicting Bugs From History
Software Evolution
Chapter 4: Predicting Bugs from History
T. Zimmermann, N. Nagappan, A Zeller
1
Motivations


Managers want to know where to spend
resources in quality assurance activities.
In practice, this is based on developer expertise.
• By experience, developers have some intuition of
where the bugs are more likely to occur.
• But memory is limited and, sometimes, inaccurate.

A solution: mine the bug database.
• Which modules have been known to be defect prone?
• In practice, not every module is defect prone.
• 80% of defects are in 20% of modules.
2
Research questions


Why are some modules more defect-prone than
others?
Can we predict how defect prone a module will
be?
3
What makes a module defect prone?

The defect likelihood of a module depends on its
history.
• All activities that led to changes that affected it.
• History is never changed, only accumulated.
• Strategy: look for invariant properties of this history.
4
What makes a module defect prone?

Complexity
• Intuitively, the likelihood of making mistakes (injecting defects) into a software
artifact increases with its complexity.
• Number of “details” in a software artifact.
• How these “details” relate and interact to each other.
• Challenge: how to measure complexity.

Problem domain
• The problem domain leads to product requirements.
• Some domains appear to be more difficult to work with than others.
• What is it about a domain that makes it harder?

Evolution
• When code is changed very frequently, it increases the likelihood of introducing
defects.
• How can we decrease the likelihood?

Process
• Field problems imply that defects managed to slip through the quality assurance
process.
• How do we take the quality of the process into account?
5
Measuring complexity


How do we measure complexity?
Traditional complexity measures:
• Size
• Lines of code, function points
• Code structure
• Cyclomatic complexity, number of variables
• Coupling relationships
• Module fan-in, fan-out
• Object-oriented metrics
• WMC, DIT, NoC, …


Do these metrics correlate with complexity?
Do these metrics really capture complexity?
6
Traditional O-O metrics






Weighted Methods per Class (WMC)
Depth of Inheritance Tree (DIT)
Number of Children (NOC)
Coupling Between Objects (CBO)
Response for a Class (RFC)
Lack of Cohesion in Methods (LCOM)
7
Empirical studies

Basili, et al.
• Student experiment.
• WMC, CBO, DIT, NOC, RFC were correlated with
defects.

Briand, et al.
• Industrial case study.
• CBO, RFC, LCOM were correlated with defects.

Nagappan, et al.
• Case study of 5 Microsoft projects.
• Results are not that simple.
Basili, Briand, Melo. A validation of object-oriented design metrics as quality indicators. IEEE TSE, Oct. 1996.
Briand, et al. Investigating quality factors in object-oriented designs: an industrial case study. ICSE 1999.
Nagappan, Ball, Zeller. Mining metrics to predict component failures. ICSE 2006.
8
Microsoft case study


Which complexity metrics have a high correlation with defects?
5 projects
•
•
•
•
•

HTML rendering module for IE6.
Application loader for IIS.
Process Messaging Component.
DirectX.
NetMeeting.
Findings:
• In each case, defects correlated with some complexity metrics.
• But every project correlated with a different set of complexity metrics!
• No universal complexity metric was found.
• Hence, no universal prediction model can be formulated using complexity
metrics.
• Never blindly trust a complexity metric.
9
General study methodology

Model development
• Variables
•
•
•
•
Indices: complexity metrics, etc.
Response: number of defects, defect density.
Study the distribution of the variables to find suitable models.
Use transformation functions as needed.
• Model formulation
• Use supervised learning models: regression, Bayesian, SVM, etc.
• Balance goodness-of-fit with simplicity (Ockham’s Razor).
• Examine significance, direction and magnitude of contribution for each independent
variable.

Model validation
•
•
•
•
•
Calculate precision and recall.
Compare against the null hypothesis (is it better than guessing?)
Test sensitivity of different factors, e.g., using stepwise regression.
Test with hold out data – separate training set from validation set.
Cross-validate:
• Leave one out.
• K-fold cross validation.
For those interested, check out Andrew Moore’s statistical data mining tutorials: http://www.autonlab.org/tutorials/
10
Problem domain



Does the problem domain contribute to the
defect proneness of a module?
Algorithms for some domains are difficult to get
right.
Even reusing existing code is no guarantee.
11
Eclipse case study


Study of 52 Eclipse 2.0 plug-ins.
Determine the domain of each plug-in.
• Use the list of packages imported as a surrogate for the domain.
• Plug-ins that come with a GUI will require UI classes.
• Those that need access to the parse tree will need compiler classes.


Methodology: use supervised learning techniques.
Findings
• Java code that imported packages from the following are the
most likely to have defects:
• org.eclipse.jdt.internal.compiler (> 0.7)
• org.eclipse.jdt.internal.ui (> 0.6)

Findings are specific to the Eclipse project.
Schroter, Zimmermann, Zeller. Predicting component failures at design time. ISESE 2006.
12
Eclipse classes
13
Code churn


Code churn – the number of times a piece of
code has been modified.
Some code churn metrics:
•
•
•
•
•
•

Number of lines in a new version.
Number of lines added/deleted.
Number of files in a new version.
Number of files added/modified/deleted.
Number of changes (deltas) applied.
Churn interval: from initial to final change.
Can code churn metrics correlate with defects?
14
Empirical studies

Ostrand, et al., Munson, et al., Nagappan, et al.
•
•
•
•

[Ostrand] Industrial inventory system.
[Munson] 300 KLOC embedded real-time system.
[Nagappan] Windows Server 2003.
New, changed and unchanged files were good predictors of
defect proneness.
Graves, et al.
• 5ESS subsystem, 20-year history of changes.
• Churn metrics correlate to defects better than complexity
metrics.
• Modules with more recent changes are more likely to have
higher defect proneness.
Ostrand, Weyuker, Bell. Where the bugs are. ISSTA 2004.
Munson, Elbaum. Code churn: a measure for estimating the impact of code change. ICSM 1998.
Nagappan, Ball. Use of relative code churn measures to predict system defect density. ICSE 2005.
Graves, et al. Predicting fault incidence using software change history. IEEE TSE, July 2000.
15
Open issues

External validity
• Most findings come from case studies. Replication is needed to
improve confidence on the results.

Construct validity
• Defect count is not the most important response variable.
• Some defects are more severe than others.
• We are interested in the ones that lead to failures.

Learning from history
• Findings are difficult to apply outside of specific projects.
• What lessons can we transfer to a new project that has no
history?
• How can we abstract/consolidate/generalize our knowledge of
the nature of defects from one project to another?
• Are there universal properties of programs and processes that
likely result in more defects?
16
Some future work

Leverage process data
• Incorporate process features (process model used, quality
assurance activities employed, etc.) into prediction models.

Better metrics and models
• Examine failures, not just defects.

Combined approaches
• Combine knowledge of complexity, domain, churn and process
into models.

Finer-grained granularity
• White-box treatment of modules.
• Examine calling relationships, data dependencies, etc.
17