Towards Logistic Regression Models for Predicting Fault

Transcript Towards Logistic Regression Models for Predicting Fault

Towards Logistic Regression Models
for Predicting Fault-prone Code
across Software Projects
Erika Camargo
and
Ochimizu Koichiro
Japan Institute of Science and Technology
ESEM 2009
1
Contents
1.
2.
3.
4.
5.
6.
Abstract
Background
Problem Analysis
Case study
Results
Conclusion and Future Work
2
Abstract
Challenge:
To make logistic regression (LR) models, which
use design-complexity metrics, able to predict
fault-prone o-o classes across software projects.
P(y=1)
X=
design-complexity
metric
x
P(Fault prone
class)
First attempt of solution:
simple log data transformations
3
Background
• Some design-complexity metrics have shown to
be good predictors of fault-prone classes in LR
models
• Among these metrics are the Chidamber &
Kemerer (CK) metrics
– 80th and 20th percentiles of the distributions can be
used to determine high and low values
– Their thresholds cannot be determined before their
use and should be derived and used locally
4
Problem Analysis
Can a LR model built with these kind of
metrics work efficiently with different
software projects?
P (y=1)
LEAST FAULTY
MOST FAULTY
Large Size SW project
Small Size SW project
10
20
X = Number of Methods
5
Case Study
1. Data analysis of 7 different projects and
application of simple log data transformations.
2. Construction of 3 univariate LR models using a
large open source project (1st release of the
MYLYN System with 638 Java classes).
– Dependent Variables: CK-CBO, CK-RFC, CK-WMC
– Independent Variables: Defects (from Bugzilla & CVS)
3. Test these models with 2 other smaller projects
(with 11 and13 Java classes)
6
Challenge
BNS: Banking system (2006) *
CRS: Cruise control system (2005) *
ECS: ecommerce system (2006) *
ELCS: Elevator control system (2003)*
FACS: Factory automation system (2005) *
GMF: Graphic Modeling Framework **
MYL : Mylyn system **
produced biased
regression estimates
and reduce the
predictive power of
regression models
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
7
BNS: Banking system (2006) *
CRS: Cruise control system (2005) *
ECS: ecommerce system (2006) *
ELCS: Elevator control system (2003)*
FACS: Factory automation system (2005) *
GMF: Graphic Modeling Framework **
MYL : Mylyn system **
RFC Data of
BNS is more
spread than
the data of
the MYL
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
8
BNS: Banking system (2006) *
CRS: Cruise control system (2005) *
ECS: ecommerce system (2006) *
ELCS: Elevator control system (2003)*
FACS: Factory automation system (2005) *
GMF: Graphic Modeling Framework **
MYL : Mylyn system **
RFC Data of
BNS is more
spread than
the data of
the MYL
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
9
Case Study
Solution. Simple data transformation using
Number of Outliers are less
“Log10”
Data Spread is more uniform
Example :
LCBO = Log10(CBO+1)
LTCBO = Log10(CBO+1) + dm;
Where dm is the difference of CBO medias of the
Mylyn system and the system which data is being
transformed
10
Results
Effects of the Log data Transformations:
• Elimination of great number of outliers
• Overall goodness of fit of the 3 models is
better
• Discrimination (Most Faulty/Least Faulty)
– All models discriminate well between most Faulty
and Least Faulty classes of the Mylyn System
– What about using different projects?
11
MF: Most Faulty
LF: Least Faulty
Results
BANKING SYSTEM
Group
Model
Correct
Classification
(RAW DATA)
Correct
Classification
(LOG Tx DATA)
MF
(6 classes)
CBO
2
5

RFC
5
5
=
WMC
6
6
=
CBO
5
5
=
RFC
3
3
=
WMC
4
4
=
7
10

8
8
=
10
10
=
LF
(5 classes)
BOTH
CBO
(11 classes) RFC
WMC
Effect
12
MF: Most Faulty
LF: Least Faulty
Results
E-COMMERCE SYSTEM
Group
Model Correct
Classification
(RAW DATA)
Correct
Classification
(LOG Tx DATA)
MF
(9 classes)
CBO
3
7

RFC
9
8

WMC
7
6

CBO
4
4
=
RFC
0
3

WMC
0
4

7
11

9
11

WMC 7
10

LF
(4 classes)
BOTH
CBO
(13 classes) RFC
Effect
13
Conclusions and Future work
• CK-CBO, CKR-RFC ad CK-WMC can have
different distributions in different projects
• Simple Log Transformations seem to improve
the prediction ability of LR models, specially
when the project measures are not as spread
as those used in the construction of the
model.
• Further data exploration and study of data
transformations
14
Thank you!
questions, comments …
contact: [email protected]
15
16
17
18

Towards Logistic Regression Models for Predicting Fault

Transcript Towards Logistic Regression Models for Predicting Fault

Directory