Introduction and Overview to Mining Software Repository
Download
Report
Transcript Introduction and Overview to Mining Software Repository
Introduction and Overview to Mining Software Repository
Zoltan Karaszi
zkaraszi (at) kent.edu
MS/PHD seminar (cs6/89191)
1
November 9th, 2011
Abstract
Based on the following survey paper: “ A survey and taxonomy of approaches
for mining software repositories in the context of software evolution”
by Huzefa Kagdi, Michael L. Collard and Jonathan I. Maletic, 2007
After defining MSR, giving background and different classifications,
my main goal is - give a general picture about MSR
After showing the different MSR approaches I will focus on one example of
Frequent-pattern mining that examines the changes and evolution of software
2
November 9th, 2011
Outline
1. Introduction
2. Dimensions of survey
3. A layered taxonomy of MSR
4. Software repository mining overview
5. Example: Frequent-pattern mining
6. Discussion and open issues
7. Concluding remarks
8. References
.
November 9th, 2011
1. Introduction
1.1. Terms
1.2. Premise
1.3. Scope, background, history
1.4. Goals of the survey
.
November 9th, 2011
1. Introduction
1.1. Terms
Mining Software Repositories (MSRs): created to describe a broad class of
investigations into the examination of software repositories
Software Repositories (SRs): produced and archived during software evolution
Concurrent Versions System (CVS): client-server free software revision control
system, track of all changes in a set of files
1.2. Premise
Empirical and systematic investigations of repositories
Identify uncovered information, relationships or trends
Bring new light on the process of software evolution and the changes
3
November 9th, 2011
1.3. Scope, background and history
Scope
Survey the literature until June, 2006
Specifically investigates evolutionary changes of software artifacts
Background
No survey of investigation examined the changes and evaluation of software
and use data mining and other similar techniques before
In the past
MSR investigations were subjected on industrial Systems
research efforts were limited for few software systems
Currently
Large increase in open-source software how to manage this challenge
4
November 9th, 2011
1.4. Goals of the survey
Form a basis for researchers interested in MSR to better understand the
evolution of software systems
Create a taxonomy assist in the continued advancement of the field
Clearer understanding support the development of tools, methods, processes
More precisely reflect the actual nature of software evolution
5
November 9th, 2011
2. Dimensions of survey
2.1. Information sources
2.2. Purpose
2.3. Methodology
2.4. Evaluation
.
November 9th, 2011
2. Dimensions of the survey
2.1. Information sources
Categories of information in SR
Metadata about the software change: comments, user-ids, timestamps
Differences between the versions: addition, deletion or modification
Classification of different software versions (artifacts)
Version control systems
CVS – doesn’t maintain explicit branch and merge points
Subversion (more modern) – build the change-set
Bugzilla – bug-tracking system - history of the entire lifecycle of a bug (bug report)
6
November 9th, 2011
2.2. Purpose
Extract information and uncover relationships or trends in source code evolution
Two classes of answers of MSR questions
Market-Basket Question (MBQ) formulated as
If A occurs then what else occurs on a regular basis?
Prevalence Questions (PQ) formulated as
Was a particular function added/deleted/modified?
How many and which of the functions are reused?
7
November 9th, 2011
2.3. Methodology
Researchers utilize software repositories in multiple ways
Limit the studies to the metadata
directly available from the repositories using the semantic manner, traditional
Use directly the functionality of source code repositories (CVS commands)
to get a particular version of the code using the adopted/invented methodology
2.4. Evaluation
Assessment metrics
Precision: how much of the information found is relevant
Recall: how much of all of the relevant information is found
8
November 9th, 2011
3. Layered taxonomy of MSR approaches
November 7th, 2011
.
3. Layered taxonomy of MSR approaches
All the investigated survey paper works: on version-release histories, on the same
level of granularity, ask and answer very similar type of MSR questions, analyze
the information and derive conclusions within the context of software evolution
The four-layer taxonomic description [1]
9
November 9th, 2011
4. Software repository mining overview
4.1. Metadata analysis
4.2. Static source code analysis
4.3. Source code differencing, analysis
4.4. Software metrics
4.5. Visualization
4.6. Clone-detection methods
4.7. Information-retrieval methods
4.8. Classification with supervised learn
4.9. Social network analysis
.
November 9th, 2011
4. Software repository mining overview
4.1. Metadata analysis
Lightweight methodology to analyze metadata
Utilize the metadata stored in software repositories
Straightforward first choice – accessible (CVS log)
4.2. Static source code analysis
Good approach to extract facts and other information from versions of a system
Bug finding and fixing
4.3. Source code differencing and analysis
Further extension of MSR with regards to source code changes
More source code ‘aware’manner
10
November 9th, 2011
4.4. Software metrics
Quantitatively measures various aspects of software products and projects
Include size, effort, cost, functionality, quality, complexity and efficiency
4.5. Visualization
Interactive visual representation of data to amplify cognition and to support
software maintenance and evolution
Very task specific
Based on the mined data and how one separates approach categories
4.6. Clone-detection methods
Approaches for identify both exact and near-miss clones
Source code entities with similar textual, structural and semantic composition
11
November 9th, 2011
4.7. Information-Retrieval (IR) methods
Classification and clustering of textual units
Applied to many software engineering problems
Traceability, program comprehension, and software reuse
CVS comments, textual descriptions of bug reports, and e-mails
4.8. Classification with supervised learning
Supervised learning: technique creating cause–effect function from training data
4.9. Social network analysis
For deriving and measuring‘invisible’ relationships between social entities
To discover developer roles, contributions, associations in the software development
12
November 9th, 2011
5. Example: Frequent-pattern mining
5.1. Evolutionary couplings and change predictions
5.2. Capabilities of technique
5.3. Extension of their work [33]
5.4. Evaluation
5.5. Advantages of extended ROSE
.
18
November 9th, 2011
5. Example: Frequent-pattern mining
Discover implicit knowledge from large datasets (patterns, trends, rules)
Encompasses IR, statistical analysis and modeling and machine learning
Applied to uncover frequently co-change (frequent patterns) software entities
Include the ordering information
[34]
13
November 9th, 2011
5.1. Evolutionary couplings and change predictions
Zimmermann et al. [15] aimed to identify co-occurring changes in a software system
Purpose: find changes
?
source code entity(function A) modifiedother entities(functions B and C)modified
Use
ROSE (parser tool) for SC (C++, Java, Python)
Association-rule mining technique to determine rules of the form B A
Derived association rules such as a particular ‘type’ definition changes
leads to changes
In instances of variables of that ‘type’
In coupling between interface and implementation
14
November 9th, 2011
5.2. Capabilities of technique
Ability to identify addition, modification and deletion of syntactic entities
Handles various programming languages and HTML documents
Detection of hidden dependencies
Figure 1.2: Programmers who Changed this Function also Changed…[15]
15
November 9th, 2011
5.3. Extension of their work [33]
Allows prediction of additions to and deletions from entities
ROSE was evaluated for
Navigation (recommendation of other affected entities)
Closure (false suggestions for missing entities)
Granularity (fine versus coarse)
Maintenance (modified only)
16
November 9th, 2011
5.4. Evaluation (‘interactive power’ of ROSE tool)
Period: at least one month selected for eight open-source projects
Prediction - based on previous versions: changes occurred during the evaluation
New additional measure feedback: percentage of queries
Average precision, recall, and feedback values
Navigation and prevention support is better with coarse level than with fine level
granularity
Average feedback values in the case of closure: 1.9%
in the case of fine and coarse granularity: 3%
17
November 9th, 2011
5.5. Advantages of extended ROSE tool
Needs only a few weeks of history to make suggestions
Results can be improved by assigning higher weight to rapid renames and moves
Similar approach
Ying et al. [34] - approach for source code change prediction at a file level
Use: association-mining technique based on FP-tree item-set mining
Evaluated: version histories of Mozilla and Eclipse projects
18
November 9th, 2011
6. Discussion and open issues
7. Concluding remarks
8. References
.
November 9th, 2011
6. Discussion and open issues
Need to be able to perform MSR on fine-grained entities
Standards for validation must be developed
7. Concluding remarks
Over 80 investigations were surveyed
Layered taxonomy was derived
MSR investigations are promising avenue
to help support and understand software evolution !
19
November 9th, 2011
8. References
[1]. Kagdi, H., Collard, M.L., Maletic, J.I., "A Survey and Taxonomy of Approaches for
Mining Software Repositories in the Context of Software Evolution", in the Journal of
Software Maintenance and Evolution: Research and Practice (JSME), Vol. 19, No. 2,
2007, pp. 77-131.
[15]. Zimmermann T, Weißgerber P, Diehl S, Zeller A. Mining version histories to
guide software changes. Proceedings 26th International Conference on Software
Engineering (ICSE’04). IEEE Computer Society Press: Los Alamitos CA, 2004;
[33]. Zimmermann T, Zeller A,Weißgerber P, Diehl S. Mining version histories to guide
software changes. IEEE Transactions on Software Engineering 2005; 31(6):429–445.
[34]. Ying ATT, Murphy GC, Ng R, Chu-Carroll MC. Predicting source code changes
by mining change history. IEEE Transactions on Software Engineering 2004;
30(9):574–586.
Thank you for your time !
20
November 9th, 2011