DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue Department of Computer Science, Graduate School of Information.

Download Report

Transcript DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue Department of Computer Science, Graduate School of Information.

DCCFinder: A Very-Large Scale Code
Clone Analysis and Visualization Tool
Simone Livieri
Yoshiki Higo
Makoto Matsushita
Katsuro Inoue
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Background
• Open-Source Software (OSS) is used in many software systems
• Relations between software systems can be exposed through code clone analysis
• Large collections of OSS exist
• Huge memory requirements, long running time
• Computing power is cheap
• Large number of computers are often easy accessible
• Code clone analysis can be distributed
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
In the beginning was CCFinder
•
•
•
•
CCFinder is a code-clone analysis tool
Widely used and cited
Token based
Many languages supported (e.g. C, C++, Java)
• Good scalability (but can’t handle very large input)
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
DCCFinder
•
•
•
•
•
D(istributed)CCFinder is a tool for distributed code clone analysis
Master-slave distributed system
Data sharing through a shared file system
Uses CCFinder to perform the code clone analysis
The prototype ran on 80 computers of the Student Laboratory of our
department
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Computational Model
unit 1
unit i-1
project 1
project 2
category 1
unit i
unit i
AATarget
Two
category
Aproject
unit
units
isisis
the
is
amake
aset
aset
single
set
ofof
aof
source
source
piece.
software
source
Afiles
file
piece
sharing
system
that
fileismay
the
a
collection
specific
undergoing
cross multiple
feature
of file
code
that
or
willclone
beprojects
analyzed
use
analysison
each slave node
Slave Node
Piece
i,j
CCFinder
unit j
unit i+1
project 3
unit j-1
project 4
project 5
category 2
unit j
project 6
unit j+1
unit n
project 7
category 3
target
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
project 8
category 4
System Implementation (1)
• Written in Java (about 20kLoc)
• Master-Slave-Registry communication handled with Java RMI
• Basic fault tolerance
Master and slave node characteristics
Processor
Pentium IV 3GHz
Memory
1 GBytes
Network Link
Gigabit Ethernet connected to 100 MBit/s network hubs
OS
FreeBSD 5.3-STABLE
Local Storage
40~50 GBytes
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Analysis Process
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (2)
• Indexer
• Examines the target and collect file size, LoC, project and category name
• Computes unit boundaries
• Master Node
• Creates the input files for CCFinder and assigns jobs to the slaves
• Slave Node
• Copies the files on the local storage
• Executes CCFinder
• Copies the output to the shared storage
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (2)
• Indexer
• Examines the target and collect file size, LoC, project and category name
• Computes unit boundaries
• Master Node
• Creates the input files for CCFinder and assigns jobs to the slaves
• Slave Node
• Copies the files on the local storage
• Executes CCFinder
• Copies the output to the shared storage
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (3)
• Clone Coverage Analyzer
• Compute the number of shared line of code between each pair of files, projects and
categories
• Image Generator
• Generate scatter plot, heat maps or bar chart from the clone coverage data
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (3)
• Clone Coverage Analyzer
• Compute the number of shared line of code between each pair of files, projects and
categories
• Image Generator
• Generate scatter plot, heat maps or bar chart from the clone coverage data
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: The FressBSD Target
• Vast collection of Open-Source
software used by the FreeBSD OS
• Unit size: 15MBytes
• Minimum code clone length: 50
tokens
• Total number of tasks: 269,745
Number of categories
45
Number of projects
6658
Number of .c files
754.552
Total line of code
403,625,067
Total size
10.8GBytes
Time elapsed
Indexer
22 minutes
D-CCFinder
51 hours
Scatter plot
Clone Coverage Analyzer
23 hours
Image Generator
4 hours
Total
78 hours 22 minutes
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
php4 and php5 duplicated source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
gstream’s main source tree is
duplicated inside all the gstream
plugin projects
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Multiple copies of the X-Windows
System source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Database Category
CCC1: 41%
Causes:
•Different version of the same software
•Database drivers for different languages
•Multiple copies of the phpX source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Development Category
CCC1: 38%
Causes:
•Mainly the presence of different versions of the GNU binary utilities and
compilers
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Lang and Development Categories
CCC1: 28%
Causes:
•The presence in both categories of the suite of GNU compilers
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
X11 Fonts Category
CCC1: 46%
Causes:
•Small category size
•Seven copies of the X Window System source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: SPARS-J and the FressBSD Target
• SPARS-J is a Java component analysis tool
• About 47000 line of code; written in C
• Code clones between the SPARS-J and the whole FreeBSD target were
detected
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: Code Clone Coverage (before)
Most of the code clones were from a single file: getopt.c
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: Code Clone Coverage (after)
• Code clones from CGI handling source code
• Specialized version of getopt.c
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary
• Proposed a new approach to distributed large scale code clone analysis
• Obtained a global overview of code clones in the FreeBSD target
• In SPARS-J, effortlessly individuated the use of code from the FreeBSD
target
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary (2)
• The acceleration gain was 20. Limited by:
• data transfer, network congestion, master-slave coordination
• Generating of reasonable size scatter-plot traded speed for accuracy. Effects:
• Source code organization easily visible, enhanced artifacts, finer details not distinguishable
• Currently can’t efficiently filter unnecessary or not-so-interesting code clones
• Being addressed by exploring fingerprint based source code analysis
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Future Work
• Currently D-CCFinder is being rewritten
• Better fault tolerance
• GUI Interface
• Distributed post processing and image generation
• Exploring the evolution of different software systems with code clone
analysis
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Metrics
CCC1( M 0 , M 1 ) 
LOC (CCM 0 M1 ( M 0 , M 1 ))
CCC 2( M 0 , M 1 ) 
LOC (CCM 0 M1 ( M 0 ))
LOC ( M 0 )  LOC ( M 1 )
LOC ( M 0 )
100
100
CCC1 is the percentage of shared line of
code between M0 and M1 computed over
the total line of code of M0 and M1
CCC2 is the percentage of line of code that
M0 shares with M1 computed over the total
line of code of M0
A pair of files or projects or categories
CCM 0 M1 ( M 0 , M 1 ) Segments of the cone clones between M0 and M1
M 0 , M1
CCM 0 M1 ( M 0 )
Segments of the cone clones between M0 and M1 in M0
LOC (x)
Number of lines of code in x
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University