Finding File Clones in FreeBSD Ports Collection Yusuke Sasaki Tetsuo Yamamoto Yasuhiro Hayase Katsuro Inoue Department of Computer Science, Graduate School of Information Science & Technology, Osaka University.

Download Report

Transcript Finding File Clones in FreeBSD Ports Collection Yusuke Sasaki Tetsuo Yamamoto Yasuhiro Hayase Katsuro Inoue Department of Computer Science, Graduate School of Information Science & Technology, Osaka University.

Finding File Clones
in FreeBSD Ports
Collection
Yusuke Sasaki
Tetsuo Yamamoto
Yasuhiro Hayase
Katsuro Inoue
Department of Computer Science,
Graduate School of Information Science & Technology,
Osaka University
File Clones

Two or more files with the same content
 Comments
and code indentation ignored

Inside a project or between different projects

Research about file-clones is scarce
 Get
new knowledge about file-clones
Project A
Project B
int main() {
printf(“Hello msr!”);
return 0;
}
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
FCFinder

Input
 .c

and .h files
Output
 File-clone


sets
Faster than other tools
Tool
Speed
CCFinder
1.4M files / 960 hours
x1
1PC
D-CCFinder
1.4M files / 51 hours
x19
80PCs
FCFinder
1.4M files / 17.16 hours
x55
1PC
Detection
 Tokenization
 MD5 Hash Calculation
 Exact Matching
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Experiment

Target
 Only .c and .h
 ~1.4M files
 ~12 GB
 17.16 hours

files in the FreeBSD Ports Collection
We measured:
 File size
 Number of files in each project
 Size of each file-clone set
 Number
of file-clones in a project
These values follow the power law
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
used in both of PHP4
and 5
Left:used in PHP5
Right:used in PHP4
D
E
100
L:650 sets
R:500 sets
419 sets
1
number of file clone sets
File-clone Set Size
120 file clones
5
10
50
100
L:61 file clones
R:59 file clones
population
of file
file clone
setclone
sizeset
R*2 = 0.8508
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
500
Right:PHP4 modules
Center:projects related
bin-utils
Left:PHP5 modules
5
50
G
1
number of projects with file clones
File-clones per Project
5 10
50 100
500 1K
5K 10K
number
of file
clone
setsproject are
R*2excluded)
= 0.8263
number of file clones
in projects
(clones
inside
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (1/3)
* Nodes show the projects
* Edges between projects show the number of file clones
between two projects
Ex) gcc41 and gfortran shares 7691 file clones
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (2/3)
* Nodes show the projects
* Edges between projects show the number of file clones
between two projects
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (3/3)
* Nodes show the projects
* Edges between projects show the number of file clones
between two projects
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Conclusions & Future Work
Conclusions
 Measured several features of the FreeBSD
Ports collection.
 Found that the measured features follow the
power law
Future Work
 Projects logical coupling investigation
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University