Gemini: Code Clone Analysis Tool Yasushi Ueda†, Yoshiki Higo‡, Toshihiro Kamiya*, Shinji Kusumoto‡, and Katsuro Inoue‡ †Graduate School of Engineering Science, Osaka Univ.,
Download
Report
Transcript Gemini: Code Clone Analysis Tool Yasushi Ueda†, Yoshiki Higo‡, Toshihiro Kamiya*, Shinji Kusumoto‡, and Katsuro Inoue‡ †Graduate School of Engineering Science, Osaka Univ.,
Gemini: Code Clone Analysis Tool
Yasushi Ueda†, Yoshiki Higo‡, Toshihiro Kamiya*,
Shinji Kusumoto‡, and Katsuro Inoue‡
†Graduate School of Engineering Science, Osaka Univ., Japan
‡Graduate School of Information Science and Technology, Osaka Univ., Japan
*PRESTO, Japan Science and Technology Corp., Japan
{y-ueda, y-higo, kamiya, kusumoto, inoue}@ist.osaka-u.ac.jp
1
Contents
Background
Code Clone Analysis Tool, Gemini
Overview
System structure
Scatter Plot
2
Background (1/2)
A code clone is a pair/set of code portions in
source files that are identical or similar to each
other.
3
Background (2/2)
Code clone is one of the factors that make
software maintenance more difficult.
If some faults are found in a code portion, it is
necessary to correct the faults in its all clone pairs.
We have developed a code clone detection tool,
CCFinder [1].
Token-based clone detector
Its input is a set of source files and
output is the locations of clone pairs.
[1] T. Kamiya, S. Kusumoto, and K. Inoue,
“CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”,
IEEE Transactions on Software Engineering, 28(7):654-670, 2002.
4
CCFinder
Example of clone detection process
throws
{{ String
1.
1. static
static
void
void
foo()
throws
throws
RESyntaxException
RESyntaxException
{
static
$$ $$foofoo()
(( (( )) )) throws
throws
static
throws $$RESyntaxException
RESyntaxException
String aa {
static void
{{ $$ $$
String
]]{{ {{$$ "123,400"
,
[[String
$$ =
$$String
]]] ==
new
$[] {
2.
2. [String
a[]
=
new
new
String
[]
{ "123,400",
"123,400",
"abc",
"abc", "orange
"orange 100"
100" };
};
= a[]
[[ [[]] String
"abc"
,
"orange
100"
}
;
org
.
apache
.
regexp
}
;
} ;;
3.
3. org.apache.regexp.RE
org.apache.regexp.RE
pat
pat == new
new org.apache.regexp.RE("[0-9,]+");
org.apache.regexp.RE("[0-9,]+");
. RE
pat
=
new
org
.
apache
.
regexp
$$ ==
new
pat
=
new
4.
4. int
intRE$$sum
sum
==new
0;
0;
. RE
) ;;; int
$$ ((( "[0-9,]+"
$$
int
sum
$$ sum
$$ =
== 0$$0
5.
5. for
for (int
(int ii == 0;
0; ii)))<< a.length;
a.length;
++i)
++i)
for
$$ ii$$ ===
= 00$$ ;;;; ii$$ <<<
<
;;; for
for ((( int
6.
6. a$ ifif.. (pat.match(a[i]))
(pat.match(a[i]))
$$ ;; ;++
if ifif(( (($$ pat
pat
a$ .. length
length
; ++
++$$ ii )) ))if
++
7.
7. ... match
sum
sum
+=
+=
Sample.parseNumber(pat.getParen(0));
Sample.parseNumber(pat.getParen(0));
$$ (( ( $$ aa [[ [[ $$ ii ]] ]] )) )) )) )) $$sum
sum
8.
8. +=
System.out.println("sum
System.out.println("sum
"getParen
" ++ sum);
sum);
(( 00
+=
$$ .. .$$ parseNumber
(( $$ .. $$ (( ((pat
pat $$ .=
.= getParen
parseNumber
+= Sample
9.
9. }} )))) )))) ;;;; System
$$ .. .$$. out
println
"sum ==""
System
out
.. ..$$ println
(( $$ (( "sum
String
$$ void
)))) ;;; goo(String
}}} static
static
++
static void
void
goo
String
10.
10. static
static
void
goo(String
[]
[] ((a)
a)(( $$throws
throws
RESyntaxException
RESyntaxException {{
+ sum
$$ $$goo
throws
= {{ RE
[[[ exp
]]] )))=
RESyntaxException
RE exp
exp ==
throws
throws
$$RE("[0-9,]+");
{{ $$ $$ =
11.
11. a$$RE
RE
exp
=throws
new
newRESyntaxException
RE("[0-9,]+");
new
RE
((( "[0-9,]+"
new
$sum
$=
) ;)) $;; $int
= $== 00
int sum
sum
RE
"[0-9,]+"
12.
12. new
int
int
sum
= 0;
0;
; for
i === 00$ ;;; ii$ <<<
for ((( int
for
13.
13. ;;for
for
(int
(intint$ii ==i$ 0;
0; ii << a.length;
a.length; ++i)
++i)
$ ; ;++
a$ .. length
exp
; ++
++$ ii ) ))if ifif( (($ exp
14.
14.
ifif (exp.match(a[i]))
(exp.match(a[i]))
$ ( ( $ aa [ [[ $ ii ] ]] ) )) ) )) $sum
.. match
sum
15.
15. +=
sum
sum
+=
+=
parseNumber(exp.getParen(0));
parseNumber(exp.getParen(0));
getParen
() 0) )( )0 ) )
+= parseNumber
$ ( ( $exp . (. $exp
( . $ getParen
$$ .. parseNumber
+=
16.
16. ;;System.out.println("sum
System.out.println("sum
== "" ++ sum);
sum);
$ . .$ out
. .$. println
System
println
"sum
sum
( $ ((+ "sum
$ == "" ++ sum
out
17.
17. }})) ;; }}
0.1
3,1 9,1
Source files
Lexical analysis
Token sequence
Transformation
Transformed token sequence
Match detection
Clones on transformed sequence
Formatting
Clone pairs
11,1
17,15
Gemini overview
A GUI-based code clone analysis tool
Uses CCFinder as a code clone detector.
Has several views to interactive analysis.
Scatter plot view
Select by mouse dragging
Sorting function
Zoom in/out
Metric graph view
Select by metric values
Source code view
Implemented in Java
About 10,000 lines of code
6
Scatter plot
The main diagonal line is
always drawn, because each
dot on it refers to an identical
position of the two axes.
A clone pair is shown as a
diagonal line segment.
The distribution is symmetrical
with the main diagonal line.
a b c a b c a d e c
a b c a b c a d e c
Both the vertical and
horizontal axes represent a
token sequence of source
code.
A dot means that
corresponding two tokens on
the two axes are same.
a, b, c, ... : tokens
: matched position
7
Sorting function
When multiple files are compared in scatter plot,
boundaries of their files are shown on the axes.
Depending on the file orders,
the distribution of dots is spread widely.
We put similar files as near as possible.
8
Snapshots of Gemini
9
Conclusions
We presented a maintenance support
environment based on code clone analysis,
Gemini.
We are going to evaluate the applicability
to large scale softwares in actual
maintenance as future research work.
10
11
CCFinder: Implementation
CCFinder extracts code clones by direct
comparison of source text.
It transforms source text for precise and
effective detection of code clones.
Token-based transformation rules to regularize and
select code portion, for Java, C++, COBOL, etc.
programs
It uses an effective matching algorithm for
large source code.
Complexity of algorithm: O(n), where n is a length
of source code
Scalability: 108 min. for 7.2 million lines (Pentium
III 650 MHz, 640MB memory)
12
The difference between ‘diff’
and clone detection tools
Diff finds the longest common substring.
Given a code portion, diff does not report
two or more same code portions (clones).
Clone detection tool finds all the same
or similar code portions.
13
Example of
transformation rules in Java
All identifiers defined by user are transformed to same
tokens.
Unique identifier is inserted at each end of the top-level
definitions and declarations.
Prevents detecting clones that begin at the middle of class
definition and end at the middle of another one.
”java. lang. Math. PI” is transformed to ”Math. PI”.
By using import sentence, a class is referred to with either full
package name or a shorter name
” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ”
Eliminates table initialization code.
14
Clone class metrics
LEN (C ): Length of token sequence of each element in clone class C
POP (C ): Number of elements in clone class C
DFL (C ): Estimation of how many tokens would be removed
from source files when all code fragments of clone
class C are replaced with caller statements of a new
identical routine
new sub routine
caller statements
RAD (C ): Distribution in the file system of elements in clone class C
15
Snapshots of
clone class metric graph
RAD
LEN
POP
DFL
Filtering mode : ON
16
Aims of clone class metrics
We are interested in
Clone classes whose elements are spread widely.
High value of POP means that there are many
similar code fragments.
High value of RAD means that the clones are spread
over many subsystems. They are difficult to find all
together in maintenance.
Clone classes which are appropriate for
refactoring.
High value of DFL (high value POP and high value
of LEN) means that the clone class is worth
evaluating whether the elements can be merged into
one routine.
17
Definition of DFL and RAD
DFL(C )
DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C)
LEN(C) ×POP(C) : the target code size for restructuring
5×POP(C) : the code size of new caller statements
LEN(C) : the code size of new identical routine
new sub routine
caller statements
RAD (C )
Distribution in the file system of elements in clone class C
RAD(C) = 0 : C is enclosed within a single file.
RAD(C) = 1 : C is enclosed within a single directory.
RAD(C) = n : C is enclosed within a directory tree of n layers.
18
CCFinder (3/4)
Application of CCFinder
Free software
JDK libraries (Java, 570 KLOC)
Linux, FreeBSD (C, 1.6 + 1.3 MLOC)
FreeBSD, OpenBSD,NetBSD(C)
Qt(C++,240KLOC)
Commercial software
NTT data Corp., Hitachi Ltd., NEC soft Ltd.,
ASTEC Inc., SRA Inc.
NASDA (Control program for rocket)
19
CCFinder (4/4)
Output of CCFinder
Object file ID
( file 0 in Group 0 )
#version: ccfinder 3.1
#langspec: JAVA
#option: -b 30,1
#option: -k +
#option: -r abcdfikmnprsv
#option: -c wfg
#begin{file description}
0.0 52 C:\Gemini.java
0.1 94 C:\GeneralManager.java
:
:
#end{file description}
Location of a clone pair
( Lines 53 - 63
in file 0.1 and
Lines 542 - 553 in file 1.10
are identical or similar to each other)
It is difficult to analyze
source code by only this
text-based information of
the location of clone pairs.
#begin{clone}
0.1 53,9 63,13 1.10 542,9 553,13 35
0.1 53,9 63,13 1.10 624,9 633,13 35
0.2 124,9 152,31 0.2 154,9 216,51 42
:
:
#end{clone}
20
System structure of Gemini
Gemini
User Interfaces
Clone pair manager
Clone
selection
information
CCFinder
Source code
manager
Code clone
detector
Source files
Scatter plot view
Code clone
database
Clone
selection
information
User
Source code view
Metrics
manager
Metric graph views
21
CCFinder
Example of clone detection process
0.1
3,1, 9,1
11,1 17,1
{
$p $pfoo ( ( ) ) throws
{
throws$pRESyntaxException
1. static void
1.1.static
staticvoid
voidfoo()
foo()throws
throwsRESyntaxException
RESyntaxException{{
}$u ; } ; , "abc",
$u
[ ] [ {[]
$p
$p String
] {
new$pString
=new
] = ==
a [ [ ]a[]
2.2.String
"123,400"
"abc"
, "orange
"orange
100"100"
} ; };
2.
new
String
String
a[]
new
String
[]
{{"123,400",
"123,400",
"abc",
"orange
100"
};
$p . ) RE; )pat
$p
$p
$p
; = ==
( "[0-9,]+"
= new RE
pat
3.3.RE
orgorg.apache.regexp.RE
. apache
. regexp
new
org org.apache.regexp.RE("[0-9,]+");
.org.apache.regexp.RE("[0-9,]+");
apache . regexp . RE ( "[0-9,]+" ) ;
3.
pat
new
org.apache.regexp.RE
pat
new
=sum
$p
$p int
= $p0 ==;;0;
sum
4.4.int
4.
intsum
0;
$p ; ++
$p
$p
$p
; ++
length
a . ++i)
< $p
ia.length;
0 ii;<<$p
==0;
i ii=
( int
5.5.for for
$p i) ))
5.
(int
a.length;
for
(int
0;
++i)
$p ( ($p a [ [$p i ] ] ) ) ) )
$p
. match
6.6. if ( ififpat
6.
(pat.match(a[i]))
(pat.match(a[i]))
) ; ( 0 )) )) ;;
$p . $p ( ( pat$p . ) getParen
( Sample.parseNumber(pat.getParen(0));
$p
. +=
$p +=
. parseNumber
Sample
+= $p
7.7.sum
7.
sum
Sample.parseNumber(pat.getParen(0));
sum
+=
;+=""
$p =) " =
( $p (+ "sum
. $p
. $p
$p System.out.println("sum
) ;
sum
. println
. out
8.8.System
8.
++sum);
System.out.println("sum
sum);
}}
9.9.}
9.
$p RESyntaxException
throws
)] a)
$p [a ][ []
$pString
$p
$p void
10.
static
RESyntaxException
{{
{RESyntaxException
{{
throws
goo ( (goo(String
void
10.
10.static
static
void
goo(String
[]
a)) throws
throws
RESyntaxException
) ; ) ;
(( $p
new
$p
$p RE
11.
RE("[0-9,]+");
"[0-9,]+"
new=$p
exp
11.
11.RE
RE==exp
exp
=REnew
new
RE("[0-9,]+");
12.
int
$p0 =;=
$p
$p sum
12.
12.int
int=sum
0;
; 0;
=sum
13.
0;
$p ; ++
$p
$p(int
$p i) ))
13.for for
(int
a.length;
++i)
; ++
length
a . ++i)
< $p
ia.length;
00;i;i<<$p
i ii===$p
(forint
13.
14.
14.if ( ifif
(exp.match(a[i]))
. . $p
$p(exp.match(a[i]))
[ i] ]) )) )
( a[ $p
match( $p
exp
14.
15.
sum
+=
15.sum
sum
+=( parseNumber(exp.getParen(0));
parseNumber(exp.getParen(0));
)( )0 ;)( )0 ;) ) ;
( $p
. . $p
$p +=
parseNumber
( $pexp. (.$p exp
getParen
. getParen
parseNumber
$p
+= $p
15.
16.
16.System
"++sum);
$p =) " ;+=="
( $p (+ "sum
. $p
$p
.System.out.println("sum
$p System.out.println("sum
;
)sum);
sum
. println
. out
16.
17.
17.}}}
17.
Source files
Lexical analysis
Token sequence
Transformation
Transformed token sequence
Match detection
Clones on transformed sequence
Formatting
Clone pairs
22
Suffix-tree
Suffix tree is a tree that satisfies the following conditions.
1. A leaf node represents the starting
position of sub-string.
2. A path from root node to a leaf node
represents a sub-string.
3. First characters of labels
of all the edges from one node
are different from each other.
xyxyz% 1
xyz% 2
y
x
→ A common path means
a clone
1 2 3 4 5 6 7
x x y x y z %
1 x *
2 x * *
3 y
*
4 x * *
*
5 y
*
*
6 z
*
7 %
*
xyz%
y
z%
4
3
z%
5
z%
6
%
7
1 2 3 4 5 6 7
x x y x y z%
23
Case study overview
Application target
Programs developed in a programming exercise
of Osaka Univ.
Compiler in C language
Programs of 69 students
Total size is 360,000 lines of code
Issue of Analysis
Similarity among all programs
In the programming exercise,
plagiarisms sometimes happen.
24
Analysis (1/2)
Compiler of 69 students
are arranged on the
two axes.
The distribution is
spread widely.
Rearrangement of
scatter plot
using sorting
function
The grid represents
boundary lines between
individuals.
25
Analysis (2/2)
The corresponding code
A (2 students)
Similar code fragments
were from source code
of sample compiler
described in textbook.
B
A
B (4 students)
Many code fragments
were similar even with
respect to name of
variables or comments.
26
f1
f2
f3
f4
f5
f6
f1
f3
f4
RSA(i) : Ratio of covered code range in
file i by clones between one file i
f5
Step1:
Select a head file by the
value of RSA
(Make F the head file)
f2
Sorting function
of other files
f6
Step2:
From among the remaining
f1
f6
f3
f5
f4
f2
files, select the most similar
file to F and put it next to RST(i,j) : Ratio of covered code range in
file i by clones between a file i
F by the value of RST
and a file j
Step3:
Repeat step2 recursively
while any file remains,
treating the most similar
file in previous step2 as
new F
f1
f6
f3
f4
f2
f5
27