On Detection of Gapped Code Clones using Gap Locations

Download Report

Transcript On Detection of Gapped Code Clones using Gap Locations

On Detection of Gapped
Code Clones using Gap
Locations
Yasushi Ueda†, Toshihiro Kamiya‡,
Shinji Kusumoto†, and Katsuro Inoue†
†Graduate School of Information Science and Technology, Osaka University., Japan
{y-ueda, y-higo, kusumoto, inoue}@ist.osaka-u.ac.jp
‡PRESTO, Japan Science and Technology Corporation, Japan
[email protected]
Contents
Background
 Research goals
 Gapped code clone detection
 Case study
 Conclusions and future works

2
APSEC 2002
Background (1/2)

A code clone is a pair/set of code portions
in source files that are identical or similar to
each other.
3
APSEC 2002
Background (2/2)

Code clone is one of the factors that make software
maintenance more difficult.


If some faults are found in a code portion, it is necessary to
correct the faults in its all clone pairs.
We have developed a code clone detection tool,
CCFinder[1], and its analysis tool, Gemini[2].

CCFinder
• Token-based clone detector
• The input is a set of source files and
the output (text-based) is the locations of clone pairs.

Gemini
• GUI-based clone analysis environment
• Uses CCFinder as a clone detector.
[1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”,
IEEE Transactions on Software Engineering, 28(7):654-670, 2002.
[2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”,
Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002.
CCFinder/Gemini (1/4)

Example of clone detection process
throws
{{ String
(( (( )) )) throws
$$ $$foo
foofoo()
1. static
void
throws
RESyntaxException
static
throws$$RESyntaxException
RESyntaxException
String aa {
static void
throws
{{ $$ $$
new
]{{] {{$$ "123,400"
,
[[ [[]] String
$$String
$$ =
new
]] == a[]
String
$[] { "123,400",
2. [[String
new
"abc", "orange 100" };
"abc"
,
"orange
100"
}
;
org
.
apache
.
regexp
;
}
} ;
3. org.apache.regexp.RE
pat = new org.apache.regexp.RE("[0-9,]+");
new
$$ ==
$$ pat
==new
new
4. .intRE
sum
0; org . apache . regexp
. RE
) ; int
== $$00
$
$$ (( "[0-9,]+"
"[0-9,]+"
int
sum
$$ sum
$$ =
5. for
(int i$ = 0; i) < ;a.length;
++i)
;; for
<
$$ i$$i ==
int
for (( int
= 0$$0 ;;; i$$i <<
6. a$ if. (pat.match(a[i]))
;; ++
if ifif(( (($$ pat
$$ ;; ++
length
++$$ ii)) ))if
pat
$ . length
++
7. .. match
sum
+=
Sample.parseNumber(pat.getParen(0));
$$ (( (( $$ aa [[ [[ $$ ii ]] ]] )) )) )) )) $$sum
sum
8. +=
System.out.println("sum
"getParen
+ sum);
(( 00
(( $$ .. $$ (( ((pat
$$ .. $.$. parseNumber
parseNumber
pat$$ .=
. getParen
+= Sample
9. } )) )) ;; System
.. .$$. println
$$ .. .$$. out
System
out
println
"sum==""
(( $$ (( "sum
static
String
$$ $$goo
}}}} static
))) ;;;; goo(String
$$ void
10. static
[] ((a)(( $$throws
RESyntaxException {
++ sum
static void
void
goo
String
static
]] ))) =throws
= {{ RE
$$RE("[0-9,]+");
[[ exp
throws
RESyntaxException
RE exp
exp ==
{{ $$ $$ =
11. a$$RE
newRESyntaxException
=sum$ == 00
$ ) ; )) $;; $int
$ (( "[0-9,]+"
int sum
12. newintRE
sum"[0-9,]+"
= 0;
for (( int
for
0$ ;; i$i <<
13. ;; for
(intint$i =i$i 0;== i 0<
a.length; ++i)
$ ; ++
;; ++
length
a$ .. length
++$ ii) ))if ifif( (($ exp
exp
14. . match
if
(exp.match(a[i]))
$ ( (( $ aa [ [[ $ ii ] ]] ) )) ) )) $sum
sum
15. += parseNumber
sum
+=
parseNumber(exp.getParen(0));
( . $ getParen
$ ( ( $exp. (. $exp
getParen
() 0) )( )0 ) )
$$ .. parseNumber
16. ;; System.out.println("sum
+ sum
sum);
( $ ((+ "sum
. .$. println
$ . $.. out
==""" ++
out
$ =
System
println
"sum
sum
17. })) ;; }}
5
APSEC 2002
Source files
Lexical
Lexical analysis
analysis
Token
Token sequence
sequence
Transformation
Transformation
Transformed
Transformed token
token sequence
sequence
Match
Match detection
detection
Clones
Clones on
on transformed
transformed sequence
sequence
Formatting
Formatting
Clone pairs
CCFinder/Gemini (2/4)

Gemini overview
a b c a b c a d e c

a b c a b c a d e c


A GUI-based code clone
analysis tool
Uses CCFinder as a code
clone detector.
Has several views to interactive
analysis.
• Scatter plot view
• Select clones by mouse dragging
• Metric graph view
• Select clones by the value of
metric for clones
• Source code view
a, b, c, ... : tokens
6
: matched
position
APSEC 2002
If (a > b) { b++; a=1;}
CCFinder/Gemini (3/4)
reused by ‘copy-and-paste’

Classification of code clones



Exact clone
Renamed clone
Gapped clone
If (a > b)
{
b++;
a=1;
}
Exact clone
If (i > j)
{
j++;
i=0;
}
Renamed clone
Non-gapped clone
renamed
inserted
If (i > j)
{
i = i / 2;
j++;
i=0;
}
deleted
modified
If (i > j)
{
j = j + 1;
i=0;
}
If (i > j)
{
i=0;
}
Gapped clone Gapped clone Gapped clone
Gaps
CCFinder/Gemini (4/4)

Needs of gapped clone detection


CCFinder can detect non-gapped clones.
Gapped clone is separately detected as several short non-gapped clones.
• If each matched portion is too short, CCFinder does not identify it as
a clone because the minimum length of clone to be detected must be
set in CCFinder beforehand.
• Generally, if the minimum length is set to short one, too many clones
would be detected.
1. static void foo() throws RESyntaxException
1. static void goo(String [] a) throws RESyntaxException
2. {
2. {
14 tokens
3. String a[] = new String [] {"123,400", "abc"};
3. RE exp = new RE(“[0-9,]+”);
4. org.apache.regexp.RE pat =
4. int sumSet
= 0; the min. length
5.
new org.apache.regexp.RE("[0-9,]+");
5. int i = 0; to 20 tokens…
6. int sum = 0;
6. while (i < a.length)
27 tokens
7. for (int i = 0; i < a.length; ++i)
7. {
8. {
8.
if (exp.match(a[i]))
9.
if (pat.match(a[i])){
9.
sum += parseNumber(exp.getParen(0));
13 tokens
10.
sum += Sample.parseNumber(pat.getParen(0));}
10.
i++;
11. }
11. }
Clones
longer
than
30
tokens
Clones
longer than =10
tokens
12. System.out.println("sum = " + sum);
12.
System.out.println("sum
" + sum);
13.
}
13.
} number of clone pairs is 26984)
(the
number of clone pairs is 1208)
(the
Research goals

Propose a method to efficiently detect
gapped clones.

Conduct a case study to evaluate the
method.
9
APSEC 2002
Gapped code clone detection
- Overview (1/2)

Major premise


See the problem to detect gapped clones as a combination
problem of non-gapped clones.
Combination explosion of non-gapped clones


If there are many overlapping or overcrowded non-gapped
clones, identification of gapped clones makes a combination
explosion because
one non-gapped
clone may have many
The number
of
105
combinations
is to
3 be combined into a gapped clone.
other non-gapped
clones
15
Takes long time for computation.
10
APSEC 2002
Gapped code clone detection
- Overview (2/2)

Source files
Approach

Man-machine collaboration
Non-gapped clone detection
• Extract concatenated subsets from all of non-gapped clones
Entanglements
Non-gapped clones
• Visualize the entanglements on a scatter plot.
• Users can see the locations where gapped clones
possibly exist and
Gap identification
pick up interactively one of them to find gapped clones in it.

Detecting process




Gaps
Step1: Non-gapped clone detection
Step2: Gap identification
Step3: Visualization
Step4: Source code investigation
Correspondences
Visualization
Gap-and-clone scatter plot
Source code investigation
11
APSEC 2002
Gapped code clone detection
File Y
- Detecting
process
File X

Sample input
A B C E F B C D E B C D
 Code sequence of source file X:
A
“ABCDCDEFBCDG”
B
 Code sequence of source file Y:
C
“ABCEFBCDEBCD”
D
• “A”, “B”, “C” … are code portions
C
in a certain unit.
D
Source files
Non-gapped clone detection
Non-gapped clones
Gap identification
Gaps
Correspondences
E
F
B
Visualization
Gap-and-clone scatter plot
C
D
G
Source code investigation
Gapped code clone detection
- Detecting process
Source files
Non-gapped clone detection
Non-gapped clones
The upper limit of
gap length
Gap
Gap identification
Source files
Gaps
Correspondences
Visualization
Gap-and-clone scatter plot
Source code investigation
Gapped code clone detection
File Y
- Detecting
process
Source files
A B C E F B C D E B C D
A
Non-gapped clone detection
B
C
Non-gapped clones
D
File X
C
D
Gap identification
Gaps
Correspondences
E
F
B
Visualization
Gap-and-clone scatter plot
C
D
G
Source code investigation
Gapped code clone detection
File Y
- Detecting
process
A B C
A
E
F B C
D
E
B
Source files
C D
Non-gapped clone detection
B
C
Non-gapped clones
D
File X
C
D
Gap identification
Gaps
Correspondences
E
F
B
Visualization
Gap-and-clone scatter plot
C
D
G
Source code investigation
Gapped code clone detection
- Implementation


CCFinder is used as a non-gapped clone
detection tool
Extend a GUI maintenance support tool Gemini.
 On the view of gap-and-clone scatter plot
implemented in Gemini, user can select a nongapped clones by mouse dragging and refer to
the actual source code.
Entanglement
16
APSEC 2002
Case study overview

Application target

Programs developed in a programming exercise of Osaka Univ.
• Compiler in C language
• Consists of three steps (sub-exercises):
• Step1(Ex.1): Making a syntax checker
• Step2(Ex.2): Making a semantic checker
• Step3(Ex.3): Making a compiler
• In Ex.2 and Ex.3, it was also required that the programs are
developed by reusing the code of the previous programs.
• Programs of 69 students.
• Total size is 360,000 lines of code

Issues of analysis


17
Type of gapped clones found in gap-and-clone scatter plot
Usefulness of gap-and-clone scatter plot
APSEC 2002
in Ex.2
void sentence()
in Ex.3
{
(tok_name == SREADLN)
||
(tok_name == SWRITELN) ||
(tok_name == SBEGIN))
basic_sen();
else if (tok_name == SIF)
{
A
int llt,llf,lp,lpf;
in Ex.1
in Ex. 2
in Ex. 3
llt=lt; llf=lf; lp=p; lpf=pf;
in Ex. 1
if ((tok_name == SIDENTIFIER)||
in Ex.3
Analysis – Type of gapped clone
found in gap-and-clone
scatter
The minimum size of non-gapped
clones: plot
void sentence()
{
if ((tok_name == SIDENTIFIER) ||
(tok_name == SREADLN)
||
(tok_name == SWRITELN)
||
(tok_name == SBEGIN))
B1
scan();
if (expression() != TBOOLEAN) error(4);
basic_sen();
20 tokens
else if (tok_name == SIF)
{
if (tok_name != STHEN) syntax_error();
scan();

multi_sentence();
 The minimum size of non-gapped clones:
Compare Bthree versions of a function
“sentence()”
in Ex.1,
10
tokens
40 tokens
The maximum size of gaps:
Ex.2 and Ex.3 of a certain student.
if (expression() != TBOOLEAN) error(4);
if (tok_name == SELSE)
{
scan();
multi_sentence();
fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line);
2
fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub);
fprintf(outfile,"\tJNZ\tLF%d\n\n",llf);
lf++;lt++;
B3
}
}
if (tok_name != STHEN) syntax_error();
else if (tok_name == SWHILE)
in Ex. 2
scan();
scan();
B4
{
scan();
10 tokens
The minimum size of entanglements:
20 tokens
45 tokens
multi_sentence();
fprintf(outfile,"\tJMP\tLT%d\n",llt);
if (expression() != TBOOLEAN) error(4);
fprintf(outfile,"LF%d\n\n",llf);
if (tok_name != SDO) syntax_error();
scan();
if (tok_name == SELSE)
sentence();
{
}
27 tokens
scan();
else syntax_error();
multi_sentence();
}
50 tokens
}
fprintf(outfile,"LT%d\n",llt);
}
{
in Ex.3
else if (tok_name == SWHILE)
scan();
fprintf(outfile,"LOOP%d\n",lp);
p++;
if (expression() != TBOOLEAN) error(4);
fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line);
A
18 tokens
B 14 tokens
12 tokens
fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub);
fprintf(outfile,"\tJNZ\tLOOF%d\n\n",lpf);
14 tokens
pf++;
if (tok_name != SDO) syntax_error();
scan();
sentence();
18
fprintf(outfile,"\tJMP\tLOOP%d\n",lp);
fprintf(outfile,"LOOF%d\n\n",lpf);
}
APSEC 2002
Conclusions and future works



The method to show the gapped clones based on the
information of the gap location was proposed and
implemented.
The case study was conducted.
 As result, we have successfully found the gapped
clones that are composed of several short clones
each of which is too short to appear individually.
Since we just show gapped clones and have no
mechanisms to evaluate the characteristic of each of
gapped clones quantitatively, we are going to examine
the method to extract efficiently the each as future
works.
19
APSEC 2002
20
APSEC 2002
Web page of CCFinder/Gemini is available at
http://sel.ist.osaka-u.ac.jp/cdtools/index.html.en
21
APSEC 2002
Application of
CCFinder/Gemini

Free software





Commercial software



JDK libraries (Java, 570 KLOC)
Linux, FreeBSD (C, 1.6 + 1.3 MLOC)
FreeBSD, OpenBSD,NetBSD(C)
Qt(C++,240KLOC)
NTT Data Corp., Hitachi Ltd., Hitachi GP Ltd., NEC
soft Ltd., ASTEC Inc., SRA Inc., NASDA, etc…
Students exercise of Osaka university
Filed in a court as an evidence for software
copyright suit.
APSEC 2002
Differences between our
method and homology analysis
in genome informatics
G
G

Alignment analysis



F
Dynamic programming
• O(mn) (m, n : length of
sequences)
The optimal alignment is not our
interest.
Homology search


BLAST, FASTA
We have no query sequence for
search and want to detect all
gapped clones.
23
APSEC 2002
V
D
K
Y
D
5
-2
-5
1
-7
-5
-5
7
-6
-7
GK Y
-1
-2
-2
-2
-7
G
1
0
-4
4
-7
-7
-7
-7
-7
A L
G F G S L
A L
G G V S V G
A L
G F G
A L
G
D
F V D
Y G
S L
Y G
G V S V
G
Related work

Baxter et al.[3]




Extract clone pairs of statements, declarations, or sequences
of them from C source files.
Parse source code to build an abstract syntax tree (AST) and
compare its sub-trees by characterization metrics (hash
functions).
Its computation complexity is O(n), where n is the number of
the sub-tree of the source files.
The hash function enables one to do parameterized matching,
to detect gapped clones, and to identify clones of code portions
in which some statements are reordered.
24
[3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier,
“Clone Detection Using Abstract Syntax Trees,”
Proc. of ICSM ’98, pp. 368-377, Bethesda, Maryland, 1998.
APSEC 2002
Computation cost of
our method

Non-gapped clone detection (in CCFinder):
O(n + m)
n: length of source code
 m: number of non-gapped clones


Gap identification: O(m)


Identification of gaps combined with each
non-gapped clones : O(1)
Total: O(n+m)
25
APSEC 2002
The difference between ‘diff’
and clone detection tools

Diff finds the longest common sub-string.


Given a code portion, diff does not report
two or more same code portions (clones).
Clone detection tool finds all the same or
similar code portions.
26
APSEC 2002
Snapshots of
clone class metric graph
RAD
LEN
Filtering mode : ON
27
APSEC 2002
POP
DFL
Clone class metrics



LEN (C ): Length of token sequence of each element in clone class C
POP (C ): Number of elements in clone class C
DFL (C ): Estimation of how many tokens would be removed
from source files when all code fragments of clone
class C are replaced with caller statements of a new
identical routine
new sub routine
caller statements

RAD (C ): Distribution in the file system of elements in clone class C
28
APSEC 2002
Definitions of DFL and RAD

DFL(C )

DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C)
• LEN(C) ×POP(C) : the target code size for restructuring
• 5×POP(C) : the code size of new caller statements
• LEN(C) : the code size of new identical routine

RAD (C )

Distribution in the file system of elements in clone class C
• RAD(C) = 0 : C is enclosed within a single file.
• RAD(C) = 1 : C is enclosed within a single directory.
• RAD(C) = n : C is enclosed within a directory tree of n layers.
29
APSEC 2002
Analysis using clone class metrics

Example of analysis issue

Finding clones that are appropriate for refactoring.
• Clones having high DFL
• Clones having high POP and low RAD
• It may be easy and meaningful to merge clones into one
routine because of their density.

Finding portions that are not reliable.
• Clones having high LEN
• Modules having larger code clones are less maintainable
than modules having smaller code clones [4].
[4] Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato, Ken-ichi Matsumoto,
“Software Quality Analysis by Code Clones in Industrial Legacy Software”,
Proc. Of the 8th IEEE International Symposium on Software Metrics, 87-96, 2002.
Suffix-tree

1.
2.
3.
Suffix tree is a tree that satisfies the following
xyxyz% 1
conditions.
A leaf node represents the starting
position of sub-string.
A path from root node to a leaf node
represents a sub-string.
First characters of labels
of all the edges from one node
are different from each other.
xyz% 2
y
x
xyz%
→ A common path means
a clone
1 2 3 4 5 6 7
x x y x y z %
1 x *
2 x * *
3 y
*
4 x * *
*
5 y
*
*
631 z
*
7 %
*
y
z%
3
4
z%
5
z%
6
%
APSEC 2002
7
1 2 3 4 5 6 7
x x y x y z%
Example of
transformation rules in Java




All identifiers defined by user are transformed to same tokens.
Unique identifier is inserted at each end of the top-level definitions
and declarations.
 Prevents detecting clones that begin at the middle of class
definition and end at the middle of another one.
”java. lang. Math. PI” is transformed to ”Math. PI”.
 By using import sentence, a class is referred to with either full
package name or a shorter name
” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ”
 Eliminates table initialization code.
32
APSEC 2002
The output of
CCFinder

Output of CCFinder
#version: ccfinder 3.1
#langspec: JAVA
#option: -b 30,1
#option: -k +
#option: -r abcdfikmnprsv
#option: -c wfg
#begin{file description}
Object file ID
0.0 52 C:\Gemini.java
( file 0 in Group 0 )
0.1 94 C:\GeneralManager.java
:
:
#end{file description}
Location of a clone pair
#begin{clone}
0.1 53,9 63,13 1.10 542,9 553,13 35
( Lines 53 - 63
in file 0.1 and
Lines 542 - 553 in file 1.10
are identical or similar to each other)
0.1 53,9 63,13 1.10 624,9 633,13 35
0.2 124,9 152,31 0.2 154,9 216,51 42
:
 It is difficult to analyze
source code by only this
text-based information of
the location of clone pairs.
33
APSEC 2002
:
#end{clone}
Gapped code clone detection
- Algorithm (1/5)
Source files

Step1: Non-gapped clone detection

Detect non-gapped clones from input source files.
Non-gapped clone detection
• Set the minimum length of clone (threshold1).

Sort the list of the detected non-gapped clones for
effective identification of gap locations in Step2.
• Make a clone pair which
appears previously in the file
appear previously also in the
sorted list.
• When the detected result is
one of comparison among three
or more files, a set of non-gapped
clones can be divided into
subsets defined by the
combination of two files.
Non-gapped clones
Gap identification
Non-gapped
clone ID
Pos. in file X
in file Y
Matched
Gaps Pos.Correspondences
(ABCDCDEFBCDG)
subsequence
(ABCEFBCDEBCD)
c1
1–3
c2
2–4
Visualization
6–8
“BCD”
c3
2–4
10 – 12
“BCD”
c4
5–5
3–3
“C”
c5
5–6
11 – 12
“CD”
c6
5–6
7–9
“CDE”
c7
7 – 11
APSEC 2002
c8
c9
1–3
“ABC”
Gap-and-clone scatter plot
–8
“EFBCD”
Source 4code
investigation
9 – 10
2-3
“BC”
9 – 11
10 - 12
“BCD”
Gapped code clone detection
- Algorithm (2/5)

Step2: Gap identification

Source files
Generate gap locations from sorted list of non-gapped
clones.
Non-gapped clone detection
• Gap location is a kind of the combination of the two nongapped clones.
• (c1, c6) = ((1-3, 1-3), (5-6, 7-9))  g1= (4-4, 4-6)
• The length of each gap is the length of longer unmatched
subsequence.
• Set the upper limit of the length of each gap (threshold2).
 The overall time complexity of
Step2 is O(n) (n:number of nongapped clones)
Gap identification
Gaps
• Use the facts for optimizations
• non-gapped clones are stored as
the sorted result.
• The number of gaps connected
from each non-gapped clone can
be considered up to a certain
constant.
Non-gapped clones
Gap ID
Correspondences
Pos. in file X
(ABCDCDEFBCDG)
Pos. in file Y
(ABCEFBCDEBCD)
Length
in longer
g1
4–4
4–6
3
g2
4–4
4 – 10
7
Visualization
Gap-and-clone
scatter 3plot
–
g3
4–6
g4
4–8
g5
–
g6
5–8
9
4
g7
8–8
–
1
4–9
6
– 10 investigation
2
Source9code
Gapped code clone detection
- Algorithm (3/5)
Source files

Step3-1: Visualization – gap-and-clone scatter plot

Draw gaps on the scatter plot of non-gapped clone to
visualize gapped clones in a pseudo way.
Non-gapped clone detection
File Y
1
1
A
2
B
3
C
4
D
5
C
File X
6
D
7
E
8
F
2
3
4
5
6
7
8
9
10 11 12
A B
C
E
F B
C
D E B
Non-gapped clones
C D
Gap identification
c1
c2
c3
Gaps
Correspondences
g1
g5
g3
c5
Visualization
c6
Gap-and-clone scatter plot
g7
Gapped
clone ID
Path
gc1
c1 g1 c5 g7 c7
11 D
gc2
c1 g3 c6
“ABC---EFBCD”
“ABCEFBCD”
12 G
gc3
c2 g5 c4
“BCDCD”
“BCDCD”
9
B
10 C
c8
c7
c9
Subsequence in file X
(ABCDCDEFBCDG)
Subsequence in file Y
(ABCEFBCDEBCD)
“ABC-CDE--CD”
Source
code“ABC---CDEBCD”
investigation
File Y
Gapped code clone detection
- Algorithm (4/5)
1

Step3-2: Visualization – filtering
Remove non-gapped clones and
gaps that do not contribute to
make a long gapped clone.
• Introduce the length of each
entanglement (“eSize”) of nongapped clones and gaps.
• eSize = max (eSizeX,
eSizeY)
eSizeX = eEndX – eStartX
eSizeY = eEndY – eStartY
• “eSize” means the maximum
length of gapped clone
included in the entanglement.
• Set the minimum “eSize” for
display (threshold3).
File X

1
A
2
B
3
C
4
D
5
C
6
D
7
E
8
F
9
B
10 C
11 D
2
3
4
5
6
7
8
9
10 11 12
A B
C
E
F B
C
D E B
C D
c1
c2
Source files
c3
g1
g5
g3
c5
Non-gapped
c6 clone detection
g7
Non-gapped clones
c8
c7
c9
Gap identification
12 G
Gaps
Correspondences
Visualization
Gap-and-clone scatter plot
Source code investigation
Gapped code clone detection
- Algorithm (5/5)
Source files

Step4: Source code investigation


Investigate source files with gap-and-clone
scatter plot.
Change parameters.
• Threshold1: Minimum size of non-gapped clones
in non-gapped clone detection
• Threshold2: Maximum size of gaps in
identification of gap locations.
• Threshold3: Minimum size of entanglement of
non-gapped clones and gaps in gap-and-clone
scatter plot.
• Theshold1 and threshold2 greatly affect
computation time.
• Small threshold1 makes O(m2) non-gapped clone
pairs detected from size-m source code.
• Large threshold2 makes O(n2) gaps detected from n
clone pairs.
Non-gapped clone detection
Non-gapped clones
Gap identification
Gaps
Correspondences
Visualization
Gap-and-clone scatter plot
Source code investigation
(Frequency of non-gapped clones)
(Frequency of non-gapped clones)
Analysis - Usefulness of gap-
1500
and-clone scatter plot
1000

1500
1000
Compared the scatter plots of non-gapped clones to the gap-and-clone
500
scatter plot
500
Shown up as long
Three programs (Ex.1: 2267 tokens, Ex.2: 4394 tokens and Ex.3:
gapped5738
clones
tokens) of a student S are arranged on both of the vertical and horizontal
0
axes.
0 10
20
30
40
0
(Tokens)
10
30 boundary
40
 0 The
grid 20represents
lines 50
between sub-exercises.

(Tokens)
Ex.1
Ex.2
Ex.3
Ex.1
Ex.2
Ex.1
Ex.3
Ex.1
Ex.1
Ex.1
Ex.2
Ex.2
Ex.2
Ex.3
Ex.3
39
Threshold1 = 10
Ex.3
APSEC 2002
Threshold1 = 30
Ex.2
Threshold1 = 10
Threshold2 = 10
Threshold3 = 30
Ex.3
50
The analysis of comparison among
students (non-gapped clones only)

The corresponding code

B
A (2 students)
• Similar code fragments
were from source code of
sample compiler described
in textbook.

B (4 students)
• Many code fragments were
similar even with respect to
name of variables or
comments.
40
APSEC 2002
A