Automatic Categorization Tool for Open Software Repositories

Download Report

Transcript Automatic Categorization Tool for Open Software Repositories

Automatic Categorization Tool for
Open Software Repositories
Shinji Kawaguchi†, Pankaj K. Garg††,
Makoto Matsushita†, Katsuro Inoue†
†
Osaka University, Japan
†† Zee Source, USA
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Outline
Background and research aim
Latent Semantic Analysis (LSA)
Problem with naive LSA approach
Proposed automatic categorization method
Case study
Discussions and conclusions
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
2
Software Repository
“Software repository” archives many software
systems with their source codes
It is very common in these years.
In open source community
Provide platforms for many open source projects
E.g. SourceForge (http://sourceforge.net/)
In industrial context
Archive software systems created in a company
To share information about projects that exist (or existed) in the
company
Useful especially for large and distributed organization
E.g. Corporate Source*, Progressive Open Source**
*J. Dinkelacker and P. Garg. Corporate Source: “Applying Open Source Concepts to a Corporate Environment (Position Paper)“.
In Proceedings of the 1st ICSE International Workshop on Open Source Software Engineering, May 15, 2001, Toronto, Canada.
**J. Dinkelacker, P. Garg, D. Nelson, and R. Miller. “Progressive Open Source”.
In Proceedings of the International Conference on Software Engineering, Orlando, Florida, 2002.
2003/10/26
OSIC'03
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
3
Background
Software repository is also used for...
finding a software system which fills a demand
finding source codes related to currently developing products.
Generally, there are many software systems in a repository.
SourceForge hosted 69,677 projects at Oct. 24, 2003
Categorization is essential for software finding
At present, software systems are categorized manually.
A manager of a repository makes a hierarchical category structure.
A software developer choose an adequate category for a software.
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
4
Problem
Inflexible and exclusive classification
Generally, software systems are categorized by uses of a
software system.
Classification by depending library or architecture also
valuable
A software system has various aspect
Making a hierarchical category structure requires a
huge amount of work.
To make it better, comprehensive knowledge about
various libraries and architectures is needed.
A repository manager’s load is high
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
5
Nonexclusive classification
regexp
Editor
Spreadsheet
GUI (MFC)
GUI (MFC)
support for
regular expression
support for
regular expression
Software 3
Software 1
If you do not have knowledge
about these libraries
and Spreadsheet
Editor
architecture,
GUIyou
(GTK)can not prepare
GUI (GTK)
support for
such category.
MFC
regular expression
Software 2
Software 4
GTK
2003/10/26
Editor
Spreadsheet
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
6
Research Aim
Automatic categorization method of
OpenSource software
Nonexclusive categorization counting various
aspects of a software system.
Identify depending libraries and architecture and
classify software systems automatically
Uses only source code.
Not require comprehensive knowledge
about software systems
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
7
Outline
Background and research aim
Latent Semantic Analysis (LSA)
Problem with naive LSA approach
Proposed automatic categorization method
Case study
Discussions and conclusions
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
8
LSA - Latent Semantic Analysis
LSA is proposed for calculating a similarity
about documents or terms in natural
language.
LSA is based on Vector Space Model.
LSA can detect similarity with documents
sharing only highly related (but not same)
words.
Original vector space model can not detect such
relation ship.
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
9
Example of LSATermVector
Doc1
Doc4
A B B F
Doc2
Doc5
A B C D E
Doc3
B
G G
DocumentVector
F G H H
Similarities
Make a
Doc6 about documents
word-byC and
C C terms
D
E
G represented
H
are
by
document
the cosine of two vectors.
matrix.
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
0
2
1
1
1
1
1
0
0
0
3
0
1
3
1
0
0
0
0
4
0
0
0
0
0
0
2
0
5
0
0
0
0
0
1
1
2
6
0
0
0
0
1
0
1
1
LSA
A
B
C
D
E
F
G
H
1
0.3
0.7
0.9
0.4
0.3
0.2
0.3
0.3
2
0.4
1.0
1.4
0.6
0.3
0.2
0.1
0.1
3
0.6
1.5
2.3
1.0
0.4
0.2
-0.2
-0.2
4
0.1
0.1
-0.2
0.0
0.2
0.4
0.9
0.9
5
0.1
0.2
-0.2
0.0
0.4
0.6
1.5
1.4
6
0.1
0.2
-0.1
0.0
0.3
0.4
1.0
0.9
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
10
Effect of LSA
Documents which have indirect relationship
show high similarities.
LSA make clear about tends of documents.
Similarities about each document.
1
2
3
4
5
6
1
2
3
4
5
6
1
1.0
0.2
-0.1
-0.3
-0.3
-0.5
1
1.0
1.0
0.9
-0.6
-0.6
-0.5
2
0.2
1.0
0.5
-0.5
-0.9
-0.5
2
1.0
1.0
1.0
-0.8
-0.8
-0.7
3
-0.1
0.5
1.0
-0.2
-0.4
-0.5
3
0.9
1.0
1.0
-0.8
-0.8
-0.8
4
-0.3
-0.5
-0.2
1.0
0.3
0.5
4
-0.6
-0.8
-0.8
1.0
1.0
1.0
5
-0.3
-0.9
-0.4
0.3
1.0
0.5
5
-0.6
-0.8
-0.8
1.0
1.0
1.0
6
-0.5
-0.5
-0.5
0.5
0.5
1.0
6
-0.5
-0.7
-0.8
1.0
1.0
1.0
before LSA
after LSA
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
11
Outline
Background and research aim
Latent Semantic Analysis (LSA)
Problem with naive LSA approach
Proposed automatic categorization method
Case study
Discussions and conclusions
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
12
Naive LSA approach for categorization
Apply LSA for software similarity
Software
Document
Identifier (variable, function, type)
Word
Calculate similarities by result of LSA
We apply cluster analysis using similarities of
software systems calculated above
Cluster analysis divides a set into some groups
using similarities of each item
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
13
Problem of naive approach
Each high relationship has each reason
Cluster analysis based on simple software similarity
is not adequate
2003/10/26
Editor
Spreadsheet
GUI (MFC)
GUI (MFC)
support for
regular expression
Software 1
support for
regular expression
Software 3
Editor
Spreadsheet
GUI (GTK)
GUI (GTK)
support for
regular expression
Software 2
Software 4
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
14
Outline
Background and research aim
Latent Semantic Analysis (LSA)
Problem with naive LSA approach
Proposed automatic categorization method
Case study
Discussions and conclusions
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
15
Classification by identifiers
Identifier implies behavior of source code
Some statements which have an identifier “window” are
related to some kind of GUI operations
Group some identifiers which are highly related and
consider them as one category.
Editor
window
cmdButton
2003/10/26
Spreadsheet
menuBar
window
GUI (MFC)
GUI (MFC)
Software 1
Software 3
MFC
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
16
1.Extract Identifier
Extract all identifiers
variable name
constant name
function name
type name
Sof1
Soft1 Soft4
Soft2 Soft5
Soft3 Soft6
A B B F
1.Extract
Identifier
Soft4
J
Soft2
A B C D E
Soft3
B C C C D
G G
I
J
Soft5
F G H H
J
Soft6
E G H
J
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
17
2.Make Identifier-by-Software Matrix
Identifier-by-Software Matrix
A row represents a software
A column represents an identifier
A cell has the number of identifiers appeared in a
software
Sof1
Soft4
A B B F
J
Soft2
G G
I
J
Soft5
A B C D E
Soft3
F G H H
Soft6
B C C C D
E G H
J
J
2.Make
Identifier-bySoftware
Matrix
I
J
0
0
1
0
0
0
0
0
0
0
0
0
0
0
2
0
1
1
0
0
1
1
2
0
1
0
1
0
1
1
0
1
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
2
1
1
1
1
1
0
3
0
1
3
1
0
4
0
0
0
0
5
0
0
0
6
0
0
0
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
18
3.Remove Stand-off Identifiers and
Common Identifiers
We remove stand-off Identifier and common
identifiers because they are useless for
categorization
Stand-off Identifier
An identifier appears only one software.
Common Identifier
An identifier appears more than half of software
I
J
0
0
1
0
0
0
0
0
0
0
0
0
0
0
2
0
1
1
0
0
1
1
2
0
1
0
1
0
1
1
0
1
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
2
1
1
1
1
1
0
3
0
1
3
1
0
4
0
0
0
0
5
0
0
0
6
0
0
0
2003/10/26
3.Remove
Stand-off
Identifiers
and
Common
Identifiers
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
0
2
1
1
1
1
1
0
0
0
3
0
1
3
1
0
0
0
0
4
0
0
0
0
0
0
2
0
5
0
0
0
0
0
1
1
2
6
0
0
0
0
1
0
1
1
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
19
4.LSA
We apply LSA for the matrix removed standoff identifiers and common identifiers
We can retrieve indirect relationship by
applying LSA
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
0
2
1
1
1
1
1
0
0
0
3
0
1
3
1
0
0
0
4
0
0
0
0
0
0
5
0
0
0
0
0
6
0
0
0
0
1
A
B
C
D
E
F
G
H
1
0.3
0.7
0.9
0.4
0.3
0.2
0.3
0.3
2
0.4
1.0
1.4
0.6
0.3
0.2
0.1
0.1
0
3
0.6
1.5
2.3
1.0
0.4
0.2
-0.2
-0.2
2
0
4
0.1
0.1
-0.2
0.0
0.2
0.4
0.9
0.9
1
1
2
5
0.1
0.2
-0.2
0.0
0.4
0.6
1.5
1.4
0
1
1
6
0.1
0.2
-0.1
0.0
0.3
0.4
1.0
0.9
4.LSA
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
20
5.Cluster Identifiers
Calculate similarities between all pairs of
identifiers using the result of LSA
Apply cluster analysis based on the
similarities
We call the result cluster as “identifier cluster”
A
B
C
D
E
F
G
H
1
0.3
0.7
0.9
0.4
0.3
0.2
0.3
0.3
2
0.4
1.0
1.4
0.6
0.3
0.2
0.1
0.1
3
0.6
1.5
2.3
1.0
0.4
0.2
-0.2
-0.2
4
0.1
0.1
-0.2
0.0
0.2
0.4
0.9
0.9
5
0.1
0.2
-0.2
0.0
0.4
0.6
1.5
1.4
6
0.1
0.2
-0.1
0.0
0.3
0.4
1.0
0.9
5.Cluster
Identifiers
A B C
D
F G
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
H
OSIC'03
21
6.Make Software Cluster
From each identifier cluster, we make a software
cluster.
A software cluster is an union of software systems
which have a token included in an identifier cluster.
Sof1
Soft4
A B B F
J
Soft2
G G
I
J
Soft5
A B C D E
Soft3
F G H H
Soft6
B C C C D
E G H J
D
A B C
F G H
6.Make software
cluster
J
1
2
3
1
4
5
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
6
OSIC'03
22
7.Make Cluster’s Titles
For each software cluster, we make a title which
represents what software systems are categorized.
1. Get all software vector included in a software
cluster.
2. Sum up them.
3. From the summation vector, chose some tokens
which have high value, and we make them as title
of a cluster.
7.Make Cluster’s Titles
1
2
3
1
4
5
6
1
2
3
ClusterTitle1
1
4
5
6
ClusterTitle2
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
23
Automatic Categorization System
Target: programs written in C language
Implemented in Perl
However token extractor is written in C using
YACC
Employ SVDPACKC program for LSA calculation
Total number of lines are about 4,000
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
24
Outline
Background and research aim
Latent Semantic Analysis (LSA)
Problem with naive LSA approach
Proposed automatic categorization method
Case study
Discussions and conclusions
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
25
Case study
We applied our proposed method for real software
systems using implemented prototype
We choose 6 genres from SourceForge at random
boardgames, compilers, database, editor,
videoconversion, xterm
We retrieve all C programs from above 6 genres.
41 software systems.
164,102 identifiers
We remove stand-off and common identifiers. 22,048 identifiers
are remained.
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
26
The result of case study (subset)
Title
Software
AOP, emitcode, IC_RESULT, IC_LEFT, aop, aopGet,
IC_RIGHT, pic14_emitcode, iCode, etype
New Category
CASE_IGNORE, CASE_GROUND_STATE, screen,
CASE_PRINT, CASE_BYP_STATE, Widget, TScreen,
CASE_IGNORE_STATE, CASE_PLT_VEC,
CASE_PT_POINT
NoI
compilers/gbdk, compilers/sdcc
8597
Software systems using YACC
xterm/R6.3, xterm/R6.4
2160
YY_BREAK, yyvsp, yyval, DATA, yy_current_buffer, tuple,
yy_current_state, yy_c_buf_p, yy_cp, uint32
compilers/gbdk,
database/mysql-3.23.49,
database/postgresql-7.2.1
223
AVI, cinfo, OUTLONG, avi_t, AVI_errno, hdrl_data, OUT4CC,
nhb, ERR_EXIT, str2ulong
videoconversion/dv2jpg-1.1,
videoconversion/libcu30-1.0,
videoconversion/mjpgTools
177
white_to_move, move_s, promoted
boardgame/cinag-1.1.4,
boardgame/faile_1_4_4
GtkWidget, gchar, gpointer, gint, widget, gtk_widget_show,
N_, g_free, dialog, g_return_if_fail
boardgame/gbatnav-1.0.4,
editor/gedit-1.120.0,
editor/gmas-1.1.0,
editor/gnotepad+-1.3.3,
editor/peacock-0.4
Software systems using GTK
boardgame/Sjeng-10.0,
library
board, num_moves, ply, pawn_file, npiece, pawns, moves,
Same category as SourceForge
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
154
104
OSIC'03
27
The result of case study
Our system returned 40 clusters
Clusters same as existed categories
New clusters
18
8
Details of new clusters
GTK(2 clusters)
yacc(2 clusters)
regexp
getopt
JNI
Python/C
GUI library
Library for Syntactic analysis
Library for regular expression
Library for parsing arguments
Java Native Interface
Architecture for extending Python interpreter
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
28
Discussion
Our method found categorization by a library and an
architecture without any knowledge
Categorization by many aspects of software systems
Categorization without human knowledge
Cluster’s title
Some titles are easy to understand, and some are not.
Cluster of same library are tend to have understandable
titles
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
29
Conclusion and Future Work
We proposed automatic categorization
method for open software systems
We showed that our method could found new
categorization without any knowledge about
software systems
Future works
Improve understandability of cluster’s title
Large scale experimentation
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
30
Similarity calcuration
abstraction level
By developer, LoC,
cyclomatic number,
etc...
metrics
level
By the number
of developer,
CMM level,
etc...
By usage
By library or
architecture
semantic
level
By lexical similarity
lexical
level
By programming language
unit
function
module,
component
software
team
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
31
Usage of Software Search
abstraction level
estimate metrics
metrics
level
refer development
process
semantic
level
refer design
reuse implementation
lexical
level
unit
function
module,
component
software
team
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
32
Product Search System
Develop Division A
Develop Division B
Search products
Search products
Company Source Repository
Software developed
in division A
Imported from
OpenSource
repository
Software developed
in division B
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
33
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
34
Proposed Method(1/2)
Sof1
Soft1 Soft4
A B B F
Soft2 Soft5
1.Extract
Identifier
Soft3 Soft6
Soft4
J
Soft2
G G
J
Soft5
A B C D E
Soft3
I
F G H H
J
Soft6
B C C C D
E G H J
2.Make Identifier-by-Software Matrix
I
J
0
0
1
0
0
0
0
0
0
0
0
2
0
0
1
0
1
0
A
B
C
D
E
F
G H
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
1
1
2
0
0
0
1
0
0
2
1
1
1
1
1
0
0
2
1
1
1
1
1
0
0
0
3
0
1
3
1
0
0
0
3
0
1
3
1
0
0
0
0
4
0
0
0
0
0
1
1
0
0
0
0
0
0
2
0
5
0
0
0
1
2
0
1
0
0
0
0
0
1
1
2
6
0
0
0
1
1
0
1
0
0
0
0
1
0
1
1
3.Remove
4
Stand-off Identifiers
5
and
Common Identifiers 6
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
35
Proposed Method(2/2)
A
B
C
D
E
F
G H
1
1
2
0
0
0
1
0
0
2
1
1
1
1
1
0
0
0
3
0
1
3
1
0
0
0
4
0
0
0
0
0
0
5
0
0
0
0
0
6
0
0
0
0
1
A
B
C
D
E
F
G
H
1
0.3
0.7
0.9
0.4
0.3
0.2
0.3
0.3
2
0.4
1.0
1.4
0.6
0.3
0.2
0.1
0.1
0
3
0.6
1.5
2.3
1.0
0.4
0.2
-0.2
-0.2
2
0
4
0.1
0.1
-0.2
0.0
0.2
0.4
0.9
0.9
1
1
2
5
0.1
0.2
-0.2
0.0
0.4
0.6
1.5
1.4
0
1
1
6
0.1
0.2
-0.1
0.0
0.3
0.4
1.0
0.9
2
3
4.LSA
5.Calcurate Identifier Similarity and
Cluster Analysis
D
A B C
F G
1
H
2
1
3
ClusterTitle1
6.Make
Software
Clusters
1
4
5
6
7.Make
Cluster’s
Titles
1
4
5
6
ClusterTitle2
2003/10/26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
OSIC'03
36