CLUTO A Clustering Toolkit

Download Report

Transcript CLUTO A Clustering Toolkit

CLUTO
A Clustering Toolkit
BY
ROSELINE ANTAI
What is CLUTO?
 CLUTO is a software package which is used for
clustering high dimensional datasets and for
analyzing the characteristics of the various clusters.
Algorithms of CLUTO
 vcluster
 scluster
Major difference: Input format
vcluster: actual multidimensional representation of
the objects to be clustered.
scluster: The similarity matrix (or graph) between
these objects.
Calling Sequence
vcluster [optional parameters] MatrixFile Nclusters
scluster [optional parameters] MatrixFile NClusters
Optional Parameters
 Standard specification
-paramname or –paramname = value
 Three categories:
 Clustering algorithm parameters
 Reporting and Analysis parameters
 Cluster Visualization parameters
Clustering algorithm parameters
 Control how CLUTO computes the clustering
solution.
 Examples
1.
2.
3.
4.
-clmethod=string ( rb, agglo,direct,graph, etc)
-sim = string (cos,corr,dist,jacc)
-crfun = string (i1,i2 etc)
-fulltree
Reporting and Analysis Parameters
 Control the amount of information that vcluster and
scluster report about the clusters as well as the
analysis performed on discovered clusters.
 Examples
1.
-clustfile = string. ( Default is
MatrixFile.clustering.Nclusters( or GraphFile))
2.
-clabelfile = string (name of the file that’s stores the labels
of the columns. Used when –showfeatues, -showsummaries
or –labeltree are used)
3.
4.
5.
6.
-rlabelfile=string
-rclassfile=string (Stores the labels of the rows – objects to
be clustered).
-showtree
-showfeatures (descriptive and discriminating)
Cluster Visualization Parameters
 Simple plots of the original input matrix which show
how the different objects (rows) and features
(columns) are clustered together.
 Examples
1.
2.
-plottree = string; gives graphic representation of the entire
hierarchical tree
-plotmatrix = string; shows how the rows of the original
matrix are clustered together.
A practical example

../cluto/Linux/vcluster -clmethod=rb -sim=cos -fulltree rlabelfile=Final_Results/rlabelfile rclassfile=Final_Results/classfile -showtree -plotformat=gif plottree=Final_Results/Images/PT-Final10d plotmatrix=Final_Results/Images/PM-Final10d plotclusters=Final_Results/Images/PC-Final10d showfeatures Final_Results/FinalOutput10d-Vt.mat 4
roselineantai@ubuntu:~/JLSI/jlsi$
./clusterscript.sh********************************************************************************
vcluster (CLUTO 2.1.1) Copyright 2001-03, Regents of the University of Minnesota
Matrix Information ----------------------------------------------------------Name: Final_Results/FinalOutput5d-Vt.mat, #Rows: 59, #Columns: 5, #NonZeros: 295
Options ---------------------------------------------------------------------CLMethod=RB, CRfun=I2, SimFun=Cosine, #Clusters: 4
RowModel=None, ColModel=None, GrModel=SY-DIR, NNbrs=40
Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5
CSType=Best, AggloFrom=0, AggloCRFun=I2, NTrials=10, NIter=10
Solution -------------------------------------------------------------------------------------------------------------------------------------------4-way clustering: [I2=5.41e+01] [59 of 59], Entropy: 0.473, Purity: 0.780
-----------------------------------------------------------------------cid Size ISim ISdev
ESim ESdev Entpy Purty | Sem Imp Deo Evo
-----------------------------------------------------------------------0
17 +0.731 +0.207 +0.095 +0.158 0.661 0.706 |
1
2
2
12
1
18 +0.931 +0.034 +0.327 +0.081 0.252 0.889 |
0
16
2
0
2
13 +0.811 +0.175 +0.270 +0.145 0.570 0.692 |
9
1
3
0
3
11 +0.902 +0.022 +0.441 +0.095 0.433 0.818 |
1
1
9
0
------------------------------------------------------------------------------------------------------------------------------------------------------4-way clustering solution - Descriptive & Discriminating Features...
-------------------------------------------------------------------------------Cluster
0, Size:
17, ISim: 0.731, ESim: 0.095
Descriptive: col00001 29.6%, col00005 26.6%, col00003 25.8%, col00004 12.5%, col00002
Discriminating: col00003 58.4%, col00004 21.0%, col00005 17.3%, col00001 2.8%, col00002
5.6%
0.5%
Cluster
1, Size:
Descriptive:
Discriminating:
18, ISim: 0.931, ESim: 0.327
col00003 44.6%, col00001 42.7%, col00005 10.5%, col00004
col00003 62.4%, col00002 23.1%, col00005 9.1%, col00001
2.0%, col00002
4.1%, col00004
0.3%
1.4%
Cluster
2, Size:
Descriptive:
Discriminating:
13, ISim: 0.811, ESim: 0.270
col00001 43.1%, col00005 31.2%, col00002 24.0%, col00004
col00005 83.1%, col00003 10.3%, col00002 4.0%, col00001
1.5%, col00003
2.2%, col00004
0.1%
0.4%
Cluster
3, Size:
11, ISim: 0.902, ESim: 0.441
Descriptive: col00001 38.6%, col00003 26.3%, col00004 17.7%, col00002 17.4%, col00005
Discriminating: col00004 42.7%, col00003 29.6%, col00001 15.9%, col00005 10.5%, col00002
--------------------------------------------------------------------------------
0.0%
1.2%
-----------------------------------------------------------------------------Hierarchical Tree that optimizes the I2 criterion function...
-----------------------------------------------------------------------------Sem Imp Deo Evo
-----------------------------------6
|-------0
1
2
2
12
|-5
|-----2
9
1
3
0
|-4
|---3
1
1
9
0
|---1
0
16
2
0
----------------------------------------------------------------------------------------------------------------Timing Information ----------------------------------------------------------I/O:
0.004 sec
Clustering:
0.008 sec
Reporting:
0.268 sec
********************************************************************************
Classfile and rlabelfile
Evo
Sem
Imp
Imp
Deo
Deo
Imp
Imp
Deo
Deo
Imp
Deo
Deo
Imp
Sem
Deo
Sem
Imp
Imp
Evo
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Plotclusters output
The plot uses red to
denote positive
values and green to
denote negative
values. Bright
red/green indicate
large
positive/negative
values, whereas
colors close to white
indicate values close
to zero.