Classic IR Models
Download
Report
Transcript Classic IR Models
Clustering
“Clustering is the unsupervised
classification of patterns (observations,
data items or feature vectors) into
groups (clusters)” [ACM CS’99]
Instances within a cluster are very
similar
Instances in different clusters are very
different
E.G.M. Petrakis
Text Clustering
1
Example
t
e
r
m
2
.. .
.. .
. .....
.. .
. .. .
..
term1
E.G.M. Petrakis
Text Clustering
2
Applications
Faster retrieval
Faster and better browsing
Structuring of search results
Revealing classes and other data
regularities
Directory construction
Better data organization in general
E.G.M. Petrakis
Text Clustering
3
Cluster Searching
Similar instances
tend to be relevant
to the same
requests
The query is mapped
to the closest
cluster by
comparison with the
cluster-centroids
E.G.M. Petrakis
Text Clustering
4
Notation
N: number of elements
Class: real world grouping – ground truth
Cluster: grouping by algorithm
The ideal clustering algorithm will
produce clusters equivalent to real
world classes with exactly the same
members
E.G.M. Petrakis
Text Clustering
5
Problems
How many clusters ?
Complexity? N is usually large
Quality of clustering
When a method is better than another?
Overlapping clusters
Sensitivity to outliers
E.G.M. Petrakis
Text Clustering
6
Example
.
.
.
.. ... . .... . .
... .. . . . . . .
.
.
.
.
.
... .. . .
..
E.G.M. Petrakis
Text Clustering
7
Clustering Approaches
Divisive: build clusters “top down” starting
from the entire data set
K-means, Bisecting K-means
Hierarchical or flat clustering
Agglomerative: build clusters “bottom-up”
starting with individual instances and by
iteratively combining them to form larger
cluster at higher level
Hierarchical clustering
Combinations of the above
Buckshot algorithm
E.G.M. Petrakis
Text Clustering
8
Hierarchical – Flat Clustering
Flat: all clusters at the same level
K-means, Buckshot
Hierarchical: nested sequence of clusters
Single cluster with all data at the top & singleton
clusters at the bottom
Intermediate levels are more useful
Every intermediate level combines two clusters
from the next lower level
Agglomerative, Bisecting K-means
E.G.M. Petrakis
Text Clustering
9
Flat Clustering
.. .
.. .
. .....
.. .
E.G.M. Petrakis
. .. .
..
Text Clustering
10
Hierarchical Clustering
..
.. . 2
.
. .....
.. .
4
5
E.G.M. Petrakis
1
..
.
. ...
..
.. .
..
1
6
Text Clustering
3
2
3
7
4
5 6
7
11
Text Clustering
Finds overall similarities among
documents or groups of documents
Faster searching, browsing etc.
Needs to know how to compute the
similarity (or equivalently the distance)
between documents
E.G.M. Petrakis
Text Clustering
12
Query – Document Similarity
d1
d d
Sim (d1 , d 2 ) 1 2
| d1 || d 2 |
M
i 1
wid1 wid2
2
w
i 1 id1
M
2
w
i 1 id2
M
d2
θ
Similarity is defined as the cosine of the
angle between document and query vectors
E.G.M. Petrakis
Text Clustering
13
Document Distance
Consider documents d1, d2 with vectors u1, u2
Their distance is defined as the length AB
distance(d1 , d 2 )
2 sin( / 2)
2(1 - cos( ))
2(1 - Sim(d1 , d 2 ))
E.G.M. Petrakis
Text Clustering
14
Normalization by Document
Length
The longer the document is, the more
likely it is for a given term to appear
in it
Normalize the term weights by
document length (so terms in long
documents are not given more weight)
w'ij
E.G.M. Petrakis
wij
M
k 1
wkj
Text Clustering
2
15
Evaluation of Cluster Quality
Clusters can be evaluated using internal
or external knowledge
Internal Measures: intra cluster
cohesion and cluster separability
intra cluster similarity
inter cluster similarity
External measures: quality of clusters
compared to real classes
Entropy (E), Harmonic Mean (F)
E.G.M. Petrakis
Text Clustering
16
Intra Cluster Similarity
A measure of cluster cohesion
Defined as the average pair-wise similarity of
documents in a cluster
1
1
sim(d , d ' ) d
2
S dS S
S d ,d 'S
1
1
Where c
S
2
d ' c c c
d 'S
d : cluster centroid
dS
Documents (not centroids) have unit length
E.G.M. Petrakis
Text Clustering
17
Inter Cluster Similarity
a) Single Link:
similarity of two max sim(ci , c'j ) , ci S , c'j S '
most similar
members
b) Complete Link:
'
'
similarity of two min sim(ci , c j ) , ci S , c j S '
least similar
1
1
1
members
sim(d , d ' ) d
d'
S S ' dC
S dS S ' dS '
c) Group Average:
d 'C '
average similarity c c ' sim(c , c ' )
E.G.M. Petrakis
Text Clustering
18
between members
Example
complete link
S
.c
group
average
.
.
.
c’
S’
single link
E.G.M. Petrakis
Text Clustering
19
Entropy
Measures the quality of flat clusters
using external knowledge
Pre-existing classification
Assessment by experts
Pij: probability that a member of cluster
j belong to class i
The entropy of cluster j is defined as
Ej=-ΣiPijlogPij
E.G.M. Petrakis
Text Clustering
20
Entropy (con’t)
Total entropy for all clusters
m
E
j 1
nj
N
Ej
Where nj is the size of cluster j
m is the number of clusters
N is the number of instances
The smaller the value of E is the better the
quality of the algorithm is
The best entropy is obtained when each
cluster contains exactly one instance
E.G.M. Petrakis
Text Clustering
21
Harmonic Mean (F)
Treats each cluster as a query result
F combines precision (P) and recall (R)
Fij for cluster j and class i is defined as
nij
nij
2
Fij
where Pij , Rij
1
1
n
n
j
i
Pij Rij
nij: number of instances of class i in cluster j,
ni: number of instances of class i,
nj: number of instances of cluster j
E.G.M. Petrakis
Text Clustering
22
Harmonic Mean (con’t)
The F value of any class i is the maximum
value it achieves over all j
Fi = maxj Fij
The F value of a clustering solution is
computed as the weighted average over all
classes
m
ni
F Fi
i 1 n
Where N is the number of data instances
E.G.M. Petrakis
Text Clustering
23
Quality of Clustering
A good clustering method
Maximizes intra-cluster similarity
Minimizes inter cluster similarity
Minimizes Entropy
Maximizes Harmonic Mean
Difficult to achieve all together
simultaneously
Maximize some objective function of the
above
An algorithm is better than an other if it has
better values on most of these measures
E.G.M. Petrakis
Text Clustering
24
K-means Algorithm
Select K centroids
Repeat I times or until the centroids do
not change
Assign each instance to the cluster
represented by its nearest centroid
Compute new centroids
Reassign instances
Compute new centroids
…….
E.G.M. Petrakis
Text Clustering
25
K-Means demo (1/7):
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
20/7/2015
Nikos Hourdakis, MSc Thesis
26
K-Means demo (2/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
27
K-Means demo (3/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
28
K-Means demo (4/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
29
K-Means demo (5/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
30
K-Means demo (6/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
31
K-Means demo (7/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
32
Comments on K-Means (1)
Generates a flat partition of K clusters
K is the desired number of clusters and
must be known in advance
Starts with K random cluster centroids
A centroid is the mean or the median of
a group of instances
The mean rarely corresponds to a real
instance
E.G.M. Petrakis
Text Clustering
33
Comments on K-Means (2)
Up to I=10 iterations
Keep the clustering resulted in best
inter/intra similarity or the final
clusters after I iterations
Complexity O(IKN)
A repeated application of K-Means for
K=2, 4,… can produce a hierarchical
clustering
E.G.M. Petrakis
Text Clustering
34
Choosing Centroids for K-means
Quality of clustering depends on the
selection of initial centroids
Random selection may result in poor
convergence rate, or convergence to
sub-optimal clusterings.
Select good initial centroids using a
heuristic or the results of another
method
Buckshot algorithm
E.G.M. Petrakis
Text Clustering
35
Incremental K-Means
Update each centroid during each
iteration after each point is assigned to
a cluster rather than at the end of each
iteration
Reassign instances to clusters at the
end of each iteration
Converges faster than simple K-means
Usually 2-5 iterations
E.G.M. Petrakis
Text Clustering
36
Bisecting K-Means
Starts with a single cluster with all
instances
Select a cluster to split: larger cluster
or cluster with less intra similarity
The selected cluster is split into 2
partitions using K-means (K=2)
Repeat up to the desired depth h
Hierarchical clustering
Complexity O(2hN)
E.G.M. Petrakis
Text Clustering
37
Agglomerative Clustering
Compute the similarity matrix between
all pairs of instances
Starting from singleton clusters
Repeat until a single cluster remains
Merge the two most similar clusters
Replace them with a single cluster
Replace the merged cluster in the matrix
and update the similarity matrix
Complexity O(N2)
E.G.M. Petrakis
Text Clustering
38
Similarity Matrix
C1=d1
C2=d2
…
CN=dN
C1=d1
1
0.8
…
0.3
C2=d2
0.8
1
…
0.6
….
…
…
1
…
CN=dN
0.3
0.6
…
1
E.G.M. Petrakis
Text Clustering
39
Update Similarity Matrix
C1=d1
merged
C2=d2
…
CN=dN
1
0.8
…
0.3
C2=d2
0.8
1
…
0.6
….
…
…
1
…
CN=dN
0.3
0.6
…
1
C1=d1
merged
E.G.M. Petrakis
Text Clustering
40
New Similarity Matrix
C12 =
d 1 d2
…
CN=dN
E.G.M. Petrakis
C12=
d1 d2
1
…
CN=dN
…
0.4
…
1
…
0.4
…
1
Text Clustering
41
Single Link
Selecting the most similar clusters for
merging using single link
max sim(ci , c ) ci S, c S '
'
j
'
j
Can result in long and thin clusters due
to “chaining effect”
Appropriate in some domains, such as
clustering islands
E.G.M. Petrakis
Text Clustering
42
Complete Link
Selecting the most similar clusters for
merging using complete link
min sim(ci , c'j ) ci S, c'j S '
Results in compact, spherical clusters
that are preferable
E.G.M. Petrakis
Text Clustering
43
Group Average
Selecting the most similar clusters for
merging using group average
1
S S'
1
sim(d , d ' )
S
d S
d 'S '
1
d
S'
d S
d'
d S '
c c ' sim(c , c ' )
Fast compromise between single and
complete link
E.G.M. Petrakis
Text Clustering
44
Example
complete link
A
.c
group
average
1
.
.
c2
.
B
single link
E.G.M. Petrakis
Text Clustering
45
Inter Cluster Similarity
A new cluster is represented by its centroid
1
c
S
d
dS
The document to cluster similarity is
computed as
sim(d , c ) d c
The cluster-to-cluster similarity can be
computed as single, complete or group average
similarity
E.G.M. Petrakis
Text Clustering
46
Buckshot K-Means
Combines Agglomerative and K-Means
Agglomerative results in a good
clustering solution but has O(N2)
complexity
Randomly select a sample N instances
Applying Agglomerative on the sample
which takes (N) time
Take the centroids of the cluster as
input to K-Means
Overall
complexity
is O(N)
E.G.M. Petrakis
Text Clustering
47
Example
1
2
4
8
E.G.M. Petrakis
3
5
6
7
9 10 11 12 13 14 15
Text Clustering
initial
cetroids
for
K-Means
48
More on Clustering
Sound methods based on the documentto-document similarity matrix
graph theoretic methods
O(N2) time
Iterative methods operating directly on
the document vectors
O(NlogN),O(N2/logN), O(mN) time
E.G.M. Petrakis
Text Clustering
49
Soft Clustering
Hard clustering: each instance belongs to
exactly one cluster
Does not allow for uncertainty
An instance may belong to two or more clusters
Soft clustering is based on probabilities that
an instance belongs to each of a set of
clusters
probabilities of all categories must sum to 1
Expectation Minimization (EM) is the most popular
approach
E.G.M. Petrakis
Text Clustering
50
More Methods
Two documents with similarity > T
(threshold) are connected with an
edge [Duda&Hart73]
clusters: the connected components
(maximal cliques) of the resulting graph
problem: selection of appropriate
threshold T
Zahn’s method [Zahn71]
E.G.M. Petrakis
Text Clustering
51
Zahn’s method [Zahn71]
the dashed edge
is inconsistent
and is deleted
1. Find the minimum spanning tree
2. for each doc delete edges with length l > lavg
lavg: average distance if its incident edges
3. clusters: the connected components of the
graph
E.G.M. Petrakis
Text Clustering
52
References
"Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers, 1996
“A Comparison of Document Clustering Techniques”,
M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop
on Text Mining,2000
“Data Clustering: A Review”, A.K. Jain, M.N. Murphy,
P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept.
99.
“Algorithms for Clustering Data” A.K. Jain, R.C.
Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X
“Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer”,
G. Salton, Addison-Wesley, 1989
E.G.M. Petrakis
Text Clustering
53