Classic IR Models

Transcript Classic IR Models

Clustering
“Clustering is the unsupervised
classification of patterns (observations,
data items or feature vectors) into
groups (clusters)” [ACM CS’99]
Instances within a cluster are very
similar
Instances in different clusters are very
different
E.G.M. Petrakis
Text Clustering
1
Example
t
e
r
m
2
.. .
.. .
. .....
.. .
. .. .
..
term1
E.G.M. Petrakis
Text Clustering
2
Applications
Faster retrieval
Faster and better browsing
Structuring of search results
Revealing classes and other data
regularities
Directory construction
Better data organization in general
E.G.M. Petrakis
Text Clustering
3
Cluster Searching
Similar instances
tend to be relevant
to the same
requests
The query is mapped
to the closest
cluster by
comparison with the
cluster-centroids
E.G.M. Petrakis
Text Clustering
4
Notation
N: number of elements
Class: real world grouping – ground truth
Cluster: grouping by algorithm
The ideal clustering algorithm will
produce clusters equivalent to real
world classes with exactly the same
members
E.G.M. Petrakis
Text Clustering
5
Problems
How many clusters ?
Complexity? N is usually large
Quality of clustering
When a method is better than another?
Overlapping clusters
Sensitivity to outliers
E.G.M. Petrakis
Text Clustering
6
Example
.
.
.
.. ... . .... . .
... .. . . . . . .
.
.
.
.
.
... .. . .
..
E.G.M. Petrakis
Text Clustering
7
Clustering Approaches
Divisive: build clusters “top down” starting
from the entire data set
K-means, Bisecting K-means
Hierarchical or flat clustering
Agglomerative: build clusters “bottom-up”
starting with individual instances and by
iteratively combining them to form larger
cluster at higher level
Hierarchical clustering
Combinations of the above
Buckshot algorithm
E.G.M. Petrakis
Text Clustering
8
Hierarchical – Flat Clustering
Flat: all clusters at the same level
K-means, Buckshot
Hierarchical: nested sequence of clusters
Single cluster with all data at the top & singleton
clusters at the bottom
Intermediate levels are more useful
Every intermediate level combines two clusters
from the next lower level
Agglomerative, Bisecting K-means
E.G.M. Petrakis
Text Clustering
9
Flat Clustering
.. .
.. .
. .....
.. .
E.G.M. Petrakis
. .. .
..
Text Clustering
10
Hierarchical Clustering
..
.. . 2
.
. .....
.. .
4
5
E.G.M. Petrakis
1
..
.
. ...
..
.. .
..
1
6
Text Clustering
3
2
3
7
4
5 6
7
11
Text Clustering
Finds overall similarities among
documents or groups of documents
Faster searching, browsing etc.
Needs to know how to compute the
similarity (or equivalently the distance)
between documents
E.G.M. Petrakis
Text Clustering
12
Query – Document Similarity
d1
 
 
d d
Sim (d1 , d 2 )   1 2 
| d1 || d 2 |

M
i 1
wid1 wid2
2
w
i 1 id1
M
2
w
i 1 id2
M
d2
θ
Similarity is defined as the cosine of the
angle between document and query vectors
E.G.M. Petrakis
Text Clustering
13
Document Distance
Consider documents d1, d2 with vectors u1, u2
Their distance is defined as the length AB
distance(d1 , d 2 ) 
2 sin( / 2) 
2(1 - cos( )) 
2(1 - Sim(d1 , d 2 ))
E.G.M. Petrakis
Text Clustering
14
Normalization by Document
Length
The longer the document is, the more
likely it is for a given term to appear
in it
Normalize the term weights by
document length (so terms in long
documents are not given more weight)
w'ij 
E.G.M. Petrakis
wij

M
k 1
wkj
Text Clustering
2
15
Evaluation of Cluster Quality
Clusters can be evaluated using internal
or external knowledge
Internal Measures: intra cluster
cohesion and cluster separability
 intra cluster similarity
 inter cluster similarity
External measures: quality of clusters
compared to real classes
 Entropy (E), Harmonic Mean (F)
E.G.M. Petrakis
Text Clustering
16
Intra Cluster Similarity
A measure of cluster cohesion
Defined as the average pair-wise similarity of
documents in a cluster
 
 1
1
sim(d , d ' )   d
2 
S dS S
S d ,d 'S
1
 1
Where c 
S
   2
d ' c c  c
d 'S

 d : cluster centroid
dS
Documents (not centroids) have unit length
E.G.M. Petrakis
Text Clustering
17
Inter Cluster Similarity
a) Single Link:
similarity of two max sim(ci , c'j ) , ci  S , c'j  S '
most similar
members
b) Complete Link:
'
'
similarity of two min sim(ci , c j ) , ci  S , c j  S '
least similar
 
 1

1
1
members
sim(d , d ' )   d
d' 


S S ' dC
S dS S ' dS '
c) Group Average:
d 'C '
 
 
average similarity c  c '  sim(c , c ' )
E.G.M. Petrakis
Text Clustering
18
between members




Example
complete link
S
.c
group
average
.
.
.
c’
S’
single link
E.G.M. Petrakis
Text Clustering
19
Entropy
Measures the quality of flat clusters
using external knowledge
Pre-existing classification
Assessment by experts
Pij: probability that a member of cluster
j belong to class i
The entropy of cluster j is defined as
Ej=-ΣiPijlogPij
E.G.M. Petrakis
Text Clustering
20
Entropy (con’t)
Total entropy for all clusters
m
E
j 1
nj
N
Ej
Where nj is the size of cluster j
m is the number of clusters
N is the number of instances
The smaller the value of E is the better the
quality of the algorithm is
The best entropy is obtained when each
cluster contains exactly one instance
E.G.M. Petrakis
Text Clustering
21
Harmonic Mean (F)
Treats each cluster as a query result
F combines precision (P) and recall (R)
Fij for cluster j and class i is defined as
nij
nij
2
Fij 
where Pij  , Rij 
1
1
n
n
j
i

Pij Rij
nij: number of instances of class i in cluster j,
ni: number of instances of class i,
nj: number of instances of cluster j
E.G.M. Petrakis
Text Clustering
22
Harmonic Mean (con’t)
The F value of any class i is the maximum
value it achieves over all j
Fi = maxj Fij
The F value of a clustering solution is
computed as the weighted average over all
classes
m
ni
F   Fi
i 1 n
Where N is the number of data instances
E.G.M. Petrakis
Text Clustering
23
Quality of Clustering
A good clustering method
Maximizes intra-cluster similarity
Minimizes inter cluster similarity
Minimizes Entropy
Maximizes Harmonic Mean
Difficult to achieve all together
simultaneously
Maximize some objective function of the
above
An algorithm is better than an other if it has
better values on most of these measures
E.G.M. Petrakis
Text Clustering
24
K-means Algorithm
Select K centroids
Repeat I times or until the centroids do
not change
Assign each instance to the cluster
represented by its nearest centroid
Compute new centroids
Reassign instances
Compute new centroids
…….
E.G.M. Petrakis
Text Clustering
25
K-Means demo (1/7):
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
20/7/2015
Nikos Hourdakis, MSc Thesis
26
K-Means demo (2/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
27
K-Means demo (3/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
28
K-Means demo (4/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
29
K-Means demo (5/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
30
K-Means demo (6/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
31
K-Means demo (7/7)
20/7/2015
Nikos Hourdakis, MSc Thesis
32
Comments on K-Means (1)
Generates a flat partition of K clusters
K is the desired number of clusters and
must be known in advance
Starts with K random cluster centroids
A centroid is the mean or the median of
a group of instances
The mean rarely corresponds to a real
instance
E.G.M. Petrakis
Text Clustering
33
Comments on K-Means (2)
Up to I=10 iterations
Keep the clustering resulted in best
inter/intra similarity or the final
clusters after I iterations
Complexity O(IKN)
A repeated application of K-Means for
K=2, 4,… can produce a hierarchical
clustering
E.G.M. Petrakis
Text Clustering
34
Choosing Centroids for K-means
Quality of clustering depends on the
selection of initial centroids
Random selection may result in poor
convergence rate, or convergence to
sub-optimal clusterings.
Select good initial centroids using a
heuristic or the results of another
method
Buckshot algorithm
E.G.M. Petrakis
Text Clustering
35
Incremental K-Means
Update each centroid during each
iteration after each point is assigned to
a cluster rather than at the end of each
iteration
Reassign instances to clusters at the
end of each iteration
Converges faster than simple K-means
Usually 2-5 iterations
E.G.M. Petrakis
Text Clustering
36
Bisecting K-Means
Starts with a single cluster with all
instances
Select a cluster to split: larger cluster
or cluster with less intra similarity
The selected cluster is split into 2
partitions using K-means (K=2)
Repeat up to the desired depth h
Hierarchical clustering
Complexity O(2hN)
E.G.M. Petrakis
Text Clustering
37
Agglomerative Clustering
Compute the similarity matrix between
all pairs of instances
Starting from singleton clusters
Repeat until a single cluster remains
Merge the two most similar clusters
Replace them with a single cluster
Replace the merged cluster in the matrix
and update the similarity matrix
Complexity O(N2)
E.G.M. Petrakis
Text Clustering
38
Similarity Matrix
C1=d1
C2=d2
…
CN=dN
C1=d1
1
0.8
…
0.3
C2=d2
0.8
1
…
0.6
….
…
…
1
…
CN=dN
0.3
0.6
…
1
E.G.M. Petrakis
Text Clustering
39
Update Similarity Matrix
C1=d1
merged
C2=d2
…
CN=dN
1
0.8
…
0.3
C2=d2
0.8
1
…
0.6
….
…
…
1
…
CN=dN
0.3
0.6
…
1
C1=d1
merged
E.G.M. Petrakis
Text Clustering
40
New Similarity Matrix
C12 =
d 1  d2
…
CN=dN
E.G.M. Petrakis
C12=
d1  d2
1
…
CN=dN
…
0.4
…
1
…
0.4
…
1
Text Clustering
41
Single Link
Selecting the most similar clusters for
merging using single link


max sim(ci , c ) ci  S, c  S '
'
j
'
j
Can result in long and thin clusters due
to “chaining effect”
Appropriate in some domains, such as
clustering islands
E.G.M. Petrakis
Text Clustering
42
Complete Link
Selecting the most similar clusters for
merging using complete link


min sim(ci , c'j ) ci  S, c'j  S '
Results in compact, spherical clusters
that are preferable
E.G.M. Petrakis
Text Clustering
43
Group Average
Selecting the most similar clusters for
merging using group average
1
S S'
 
1
sim(d , d ' ) 

S
d S
d 'S '
 1
d

S'
d S

d' 
d S '
 
 
c  c '  sim(c , c ' )
Fast compromise between single and
complete link
E.G.M. Petrakis
Text Clustering
44
Example
complete link
A
.c
group
average
1
.
.
c2
.
B
single link
E.G.M. Petrakis
Text Clustering
45
Inter Cluster Similarity
A new cluster is represented by its centroid
 1
c
S

d
dS
The document to cluster similarity is
 
 
computed as
sim(d , c )  d  c
The cluster-to-cluster similarity can be
computed as single, complete or group average
similarity
E.G.M. Petrakis
Text Clustering
46
Buckshot K-Means
Combines Agglomerative and K-Means
Agglomerative results in a good
clustering solution but has O(N2)
complexity
Randomly select a sample N instances
Applying Agglomerative on the sample
which takes (N) time
Take the centroids of the cluster as
input to K-Means
Overall
complexity
is O(N)
E.G.M. Petrakis
Text Clustering
47
Example
1
2
4
8
E.G.M. Petrakis
3
5
6
7
9 10 11 12 13 14 15
Text Clustering
initial
cetroids
for
K-Means
48
More on Clustering
Sound methods based on the documentto-document similarity matrix
graph theoretic methods
O(N2) time
Iterative methods operating directly on
the document vectors
O(NlogN),O(N2/logN), O(mN) time
E.G.M. Petrakis
Text Clustering
49
Soft Clustering
Hard clustering: each instance belongs to
exactly one cluster
Does not allow for uncertainty
An instance may belong to two or more clusters
 Soft clustering is based on probabilities that
an instance belongs to each of a set of
clusters
probabilities of all categories must sum to 1
Expectation Minimization (EM) is the most popular
approach
E.G.M. Petrakis
Text Clustering
50
More Methods
 Two documents with similarity > T
(threshold) are connected with an
edge [Duda&Hart73]
 clusters: the connected components
(maximal cliques) of the resulting graph
 problem: selection of appropriate
threshold T
 Zahn’s method [Zahn71]
E.G.M. Petrakis
Text Clustering
51
Zahn’s method [Zahn71]
the dashed edge
is inconsistent
and is deleted
1. Find the minimum spanning tree
2. for each doc delete edges with length l > lavg
 lavg: average distance if its incident edges
3. clusters: the connected components of the
graph
E.G.M. Petrakis
Text Clustering
52
References
 "Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers, 1996
 “A Comparison of Document Clustering Techniques”,
M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop
on Text Mining,2000
 “Data Clustering: A Review”, A.K. Jain, M.N. Murphy,
P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept.
99.
 “Algorithms for Clustering Data” A.K. Jain, R.C.
Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X
 “Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer”,
G. Salton, Addison-Wesley, 1989
E.G.M. Petrakis
Text Clustering
53