lecture_17

Transcript lecture_17

Clustering
Supervised vs. Unsupervised Learning
Examples of clustering in Web IR
Characteristics of clustering
Clustering algorithms
Cluster Labeling
1
Supervised vs. Unsupervised Learning
Supervised Learning




Goal: A program that performs a task as good as humans.
TASK – well defined (the target function)
EXPERIENCE – training data provided by a human
PERFORMANCE – error/accuracy on the task
Unsupervised Learning




Goal: To find some kind of structure in the data.
TASK – vaguely defined
No EXPERIENCE
No PERFORMANCE (but, there are some evaluations metrics)
2
What is Clustering?
Clustering is the most common form of Unsupervised
Learning
Clustering is the process of grouping a set of physical or
abstract objects into classes of similar objects
It can be used in IR:


To improve recall in search applications
For better navigation of search results
3
Example 1: Improving Recall
Cluster hypothesis - Documents with similar text are
related
Thus, when a query matches a document D, also return
other documents in the cluster containing D.
4
Example 2: Better Navigation
5
Clustering Characteristics
Flat versus Hierarchical Clustering


Flat means dividing objects in groups (clusters)
Hierarchical means organize clusters in a subsuming hierarchy
Evaluating Clustering

Internal Criteria
 The intra-cluster similarity is high (tightness)
 The inter-cluster similarity is low (separateness)

External Criteria
 Did we discover the hidden classes? (we need gold
standard data for this evaluation)
6
Clustering for Web IR
Representation for clustering

Document representation
 Vector space? Normalization?

Need a notion of similarity/distance
How many clusters?


Fixed a priori?
Completely data driven?
 Avoid “trivial” clusters - too large or small
7
Recall documents as vectors
Each doc j is a vector of tfidf values, one component for
each term.
Can normalize to unit length.

wi , j
dj
dj   
where wi , j  tf i , j  idfi
n
dj
i1 wi, j
So we have a vector space



terms are axes - aka features
n docs live in this space
even with stemming, may have 20,000+ dimensions
8
What makes documents related?
Ideal: semantic similarity.
Practical: statistical similarity


We will use cosine similarity.
Documents as vectors.
We will describe algorithms in terms of cosine similarity.
Cosine similarity of normalized d j , dk :
n
sim( d , d )   w  w
j k i1 i, j
i, k
This is known as the normalized inner product.
9
Intuition for relatedness
D2
D3
D1
x
y
t1
t2
D4
Documents that are “close together”
in vector space talk about the same things.
10
Clustering Algorithms
Partitioning “flat” algorithms


Usually start with a random (partial) partitioning
Refine it iteratively
 k-means clustering
 Model based clustering (we will not cover it)
Hierarchical algorithms


Bottom-up, agglomerative
Top-down, divisive (we will not cover it)
11
Partitioning “flat” algorithms
Partitioning method: Construct a partition of n documents
into a set of k clusters
Given: a set of documents and the number k
Find: a partition of k clusters that optimizes the chosen
partitioning criterion
Watch animation of k-means
12
K-means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or
mean) of points in a cluster, c:


1
μ(c) 
x

| c | xc
Reassignment of instances to clusters is based on distance
to the current cluster centroids.
13
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
14
K-means: Different Issues
When to stop?


When a fixed number of iterations is reached
When centroid positions do not change
Seed Choice


Results can vary based on random seed selection.
Try out multiple starting points
Example showing
sensitivity to seeds
If you start with B and E
as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
A
B
C
D
E
F
15
Hierarchical clustering
Build a tree-based hierarchical taxonomy (dendrogram)
from a set of unlabeled examples.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
16
Hierarchical Agglomerative Clustering
We assume there is a similarity function that determines
the similarity of two instances.
Algorithm:
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
Watch animation of HAC
17
What is the most similar cluster?
Single-link

Similarity of the most cosine-similar (single-link)
Complete-link

Similarity of the “furthest” points, the least cosine-similar
Group-average agglomerative clustering

Average cosine between pairs of elements
Centroid clustering

Similarity of clusters’ centroids
18
Single link clustering
1) Use maximum similarity of pairs:
sim(ci ,c j )  max sim( x, y)
xci , yc j
2) After merging ci and cj, the similarity of the resulting cluster to
another cluster, ck, is:
sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck ))
19
Complete link clustering
1) Use minimum similarity of pairs:
sim(ci ,c j )  min sim( x, y )
xci , yc j
2) After merging ci and cj, the similarity of the resulting cluster to
another cluster, ck, is:
sim((ci  c j ), ck )  min(sim(ci , ck ), sim(c j , ck ))
20
Major issue - labeling
After clustering algorithm finds clusters - how can they be
useful to the end user?
Need a concise label for each cluster


In search results, say “Animal” or “Car” in the jaguar example.
In topic trees (Yahoo), need navigational cues.
 Often done by hand, a posteriori.
21
How to Label Clusters
Show titles of typical documents



Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not fully represent
cluster
Show words/phrases prominent in cluster



More likely to fully represent cluster
Use distinguishing words/phrases
But harder to scan
22
Not covered in this lecture
Complexity:

Clustering is computationally expensive. Implementations need
careful balancing of needs.
How to decide how many clusters are best?
Evaluating the “goodness” of clustering

There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
23

lecture_17

Transcript lecture_17

Directory