CS315-L17-Clustering

Download Report

Transcript CS315-L17-Clustering

Basic Machine Learning: Clustering

CS 315 – Web Search and Data Mining 1

Supervised vs. Unsupervised Learning

Two Fundamental Methods in Machine Learning Supervised Learning (“learn from my example”)   Goal: A program that performs a task as good as humans.

TASK – well defined (the target function)   EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task Unsupervised Learning (“see what you can find”)    Goal: To find some kind of structure in the data.

TASK – vaguely defined No EXPERIENCE  No PERFORMANCE (but, there are some evaluations metrics) 2

What is Clustering?

The most common form of Unsupervised Learning Clustering is the process of grouping a set of physical or abstract objects into classes (“clusters”) of similar objects It can be used in IR:   0,9 To improve recall in search 0,8 0,7 For better navigation of search results 0,6 1 0,5 0,4 0,3 0,2 0,1 0 0 0,2 0,4 0,6 0,8 1 1,2

Ex1: Cluster to Improve Recall

Cluster hypothesis: Documents with similar text are related Thus, when a query matches a document D, also return other documents in the cluster containing D. 4

Ex2: Cluster for Better Navigation

5

Clustering Characteristics

Flat Clustering vs Hierarchical Clustering  Flat: just dividing objects in groups (clusters)  Hierarchical: organize clusters in a hierarchy Evaluating Clustering  Internal Criteria  The intra-cluster similarity is high (tightness)  The inter-cluster similarity is low (separateness)  External Criteria  Did we discover the hidden classes? (we need gold standard data for this evaluation) 6

Clustering for Web IR

Representation for clustering  Document representation  Need a notion of similarity/distance How many clusters?

   Fixed a priori?

Completely data driven?

Avoid “ trivial ” clusters - too large or small 7

Recall: Documents as vectors

Each doc j is a vector of tf.idf values, one component for each term.

 Can normalize to unit length.

d j

 

d d

j j

 

w i n i

,  1

j w i

,

j

where

w i

,

j

Vector space 

tf i

,

j

idf i

 terms are axes - aka features   N docs live in this space even with stemming, may have 20,000+ dimensions What makes documents related?

8

Intuition for relatedness

D2 D3 D1

x y

t 1 t 2 D4 Documents that are “ close together ” in vector space talk about the same things.

9

What makes documents related?

Ideal: semantic similarity.

Practical: statistical similarity  We will use cosine similarity.

We will describe algorithms in terms of cosine similarity.

Cosine similarity of normalized

d j

,

d k

:

sim

(

d j

,

d k

) 

i

 1

w i

,

j

w i

,

k

This is known as the “

normalized inner product ”

.

10

Clustering Algorithms

Hierarchical algorithms  Bottom-up, agglomerative clustering Partitioning “ flat ” algorithms  Usually start with a random (partial) partitioning  Refine it iteratively The famous k-means partitioning algorithm:   Given: a set of

n

documents and the number

k

Compute: a partition of

k

clusters that optimizes the chosen partitioning criterion 11

K-means

Assumes documents are real-valued vectors.

Clusters based on

centroids

of points in a cluster, c (= the

center of gravity

 or mean) : (c)  | 1

c

| 

x

 

c

x

Reassignment of instances to clusters is based on distance to the current cluster centroids.

See Animation 12

K-Means Algorithm

Let

d

be the distance measure between instances.

Select

k

random instances {

s

1 ,

s

2 ,…

s k

} as seeds.

Until clustering converges or other stopping criterion: For each instance

x i

: Assign

x i

to the cluster

c j

such that

d

(

x i

,

s

j ) is minimal.

(

Update the seeds to the centroid of each cluster

) For each cluster

c j s

j =  (

c j

) 13

K-means: Different Issues

When to stop?

 When a fixed number of iterations is reached  When centroid positions do not change Seed Choice   Results can vary based on random seed selection.

Try out multiple starting points

Example showing sensitivity to seeds If you start with centroids: B and E you converge to A B C If you start with centroids D and F you converge to: D E F

14

Hierarchical clustering

Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

vertebrate animal invertebrate fish reptile amphib. mammal worm insect crustacean 15

Hierarchical Agglomerative Clustering

We assume there is a similarity function that determines the similarity of two instances.

Algorithm: Start with all instances in their own cluster.

Until there is only one cluster: Among the current clusters, determine the two clusters,

c i

and

c j

, that are most similar.

Replace

c i

and

c j

with a single cluster

c i

c j

Watch animation of HAC 16

What is the most similar cluster?

Single-link  Similarity of the most cosine-similar (single-link) Complete-link  Similarity of the “ furthest ” points, the least cosine-similar Group-average agglomerative clustering  Average cosine between pairs of elements Centroid clustering  Similarity of clusters ’ centroids 17

Single link clustering

1) Use maximum similarity of pairs:

sim

(

c i

,

c j

)  max

x

c i

,

y

c j sim

(

x

,

y

) 2) After merging

c i

and

c j

, the similarity of the resulting cluster to another cluster,

c k

, is:

sim

((

c i

c j

),

c k

)  max(

sim

(

c i

,

c k

),

sim

(

c j

,

c k

)) 18

Complete link clustering

1) Use minimum similarity of pairs: 2) After merging

c i

and

c j

, the similarity of the resulting cluster to another cluster,

c k

, is: 19

Major issue - labeling

After clustering algorithm finds clusters - how can they be useful to the end user?

Need a concise label for each cluster  In search results, say “ Animal ” or “ Car ” in the jaguar example.

 In topic trees (Yahoo), need navigational cues.

 Often done by hand, a posteriori.

20

How to Label Clusters

Show titles of typical documents  Titles are easy to scan   Authors create them for quick scanning!

But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster    More likely to fully represent cluster Use distinguishing words/phrases But harder to scan 21

Further issues

Complexity:  Clustering is computationally expensive. Implementations need careful balancing of needs.

How to decide how many clusters are best?

Evaluating the “ goodness ” of clustering  There are many techniques, some focus on implementation issues (complexity/time), some on the quality of 22