Information Retrieval Lecture 7 Dell Zhang

Download Report

Transcript Information Retrieval Lecture 7 Dell Zhang

Information Retrieval
For the MSc Computer Science Programme
Lecture 7
Introduction to Information Retrieval (Manning et al. 2007)
Chapter 17
Dell Zhang
Birkbeck, University of London
Yahoo! Hierarchy
http://dir.yahoo.com/science
… (30)
agriculture
...
biology
physics
...
CS
...
space
...
...
dairy
botany
cell
AI
courses
crops
magnetism
HCI
agronomy
evolution
forestry
relativity
craft
missions
Hierarchical Clustering

Build a tree-like hierarchical taxonomy
(dendrogram) from a set of unlabeled
documents.

Divisive (top-down)



Start with all documents belong to the same cluster.
Eventually each node forms a cluster on its own.
Recursive application of a (flat) partitional clustering
algorithm, e.g., kMeans (k=2)  Bi-secting kMeans.
Agglomerative (bottom-up)

Start with each document being a single cluster.
Eventually all documents belong to the same cluster.
Dendrogram
Clustering is obtained by
cutting the dendrogram at
a desired level: each
connected component
forms a cluster.
The number of clusters k
is not required in advance.
Dendrogram – Example
Clusters of News Stories:
Reuters RCV1
Dendrogram – Example
Clusters of Things that People Want:
ZEBO
HAC

Hierarchical Agglomerative Clustering


Starts with each doc in a separate cluster.
Repeat until there is only one cluster:
 Among the current clusters, determine the pair
of clusters, ci and cj, that are most similar.

Then merges ci and cj to a single cluster.
The history of merging forms a binary tree or
hierarchy.


(Single-Link, Complete-Link, etc.)
Single-Link

The similarity between a pair of clusters is defined
by the single strongest link (i.e., maximum cosinesimilarity) between their members:
sim (ci ,c j )  max sim ( x, y )
xci , yc j

After merging ci and cj, the similarity of the resulting
cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  max sim (ci , ck ), sim (c j , ck ) 
HAC – Example
HAC – Example
HAC – Example

As clusters agglomerate, docs are likely to
fall into a dendrogram.
d3
d5
d1
d2
d3,d4,d
5
d4
d1,d2
d4,d5
d3
HAC – Example
Single-Link
Take Home Message


Single-Link HAC
Dendrogram
sim (ci ,c j )  max sim ( x, y )
xci , yc j
sim ((ci  c j ), ck )  max sim (ci , ck ), sim (c j , ck ) 