Transcript Title

WEB BAR 2004
Advanced Retrieval and Web Mining
Lecture 12
Today’s Topic: Clustering 1



Motivation: Recommendations
Document clustering
Clustering algorithms
Restaurant recommendations

We have a list of all Palo Alto restaurants



with  and  ratings for some
as provided by Stanford students
Which restaurant(s) should I recommend to you?
Input
Alice
Il Fornaio
Yes
Bob
Ming's
No
Cindy
Straits Café
No
Dave
Ming's
Yes
Alice
Straits Café
No
Estie
Zao
Yes
Cindy
Zao
No
Dave
Brahma Bull
No
Dave
Zao
Yes
Estie
Ming's
Yes
Fred
Brahma Bull
No
Alice
Mango Café
No
Fred
Ramona's
No
Dave
Homma's
Yes
Bob
Higashi West
Yes
Estie
Straits Café
Yes
Algorithm 0

Recommend to you the most popular restaurants


Ignores your culinary preferences



say # positive votes minus # negative votes
And judgements of those with similar preferences
How can we exploit the wisdom of “like-minded”
people?
Basic assumption


Preferences are not random
For example, if I like Il Fornaio, it’s more likely I will
also like Cenzo
Another look at the input - a
matrix
Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma's
Alice
Yes
No
Yes
No
Bob
Yes
No
No
Cindy
Yes
No
No
Dave
No
No
Yes
Yes
Yes
Estie
No
Yes
Yes
Yes
Fred
No
No
Now that we have a matrix
Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma's
Alice
1
-1
1
-1
Bob
1
-1
-1
Cindy
1
-1
-1
Dave
-1
-1
1
1
1
Estie
-1
1
1
1
Fred
-1
-1
View all other entries as zeros for now.
Similarity between two people



Similarity between their preference vectors.
Inner products are a good start.
Dave has similarity 3 with Estie


but -2 with Cindy.
Perhaps recommend Straits Cafe to Dave

and Il Fornaio to Bob, etc.
Algorithm 1.1




Goal: recommend restaurants I don’t know
Input: evaluation of restaurants I’ve been to
Basic idea: find the person “most similar” to me in
the database and recommend something s/he
likes.
Aspects to consider:




No attempt to discern cuisines, etc.
What if I’ve been to all the restaurants s/he has?
Do you want to rely on one person’s opinions?
www.everyonesacritic.net (movies)
Algorithm 1.k



Look at the k people who are most similar
Recommend what’s most popular among them
Issues?
Slightly more sophisticated
attempt


Group similar users together into clusters
To make recommendations:



Find the “nearest cluster”
Recommend the restaurants most popular in this
cluster
Features:




efficient
avoids data sparsity issues
still no attempt to discern why you’re
recommended what you’re recommended
how do you cluster?
How do you cluster?

Two key requirements for “good” clustering:



Keep similar people together in a cluster
Separate dissimilar people
Factors:



Need a notion of similarity/distance
Vector space? Normalization?
How many clusters?



Fixed a priori?
Completely data driven?
Avoid “trivial” clusters - too large or small
Looking beyond
Clustering people for
restaurant recommendations
Amazon.com
Clustering other things
(documents, web pages)
Other approaches
to recommendation
General unsupervised machine learning.
Why cluster documents?

For improving recall in search applications


For speeding up vector space retrieval


Better search results
Faster search
Corpus analysis/navigation

Better user interface
Improving search recall


Cluster hypothesis - Documents with similar text
are related
Ergo, to improve search recall:



Cluster docs in corpus a priori
When a query matches a doc D, also return other
docs in the cluster containing D
Hope if we do this:


The query “car” will also return docs containing
automobile
clustering grouped together docs containing car
with those containing automobile.
Why might this happen?
Speeding up vector space
retrieval



In vector space retrieval, must find nearest doc
vectors to query vector
This would entail finding the similarity of the
query to every doc – slow (for some applications)
By clustering docs in corpus a priori


find nearest docs in cluster(s) close to query
inexact but avoids exhaustive similarity
computation
Exercise: Make up a simple
example with points on a line in 2
clusters where this inexactness
shows up.
Speeding up vector space
retrieval




Cluster documents into k clusters
Retrieve closest cluster ci to query
Rank documents in ci and return to user
Applications? Web search engines?
Clustering for UI (1)
Corpus analysis/navigation

Given a corpus, partition it into groups of related
docs




Recursively, can induce a tree of topics
Allows user to browse through corpus to find
information
Crucial need: meaningful labels for topic nodes.
Yahoo: manual hierarchy

Often not available for new document collection
Clustering for UI (2)
Navigating search results



Given the results of a search (say Jaguar, or
NLP), partition into groups of related docs
Can be viewed as a form of word sense
disambiguation
Jaguar may have senses:





The car company
The animal
The football team
The video game
…
Results list clustering example
•Cluster 1:
•Jaguar Motor Cars’ home page
•Mike’s XJS resource page
•Vermont Jaguar owners’ club
•Cluster 2:
•Big cats
•My summer safari trip
•Pictures of jaguars, leopards and
lions
•Cluster 3:
•Jacksonville Jaguars’ Home Page
•AFC East Football Teams
Search Engine Example:
Vivisimo



Search for “NLP” on
vivisimo
www.vivisimo.com
Doesn’t always work well:
no geographic/coffee
clusters for “java”!
Representation for Clustering


Similarity measure
Document representation
What makes docs “related”?


Ideal: semantic similarity.
Practical: statistical similarity




We will use cosine similarity.
Docs as vectors.
For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs.
We will describe algorithms in terms of cosine
similarity.
Recall doc as vector



Each doc j is a vector of tfidf values, one
component for each term.
Can normalize to unit length.
So we have a vector space




terms are axes - aka features
n docs live in this space
even with stemming, may have 10000+
dimensions
do we really want to use all terms?

Different from using vector space for search. Why?
Intuition
t3
D
2
D3
D1
x
y
t1
t2
D4
Postulate: Documents that are “close together”
in vector space talk about the same things.
Cosine similarity
Cosine similarity of D j , Dk :
m
sim(D j , Dk )   wij  w
ik
i1
Aka normalized inner product.
How Many Clusters?

Number of clusters k is given
 Partition n docs into predetermined number of
clusters

Finding the “right” number of clusters is part of
the problem



Given docs, partition into an “appropriate” number
of subsets.
E.g., for query results - ideal value of k not known
up front - though UI may impose limits.
Can usually take an algorithm for one flavor and
convert to the other.
Clustering Algorithms

Hierarchical algorithms




Bottom-up, agglomerative
Top-down, divisive
Need a notion of cluster similarity
Iterative, “flat” algorithms


Usually start with a random (partial) partitioning
Refine it iteratively
Dendrogram: Example
be,not,he,I,it,this,the,his,a,and,but,in,on,with,for,at,from,of,to,as,is,was
Dendrogram: Document
Example

As clusters agglomerate, docs likely to fall into a
hierarchy of “topics” or concepts.
d3
d5
d1
d2
d3,d4,d
5
d4
d1,d2
d4,d5
d3
Agglomerative clustering


Given: target number of clusters k.
Initially, each doc viewed as a cluster


start with n clusters;
Repeat:

while there are > k clusters, find the “closest pair”
of clusters and merge them.
“Closest pair” of clusters


Many variants to defining closest pair of clusters
“Center of gravity”


Average-link


Average cosine between pairs of elements
Single-link


Clusters whose centroids (centers of gravity) are
the most cosine-similar
Similarity of the most cosine-similar (single-link)
Complete-link

Similarity of the “furthest” points, the least cosinesimilar
Definition of Cluster Similarity

Single-link clustering




Complete-link clustering



Similarity of two closest points
Can create elongated, straggly clusters
Chaining effect
Similarity of two least similar points
Sensitive to outliers
Centroid-based and average-link

Good compromise
Key notion: cluster
representative


We want a notion of a representative point in a
cluster
Representative should be some sort of “typical”
or central point in the cluster, e.g.,



point inducing smallest radii to docs in cluster
smallest squared distances, etc.
point that is the “average” of all docs in the cluster

Centroid or center of gravity
Centroid

Centroid of a cluster = component-wise average
of vectors in a cluster - is a vector.



Need not be a doc.
Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5).
Centroid is a good cluster representative in most
cases.
Centroid
Centroid

Is the centroid of normalized vectors normalized?
Outliers in centroid
computation


Can ignore outliers when computing centroid.
What is an outlier?

Lots of statistical definitions, e.g.

moment of point to centroid > M  some cluster moment.
Say 10.
Centroid
Outlier
Medoid As Cluster
Representative




The centroid does not have to be a document.
Medoid: A cluster representative that is one of
the documents
For example: the document closest to the
centroid
One reason this is useful




Consider the representative of a large cluster
(>1000 documents)
The centroid of this cluster will be a dense vector
The medoid of this cluster will be a sparse vector
Compare: mean/centroid vs. median/medoid
Example: n=6, k=3, closest pair
of centroids
d6
d4
d3
d5
Centroid after
second step.
d1
d2
Centroid after first step.
Issues

Have to support finding closest pairs continually

compare all pairs?





Potentially n3 cosine similarity computations
Why?
To avoid: use approximations.
“points” are switching clusters as centroids
change.
Naïve implementation expensive for large
document sets (100,000s)
Efficient implementation


Cluster a sample, then assign the entire set
Avoid dense centroids (e.g., by using medoids)
Exercise

Consider agglomerative clustering on n points on
a line. Explain how you could avoid n3 distance
computations - how many will your scheme use?
“Using approximations”


In standard algorithm, must find closest pair of
centroids at each step
Approximation: instead, find nearly closest pair


use some data structure that makes this
approximation easier to maintain
simplistic example: maintain closest pair based on
distances in projection on a random line
Random line
Different algorithm: k-means





K-means generates a “flat” set of clusters
K-means is non-hierarchical
Given: k - the number of clusters desired.
Iterative algorithm.
Hard to get good bounds on the number of
iterations to convergence.

Rarely a problem in practice
Basic iteration

Reassignment

At the start of the iteration, we have k centroids.



Subproblem: where do we get them for 1. iteration?
Each doc assigned to the nearest centroid.
Centroid recomputation


All docs assigned to the same centroid are
averaged to compute a new centroid
thus have k new centroids.
Iteration example
Docs
Current centroids
Iteration example
Docs
New centroids
k-Means Clustering:
Initialization




We could start with with any k docs as centroids
But k random docs are better.
Repeat basic iteration until termination condition
satisfied.
Exercise: find better approach for finding good
starting points
Termination conditions

Several possibilities, e.g.,



A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Does this mean that the
docs in a cluster are
unchanged?
Convergence

Why should the k-means algorithm ever reach a
fixed point?


A state in which clusters don’t change.
k-means is a special case of a general procedure
known as the EM algorithm.


EM is known to converge.
Number of iterations could be large.
Exercise


Consider running 2-means clustering on a
corpus, each doc of which is from one of two
different languages. What are the two clusters
we would expect to see?
Is agglomerative clustering likely to produce
different results?
Convergence of K-Means

Define goodness measure of cluster k as sum of
squared distances from cluster centroid:




G_k = sum_i (v_i – c_k)^2 (sum all v_i in cluster k)
G = sum_k G_k
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Recomputation monotonically decreases each
G_k since: (m_k number of members in cluster)


sum (v_in –a)^2 reaches minimum for:
sum –2(v_in-a) = 0
Convergence of K-Means




sum –2(v_in-a) = 0
sum v_in = sum a
m_k a = sum v_in
a = 1/m_k sum v_in = c_kn
k not specified in advance


Say, the results of a query.
Solve an optimization problem: penalize having
lots of clusters


application dependant, e.g., compressed summary
of search results list.
Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
k not specified in advance


Given a clustering, define the Benefit for a doc to
be the cosine similarity to its centroid
Define the Total Benefit to be the sum of the
individual doc Benefits.
Why is there always a clustering of Total Benefit n?
Penalize lots of clusters



For each cluster, we have a Cost C.
Thus for a clustering with k clusters, the Total
Cost is kC.
Define the Value of a clustering to be =
Total Benefit - Total Cost.

Find the clustering of highest value, over all
choices of k.
Back to agglomerative
clustering


In a run of agglomerative clustering, we can try
all values of k=n,n-1,n-2, … 1.
At each, we can measure our value, then pick the
best choice of k.
Exercise

Suppose a run of agglomerative clustering finds
k=7 to have the highest value amongst all k.
Have we found the highest-value clustering
amongst all clusterings with k=7?
Clustering vs Classification

Clustering


Unsupervised
Input





No specific information for each document
Classification




Clustering algorithm
Similarity measure
Number of clusters
Supervised
Each document is labeled with a class
Build a classifier that assigns documents to one of the
classes
Two types of partitioning: supervised vs unsupervised
Clustering vs Classification

Consider clustering a large set of computer
science documents

what do you expect to see in the vector space?
Clustering vs Classification

Consider clustering a large set of computer
science documents

what do you expect to see in the vector space?
Arch.
Graphics
Theory
NLP
AI
Decision boundaries

Could we use these blobs to infer the subject of a
new document?
Arch.
Graphics
Theory
NLP
AI
Deciding what a new doc is
about

Check which region the new doc falls into

can output “softer” decisions as well.
Arch.
Graphics
Theory
NLP
AI
= AI
Setup for Classification

Given “training” docs for each category


Cast them into a decision space


generally a vector space with each doc viewed as
a bag of words
Build a classifier that will classify new docs


Theory, AI, NLP, etc.
Essentially, partition the decision space
Given a new doc, figure out which partition it falls
into
Supervised vs. unsupervised
learning


This setup is called supervised learning in the
terminology of Machine Learning
In the domain of text, various names





Text classification, text categorization
Document classification/categorization
“Automatic” categorization
Routing, filtering …
In contrast, the earlier setting of clustering is
called unsupervised learning


Presumes no availability of training samples
Clusters output may not be thematically unified.
Which is better?

Depends



Can use in combination






on your setting
on your application
Analyze a corpus using clustering
Hand-tweak the clusters and label them
Use tweaked clusters as training input for
classification
Subsequent docs get classified
Computationally, methods quite different
Main issue: can you get training data?
Summary

Two types of clustering




Hierarchical, agglomerative clustering
Flat, iterative clustering
How many clusters?
Key parameters


Representation of data points
Similarity/distance measure