CIS732-Lecture-24

Transcript CIS732-Lecture-24

Lecture 24 of 42
Model-Based Clustering:
Expectation-Maximization
Monday, 24 March 2008
William H. Hsu
Department of Computing and Information Sciences, KSU
KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih
Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732
Instructor home page: http://www.cis.ksu.edu/~bhsu
Reading:
Today: Section 7.5, Han & Kamber 2e
After spring break: Sections 7.6 – 7.7, Han & Kamber 2e
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
What is Clustering?
Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in marketing
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of
classes directly from the data (in contrast to
classification).
• More informally, finding natural groupings among
objects.
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Hierarchical Clustering:
Names (using String Edit Distance)
Pedro (Portuguese)
Petros (Greek), Peter (English), Piotr
(Polish), Peadar (Irish), Pierre (French),
Peder (Danish), Peka (Hawaiian), Pietro
(Italian), Piero (Italian Alternative), Petr
(Czech), Pyotr (Russian)
Cristovao (Portuguese)
Christoph (German), Christophe
(French), Cristobal (Spanish), Cristoforo
(Italian), Kristoffer (Scandinavian),
Krystof (Czech), Christopher (English)
Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick
(Irish!)
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Hierarchical Clustering:
Names by Linguistic Similarity
Pedro (Portuguese/Spanish)
Petros (Greek), Peter (English), Piotr (Polish),
Peadar (Irish), Pierre (French), Peder (Danish),
Peka (Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Incremental Clustering [1]
Nearest Neighbor Clustering
Not to be confused with Nearest Neighbor Classification
• Items are iteratively merged into the
existing clusters that are closest.
• Incremental
• Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Incremental Clustering [2]
10
9
8
7
Threshold t
6
5
4
3
t
1
2
1
2
1
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
2
Monday, 24 Mar 2008
3
4
5
6
7
8
9 10
Computing & Information Sciences
Kansas State University
Incremental Clustering [3]
10
9
New data point arrives…
It is within the threshold for
cluster 1, so add it to the
cluster, and update cluster
center.
8
7
6
5
4
3
1
3
2
1
2
1
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
2
Monday, 24 Mar 2008
3
4
5
6
7
8
9 10
Computing & Information Sciences
Kansas State University
Incremental Clustering [4]
New data point arrives…
10
4
9
It is not within the
threshold for cluster 1, so
create a new cluster, and
so on..
8
7
6
5
4
3
Algorithm is highly order
dependent…
It is difficult to determine t
in advance…
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
1
3
2
1
2
1
2
Monday, 24 Mar 2008
3
4
5
6
7
8
9 10
Computing & Information Sciences
Kansas State University
Similarity and clustering
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Motivation
 Problem: Query word could be ambiguous:
 Eg: Query“Star” retrieves documents about astronomy, plants, animals etc.
 Solution: Visualisation
 Clustering document responses to queries along lines of different topics.
 Problem 2: Manual construction of topic hierarchies and taxonomies
 Solution:
 Preliminary clustering of large samples of web documents.
 Problem 3: Speeding up similarity search
 Solution:
 Restrict the search for documents similar to a query to most representative
cluster(s).
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Clustering
 Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within
which similarity within a cluster is larger than across clusters.
 Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is interested in
document/term d/t, he is likely to be interested in other members of the cluster to which d/t
belongs.
 Similarity measures
 Represent documents by TFIDF vectors
 Distance between document vectors
 Cosine of angle between document vectors
 Issues
 Large number of noisy dimensions
 Notion of noise is application dependent
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Top-down clustering
 k-Means: Repeat…
 Choose k arbitrary ‘centroids’
 Assign each document to nearest centroid
 Recompute centroids
 Expectation maximization (EM):
 Pick k arbitrary ‘distributions’
 Repeat:
 Find probability that document d is generated from distribution f for all d
and f
 Estimate distribution parameters from weighted contribution of
documents
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing `k’
 Mostly problem driven
 Could be ‘data driven’ only when either
 Data is not sparse
 Measurement dimensions are not too noisy
 Interactive
 Data analyst interprets results of structure discovery
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing ‘k’ : Approaches
 Hypothesis testing:
 Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions
 Require regularity conditions on the mixture likelihood function (Smith’85)
 Bayesian Estimation




Estimate posterior distribution on k, given data and prior on k.
Difficulty: Computational complexity of integration
Autoclass algorithm of (Cheeseman’98) uses approximations
(Diebolt’94) suggests sampling techniques
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing ‘k’ : Approaches
 Penalised Likelihood
 To account for the fact that Lk(D) is a non-decreasing function of k.
 Penalise the number of parameters
 Examples : Bayesian Information Criterion (BIC), Minimum Description
Length(MDL), MML.
 Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)
 Cross Validation Likelihood
 Find ML estimate on part of training data
 Choose k that maximises average of the M cross-validated average
likelihoods on held-out data Dtest
 Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold
cross validation (vCV)
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Similarity and clustering
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Motivation
 Problem: Query word could be ambiguous:
 Eg: Query“Star” retrieves documents about astronomy, plants, animals etc.
 Solution: Visualisation
 Clustering document responses to queries along lines of different topics.
 Problem 2: Manual construction of topic hierarchies and taxonomies
 Solution:
 Preliminary clustering of large samples of web documents.
 Problem 3: Speeding up similarity search
 Solution:
 Restrict the search for documents similar to a query to most representative
cluster(s).
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Clustering
 Task : Evolve measures of similarity to cluster a collection of documents/terms
into groups within which similarity within a cluster is larger than across clusters.
 Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is
interested in document/term d/t, he is likely to be interested in other members of
the cluster to which d/t belongs.
 Collaborative filtering: Clustering of two/more objects which have bipartite
relationship
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Clustering (contd)
 Two important paradigms:
 Bottom-up agglomerative clustering
 Top-down partitioning
 Visualisation techniques: Embedding of corpus in a low-dimensional
space
 Characterising the entities:
 Internally : Vector space model, probabilistic models
 Externally: Measure of similarity/dissimilarity between pairs
 Learning: Supplement stock algorithms with experience with data
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Clustering: Parameters
 Similarity measure: (eg: cosine similarity)
 (d1 , d 2 )
 Distance measure: (eg: eucledian distance)
 Number “k”of
(dclusters
1, d2 )
 Issues
 Large number of noisy dimensions
 Notion of noise is application dependent
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Clustering: Formal specification
 Partitioning Approaches
 Bottom-up clustering
 Top-down clustering
 Geometric Embedding Approaches
 Self-organization map
 Multidimensional scaling
 Latent semantic indexing
 Generative models and probabilistic approaches
 Single topic per document
 Documents correspond to mixtures of multiple topics
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Partitioning Approaches
 Partition document collection into k clusters
 Choices:
 Minimize intra-cluster
distance
{D1 , D2 .....
Dk }
 Maximize intra-cluster semblance
 If cluster representations
  (d , d )
i
are available
 Minimize
 Maximize
d1 ,d 2Di
1
2
   (d , d )
i
d1 , d 2Di
1
2
Di
 Soft clustering
 d assigned to  with
 (d ,`confidence’
Di )
i dD
 Find
so as to minimize
i
or maximize
   (d , D )
 Two ways to get partitions - bottom-up clustering and top-down
clustering
i
i
dDi
z d ,i
Di
z d ,i
 z
i
dDi
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
d ,i
 (d , Di )
 z
i
Monday, 24 Mar 2008
dDi
 (d , Di )
d ,i
Computing & Information Sciences
Kansas State University
Bottom-up clustering(HAC)
 Initially G is a collection of singleton groups, each with one document
d
 Repeat
 Find ,  in G with max similarity measure, s()
 Merge group  with group 
 For each  keep track of best 
 Use above info to plot the hierarchical merging process (DENDOGRAM)
 To get desired number of clusters: cut across any level of the dendogram
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Dendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Similarity measure
 Typically s() decreases with increasing number of merges
 Self-Similarity
 Average pair wise similarity between documents in 

= inter-document similarity measure (say cosine of tfidf vectors)
 Other criteria: Maximium/Minimum
pair wise similarity between
1
documents
s()inthe clusterss(d , d )


C2 d1 ,d 2
1
2
s(d1 , d 2 )
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Computation
Un-normalized
group profile:
pˆ   d pd 
Can show:
s 
pˆ (), pˆ ()  
s   
    1
pˆ (  ), pˆ (  )      
        1
pˆ    , pˆ      pˆ  , pˆ    pˆ  , pˆ  
 2 pˆ  , pˆ  
O(n2logn) algorithm with n2 space
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Similarity
s( ,  ) 
g (c( )), g (c(  ))
g (c( ))  g (c(  ))
,  inner product
g (c( ))
p( ) 
g (c( ))
Normalized
document profile:
Profile for
document group :
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
p ( ) 
Monday, 24 Mar 2008


p
(

)


p( )
Computing & Information Sciences
Kansas State University
Switch to top-down
 Bottom-up
 Requires quadratic time and space
 Top-down or move-to-nearest
 Internal representation for documents as well as clusters
 Partition documents into `k’ clusters
 2 variants
 “Hard” (0/1) assignment of documents to clusters
 “soft” : documents belong to clusters, with fractional scores
 Termination
 when assignment of documents to clusters ceases to change much OR
 When cluster centroids move negligibly over successive iterations
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Top-down clustering
 Hard k-Means: Repeat…
 Choose k arbitrary ‘centroids’
 Assign each document to nearest centroid
 Recompute centroids
 Soft k-Means :
 Don’t break close ties between document assignments to clusters
 Don’t make documents contribute to a single cluster which wins narrowly
 Contribution for updating cluster centroid
between
and .
from document
related to the current similarity
c
c
d
d
exp( | d   c |2 )
 c  
 exp( | d   |2 )

   c   c
c
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Seeding `k’ clusters
 
O kn
 Randomly sample
documents
 Run bottom-up group average clustering algorithm to reduce to k
groups or clusters : O(knlogn) time
 Iterate assign-to-nearest O(1) times
 Move each document to nearest cluster
 Recompute cluster centroids
 Total time taken is O(kn)
 Non-deterministic behavior
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing `k’
 Mostly problem driven
 Could be ‘data driven’ only when either
 Data is not sparse
 Measurement dimensions are not too noisy
 Interactive
 Data analyst interprets results of structure discovery
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing ‘k’ : Approaches
 Hypothesis testing:
 Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions
 Require regularity conditions on the mixture likelihood function (Smith’85)
 Bayesian Estimation




Estimate posterior distribution on k, given data and prior on k.
Difficulty: Computational complexity of integration
Autoclass algorithm of (Cheeseman’98) uses approximations
(Diebolt’94) suggests sampling techniques
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Choosing ‘k’ : Approaches
 Penalised Likelihood
 To account for the fact that Lk(D) is a non-decreasing function of k.
 Penalise the number of parameters
 Examples : Bayesian Information Criterion (BIC), Minimum Description
Length(MDL), MML.
 Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)
 Cross Validation Likelihood
 Find ML estimate on part of training data
 Choose k that maximises average of the M cross-validated average
likelihoods on held-out data Dtest
 Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold
cross validation (vCV)
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Visualisation techniques
 Goal: Embedding of corpus in a low-dimensional space
 Hierarchical Agglomerative Clustering (HAC)
 lends itself easily to visualisaton
 Self-Organization map (SOM)
 A close cousin of k-means

Multidimensional scaling (MDS)
 minimize the distortion of interpoint distances in the low-dimensional
embedding as compared to the dissimilarity given in the input data.
 Latent Semantic Indexing (LSI)
 Linear transformations to reduce number of dimensions
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Self-Organization Map (SOM)
 Like soft k-means
 Determine association between clusters and documents
 Associate a representative vector
with each cluster and iteratively refine
 Unlike k-means

c
 Embed the clusters in a low-dimensional space right from the beginning
c
 Large number of clusters can be initialised even if eventually many are to remain
devoid of documents
 Each cluster can be a slot in a square/hexagonal grid.
 The grid structure defines the neighborhood N(c) for each cluster c
 Also involves a proximity function
between clusters and

CIS 732 / 830: Machine Learning / Advanced
Topics in AI
h ( c,  )
c
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
SOM : Update Rule
 Like Neural network
 Data item d activates neuron (closest cluster)
neighborhood cneurons
d
 Eg Gaussian neighborhood function
as well as the
N (cd )
 Update rule for node
 Where
under the influence of d is:
|| c   ||2
h(the
c,  )ndb
 exp(
) the learning rate parameter
2
is
width 2and
is
 (t )

 (t 1)   (t )  (t )h( , cd )(d   )
 (t )
 2 (t )
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
SOM : Example I
SOM computed from over a million documents taken from 80 Usenet newsgroups. Ligh
areas have a high density of documents.
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
SOM: Example II
Another example of SOM at work: the sites listed in the Open Directory
have beenorganized within a map of Antarctica at http://antarcti.ca/.
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Multidimensional Scaling(MDS)
 Goal
 “Distance preserving” low dimensional embedding of documents
 Symmetric inter-document distances
 Given apriori or computed from internal representation
 Coarse-grained user feedback
d ij
 User provides similarity
between documents i and j .
 With increasing feedback, prior distances are overridden
 Objective : Minimize the stress of^ embedding
d ij
^
stress 
2
(
d

d
)
 ij ij
i, j
 dij
2
i, j
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
MDS: issues


Stress not easy to optimize
Iterative hill climbing
1.
2.

Points (documents) assigned random coordinates by external
heuristic
Points moved by small distance in direction of locally decreasing
stress
For n documents


Each takes
time to be moved
Totally
time per relaxation
O(n)
2
O(n )
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Fast Map [Faloutsos ’95]


No internal representation of documents available
Goal

find a projection from an ‘n’ dimensional space to a space with a smaller
number `k‘’ of dimensions.

Iterative projection of documents along lines of maximum
spread

Each 1D projection preserves distance information
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Best line
 Pivots for a line: two points (a and b) that determine it
 Avoid exhaustive checking by picking pivots that are far apart
 First coordinates of point
on “best line”
x1
x
( a, b)
x1 
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
d a2, x  d a2,b  d b2, x
2 d a ,b
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Iterative projection
 For i = 1 to k
1. Find a next (ith ) “best” line
 A “best” line is one which gives maximum variance of the point-set in the
direction of the line
2. Project points on the line
3. Project points on the “hyperspace” orthogonal to the above line
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Projection
 Purpose
 To correct inter-point distances
betweendpoints
by
x' , y '
'
' the components
taking into account
already accounted for
(
x
,
y
)
by the first pivot line.
( x1 , y1 )
 Project recursively
d x' ' , y 'upto
 1-D
d x2, yspace
 ( x1 
 Time: O(nk) time
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
y1 ) 2
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Issues
 Detecting noise dimensions
 Bottom-up dimension composition too slow
 Definition of noise depends on application
 Running time
 Distance computation dominates
 Random projections
 Sublinear time w/o losing small clusters
 Integrating semi-structured information
 Hyperlinks, tags embed similarity clues
 A link is worth a ? words
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
 Expectation maximization (EM):
 Pick k arbitrary ‘distributions’
 Repeat:
 Find probability that document d is generated from distribution f for all d
and f
 Estimate distribution parameters from weighted contribution of
documents
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Extended similarity






Where can I fix my scooter?
A great garage to repair your 2-wheeler is at …
auto and car co-occur often
Documents having related words are related
Useful for search and clustering
Two basic approaches
 Hand-made thesaurus (WordNet)
 Co-occurrence and associations
… auto …car
… auto …car
… car
… auto
… auto
…car
… car … auto
… car … auto
car  auto
… auto …

… car …
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Latent semantic indexing
Term
Document
k
Documents
d
Terms
car
A
t
SVD
D
V
U
auto
d
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
r
k-dim vector
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Collaborative recommendation
 People=record, movies=features
 People and features to be clustered
 Mutual reinforcement of similarity
 Need advanced models
Batman
Rambo
Andre
Hiver
Whispers StarWars
Lyle
Ellen
Jason
Fred
Dean
Karen
From Clustering methods in collaborative filtering, by Ungar and Foster
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
A model for collaboration





People and movies belong to unknown classes
Pk = probability a random person is in class k
Pl = probability a random movie is in class l
Pkl = probability of a class-k person liking a class-l movie
Gibbs sampling: iterate
 Pick a person or movie at random and assign to a class with probability
proportional to Pk or Pl
 Estimate new parameters
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model





Metric data vs Dyadic data vs Proximity data vs Ranked preference data.
Dyadic data : domain with two finite sets of objects
Observations : Of dyads X and Y
Unsupervised learning from dyadic data
Two sets of objects
X  {x1 ....xi , xn }, Y  {y1 ....yi , yn }
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model (contd)
 Two main tasks
 Probabilistic modeling:
 learning a joint or conditional probability model over
 structure discovery:
X Y
 identifying clusters and data hierarchies.
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model
 Statistical models
 Empirical co-occurrence frequencies
 Sufficient statistics
 Data spareseness:
 Empirical frequencies either 0 or significantly corrupted by sampling noise
 Solution
 Smoothing
 Back-of method [Katz’87]
 Model interpolation with held-out data [JM’80, Jel’85]
 Similarity-based smoothing techniques [ES’92]
 Model-based statistical approach: a principled approach to deal with data
sparseness
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model
 Model-based statistical approach: a principled approach to deal with data
sparseness
 Finite Mixture Models [TSM’85]
 Latent class [And’97]
 Specification of a joint probability distribution for latent and observable
variables [Hoffmann’98]
 Unifies
 statistical modeling
 Probabilistic modeling by marginalization
 structure detection (exploratory data analysis)
 Posterior probabilities by baye’s rule on latent space of structures
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model
n
n
S

(
x
,
y
)1n N : Realisation of an underlying sequence of

random variables
 2 assumptions
S  ( X n , Y n )1n N :
 All co-occurrences in sample S are iid

are independent given
 P(c) are the mixture components
An
X n ,Y n
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model: Latent classes
Increasing
Degree of
Restriction
On Latent
space
An ( X n , Y n )1n N
A  {a1 ,....aK }
{C ( X n ),Y n }1n N
{C ( X n ),Y n }1n N
C  {c1 ,...cK }
C  {c1 ,...cK }
{C ( X n ), D(Y n )}1n N
C  {c1 ,...cK }
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
D  {d1 ,..d L }
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Aspect Model
Symmetric
N
Asymmetric
N
P( S , a)   P( x , y , a )   P(a n ) P( x n | a n ) P( y n | a n )
n
n 1
n
n
n 1
P ( S )    P ( x, y ) n ( x , y )  
xX yY
xX
[P(a)P( x | a)P( y | a)]
n( x, y )
yY aA
N
N
P( S , a)   P( x , y , a )   P(a n ) P( x n | a n ) P( y n | a n )
n
n
n 1
n
n 1
P( S )    P( x, y ) n ( x , y )   P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
xX
yY aA
Computing & Information Sciences
Kansas State University
Clustering vs Aspect
 Clustering model
 constrained aspect model
P(a | x, c)  P( An  a | X n  x, C ( x)  c}   ac
 For flat:
ck  ak   ac
 For hierarchical
 Group structure on object spaces as against partition the
observations
ak  ck  ac .P(a | x, c)
 Notation

 P(.) : are the parameters
 P{.}: are posteriors
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Hierarchical Clustering model
One-sided clustering
Hierarchical clustering
P( S )    P( x, y ) n ( x , y )   P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
xX
yY aA
  P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y )   P(c) [ P( x)]n ( x )  [ P( y | a)]n ( x , y )
xX
yY aA cC
cC
xX
yY
P( S )    P( x, y ) n ( x , y )   P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
xX
yY aA
  P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y )
xX
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
yY aA cC
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Comparison of E’s
•Aspect model
P(a) P( x | a) P( y | a)
P{ A  a | X  x, Y  y; } 
 P(a' ) P( x | a' ) P( y | a' )
n
n
n
a 'A
•One-sided aspect model
P{C ( x)  c | S x ,  } 
•Hierarchical aspect model
P{C ( x)  c | S ,  } 
P (c) [P ( y | c)]n ( x , y )
yY
 P(c' ) [P( y | c' )]
n( x, y )
c 'C
yY
P(c) [P( y | a ) P(a | x, c)]n ( x , y )
yY a A
n( x, y )
P
(
c
'
)
[
P
(
y
|
a
)
P
(
a
|
x
,
c
'
)]
 
c 'C
yY
P{ An  a | X n  x, Y n  y, C ( x)  c; } 
P ( a | x, c ) P ( y | a )
 P ( a ' | x, c ) P ( y | a ' )
a 'A
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Tempered EM(TEM)

Additively (on the log scale) discount the likelihood part in
Baye’s formula:
1.
2.
3.
4.
Set
and perform EM until the performance on held--out data deteriorates (early stopping).

Decrease e.g., by setting
with some rate parameter .
As long as the performance on held-out data improves continue TEM iterations at this value of
Stop on
i.e., stop
yield
 when decreasing does not

further improvements, otherwise goto step (2)
5.
Perform some final iterations using both, training and heldout data.
 1




P
(
a
)[
P
(
x
|
a
)
P
(
y
|
a
)]
P{ An  a | X n  x, Y n  y; } 

P
(
a
'
)[
P
(
x
|
a
'
)
P
(
y
|
a
'
)]

a 'A
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
M-Steps
1.
Aspect
P( x | a ) 

n:x n  x
N

P( a | x n , y n ; ' )

P( a | x n , y n ; ' )
n 1
2.
 n( x, y)P(a | x, y; ' )
y
 n( x' , y)P(a | x' , y; ' )

n: y n  y
N
P( y | a ) 

P( a | x) 
n( x )
P( x) 
N

P( a | x n , y n ; ' )
n 1
x ', y
Assymetric
P( a | x n , y n ; ' )
P( a | x n , y n ; ' )

P( a | x n , y n ; ' )

n
n 1
3.
 n( x, y' )P(a | x, y' ; ' )
 n( x, y)P(a | x, y; ' )
y
 n( x' , y)P(a | x' , y; ' )
x ', y
Hierarchical x-clustering
P( x) 
n( x )
N
P( y | a ) 

n: y n  y
N

P{a | x n , y n ; '}

P{a | x n , y n ; '}
n 1
4.
x
x, y '

n:x  x
N
 n( x, y)P(a | x, y; ' )
 n( x, y)P{a | x, y; '}
x
 n( x, y' )P{a | x, y' ; '}
x, y '
One-sided x-clustering
P( x) 
n( x )
N
 n( x, y)P{C ( x)  c | S ; '}
P( y | c) 
 n( x)P{C ( x)  c | S ; '}
x
x
x
x
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Example Model [Hofmann and Popat CIKM 2001]
 Hierarchy of document categories
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Example Application
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Topic Hierarchies
 To overcome sparseness problem in topic hierarchies with large number
of classes
 Sparseness Problem: Small number of positive examples
• Topic hierarchies to reduce variance in parameter estimation
 Automatically differentiate
 Make use of term distributions estimated for more general, coarser text aspects to
provide better, smoothed estimates of class conditional term distributions
 Convex combination of term distributions in a Hierarchical Mixture Model


refers to all inner nodes a above the terminal class node c.
P( w | c)   P( a | c) P( w | a )
a c

CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Topic Hierarchies
(Hierarchical X-clustering)

X = document, Y = word
P( y | a ) 

n: y  y
N
P{a | x n , y n ; '}
n


P{a | x n , y n ; '}
n 1
 n( x, y)P{a | x, y; '}
x
 n( x, y' )P{a | x, y' ; '}
c ( x ) a

 n(c( x), y' )P{a | c( x), y' ; '}
c ( x ) a , y '
x, y '
P{a | x, y, c( x); }  P{a | y, c( x); } 
 n(c( x), y)P{a | c( x), y; '}
P(a | x, c) P( y | a)

 P(a'| x, c) P( y | a' )
a 'A
 n( y, c)P(a | y, c( x))
P{a | c( x); } 
 P(a'| y, c( x))
y
P( x) 
P(a | c) P( y | a )
 P(a'| c( x))P( y | a' )
a 'c
n( x )
N
a 'c
P{C ( x)  c | S ,  } 
P(c) [P( y | a ) P(a | x, c( x))]n ( x , y )
yY a A
 P(c' ) [ P( y | a) P(a | x, c' ( x))]
n( x, y )
c 'C
yY
CIS 732 / 830: Machine Learning / Advanced
Topics in AI

P(c) [P( y | a) P(a | c( x))]n ( x , y )
yY a  c
 P(c' ) [ P( y | a) P(a | c' ( x))]
n( x, y )
c 'C
Monday, 24 Mar 2008
yY
Computing & Information Sciences
Kansas State University
Document Classification Exercise
 Modification of Naïve Bayes
P( w | c)   P( a | c) P( w | a )
a c
P (c | x ) 
P (c ) P ( y i | c )
y i x
 P(c' ) P( y | c' )
i
c'
y i x
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Mixture vs Shrinkage
 Shrinkage [McCallum Rosenfeld AAAI’98]: Interior nodes in the hierarchy
represent coarser views of the data which are obtained by simple
pooling scheme of term counts
 Mixture : Interior nodes represent abstraction levels with their
corresponding specific vocabulary
 Predefined hierarchy [Hofmann and Popat CIKM 2001]
 Creation of hierarchical model from unlabeled data [Hofmann IJCAI’99]
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Mixture Density Networks(MDN)
[Bishop CM ’94 Mixture Density Networks]
 broad and flexible class of distributions that are capable of modeling
completely general continuous distributions
 superimpose simple component densities with well known properties to
generate or approximate more complex distributions
 Two modules:
.
 Mixture models: Output has a distribution given as mixture of distributions
 Neural Network: Outputs determine parameters of the mixture model
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
MDN: Example
A conditional mixture density network with Gaussian component densities
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
MDN
 Parameter Estimation :
 Using Generalized EM (GEM) algo to speed up.
 Inference
 Even for a linear mixture, closed form solution not possible
 Use of Monte Carlo Simulations as a substitute
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Document model


 Vocabulary V, term wi, document  represented by
c( )w
f
(
w
,
)
i

is the number of times
occurs
in
document
wi V 
i
 Most
f (wf’si ,are)zeroes for a single document
 Monotone component-wise damping function g such as log or
square-root



g(c( ))  g( f (wi , ))wi V
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Terminology
 Expectation-Maximization (EM) Algorithm
 Iterative refinement: repeat until convergence to a locally optimal label
 Expectation step: estimate parameters with which to simulate data
 Maximization step: use simulated (“fictitious”) data to update parameters
 Unsupervised Learning and Clustering
 Constructive induction: using unsupervised learning for supervised learning
 Feature construction: “front end” - construct new x values
 Cluster definition: “back end” - use these to reformulate y
 Clustering problems: formation, segmentation, labeling
 Key criterion: distance metric (points closer intra-cluster than inter-cluster)
 Algorithms
 AutoClass: Bayesian clustering
 Principal Components Analysis (PCA), factor analysis (FA)
 Self-Organizing Maps (SOM): topology preserving transform (dimensionality
reduction) for competitive unsupervised learning
CIS 732 / 830: Machine Learning / Advanced
Topics in AI
Monday, 24 Mar 2008
Computing & Information Sciences
Kansas State University
Summary Points
 Expectation-Maximization (EM) Algorithm
 Unsupervised Learning and Clustering
 Types of unsupervised learning
 Clustering, vector quantization
 Feature extraction (typically, dimensionality reduction)
 Constructive induction: unsupervised learning in support of supervised
learning
 Feature construction (aka feature extraction)
 Cluster definition
 Algorithms
 EM: mixture parameter estimation (e.g., for AutoClass)
 AutoClass: Bayesian clustering
 Principal Components Analysis (PCA), factor analysis (FA)
 Self-Organizing Maps (SOM): projection of data; competitive algorithm
 Clustering problems: formation, segmentation, labeling
Computing & Information Sciences
Kansas State University
Monday, 24 Mar 2008
 Next Lecture: Time Series Learning
and Characterization
CIS 732 / 830: Machine Learning / Advanced
Topics in AI