A Guided Tour of Finite Mixture Models

Transcript A Guided Tour of Finite Mixture Models

A Guided Tour of Finite Mixture
Models: From Pearson to the Web
ICML ‘01 Keynote Talk
Williams College, MA
June 29th 2001
Padhraic Smyth
Information and Computer Science
University of California, Irvine
www.datalab.uci.edu
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Outline
• What are mixture models?
– Definitions and examples
• How can we learn mixture models?
– A brief history and illustration
• What are mixture models useful for?
– Applications in Web and transaction data
• Recent research in mixtures?
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Acknowledgements
• Students:
– Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov
• Collaborators
– David Heckerman, Chris Meek, Heikki Mannila,
Christine McLaren, Geoff McLachlan, David Wolpert
• Funding
– NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
Microsoft Research, IBM Research, HNC Software.
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Finite Mixture Models
p ( x)  ?
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k )  k
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k )  k
k 1
Component
Modelk
Weightk
Parametersk
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
Each mixture component is a
multidimensional Gaussian with its own
mean mk and covariance “shape” Sk
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
Each mixture component is a
multidimensional Gaussian with its own
mean mk and covariance “shape” Sk
e.g., K=2, 1-dim:
{, } = {m1 , s1 , m2 , s2 , 1}
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
2
p(x)
1.5
Component Models
1
0.5
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
d
p(x | ck ,k )   p( xj | ck )
j 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
d
p(x | ck ,k )   p( xj | ck )
j 1
Conditional Independence
model for each component
(often quite useful to first-order)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Other Component Models
• Mixtures of Rectangles
– Pelleg and Moore (ICML, 2001)
• Mixtures of Trees
– Meila and Jordan (2000)
• Mixtures of Curves
– Quandt and Ramsey (1978)
• Mixtures of Sequences
– Poulsen (1990)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
3. C is just a convenient latent variable
e.g., flexible density estimation
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes:
C
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes:
C
X1
X2
X3
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes:
C
X1
(discrete, hidden)
X2
X3
(observed)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Sequential Mixtures
C
X1
X2
Time t-1
C
X3
X1
Time t
X2
C
X3
X1
X2
X3
Time t+1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Sequential Mixtures
C
X1
X2
Time t-1
C
X3
X1
Time t
X2
C
X3
X1
X2
X3
Time t+1
Markov Mixtures = C has Markov dependence
= Hidden Markov Model
(here with naïve Bayes)
C = discrete state,
couples observables through time
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Dynamic Mixtures
• Computer Vision
• mixtures of Kalman filters for tracking
• Atmospheric Science
• mixtures of curves and dynamical models for
cyclones
• Economics
• mixtures of switching regressions for the US
economy
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Limitations of Mixtures
• Discrete state space
– not always appropriate
– e.g., in modeling dynamical systems
• Training
– no closed form solution, can be tricky
• Interpretability
– many different mixture solutions may explain the
same data
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Learning of mixture models
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Learning Mixtures from Data
Consider fixed K
e.g., Unknown parameters Q = {m1 , s1 , m2 , s2 , 1}
Given data D = {x1,…….xN}, we want to find the parameters Q
that “best fit” the data
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Early Attempts
Weldon’s data, 1893
- n=1000 crabs from Bay of Naples
- Ratio of forehead to body length
- suspected existence of 2 separate species
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Early Attempts
Karl Pearson, 1894:
- JRSS paper
- proposed a mixture of 2 Gaussians
- 5 parameters Q = {m1 , s1 , m2 , s2 , 1}
- parameter estimation -> method of moments
- involved solution of 9th order equations!
(see Chapter 10, Stigler (1986), The History of Statistics)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
“The solution of an equation of the ninth degree,
where almost all powers, to the ninth, of the
unknown quantity are existing, is, however, a very
laborious task. Mr. Pearson has indeed possessed
the energy to perform his heroic task…. But I fear
he will have few successors…..”
Charlier (1906)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Maximum Likelihood Principle
• Fisher, 1922
– assume a probabilistic model
– likelihood = p(data | parameters, model)
– find the parameters that make the data most likely
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Maximum Likelihood Principle
• Fisher, 1922
– assume a probabilistic model
– likelihood = p(data | parameters, model)
– find the parameters that make the data most likely
L({ , })  p( D | { , })
N
  p(xi |{ , })
i 1
N

i 1
K

 p(xi | ck , k )  k)
 k 1

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
1977: The EM Algorithm
• Dempster, Laird, and Rubin
– general framework for likelihood-based parameter
estimation with missing data
• start with initial guesses of parameters
• Estep: estimate memberships given params
• Mstep: estimate params given memberships
• Repeat until convergence
– converges to a (local) maximum of likelihood
– Estep and Mstep are often computationally simple
– generalizes to maximum a posteriori (with priors)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example of a Log-Likelihood Surface
50
100
150
Mean 2
200
250
300
350
400
10
20
30
40
50
60
70
Log
Scale
for Sigma
280
90
100
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Log-Likelihood Cross-Section
-45
-50
Log-likelihood
-55
-60
-65
-70
-75
-80
-50
-40
-30
-20
-10
0
10
20
Log(sigma)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
5
10
15
EM Iteration
20
25
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
Anemia Group
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data for an Individual Patient
Healthy state
Anemic state
(Cadez et al., Machine Learning, in press)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Alternatives to EM
• Method of Moments
– EM is more efficient
• Direct optimization
– e.g., gradient descent, Newton methods
– EM is simpler to implement
• Sampling (e.g., MCMC)
• Minimum distance, e.g.,

IMSE(Q)  E  p( x | Q)  q( x)
2

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
How many components?
• 2 general approaches:
1. Best density estimator
e.g., what predicts best on new data
2. True number of components
- cannot be answered from data alone
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data-generating
process (“truth”)
“Closest” model in terms
of KL distance
K=2 Model Class
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data-generating
process (“truth”)
K=10 Model Class
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Prescriptions for Model Selection
• Minimize distance to “truth”
• Maximize predictive “logp score”
– gives an estimate of KL(model, truth)
– pick model that predicts best on validation data
• Bayesian techniques
– p(k|data): impossible to compute exactly
– closed-form approximations:
• BIC, Autoclass, etc
– sampling
• Monte Carlo techniques: quite tricky for mixtures
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixture model applications
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
What are Mixtures used for?
• Modeling heterogeneity
– e.g., inferring multiple species in biology
• Handling missing data
– e.g., variables and cases missing in model-building
• Density estimation
– e.g., as flexible priors in Bayesian statistics
• Clustering
– components as clusters
• Model Averaging
– combining density models
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixtures of non-vector data
• Example
– N individuals, and sets of sequences for each
– e.g., Web session data
• Clustering of the N individuals?
– Vectorize data and apply vector methods?
– Pairwise distances of sets of sequences?
– Parameter estimate for each individual and then
cluster?
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixtures of {Sequences, Curves, …}
K
p( Di )   p( Di | ck )  k
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixtures of {Sequences, Curves, …}
K
p( Di )   p( Di | ck )  k
k 1
Generative Model
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Application 1: Web Log Visualization
(Cadez, Heckerman, Meek, Smyth, KDD 2000)
• MSNBC Web logs
– 2 million individuals per day
– different session lengths per individual
– difficult visualization and clustering problem
• WebCanvas
– uses mixtures of SFSMs to cluster individuals based
on their observed sequences
– software tool: EM mixture modeling + visualization
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Example: Mixtures of SFSMs
Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors <=> mixture of SFSMs
EM algorithm is quite simple: weighted counts
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
WebCanvas: Cadez, Heckerman, et al, KDD 2000
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
20
40
60
80
100
120
140
160
180
200
Number of mixture components
[K]
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Timing Results
2500
2000
N=150,000
Time [sec]
1500
N=110,000
1000
N = 70,000
500
0
-500
0
20
40
60
80
100
120
140
160
180
200
Number of mixture components [K]
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Transaction Data
50
TRANSACTIONS
100
150
200
250
300
350
5
10
15
20
25
30
35
40
45
50
PRODUCT CATEGORIES
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Profiling for Transaction Data
• Profiling
– given transactions for a set of individuals
– infer a predictive model for future transactions of
individual i
– typical applications:
• automated recommender systems: e.g.,
Amazon.com
• personalization: e.g., wireless information
services
– existing techniques
• collaborative filtering, association rules
• unsuited for prediction, e.g., inclusion of
seasonality.
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Application 2: Transaction Data
(Cadez, Smyth, Mannila, KDD 2001)
• Retail Data Set
– 200,000 individuals
– all market baskets over 2 years
– 50 departments, 50k items
• Problem
– predict what an individual will purchase in the future
• want to generalize across products
• want to allow heterogeneity
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Transaction Data
50
TRANSACTIONS
100
150
200
250
300
350
5
10
15
20
25
30
35
40
45
50
PRODUCT CATEGORIES
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Probability
Examples of Mixture Components
0.4
0.4
COMPONENT 1
0.2
0
0.2
0
10
20
30
40
50
Probability
0.6
COMPONENT 3
10
20
30
40
50
COMPONENT 4
0.2
0
10
20
30
40
50
0.6
Probability
0
0.4
0.2
0
0
10
20
30
40
Components
“encode”
typical
combinations
of clothes
50
0.6
0.4
COMPONENT 5
0.4
0.2
0
0
0.6
0.4
0
COMPONENT 2
COMPONENT 6
0.2
0
10
20
30
Department
40
50
0
0
10
20
30
Department
40
50
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Mixture-based Profiles
K
p( B | i )   p( B | Ck ) p(Ck | i)
k 1
Predictive
Profile for
Individual i
Basket
model for
Component k
(“Basis function”)
Probability
that Individual i
engages in
“behavior” k,
given that they
enter the store
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Hierarchical Bayes Model
Empirical Prior on
Mixture Weights
Individual 1
B1
Individual i
B1
B2
B3
Individual N
B1
B2
Individuals with little data get “shrunk” to the prior
Individuals with a lot of data are more data-driven
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
0
0
5
10
15
20
25
Department
30
35
40
45
50
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
5
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
5
10
15
20
40
45
50
40
45
50
0.15
0.1
0.05
0
0
25
Department
30
35
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
0.15
0.1
0.05
0
0.2 0
Probability
5
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
0
25
Department
30
35
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
5
10
15
20
0.15
0.1
0.05
0
0.2 0
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
80
Number of items
Probability
5
6
25
30
35
Department
TEST PURCHASES
4
2
0
0
25
30
35
40
45
50
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
0.15
0.1
0.05
0
0.2 0
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
80
Number of items
Probability
5
6
No Training
Data for 14
4
2
0
25
30
35
Department
TEST PURCHASES
0
5
10
15
20
25
No Purchases above Dept 25
30
35
40
45
50
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Predictive Entropy Out of Sample
3.5
Negative log-likelihood per item
3.4
Empirical Bayes Multinomials
3.3
3.2
3.1
3
Standard Mixtures
2.9
Empirical Bayes Mixtures
2.8
2.7
0
0.5
1
1.5
2
2.5
Log (Number of Mixture Components
K)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
10
Scatter plot of multinomials vs. mixtures
0
logP, Empirical Bayes mixtures
-10
-20
-30
-40
-50
-60
-70
-80
-90
-100
-100
-80
-60
-40
-20
0
logP, Empirical Bayes multinomials
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Transaction mixtures
• Mixture-based profiles
– interpretable and flexible
– more accurate than non-mixture approaches
– training time = linear(number of items)
• Applications
– early detection of high-value customers
– visualization and exploration
– forecasting customer behavior
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions of Mixtures
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions : Multiple Causes
• Single “cause” variable
C
X
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: Multiple Causes
• Single “cause” variable
C
X
• Multiple Causes (“Factors”)
– examples
• Hoffman (1999,2000) for text
• ICA for signals
C1
C2
C3
X
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: High Dimensions
• Density estimation is difficult in high-d
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions : High Dimensions
• Density estimation is difficult in high-d
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: High Dimensions
• Density estimation is difficult in high-d
Global
PCA
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: High Dimensions
• Density estimation is difficult in high-d
Mixtures of PCA
(Tipping and Bishop
1999)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: Predictive Mixtures
• Standard mixture model
K
p(x)   p(x | ck ) p(ck )
k 1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: Predictive Mixtures
• Standard mixture model
K
p(x)   p(x | ck ) p(ck )
k 1
Input-dependent
• Conditional Mixtures:
K
p( x | y )   p( x | y, ck ) p(ck | y )
k 1
– e.g., mixtures of experts, Jordan and Jacobs (1994)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Extensions: Learning Algorithms
• Fast algorithms
• kd-trees (Moore, 1998)
• Caching(Bradley et al.)
• Random projections:
• DasGupta (2000)
• Mean-squared error criteria
• Scott (2000)
• Bayesian techniques
• reversible-jump MCMC (Green, et al.)
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Classic References
Statistical Analysis of Finite Mixture Distributions
Titterington, Smith, and Makov
Wiley, 1985
Finite Mixture Models
McLachlan and Peel
Wiley 2000
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Conclusions
• Mixtures
– flexible “tool” in the machine learner’s toolbox
• Beyond mixtures of Gaussians…
– mixtures of sequences, curves, trees, etc.
• Applications
– numerous and broad
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Scatter plot of multinomials vs. mixtures
0
logP, Empirical Bayes mixtures
-10
-20
-30
-40
-50
-60
-70
-80
-90
-100
-100
-80
-60
-40
-20
0
logP, Empirical Bayes multinomials
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Concavity of Likelihood
(Cadez and Smyth, NIPS 2000)
5
-3
x 10
In-Sample Loglikelihood
-3.5
-4
IN-SAMPLE LOG-LIKELIHOOD FOR
A MIXTURE OF MULTINOMIALS
-4.5
-5
-5.5
-6
0
10
20
30
40
50
60
70
80
90
100
Number of Components K
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Application 3: Model Averaging
Bayesian model averaging for p(x):
- since we don’t know which model (if any)
is the true one, average out this uncertainty
K
p ( x | D)   p ( x | M k , D) p ( M k | D )
k 1
Prediction of
x given data D
Prediction of
Model k
Weight of
Model k
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Stacked Mixtures
(Smyth and Wolpert, Machine Learning, 1999)
Simple idea:
use cross-validation to estimate the weights,
rather than using a Bayesian scheme
Two-phase learning
1. Learn each model Mk on Dtrain
2. Learn mixture model weights on Dvalidation
- components are fixed
- EM just learns the weights
Outperforms any single model selection technique
Even outperforms “cheating”
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Query Approximation
(Pavlov and Smyth, KDD 2001)
Large
Database
Approximate
Models
Query
Generator
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Query Approximation
Large
Database
Approximate
Models
Query
Generator
Construct
Probability
Models
Offline
e.g., mixtures,
belief networks, etc
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Query Approximation
Large
Database
Approximate
Models
Construct
Probability
Models
Offline
Query
Generator
Provide
Fast Query
Answers
Online
e.g., mixtures,
belief networks, etc
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Stacking for Query Model Combining
Conjunctive
Queries
on Microsoft
Web Data,
32k records,
294 attributes
6
Mixture, 16 components
Bayesian Network
Best Holdout Model
Stacked Model
Mean Percent Error
5
Available
online at
UCI KDD Archive
4
3
2
1
0
4
8
12
16
Query Size
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation
Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Treat as Missing
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation
Vectors
1
1
1
1
1
1
1
C1
C1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
C1
C1
C1
C1
C1
C2
C2
C2
C2
C2
C2
C2
C2
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Treat as Missing
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation
Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
C1
C1
P(C1|x1) P(C2|x1)
P(C1|..) P(C2|..)
C1
C1
C1
C1
C1
C2
C2
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
C2
C2
C2
C2
C2
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
E-Step: estimate component
membership probabilities given
current parameter estimates
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk
Treat as Missing
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation
Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
C1
C1
P(C1|x1) P(C2|x1)
P(C1|..) P(C2|..)
C1
C1
C1
C1
C1
C2
C2
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
C2
C2
C2
C2
C2
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C1|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
P(C2|..)
M-Step: use “fractional” weighted data
to get new estimates of the parameters
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk