No Slide Title

Transcript No Slide Title

Large Data Set Analysis
using Mixture Models
Seminar at IBM Watson Research Center
June 27th 2001
Padhraic Smyth
Information and Computer Science
University of California, Irvine
www.datalab.uci.edu
© Padhraic Smyth, UC Irvine
Outline
• Part 1: Basic concepts in mixture modeling
– representational capabilities of mixtures
– learning mixtures from data
– extensions of mixtures to non-vector data
• Part 2: New applications of mixtures
– 1. Visualization and clustering of Web log data
– 2. predictive profiles from transaction data
– 3. query approximation problems
© Padhraic Smyth, UC Irvine
Acknowledgements
• Students:
– Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov
• Collaborators
– David Heckerman, Chris Meek, Heikki Mannila, Christine
McLaren, Geoff McLachlan, David Wolpert
• Funding
– NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center, Microsoft
Research, IBM Research, HNC Software.
© Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
© Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
© Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k )  k
k 1
© Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k )  k
k 1
Component
Modelk
Weightk
Parametersk
© Padhraic Smyth, UC Irvine
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
© Padhraic Smyth, UC Irvine
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
Each mixture component is a
multidimensional Gaussian with its own
mean mk and covariance “shape” Sk
© Padhraic Smyth, UC Irvine
Example: Mixture of Gaussians
• Gaussian mixtures:
K
p(x)   p(x | ck ,k )  k
k 1
Each mixture component is a
multidimensional Gaussian with its own
mean mk and covariance “shape” Sk
e.g., K=2, 1-dim:
{, } = {m1 , s1 , m2 , s2 , 1}
© Padhraic Smyth, UC Irvine
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine
2
p(x)
1.5
Component Models
1
0.5
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
© Padhraic Smyth, UC Irvine
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
d
p(x | ck ,k )   p( xj | ck )
j 1
© Padhraic Smyth, UC Irvine
Example: Mixture of Naïve Bayes
K
p(x)   p(x | ck ,k )  k
k 1
d
p(x | ck ,k )   p( xj | ck )
j 1
Conditional Independence
model for each component
(often quite useful to first-order)
© Padhraic Smyth, UC Irvine
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
© Padhraic Smyth, UC Irvine
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
© Padhraic Smyth, UC Irvine
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
© Padhraic Smyth, UC Irvine
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
© Padhraic Smyth, UC Irvine
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
3. C is just a convenient latent variable
e.g., flexible density estimation
© Padhraic Smyth, UC Irvine
Learning Mixtures from Data
Consider fixed K
e.g., Unknown parameters Q = {m1 , s1 , m2 , s2 , 1}
Given data D = {x1,…….xN}, we want to find the parameters Q that
“best fit” the data
© Padhraic Smyth, UC Irvine
Maximum Likelihood Principle
• Fisher, 1922
– assume a probabilistic model
– likelihood = p(data | parameters, model)
– find the parameters that make the data most likely
© Padhraic Smyth, UC Irvine
Maximum Likelihood Principle
• Fisher, 1922
– assume a probabilistic model
– likelihood = p(data | parameters, model)
– find the parameters that make the data most likely
L({ , })  p( D | { , })
N
  p(xi |{ , })
i 1
N

i 1
K

 p(xi | ck , k )  k)
 k 1

© Padhraic Smyth, UC Irvine
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
x
© Padhraic Smyth, UC Irvine
Example of a Log-Likelihood Surface
50
100
150
Mean 2
200
250
300
350
400
10
20
30
40
50
60
70
Log
Scale
for Sigma
2
80
90
100
© Padhraic Smyth, UC Irvine
1977: The EM Algorithm
• Dempster, Laird, and Rubin
– general framework for likelihood-based parameter
estimation with missing data
• start with initial guesses of parameters
• Estep: estimate memberships given params
• Mstep: estimate params given memberships
• Repeat until convergence
– converges to a (local) maximum of likelihood
– Estep and Mstep are often computationally simple
– generalizes to maximum a posteriori (with priors)
© Padhraic Smyth, UC Irvine
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
5
10
15
EM Iteration
20
25
© Padhraic Smyth, UC Irvine
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
Anemia Group
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
© Padhraic Smyth, UC Irvine
Alternatives to EM
• Direct optimization
– e.g., gradient descent, Newton methods
– EM is simpler to implement
• Sampling (e.g., MCMC)
– computationally intensive
• Minimum distance, e.g.,

IMSE(Q)  E  p( x | Q)  q( x)
2

© Padhraic Smyth, UC Irvine
How many components?
• 2 general approaches
– 1. Best density estimator
• e.g., what predicts best on new data
– 2. “True” number of components
• typically cannot be done with data alone
© Padhraic Smyth, UC Irvine
K=1 Model Class
© Padhraic Smyth, UC Irvine
Data-generating
process (“truth”)
K=1 Model Class
© Padhraic Smyth, UC Irvine
Data-generating
process (“truth”)
“Closest” model in terms
of logp scores on new data
K=1 Model Class
© Padhraic Smyth, UC Irvine
Data-generating
process (“truth”)
“Closest” model in terms
of logp scores on new data
K=1 Model Class
Best model is relatively far from truth
=> High Bias
© Padhraic Smyth, UC Irvine
Data-generating
process (“truth”)
K=1 Model Class
K=10 Model Class
© Padhraic Smyth, UC Irvine
Data-generating
process (“truth”)
K=1 Model Class
K=10 Model Class
Best model is closer to Truth
=> Low Bias
© Padhraic Smyth, UC Irvine
However,…. This could be the model that best fits the observed data
=> High Variance
Data-generating
process (“truth”)
K=1 Model Class
K=10 Model Class
© Padhraic Smyth, UC Irvine
Prescriptions for Model Selection
• Minimize distance to “truth”
• Method 1: Predictive logp scores
– calculate log p(test data| model k)
– select model that predicts best
• Method 2: Bayesian techniques
– p(k|data): impossible to compute exactly
– closed-form approximations:
• BIC, Autoclass, MDL, etc
– sampling
• Monte Carlo techniques: quite tricky for mixtures
© Padhraic Smyth, UC Irvine
Mixtures of non-vector data
• Example
– N individuals, and sets of sequences for each
– e.g., Web session data
• Clustering of the N individuals?
– Vectorize data and apply vector methods?
– Estimate parameters for each sequence and cluster in
parameter space?
– Pairwise distances of sequences?
© Padhraic Smyth, UC Irvine
Mixtures of {Sequences, Curves, …}
K
p( Di )   p( Di | ck )  k
k 1
© Padhraic Smyth, UC Irvine
Mixtures of {Sequences, Curves, …}
K
p( Di )   p( Di | ck )  k
k 1
Generative Model
- pick individual i
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
© Padhraic Smyth, UC Irvine
Outline
• Part 1: Basic concepts in mixture modeling
– representational capabilities of mixtures
– learning mixtures from data
– extensions of mixtures to non-vector data
• Part 2: New applications of mixtures
– 1. predictive profiles from transaction data
– 2. sequence clustering with mixtures of Markov models
– 3. query approximation problems
© Padhraic Smyth, UC Irvine
Application 1: Web Log Visualization
and Clustering
(Cadez, Heckerman, Meek, Smyth, White, KDD 2000)
© Padhraic Smyth, UC Irvine
Web Log Visualization
• MSNBC Web logs
– 2 million individuals per day
– different session lengths per individual
– difficult visualization and clustering problem
• WebCanvas
– uses mixtures of finite state machines to cluster
individuals
– software tool: EM mixture modeling + visualization
© Padhraic Smyth, UC Irvine
© Padhraic Smyth, UC Irvine
Web Log Files
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3 3 1 1 1 3 1 3 3 3 3
1
7 7 7
5 1 5 1 1 1 1 1 1
© Padhraic Smyth, UC Irvine
Model: Mixtures of SFSMs
SFSM: stochastic finite state machine
Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors <=> mixture of SFSMs
EM algorithm is quite simple: weighted counts
© Padhraic Smyth, UC Irvine
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
20
40
60
80
100
120
140
Number of mixture components [K]
160
180
200
© Padhraic Smyth, UC Irvine
Timing Results
2500
2000
N=150,000
Time [sec]
1500
N=110,000
1000
N = 70,000
500
0
-500
0
20
40
60
80
100
120
140
160
180
200
Number of mixture components [K]
© Padhraic Smyth, UC Irvine
WebCanvas: Cadez et al, KDD 2000
© Padhraic Smyth, UC Irvine
Application 2: Building Predictive
Profiles from Transaction Data
(Cadez, Smyth, Mannila, KDD 2001)
© Padhraic Smyth, UC Irvine
Example of Transaction Data
50
TRANSACTIONS
100
150
200
250
300
350
5
10
15
20
25
30
35
40
45
50
PRODUCT CATEGORIES
© Padhraic Smyth, UC Irvine
Profiling Approaches
• Predictive profile
– predictive model of an individual’s behavior
• Histograms
– Simple but inefficient:
• No generalization: p(product you did not buy) = 0
• Collaborative filtering
– Your profile = function(k “most similar” other individuals)
• Ad hoc: no statistical basis (e.g., cannot incorporate
covariates, seasonality, etc)
• Proposed approach: generative probabilistic models
– mixtures of baskets (captures heterogeneity)
– hierarchical Bayes (helps with sparseness)
© Padhraic Smyth, UC Irvine
The Nature of Transaction Data
• Large and sparse
– N = number of individuals, can be order of millions
– P = number of items, can be in the 1000’s
– Very sparse:
• Each transaction may only have a few items
• Most individuals only have a few transactions
• Implications for modeling:
– Effectively modeling the joint distribution of a set of very
high-dimensional binary/categorical random variables
– Relatively little information on any single individual
– Typically want to start inferring a profile even after an
individual purchases a single item
– Volume of data, nature of applications (e.g., ecommerce)
dictates that inference methods must be computationally
efficient
© Padhraic Smyth, UC Irvine
Mixture-based Profiles
K
p( B | i )   p( B | Ck ) p(Ck | i)
k 1
Predictive
Profile for
Individual i
Basket
model for
Component k
(multinomial,
emphasizes
certain products)
Probability
that Individual i
engages in
“behavior” k,
given that they
enter the store
© Padhraic Smyth, UC Irvine
Hierarchical Model
Prior on Mixture Weights
Individual 1
B1
Individual i
B1
B2
B3
Individual N
B1
B2
Individuals with little data get “shrunk” to the prior
Individuals with a lot of data are more data-driven
© Padhraic Smyth, UC Irvine
Inference Algorithms
•
MAP/Empirical v. Full Bayes
– Full Bayesian analysis is computationally impractical
• 59k transactions even for this “small” study data set
– We use a maximum a posterior (MAP) inference approach
• prior is matched to data, i.e., “empirical Bayes”
© Padhraic Smyth, UC Irvine
Inference Algorithm
• 3-phase estimation algorithm
– 1. Use EM (MAP version) to learn a K-component mixture
model
• Ignore individual grouping, just find K components for
all transactions
– 2. Empirical Bayes prior:
• Use the resulting “global mixture weights” to determine
the mean of the population prior (Dirichlet)
– 3. Fitting of individual weights (k for each individual)
• Use EM (MAP) again on each individual, with population
prior
• Mixture components are fixed, just use EM to find the
weights (very fast)
© Padhraic Smyth, UC Irvine
Experiments on Real Data
•
Retail transaction data set
– 2 years worth of transactions from chain of 9 stores
– 1 million transactions in total
– 200,000 individuals, “product tree” of 50k items
• Experiments described here:
– Data used for model training (months 1 to 6)
• 4300 individuals with at least 10 transactions (10 store
visits)
• 58,886 transactions, 164,000 items purchased
– Out-of-sample data used for model test (months 7 and 8)
• 4040 individuals, 25,292 transactions, and 69,103
items
– Predictive accuracy on out-of-sample data
• Logp: “log p score” on new data: higher is better
• -Logp/n’ : predictive entropy, lower is better,
© Padhraic Smyth, UC Irvine
Transaction Data
50
TRANSACTIONS
100
150
200
250
300
350
5
10
15
20
25
30
35
40
45
50
PRODUCT CATEGORIES
© Padhraic Smyth, UC Irvine
Probability
Examples of Mixture Components
0.4
0.4
COMPONENT 1
0.2
0
0.2
0
10
20
30
40
50
Probability
0.6
COMPONENT 3
10
20
30
40
50
COMPONENT 4
0.2
0
10
20
30
40
50
0.6
Probability
0
0.4
0.2
0
0
10
20
30
40
Components
“encode”
typical
combinations
of clothes
50
0.6
0.4
COMPONENT 5
0.4
0.2
0
0
0.6
0.4
0
COMPONENT 2
COMPONENT 6
0.2
0
10
20
30
Department
40
50
0
0
10
20
30
Department
40
50
© Padhraic Smyth, UC Irvine
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
0
0
5
10
15
20
25
Department
30
35
40
45
50
© Padhraic Smyth, UC Irvine
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
5
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
5
10
15
20
40
45
50
40
45
50
0.15
0.1
0.05
0
0
25
Department
30
35
© Padhraic Smyth, UC Irvine
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
0.15
0.1
0.05
0
0.2 0
Probability
5
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
0
25
Department
30
35
© Padhraic Smyth, UC Irvine
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
0.15
0.1
0.05
0
0.2 0
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
80
Number of items
Probability
5
6
25
30
35
Department
TEST PURCHASES
4
2
0
0
25
30
35
50
© Padhraic Smyth, UC Irvine
Data and Profile Example
Number of items
8
6
TRAINING PURCHASES
4
2
Probability
0
0.2 0
10
15
20
25
30
35
Department
SMOOTHED HISTOGRAM PROFILE (MAP)
40
45
50
5
10
15
20
40
45
50
5
10
15
20
40
45
50
0.15
0.1
0.05
0
0.2 0
25
30
35
Department
PROFILE FROM INDIVIDUAL WEIGHTS
0.15
0.1
0.05
0
80
Number of items
Probability
5
6
No Training
Data for 14
4
2
0
25
30
35
Department
TEST PURCHASES
0
5
10
15
20
25
No Purchases above Dept 25
30
35
40
45
50
© Padhraic Smyth, UC Irvine
Predictive Entropy Out of Sample
3.5
Negative log-likelihood per item
3.4
Empirical Bayes Multinomials
3.3
3.2
3.1
3
Standard Mixtures
2.9
Empirical Bayes Mixtures
2.8
2.7
0
0.5
1
1.5
2
Log (Number of Mixture Components K)
10
2.5
© Padhraic Smyth, UC Irvine
Scatter plot of multinomials vs. mixtures
0
logP, Empirical Bayes mixtures
-50
-100
-150
-200
-250
-300
-350
-400
-400
-350
-300
-250
-200
-150
-100
-50
logP, Empirical Bayes multinomials
0
© Padhraic Smyth, UC Irvine
Scatter plot of multinomials vs. mixtures
0
logP, Empirical Bayes mixtures
-10
-20
-30
-40
-50
-60
-70
-80
-90
-100
-100
-80
-60
-40
-20
logP, Empirical Bayes multinomials
0
© Padhraic Smyth, UC Irvine
Timing Results
5500
5000
Standard Mixtures
Empirical Bayes Mixtures
4500
Time (seconds)
4000
3500
3000
2500
Time taken to fit a
model with 4300 individuals
and 59,000 transactions,
and 164,000 items
2000
1500
1000
500
20
40
60
80
100
120
140
160
180
200
Model complexity (number of components K)© Padhraic Smyth, UC Irvine
Ongoing Work
• Applications
– interactive visualization and exploration tool
– early identification of high value customers
– segmentation
• Extensions
–
–
–
–
“factored” mixtures: multiple behaviors in one transaction
time-rate of purchases (e.g., Poisson, seasonal)
covariate information (e.g., demographics, etc)
outlier detection, clustering, forecasting, cross-selling
• Other Applications
– sequential profiles for Web users
• component models integrate time and content
• hierarchical models
© Padhraic Smyth, UC Irvine
Summary of Transaction Results
• Predictive performance out-of-sample:
– hierarchical mixtures are better than both
• global mixture weights
• hierarchical multinomials
– predictive power continues to improve up to about K=50,
100 mixture components
• Computational efficiency
– model fitting is relatively fast
– estimation time scales roughly as 10 to 100 transactions
per second
• Predictive profiles are interpretable, fast, and accurate
© Padhraic Smyth, UC Irvine
Application 3: Fast Approximate
Querying
(Pavlov and Smyth, KDD 2001)
© Padhraic Smyth, UC Irvine
Query Approximation
Large
Database
Approximate
Models
Query
Generator
© Padhraic Smyth, UC Irvine
Query Approximation
Large
Database
Approximate
Models
Query
Generator
Construct
Probability
Models
Offline
e.g., mixtures,
belief networks, etc
© Padhraic Smyth, UC Irvine
Query Approximation
Large
Database
Approximate
Models
Construct
Probability
Models
Offline
Query
Generator
Provide
Fast Query
Answers
Online
e.g., mixtures,
belief networks, etc
© Padhraic Smyth, UC Irvine
Model Averaging
Bayesian model averaging for p(x):
- since we don’t know which model (if any)
is the true one, average out this uncertainty
K
p ( x | D)   p ( x | M k , D) p ( M k | D )
k 1
Prediction of
x given data D
Prediction of
Model k
Weight of
Model k
© Padhraic Smyth, UC Irvine
Stacked Mixtures
(Smyth and Wolpert, Machine Learning, 1999)
Simple idea:
use cross-validation to estimate the weights,
rather than using a Bayesian scheme
Two-phase learning
1. Learn each model Mk on Dtrain
2. Learn mixture model weights on Dvalidation
- components are fixed
- EM just learns the weights
Outperforms any single model selection technique
Even outperforms “cheating”
© Padhraic Smyth, UC Irvine
Model Averaging for Queries
“Best model” is a function of (a) data
(b) query distribution (Q)
Minimize




 (Q ) 

 

E
Distribution
over queries



k


K
ptrue (Q)   p(Q | M k ) 
k 1
Long run frequency
with which Q occurs
2 





Weight of
Model k
© Padhraic Smyth, UC Irvine
Stacking for Query Model Combining
Conjunctive
Queries
on Microsoft
Web Data,
32k records,
294 attributes
6
Mixture, 16 components
Bayesian Network
Best Holdout Model
Stacked Model
Mean Percent Error
5
Available
online at
UCI KDD Archive
4
3
2
1
0
4
8
12
16
Query Size
© Padhraic Smyth, UC Irvine
Other Work in Our Group
• Pattern recognition in time series
– semi-Markov models for time-series pattern matching
• Ge and Smyth, KDD 2000
– applications
• semiconductor manufacturing
• NASA space station data
• Pattern discovery in categorical sequences
– unsupervised hidden Markov learning of patterns
embedded in “background”
– preliminary results at KDD 2001 temporal DM workshop
© Padhraic Smyth, UC Irvine
Other Work in Our Group
• Trajectory modeling and mixtures
– general frameworks for modeling/clustering/prediction with
sets of trajectories
– application to cyclone-tracking and fluid flow data
– Gaffney and Smyth, KDD 1999
• Spatial data models for pattern classification
– learn priors from human-labeled images applications
• biological cell image segmentation
• detecting “double-bent” galaxies
• Medical diagnosis with mixtures
– see Cadez et al., Machine Learning, in press
© Padhraic Smyth, UC Irvine

No Slide Title

Transcript No Slide Title

Directory