Bayesian Wavelet-Based Signal Estimation Using Non

Download Report

Transcript Bayesian Wavelet-Based Signal Estimation Using Non

Unsupervised Learning of
Finite Mixture Models
Mário A. T. Figueiredo
[email protected]
http://red.lx.it.pt/~mtf
Institute of Telecommunications, and
Instituto Superior Técnico.
Technical University of Lisbon
PORTUGAL
This work was done jointly with Anil K. Jain, Michigan State University
2 Outline
- Review of finite mixtures
- Estimating mixtures: the expectation-maximization (EM) algorithm
- Research issues: order selection and initialization of EM
- A new order selection criterion
- A new algorithm
- Experimental results
- Concluding remarks
Some of this work (and earlier versions of it) is reported in:
- M. Figueiredo and A. K. Jain, "Unsupervised Learning of Finite Mixture Models", to appear in
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.
- M. Figueiredo, and A. K. Jain, "Unsupervised Selection and Estimation of Finite Mixture Models",
in Proc. of the Intern. Conf. on Pattern Recognition - ICPR'2000, vol. 2, pp. 87-90, Barcelona, 2000.
- M. Figueiredo, J. Leitão, and A. K. Jain, "On Fitting Mixture Models", in E. Hancock and M. Pellilo
(Editors), Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 54 - 69,
Springer Verlag, 1999.
3 Finite mixtures
k random sources, probability density functions fi(x), i=1,…,k
f1(x)
Choose at random
f2(x)
X
fi(x)
fk(x)
random variable
4 Finite mixtures
Example: 3 species (Iris)
5 Finite mixtures
f1(x)
Choose at random,
Prob.(source i) = i
f2(x)
X
fi(x)
random variable
Conditional: f (x|source i) = fi (x)
Joint:
f (x and source i) = fi (x) i
fk(x)
 f (x and source i)   
k
Unconditional: f(x) =
all sources
i 1
i
fi (x)
6 Finite mixtures
k
f (x)   i fi (x)
i 1
Component densities
Mixing probabilities:
i  0 and
k

i 1
i
1
Parameterized components (e.g., Gaussian): f i (x)  f (x | θi )
k
f ( x | )    i f ( x |  i )
i 1
  {θ1 , θ 2 ,..., θ k , 1 , 2 ,..., k }
7 Gaussian mixtures
f (x | i )
Gaussian
Arbitrary covariances:
f (x | i )  N(x | μ i , Ci )
  {μ 1 ,μ 2 ,...,μ k , C1 , C2 ,..., Ck , 1 , 2 ,...k 1}
Common covariance:
f (x | i )  N(x | μ i , C)
  {μ 1 ,μ 2 ,...,μ k , C, 1 , 2 ,...k 1}
8 Mixture fitting / estimation
Data: n independent observations,
x  {x (1) , x ( 2) ,..., x ( n ) }
Goals: estimate the parameter set , maybe “classify the observations”
Example:
- How many species ? Mean of each species ?
- Which points belong to each species ?
Classified data (classes unknown)
Observed data
9 Gaussian mixtures (d=1), an example
1  2
1  3
2  4
2  2
3  7
3  0.1
1  0.6
 2  0.3
 3  0.1
10 Gaussian mixtures, an R2 example
(1500 points)
k=3
  4
μ1   
 4
 2 0
C2  

0
8


3
μ2   
3
 0
μ3   
  4
1 0
C3  

0
1


 2  1
C2  


1
2


11 Uses of mixtures in pattern recognition
Unsupervised learning (model-based clustering):
- each component models one cluster
- clustering = mixture fitting
Observations:
- unclassified points
Goals:
- find the classes,
- classify the points
12 Uses of mixtures in pattern recognition
Mixtures can approximate arbitrary densities
Good to represent class conditional densities in supervised learning
Example:
- two strongly non-Gaussian
classes.
- Use mixtures to model each
class-conditional density.
13 Fitting mixtures
n independent observations
x  {x , x ,..., x }
(1)
( 2)
(n)
Maximum (log)likelihood (ML) estimate of :
ˆ  arg max L( x, )


n
n
L( x, )  log  f ( x | )   log
( j)
j1
j1
mixture
ML estimate has no closed-form solution
k
( j)

f
(
x
| i )
 i
i 1
14 Gaussian mixtures: A peculiar type of ML
  {μ1 , μ 2 ,..., μ k , C1 , C2 ,..., Ck , 1 , 2 ,...k 1}
Maximum (log)likelihood (ML) estimate of :
ˆ  arg max L( x, )


Subject to: Ci
positive definite
i  0 and
k

i 1
i
1
Problem: the likelihood function is unbounded as det(Ci )  0
There is no global maximum.
Unusual goal: a “good” local maximum
15 A Peculiar type of ML problem
Example: a 2-component Gaussian mixture
f ( x | μ1 , μ 2 , σ ,  ) 
2

2σ 2

e
( x 1 ) 2
2 σ12
1 

e
2
( x  2 ) 2

2
Some data points: {x1 , x 2 ,..., x n }
μ1  x1
( x1   2 ) 2
 
1   2

L( x, )  log 

e
 2σ 2
2

 , as σ  0
2
 n
  log(...)
 
j 2

16 Fitting mixtures: a missing data problem
ML estimate has no closed-form solution
Standard alternative: expectation-maximization (EM) algorithm
Missing data problem:
Observed data:
Missing data:
x  {x (1) , x ( 2) ,..., x ( n ) }
(1)
( 2)
(n)
z  {z , z ,..., z }
Missing labels (“colors”)


z ( j)  z1( j) , z (2j) , ..., z (kj) ,  0 ... 0 1 0 ... 0 T
“1” at position i  x( j) generated by component i
17 Fitting mixtures: a missing data problem
Observed data:
Missing data:
x  {x (1) , x ( 2) ,..., x ( n ) }
z  {z (1) , z ( 2) ,..., z ( n ) }

z ( j)  z1( j) , ..., z (kj)
k-1 zeros,
one “ 1”
Complete log-likelihood function:
n
k

Lc (x, z, )   z log  i f i (x | i )
j1 i 1
( j)
i
( j)

log f (x ( j) , z ( j) | )
In the presence of both x and z,  would be easy to estimate,
…but z is missing.

18 The EM algorithm
Iterative procedure:
( 0 ) ˆ (1)
( t ) ˆ ( t 1)
ˆ
ˆ
 ,  ,...,  ,  ,...
ˆ (t) 

Under mild conditions:
t 
local maximum of
The E-step: compute the expected value of
Lc (x, z, )
(t)
(t)
ˆ
ˆ
E[Lc (x, z, ) | x,  ]  Q(,  )
The M-step: update parameter estimates
ˆ ( t 1)  arg max Q(, 
ˆ (t) )


L(x, )
19 The EM algorithm: the Gaussian case
The E-step:
ˆ ( t ) )  E [L (x, z, ) | x, 
ˆ (t) ]
Q(, 
Z
c
(t)
ˆ
 Lc (x, E[z | x,  ], )
Because
Lc (x, z, )
is linear in z
Bayes law
E[z i( j) | x, ˆ ( t ) ]  Pr{z i( j)  1 | x( j) , ˆ ( t ) } 
Binary variable
w i( j,t ) 
ˆ i f ( x ( j) | ˆ i( t ) )
k
( j) ˆ ( t )
ˆ

f
(
x
| n )
 n
 w i( j,t )
n 1
Estimate, at iteration t, of the probability that x( j) was
produced by component i
“Soft” probabilistic assignment
20 The EM algorithm: the Gaussian case
Result of the E-step:
w
( j, t )
i

Estimate, at iteration t, of
the probability that x( j) was produced by component i
The M-step:
( t 1)
ˆ
i
1 n ( j, t )
  wi
n j1
n
ˆ i( t 1) 
w
j1
n
( j, t )
i
x
n
w
j1
( j, t )
i
( j)
Cˆ i( t 1) 
( j, t )
( j)
( t 1)
( j)
( t 1) T
ˆ
ˆ
w
(
x


)
(
x


)
 i
i
i
j1
n
( j, t )
w
 i
j1
21 Difficulties with EM
It is a local (greedy) algorithm (likelihood never dcreases)
Initialization dependent
74 iterations
270 iterations
22 Model selection
Number of components ?
The maximized likelihood never decreases when k increases
...the classical over/under fitting issue.
ˆ
ˆ )
L(x, 
)

L
(
x
,

( k 1)
(k)
For any
(k ) ,
there exists a
…because
 ( k 1) ,
f ( x | ( k 1) )  f ( x | ( k ) )
Parameter spaces are “nested”
such that
Example: (k+1) = 0
ˆ )
L(x, 
(k)
can not be used to estimate k
23 Estimating the number of components (EM-based)

obtained, e.g., via EM
ˆ ), k  k , k  1,..., k
kˆ  arg min C (
(k)
min
min
max
Usually:

ˆ )  L(x, 
ˆ )  P(
ˆ )
C (
(k)
(k)
(k)
penalty term
k
log-likelihood
Criteria in this cathegory:
- Bezdek’s partition coefficient (PC), Bezdek, 1981 (in a clustering framework).
- Minimum description length (MDL), Rissanen and Ristad, 1992.
- Akaike’s information criterion (AIC), Whindham and Cutler, 1992.
- Approximate weight of evidence (AWE), Banfield and Raftery, 1993.
- Evidence Bayesian criterion (EBC), Roberts, Husmeyer, Rezek, and Penny, 1998.
- Schwarz´s Bayesian inference criterion (BIC), Fraley and Raftery, 1998.
24 Estimating the number of components: other approaches
Resampling based techniques
- Bootstrap for clustering, Jain and Moreau, 1987.
- Bootstrap for Gaussian mixtures, McLachlan, 1987.
- Cross validation, Smyth, 1998.
Comment: computationally very heavy.
Stochastic techniques
- Estimating k via Markov Chain Monte Carlo (MCMC), Mengersen and
Robert, 1996; Bensmail, Celeux, Raftery, and Robert, 1997; Roeder and
Wasserman, 1997.
- Sampling the full posterior via MCMC, Neal, 1992; Richardson and
Green, 1997
Comment: computationally extremely heavy.
25 STOP !
- Review of basic concepts concerning finite mixtures
- Review of the EM algorithm for Gaussian mixtures
E-step:
w
( j, t )
i
probability that x( j) was produced by component i
M-step: Given
w i( j,t )
, update parameter estimates
- Difficulty: how to initialize EM ?
- Difficulty: how to estimate the number of components, k ?
- Difficulty: how to avoid the boundary of the parameter space ?
26 A New Model Selection Criterion
Introduction to minimum encoding length criteria
x  {x (1) , x ( 2) ,..., x ( n ) }
 good model
long code  bad model
code length  model adequacy
coder
Rationale: short code
compressed
data
Several flavors: Rissanen (MDL) 78, 87
decoder
x  {x , x ,..., x }
(1)
( 2)
(n)
Rissanen (NML) 96,
Wallace and Freeman (MML), 87
27 MDL criterion: formalization
Family of models:
{f (x | ( k ) )}
Unknown to transmitter
and receiver
Known to transmitter and receiver
Given (k) , shortest code-length for x (Shannon’s):
L(x | ( k ) )   log f (x | ( k ) )
Not enough, because...
…decoder needs to know which code was chosen, i.e.,  ( k )
Total code-length (two part code):
L(x, ( k ) )   log f (x | ( k ) )  L(( k ) )
Parameter
code-length
28 MDL criterion: formalization
MDL criterion:
ˆ  arg min   log f (x |  )  L( )

(k)
(k)
(k)
( k )
grows
with k
Can be seen as an order-penalized ML
Remaining question: how to choose
L( ( k ) )
?
29 MDL criterion: parameter code length
Finite
L(( k ) )  truncate to finite precision:
High precision
Low precision
~
(k )
~
 log f (x | ( k ) )   log f (x | ˆ (ML
k) )
~
L(( k ) )
but
~
L(( k ) )
but
~
 log f (x | ( k ) )
may be >>
 log f (x | ˆ (ML
k) )
Optimal compromise (under certain conditions, and asymptotic)
L(each component of (k) ) =
1
log( n ' )
2
n'  Amount of data from which the parameter is estimated
30 MDL criterion: parameter code length
1
L[each component of (k) ] = log( n ' )
2
n'  Amount of data from which the parameter is estimated
Classical MDL: n’ = n
k


ˆ
( k )  arg min   log f (x | ( k ) )  log( n )
( k )
2


Not true for mixtures !
Why ?
Not all data points have equal weight in each parameter estimate
31 MDL for mixtures
1
L(each component of (k) ) = log( n ' )
2
n'  Amount of data from which the parameter is estimated
m 
Any parameter of the m-th component
(e.g., a component of μ m )
Fisher information for
m
I(m )  n  m I1 (m )
Fisher information, for one
observation from component m
Conclusion: sample size “seen” by m
is n m
32 MDL for mixtures
Recall that
( k )  {1 , 2 ,..., k , 1 ,  2 ,... k 1}
Sample size “seen” by m is
Then:
L(  m ) 
What about m ?
I(m )  n I1 (m )
Np
2
n m
log( n  m )
Np is the number of parameters
of each component.
so,
Examples:
Gaussian, arbitrary covariances
Np = d + d(d+1)/2
1
L( m )  log( n )
2
Gaussian, common covariance:
Np = d
33 MDL for mixtures
( k )  {1 , 2 ,..., k , 1 ,  2 ,... k 1}
L(  m ) 
Np
2
1
L( m )  log( n )
2
log( n  m )
Np
k 1
L(( k ) ) 
log( n ) 
2
2
 log( n 
m
)
m
the mixture-MDL (MMDL) criterion :
ˆ  arg min   log f (x |  )  L( )

(k)
(k)
(k)
( k )
34 The MMDL criterion
k ( N p  1)
Np

ˆ
( k )  arg min   log f (x | ( k ) ) 
log( n ) 
( k )
2
2

Key observation

log(  m )

m

L(( k ) ) is not just a function of k
This is not a simple order penalty (like in standard MDL)
ˆ
is not an ML estimate.
For fixed k, 
(k)
For fixed k, MMDL has a simple Bayesian interpretation:
 Np
p({1 ,...,  m })  exp 
 2

log(  m )   ( m )

m 1
 m1
k
This is a Dirichlet-type (improper) prior.
k

Np
2
35 EM for MMDL
k
k
(
N

1
)
N


p
p
ˆ
( k )  arg min   log f (x | ( k ) ) 
log( n ) 
log(  m )

( k )
2
2 m1


Using EM  redefining the M-step (there is a prior on the m’s)
Simple, because Dirichlet is conjugate
y+
 n ( j, t ) N p 

i    w i 

2
 j1

ˆ

( t 1)
i

i
because of constraints
k
k

m 1
y
m
i  0

i 1
i
1
Remarkable fact: this M-step may annihilate components
36 EM for MMDL
The M-step for MMDL is able to annihilate components
MMDL promotes sparseness
This suggests: start with k much higher than true value,
EM for MMDL will set some m’s to zero
Interesting interpretation:
1 k
log(  m )  D{1 / k} {1 ,...,  m }

2 m 1
Kullback-Leibler between uniform and {1 ,...,  m }
MMDL favors lower entropy distributions
(Zhu, Wu, & Mumford, 1997; Brand, 1999)
37 The MMDL criterion: an example
The MMDL term, for Np = 2 and k = 2
Promotes instability
and competition
between components.
38 The initialization problem
EM is a local (greedy) algorithm,
Mixture estimates depend on good initialization
Many approaches
- Multiple random starts, McLachlan and Peel, 1998;
Roberts, Husmeyer, Rezek, and Penny, 1998, many others.
- Initialization by clustering algorithm (e.g., k-means), McLachlan and
Peel, 1998, and others.
-Deterministic annealing, Yuille, Stolorz, and Utans, 1994;
Kloppenburg and Tavan, 1997; Ueda and Nakano, Neural Networks, 1998.
- The split and merge EM algorithm, Ueda, Nakano, Gharhamani, and Hinton,2000
Our approach:
- start with too many components, and prune with the new M-step
39
Possible problem with the new M-step:
ˆ

( t 1)
i

i
k

m 1
m
 n ( j, t ) N p 

i    w i 

2
j

1


probability that x( j) was
produced by component i
n
k too high, may happen that all
w
j1
( j, t )
i

Np
2
 αˆ i(t 1)  0
Solution: use “component-wise EM”
[Celeux, Chrétien, Forbes, and Mkhadri, 1999]
Convergence shown using the approach in [Chrétien and Hero, 2000]
40
Component-wise EM [Celeux, Chrétien, Forbes, and Mkhadri, 1999]
- Update
θˆ 1 αˆ 1
- Recompute all the
- Update
θˆ 2 αˆ 2
- Recompute all the
- ....
- Update
w i( j)
w
( j)
i
w
( j)
i
Repeat until
convergence
θˆ k αˆ k
- Recompute all the
Key fact: if one componend “dies”, its probability mass is
immediately re-distributed
41 The complete algorithm
- Start with a large number of components.
- While number of “surviving” > kmin
- Run component-wise EM (CEM2), using the new “killer” M-step
- After convergence, store final value of MMDL cost funtion
- Kill the weakest component, and restart CEM2
- Select model with the minimum value of the MMDL cost.
42 Example
Same as in [Ueda and Nakano, 1998].
43 Example
k=4
n = 1200
kmax = 10
0 
0 
 4
 4 C = I
1    1    1    1   
1
0 
 4
0 
 4  m 
4
44 Example
Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].
45 Example
An example with overlapping components
46 Evolution of the MMDL cost function
47 Evolution of the MMDL cost function
48 Comparison with other methods
Performance evaluation, as a function of separation (d
An easy mixture, d=5
)
A not so easy mixture, d=2
49
2-dimensional data, separation = d
50
10-dimensional data, separation =
δ 10
51
2-dimensional data, common covariance, separation = d
52
10-dimensional data, common covariance, separation =
δ 10
53
The iris (4-dim.) data-set: 3 components correctly identified
54 One-dimensional examples: enzyme data-set
55 One-dimensional examples: acidity data-set
56 A supervised learning example
(Problem suggested in Hastie and Tibshirani, 1996)
3 classes
21-dimensional feature-space
Consider these
3 points in R21
y
(i )
 [y
(i )
1
y
(i )
2
(i )
21
 y ]
57 A supervised learning example (cont.)
Class definitions:
y (i )  u (i )h1  (1  u (i ) )h 2  n(i )
Class 1
y (i )  u (i )h1  (1  u (i ) )h3  n(i )
Class 2
y
(i )
u
(i )
 u h 2  (1  u )h3  n
(i )
(i )
i.i.d. uniform on [0,1]
(i )
Class 3
58 How do samples look like ?
y15
Class 1
y11
Class 2
y11
y15
y7
Class 3
y7
59 How do samples look like ?
Best projection
60 A supervised learning example (cont.)
Learn Bayes classifiers, from 300 samples (~100/class),
Tests on 500 samples
Methods: Linear discriminant (LD),
quadratic discriminant (QD),
mixture discriminant (MD) – fit mixture to class-conditionals
Results (on 10 trials)
Method
Mean error rates (std. dev)
Linear discriminant
Quadratic discriminant
MD – random starts, k=3
MD – new method
0.195 (0.008)
0.211(0.008)
0.167 (0.005)
0.159 (0.005)
61 Another supervised learning example
Problem: learn to classify textures, from 19 Gabor features.
- Four classes:
-Fit Gaussian mixtures to 800 randomly located feature vectors
from each class/texture.
-Test on the remaining data.
Error rate
Mixture-based
0.0074
Linear discriminant
0.0185
Quadratic disriminant
0.0155
62
Resulting decision regions
2-d projection of the texture
data and the obtained mixtures
63 Summary and Conclusions
- Review mixture models and the EM algorithm.
- Two main difficulties:
- Model selection
- Initialization of EM
- A new MDL-type criterion (MMDL) adequate for mixtures.
- MMDL is a “sparseness” prior.
- MMDL helps EM avoiding local maxima
- Experimental results.