Introduction to Bayesian Learning

Download Report

Transcript Introduction to Bayesian Learning

BAYESIAN LEARNING
Jianping Fan
Dept of Computer Science
UNC-Charlotte
OVERVIEW

Bayesian classification:

one example
E.g. How to decide if a patient is sick or healthy, based on

A probabilistic model of the observed data

Prior knowledge
CLASSIFICATION PROBLEM
Training data: examples of the form (d,h(d))
 where d are the data objects to classify (inputs)
 and h(d) are the correct class info for d, h(d){1,…K}
 Goal: given dnew, provide h(dnew)

WHY BAYESIAN?
Provides practical learning algorithms
 E.g. Naïve Bayes
 Prior knowledge and observed data can be combined
 It is a generative (model based) approach, which
offers a useful conceptual framework

E.g. sequences could also be classified, based on a
probabilistic model specification
 Any kind of objects can be classified, based on a
probabilistic model specification

BAYES’ RULE
P ( d | h) P ( h)
p(h | d ) 
P(d )
P ( d | h) P ( h)

 P ( d | h) P ( h )
h
Underst anding Bayes'rule
d  dat a
h  hypot hesis(model)
- rearranging
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
t hesame joint probability
on bot h sides
Who is who in Bayes’ rule
P ( h) :
P ( d | h) :
prior belief (probability of hypothesis h before seeing any data)
likelihood (probability of the data if the hypothesis h is true)
P(d )   P(d | h) P(h) : data evidence (marginal probability of the data)
h
P(h | d ) :
posterior (probability of hypothesis h after having seen the data d )
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
PROBABILITIES – AUXILIARY SLIDE FOR
MEMORY REFRESHING





Have two dice h1 and h2
The probability of rolling an i given die h1 is denoted
P(i|h1). This is a conditional probability
Pick a die at random with probability P(hj), j=1 or 2.
The probability for picking die hj and rolling an i with
it is called joint probability and is P(i, hj)=P(hj)P(i| hj).
For any events X and Y, P(X,Y)=P(X|Y)P(Y)
If we know P(X,Y), then the so-called marginal
probability P(X) can be computed as P( X )   P( X , Y )
Y
DOES PATIENT HAVE CANCER OR NOT?

A patient takes a lab test and the result comes back
positive. It is known that the test returns a correct positive
result in only 98% of the cases and a correct negative
result in only 97% of the cases. Furthermore, only 0.008 of
the entire population has this disease.
1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the diagnosis?
hypothesis1 : ' cancer'
hypothesis2 : ' cancer'
 data : ' '
}
hypothesisspace H
P (  | cancer) P (cancer) .........................

 ..........
P()
.........................
P (  | cancer)  0.98
P (cancer)  0.008
P (  )  P (  | cancer) P (cancer)  P (  | cancer) P (cancer)
1.P (cancer|  ) 
 ...................................................................
P (  | cancer)  0.03
P (cancer)  ..........
2.P (cancer|  )  ...........................
3.Diagnosis??
CHOOSING HYPOTHESES


Maximum Likelihood
hypothesis:
Generally we want the
most probable hypothesis
given training data. This
is the maximum a
posteriori hypothesis:

Useful observation: it does
not depend on the
denominator P(d)
hML  arg max P(d | h)
hH
hMAP  arg max P(h | d )
hH
NOW WE COMPUTE THE DIAGNOSIS

To find the Maximum Likelihood hypothesis, we evaluate
P(d|h) for the data d, which is the positive lab test and chose
the hypothesis (diagnosis) that maximises it:
P( | cancer)  ............
P( | cancer)  .............
 Diagnosis: hML  .................

To find the Maximum A Posteriori hypothesis, we evaluate
P(d|h)P(h) for the data d, which is the positive lab test and
chose the hypothesis (diagnosis) that maximises it. This is the
same as choosing the hypotheses gives the higher posterior
probability.
P( | cancer) P(cancer)  ................
P( | cancer) P(cancer)  .............
 Diagnosis: hMAP  ......................
NAÏVE BAYES CLASSIFIER


What can we do if our data d has several attributes?
Naïve Bayes assumption: Attributes that describe data
instances are conditionally independent given the
classification hypothesis
P(d | h)  P(a1 ,...,aT | h)   P(at | h)
t



it is a simplifying assumption, obviously it may be violated in
reality
in spite of that, it works well in practice
The Bayesian classifier that uses the Naïve Bayes assumption
and computes the MAP hypothesis is called Naïve Bayes
classifier

One of the most practical learning methods

Successful applications:
Medical Diagnosis
 Text classification

EXAMPLE. ‘PLAY TENNIS’ DATA
Day
Outlook
Temperature
Humidity
Wind
Play
Tennis
Day1
Day2
Sunny
Sunny
Hot
Hot
High
High
Weak
Strong
No
No
Day3
Overcast
Hot
High
Weak
Yes
Day4
Rain
Mild
High
Weak
Yes
Day5
Rain
Cool
Normal
Weak
Yes
Day6
Rain
Cool
Normal
Strong
No
Day7
Overcast
Cool
Normal
Strong
Yes
Day8
Sunny
Mild
High
Weak
No
Day9
Sunny
Cool
Normal
Weak
Yes
Day10
Rain
Mild
Normal
Weak
Yes
Day11
Sunny
Mild
Normal
Strong
Yes
Day12
Overcast
Mild
High
Strong
Yes
Day13
Overcast
Hot
Normal
Weak
Yes
Day14
Rain
Mild
High
Strong
No
NAÏVE BAYES SOLUTION
Classify any new datum instance x=(a1,…aT) as:
hNaive Bayes  arg max P(h) P(x | h)  arg max P(h) P(at | h)
h

h
t
To do this based on training examples, we need to estimate the
parameters from the training examples:

For each target value (hypothesis) h
Pˆ (h) : estimateP(h)

For each attribute value at of each datum instance
Pˆ (at | h) : estimateP(at | h)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
 That means: Play tennis or not?
hNB  arg max P(h) P(x | h)  arg max P(h) P(at | h)
h[ yes, no ]
h[ yes, no ]
t
 arg max P(h) P(Outlook  sunny| h) P(Tem p  cool | h) P( Hum idity high| h) P(Wind  strong | h)
h[ yes, no ]

Working:
P( PlayTennis yes)  9 / 14  0.64
P( PlayTennis no)  5 / 14  0.36
P(Wind  strong | PlayTennis yes)  3 / 9  0.33
P(Wind  strong | PlayTennis no)  3 / 5  0.60
etc.
P( yes) P( sunny| yes) P(cool | yes) P(high | yes) P( strong | yes)  0.0053
P(no) P( sunny| no) P(cool | no) P(high | no) P( strong | no)  0.0206
 answer : PlayTennis( x)  no
LEARNING TO CLASSIFY TEXT
 Learn
from examples which articles are of
interest
 The attributes are the words
 Observe the Naïve Bayes assumption just
means that we have a random sequence
model within each class!
 NB classifiers are one of the most effective
for this task
 Resources for those interested:

Tom Mitchell: Machine Learning (book) Chapter
6.
RESULTS ON A BENCHMARK TEXT CORPUS
REMEMBER




Bayes’ rule can be turned into a classifier
Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood doesn’t
Naive Bayes Classifier is a simple but effective
Bayesian classifier for vector data (i.e. data with several
attributes) that assumes that attributes are
independent given the class.
Bayesian classification is a generative approach to
classification
RESOURCES

Textbook reading (contains details about using
Naïve Bayes for text classification):
Tom Mitchell, Machine Learning (book), Chapter 6.

Software: NB for classifying text:
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naivebayes.html

Useful reading for those interested to learn more
about NB classification, beyond the scope of this
module:
http://www-2.cs.cmu.edu/~tom/NewChapters.html
UNIVARIATE NORMAL SAMPLE
X ~ N ( ,  )
2
Sampling

x  ( x1, x2 ,

1
 (x  ) 
f ( x | , ) 
exp  
2
2 
 2 
2
T
, xn )
ˆ  ?
2
ˆ  ?
MAXIMUM LIKELIHOOD
1
 (x  ) 
f ( x | , ) 
exp  
2
2 
 2 
2
Sampling


L(,  | x) 
2
Given x, it is a
function of  and 2
x  ( x1, x2 ,
T
, xn )
We want to
maximize it.
f (x | ,  2 )  f ( x1 | , 2 )
 1 

2 
2




n/2
f ( xn | , 2 )
 n ( xi   ) 2 
exp  

2
2

 i 1

LOG-LIKELIHOOD FUNCTION
L(,  | x) 
2
 1 

2 
2



n/2
 n ( xi   ) 2 
exp  

2
2

 i 1

l (, | x)  log L(,  | x)
2
Maximize
this instead
2
n
( xi   )2
n
1
 log

2
2
2
2
2

i 1
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
By setting

l (  ,  2 | x)  0 and



2
l
(

,

| x)  0
2
MAX. THE LOG-LIKELIHOOD FUNCTION
l (, | x)
2
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
n

1
n
2
l (  ,  | x)  2  xi  2  0

 i 1

1 n
ˆ   xi
n i 1
MAX. THE LOG-LIKELIHOOD FUNCTION
1 n
ˆ   xi
n i 1
l (, | x)
2
n
1
ˆ 2   xi2  ˆ 2
n i 1
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
2
n

n

2
2
l
(

,

|
x
)



x
 4  xi  4  0
2
2
4  i

2
2 i 1
 i 1
2

n
n
n
i 1
i 1
n
1
n 2   xi2  2  xi  n 2
2
2
 1

  x    xi     xi 
n  i 1  n  i 1 
i 1
n
n
2
i
n
2
MISS DATA
1 n
ˆ   xi
n i 1
Sampling
x  ( x1,
n
1
ˆ 2   xi2  ˆ 2
n i 1
, xm , xm1 ,

, xn )
Missing data

n

1 m
ˆ    xi   x j 
n  i 1
j  m 1

n
1 m 2
2
2
ˆ
ˆ
    xi   x j   
n  i 1
j  m 1

2
T
E-STEP
n

1 m
ˆ    xi   x j 
n  i 1
j  m 1

n
1 m 2
2
ˆ
    xi   x j   ˆ 2
n  i 1
j  m 1

2
be the estimated parameters at

Let
2(t ) the initial of the tth iterations

(t )
E
ˆ ( t ) , 2
E
ˆ ( t ) , 2
(t )
(t )
 n
  xj
 j m1

x   (n  m) ˆ (t )

 n 2
  xj
 j m1

(t ) 2
2(t )
x   (n  m) ˆ  



n

1 m
ˆ    xi   x j 
n  i 1
j  m 1

E-STEP
n
1 m 2
2
ˆ    xi   x j   ˆ 2
n  i 1
j  m 1

2
be the estimated parameters at

Let
2(t ) the initial of the tth iterations

(t )
E
ˆ ( t ) , 2
E
ˆ ( t ) , 2
(t )
(t )
 n
  xj
 j m1

x   (n  m) ˆ (t )

 n 2
  xj
 j m1

(t ) 2
2(t )
x   (n  m) ˆ  

m
(t )
1
s

m
(t )
2
s
  xi  (n  m) ˆ (t )
i 1


  xi2  (n  m) ˆ (t )   2
i 1
2
(t )

n

1 m
ˆ    xi   x j 
n  i 1
j  m 1

M-STEP
n
1 m 2
2
ˆ    xi   x j   ˆ 2
n  i 1
j  m 1

2
be the estimated parameters at

Let
2(t ) the initial of the tth iterations

(t )


( t 1)
2 ( t 1)
m
(t )
1
s

n
(t )
1
s
s2(t )
( t 1)2

 ˆ
n
m
(t )
2
s
  xi  (n  m) ˆ (t )
i 1

  xi2  (n  m) ˆ (t )   2
i 1
2
(t )

EXERCISE
X ~ N ( ,  )
2
n = 40 (10 data missing)
Estimate  ,  2 using different initial conditions.
375.081556
362.275902
332.612068
351.383048
304.823174
386.438672
430.079689
395.317406
369.029845
365.343938
243.548664
382.789939
374.419161
337.289831
418.928822
364.086502
343.854855
371.279406
439.241736
338.281616
454.981077
479.685107
336.634962
407.030453
297.821512
311.267105
528.267783
419.841982
392.684770
301.910093
MULTINOMIAL POPULATION
Sampling
( p1 , p2 ,
, pn )
 pi  1
N samples
x  ( x1, x2 ,
, xn )
xi : #samples  Ci
x1  x2 
p(x | p1 ,
T
 xn  N
N!
x1
, pn ) 
p1
x1 ! xn !
xn
n
p
MAXIMUM LIKELIHOOD
p(x | p1 ,
Sampling
(  2,4,4, )
1
2



1
2
N!
, pn ) 
p1x1
x1 ! xn !
pnxn
N samples
x  ( x1, x2 , x3 , x4 )
T
xi : #samples  Ci
x1  x2  x3  x4  N
N!
 x1  x2  x3 1 x4
1
p
(
x
|

)

(

L ( | x) 
2
2 ) ( 4) ( 4) ( 2)
x1 ! xn !
MAXIMUM LIKELIHOOD
p(x | p1 ,
Sampling
(  2,4,4, )
1
2



1
2
We want to
maximize it.
N!
, pn ) 
p1x1
x1 ! xn !
pnxn
N samples
x  ( x1, x2 , x3 , x4 )
xi : #samples  Ci
T
x1  x2  x3  x4  N
N!
 x1  x2  x3 1 x4
1
p
(
x
|

)

(

L ( | x) 
2
2 ) ( 4) ( 4) ( 2)
x1 ! xn !
LOG-LIKELIHOOD
N!
 x1  x2  x3 1 x4
1
L ( | x) 
(2  2) ( 4) ( 4) (2)
x1 ! xn !
l ( | x)  log L ( | x)
 x1 log( 12  2 )  x2 log( 4 )  x3 log( 4 )  const

x1
x2 x3
l ( | x)  
 

1  
0
 x1  x2 (1   )  x3 (1   )  0
ˆ 
x2  x3
x1  x2  x3
MIXED ATTRIBUTES
ˆ 
Sampling
  ( 12  2 , 4 , 4 , 12 )
x2  x3
x1  x2  x3
N samples
x  ( x1 , x2 , x3  x4 )T
x3 is not available
ˆ 
E-STEP
Sampling
  ( 12  2 , 4 , 4 , 12 )
x2  x3
x1  x2  x3
N samples
x  ( x1 , x2 , x3  x4 )T
x3 is not available
Given (t), what can you say about x3?
E ( t )  x3 | x  ( x3  x4 ) 1
2
 (t )
4

 (t )
4
(t )
ˆ
 x3
ˆ 
M-STEP
ˆ
( t 1)
x2  x3
x1  x2  x3
x2  xˆ

(t )
x1  x2  xˆ3
(t )
3
E ( t )  x3 | x  ( x3  x4 ) 1
2
 (t )
4

 (t )
4
(t )
ˆ
 x3
EXERCISE
xobs  ( x1, x2 , x3  x4 )  (38,34,125)
T
T
Estimate  using different initial conditions?
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children
# Children
# Obasongs n0
n1 n2
n3
n4
n6
n5
X | M ~ P ( )
Married Obasongs
Unmarried Obasongs
(No Children)
P(M )  1  
P(M c )  
P( x | M ) 
 x e 
x!
P( X  0 | M c )  1
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children
# Children
# Obasongs n0
n1 n2
n3
n4
n5
n6
n0  nA  nB
Married Obasongs
Unmarried Obasongs
(No Children)
Unobserved data:
nA : # married Ob’s
nB : # unmarried Ob’s
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children
# Children
# Obasongs n0
n1 n2
n3
n4
n5
n6
Complete
data
nA , nB n1 n2
n3
n4
n5
n6
Probability
p A, p B p 1 p 2
p3
p4
p5
p6
BINOMIAL/POISON MIXTURE
pA  
px 
pB  e (1   )
 x e
x!
(1   )
x  1, 2,
# Children
# Obasongs n0
n1 n2
n3
n4
n5
n6
Complete
data
nA , nB n1 n2
n3
n4
n5
n6
Probability
p A, p B p 1 p 2
p3
p4
p5
p6
COMPLETE DATA LIKELIHOOD
pA  
px 
pB  e (1   ) n  (nA , nB , n1,

 x e
x!
(1   )
# Children
# Obasongs n0
nobs  (n0 , n1,
x  1, 2,
, n6 )T
n0  nA  nB
T
, n6 )
n1 n2
n3
n4
n5
n6
Complete
data
nA , nB n1 n2
n3
n4
n5
n6
Probability
p A, p B p 1 p 2
p3
p4
p5
p6
(nA  nB  n1   n6 )! nA nB n1
pA pB p1
L ( ,  | n)  p(n |  ,  ) 
nA !nB !n1 ! n6 !
p6n6
MAXIMUM LIKELIHOOD

X  {x1 , x2 ,, x N }
N
L ( | X)  p( X |)   p(xi|)
i 1
  arg max L ( | X)
*

LATENT VARIABLES
Incomplete
Data

X  {x1 , x2 ,, x N }
Y  {y1 , y 2 ,, y N }
Complete Data
Z  (X, Y)
COMPLETE DATA LIKELIHOOD
Complete Data

Z  (X, Y)
X  {x1 , x2 ,, x N }
Y  {y1 , y 2 ,, y N }
L ( | Z)  p(Z|)  p(X, Y|)
 p(Y | X, ) p( X|)
COMPLETE DATA LIKELIHOOD
Complete Data
Z  (X, Y)
L ( | Z)  p(Y | X, ) p(X|)
A function of
random
variable Y.
If we are given ,
A function of
latent variable Y
and parameter 
The result is in
term of random
variable Y.
A function of
parameter 
Computable
EXPECTATION STEP
L ( | Z)  p(X, Y | )
Let (i1) be the parameter vector obtained at the
(i1)th step.
Define
Q(Θ, Θ(i 1) )  E[log L(Θ | Z) | X, Θ(i 1) ]
( i 1)

log p( X, y | Θ)  p( y | X, Θ )dy continuous

y

Y


( i 1)
log
p
(
X
,
y
|
Θ
)

p
(
y
|
X
,
Θ
)
discrete

 yY
MAXIMIZATION STEP
L ( | Z)  p(X, Y | )
i ) be the parameter vector obtained at (the
i 1)
Let ((i1)
(i1)th step.
Θ
 arg max Q(Θ, Θ
Θ
)
Define
Q(Θ, Θ(i 1) )  E[log L(Θ | Z) | X, Θ(i 1) ]
( i 1)

log p( X, y | Θ)  p( y | X, Θ )dy continuous

y

Y


( i 1)
log
p
(
X
,
y
|
Θ
)

p
(
y
|
X
,
Θ
)
discrete

 yY
MIXTURE MODELS
If there is a reason to believe that a data set is
comprised of several distinct populations, a
mixture model can be used.
 It has the following form:

M
M
p(x | Θ)   j p j (x |  j )
with
j 1
Θ  (1 ,,  M ,1 , ,, M )

j 1
j
1
MIXTURE MODELS
M
p(x | Θ)   j p j (x |  j )
j 1

X  {x1 , x2 ,, x N }
Y  { y1 , y2 ,, yN }
Let yi{1,…, M} represents the source that
generates the data.
MIXTURE MODELS
M
p(x | Θ)   j p j (x |  j )
j 1

x
p(x | y  j, Θ)  p j (x |  j )
y j
p( y  j | Θ)   j
Let yi{1,…, M} represents the source that
generates the data.
MIXTURE MODELS
M
p(x | Θ)   j p j (x |  j )
j 1
p(z i | Θ)  p(xi , yi | Θ)  p( yi | xi , Θ) p(xi | Θ)
p( y  j | Θ)   j

xi
yi
p(x | y  j, Θ)  p j (x |  j )
zi
MIXTURE MODELS
M
p(x | Θ)   j p j (x |  j )
j 1
p(xi , yi , Θ) p(xi | yi , Θ) p( yi , Θ)
p( yi | xi , Θ) 

p(xi , Θ)
p ( x i , Θ)
p(xi | yi , Θ) p( yi | Θ) p(Θ) p(xi | yi , Θ) p( yi | Θ)


p(xi | Θ) p(Θ)
p ( x i | Θ)

p yi (xi |  yi ) yi
M

j 1
j
p j (x |  j )
EXPECTATION
M
N
N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1
M
yY
N
j 1
M
M
M
N
y1 1
yi 1
y N 1
j 1
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1
Zero when yi  l
N
M
Q(Θ, Θ )     yi ,l log l pl (xi | l )  p( y j | x j , Θ g )
g
yY i 1 l 1
N
j 1
EXPECTATION
M
N
N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1
M
N
yY
j 1
M
M
M
N
y1 1
yi 1
y N 1
j 1
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1
M
  log  l pl (xi | l ) 
 y 1
l 1 i 1
1

M
N
M
M

yi1 1 yi1 1

g 
g
p
(
y
|
X
,
Θ
)
p
(
l
|
x
,
Θ
)


j
i

yN 1 j 1

j i

M
N
EXPECTATION
M
N
N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1
M
N
yY
j 1
M
M
M
N
y1 1
yi 1
y N 1
j 1
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1
M
  log  l pl (xi | l ) 
 y 1
l 1 i 1
1

M
N
Q(Θ, Θ )
g
M
M

yi1 1 yi1 1

g 
g
p
(
y
|
X
,
Θ
)
p
(
l
|
x
,
Θ
)


j
i

yN 1 j 1

j i

M
N
N  M


  log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )

 j 1  y 1
l 1 i 1
j

 j i 
M
N
EXPECTATION
M
N
Q(Θ, Θ g )   log l pl (xi | l )p(l | xi , Θ g )
l 1 i 1
M
N
M
N
Q(Θ, Θ )   log( l ) p(l | xi , Θ )   log[ pl (xi | l )] p(l | xi , Θ g )
g
g
l 1 i 1
l 1 i 1
1
Q(Θ, Θ )
g
N  M


  log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )

 j 1  y 1
l 1 i 1
j

 j i 
M
N
MAXIMIZATION
Θ  (1 ,,  M ,1 , ,, M )
Given the initial guess
M
N
M
g
,
N
Q(Θ, Θ )   log( l ) p(l | xi , Θ )   log[ pl (xi | l )] p(l | xi , Θ g )
g
g
l 1 i 1
l 1 i 1
We want to find , to maximize the above
expectation.
In fact, iteratively.
THE GMM (GUASSIAN MIXTURE MODEL)
Guassian model of a d-dimensional source, say j :
1
 1

T
1
p j (x | μ j , Σ j ) 
exp (x  μ j ) Σ j (x  μ j )
d /2
1/ 2
(2 ) | Σ j |
 2

 j  (μ j , Σ j )
GMM with M sources:
M
j 0
j 1

p j (x | μ1 , Σ1 ,, μ M , Σ M )   j p j (x | μ j , Σ j )
j
1
GOAL
M
Mixture Model
p(x | Θ)    l pl (x | l )
l 1
Θ  (1 ,,  M ,1 , ,, M )
M
subject to

l 1
l
1
To maximize:
M
N
M
N
Q(Θ, Θ )   log( l ) p(l | xi , Θ )   log[ pl (xi | l )] p(l | xi , Θ g )
g
g
l 1 i 1
l 1 i 1
FINDING L
M
N
M
N
Q(Θ, Θ )   log( l ) p(l | xi , Θ )   log[ pl (xi | l )] p(l | xi , Θ g )
g
g
l 1 i 1
l 1 i 1
Due to the constraint on l’s, we introduce Lagrange
Multiplier , and solve the following equation.
 M N
M

g
 log( l ) p(l | xi , Θ )     l  1  0,
 l  l 1 i 1
 l 1

N
1

i 1
p(l | xi , Θ g )    0,
l  1,, M
l
N
 p(l | x , Θ
i 1
l  1,, M
i
g
)   l   0,
l  1,, M
SOURCE CODE FOR GMM
1. EPFL
http://lasa.epfl.ch/sourcecode/
2. Google Source Code on GMM
http://code.google.com/p/gmmreg/
3. GMM & EM
http://crsouza.blogspot.com/2010/10/gaussia
n-mixture-models-and-expectation.html