Document

Transcript Document

2D1431 Machine Learning
Bayesian Learning
Outline







Bayes theorem
Maximum likelihood (ML) hypothesis
Maximum a posteriori (MAP) hypothesis
Naïve Bayes classifier
Bayes optimal classifier
Bayesian belief networks
Expectation maximization (EM) algorithm
Handwritten characters classification
Gray level pictures:object
classification
Gray level pictures: human
action classification
Literature & Software





T. Mitchell: chapter 6
S. Russell & P. Norvig, “Artificial Intelligence – A Modern
Approach” : chapters 14+15
R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification 2nd
ed.” : chapters 2+3
David Heckerman: “A Tutorial on Learning with Bayesian
Belief Networks”
http://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf
Bayes Net Toolbox for Matlab (free), Kevin Murphy
http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html
Bayes Theorem
P(h|D) = P(D|h) P(h) / P(D)




P(D) : prior probability of the data D, evidence
P(h) : prior probability of the hypothesis h, prior
P(h|D) : posterior probability of the hypothesis
given the data D, posterior
P(D|h) : probability of the data D given the
hypothesis h , likelihood of the data
Bayes Theorem
P(h|D) = P(D|h) P(h) / P(D)
posterior = likelihood x prior / evidence



By observing the data D we can convert the prior
probability P(h) to the a posteriori probability (posterior)
P(h|D)
The posterior is probability that h holds after data D has
been observed.
The evidence P(D) can be viewed merely as a scale factor
that guarantees that the posterior probabilities sum to
one.
Choosing Hypotheses




P(h|D) = P(D|h) P(h) / P(D)
Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP
hMAP = argmaxhH P(h|D)
= argmaxhH P(D|h) P(h) / P(D)
= argmaxhH P(D|h) P(h)
If the priors of hypothesis are equally likely
P(hi)=P(hj) then one can choose the maximum
likelihood (ML) hypothesis
hML = argmaxhH P(D|h)
Bayes Theorem Example
A patient takes a lab test and the result is positive. The test
returns a correct positive () result in 98% of the cases in
which the disease is actually present, and a correct negative
() result in 97% of the cases in which the disease is not
present. Furthermore, 0.8% of the entire population have
the disease. Hypotheses : disease, ¬disease
priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992
likelihoods P(D|h): P(|disease)=0.98, P( |disease)=0.02
P(|¬disease)=0.03, P(|¬disease)=0.97
Maximum posteriors argmax P(h|D):
P(disease|)~ P(|disease)P(disease)=0.0078
P(¬ disease|)~ P(|¬ disease) P(¬ disease) = 0.0298
P(disease|) = 0.0078/(0.0078+0.0298) = 0.21
P(¬ disease|) = 0.0298/(0.0078+0.0298) = 0.79
Basic Formula for Probabilities



Product rule: P(AB) = P(A) P(B)
Sum rule: P(AB) = P(A) + P(B) - P(AB)
Theorem of total probability: if A1, A2, …, An are
mutually exclusive events Si P(Ai) = 1, then
P(B) = Si P(B|Ai) P(Ai)
Bayes Theorem Example
P(x1,x2|m1,m2,s) = 1/(2ps) exp -Si (xi-mi)2/2s2
h={m1,m2,s}
D={x1,…,xm}
Gaussian Probability Function
P(D|m1,m2,s) = Pm P(xm|m1,m2,s)
 Maximum likelihood hypothesis hML
hML = argmax m1,m2,s P(D|m1,m2,s)
 Trick: maximize log-likelihood
log P(D|m1,m2,s) = Sm log P(xm|m1,m2,s)
= Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2
= -M log (2ps) - Sm Si (xmi-mi)2/2s2

Gaussian Probability Function
log P(D|m1,m2,s)/  mi = 0
Sm xmi-mi = 0  mi ML = 1/M Sm xmi = E[xm]
log P(D|m1,m2,s)/  s = 0
sML = Sm Si (xmi-mi)2 / 2M = E[(Si (xmi-mi)2] / 2
Maximum likelihood hypothesis hML = {miML,sML}
Maximum Likelihood Hypothesis

mML= (0.20, -0.14) sML = 1.42
Bayes Decision Rule


x = examples of class c1
o = examples of class c2
{m2,s2}
{m1,s1}
Bayes Decision Rule





Assume we have two Gaussians distributions
associated to two separate classes c1, c2.
P(x|ci) = P(x|mi,si)= 1/(2ps) exp -Si (xi-mi)2/2s2
Bayes decision rule (max posterior probability)
Decide c1 if P(c1|x) > P(c2|x)
otherwise decide c2.
if P(c1) = P(c2) use maximum likelihood P(x|ci)
else use maximum posterior P(ci|x) = P(x|ci) P(ci)
Bayes Decision Rule
c2
c1
Two-Category Case
Discriminant functions:
if g(x) > 0 then c1 else c2
 g(x) = P(c1|x) – P(c2|x)
= P(x|c1) P(c1) - P(x|c1) P(c1)
 g(x) = log P(c1|x) – log P(c2|x)
= log P(x|c1)/P(x|c2) - log P(c1)/ P(c2)
 Gaussian probability functions with identical si
g(x) = (x-m2)2/2s2 - (x-m1)2/2s2 + log P(c1) – log P(c2)
decision surface is a line/hyperplane

Learning a Real Valued Function
f
e
hML
Consider a real-valued target function f
 Noisy training examples <xi,di>
di = f(xi) + ei
ei is a random variable drawn from a Gaussian distribution
with zero mean.
 The maximum likelihood hypothesis hML is the one that
minimizes the squared sum of errors

hML = argmin
hH
Si (di – h(xi))2
Learning a Real Valued Function
hML = argmax hH P(D|h)
= argmax hH Pi P(xi|h)
= argmax hH Pi (2ps)-0.5 exp -(di-h(xi))2/2s2
 maximizing logarithm log P(D|h)
hML = argmax hH Si –0.5 log(2ps) -(di-h(xi))2/2s2
= argmax hH Si -(di - h(xi))2
= argmin hH Si (di – h(xi))2
Learning to Predict Probabilities




Predicting survival probability of a patient
Training examples <xi,di> where di is 0 or 1
Objective: train a neural network to output a probability
h(xi) = p(di=1) given xi
Maximum likelihood hypothesis:
hML = argmax hH Si di ln h(xi) + (1-di) ln (1-h(xi))
maximize cross entropy between di and h(xi)

Weight update rule for synapses wk to output neuron h(xi)
wk = wk +  Si (di-h(xi)) xk

Compare to standard BP weight update rule
wk = wk + 
Si h(xi)(1-h(xi)) (di-h(xi)) xk
Most Probable Classification
So far we sought the most probable hypothesis hMAP?
 What is most probable classification of a new instance x
given the data D?
hMAP(x) is not the most probable classification, although
often a sufficiently good approximation of it.
 Consider three possible hypotheses:
 P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3
 Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=hMAP(x) = h1(x) = +
 most probable classification:
P(+)=P(h1|D)=0.4
P(-)=P(h2|D) + P(h3|D) = 0.6

Bayes Optimal Classifier


cmax = argmax
Example:
cjC
S
hiH
P(cj|hi) P(hi|D)
P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3
P(+|h1)=1, P(-|h1)=0
P(+|h2)=0, P(-|h2)=1
P(+|h3)=0, P(-|h3)=1
therefore
S
S
hiH
hiH
P(+|hi) P(hi|D) = 0.4
P(- |hi) P(hi|D) = 0.6
argmax
cjC
S
hiH
P(vj|hi) P(hi|D) = -
MAP vs. Bayes Method






The maximum posterior hypothesis estimates a point hMAP
in the hypothesis space H.
Bayes method instead estimates and uses a complete
distribution P(h|D).
The difference appears when inference MAP or Bayes
method are used for inference of unseen instances and
one compares the distributions P(x|D)
MAP: P(x|D) = hMAP(x) with hML = argmax hH P(h|D)
Bayes: P(x|D) = S hiH P(x|hi) P(hi|D)
For reasonable prior distributions P(h) MAP and Bayes
solution are equivalent in the asymptotic limit of infinite
training data D.
Naïve Bayes Classifier




popular, simple learning algorithm
moderate or large training set available
assumption: attributes that describe instances are
conditionally independent given classification (in practice
works surprisingly well even if assumption is violated)
Applications:
 diagnosis
 text classification (newsgroup articles 20 newsgroups,
1000 documents per newsgroup, classification
accuracy 89%)
Naïve Bayes Classifier


Assume discrete target function F: XC, where each
instance x described by attributes <a1,a2,…,an>
Most probable value of f(x) is:
cMAP= argmax cjC P(cj| <a1,a2,…,an>)
= argmax cjC P(<a1,a2,…,an>|cj) P(cj) / P(<a1,a2,…,an>)
= argmax cjC P(<a1,a2,…,an>|cj) P(cj)

Naïve Bayes assumption: P(<a1,a2,…,an>|cj) =
cNB = argmax
cjC
P(cj)
Pi P(ai|cj)
Pi P(ai|cj)
Naïve Bayes Learning Algorithm
Naïve_Bayes_Learn(examples)
for each target value cj estimate P(cj)
for each attribute value ai estimate of each attribute a
estimate P(ai|cj)
Classify_New_Instance(x)
cNB = argmax
cjC
P(cj)
Paix P(ai|cj)
Naïve Bayes Example

Consider PlayTennis and new instance
<Outlook=Sunny, Temp=cool, Humidity=high, Wind=strong>

Compute cNB = argmax
cjC
P(cj)
Paix P(ai|cj)
playtennis (9+,5-)
P(yes) = 9/14, P(no) = 5/14
wind=strong (3+,3-)
P(strong|yes) = 3/9 , P(strong|no) 3/5
…
P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= 0.005
P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021
Estimating Probabilities

What if none (nc=0) of the training instances with target
value cj have attribute ai?
P(ai|cj) = nc/n = 0 and P(cj)


Paix P(ai|cj) = 0
Solution: Bayesian estimate for P(ai|cj)
P(ai|cj) = (nc + mp)/(n + m)
 n : number of training examples for which c=cj
 nc : number of examples for which c=cj and a=ai
 p : prior estimate of P(ai|cj)
 m : weight given to prior (number of “virtual” examples)
Bayesian Belief Networks




naïve assumption of conditional independency too
restrictive
full probability distribution intractable due to lack of data
Bayesian belief networks describe conditional
independence among subsets of variables
allows combining prior knowledge about causal
relationships among variables with observed data
Conditional Independence
Definition: X is conditionally independent of Y given Z is the
probability distribution governing X is independent of the
value of Y given the value of Z, that is, if
 xi,yj,zk P(X=xi|Y=yj,Z=zk) = P(X=xi|Z=zk)
or more compactly P(X|Y,Z) = P(X|Z)
Example: Thunder is conditionally independent of Rain given
Lightning
P(Thunder |Rain, Lightning) = P(Thunder |Lightning)
Notice: P(Thunder |Rain)  P(Thunder)
Naïve Bayes uses conditional independence to justify:
P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)
Bayesian Belief Network
Storm
Lightning
Thunder
BusTour
Group
Campfire
Forestfire
Campfire
S,B
S,¬B
¬S,B
S, ¬B
C
0.4
0.1
0.8
0.2
¬C
0.6
0.9
0.2
0.8
Network represents a set of conditional independence assertions:
 Each node is conditionally independent of its non-descendants,
given its immediate predecessors. (directed acyclic graph)
Bayesian Belief Network
Storm
Lightning
Thunder
BusTour
Group
Campfire
Forestfire
Campfire
S,B
S,¬B
¬S,B
S, ¬B
C
0.4
0.1
0.8
0.2
¬C
0.6
0.9
0.2
0.8
P(C|S,B)
Network represents joint probability distribution over all variables
 P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire)
Pi=1n P(yi|Parents(Yi))

P(y1,…,yn) =

joint distribution is fully defined by graph plus P(yi|Parents(Yi))
Expectation Maximization EM
when to use
 data is only partially observable
 unsupervised clustering: target value unobservable
 supervised learning: some instance attributes
unobservable
applications
 training Bayesian Belief Networks
 unsupervised clustering
 learning hidden Markov models
Generating Data from Mixture of Gaussians
Each instance x generated by
 choosing one of the k Gaussians at random
 Generating an instance according to that Gaussian
EM for Estimating k Means
Given:
 instances from X generated by mixture of k Gaussians
 unknown means <m1,…,mk> of the k Gaussians
 don’t know which instance xi was generated by which
Gaussian
Determine:
 maximum likelihood estimates of <m1,…,mk>
Think of full description of each instance as yi=<xi,zi1,zi2>
 zij is 1 if xi generated by j-th Gaussian
 xi observable
 zij unobservable
EM for Estimating k Means
EM algorithm: pick random initial h=<m1,m2> then iterate
 E step: Calculate the expected value E[zij] of each hidden
variable zij, assuming the current hypothesis h=<m1,m2>
holds.
Sn=12 p(x=xi|m=mj)
= exp(-(xi-mj)2/2s2) / Sn=12 exp(-(xi-mn)2/2s2)
E[zij] = p(x=xi|m=mj) /

M step: Calculate a new maximum likelihood hypothesis
h’=<m1’,m2’> assuming the value taken on by each hidden
variable zij is its expected value E[zij] calculated in the Estep. Replace h=<m1,m2> by h’=<m1’,m2’>
mj =
Si=1m E[zij] xi / Si=1m E[zij]
EM Algorithm
Converges to local maximum likelihood and provides
estimates of hidden variables zij.
In fact local maximum in E [ln (P(Y|h)]
 Y is complete (observable plus non-observable
variables) data
 Expected valued is taken over possible values of
unobserved variables in Y
General EM Problem
Given:
 observed data X = {x1,…,xm}
 unobserved data Z = {z1,…,zm}
 parameterized probability distribution P(Y|h) where
 Y = {y1,…,ym} is the full data yi=<xi,zi>
 h are the parameters
Determine:
 h that (locally) maximizes E[ln P(Y|h)]
Applications:
 train Bayesian Belief Networks
 unsupervised clustering
 hidden Markov models
General EM Method
Define likelihood function Q(h’|h) which calculates
Y = X  Z using observed X and current parameters h
to estimate Z
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
EM algorithm:
Estimation (E) step: Calculate Q(h’|h) using the current
hypothesis h and the observed data X to estimate the
probability distribution over Y.
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
Maximization (M) step: Replace hypothesis h by the
hypothesis h’ that maximizes this Q function.
h = argmaxh’H Q(h’|h)

Document

Transcript Document

Directory