Machine Learning

Download Report

Transcript Machine Learning

Machine Learning: Lecture 6
Bayesian Learning
(Based on Chapter 6 of Mitchell T..,
Machine Learning, 1997)
1
An Introduction
 Bayesian Decision Theory came long before Version
Spaces, Decision Tree Learning and Neural Networks. It
was studied in the field of Statistical Theory and more
specifically, in the field of Pattern Recognition.
 Bayesian Decision Theory is at the basis of important
learning schemes such as the Naïve Bayes Classifier,
Learning Bayesian Belief Networks and the EM
Algorithm.
 Bayesian Decision Theory is also useful as it provides a
framework within which many non-Bayesian classifiers
can be studied (See [Mitchell, Sections 6.3, 4,5,6]).
2
Bayes Theorem
 Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
 Prior probability of h, P(h): it reflects any background
knowledge we have about the chance that h is a correct
hypothesis (before having observed the data).
 Prior probability of D, P(D): it reflects the probability
that training data D will be observed given no
knowledge about which hypothesis h holds.
 Conditional Probability of observation D, P(D|h): it
denotes the probability of observing data D given some
world in which hypothesis h holds.
3
Bayes Theorem (Cont’d)
 Posterior probability of h, P(h|D): it represents
the probability that h holds given the observed
training data D. It reflects our confidence that h
holds after we have seen the training data D and
it is the quantity that Machine Learning
researchers are interested in.
 Bayes Theorem allows us to compute P(h|D):
P(h|D)=P(D|h)P(h)/P(D)
4
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
 Goal: To find the most probable hypothesis h from a set
of candidate hypotheses H given the observed data D.
 MAP Hypothesis, hMAP = argmax hH P(h|D)
= argmax hH P(D|h)P(h)/P(D)
= argmax hH P(D|h)P(h)
 If every hypothesis in H is equally probable a priori, we
only need to consider the likelihood of the data D given
h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)P(h)
5
An Example

Two alternative hypotheses



Observed Data


The patient has a particular form of cancer
The patient does not
Positive/negative of a particular lab test
Prior knowledge

Population of people 0.008 have this disease
P(cancer)  0.008 P(cancer)  0.992
P( | cancer)  0.98 P( | cancer)  0.02
P( | cancer)  0.03
P( | cancer)  0.97
hMAP  arg maxhH P(h | )
 arg max{P( | cencer) P(cancer), P ( | cencer) P (cancer)}
 arg max{0.008 0.98,0.03 0.992)  arg max{0.0078,0.0298}
 cancer
6
Bayes Theorem and Concept Learning
 Assumption



Training data D is noise free.
The target concept c is contained in H
No a priori reason to believe that any hypothesis is more probable than
any other
P ( h) 
1
for all h  H
H
1 if d i  h( xi ) for all d i  D  d1 ,...d m 
P ( D | h)  
0
otherwise

Note: P( D | h)  1 if D is consistentwith h.
7
Bayes Theorem and Concept Learning
1. h is incosistent. ( P( D | h)  0)
0  P ( h)
P(h | D) 
0
P( D)
2. h is cosistent.( P( D | h)  1)
1
1
1
1
H
H
1
P(h | D) 


P( D) VS H , D
VS H , D
if h is consistentwith D.
H
Note: Sum of all P(h | D)  1
If P(h)=1/|H| and if P(D|h)=1 if D is consistent with h, and 0
otherwise, then every hypothesis in the version space
resulting from D is a MAP hypothesis.
See Fig. 6.1 in text
8
MAP Hypotheses and Consistent Learners
 Every hypothesis consistent with D is a MAP
hypothesis.
 Consistent Learner



It outputs a hypothesis that commits zero errors over
training examples.
Every consistent learner outputs a MAP hypothesis if we
assume a uniform prior probability distribution over H
and if we assume deterministic, noise free data.
Example: Find-S algorithm outputs MAP hypotheses that
are maximally specific members of version space.
• Are there other probability distributions for P(h) and P(D|h) ?
– Yes.
P(h1 )  P(h2 ) if h1 is morespecific thanh2 .
9
MAP Hypotheses and Consistent Learners
 Bayesian framework allows one way to
characterize the behavior of learning algorithms
(e.g. Find-S), even when the learning algorithm
does not explicitly manipulate probabilities.
10
ML and Least-Squared Error Hypotheses
 Under certain assumptions regarding noise in the
data, minimizing the mean squared error (what
common neural nets do) corresponds to computing
the maximum likelihood hypothesis.
hML  arg max P( D | h)
hH
hML  arg max p( D | h) where p( D | h) refers to probability density.
hH
T raininginstances x1 ,..., xm
D  d1 ,...,d m
d i  f ( xi )  ei
Assuming that trai
ning examplesare mutuallyindependent given h,
m
hML  arg max  p(d i | h).
hH
i 1
11
ML and Least-Squared Error Hypotheses
If ei  N (0,  2 ), then
1
i 1
2
m
1
hML  arg max 
hH
hML  arg max 
hH

m
i 1
2
hH
i 1
1
2
2 2
i 1
 arg max  
2
e
2 2
1
hML  arg max  ln
m
e

m
hH
2
1
1
2 2
2

( di   )2
. By considering μ  f ( xi )  h( xi ),
( d i  h ( xi )) 2
1
2
2
. By takinglog function,
(d i  h( xi ))2
(d i  h( xi ))2
m
 arg min  (d i  h( xi ))2
hH
i 1
12
ML and Least-Squared Error Hypotheses
 Assumptions



MAP becomes ML when P(h) is equally probable.
Normal Distribution of error
Noise in target values but noise free for attribute values
• Target value: Weight
• Attributes: Age, Height
13
ML Hypotheses for Predicting Probability
 Binary classifier of neural net

Training data
D  {( x1 , d1 ), ( x2 , d2 ),...(xm , dm )} where di  1 or 0

Single output node
m
m
i 1
i 1
P( D | h)   P( xi , d i | h)   P(d i | h, xi ) P( xi )
when xi is independent of h, thehypothesis.
if d i  1
 h( x )
P(d i | h, xi )   i
1  h( xi ) if d i  0
Note: Output of thehypothesis(or network)is relatedwith theprob.
m
P( D | h)   h( xi ) di (1  h( xi ))1 di P( xi )
i 1
14
ML Hypotheses for Predicting Probability
ML hypothesis
m
hML  arg max  h( xi ) di (1  h( xi ))1 d i P( xi )
i 1
hH
Assume that xi are equally probable.
m
hML  arg max  h( xi ) di (1  h( xi ))1 d i
i 1
hH
Note: Binomialdistribution
P rob.thatflippingeach of m distinct coins will produce
theoutcome d1 , d 2 ,...d m assuming thateach coin xi has
prob.h( xi ) of producinga head.
m
hML  arg max  d i lnh( xi )  (1  d i ) ln(1  h( xi ))
i 1
hH
m
Cross Entropy:  d i lnh( xi )  (1  d i ) ln(1  h( xi ))
i 1
15
ML Hypotheses for Predicting Probability
 Gradient Search in single layer NN
Updaterule for weightw jk frominput k to unit j
m
w jk  w jk    (d i  h( xi ))xijk
i 1
xijk : kth input tounit j for theith training example
 Summary:


Minimize sum of squared error seeks the ML hypothesis under
assumption that training data can be modeled by Normal
distributed noise added to the target function value.
The rule that minimizes cross entropy seeks the ML hypothesis
under the assumption that the observed Boolean value is the
probabilistic function of input instance.
16
let G(h, D)  d i ln h( xi )  (1  d i ) ln(1  h( xi ))
G ( h, D )
G ( h, D ) h( xi )

w jk
h( xi ) w jk

 ( d i ln h( xi )  (1  d i ) ln(1  h( xi ))) h( xi )
h( xi )
w jk
d i  h( xi )
h( xi )

h( xi )(1  h( xi )) w jk
h( xi )(1  h( xi ))xijk
d i  h( xi )

h( xi )(1  h( xi ))
1
  ( d i  h( xi ))xijk
w jk
G (h, D)

 jk
17
Bayes Optimal Classifier
 One great advantage of Bayesian Decision Theory is
that it gives us a lower bound on the classification error
that can be obtained for a given problem.
 Bayes Optimal Classification: The most probable
classification of a new instance is obtained by
combining the predictions of all hypotheses, weighted
by their posterior probabilities:
argmaxvjVhi HP(vh|hi)P(hi|D)
where V is the set of all the values a classification can take
and vj is one possible such classification.
 Unfortunately, Bayes Optimal Classifier is usually too
costly to apply! ==> Naïve Bayes Classifier
18
P(h1 | D)  .4
P (-| h1 )  0 P ( | h1 )  1
P(h2 | D)  .3
P (-| h 2 )  1 P ( | h 2 )  0
P(h3 | D)  .3
P (-| h 3 )  1 P ( | h 3 )  0
 P( | h ) P(h
| D)  .4
 P( | h ) P(h
| D)  .6
hi H
hi H
i
i
i
i
arg max  P(v j | hi ) P(hi | D)  
v j { , }
hi H
19
Gibbs algorithm
 Algorithm


1. Choose h from H, according to the posterior
probability distribution over H
2. Use h to predict the classification of x
 Validity of Gibbs algorithm


Haussler , 1994
Error(Gibbs algorithm)< 2*Error(Bayes optimal
classifier)
20
Naïve Bayes Classifier
 Let each instance x of a training set D be described by a
conjunction of n attribute values <a1,a2,..,an> and let f(x),
the target function, be such that f(x)  V, a finite set.
 Bayesian Approach:
vMAP = argmaxvj V P(vj|a1,a2,..,an)
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an)]
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)
 Naïve Bayesian Approach: We assume that the attribute
values are conditionally independent so that
P(a1,a2,..,an|vj) =i P(ai|vj) [and not too large a data set is
required.]
 Naïve Bayes Classifier:
vNB = argmaxvj V P(vj) i P(ai|vj)
21
An Illustrative Example
 (outlook=sunny,temperature=cool,humidity=high,wind=str
ong)
 P(wind=strong|playTennis=yes)=3/9=.33
 P(wind=strong|PlayTennis=no)=3/5=.60
 P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=.0
053
 P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206
 vNB = no
22
Naïve Bayes Classifier
• Estimating Probabilities P(ai|vj)
m-estimate of probability = nc  mp
n m
– m : equivalent sample size , p : prior estimate of probability
– nc: number of instances ai among vj
– n: number of instances with value vj
Example: Table 3.2
23
An Example: Learning to classify text
 Independent Assumption
111
vNB  arg max P(v j ) P(ai | v j )
v j {like, dislike}
i 1
 arg max P(v j ) P(a1 " our"| v j ) P(a2 " approach"| v j )...P(a1 " trouble"| v j )
v j {like, dislike}




Incorrect assumption but no other choice
Example: machine learning
Impractical:
2 target values*111 word positions*50000 words
Position Independent Assumption
• 2 target values*50000 words
• See Algorithm at page 183

Learn_Bayes_Text/Classify_Bayes_Text
24
Bayesian Belief Networks
 The Bayes Optimal Classifier is often too
costly to apply.
 The Naïve Bayes Classifier uses the
conditional independence assumption to
defray these costs. However, in many cases,
such an assumption is overly restrictive.
 Bayesian belief networks provide an
intermediate approach which allows stating
conditional independence assumptions that
apply to subsets of the variable.
25
Conditional Independence
 We say that X is conditionally independent of Y
given Z if the probability distribution governing X is
independent of the value of Y given a value for Z.
 i.e., (xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk)
 or, P(X|Y,Z)=P(X|Z)
 This definition can be extended to sets of variables
as well: we say that the set of variables X1…Xl is
conditionally independent of the set of variables Y1…Ym
given the set of variables Z1…Zn , if
P(X1…Xl|Y1…Ym,Z1…Zn)=P(X1…Xl|Z1…Zn)
26
Representation in Bayesian
Belief Networks
Storm
Lightning
Thunder
Associated with each
BusTourGroup node is a conditional
probability table, which
specifies the conditional
Campfire
distribution for the
variable given its
immediate parents in
the graph
ForestFire
Each node is asserted to be conditionally independent of
its non-descendants, given its immediate parents
27
Inference in Bayesian Belief
Networks
 A Bayesian Network can be used to compute the
probability distribution for any subset of network
variables given the values or distributions for any
subset of the remaining variables.
 Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is
known to be NP-hard.
 In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be
useful.
28
Learning Bayesian Belief
Networks
3 Cases:
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
==> Trivial Case: just estimate the conditional
probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data. ==> Similar to learning the weights for the hidden
units of a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance. ==>
Use a heuristic search or constraint-based technique to
search through potential structures.
29
The EM Algorithm: Learning with
unobservable relevant variables.
 Example:Assume that data points have been uniformly
generated from k distinct Gaussian with the same known
variance. The problem is to output a hypothesis
h=<1, 2 ,.., k> that describes the means of each of
the k distributions. In particular, we are looking for a
maximum likelihood hypothesis for these means.
 We extend the problem description as follows: for each
point xi, there are k hidden variables zi1,..,zik such that
zil=1 if xi was generated by normal distribution l and
ziq= 0 for all ql.
30
The EM Algorithm (Cont’d)
 An arbitrary initial hypothesis h=<1, 2 ,.., k> is chosen.
 The EM Algorithm iterates over two steps:
 Step 1 (Estimation, E): Calculate the expected value
E[zij] of each hidden variable zij, assuming that the
current hypothesis h=<1, 2 ,.., k> holds.
 Step 2 (Maximization, M): Calculate a new maximum
likelihood hypothesis h’=<1’, 2’ ,.., k’>, assuming the
value taken on by each hidden variable zij is its expected
value E[zij] calculated in step 1. Then replace the
hypothesis h=<1, 2 ,.., k> by the new hypothesis
h’=<1’, 2’ ,.., k’> and iterate.
The EM Algorithm can be applied to more general problems
31