Machine Learning
Download
Report
Transcript Machine Learning
Machine Learning: Lecture 6
Bayesian Learning
(Based on Chapter 6 of Mitchell T..,
Machine Learning, 1997)
1
An Introduction
Bayesian Decision Theory came long before Version
Spaces, Decision Tree Learning and Neural Networks. It
was studied in the field of Statistical Theory and more
specifically, in the field of Pattern Recognition.
Bayesian Decision Theory is at the basis of important
learning schemes such as the Naïve Bayes Classifier,
Learning Bayesian Belief Networks and the EM
Algorithm.
Bayesian Decision Theory is also useful as it provides a
framework within which many non-Bayesian classifiers
can be studied (See [Mitchell, Sections 6.3, 4,5,6]).
2
Bayes Theorem
Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Prior probability of h, P(h): it reflects any background
knowledge we have about the chance that h is a correct
hypothesis (before having observed the data).
Prior probability of D, P(D): it reflects the probability
that training data D will be observed given no
knowledge about which hypothesis h holds.
Conditional Probability of observation D, P(D|h): it
denotes the probability of observing data D given some
world in which hypothesis h holds.
3
Bayes Theorem (Cont’d)
Posterior probability of h, P(h|D): it represents
the probability that h holds given the observed
training data D. It reflects our confidence that h
holds after we have seen the training data D and
it is the quantity that Machine Learning
researchers are interested in.
Bayes Theorem allows us to compute P(h|D):
P(h|D)=P(D|h)P(h)/P(D)
4
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
Goal: To find the most probable hypothesis h from a set
of candidate hypotheses H given the observed data D.
MAP Hypothesis, hMAP = argmax hH P(h|D)
= argmax hH P(D|h)P(h)/P(D)
= argmax hH P(D|h)P(h)
If every hypothesis in H is equally probable a priori, we
only need to consider the likelihood of the data D given
h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)P(h)
5
An Example
Two alternative hypotheses
Observed Data
The patient has a particular form of cancer
The patient does not
Positive/negative of a particular lab test
Prior knowledge
Population of people 0.008 have this disease
P(cancer) 0.008 P(cancer) 0.992
P( | cancer) 0.98 P( | cancer) 0.02
P( | cancer) 0.03
P( | cancer) 0.97
hMAP arg maxhH P(h | )
arg max{P( | cencer) P(cancer), P ( | cencer) P (cancer)}
arg max{0.008 0.98,0.03 0.992) arg max{0.0078,0.0298}
cancer
6
Bayes Theorem and Concept Learning
Assumption
Training data D is noise free.
The target concept c is contained in H
No a priori reason to believe that any hypothesis is more probable than
any other
P ( h)
1
for all h H
H
1 if d i h( xi ) for all d i D d1 ,...d m
P ( D | h)
0
otherwise
Note: P( D | h) 1 if D is consistentwith h.
7
Bayes Theorem and Concept Learning
1. h is incosistent. ( P( D | h) 0)
0 P ( h)
P(h | D)
0
P( D)
2. h is cosistent.( P( D | h) 1)
1
1
1
1
H
H
1
P(h | D)
P( D) VS H , D
VS H , D
if h is consistentwith D.
H
Note: Sum of all P(h | D) 1
If P(h)=1/|H| and if P(D|h)=1 if D is consistent with h, and 0
otherwise, then every hypothesis in the version space
resulting from D is a MAP hypothesis.
See Fig. 6.1 in text
8
MAP Hypotheses and Consistent Learners
Every hypothesis consistent with D is a MAP
hypothesis.
Consistent Learner
It outputs a hypothesis that commits zero errors over
training examples.
Every consistent learner outputs a MAP hypothesis if we
assume a uniform prior probability distribution over H
and if we assume deterministic, noise free data.
Example: Find-S algorithm outputs MAP hypotheses that
are maximally specific members of version space.
• Are there other probability distributions for P(h) and P(D|h) ?
– Yes.
P(h1 ) P(h2 ) if h1 is morespecific thanh2 .
9
MAP Hypotheses and Consistent Learners
Bayesian framework allows one way to
characterize the behavior of learning algorithms
(e.g. Find-S), even when the learning algorithm
does not explicitly manipulate probabilities.
10
ML and Least-Squared Error Hypotheses
Under certain assumptions regarding noise in the
data, minimizing the mean squared error (what
common neural nets do) corresponds to computing
the maximum likelihood hypothesis.
hML arg max P( D | h)
hH
hML arg max p( D | h) where p( D | h) refers to probability density.
hH
T raininginstances x1 ,..., xm
D d1 ,...,d m
d i f ( xi ) ei
Assuming that trai
ning examplesare mutuallyindependent given h,
m
hML arg max p(d i | h).
hH
i 1
11
ML and Least-Squared Error Hypotheses
If ei N (0, 2 ), then
1
i 1
2
m
1
hML arg max
hH
hML arg max
hH
m
i 1
2
hH
i 1
1
2
2 2
i 1
arg max
2
e
2 2
1
hML arg max ln
m
e
m
hH
2
1
1
2 2
2
( di )2
. By considering μ f ( xi ) h( xi ),
( d i h ( xi )) 2
1
2
2
. By takinglog function,
(d i h( xi ))2
(d i h( xi ))2
m
arg min (d i h( xi ))2
hH
i 1
12
ML and Least-Squared Error Hypotheses
Assumptions
MAP becomes ML when P(h) is equally probable.
Normal Distribution of error
Noise in target values but noise free for attribute values
• Target value: Weight
• Attributes: Age, Height
13
ML Hypotheses for Predicting Probability
Binary classifier of neural net
Training data
D {( x1 , d1 ), ( x2 , d2 ),...(xm , dm )} where di 1 or 0
Single output node
m
m
i 1
i 1
P( D | h) P( xi , d i | h) P(d i | h, xi ) P( xi )
when xi is independent of h, thehypothesis.
if d i 1
h( x )
P(d i | h, xi ) i
1 h( xi ) if d i 0
Note: Output of thehypothesis(or network)is relatedwith theprob.
m
P( D | h) h( xi ) di (1 h( xi ))1 di P( xi )
i 1
14
ML Hypotheses for Predicting Probability
ML hypothesis
m
hML arg max h( xi ) di (1 h( xi ))1 d i P( xi )
i 1
hH
Assume that xi are equally probable.
m
hML arg max h( xi ) di (1 h( xi ))1 d i
i 1
hH
Note: Binomialdistribution
P rob.thatflippingeach of m distinct coins will produce
theoutcome d1 , d 2 ,...d m assuming thateach coin xi has
prob.h( xi ) of producinga head.
m
hML arg max d i lnh( xi ) (1 d i ) ln(1 h( xi ))
i 1
hH
m
Cross Entropy: d i lnh( xi ) (1 d i ) ln(1 h( xi ))
i 1
15
ML Hypotheses for Predicting Probability
Gradient Search in single layer NN
Updaterule for weightw jk frominput k to unit j
m
w jk w jk (d i h( xi ))xijk
i 1
xijk : kth input tounit j for theith training example
Summary:
Minimize sum of squared error seeks the ML hypothesis under
assumption that training data can be modeled by Normal
distributed noise added to the target function value.
The rule that minimizes cross entropy seeks the ML hypothesis
under the assumption that the observed Boolean value is the
probabilistic function of input instance.
16
let G(h, D) d i ln h( xi ) (1 d i ) ln(1 h( xi ))
G ( h, D )
G ( h, D ) h( xi )
w jk
h( xi ) w jk
( d i ln h( xi ) (1 d i ) ln(1 h( xi ))) h( xi )
h( xi )
w jk
d i h( xi )
h( xi )
h( xi )(1 h( xi )) w jk
h( xi )(1 h( xi ))xijk
d i h( xi )
h( xi )(1 h( xi ))
1
( d i h( xi ))xijk
w jk
G (h, D)
jk
17
Bayes Optimal Classifier
One great advantage of Bayesian Decision Theory is
that it gives us a lower bound on the classification error
that can be obtained for a given problem.
Bayes Optimal Classification: The most probable
classification of a new instance is obtained by
combining the predictions of all hypotheses, weighted
by their posterior probabilities:
argmaxvjVhi HP(vh|hi)P(hi|D)
where V is the set of all the values a classification can take
and vj is one possible such classification.
Unfortunately, Bayes Optimal Classifier is usually too
costly to apply! ==> Naïve Bayes Classifier
18
P(h1 | D) .4
P (-| h1 ) 0 P ( | h1 ) 1
P(h2 | D) .3
P (-| h 2 ) 1 P ( | h 2 ) 0
P(h3 | D) .3
P (-| h 3 ) 1 P ( | h 3 ) 0
P( | h ) P(h
| D) .4
P( | h ) P(h
| D) .6
hi H
hi H
i
i
i
i
arg max P(v j | hi ) P(hi | D)
v j { , }
hi H
19
Gibbs algorithm
Algorithm
1. Choose h from H, according to the posterior
probability distribution over H
2. Use h to predict the classification of x
Validity of Gibbs algorithm
Haussler , 1994
Error(Gibbs algorithm)< 2*Error(Bayes optimal
classifier)
20
Naïve Bayes Classifier
Let each instance x of a training set D be described by a
conjunction of n attribute values <a1,a2,..,an> and let f(x),
the target function, be such that f(x) V, a finite set.
Bayesian Approach:
vMAP = argmaxvj V P(vj|a1,a2,..,an)
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an)]
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)
Naïve Bayesian Approach: We assume that the attribute
values are conditionally independent so that
P(a1,a2,..,an|vj) =i P(ai|vj) [and not too large a data set is
required.]
Naïve Bayes Classifier:
vNB = argmaxvj V P(vj) i P(ai|vj)
21
An Illustrative Example
(outlook=sunny,temperature=cool,humidity=high,wind=str
ong)
P(wind=strong|playTennis=yes)=3/9=.33
P(wind=strong|PlayTennis=no)=3/5=.60
P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=.0
053
P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206
vNB = no
22
Naïve Bayes Classifier
• Estimating Probabilities P(ai|vj)
m-estimate of probability = nc mp
n m
– m : equivalent sample size , p : prior estimate of probability
– nc: number of instances ai among vj
– n: number of instances with value vj
Example: Table 3.2
23
An Example: Learning to classify text
Independent Assumption
111
vNB arg max P(v j ) P(ai | v j )
v j {like, dislike}
i 1
arg max P(v j ) P(a1 " our"| v j ) P(a2 " approach"| v j )...P(a1 " trouble"| v j )
v j {like, dislike}
Incorrect assumption but no other choice
Example: machine learning
Impractical:
2 target values*111 word positions*50000 words
Position Independent Assumption
• 2 target values*50000 words
• See Algorithm at page 183
Learn_Bayes_Text/Classify_Bayes_Text
24
Bayesian Belief Networks
The Bayes Optimal Classifier is often too
costly to apply.
The Naïve Bayes Classifier uses the
conditional independence assumption to
defray these costs. However, in many cases,
such an assumption is overly restrictive.
Bayesian belief networks provide an
intermediate approach which allows stating
conditional independence assumptions that
apply to subsets of the variable.
25
Conditional Independence
We say that X is conditionally independent of Y
given Z if the probability distribution governing X is
independent of the value of Y given a value for Z.
i.e., (xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk)
or, P(X|Y,Z)=P(X|Z)
This definition can be extended to sets of variables
as well: we say that the set of variables X1…Xl is
conditionally independent of the set of variables Y1…Ym
given the set of variables Z1…Zn , if
P(X1…Xl|Y1…Ym,Z1…Zn)=P(X1…Xl|Z1…Zn)
26
Representation in Bayesian
Belief Networks
Storm
Lightning
Thunder
Associated with each
BusTourGroup node is a conditional
probability table, which
specifies the conditional
Campfire
distribution for the
variable given its
immediate parents in
the graph
ForestFire
Each node is asserted to be conditionally independent of
its non-descendants, given its immediate parents
27
Inference in Bayesian Belief
Networks
A Bayesian Network can be used to compute the
probability distribution for any subset of network
variables given the values or distributions for any
subset of the remaining variables.
Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is
known to be NP-hard.
In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be
useful.
28
Learning Bayesian Belief
Networks
3 Cases:
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
==> Trivial Case: just estimate the conditional
probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data. ==> Similar to learning the weights for the hidden
units of a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance. ==>
Use a heuristic search or constraint-based technique to
search through potential structures.
29
The EM Algorithm: Learning with
unobservable relevant variables.
Example:Assume that data points have been uniformly
generated from k distinct Gaussian with the same known
variance. The problem is to output a hypothesis
h=<1, 2 ,.., k> that describes the means of each of
the k distributions. In particular, we are looking for a
maximum likelihood hypothesis for these means.
We extend the problem description as follows: for each
point xi, there are k hidden variables zi1,..,zik such that
zil=1 if xi was generated by normal distribution l and
ziq= 0 for all ql.
30
The EM Algorithm (Cont’d)
An arbitrary initial hypothesis h=<1, 2 ,.., k> is chosen.
The EM Algorithm iterates over two steps:
Step 1 (Estimation, E): Calculate the expected value
E[zij] of each hidden variable zij, assuming that the
current hypothesis h=<1, 2 ,.., k> holds.
Step 2 (Maximization, M): Calculate a new maximum
likelihood hypothesis h’=<1’, 2’ ,.., k’>, assuming the
value taken on by each hidden variable zij is its expected
value E[zij] calculated in step 1. Then replace the
hypothesis h=<1, 2 ,.., k> by the new hypothesis
h’=<1’, 2’ ,.., k’> and iterate.
The EM Algorithm can be applied to more general problems
31