Bayesian Learning

Download Report

Transcript Bayesian Learning

Bayesian Learning
• Provides practical learning algorithms
– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)
• Provides foundations for machine learning
– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A
and B:
P( A, B)  P( A | B) P( B)  P( B | A) P( A)
•Sum Rule: probability of a disjunction of two events A and B:
P( A  B)  P( A)  P( B)  P( AB)
•Theorem of Total Probability : if events A1, …., An are mutually
exclusive with
n
P( B)   P( B | Ai ) P( Ai )
i 1
Basic Approach
Bayes Rule:
•
•
•
•
P ( D | h) P ( h )
P(h | D) 
P ( D)
P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D
P(h|D) = probability of h given D (posterior density )
P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the
training data (Maximum A Posteriori hypothesis hmap)
hmap
 max P (h | D)
hH
P ( D | h) P ( h)
 max
hH
P( D)
 max P ( D | h) P (h)
hH
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P(cancer)  .008, P(cancer)  .992
P( | cancer)  .98, P( | cancer)  .02
P( | cancer)  .03, P( | cancer)  .97
P( | cancer) P(cancer)
P()
P( | cancer) P(cancer)
P(cancer| ) 
P()
P(cancer| ) 
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P(h | D) 
P ( D | h) P ( h )
P ( D)
Output the hypothesis hmap with the highest posterior probability
hmap  max P ( h | D )
hH
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Bayes Optimal Classifier
• Question: Given new instance x, what is its most probable
classification?
• Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = What is the most probable classification of x ?
Bayes optimal classification:
max  P(v j | hi ) P(hi | D)
vjV
hjH
Example:
P(h1| D) =.4,
P(h2|D) =.3,
P(h3|D)=.3,
P(-|h1)=0,
P(-|h2)=1,
P(-|h3)=1,
P(+|h1)=1
P(+|h2)=0
P(+|h3)=0
 P( | h ) P(h | D)  .4
hiH
i
i
 P( | h ) P(h | D)  .6
hiH
i
i
Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described
by attributes <a1, a2, …., an>. Most probable value of f(x) is:
v  max P (v j | a1 , a2 ....an )
vjV
 max
P ( a1 , a2 ....an | v j ) P (v j )
vjV
P ( a1 , a2 ....an )
 max P ( a1 , a2 ....an | v j ) P (v j )
vjV
Naïve Bayes assumption:
P(a1 , a2 ....an | v j )   P(ai | v j ) (attributes are conditionally independent)
i
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny,windy=true,…)
• Idea: assign to sample X the class label C such
that P(C|X) is maximal
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples
having value xi as i-th attribute in class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases
Naive Bayesian Classifier (II)
• Given a training set, we can compute the probabilities
O u tlo o k
su n n y
o verc ast
rain
T em p reatu re
hot
m ild
cool
P
2 /9
4 /9
3 /9
2 /9
4 /9
3 /9
N
3 /5
0
2 /5
2 /5
2 /5
1 /5
H u m id ity P
h ig h
3 /9
n o rm al
6 /9
N
4 /5
1 /5
W in d y
tru e
false
3 /5
2 /5
3 /9
6 /9
Play-tennis example: estimating P(xi|C)
outlook
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
P(p) = 9/14
P(n) = 5/14
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(overcast|p) = 4/9
P(overcast|n) = 0
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 2/5
windy
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high,
strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following
training data:
Day
Outlook Temperature
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Play Tennis
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
# days of playingtennise with strong wind
# days of playingtennise
p( y) p( sun | y) p(cool | y) p(high| y) p( strong | y)  .005
p(n) p( sun | n) p(cool | n) p(high| n) p( strong | n)  .021
we have :
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first
Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )
Classify_New_Instance (x)
v  max P(v j ) P(ai | v j )
vjV
ai x
Typical estimation of P(ai | vj)
nc  mp
P(ai | v j ) 
nm
Where
n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
Bayesian Belief Networks
• Naïve Bayes assumption of conditional independence too restrictive
• But it is intractable without some such assumptions
• Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
• Bayesian Net
– Node = variables
– Arc = dependency
– DAG, with direction on arc representing causality
Bayesian Networks:
Multi-variables with Dependency
• Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
• Bayesian Net
– Node = variables and each variable has a finite set of mutually exclusive
states
– Arc = dependency
– DAG, with direction on arc representing causality
– To each variables A with parents B1, …., Bn there is attached a
conditional probability table P (A | B1, …., Bn)
Bayesian Belief Networks
Occ
Age
Income
Buy X
Interested in
Insurance
•Age, Occupation and Income determine if
customer will buy this product.
•Given that customer buys product, whether
there is interest in insurance is now
independent of Age, Occupation, Income.
•P(Age, Occ, Inc, Buy, Ins ) =
P(Age)P(Occ)P(Inc)
P(Buy|Age,Occ,Inc)P(Int|Buy)
Current State-of-the Art: Given structure
and probabilities, existing algorithms can
handle inference with categorical values and
limited representation of numerical values
General Product Rule
n
P( x1 ,....xn | M )   P( xi | Pai , M )
i 1
Pai  parent( xi )
Nodes as Functions
A node in BN is a conditional distribution function
P(X|A=a, B=b)
l
m
h
A
a
b
B
ab ~ab a~b ~a~b
l 0.1
m 0.3
h 0.6
0.1
0.3
0.6
0.7 0.4 0.2
0.2 0.4 0.5
0.1 0.2 0.3
X
•input: parents state values
•output: a distribution over its own value
Special Case : Naïve Bayes
h
e1
e2
………….
en
P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)
Inference in Bayesian Networks
Age
Income
How likely are elderly rich
people to buy Sun?
P( paper = Sun | Age>60, Income > 60k)
Living
Location
House
Owner
Newspaper
Preference
EU
Voting
Pattern
Inference in Bayesian Networks
Age
Income
Living
Location
House
Owner
Newspaper
Preference
EU
Voting
Pattern
How likely are elderly rich
people who voted labour to
buy Daily Mail?
P( paper = DM | Age>60,
Income > 60k, v = labour)
Bayesian Learning
B
E
A
C
~b
e
a
c
b ~e ~a
~c
………………...
N
n
n
Burglary
Earthquake
Alarm
Call
Input : fully or partially observable data cases
Output : parameters AND also structure
Learning Methods:
EM (Expectation Maximisation)
using current approximation of parameters to estimate filled in data
using filled in data to update parameters (ML)
Gradient Ascent Training
Gibbs Sampling (MCMC)
Newscast