Machine Learning - Naive Bayes Classifier

Download Report

Transcript Machine Learning - Naive Bayes Classifier

Naïve Bayes Classifier
Ke Chen
COMP24111 Machine Learning
Outline
• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
– Principle and Algorithms
– Example: Play Tennis
• Relevant Issues
• Summary
2
COMP24111 Machine Learning
Background
•
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification
•
3
COMP24111 Machine Learning
Probability Basics
•
Prior, conditional and joint probability for random variables
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X  ( X1 , X2 ), P( X)  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P( X2 |X1 )P( X1 )  P( X1 |X2 )P( X2 )
– Independence: P( X2 |X1 )  P( X2 ), P( X1 |X2 )  P( X1 ), P(X1 ,X2 )  P( X1 )P( X2 )
•
Bayesian Rule
Likelihood  Prior
P(X|C)P(C)
P(C |X) 
Posterior 
P( X)
Evidence
4
COMP24111 Machine Learning
Probability Basics
•
Quiz:
We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
1) P( A)  ?
2) P(B)  ?
3) P(C)  ?
4) P( A| B)  ?
5) P(C | A)  ?
6) P( A , B)  ?
7) P( A , C )  ?
8) Is P( A , C ) equal to P(A)  P(C)?
5
COMP24111 Machine Learning
Probabilistic Classification
•
Establishing a probabilistic model for classification
– Discriminative model
P(C |X) C  c1 ,  , c L , X  (X1 ,  , Xn )
P(c1 | x) P(c2 | x)

P(c L | x)
Discriminative
Probabilistic Classifier
x1
x2

xn
x  ( x1 , x2 ,  , xn )
6
COMP24111 Machine Learning
Probabilistic Classification
•
Establishing a probabilistic model for classification (cont.)
– Generative model
P( X |C ) C  c1 ,  , c L , X  (X1 ,  , Xn )
P( x |c1 )
P( x |c2 )
P( x |c L )
Generative
Probabilistic Model
Generative
Probabilistic Model
Generative
Probabilistic Model
for Class 1
for Class 2


x1
x2
xn x1
x2

for Class L
xn
x1
x2

xn
x  ( x1 , x2 ,  , xn )
7
COMP24111 Machine Learning
Probabilistic Classification
•
MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C  c* |X  x)  P(C  c|X  x) c  c* , c  c1 ,  , cL
•
Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
P( X  x |C  ci )P(C  ci )
P(C  ci |X  x) 
P( X  x)
 P( X  x |C  ci )P(C  ci )
fo r i  1,2 ,  , L
– Then apply the MAP rule
8
COMP24111 Machine Learning
Naïve Bayes
•
Bayes classification
P(C |X)  P( X |C )P(C )  P( X1 ,  , Xn |C )P(C )
Difficulty: learning the joint probability P( X1 ,  , Xn |C )
•
Naïve Bayes classification
– Assumption that all input features are conditionally independent!
P(X1 , X2 ,  , Xn |C)  P(X1 | X2 ,  , Xn , C)P(X2 ,  , Xn |C)
 P(X1 |C)P(X2 ,  , Xn |C)
 P(X1 |C)P(X2 |C)    P(Xn |C)
– MAP classification rule: for x  ( x1 , x2 ,  , xn )
[P( x1 |c* )    P( xn |c* )]P(c* )  [P( x1 |c)    P( xn |c)]P(c), c  c* , c  c1 ,  , cL
9
COMP24111 Machine Learning
Naïve Bayes
•
Algorithm: Discrete-Valued Features
– Learning Phase: Given a training set S,
Fo r eac h targ et value o f ci (ci  c1 ,  , c L )
Pˆ (C  ci )  estimate P(C  ci ) with examples in S;
Fo r everyfeature value x jk o f eac h feature X j ( j  1,  , n; k  1,  , N j )
Pˆ ( X j  x jk |C  ci )  estimate P( X j  x jk |C  ci ) with examples in S;
Output: conditional probability tables; for X j , N j  L elements
– Test Phase: Given an unknown instance X  ( a1 ,  , an ),
Look up tables to assign the label c* to X’ if
[Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c  c1 ,  , cL
10
COMP24111 Machine Learning
Example
•
Example: Play Tennis
11
COMP24111 Machine Learning
Example
•
Learning Phase
Outlook
Play=Yes
Play=No
Temperature
Play=Yes
Play=No
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
3/9
0/5
2/5
Mild
4/9
3/9
2/5
1/5
Rain
Humidity
High
Normal
Cool
Play=Yes Play=No
3/9
6/9
4/5
1/5
P(Play=Yes) = 9/14
Wind
Play=Yes
Play=No
Strong
3/9
3/5
Weak
6/9
2/5
P(Play=No) = 5/14
12
COMP24111 Machine Learning
Example
•
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– Decision making with the MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
13
COMP24111 Machine Learning
Example
•
Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
14
COMP24111 Machine Learning
Naïve Bayes
•
Algorithm: Continuous-valued Features
– Numberless values for a feature
– Conditional probability often modeled with the normal distribution
 ( X j  ji )2 
1

Pˆ ( X j |C  ci ) 
exp 
2

2 ji
2 ji 

 ji : mean (av earag e)o f feature v alues X j o f examples fo r whic h C  ci
 ji : standard dev iatio no f feature v alues X j o f examples fo r whic h C  ci
– Learning Phase: for X  ( X1 ,  , Xn ), C  c1 ,  , c L
Output: n L normal distributions and P(C  ci ) i  1,  , L
– Test Phase: Given an unknown instance X  ( a1 ,  , an )
•
•
Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
Apply the MAP rule to make a decision
15
COMP24111 Machine Learning
Naïve Bayes
•
Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
N
1 N
1
   xn , 2  (xn  )2
N n1
N n1
Yes  21.64, Yes  2.35
No  23.88, No  7.09
– Learning Phase: output two Gaussian models for P(temp|C)
2
2




1
(
x

21
.
64
)
1
(
x

21
.
64
)
ˆ




P( x | Yes) 
exp 

exp 
2

2  2.35  2.35 2
11.09 
2.35 2


2
2




1
(
x

23
.
88
)
1
(
x

23
.
88
)
ˆ
 

P( x | No) 
exp 
exp 
2
2  7.09  7.09 2
50.25 
7.09 2


16
COMP24111 Machine Learning
Relevant Issues
•
Violation of Independence Assumption
– For many real world tasks, P( X1 ,  , Xn |C )  P( X1 |C )    P( Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
•
Zero conditional probability Problem
– If no example contains the attribute value Xj  ajk , Pˆ (Xj  ajk |C  ci )  0
– In this circumstance, Pˆ (x1 |ci )    Pˆ (ajk |ci )    Pˆ (xn |ci )  0 during test
– For a remedy, conditional probabilities estimated with
n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
17
COMP24111 Machine Learning
Summary
•
Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
•
A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
18
COMP24111 Machine Learning