Transcript Lecture 11

INC 551 Artificial
Intelligence
Lecture 11
Machine Learning (Continue)
Bayes Classifier
Bayes Rule
Play Tennis Example
John wants to play tennis everyday.
However, in some days, the condition is not good.
So, he decide not to play.
The following table is the record for the last 14 days.
Outlook
Temperature
Humidity
Wind
PlayTennis
Sunny
Hot
High
Weak
No
Sunny
Hot
High
Strong
No
Overcast
Hot
High
Weak
Yes
Rain
Mild
High
Weak
Yes
Rain
Cool
Normal
Weak
Yes
Rain
Cool
Normal
Strong
No
Overcast
Cool
Normal
Strong
Yes
Sunny
Mild
High
Weak
No
Sunny
Cool
Normal
Weak
Yes
Rain
Mild
Normal
Weak
Yes
Sunny
Mild
Normal
Strong
Yes
Overcast
Mild
High
Strong
Yes
Overcast
Hot
Normal
Weak
Yes
Rain
Mild
High
Strong
No
Question:
Today’s condition is
<Sunny, Mild Temperature, Normal Humidity, Strong Wind>
Do you think John will play tennis?
Find
P(condition| PlayTennis)
We need to use naïve Bayes assumption.
Assume that all events are independent.
P( sunny m ild  norm al strong | PlayTennis) 
P( sunny| PlayTennis)  P(m ild | PlayTennis)
 P(norm al| PlayTennis)  P( strong | PlayTennis)
Now, let’s look at each property
P ( sunny| PlayTennis yes)  2 / 9  0.22
P ( sunny| PlayTennis no)  3 / 5  0.6
P (Tem p  m ild | PlayTennis yes)  4 / 9  0.44
P (Tem p  m ild | PlayTennis no)  2 / 5  0.4
P ( Hum id  norm al| PlayTennis yes)  6 / 9  0.66
P ( Hum id  norm al| PlayTennis no)  1 / 5  0.2
P (Wind  strong | PlayTennis yes)  3 / 9  0.33
P (Wind  strong | PlayTennis no)  3 / 5  0.6
P( sunny m ild  norm al strong | PlayTennis yes) 
0.22 0.44 0.66 0.33  0.022
P( sunny m ild  norm al strong | PlayTennis no) 
0.6  0.4  0.2  0.6  0.0288
Using Bayes rule
P(condition| PlayTennis)  P( PlayTennis)
P( PlayTennis| condition) 
P(condition)
P( PlayTennis yes | condition) 
0.022 0.643
0.01415

P(condition) P(condition)
P( PlayTennis no | condition) 
0.0288 0.357
0.01028

P(condition) P(condition)
Since P(condition) is the same, we can
conclude that John is more likely to play
tennis today.
Note that, we do not need to compute
P(condition) to get the answer. However, if
you want to get the number, we can
calculate P(condition) in the way similar to
normalize the probability.
P(condition)  P(condition PlayTennis)  P(condition PlayTennis)
P(condition)  0.01415 0.01028 0.02443
0.01415
P ( PlayTennis yes | condition) 
 0.58
0.02443
0.01028
P ( PlayTennis no | condition) 
 0.42
0.02443
Therefore, John is more likely to play tennis
today with 58% chance.
Learning and Bayes Classifier
Learning is the adjustment of probability values
to compute a posterior probability when new data
Is added.
Classifying Object Example
Suppose we want to classify objects into two
classes, A and B. There are two features that
we can measure from each object, f1 and f2.
We sample four objects randomly to be a
database and classify it by hand.
Sample
f1
f2
Class
1
5.2
1.2
B
2
2.3
5.4
A
3
1.5
4.4
A
4
4.5
2.1
B
Now, we have another sample that have f1=3.2
f2=4.2 we want to know what class it is.
We want to find
P(Class | feature)
Using Bayes rule
P(Class | feature) 
P( feature| Class)  P(Class)
P( feature)
From the table, we will count the number of events.
P(Class  A)  2 / 4  0.5
P(Class  B)  2 / 4  0.5
Find
P( feature| Class)
Again, we use the naïve Bayes assumption.
Assume that all events are independent.
P( f 1  f 2 | Class)  P( f 1 | Class)  P( f 2 | Class)
P( f 1 | Class) we need to assume probability
To find
distribution because the features are continuous value.
The most common form of distribution is Gaussian
(normal) distribution.
Gaussian distribution
 ( x   )2 

P( x) 
exp 
2
2
2 2


1
There are two parameters: mean µ and variance σ
Using the maximum likelihood principle, the mean
and the variance can be estimated from the samples
In the database.
Class A
f1: Mean = (2.3+1.5 )/2 = 1.9
f2: Mean = (5.4+4.4 )/2 = 4.9
SD = 0.4
SD = 0.5
Class B
f1: Mean = (5.2+4.5 )/2 = 4.85
f2: Mean = (1.2+2.1 )/2 = 1.65
SD = 0.35
SD = 0.45
 ( x  1.9) 2 
 
P( f 1 | A) 
exp 
2
2 (0.4 2 )
 2(0.4 ) 
1
The object that we want to classify has f1=3.2 f2=4.2.
 (3.2  1.9) 2 
  0.0051
P( f 1 | A) 
exp 
2
2(0.4 ) 
2 (0.4 2 )

1
 (4.2  4.9) 2 
  0.2995
P( f 2 | A) 
exp 
2
2(0.5 ) 
2 (0.52 )

1
 (3.2  4.85) 2 
  1.7016e- 05
P( f 1 | B) 
exp 
2
2(0.35 ) 
2 (0.352 )

1
 (4.2  1.65) 2 
  9.4375e- 08
P( f 2 | B) 
exp 
2
2(0.45 ) 
2 (0.452 )

1
Therefore,
P( f 1  f 2 | Class  A)  0.0051 0.2995 0.0015
P( f 1  f 2 | Class  B)  1.7016e- 5  9.4375e- 8  1.6059e- 12
P( feature| Class)  P(Class)
From Bayes P(Class | feature) 
P( feature)
P( A | feature) 
P( B | feature) 
0.0015 0.5
P( feature)
1.6059e  12 0.5
P( feature)
Therefore, we should classify the sample as Class A.
Nearest Neighbor Classification
NN is considered as no model classification.
Nearest Neighbor’s Principle
The unknown sample is classified to be the same
class as the sample with closet distance.
Feature 2
Closet Distance
Feature 1
We classify the sample as a circle.
Distance between Samples
Sample X and Y have multi-dimension feature values.
2
 3
X  
1
 
0
  2
  1
Y  
5
 
3
The distance between sample X,Y can be calculated by
this formula.
D( x, y) 
k
k
k
k
k
k
(
x

y
)

(
x

y
)

(
x

y
)

..

(
x

y
)
i1 i i
1
1
2
2
N
N
N
D( x, y) 
k
k
k
k
k
k
(
x

y
)

(
x

y
)

(
x

y
)

..

(
x

y
)
i1 i i
1
1
2
2
N
N
N
If k = 1 , the distance is called Manhattan distance
If k = 2 , the distance is called Euclidean distance
If k = ∞ , the distance is the maximum value of feature
Euclidean is well-known and is the prefer one.
Classifying Object with NN
Sample
f1
f2
Class
1
5.2
1.2
B
2
2.3
5.4
A
3
1.5
4.4
A
4
4.5
2.1
B
Now, we have another sample, f1=3.2 f2=4.2
We want to know its class.
Compute Euclidian distance from it to all other samples
D( x, s1)  (3.2  5.2) 2  (4.2  1.2) 2  3.6056
D( x, s 2)  (3.2  2.3) 2  (4.2  5.4) 2  1.5
D( x, s3)  (3.2  1.5) 2  (4.2  4.4) 2  1.7117
D( x, s 4)  (3.2  4.5) 2  (4.2  2.1) 2  2.4698
The unknown sample has the closest distance to the
second sample. Therefore, we classify it to be the
same class as the second sample, which is Class A.
K-Nearest Neighbor (KNN)
Instead of using the closet sample as the decided class,
we use the closet k samples as the decided class.
Feature 2
Feature 1
Example k=3
The data is classified as a circle
Feature 2
Feature 1
Example k=5
The data is classified as a star.