3/5[-1/2(log 2 (1/2))
Download
Report
Transcript 3/5[-1/2(log 2 (1/2))
SEEM4630 2012-2013
Tutorial 2 – Classification
Decision tree, Naïve Bayes & k-NN
WANG Jing
Classification: Definition
Given a collection of records (training set )
Find a model for class attribute as a function
of the values of other attributes.
Each record contains a set of attributes, one of the
attributes is the class.
Decision tree, Naïve bayes & k-NN
Goal: previously unseen records should be
assigned a class as accurately as possible.
2
Decision Tree
Goal
Construct a tree so that instances belonging to different
classes should be separated
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Test attributes are selected on the basis of a heuristics
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected
attributes
3
Attribute Selection Measure 1:
Information Gain
Let pi be the probability that a tuple belongs to
class Ci, estimated by |Ci,D|/|D|
Expected information (entropy) needed to classify
m
a tuple in D:
Info( D) pi log2 ( pi )
i 1
Information needed (after using A to split D into
v |D |
j
v partitions) to classify D: Info ( D)
Info(D j )
A
j 1
| D|
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
4
Attribute Selection Measure 2:
Gain Ratio
Information gain measure is biased towards
attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain)
v |D |
| Dj |
j
SplitInfoA ( D)
log2 (
)
| D|
j 1 | D |
GainRatio(A) = Gain(A)/SplitInfo(A)
5
Attribute Selection Measure 3:
Gini index
If a data set D contains examples from n classes,
gini index, gini(D) is defined as
n 2
gini(D) 1 p j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1
and D2, the gini index gini(D) is defined as
gini A ( D)
Reduction in Impurity:
|D1|
|D |
gini ( D1) 2 gini ( D 2)
|D|
|D|
gini( A) gini(D) giniA (D)
6
Example
Outlook
Temperature
Humidity
Wind
Play Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
7
Tree induction example
S[9+, 5-]
Outlook
Sunny [2+,3-]
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14))
Overcast [4+,0-]
= 0.94
Rain [3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25
S[9+, 5-]
Temperature
<15 [3+,1-]
15-25 [5+,1-]
>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]
– 6/14[-5/6(log2(5/6))-1/6(log2(1/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.80 = 0.14
8
High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
Weak [6+, 2-]
S[9+, 5-] Wind
Strong [3+, 3-]
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]
– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05
9
Outlook Tempe
rature
Humidi
ty
Wind
Play
Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast >25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast <15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast 15-25
High
Strong
Yes
Overcast >25
Normal
Weak
Yes
Rain
High
Strong
No
15-25
Gain(Outlook) = 0.25
Gain(Temperature)=0.14
Gain(Humidity) = 0.15
Gain(Wind) = 0.05
Outlook
Sunny
??
Overcast
Rain
Yes
??
10
Sunny[2+,3-]
Temperature
<15 [1+,0-]
15-25 [1+,1-]
>25 [0+,2-]
Info(Sunny) = -2/5(log2(2/5))
-3/5(log2(3/5))
= 0.97
Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.37
Sunny[2+,3-]
Humidity
High [0+,3-]
Normal [2+, 0-]
Sunny[2+, 3-]
Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Humidity)
= 0.97
– 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
– 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
Gain(Wind)
= 0.97
– 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 3/5[-1/2(log2(1/2))-1/2(log2(1/2))]
= 0.97 – 0.96 = 0.02
11
Outlook
Sunny
Humidity
High
No
Overcast
Yes
Rain
??
Normal
Yes
12
Rain[3+,2-]
Temperature
<15 [1+,1-]
15-25 [2+,1-]
>25 [0+,0-]
Info(Rain) = -3/5(log2(3/5))
-2/5(log2(2/5))
= 0.97
Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.75 = 0.22
Rain[3+,2-]
Humidity
High [1+,1-]
Normal [2+, 1-]
Rain[3+,2-]
Wind
Gain(Humidity)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.43 = 0.54
Weak [3+, 0-]
Strong [0+, 2-]
Gain(Wind)
= 0.97
– 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97
13
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Weak
Yes
Strong
No
14
Bayesian Classification
A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
P(Ci | x1 , x2 ,...,xn )
where xi is the value of attribute Ai
Choose the class label that has the highest probability
Foundation: Based on Bayes’ Theorem.
P( x1 , x2 ,...,xn | Ci ) P(Ci )
P(Ci | x1 , x2 ,...,xn )
P( x1 , x2 ,...,xn )
P(Ci | x1, x2 ,...,xn ) posteriori probability
P(Ci ) prior probability
Model: compute from data
P( x1, x2 ,...,xn | Ci ) likelihood
?
15
Naïve Bayes Classifier
Problem: joint probabilities are difficult to
estimate P( x1, x2 ,...,xn | Ci )
Naïve Beyes Classifier
Assumption: attributes are conditionally independent
P( x1 , x2 ,...,xn | Ci ) P( x1 | Ci )P( xn | Ci )
P(C | x , x ,..., x )
n
i
1
2
n
j 1
P( x j | Ci )P(Ci )
P( x1 , x2 ,..., xn )
16
Naïve Bayes Classifier
A
B
C
m
b
t
m
s
t
g
q
t
h
s
t
g
q
t
g
q
f
g
s
f
h
b
f
h
q
f
m
b
f
P(C=t) = 1/2 P(C=f ) = 1/2
P(A=m|C=t) = 2/5
P(A=m|C=f) = 1/5
P(B=q|C=t) = 2/5
P(B=q|C=f) = 2/5
Test Record: A=m, B=q, C=?
17
SEG4630 Tutorial 6
Made by Wenting
Naïve Bayes Classifier
For
C=t
P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 *
2/5 * 1/2 = 2/25
Higher!
P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)
For
C=f
P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 *
2/5 * 1/2 = 1/25
P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)
Conclusion:
A=m, B=q, C=t
SEG4630 Tutorial 6
Made by Wenting
18
Nearest Neighbor Classification
Input
A set of stored records
k: # of nearest neighbors
Output
Compute distance: d ( p, q) ( p q )
Identify k nearest neighbors
Determine the class label of unknown record based on
class labels of nearest neighbors (i.e. by taking majority
vote)
2
i
i
i
19
Nearest Neighbor Classification
A Discrete Example
Input
Given 8 training
instances
P1
P2
P3
P4
P5
P6
P7
P8
(4, 2) Orange
(0.5, 2.5) Orange
(2.5, 2.5) Orange
(3, 3.5) Orange
(5.5, 3.5) Orange
(2, 4) Black
(4, 5) Black
(2.5, 5.5) Black
Calculate the distances:
d(P1,
d(P2,
d(P3,
d(P4,
d(P5,
d(P6,
d(P7,
d(P8,
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
=
=
=
=
=
=
=
=
(4 4) 2 (2 4) 2 2
3.80
2.12
1.12
1.58
2
1
2.12
k=1&k=3
new instance:
Pn (4, 4) ???
20
Nearest Neighbor Classification
k=3
k=1
P8
P8
P7
P7
Pn
Pn
P6
P6
P2
P4
P3
P4
P5
P1
P2
P5
P3
P1
21
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
Each attribute must follow in the same range
Min-Max normalization
Example:
Two data records: a = (1, 1000), b = (0.5, 1)
dis(a, b) = ?
22
Lazy & Eager Learning
Two
Types of Learning Methodologies
Lazy Learning
Instance-based learning. (k-NN)
Eager Learning
Decision-tree and Bayesian classification.
ANN & SVM
P8
P8
P7
P7
Pn
Pn
P6
P6
P4
P2
P4
P5
P3
P1
P2
P5
P3
P1
23
Lazy & Eager Learning
Key
Lazy Learning
Differences
Do not require model building
Less time training but more time predicting
Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
Eager Learning
Require model building
More time training but less time predicting
must commit to a single hypothesis that covers the
entire instance space
24