Transcript Tutorial 2
SEEM4630 2013-2014 Tutorial 2
Classification: Definition
Given a collection of records (training set ), each record
contains a set of attributes, one of the attributes is the
class.
Find a model for class attribute as a function of the
values of other attributes.
Decision tree
Naïve bayes
k-NN
Goal: previously unseen records should be assigned a
class as accurately as possible.
2
Decision Tree
Goal
Construct a tree so that instances belonging to
different classes should be separated
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Test attributes are selected on the basis of a heuristics
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected
attributes
3
Attribute Selection Measure 1: Information Gain
Let pi be the probability that a tuple belongs to class Ci,
estimated by |Ci,D|/|D|
Expected information (entropy) needed to classify a tuple
in D:
m
Info ( D ) p i log 2 ( p i )
i 1
Information needed (after using A to split D into v
partitions) to classify D:
v
Info A ( D )
j 1
| Dj |
|D|
Info ( D j )
Information gained by branching on attribute A
Gain(A) Info(D) Info A (D)
4
Attribute Selection Measure 2: Gain Ratio
Information gain measure is biased towards attributes
with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
SplitInfo
A
(D)
| Dj |
j 1
|D|
log 2 (
| Dj |
)
|D|
GainRatio(A) = Gain(A)/SplitInfo(A)
5
Attribute Selection Measure 3: Gini index
If a data set D contains examples from n classes, gini
index, gini(D) is defined as
n
gini ( D ) 1 p 2j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
gini A ( D )
| D1 |
|D |
gini ( D 1)
|D 2 |
|D |
gini ( D 2 )
Reduction in Impurity: gini ( A ) gini ( D ) gini A ( D )
6
Example
Outlook
Temperature
Humidity
Wind
Play Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
7
Tree induction example
Entropy of data S
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94
Split data by attribute Outlook
S[9+, 5-]
Outlook
Sunny [2+,3-]
Overcast [4+,0-]
Rain [3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25
8
Tree induction example
Split data by attribute Temperature
S[9+, 5-]
Temperature
<15 [3+,1-]
15-25 [5+,1-]
>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]
– 6/14[-5/6(log2(5/6))-1/6(log2(1/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.80 = 0.14
9
Tree induction example
Split data by attribute Humidity
High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
Split data by attribute Wind
Weak [6+, 2-]
S[9+, 5-] Wind
Strong [3+, 3-]
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]
– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05
10
Tree induction example
Outlook
Temperat
ure
Humidity
Wind
Play
Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
Gain(Outlook) = 0.25
Gain(Temperature)=0.14
Gain(Humidity) = 0.15
Gain(Wind) = 0.05
Outlook
Sunny
??
Overcast
Yes
Rain
??
11
Entropy of branch Sunny
Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97
Split Sunny branch by attribute Temperature
Sunny[2+,3-]
Temperature
<15 [1+,0-]
15-25 [1+,1-]
>25 [0+,2-]
Gain(Temperature)
= 0.97
– 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.57
Split Sunny branch by attribute Humidity
Sunny[2+,3-]
Humidity
High [0+,3-]
Normal [2+, 0-]
Gain(Humidity)
= 0.97
– 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
– 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
Split Sunny branch by attribute Wind
Sunny[2+, 3-]
Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Wind)
= 0.97
– 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
= 0.97 – 0.95= 0.02
12
Tree induction example
Outlook
Sunny
Humidity
High
No
Overcast
Yes
Rain
??
Normal
Yes
13
Entropy of branch Rain
Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97
Split Rain branch by attribute Temperature
Rain[3+,2-]
Temperature
<15 [1+,1-]
15-25 [2+,1-]
>25 [0+,0-]
Gain(Outlook)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.95 = 0.02
Split Rain branch by attribute Humidity
Rain[3+,2-]
Humidity
High [1+,1-]
Normal [2+, 1-]
Gain(Humidity)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.95 = 0.02
Split Rain branch by attribute Wind
Rain[3+,2-]
Wind
Weak [3+, 0-]
Strong [0+, 2-]
Gain(Wind)
= 0.97
– 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97
14
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Weak
Yes
Strong
No
15
Bayesian Classification
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
P ( C i | x1 , x 2 ,..., x n ) where xi is the value of attribute Ai
Choose the class label that has the highest probability
Foundation: Based on Bayes’ Theorem.
P ( C i | x1 , x 2 ,..., x n )
P ( x1 , x 2 ,..., x n | C i ) P ( C i )
P ( x1 , x 2 ,..., x n )
P ( C i | x1 , x 2 ,..., x n ) posteriori probability
P ( C i ) prior probability
P ( x1 , x 2 ,..., x n | C i ) likelihood
Model: compute
from data
16
Naïve Bayes Classifier
Problem: joint probabilities are difficult to estimate
P ( x1 , x 2 ,..., x n | C i )
Naïve Bayes Classifier
Assumption: attributes are conditionally independent
P ( x1 , x 2 ,..., x n | C i ) P ( x1 | C i ) P ( x n | C i )
P ( C i | x1 , x 2 , ..., x n )
n
j 1
P ( x j | C i )P ( C i )
P ( x1 , x 2 , ..., x n )
17
Example: Naïve Bayes Classifier
A
B
C
m
b
t
m
s
t
g
q
t
h
s
t
g
q
t
g
q
f
g
s
f
h
b
F
h
q
f
m
b
f
P(C=t) = 1/2
P(C=f ) = 1/2
P(A=m|C=t) = 2/5
P(A=m|C=f ) = 1/5
P(B=q|C=t) = 2/5
P(B=q|C=f ) = 2/5
Test Record: A=m, B=q, C=?
18
Example: Naïve Bayes Classifier
For C = t
P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2
= 2/25 Higher!
P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)
For C = f
P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2
= 1/25
P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)
Conclusion: A=m, B=q, C=t
19
Nearest Neighbor Classification
Input
A set of stored records
k: # of nearest neighbors
Output
Compute distance: d ( p , q ) ( p q )
Identify k nearest neighbors
Determine the class label of unknown record based on class labels
2
i
i
i
of nearest neighbors (i.e. by taking majority vote)
20
Nearest Neighbor Classification
A Discrete Example
Calculate the distances:
Input
Given 8 training instances
P1 (4, 2) Orange
P2 (0.5, 2.5) Orange
P3 (2.5, 2.5) Orange
P4 (3, 3.5) Orange
P5 (5.5, 3.5) Orange
P6 (2, 4) Black
P7 (4, 5) Black
P8 (2.5, 5.5) Black
d(P1, Pn) = ( 4 4 )
d(P2, Pn) = 3.80
d(P3, Pn) = 2.12
d(P4, Pn) = 1.12
d(P5, Pn) = 1.58
d(P6, Pn) = 2
d(P7, Pn) = 1
d(P8, Pn) = 2.12
2
(2 4) 2
2
k=1&k=3
New Instance:
Pn (4, 4) ?
21
Nearest Neighbor Classification
k=3
k=1
P8
P8
P7
P7
Pn
Pn
P6
P6
P2
P4
P3
P4
P5
P1
P2
P5
P3
P1
22
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
• Each attribute must follow in the same range
• Min-Max normalization
Example:
• Two data records: a = (1, 1000), b = (0.5, 1)
• dis(a, b) = ?
23
Classification: Lazy & Eager Learning
Two Types of Learning Methodologies
Lazy Learning
• Instance-based learning. (k-NN)
Eager Learning
• Decision-tree and Bayesian classification.
• ANN & SVM
P8
P8
P7
P7
Pn
Pn
P6
P6
P4
P2
P4
P5
P3
P1
P2
P5
P3
P1
24
Differences Between Lazy &Eager Learning
Lazy Learning
a. Do not require model building
b. Less time training but more time predicting
c. Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
Eager Learning
a. Require model building
b. More time training but less time predicting
c. Must commit to a single hypothesis that covers the
entire instance space
25
26