Transcript Tutorial 2

SEEM4630 2013-2014 Tutorial 2
Classification: Definition
 Given a collection of records (training set ), each record
contains a set of attributes, one of the attributes is the
class.
 Find a model for class attribute as a function of the
values of other attributes.
 Decision tree
 Naïve bayes
 k-NN
 Goal: previously unseen records should be assigned a
class as accurately as possible.
2
Decision Tree
 Goal
 Construct a tree so that instances belonging to

different classes should be separated
Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Test attributes are selected on the basis of a heuristics
or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on selected
attributes
3
Attribute Selection Measure 1: Information Gain
 Let pi be the probability that a tuple belongs to class Ci,

estimated by |Ci,D|/|D|
Expected information (entropy) needed to classify a tuple
in D:
m
Info ( D )    p i log 2 ( p i )
i 1
 Information needed (after using A to split D into v
partitions) to classify D:
v
Info A ( D ) 

j 1
| Dj |
|D|
 Info ( D j )
 Information gained by branching on attribute A
Gain(A)  Info(D)  Info A (D)
4
Attribute Selection Measure 2: Gain Ratio
 Information gain measure is biased towards attributes
with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
SplitInfo
A
(D)  
| Dj |
j 1
|D|
 log 2 (
| Dj |
)
|D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
5
Attribute Selection Measure 3: Gini index
 If a data set D contains examples from n classes, gini
index, gini(D) is defined as
n
gini ( D )  1   p 2j
j 1

where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
gini A ( D ) 
| D1 |
|D |
gini ( D 1) 
|D 2 |
|D |
gini ( D 2 )
 Reduction in Impurity:  gini ( A )  gini ( D )  gini A ( D )
6
Example
Outlook
Temperature
Humidity
Wind
Play Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
7
Tree induction example
 Entropy of data S
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94
 Split data by attribute Outlook
S[9+, 5-]
Outlook
Sunny [2+,3-]
Overcast [4+,0-]
Rain [3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25
8
Tree induction example
 Split data by attribute Temperature
S[9+, 5-]
Temperature
<15 [3+,1-]
15-25 [5+,1-]
>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]
– 6/14[-5/6(log2(5/6))-1/6(log2(1/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.80 = 0.14
9
Tree induction example
 Split data by attribute Humidity
High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
 Split data by attribute Wind
Weak [6+, 2-]
S[9+, 5-] Wind
Strong [3+, 3-]
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]
– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05
10
Tree induction example
Outlook
Temperat
ure
Humidity
Wind
Play
Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
Gain(Outlook) = 0.25
Gain(Temperature)=0.14
Gain(Humidity) = 0.15
Gain(Wind) = 0.05
Outlook
Sunny
??
Overcast
Yes
Rain
??
11
 Entropy of branch Sunny
Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97
 Split Sunny branch by attribute Temperature
Sunny[2+,3-]
Temperature
<15 [1+,0-]
15-25 [1+,1-]
>25 [0+,2-]
Gain(Temperature)
= 0.97
– 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.57
 Split Sunny branch by attribute Humidity
Sunny[2+,3-]
Humidity
High [0+,3-]
Normal [2+, 0-]
Gain(Humidity)
= 0.97
– 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
– 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
 Split Sunny branch by attribute Wind
Sunny[2+, 3-]
Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Wind)
= 0.97
– 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
= 0.97 – 0.95= 0.02
12
Tree induction example
Outlook
Sunny
Humidity
High
No
Overcast
Yes
Rain
??
Normal
Yes
13
 Entropy of branch Rain
Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97
 Split Rain branch by attribute Temperature
Rain[3+,2-]
Temperature
<15 [1+,1-]
15-25 [2+,1-]
>25 [0+,0-]
Gain(Outlook)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.95 = 0.02
 Split Rain branch by attribute Humidity
Rain[3+,2-]
Humidity
High [1+,1-]
Normal [2+, 1-]
Gain(Humidity)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.95 = 0.02
 Split Rain branch by attribute Wind
Rain[3+,2-]
Wind
Weak [3+, 0-]
Strong [0+, 2-]
Gain(Wind)
= 0.97
– 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97
14
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Weak
Yes
Strong
No
15
Bayesian Classification
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 P ( C i | x1 , x 2 ,..., x n ) where xi is the value of attribute Ai
 Choose the class label that has the highest probability
 Foundation: Based on Bayes’ Theorem.
P ( C i | x1 , x 2 ,..., x n ) 
P ( x1 , x 2 ,..., x n | C i ) P ( C i )
P ( x1 , x 2 ,..., x n )

P ( C i | x1 , x 2 ,..., x n ) posteriori probability

P ( C i ) prior probability

P ( x1 , x 2 ,..., x n | C i ) likelihood
Model: compute
from data
16
Naïve Bayes Classifier
 Problem: joint probabilities are difficult to estimate
P ( x1 , x 2 ,..., x n | C i )
 Naïve Bayes Classifier
 Assumption: attributes are conditionally independent
P ( x1 , x 2 ,..., x n | C i )  P ( x1 | C i )  P ( x n | C i )
P ( C i | x1 , x 2 , ..., x n ) 

n
j 1
P ( x j | C i )P ( C i )
P ( x1 , x 2 , ..., x n )
17
Example: Naïve Bayes Classifier
A
B
C
m
b
t
m
s
t
g
q
t
h
s
t
g
q
t
g
q
f
g
s
f
h
b
F
h
q
f
m
b
f
P(C=t) = 1/2
P(C=f ) = 1/2
P(A=m|C=t) = 2/5
P(A=m|C=f ) = 1/5
P(B=q|C=t) = 2/5
P(B=q|C=f ) = 2/5
Test Record: A=m, B=q, C=?
18
Example: Naïve Bayes Classifier
For C = t
P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2
= 2/25 Higher!
P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)
For C = f
P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2
= 1/25
P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)
Conclusion: A=m, B=q, C=t
19
Nearest Neighbor Classification
Input
A set of stored records
k: # of nearest neighbors
Output
 Compute distance: d ( p , q )   ( p  q )
 Identify k nearest neighbors
 Determine the class label of unknown record based on class labels
2
i
i
i
of nearest neighbors (i.e. by taking majority vote)
20
Nearest Neighbor Classification
A Discrete Example
 Calculate the distances:
Input
Given 8 training instances
P1 (4, 2)  Orange
P2 (0.5, 2.5)  Orange
P3 (2.5, 2.5)  Orange
P4 (3, 3.5)  Orange
P5 (5.5, 3.5)  Orange
P6 (2, 4)  Black
P7 (4, 5)  Black
P8 (2.5, 5.5)  Black
d(P1, Pn) = ( 4  4 )
d(P2, Pn) = 3.80
d(P3, Pn) = 2.12
d(P4, Pn) = 1.12
d(P5, Pn) = 1.58
d(P6, Pn) = 2
d(P7, Pn) = 1
d(P8, Pn) = 2.12
2
 (2  4)  2
2
k=1&k=3
New Instance:
Pn (4, 4)  ?
21
Nearest Neighbor Classification
k=3
k=1
P8
P8
P7
P7
Pn
Pn
P6
P6
P2
P4
P3
P4
P5
P1
P2
P5
P3
P1
22
Nearest Neighbor Classification…
 Scaling issues
 Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
• Each attribute must follow in the same range
• Min-Max normalization
 Example:
• Two data records: a = (1, 1000), b = (0.5, 1)
• dis(a, b) = ?
23
Classification: Lazy & Eager Learning
Two Types of Learning Methodologies
 Lazy Learning

• Instance-based learning. (k-NN)
Eager Learning
• Decision-tree and Bayesian classification.
• ANN & SVM
P8
P8
P7
P7
Pn
Pn
P6
P6
P4
P2
P4
P5
P3
P1
P2
P5
P3
P1
24
Differences Between Lazy &Eager Learning
 Lazy Learning
a. Do not require model building
b. Less time training but more time predicting
c. Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
 Eager Learning
a. Require model building
b. More time training but less time predicting
c. Must commit to a single hypothesis that covers the
entire instance space
25
26