The Art and Technology of Data Mining

Transcript The Art and Technology of Data Mining

Decision Tree
Saed Sayad
4/24/2020
University of Toronto
1
Decision Tree (Mitchell 97)
Decision tree induction is a simple but
powerful learning paradigm. In this method
a set of training examples is broken down
into smaller and smaller subsets while at the
same time an associated decision tree get
incrementally developed. At the end of the
learning process, a decision tree covering
the training set is returned.
4/24/2020
University of Toronto
2
Dataset
4/24/2020
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
University of Toronto
3
Decision Tree
Outlook
Attribute Node
Value Node
Sunny
Humidity
High
No
4/24/2020
Overcast
Rain
Yes
Normal
Leaf Node
Wind
Strong
Yes
No
University of Toronto
Weak
Yes
4
Frequency Tables
Play
Play
No
No
No
No
Yes
No
Yes
No
Yes
No
No
Yes
Yes
4/24/2020
Sort
No
5 / 14 = 0.36
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
University of Toronto
Yes
9 / 14 = 0.64
5
Play
Frequency Tables …
4/24/2020
Outlook | No
Yes
-------------------------------------------Sunny
| 3
2
-------------------------------------------Overcast | 0
4
-------------------------------------------Rainy
| 2
3
Outlook Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
University of Toronto
Temp
| No
Yes
-------------------------------------------Hot
| 2
2
-------------------------------------------Mild
| 2
4
-------------------------------------------Cool
| 1
3
Humidity | No
Yes
-------------------------------------------High
| 4
3
-------------------------------------------Normal | 1
6
Windy
| No
Yes
-------------------------------------------False
| 2
6
-------------------------------------------True
| 3
3
6
Entropy
Entropy(S) = - p log2 p - q log2 q
 Entropy measures the impurity of S
 S is a set of examples
 p is the proportion of positive examples
 q is the proportion of negative examples
4/24/2020
University of Toronto
7
Entropy: One Variable
Play
Yes
No
9 / 14 = 0.64
5 / 14 = 0.36
Entropy(Play) = -p log2 p - q log2 q
= - (0.64 * log2 0.64) - (0.36 * log2 0.36)
= 0.94
4/24/2020
University of Toronto
8
Entropy: One Variable
Example:
Entropy(5,3,2) = - (0.5 * log2 0.5) - (0.3 * log2 0.3) - (0.2 * log2 0.2)
= 1.49
4/24/2020
University of Toronto
9
Entropy: Two Variables
Play
Outlook | No
Yes
-------------------------------------------Sunny
| 3
2
| 5
-------------------------------------------Overcast | 0
4
| 4
-------------------------------------------Rainy
| 2
3 | 5
-------------------------------------------| 14
Size of the subset
Size of the set
E (Play,Outlook) = (5/14)*0.971 + (4/14)*0.0 + (5/14)*0.971
= 0.693
4/24/2020
University of Toronto
10
Information Gain
Gain(S, A) = E(S) – E(S, A)
Example:
Gain(Play,Outlook) = 0.940 – 0.693 = 0.247
4/24/2020
University of Toronto
11
Selecting The Root Node
Play=[9+,5-]
E=0.940
Outlook
Sunny
Overcast
Rain
[2+, 3-]
[4+, 0-]
[3+, 2-]
E=0.971
E=0.0
E=0.971
Gain(Play,Outlook) = 0.940 – ((5/14)*0.971 + (4/14)*0.0 + (5/14)*0.971)
= 0.247
4/24/2020
University of Toronto
12
Selecting The Root Node …
Play=[9+,5-]
E=0.940
Temp
Hot
Mild
Cool
[2+, 2-]
[4+, 2-]
[3+, 1-]
E=1.0
E=0.918
E=0.811
Gain(Play,Temp) = 0.940 – ((4/14)*1.0 + (6/14)*0.918 + (4/14)*0.811)
= 0.029
4/24/2020
University of Toronto
13
Selecting The Root Node …
Play=[9+,5-]
E=0.940
Humidity
High
[3+, 4-]
E=0.985
Normal
[6+, 1-]
E=0.592
Gain(Play,Humidity) = 0.940 – ((7/14)*0.985 + (7/14)*0.592)
= 0.152
4/24/2020
University of Toronto
14
Selecting The Root Node …
Play=[9+,5-]
E=0.940
Windy
Weak
Strong
[6+, 2-]
[3+, 3-]
E=0.811
E=1.0
Gain(Play,Wind) = 0.940 – ((8/14)*0.811 + (6/14)*1.0)
= 0.048
4/24/2020
University of Toronto
15
Selecting The Root Node …
Play
4/24/2020
Outlook
Temperature
Humidity
Windy
Gain=0.247
Gain=0.029
Gain=0.152
Gain=0.048
University of Toronto
16
Decision Tree - Classification
Outlook
Attribute Node
Value Node
Sunny
Humidity
High
No
4/24/2020
Overcast
Rain
Yes
Normal
Leaf Node
Wind
Strong
Yes
No
University of Toronto
Weak
Yes
17
ID3 Algorithm
1. A  the “best” decision attribute for next
node
2. Assign A as decision attribute for node
3. For each value of A create new
descendant
4. Sort training examples to leaf node
according to the attribute value of the
branch
5. If all training examples are perfectly
classified (same value of target attribute)
stop, else iterate over new leaf nodes.
4/24/2020
University of Toronto
18
Converting Tree to Rules
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Wind
Strong
Yes
No
Weak
Yes
R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No
R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes
R3: IF (Outlook=Overcast) THEN Play=Yes
R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No
R5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
4/24/2020
University of Toronto
19
Decision Tree - Regression
4/24/2020
University of Toronto
20
4/24/2020
Dataset
Numeric
Outlook
Temp
Humidity
Windy
Players
Sunny
Hot
High
False
25
Sunny
Hot
High
True
30
Overcast
Hot
High
False
46
Rainy
Mild
High
False
45
Rainy
Cool
Normal
False
52
Rainy
Cool
Normal
True
23
Overcast
Cool
Normal
True
43
Sunny
Mild
High
False
35
Sunny
Cool
Normal
False
38
Rainy
Mild
Normal
False
46
Sunny
Mild
Normal
True
48
Overcast
Mild
High
True
52
Overcast
Hot
Normal
False
44
Rainy
Mild
High
True
30
University of Toronto
21
Decision Tree - Regression
Outlook
Attribute Node
Value Node
Sunny
Humidity
High
30
4/24/2020
Overcast
Rain
50
Normal
Leaf Node
Wind
Strong
45
25
University of Toronto
Weak
55
22
Standard Deviation and Mean
Players
25
30
46
45
52
23
43
SD (Players) = 9.32
Mean (Players) = 39.79
35
38
46
48
52
44
30
4/24/2020
University of Toronto
23
Players
Standard Deviation and Mean
4/24/2020
Outlook Temp
Humidity
Windy
Players
Sunny
Hot
High
False
25
Sunny
Hot
High
True
30
Overcast
Hot
High
False
46
Rainy
Mild
High
False
45
Rainy
Cool
Normal
False
52
Rainy
Cool
Normal
True
23
Overcast
Cool
Normal
True
43
Sunny
Mild
High
False
35
Sunny
Cool
Normal
False
38
Rainy
Mild
Normal
False
46
Sunny
Mild
Normal
True
48
Overcast
Mild
High
True
52
Overcast
Hot
Normal
False
44
Rainy
Mild
High
True
30
University of Toronto
Outlook | SD
Mean
-------------------------------------------Sunny
| 7.78
35.20
-------------------------------------------Overcast | 3.49
46.25
-------------------------------------------Rainy
| 10.87
39.2
Temp
| SD
Mean
-------------------------------------------Hot
| 8.95
36.25
-------------------------------------------Mild
| 7.65
42.67
-------------------------------------------Cool
| 10.51
39.00
Humidity | SD
Mean
-------------------------------------------High
| 9.36
37.57
-------------------------------------------Normal | 8.73
42.00
Windy
| SD
Mean
-------------------------------------------False
| 7.87
41.36
-------------------------------------------True
| 10.59
37.67
24
Standard Deviation versus Entropy
Decision Tree
Classification
4/24/2020
Regression
University of Toronto
25
Information Gain versus Standard Error Reduction
Decision Tree
Classification
4/24/2020
Regression
University of Toronto
26
Selecting The Root Node
Play=[14]
SD=9.32
Outlook
Sunny
[5]
SD=7.78
Overcast
Rain
[4]
SD=3.49
[5]
SD=10.87
SDR(Play,Outlook) = 9.32 - ((5/14)*7.78 + (4/14)*3.49 + (5/14)*10.87)
= 1.662
4/24/2020
University of Toronto
27
Selecting The Root Node …
Play=[14]
SD=9.32
Temp
Hot
[4]
SD=8.95
Mild
Cool
[4]
[6]
SD=7.65
SD=10.51
SDR(Play,Temp) =9.32 - ((4/14)*8.95 + (6/14)*7.65 + (4/14)*10.51)
=0.481
4/24/2020
University of Toronto
28
Selecting The Root Node …
Play=[14]
SD= 9.32
Humidity
High
Normal
[7]
[7]
SD=9.36
SD=8.73
SDR(Play,Humidity) =9.32 - ((7/14)*9.36 + (7/14)*8.73)
=0.275
4/24/2020
University of Toronto
29
Selecting The Root Node …
Play=[14]
SD= 9.32
Windy
Weak
Strong
[8]
[6]
SD=7.87
SD=10.59
SDR(Play,Humidity) =9.32 - ((8/14)*7.87 + (6/14)*10.59)
=0.284
4/24/2020
University of Toronto
30
Selecting The Root Node …
Players
4/24/2020
Outlook
Temperature
Humidity
Windy
SDR=1.662
SDR=0.481
SDR=0.275
SDR=0.284
University of Toronto
31
Decision Tree - Regression
Outlook
Attribute Node
Value Node
Sunny
Humidity
High
30
4/24/2020
Overcast
Rain
50
Normal
Leaf Node
Wind
Strong
45
25
University of Toronto
Weak
55
32
Decision Tree - Issues
 Working with Continuous Attributes – Discretization
 Overfitting and Pruning
 Super Attributes; attributes with many values
 Working with Missing Values
 Attributes with Different Costs
4/24/2020
University of Toronto
33
Discretization

Equally probable intervals
This strategy creates a set of N intervals with the same number of
elements.

Equal width intervals
The original range of values is divided into N intervals with the same
range.

Entropy based
For each numeric attribute, instances are sorted and, for each possible
threshold, a binary <, >= test is considered and evaluated in exactly the same
way that a categorical attribute would be.
4/24/2020
University of Toronto
34
Avoid Overfitting
Overfitting when our learning algorithm continues develop
hypotheses that reduce training set error at the cost of an
increased test set error.
 Stop growing when data split not statistically
significant (Chi2 test)
 Grow full tree then post-prune
 Minimum description length (MDL):
Minimize:
size(tree) + size(misclassifications(tree))
4/24/2020
University of Toronto
35
Post-pruning





witten & eibe
4/24/2020
First, build full tree
Then, prune it
– Fully-grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
Two pruning operations:
1. Subtree replacement
2. Subtree raising
Possible strategies:
– error estimation
– significance testing
– MDL principle
University of Toronto
36
Subtree replacement


witten & eibe
4/24/2020
Bottom-up
Consider replacing a tree only after considering all its
subtrees
University of Toronto
37
Subtree Replacement
witten & eibe
4/24/2020
University of Toronto
38
Error Estimation
 Transformed value for f :
f p
p(1  p) / N
(i.e. subtract the mean and divide by the standard deviation)
 Resulting equation:
 Solving for p:

Pr  z 


f p
 z  c
p (1  p ) / N


z2
f f2
z2   z2 
 1  
p   f 
z


2 
2N
N N 4N   N 

witten & eibe
4/24/2020
University of Toronto
39
Error Estimation
 Error estimate for subtree is weighted sum of
error estimates for all its leaves
 Error estimate for a node (upper bound):
2
2
2 

z
f
f
z

e   f 
z


2
2N
N N 4N 

 z2 
1  
 N
 If c = 25% then z = 0.69 (from normal
distribution)
 f is the error on the training data
 N is the number of instances covered by the leaf
witten & eibe
4/24/2020
University of Toronto
40
f = 5/14
e = 0.46
e < 0.51
so prune!
f=0.33
e=0.47
witten & eibe
4/24/2020
f=0.5
e=0.72
f=0.33
e=0.47
Combined using ratios 6:2:6 gives 0.51
University of Toronto
41
Super Attributes
 The information gain equation, G(S,A) is biased toward
attributes that have a large number of values over
attributes that have a smaller number of values.
 Theses ‘Super Attributes’ will easily be selected as the
root, result in a broad tree that classifies perfectly but
performs poorly on unseen instances.
 We can penalize attributes with large numbers of values
by using an alternative method for attribute selection,
referred to as GainRatio.
GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)
4/24/2020
University of Toronto
42
Super Attributes …
Outlook | No
Yes
-------------------------------------------Sunny
| 3
2
| 5
-------------------------------------------Overcast | 0
4
| 4
-------------------------------------------Rainy
| 2
3 | 5
-------------------------------------------| 14
n
| Si |
| Si |
Split ( S , A)  
log 2
|S|
i 1 | S |
Split (Play,Outlook)= - (5/14*log2(5/14)+4/14*log2(4/15)+5/14*log2(5/14))
= 1.577
Gain (Play,Outlook) = 0.247
Gain Ratio (Play,Outlook) = 0.247/1.577 = 0.156
4/24/2020
University of Toronto
43
Missing Values
 Most common value
 Most common value at node K
 Mean or Median
 Nearest Neighbor
…
4/24/2020
University of Toronto
44
Attributes with Different Costs
 Sometimes the best attribute for splitting the training
elements is very costly. In order to make the overall
decision process more cost effective we may wish to
penalize the information gain of an attribute by its cost.
G( S , A)
G' ( S , A) 
Cost ( A)
4/24/2020
University of Toronto
45
Decision Trees:
 are simple, quick and robust
 are non-parametric
 can handle complex datasets
 can use any combination of categorical and
continuous variables and missing values
 are not incremental and adaptive
 sometimes are not easy to be read
…
4/24/2020
University of Toronto
46
Thank You!
4/24/2020
University of Toronto
47
Outlook
Humidity
Sunny
Overcast
Rainy
High
Normal
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Windy
Temperature
4/24/2020
False
True
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
University of Toronto
Hot
Mild
Cool
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
48
Outlook
Sunny
25
30
35
38
48
SD=7.78
Windy
Overcast
Rainy
False
True
46
43
52
44
45
52
23
46
30
25
46
45
52
35
38
46
44
30
23
43
48
52
30
SD=3.49
SD=10.87
Humidity
High
Normal
25
30
46
45
35
52
30
52
23
43
38
46
48
44
SD=9.36
SD=8.73
Standard
Deviation
SD=10.59
SD=7.87
Temperature
Hot
Mild
Cool
25
30
46
44
45
35
46
48
52
30
52
23
43
38
SD=8.95
SD=10.51
SD=7.65
4/24/2020
University of Toronto
49
Overfitting (Mitchell 97)
Consider error of hypothesis h over
 Training data: errortrain(h)
 Entire distribution D of data: errorD(h)
 Hypothesis hH overfits training data if there is
an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
4/24/2020
University of Toronto
50
Pruning
 Pruning steps:
–
–
–
–
Step 1. Grow the Decision Tree with respect to the Training Set,
Step 2. Randomly Select and Remove a Node.
Step 3. Replace the node with its majority classification.
Step 4. If the performance of the modified tree is just as good or
better on the validation set as the current tree then set the current
tree equal to the modified tree.
– While (not done) goto Step 2.
4/24/2020
University of Toronto
51
Pruning …
 Rule Post-Pruning:
– Step 1. Grow the Decision Tree with respect to the Training Set,
– Step 2. Convert the tree into a set of rules.
– Step 3. Remove antecedents that result in a reduction of the
validation set error rate.
– Step 4. Sort the resulting list of rules based on their accuracy and
use this sorted list as a sequence for classifying unseen instances.
4/24/2020
University of Toronto
52