Document 7636104

Download Report

Transcript Document 7636104

Decision Tree
Rong Jin
Determine Milage Per Gallon
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
cylinders displacement horsepower
4
6
4
8
6
4
4
8
:
:
:
8
8
8
4
6
4
4
8
4
5
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium
low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium
weight
acceleration modelyear maker
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium
high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
A Decision Tree for Determining MPG
mpg
cylinders
displacement horsepower weight
acceleration modelyear
maker
good
4
low
high
asia
low
low
From slides of Andrew Moore
75to78
Decision Tree Learning

Extremely popular method





Credit risk assessment
Medical diagnosis
Market analysis
Good at dealing with symbolic feature
Easy to comprehend

Compared to logistic regression model and
support vector machine
Representational Power

Q: Can trees represent arbitrary Boolean
expressions?

Q: How many Boolean functions are there over N
binary attributes?
How to Generate Trees from
Training Data
A Simple Idea

Enumerate all possible trees
Too many trees


Check how well each tree matches with the training
data
How to determine
the quality of
Pick the one work best
decision trees?
Problems ?
Solution: A Greedy Approach



Choose the most informative feature
Split data set
Recursive until each data item is classified
correctly
How to Determine the Best Feature?

Which feature is more
informative to MPG?

What metric should be
used?
Mutual Information !
From Andrew Moore’s slides
Mutual Information for Selecting
Best Features
P ( x, y )
P( x) P( y )
Y : MPG (good or bad), X : cylinder (3, 4, 6, 8)
I ( X ; Y )   x , y P ( x, y ) log
From Andrew Moore’s slides
Another Example: Playing Tennis
Example: Playing Tennis
Humidity
High
(3+, 4-)
(9+, 5-)
Norm
Weak
(6+, 1-)
P(h, p)
P(n, p)
 P(n, p) log

P ( h) P ( p )
P ( n) P ( p )
P(h, p)
P(n, p)
P(h, p) log
 P(n, p) log
P(h) P(p)
P(n) P(p)
 0.151
I h  P(h, p) log
Wind
(6+, 2-)
(9+, 5-)
Strong
(3+, 3-)
P( w, p)
P ( s, p )
 P( s, p) log

P( w) P( p)
P( s) P( p)
P( w, p)
P( s, p)
P( w, p) log
 P( s, p) log
P( w) P(p)
P( s ) P(p)
 0.048
I w  P( w, p) log
Predication for Nodes
What is the predication for each node?
From Andrew Moore’s slides
Predication for Nodes
Recursively Growing Trees
cylinders = 4
cylinders = 5
cylinders = 6
Original
Dataset
Partition it
according
to the value of
the attribute
we split on
cylinders = 8
From Andrew Moore slides
Recursively Growing Trees
Build tree from
These records..
cylinders = 4
Build tree from
These records..
cylinders = 5
Build tree from
These records..
cylinders = 6
Build tree from
These records..
cylinders = 8
From Andrew Moore slides
A Two Level Tree
Recursively
growing trees
When should We Stop Growing Trees?
Should we split
this node ?
Base Cases


Base Case One: If all records in current data subset have the
same output then don’t recurse
Base Case Two: If all records have exactly the same set of
input attributes then don’t recurse
Base Cases: An idea


Base Case One: If all records in current data subset have the
same output then don’t recurse
Base Case Two: If all records have exactly the same set of
input attributes then don’t recurse
Proposed Base Case 3:
If all attributes have zero information
gain then don’t recurse
Is this a good idea?
Old Topic: Overfitting
What should We do ?
Pruning
Pruning Decision Tree



Stop growing trees in time
Build the full decision tree as before.
But when you can grow it no more, start to
prune:


Reduced error pruning
Rule post-pruning
Reduced Error Pruning



Split data into training and validation set
Build a full decision tree over the training set
Keep removing node that maximally increases
validation set accuracy
Original Decision Tree
Pruned Decision Tree
Reduced Error Pruning
Rule Post-Pruning



Convert tree into rules
Prune rules by removing the preconditions
Sort final rules by their estimated accuracy
Most widely used method (e.g., C4.5)
Other methods: statistical significance test (chisquare)
Real Value Inputs

What should we do to deal with real value inputs?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad
cylinders displacementhorsepower weight acceleration modelyear maker
4
6
4
8
6
4
4
8
:
:
:
97
199
121
350
198
108
113
302
:
:
:
4
8
4
5
75
90
110
175
95
94
95
139
:
:
:
120
455
107
131
2265
2648
2600
4100
3102
2379
2228
3570
:
:
:
79
225
86
103
18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:
2625
4425
2464
2830
77
70
77
73
74
73
71
78
:
:
:
18.6
10
15.5
15.9
82
70
76
78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe
Information Gain



x: a real value input
t: split value
Find the split value t such that the mutual
information I(x, y: t) between x and the class
label y is maximized.
Conclusions

Decision trees are the single most popular data
mining tool






Easy to understand
Easy to implement
Easy to use
Computationally cheap
It’s possible to get in trouble with overfitting
They do classification: predict a categorical output
from categorical and/or real inputs
Software
Most widely used decision tree C4.5 (or C5.0)
http://www2.cs.uregina.ca/~hamilton/courses/83
1/notes/ml/dtrees/c4.5/tutorial.html


Source code, tutorial
The End