Decision tree learning
Download
Report
Transcript Decision tree learning
CS B551: DECISION TREES
AGENDA
Decision trees
Complexity
Learning curves
Combatting overfitting
Boosting
RECAP
Still in supervised setting with logical attributes
Find a representation of CONCEPT in the form:
CONCEPT(x) S(A,B, …)
where S(A,B,…) is a sentence built with the
observable attributes, e.g.:
CONCEPT(x) A(x) (B(x) v C(x))
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can
be represented by the following decision tree:
Example:
A?
A mushroom is poisonous iff
True
it is yellow and small, or yellow,
big and spotted
B?
• x is a mushroom
False
True
• CONCEPT = POISONOUS
• A = YELLOW
True
• B = BIG
C?
• C = SPOTTED
True
False
True
False
False
False
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can
be represented by the following decision tree:
Example:
A?
A mushroom is poisonous iff
True
it is yellow and small, or yellow,
big and spotted
B?
• x is a mushroom
False
True
• CONCEPT = POISONOUS
• A = YELLOW
True
• B = BIG
C?
• C = SPOTTED
True
False
• D = FUNNEL-CAP
• E = BULKY
True
False
False
False
TRAINING SET
Ex. #
A
B
C
D
E
CONCEPT
1
False
False
True
False
True
False
2
False
True
False
False
False
False
3
False
True
True
True
True
False
4
False
False
True
False
False
False
5
False
False
False
True
True
False
6
True
False
True
False
False
True
7
True
False
False
True
False
True
8
True
False
True
False
True
True
9
True
True
True
False
True
True
10
True
True
True
True
True
True
11
True
True
False
False
False
False
12
True
True
False
False
True
False
13
True
False
True
True
True
True
POSSIBLE DECISION TREE
D
T
F
E
Ex. #
A
B
C
D
E
CONCEPT
1
False
False
True
False
True
False
2
False
True
False
False
False
False
3
False
True
True
True
True
False
4
False
False
True
False
False
False
5
False
False
False
True
True
False
6
True
False
True
False
False
True
7
True
False
False
True
False
True
8
True
False
True
False
True
True
9
True
True
True
False
True
True
10
True
True
True
True
True
True
11
True
True
False
False
False
False
12
True
True
False
False
True
False
13
True
False
True
True
True
True
T
A
T
C
F
B
F
T
E
A
F
A
T
T
F
POSSIBLE DECISION TREE
CONCEPT
(D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D
T
F
E
CONCEPT A (B v C)
True
B?
True
True
False
T
F
B
F
T
E
False
False
C?
True
T
A
A?
C
True
A
False
False
F
A
T
T
F
POSSIBLE DECISION TREE
CONCEPT
(D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D
T
F
E
CONCEPT A (B v C)
A
A?
True
C
T
B
F
False
T
Fdecision
T tree
KIS bias Build smallest
E
B?
True
False
False
True
C?
A
A
Computationally
intractable problem
False
True
True
False
greedy algorithm
F
T
T
F
A
TOP-DOWN
INDUCTION OF A DT
True
C
False
False
True
B
True
DTL(D, Predicates)
1.
2.
3.
4.
5.
False
True
False
False
True
If all examples in D are positive then return True
If all examples in D are negative then return False
If Predicates is empty then return majority rule
A error-minimizing predicate in Predicates
Return the tree whose:
- root is A,
- left branch is DTL(D+A,Predicates-A),
- right branch is DTL(D-A,Predicates-A)
LEARNABLE CONCEPTS
Some
simple concepts cannot be
represented compactly in DTs
Parity(x) = X1 xor X2 xor … xor Xn
Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
Exponential
size in # of attributes
Need exponential # of examples to learn
exactly
The ease of learning is dependent on
shrewdly (or luckily) chosen attributes
that correlate with CONCEPT
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
% correct on test set
100
size of training set
Typical learning curve
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
Some concepts are
unrealizable within a
machine’s capacity
% correct on test set
100
size of training set
Typical learning curve
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
% correct on test set
Overfitting
100
size of training set
Typical learning curve
Risk of using irrelevant
observable predicates to
generate an hypothesis
that agrees with all
examples
in the training set
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
Overfitting
Tree pruning
Risk of using irrelevant
observable predicates to
generate an hypothesis
that agrees with all
examples
in the training set
Terminate recursion when
# errors / information gain
is small
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
Risk of using irrelevant
Overfitting
observable predicates to
Tree pruning
generate an hypothesis
that agrees with all
The resulting decision
tree +
examples
majority rule
not
in may
the training
set
classify correctly all
Terminate recursion
when in the training set
examples
# errors / information gain
is small
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
Learning curve
Overfitting
Tree pruning
Incorrect examples
Missing data
Multi-valued and continuous attributes
USING INFORMATION THEORY
Rather than minimizing the probability of error,
minimize the expected number of questions
needed to decide if an object x satisfies
CONCEPT
Use the information-theoretic quantity known as
information gain
Split on variable with highest information gain
ENTROPY / INFORMATION GAIN
Entropy: encodes the quantity of uncertainty in a
random variable
Properties
I(X,Y) = Ey[H(X) – H(X|Y)] =
y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)]
Properties
H(X) = 0 if X is known, i.e. P(x)=1 for some value x
H(X) > 0 if X is not known with certainty
H(X) is maximal if P(X) is uniform distribution
Information gain: measures the reduction in
uncertainty in X given knowledge of Y
H(X) = -xVal(X) P(x) log P(x)
Always nonnegative
= 0 if X and Y are independent
If Y is a choice, maximizing IG = > minimizing
Ey[H(X|Y)]
MAXIMIZING IG / MINIMIZING CONDITIONAL
ENTROPY IN DECISION TREES
Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y)
Let n be # of examples
Let n+,n- be # of examples on T/F branches of Y
Let p+,p- be accuracy on true/false branches of Y
P(Y) = (p+n++p-n-)/n
P(correct|Y) = p+, P(correct|-Y) = p
Ey[H(X|Y)]
n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)]
CONTINUOUS ATTRIBUTES
Continuous attributes can be converted into
logical ones via thresholds
X => X<a
When considering splitting on X, pick the
threshold a to minimize # of errors / entropy
7
7 6
5
6
5
4 5 4 3 4 5
4
5
6
7
MULTI-VALUED ATTRIBUTES
Simple change: consider splits on all values A can
take on
Caveat: the more values A can take on, the more
important it may appear to be, even if it is
irrelevant
More values => dataset split into smaller example
sets when picking attributes
Smaller example sets => more likely to fit well to
spurious noise
STATISTICAL METHODS FOR ADDRESSING
OVERFITTING / NOISE
There may be few training examples that match
the path leading to a deep node in the decision
tree
More susceptible to choosing irrelevant/incorrect
attributes when sample is small
Idea:
Make a statistical estimate of predictive power
(which increases with larger samples)
Prune branches with low predictive power
Chi-squared pruning
TOP-DOWN DT PRUNING
Consider an inner node X that by itself (majority
rule) predicts p examples correctly and n
examples incorrectly
At k leaf nodes, number of correct/incorrect
examples are p1/n1,…,pk/nk
Chi-squared statistical significance test:
Null hypothesis: example labels randomly chosen
with distribution p/(p+n) (X is irrelevant)
Alternate hypothesis: examples not randomly chosen
(X is relevant)
Prune X if testing X is not statistically significant
CHI-SQUARED TEST
Let Z = i (pi – pi’)2/pi’ + (ni – ni’)2/ni’
Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the
expected number of true/false examples at leaf node i
if the null hypothesis holds
Z is a statistic that is approximately drawn from
the chi-squared distribution with k degrees of
freedom
Look up p-Value of Z from a table, prune if pValue > a for some a (usually ~.05)
ENSEMBLE LEARNING
(BOOSTING)
IDEA
It may be difficult to search for a single
hypothesis that explains the data
Construct multiple hypotheses (ensemble), and
combine their predictions
“Can a set of weak learners construct a single
strong learner?” – Michael Kearns, 1988
MOTIVATION
5 classifiers with 60% accuracy
On a new example, run them all, and pick the
prediction using majority voting
If errors are independent, new classifier has 94%
accuracy!
(In reality errors will not be independent, but we
hope they will be mostly uncorrelated)
BOOSTING
Main idea:
If learner 1 fails to learn an example correctly, this
example is more important for learner 2
If learner 1 and 2 fail to learn an example correctly,
this example is more important for learner 3
…
Weighted training set
Weights encode importance
BOOSTING
Weighted training set
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
w1
False
False
True
False
True
False
2
w2
False
True
False
False
False
False
3
w3
False
True
True
True
True
False
4
w4
False
False
True
False
False
False
5
w5
False
False
False
True
True
False
6
w6
True
False
True
False
False
True
7
w7
True
False
False
True
False
True
8
w8
True
False
True
False
True
True
9
w9
True
True
True
False
True
True
10
w10
True
True
True
True
True
True
11
w11
True
True
False
False
False
False
12
w12
True
True
False
False
True
False
13
w13
True
False
True
True
True
True
BOOSTING
Start with uniform weights wi=1/N
Use learner 1 to generate hypothesis h1
Adjust weights to give higher importance to
misclassified examples
Use learner 2 to generate hypothesis h2
…
Weight hypotheses according to performance, and
return weighted majority
MUSHROOM EXAMPLE
“Decision stumps” - single attribute DT
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE
Update weights (precise formula given in R&N)
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE
Update weights
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE
Update Weights…
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE
Final classifier, order C,A,E,D,B
Weights on hypotheses determined by overall error
Weighted majority weights
A=2.1, B=0.9, C=0.8, D=1.4, E=0.09
100% accuracy on training set
BOOSTING STRATEGIES
Prior weighting strategy was the popular
AdaBoost algorithm
see R&N pp. 667
Many other strategies
Typically as the number of hypotheses increases,
accuracy increases as well
Does this conflict with Occam’s razor?
ANNOUNCEMENTS
Next class:
Neural networks & function learning
R&N 18.6-7