Transcript Slide 1

Decision Tree Learning Algorithms
Sagar Kasukurthy
DECISION TREE ALGORITHMS
o One of the simplest forms of machine learning.
o Supervised learning – Output for the training data is known.
o Takes as input a vector of attribute values. Returns a single output value decision.
o We build a decision tree first based on the training data and then apply the
decision tree on the test sample.
Decision - Whether new customer will default on credit card or not?
Goal – To come up with a value, yes or no.
Attributes – The Random variables in the problem are attributes.
Training Data
Home Owner
Marital Status
Annual Income
Defaulted Borrower
Yes
Single
125k
No
No
Married
100k
No
No
Single
70k
No
Yes
Married
120k
No
No
Divorced
95k
Yes
No
Married
60k
No
Yes
Divorced
220k
No
No
Single
85k
Yes
No
Married
75k
No
No
Single
90k
Yes
Generated Decision Tree
New Test Sample
• Consider a person who is not a home owner, is
Single and has an annual income of 94k.
Would he default on the payment?
Observations on the tree
• Do not need to check all attributes to make a
decision.
• Very intuitive.
• Why is home owner the root node and not
marital status?
Observations on the tree
• Node Types
– Root Node : No incoming edges, only outgoing edges
• Example: Home Owner.
– Internal Node : Exactly one incoming edge and >= 2
outgoing edges.
• Example: Marital Status
– Leaf Node: Exactly one incoming edge
• Example : Class Label.
• Edges : Represent possible values of the attributes.
Attribute Types
• Binary Attribute – Two possible values
– Example: Home Owner : Yes or No
• Nominal Attributes – Many possible values
– Can be split in (2k-1 -1) ways
– Example: Marital Status:
• (Single, Divorced, Married)
• (Single, Divorced/Married)
• (Single/Married, Divorced.
Attribute Types
•
Ordinal Attributes – Similar to Nominal expect grouping must not violate the order
property of the attribute values.
– Example: Shirt Size with values small, medium, large and extra large. Group
only in the order small, medium, large, extra large.
• Continuous Attributes
– Binary Outcomes. example : Annual Income > 80k (Yes,No)
– Range Query. example : Annual Income with branches:
•
•
•
•
•
<10k
10k – 25k
25k – 50k
50k – 80k
>80k
Learning Algorithm
• Aim: Find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of (sub)tree
function DTL(examples, attributes, parents_examples) returns a decision tree
{
if examples is empty then return MAJORITY_VALUE(parent_examples)
else if all examples have all same classification then return the classification
else if attributes is empty then return MAJORITY_VALUE(examples)
else
best
CHOOSE_BEST_ATTRIBUTE(attributes, examples)
Tree
a new decision tree with root test best
For each value vi of best do
examplesi
{ elements of examples with best = vi }
subtree
DTL( examplesi, attributes – best, MAJORITY_VALUE(examples))
add a branch to the tree with label vi and subtree subtree
return Tree
}
CHOOSE_BEST_ATTRIBUTE
• Information gain at the attribute.
• Choose the attribute with the higher information gain.
• Equation:
k
Inform ationGain(V )  I ( Parent)  ( N (Vj ) I (Vj )) / N
j 1
o
o
o
I = Impurity Measure
N = number of samples
N(Vj) = number of samples when attribute V takes value Vj.
CHOOSE_BEST_ATTRIBUTE
• Impurity Measure: Measure of the goodness of a
split at a node.
• When is a split pure?
– A split is pure if after the split, for all branches, all
the instances choosing a branch belong to the
same class
– The measures for selecting the best split are based
on the degree of impurity of child nodes.
IMPURITY MEASURES
• ENTROPY
• GINI INDEX
• MISCLASSIFICATION ERROR
• C = number of classes
• P(i/t) = fraction of records belonging to class i at node t
ENTROPY
• Measure of uncertainty of a random variable.
• More the uncertainty, higher the entropy.
• Example: Coin toss which always comes up as heads.
– No uncertainty, thus entropy = zero.
– We gain no information by observing the value since the value is
always heads.
• Entropy:
– Entropy: H(V) =
– V = Random variable
– P(Vk) – Probability of variable taking value Vk
– For a fair coin H(Fair) = - (0.5log2 0.5 + 0.5log2 0.5 ) = 1
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
A
B
C
B (AND) C
A OR (B AND C)
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
1
1
1
1
0
0
0
1
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
Let 0 be Class 0 and 1 be Class 1.
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
• I(Parent)
– Total 8 samples, 3 class 0 and 5 class 1
– I(Parent) = -3/8log23/8 – 5/8log25/8 = 0.95
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
For attribute A
A takes value 0 , 3 class 0 and 1 class 1
N(Vj) = 4 N = 8
A takes value 1,0 class 0 and 4 class 1
N(Vj) = 4 N = 8
Information gain for attribute A =
0.95 – (-4/8(-3/4log23/4 + 1/4log21/4) –
4/8(-4/4log24/4 + 0/4log20/4)) = 0.54
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
For attribute B
B takes value 0 , 2 class 0 and 2 class 1
N(Vj) = 4 N = 8
B takes value 1,1 class 0 and 3 class 1
N(Vj) = 4 N = 8
Information gain for attribute B =
0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) –
4/8(-1/4log21/4 + 3/4log23/4)) = 0.04
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
For attribute C
C takes value 0 , 2 class 0 and 2 class 1
N(Vj) = 4 N = 8
C takes value 1,1 class 0 and 3 class 1
N(Vj) = 4 N = 8
Information gain for attribute B =
0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) –
4/8(-1/4log21/4 + 3/4log23/4)) = 0.04
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
• Information gain for Node A is the highest, so we
use node A as the root node.
• Now, we see that when A = 1, all samples belong
to class label 1.
• Now, remaining samples = 4 when A = 0. Need to
calculate information gain for attributes B and C.
• Class 0 = 3 and Class 1 = 1.
I(Parent) = -3/4log23/4 – 1/4log21/84 = 0.81
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
For attribute B
B takes value 0 , 2 class 0 and 0 class 1
N(Vj) = 2 N = 4
B takes value 1,1 class 0 and 1 class 1
N(Vj) = 2 N = 4
Information gain for attribute B =
0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) –
2/4(-1/2log21/2 + 1/2log21/2)) = 0.311
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
For attribute C
C takes value 0 , 2 class 0 and 0 class 1
N(Vj) = 2 N = 4
C takes value 1,1 class 0 and 1 class 1
N(Vj) = 2 N = 4
Information gain for attribute C =
0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) –
2/4(-1/2log21/2 + 1/2log21/2)) = 0.311
DECISION TREE USING ENTROPY AS
IMPURITY MEASURE
• Both B and C have same information gain.
ENTROPY, GINI
• ENTROPY = Used by ID3, C4.5 and C5.0
algorithms
• GINI Coefficient = Used by the CART algorithm
PERFORMANCE
OVERFITTING
• Algorithm generates a large tree when there is
no pattern.
– Problem of predicting whether a roll of dice
outputs 6 or not.
– Carry experiments with various dice and decide to
use attributes like color of the die, weight etc.
– If in the experiments we saw that when we roll a 7
gram blue die, we got 6, the decision tree will
build a pattern on that training sample.
REASONS FOR OVERFITTING
• Choosing attributes with little meaning and try
to satisfy noisy data.
• Huge number of attributes
• Small Training Data Set.
HOW TO COMBAT OVERFITTING PRUNING
• Eliminate irrelevant nodes – Nodes that have
zero information gain.
– A node that after split gives 50 Yes and 50 No on
100 examples
Problems associated with
Decision Trees
• Missing Data
• Multi valued attributes
– Attribute with many possible values, information gain may be
high. But choosing this attribute first, might not yield the best
tree.
• Continuous attributes
Continuous attributes
• Steps
– Sort the records based on the integer value of the
attribute.
– Scan the values and each time update the count
matrix of Yes/No and compute Impurity
– Choose the split position that has the least impurity.
• Splitting is the most expensive part of real-world
decision tree learning applications.
Thank You
References
• Artificial Intelligence – A modern approach –Third edition by Russel and
Norvig.
• Video Lecture – Prof. P. Dasgupta, Dept. of Computer Science, IIT
Kharagpur.
• Neural networks course Classroom lecture on decision tree – Dr. Eun Youn,
Texas Tech Univ, Lubbock.