Transcript Document

Inductive Learning from Examples:
Decision tree learning
Prof. Gheorghe Tecuci
Learning Agents Laboratory
Computer Science Department
George Mason University
 2002, G.Tecuci, Learning Agents Laboratory
1
Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
 2002, G.Tecuci, Learning Agents Laboratory
2
The decision tree learning problem
Given
• language of instances: feature value vectors
• language of generalizations: decision trees
• a set of positive examples (E1, ..., En) of a concept
• a set of negative examples (C1, ... , Cm) of the same concept
• learning bias: preference for shorter decision trees
Determine
• a concept description in the form of a decision tree which is
a generalization of the positive examples that does not cover
any of the negative examples
 2002, G.Tecuci, Learning Agents Laboratory
3
Illustration
Examples
Feature vector representation of examples
That is, there is a fixed set of attributes, each
attribute taking values from a specified set.
height
short
tall
tall
short
tall
tall
tall
short
Decision tree concept
(short, blond, blue)
hair = blond
dark
hair
red
 2002, G.Tecuci, Learning Agents Laboratory
class
+
+
+
-
ey es = blue ey es
blue
(short, blond, blue) is +
eyes
blue
brown
blue
blue
blue
blue
brown
brown
blond
+
-
hair
blond
blond
red
dark
dark
blond
dark
blond
+
brown
-
4
What is the logical expression represented by the decision tree?
Decision tree concept:
dark
hair
blond
red
eyes
+
-
blue
brown
+
-
Disjunction of conjunctions (one conjunct per path to a + node):
(hair = red)  [(hair = blond) & (eyes = blue)]
 2002, G.Tecuci, Learning Agents Laboratory
5
Feature-value representation
Is the feature value representation powerful enough?
If the training set (i.e. the set of positive and negative examples from
which the tree is learned) contains a positive example and a negative
example that have identical values for each attribute, it is impossible to
differentiate between the instances with reference only to the given
attributes.
In such a case the attributes are inadequate for the training set and for the
induction task.
 2002, G.Tecuci, Learning Agents Laboratory
6
Feature-value representation (cont.)
When could a decision tree be built?
If the attributes are adequate, it is always possible to construct a decision
tree that correctly classifies each instance in the training set.
So what is the difficulty in learning a decision tree?
The problem is that there are many such correct decision trees and the
task of induction is to construct a decision tree that correctly classifies not
only instances from the training set but other (unseen) instances as well.
 2002, G.Tecuci, Learning Agents Laboratory
7
Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
 2002, G.Tecuci, Learning Agents Laboratory
8
The basic ID3 learning algorithm
• Let C be the set of training examples
• If all the examples in C are positive then create a node with label +
• If all the examples in C are negative then create a node with label • If there is no attribute left then create a node with the same label as
the majority of examples in C
• Otherwise:
- select the best attribute A and create a decision node,
where v1, v2, ... , vk are the values of A:
v
1
- partition the examples into subsets C1, C2, ... , Ck
according to the values of A.
A
vk
v2
...
- apply the algorithm recursively to each of the sets Ci which is not empty
- for each Ci which is empty create a node with the same label
as the majority of examples in C the node
 2002, G.Tecuci, Learning Agents Laboratory
9
Features selection: information theory
Let us consider a set S containing objects from n classes S1, ... , Sn,
so that the probability of an object to belong to a class Si is pi.
According to the information theory, the amount of information needed to
identify the class of one particular member of S is:
Ii = - log2 pi
Intuitively, Ii represents the number of questions required to identify the
class Si of a given element in S.
The average amount of information needed to identify the class of an
element in S is:
- ∑ pi log2 pi
 2002, G.Tecuci, Learning Agents Laboratory
10
Features selection: the best attribute
Let us suppose that the decision tree has been built from a training set C
consisting of p positive examples and n negative examples.
The average amount of information needed to classify an instance from C is
p log
p - n log n
I(p, n) = - p +
n 2 p+ n p+ n 2 p+ n
If attribute A with values {v1, v2,...,vk} is used for the root of the decision
tree, it will partition C into {C1, C2,...,Ck}, where each Ci contains pi positive
examples and ni negative examples.
The expected information required to classify an instance in Ci is I(pi, ni).
The expected amount of information required to classify an instance after
the value of the attribute A is known is therefore:
k p+ n
i I(p , n )
i
Ires(A) =
i i
i=1 p+ n

The information gained by branching on A is: gain(A) = I(p, n) - Ires(A)
 2002, G.Tecuci, Learning Agents Laboratory
11
Features selection: the heuristic
The information gained by branching on A is:
gain(A) = I(p, n) - Ires(A)
What would be a good heuristic?
Choose the attribute which leads to the greatest information gain.
Why is this a heuristic and not a guaranteed method?
 2002, G.Tecuci, Learning Agents Laboratory
12
Features selection: algorithm optimization
How could we optimize the algorithm for determining the best
attribute?
Hint: The information for A is: gain(A) = I(p, n) - Ires(A)
Since I(p, n) is constant for all attributes, maximizing the gain is equivalent
to minimizing Ires(A), which in turn is equivalent to minimizing the following
expression:
Σi (
pi
- pi log p
2 i + ni
n
- ni log p i
2 i + ni
)
where pi is the number of positive examples in Ci
ni is the number of negative examples in Ci
if p = 0 or ni = 0
i
then the corresponding term in the sum is 0
ID3 examines all candidate attributes and chooses A to maximize gain(A)
(or minimize Ires(A)), forms the tree as above, and then uses the same
process recursively to form decision trees for the residual subsets C1,
C2,...,Ck.
 2002, G.Tecuci, Learning Agents Laboratory
13
Illustration of the method
Examples
height
short
tall
tall
short
tall
tall
tall
short
hair
blond
blond
red
dark
dark
blond
dark
blond
eyes
blue
brown
blue
blue
blue
blue
brown
brown
class
+
+
+
-
1. Find the attribute that maximizes the
information gain:
k
gain(A) = I(p, n) -
p + ni
i

p
+ n I(pi , ni )
i=1
p log
p - n log n
I(p, n) = - p +
n 2 p+ n p+ n 2 p+ n
I(3+, 5-) = -3/8log23/8 – 5/8log25/8 = 0.954434003
Height: short (1+, 2-)
tall(2+, 3-)
Gain(height) = 0.954434003 - 3/8*I(1+,2-) - 5/8*I(2+,3-) =
= 0.954434003 – 3/8(-1/3log21/3 - 2/3log22/3) – 5/8(-2/5log22/5 - 3/5log23/5) = 0.003228944
Hair:
blond(2+, 2-)
red(1+, 0-)
dark(0+, 3-)
Gain(hair) = 0.954434003 – 4/8(-2/4log22/4 – 2/4log22/4) – 1/8(-1/1log21/1-0) –
-3/8(0-3/3log23/3) = 0.954434003 – 0.5 = 0.454434003
Eyes:
blue(3+, 2-)
brown(0+, 3-)
Gain(eyes) = 0.954434003 – 5/8(-3/5log23/5 – 2/5log22/5) -5/8(=
= 0.954434003 - 0.606844122 = 0.347589881
 2002, G.Tecuci, Learning Agents Laboratory
“Hair” is the best attribute.
16
Illustration of the method (cont.)
Examples
height
short
tall
tall
short
tall
tall
tall
short
hair
blond
blond
red
dark
dark
blond
dark
blond
eyes
blue
brown
blue
blue
blue
blue
brown
brown
dark
short, dark, blue : tall, dark, blue: tall, bark, brown:  2002, G.Tecuci, Learning Agents Laboratory
class
+
+
+
-
2. “Hair” is the best attribute.
Build the tree using it.
hair
red
tall, red, blue: +
blond
short, blond, blue: +
tall, blond, brown: tall, blond, blue: +
short, blond, brown: 17
Illustration of the method (cont.)
3. Select the best attribute for the set of examples:
short, blond, blue: +
tall, blond, brown: tall, blond, blue: +
short, blond, brown: I(2+, 2-) = -2/4log22/4 – 2/4log22/4 = -log21/2=1
Height: short (1+, 1-)
tall(1+, 1-)
Eyes:
brown(0+, 2-)
blue (2+, 0-)
Gain(height) = 1 – 2/4*I(1+,1-) – 2/4*I(1+,1-) = 1 - I(1+,1-) = 1-1 = 0
Gain(eyes) = 1 – 2/4*I(2+,0-) – 2/4*I(0+,2-) = 1 – 0 – 0 = 1
“Eyes” is the best attribute.
 2002, G.Tecuci, Learning Agents Laboratory
18
Illustration of the method (cont.)
4. “Eyes” is the best attribute. Expand the tree using it:
hair
dark
short, dark, blue : tall, dark, blue: tall, bark, brown: -
blond
red
ey es
tall, red, blue: +
blue
short, blond, blue: +
tall, blond, blue: +
 2002, G.Tecuci, Learning Agents Laboratory
brown
tall, blond, brown: short, blond, brown: 19
Illustration of the method (cont.)
5. Build the decision tree:
hair
dark
-
red
blond
ey es
+
blue
+
brown
-
What induction hypothesis is made?
 2002, G.Tecuci, Learning Agents Laboratory
20
Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
 2002, G.Tecuci, Learning Agents Laboratory
21
How could we transform a tree into a set of rules?
hair
dark
-
red
blond
ey es
+
blue
Answer:
IF
(hair = red)
THEN positive example
+
brown
-
IF
[(hair = blond) & (eyes = blue)]
THEN positive example
Why should we make such a transformation?
 2002, G.Tecuci, Learning Agents Laboratory
22
Learning from noisy data
What errors could be found in an example (also called noise in data)?
• errors in the values of attributes (due to measurements or
subjective judgments);
• errors of classifications of the instances (for instance a negative
example that was considered a positive example).
What are the effects of noise?
How to change the ID3 algorithm to deal with noise?
 2002, G.Tecuci, Learning Agents Laboratory
23
How to deal with noise?
What are the effects of noise?
Noise may cause the attributes to become inadequate.
Noise may lead to decision trees of spurious complexity (overfitting).
How to change the ID3 algorithm to deal with noise?
The algorithm must be able to work with inadequate attributes,
because noise can cause even the most comprehensive set of
attributes to appear inadequate.
The algorithm must be able to decide that testing further attributes will
not improve the predictive accuracy of the decision tree.
For instance, it should refrain from increasing the complexity of the
decision tree to accommodate a single noise-generated special case.
 2002, G.Tecuci, Learning Agents Laboratory
24
How to deal with an inadequate attribute set?
(inadequacy due to noise)
A collection C of instances may contain representatives of both
classes, yet further testing of C may be ruled out, either because
the attributes are inadequate and unable to distinguish among the
instances in C, or because each attribute has been judged to be
irrelevant to the class of instances in C.
In this situation it is necessary to produce a leaf labeled with a class
information, but the instances in C are not all of the same class.
What class to assign a leaf node that contains both + and - examples?
 2002, G.Tecuci, Learning Agents Laboratory
25
What class to assign a leaf node that contains both + and - examples?
Approaches:
1. The notion of class could be generalized from a binary value (0 for
negative examples and 1 for positive examples) to a number in the
interval [0; 1]. In such a case, a class of 0.8 would be interpreted as
'belonging to class P with probability 0.8'.
2. Opt for the more numerous class, i.e. assign the leaf to class P if p>n,
to class N if p<n, and to either if p=n.
The first approach minimizes the sum of the squares of the error over
objects in C.
The second approach minimizes the sum of the absolute errors over
objects in C. If the aim is to minimize expected error, the second
approach might be anticipated to be superior.
 2002, G.Tecuci, Learning Agents Laboratory
26
How to avoid overfitting the data?
One says that a hypothesis overfits the training examples if some other
hypothesis that fits the training examples less well actually performs
better over the entire distribution of instances.
How to avoid overfitting?
• Stop growing the tree before it overfits;
• Allow the tree to overfit and then prune it.
How to determine the correct size of the tree?
Use a testing set of examples to compare the likely errors of various trees.
 2002, G.Tecuci, Learning Agents Laboratory
27
Rule post pruning to avoid overfitting the data?
Rule post pruning algorithm
Infer a decision tree
Convert the tree into a set of rules
Prune (generalize) the rules by removing antecedents as long as this
improves their accuracy
Sort the rules by their accuracy and use this order in classification
Compare tree pruning with rule post pruning.
 2002, G.Tecuci, Learning Agents Laboratory
28
How to use continuous attributes?
Transform a continuous attribute into a discrete one.
Give an example of such a transformation.
 2002, G.Tecuci, Learning Agents Laboratory
29
How to deal with missing attribute values?
Estimate the value from the values of the other examples.
How?
 2002, G.Tecuci, Learning Agents Laboratory
30
Comparison with the candidate elimination algorithm
Generalization language
ID3 – disjunctions of conjunctions
CE – conjunctions
Bias
ID3 – preference bias (Occam’s razor)
CE – representation bias
Search strategy
ID3: hill climbing (may not find the concept but only an approximation)
CE: exhaustive search
Use of examples
ID3 – all in the same time (can deal with noise and missing values)
CE – one at a time (can determine the most informative example)
 2002, G.Tecuci, Learning Agents Laboratory
31
Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
 2002, G.Tecuci, Learning Agents Laboratory
32
What problems are appropriate for decision tree learning?
Problems for which:
Instances can be represented by attribute-value pairs
Disjunctive descriptions may be required to represent the learned concept
Training data may contain errors
Training data may contain missing attribute values
 2002, G.Tecuci, Learning Agents Laboratory
33
What practical applications could you envision?
Classify:
- Patients by their disease;
- Equipment malfunctions by their cause;
- Loan applicants by their likelihood to default on payments.
 2002, G.Tecuci, Learning Agents Laboratory
34
Which are the main features of decision tree learning?
May employ a large number of examples.
Discovers efficient classification trees that are theoretically justified.
Learns disjunctive concepts.
Is limited to attribute-value representations.
Has a non incremental nature (there are however also incremental
versions that are less efficient).
The tree representation is not very understandable.
The method is limited to learning classification rules.
The method was successfully applied to complex real world problems.
 2002, G.Tecuci, Learning Agents Laboratory
35
Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
 2002, G.Tecuci, Learning Agents Laboratory
36
Exercise
Build two different decision trees corresponding to the examples and counterexamples from the
following table.
Indicate the concept represented by each decision tree.
food
herbivore
carnivore
omnivorous
herbivore
omnivorous
carnivore
carnivore
omnivorous
medium
land
land
water
amphibious
air
land
land
land
type
harmless
harmful
harmless
harmless
harmless
harmful
harmless
moody
class
mammal +
mammal fish
+
amphibian bird
reptile
+
reptile
mammal +
deer (e1)
lion (c1)
goldfish (e2)
frog (c2)
parrot (c3)
cobra (e3)
lizard (c4)
bear (e4)
Apply the ID3 algorithm to build the decision tree corresponding to the examples and
counterexamples from the above table.
 2002, G.Tecuci, Learning Agents Laboratory
37
Exercise
Consider the following positive and negative examples of a concept
shape
ball
brick
cube
ball
size
large
small
large
small
class
+
+
e1
c1
c2
e2
and the following background knowledge
a) You will be required to learn this concept by applying two different learning methods, the Induction of Decision
Trees method, and the Versions Space (candidate elimination) method.
Do you expect to learn the same concept with each method or different concepts?
Explain in detail your prediction (You will need to consider various aspects like the instance space, the hypothesis
space, and the method of learning).
b) Learn the concept represented by the above examples by applying:
- the Induction of Decision Trees method;
- the Versions Space method.
c) Explain the results obtained in b) and compare them with your predictions.
d) Which will be the results of learning with the above two methods if only the first three examples are available?
 2002, G.Tecuci, Learning Agents Laboratory
38
Exercise
Consider the following positive and negative examples of a concept
workstation
maclc
sun
hp
sgi
macII
software
macwrite
frame-maker
accounting
spreadsheet
microsoft-word
printer
laserwriter
laserwriter
laserjet
laserwriter
proprinter
class
+ e1
+ e2
- c1
- c2
+ e3
any-printer
and the following background knowledge
s omething
any-works tati on
any-software
proprinter
laserwriter
microlaser
laserjet
xerox
mac
vax
s gi
s un
hp
op-s ystem
accounting
s preadsheet
publis hing-sw
a) Build two decision trees corresponding to the above examples. Indicate the
concept represented by each decision tree. In principle, how many different decision
trees could you build?
b) Learn the concept represented by the above examples by applying the Versions
Space method. Which is the learned concept if only the first four examples are
available?
c) Compare and justify the obtained results.
 2002, G.Tecuci, Learning Agents Laboratory
macplus
maclc
macII
unix
vms
mac-os
mac-write
page-maker
frame-maker
micros oft-word
39
Exercise
True of false:
If decision tree D2 is an elaboration of D1 (according to ID3), then D1 is more
general than D2.
 2002, G.Tecuci, Learning Agents Laboratory
40
Recommended reading
Mitchell T.M., Machine Learning, Chapter 3: Decision tree learning,
pp. 52 -80, McGraw Hill, 1997.
Quinlan J.R., Induction of decision trees, in Machine Learning Journal,
1:81-106. Also in Shavlik J. and Dietterich T. (eds), Readings in Machine
Learning, Morgan Kaufmann, 1990.
Barr A., Cohen P., and Feigenbaum E.(eds), The Handbook of Artificial
Intelligence, vol III, pp.406-410, Morgan Kaufmann, 1982.
Elwyn Edwards, Information Transmission, Chapter 4: Uncertainty, pp. 2839, Chapman and Hall, 1964.
 2002, G.Tecuci, Learning Agents Laboratory
41