Decision Tree Induction - The Institute of Finance Management

Download Report

Transcript Decision Tree Induction - The Institute of Finance Management

Decision Tree Induction
CIT366: Data Mining & Data Warehousing
Bajuna Salehe
The Institute of Finance Management:
Computing and IT Dept.
What is decision Tree?
• Decision tree induction is the learning of
decision trees from class-labeled training
tuples.
• A decision tree is a flowchart-like tree
structure, where each internal node (nonleaf
node) denotes a test on an attribute, each
branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class
label.
Decision Tree Induction
• Decision trees are the most
widely used classification
technique in data mining today
• Formulate problems into a tree
composed of decision nodes (or
branch nodes) and classification
nodes (or leaf nodes)
• Problem is solved by navigating
down the tree until we reach an
appropriate leaf node
• The tricky bit is building the
most efficient and powerful tree
J. Ross Quinlan is a famed
researcher in data mining
and decision theory. He
has done pioneering
work in the area of
decision trees, including
inventing the ID3 and
C4.5 algorithms.
What is decision Tree?
• A decision tree for the concept buys computer, indicating
whether a customer at AllElectronics is likely to purchase a
computer. Each internal (nonleaf) node represents a test on
an attribute. Each leaf node represents a class (either buys
computer = yes or buys computer = no).
How Does a Decision Tree Work?
• Given a tuple, X, for which the associated class
label is unknown, the attribute values of the
tuple are tested against the decision tree.
• A path is traced from the root to a leaf node,
which holds the class prediction for that tuple.
Decision trees can easily be converted to
classification rules.
How Does a Decision Tree Work?
• Consider a simpler version of the vertebrate
classification problem introduced in the last
lecture.
• Suppose a new species was discovered by
scientists. How can we tell whether it is a
mammal or non-mammal?
• One way to do this is to pose a series of
questions to learn about the characteristics of
the creature
How Does a Decision Tree Work?
• The first question one may ask is whether it is
a warm- or cold-blooded creature.
– If it is cold-blooded, then it is definitely not a
mammal.
– Otherwise, it is either a bird or a mammal.
• In the first case, we need to ask a follow-up
question— does the female species of the
vertebrate give birth to their young?
How Does a Decision Tree Work?
• If the answer is yes, then it is definitely a
mammal; otherwise, it is most likely a nonmammal (with the exception of egg laying
mammals such as platypus and spiny
anteater).
• The example illustrates how a classification
problem can be solved by simply asking a
series of carefully crafted questions about the
attributes of the unknown record.
How Does a Decision Tree Work?
• Each time an answer is given, a follow-up
question is asked until we reach a conclusion
about the class label of the record.
• The series of questions and their possible
answers can be organized in the form of a
decision tree, which is a hierarchical structure
consisting of nodes and directed edges.
• Figure 1 illustrates an example of a decision
tree for the mammal classification problem.
How Does a Decision Tree Work?
An illustrative example of the decision tree for mammal classification problem
How Does a Decision Tree Work?
• There are three types of nodes in the decision
tree:
– A root node, which has no incoming edges and
zero or more outgoing edges.
– Internal nodes, each of which has exactly one
incoming edge and two or more outgoing edges.
– Leaf or terminal nodes, each of which has exactly
one incoming edge and no outgoing edges.
• In a decision tree, each leaf node is assigned a
class label.
How Does a Decision Tree Work?
• The non-terminal nodes, which consist of the
root and other internal nodes, contain
attribute test conditions to separate records
that have different properties.
• For example, the root node shown in Figure 1
uses a test condition (Body Temperature?) to
separate warm-blooded from cold-blooded
vertebrates.
How Does a Decision Tree Work?
• Since all cold-blooded vertebrates are
essentially non-mammals, a leaf node labeled
as Non-mammals is created as the right child
of the root node.
• If the vertebrate is warm-blooded, a
subsequent test condition (Give Birth?) is used
to distinguish mammals from other warmblooded creatures, which are mostly birds.
How Does a Decision Tree Work?
• Classifying unlabeled records is a
straightforward task once a decision tree has
been constructed.
• Starting from the root node, we apply the
attribute test condition to the unlabeled
record and follow the appropriate branch
based on the outcome of the test.
How Does a Decision Tree Work?
• This will lead us either to another internal
node, for which a new test condition is
applied, or eventually, to a leaf node.
• The class label associated with the leaf node is
then assigned to the record.
• As an illustration, Figure 2 traces the path
along the decision tree which is used to
predict the class label of a flamingo.
How Does a Decision Tree Work?
• The path ends up at a leaf node labeled as
Non-mammals.
How Does a Decision Tree Work?
Figure 2. Classifying an unlabeled vertebrate. The dashed lines represent the
outcomes of applying various attribute test conditions on the unlabeled
verterbrate. The vertebrate is eventually assigned to the Non-mammal
class.
How to Build a Decision Tree?
• There are exponentially many decision trees
that can be constructed from a given set of
attributes.
• Some of these trees are better than others
because they achieve higher accuracy on the
test set.
• There are a number of efficient algorithms
that has been developed to induce accurate
decision tree.
How to Build a Decision Tree?
• These algorithms often employ a greedy
strategy for growing decision trees by making
a series of locally optimum decisions about
which attribute to use for partitioning the
data.
• One such algorithm is the Hunt’s algorithm,
which is the basis of many existing decision
tree induction algorithms, including ID3, C4.5,
and CART.
Why are decision tree classifiers so
popular?
• Decision trees can handle high dimensional
data.
• Their representation of acquired knowledge in
tree form is intuitive and generally easy to
assimilate by humans.
• The learning and classification steps of
decision tree induction are simple and fast.
• In general, decision tree classifiers have good
accuracy.
Applications
• Decision tree induction algorithms have been
used for classification in many application
areas, such as medicine, manufacturing and
production, financial analysis, astronomy, and
molecular biology.
• Decision trees are the basis of several
commercial rule induction systems.
Note
• During tree construction, attribute selection
measures are used to select the attribute that
best partitions the tuples into distinct classes.
• When decision trees are built, many of the
branches may reflect noise or outliers in the
training data.
• Tree pruning attempts to identify and remove
such branches, with the goal of improving
classification accuracy on unseen data.