TZN 330 – Projekt Radosław Wesołowski

Download Report

Transcript TZN 330 – Projekt Radosław Wesołowski

Radosław Wesołowski
Tomasz Pękalski, Michal Borkowicz , Maciej Kopaczyński
12-03-2008
What is it anyway?
Decision tree T – a tree
with a root (in graph
theory sense), in which we
assign the following
meanings to its elements:
- inner nodes represent attributes,
- edges represent values of the attribute,
- leafs represent classification decisions.
Using decision tree we can visualize a
program with only ‘if-then’ instructions.
Testing functions
Let us consider an attribute A (e.g.
temperature). Let VA mean the set of all
possible values of A (0K up to infinity). Let
Rt mean the set of all possible test results
(hot, mild, cold). As a testing function we
mean a map
t: VARt
We distinguish two main types of testing
functions, depending on the set VA - discrete
and continuous.
Quality of a decision tree (Occam's razor):
- we prefer small, simple trees,
- we want to gain maximum accuracy of
classification (training set, test set)
For example:
Q(T) = *size(T) + *accuracy(T)
Optimal tree – we are given:
- a training set S,
- a testing functions set TEST,
- quality criterion Q.
Target: T optimising Q(T).
Fact: usually this is NP-hard problem.
Conclusion: we have to use heuristics.
Building a decision tree:
- top_down method:
a. In the beginning the root includes
all training examples
b. We divide them recursively,
choosing one attribute at a time
- bottom_up: we remove subtrees or edges
to gain precision for judging new
cases.
Entropy – average bits amount to represent
a decision d for a randomly chosen object
from a given set S. Why? Because optimal
binary representation assigns –log2(p) bits
to a decision which probability is p. We
have formula:
entropy(p1,...pn)= - p1*log2(p1) - ... - pn*log2(pn)
Information gain:
gain(.) = info before dividing – info after dividing
Overtraining: We say that a model H
overfits if there is a model H’ such that :
- training_error(H) < training_error(H’),
- testing_error(H) > testing_error(H’).
Avoiding overtraining:
- adequate stop criterions,
- posprunning,
- preprunning.
Some decision trees algorithms:
- R1,
- ID3 (Interactive dichotomizer version 3),
- C4.5 (ID3 + discretization + prunning),
- CART (Classification and Regression Trees),
- CHAID (CHi-squared Automatic Interaction
-Detection).