Transcript 下載/瀏覽
A Dynamic Discretization Approach
for Constructing Decision Trees
with a Continuous Label
Adviser: Yu-Chiang Li
Speaker: Gung-Shian Lin
Date:2010/07/20
IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA
ENGINEERING, VOL. 21, NO. 11,
NOVEMBER 2009
南台科技大學
資訊工程系
Outline
2
1
Introduction
2
Related work
3
The proposed algorithm
4
Performance evaluation
5
Conclusion
1. Introduction
When the label is a continuous variable in the data,
two possible approaches based on existing decision
tree algorithms can be used to handle the situations.
The first uses a data discretization method in the
preprocessing stage to convert the continuous label into a
class label defined by a finite set of nonoverlapping
intervals and then applies a decision tree algorithm.
The second simply applies a regression tree algorithm,
using the continuous label directly.
3
1. Introduction
We propose an algorithm that dynamically discretizes
the continuous label at each node during the tree
induction process. The proposed algorithm has the
following two important features:
The algorithm dynamically performs discretization based on
the data associated with the node in the process of
constructing a tree.
The algorithm can also produce the mean, median, and other
statistics for each leaf node as part of its output.
4
2. Related work
5
2. Related work
Main DT algorithms type
Data discretization method
• Drawback:may cannot provide good fit for the data.
Regression tree algorithm
• Drawback: size of a regression tree is usually large, results are
often not accurate.
6
2. Related work
Data discretization method (C4.5)
equal width method
equal depth method
clustering method
Monothetic Contrast Criterions (MCCs)
3-4-5 partition method
Regression tree algorithm
Classification and Regression Trees (CARTs)
7
3. The proposed algorithm
The main steps of the algorithm are outlined as
follows:
8
3. The proposed algorithm
9
3. The proposed algorithm
We rewrite steps 6 and 7 into the following more
detailed steps:
10
3. The proposed algorithm
We use three sections to explain the following key
steps in the algorithms:
11
3. The proposed algorithm
Determining Nonoverlapping Intervals
Set Ci ±16
C5:40-16=24 & 40+16=56
C8:65-16=49 & 65+16=81
Neighboring range:
C1:33
C9:35
C2:28
C10:27
C3:27
C11:28
C4:28
C5:10
C6:11
C7:24
C8:29
12
3. The proposed algorithm
13
3. The proposed algorithm
14
3. The proposed algorithm
Computing the Goodness Value
15
3. The proposed algorithm
16
3. The proposed algorithm
Stopping Tree Growing
17
4. Performance evaluation
18
4. Performance evaluation
First Experiment: Comparing CLC and Approach 1
19
4. Performance evaluation
20
4. Performance evaluation
21
4. Performance evaluation
Second Experiment: CLC and Regression Trees
22
4. Performance evaluation
Third Experiment: Supplementary Comparisons
23
5. Conclusion
Extensive numerical experiments have been
performed to evaluate the proposed algorithm. The
results also confirm the efficiency and accuracy of the
proposed algorithm.
24
南台科技大學
資訊工程系