Transcript 下載/瀏覽
A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/07/20 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009 南台科技大學 資訊工程系 Outline 2 1 Introduction 2 Related work 3 The proposed algorithm 4 Performance evaluation 5 Conclusion 1. Introduction When the label is a continuous variable in the data, two possible approaches based on existing decision tree algorithms can be used to handle the situations. The first uses a data discretization method in the preprocessing stage to convert the continuous label into a class label defined by a finite set of nonoverlapping intervals and then applies a decision tree algorithm. The second simply applies a regression tree algorithm, using the continuous label directly. 3 1. Introduction We propose an algorithm that dynamically discretizes the continuous label at each node during the tree induction process. The proposed algorithm has the following two important features: The algorithm dynamically performs discretization based on the data associated with the node in the process of constructing a tree. The algorithm can also produce the mean, median, and other statistics for each leaf node as part of its output. 4 2. Related work 5 2. Related work Main DT algorithms type Data discretization method • Drawback:may cannot provide good fit for the data. Regression tree algorithm • Drawback: size of a regression tree is usually large, results are often not accurate. 6 2. Related work Data discretization method (C4.5) equal width method equal depth method clustering method Monothetic Contrast Criterions (MCCs) 3-4-5 partition method Regression tree algorithm Classification and Regression Trees (CARTs) 7 3. The proposed algorithm The main steps of the algorithm are outlined as follows: 8 3. The proposed algorithm 9 3. The proposed algorithm We rewrite steps 6 and 7 into the following more detailed steps: 10 3. The proposed algorithm We use three sections to explain the following key steps in the algorithms: 11 3. The proposed algorithm Determining Nonoverlapping Intervals Set Ci ±16 C5:40-16=24 & 40+16=56 C8:65-16=49 & 65+16=81 Neighboring range: C1:33 C9:35 C2:28 C10:27 C3:27 C11:28 C4:28 C5:10 C6:11 C7:24 C8:29 12 3. The proposed algorithm 13 3. The proposed algorithm 14 3. The proposed algorithm Computing the Goodness Value 15 3. The proposed algorithm 16 3. The proposed algorithm Stopping Tree Growing 17 4. Performance evaluation 18 4. Performance evaluation First Experiment: Comparing CLC and Approach 1 19 4. Performance evaluation 20 4. Performance evaluation 21 4. Performance evaluation Second Experiment: CLC and Regression Trees 22 4. Performance evaluation Third Experiment: Supplementary Comparisons 23 5. Conclusion Extensive numerical experiments have been performed to evaluate the proposed algorithm. The results also confirm the efficiency and accuracy of the proposed algorithm. 24 南台科技大學 資訊工程系