[발표자료 - Download]

Download Report

Transcript [발표자료 - Download]

Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
YaDT (Yet another Decision Tree
builder)
Ah Young Shin
[email protected]
Visual Communication Lab.
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
1. Introduction
•
YaDT is a from-scratch main-memory implementation of the C4.5-like
decision tree algorithm.
•
ID3(Entropy) → C4.5(Information Gain) → C5.0 의 순으로 확장
•
Unfortunately, C4.5 (and EC4.5) are implemented in the old style K&R C
code.
•
The sources are then hard to understand, profile and extend.
Experimental results are reported comparing YaDT with Weka, dti and
(E)C4.5.
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
1. Introduction - C4.5
•
C4.5
① 수치형 속성 취급 ( Handling continuous attributes )
② 무의미한 속성을 제외하는 문제
③ 나무의 깊이 문제 ( How deeply to grow the decision tree )
④ 결측치 처리( Handling missing attributes values )
⑤ 비용고려 ( Handling attributes with different costs )
⑥ 효율성 ( Improving computational efficiency )
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
2. Meta data representation
•
Each attribute has one the following attribute types
: discrete, continuous, weights or class.
•
The values of an attribute in a case belong to some data type including
: integer, float, double, string. (special value‘?’or NULL)
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
2. Meta data representation
•
Summarizing,
in
YaDT
meta
data
describing the training set TS can be
structed as a table with columns
: attribute name, data type and attribute
type.
•
Such a table can be provided as a database table, or as a text file.
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
3. Data representation
•
Example) training data for PlayTennis may include the following case:
•
C4.5 models an attribute value by a union structure to distinguish
discrete from continuous attributes.
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
4.1 YaDT optimizations
• All the strategies implement several optimizations, mainly related to the
efficient
computation of information gain.
① The first strategy computes the local threshold using the algorithm of
C4.5, which in particular sort cases by means of the quicksort method.
② The second strategy also uses the algorithm of C4.5, but adopts a
counting sort method.
⇒ The selection of the strategy to adopt is performed accordingly to an
analytic comparison of their efficiency.
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
4.1 YaDT optimizations
•
After splitting a node, a (weighted) subset of cases are “pushed down”
to each child node. (pushed down = LIFO)
•
YaDT builds a weighted array for each node.
•
The depth-first strategy is slightly faster, since the following
optimization can be implemented.
•
The breadth-first strategy has a better memory occupation
performance, requiring to maintain arrays of weights and cases indexes
for a total of at most 2∙|TS| elements. -> YaDT
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
4.2 Some experiments on efficiency
•
Ts name : the name of training set
•
|TS| : the number of cases
•
NC : the number of class values
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
5. YaDT version 1.2.5
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
5. YaDT version 1.2.5
Dept. Electronic Computer Engineering. University Of Seoul.
Visual Communication Lab.
6. Conclusion
•
a structured object-oriented programing implementation
•
portable code over Windows (Visual Studio) and Linux (gcc)
•
32 bit and 64 bit executable
•
a documented C++ library of classes
•
compressed binary output/input of trees
•
a command line tree builder and a Java GUI.