Predictions an classification capabilities Decision - ISKO

Download Report

Transcript Predictions an classification capabilities Decision - ISKO

By
Adeyemo O.O. ,Adewole A.P, Ogunbiyi T.D, Oni
Samson.
ABSTRACT
 Decision tree is a data mining technique that can accurately
classify data and make effective predictions, it has been
successfully employed for data analyses as a comprehensible
knowledge representation in a broad range of fields such as
customer relationship management, engineering, medicine,
agriculture, computational biology, business management,
fraudulent statement detection.
 In this paper, we provide a review of research publications that
have explored the accuracy of the prediction and classification
capabilities of decision tree to develop data mining model in
comparison with several other algorithms in different
application domains ,this will enable researchers to have a
general overview of knowledge gap in decision tree data mining
algorithm. Data mining takes advantage of the large set of data
that is available to carry out prediction and classification
activities , So we used data consisting of records of Heart
disease patients that have been gathered over the years and
data mining processes is performed on them using Decision
Tree, an approach to achieving data mining.
INTRODUCTION
 Decision tree is a
classification and
prediction tool, it is used widely because
knowledge
discovered
from
it
in
illustrated in a hierarchical structure
which makes it to be easily understood
by people who are not experts in data
mining.
 It is a predictive modeling based technique developed
by Rose Quinlan.
 It is a sequential classifier in the form of recursive tree
structure. The data set in decision tree is analyzed by
developing a branch like structure with appropriate
decision tree algorithm.
 Each internal node of tree splits into branches based
on the splitting criteria. Each test node denotes a class.
 Each terminal node represents the decision. They can
work on both continuous and categorical attributes.
Manpreet Singh et. al. (2013).
RESEARCH OBJECTIVES
 Adopting a fast and reliable means of predicting or
detecting heart disease which is a disease that has claimed
several lives in Nigeria, Africa and the World at large
disease so that it will be possible to eradicate it.
 With the use of a decision making system that implements
Decision Tree (which predictive capability in the heart
disease prediction and some other domain is critically
reviewed in this paper), heart disease could be eradicated
or reduced to a very minimal level in Nigeria.
PROCESSES OF DEVELOPING A DECISION TREE
MODEL
 TREE GROWING
The initial stage of creating a decision tree model is tree growing,
which includes two steps: tree merging and tree splitting.
Tree merging : The non-significant predictor categorizes and the
significant categories within a dataset are grouped together.
Tree splitting: To remove the impurities within the model (which
increases as the tree grows and may result in reducing the accuracy
of the model) into different leaves Mutasem Sh. Alkhasawneh et.al,
(2012)
TREE PRUNING
To remove irrelevant splitting nodes. The removal of irrelevant nodes
can help reduce the chance of creating an over-fitting tree. Such a
procedure is particularly useful because an over-fitting tree model may
result in misclassifying data in real world applications. Mutasem Sh.
Alkhasawneh et.al, (2012)
TREE SELECTION
The final stage of developing a decision tree model is tree selection. At
this stage, the created decision tree model will be evaluated by either
using cross-validation or a testing dataset. This stage is essential as it
can reduce the chances of misclassifying data in real world
applications, and consequently, minimize the cost of developing
further applications. Mutasem Sh. Alkhasawneh et.al, (2012)
DECISION TREE ALGORITHMS
 The different decision tree algorithms are
o ID3
o C4.5
o C5.0
o CHAID
o CART.
ALGORITHM FOR DECISION TREE INDUCTION
BASIC ALGORITHM (A GREEDY ALGORITHM)
- Tree is constructed in a top-down recursive divide-and-conquer
manner
-At start, all the training examples are at the root
-Attributes are categorical (if continuous-valued, they are discretized in
advance)
-Examples are partitioned recursively based on selected attributes
-Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
CONDITIONS FOR STOPPING PARTITIONING
-All samples for a given node belong to the same class
-There are no remaining attributes for further partitioning –majority
voting is employed for classifying the leaf.
-There are no samples left Jiawei Han, (2006)
DECISION TREE APPLICATIONS
 Decision tree has been used to develop models for prediction and classification
in different domains some of which are
 Business Management ,
 Customer Relationship Management,
 Fraudulent Statement Detection,
 Engineering, Energy Consumption,
 Fault Diagnosis,
 Healthcare Management ,
 Agriculture
as explained in the studies below.
CLASSIFICATION
Decision trees algorithm used for classification in different
domains independently and also in combination with other
algorithms by different researchers are discussed below:
 Mohd Najwadi Yusoff and Aman Jantan, 2011
Proposed the usage of Genetic Algorithm (GA) as an approach to
optimize Decision Tree (DT) in malware classification in
comparison with Current techniques in malware classification.New
classifier was developed by combining GA with DT and named
Anti-Malware System (AMS) Classifier in order to classify unique
type of malware.Their result shows AMS Classifier shows an
accuracy increase from 4.5% to 6.5% from DT Classifier.
 Baisen Zhang Tillman, Russ 2007
investigated the potential of a decision tree
approach for modelling NFUE(Nitrogen fertilizer
Use Efficiency) in New Zealand pastures. The
researchers validated their models for 11 of the 16
trials tested with a predictive accuracy of 69%.

D.Senthil Kumar Et al, in their research focused on the aspect of
Medical diagnosis by learning pattern through the collected data of
diabetes, hepatitis and heart diseases and to develop intelligent
medical decision support systems to help the physicians, they
proposed the use of decision trees C4.5 algorithm, ID3 algorithm and
CART algorithm to classify these diseases and compare the
effectiveness, correction rate among them.
 Abolfazl Kazemia ET. Al, 2011 researched the use of
“CRT”, “QUEST” and “C5.0”
“CHIAD”,
Decision Tree algorithm to help
organizations determine the criteria needed for the identification of
potential customers in the competitive environment of their business.
The tree obtained based on C5.0 algorithm provided the most optimal
variable and decision tree by 83.96% accuracy which is closer to field
results used for the comparison and performs better in action.
 Baisen Zhang Tillman, Russ 2007 investigated the potential of a
decision tree approach for modelling NFUE(Nitrogen fertilizer Use
Efficiency) in New Zealand pastures. . It was concluded that this type
of modelling approach can be used to predict NFUE and thereby to
assist decisions on when and where to apply N fertilizer in pastures for
increasing productivity while reducing the environmental impact.
 Abishek Suresh, Et. Al. Investigated the application of decision tree
models for the formation of protein homodimer complexes for
molecular catalysis and regulation. The decision tree model produced
positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57%
for 3SDI in cross validation. It was thus concluded that the method
finds application in assigning homodimers with folding mechanism.
 Majoobi , J , 2007 studied the performances of Decision trees
classification for prediction of wave parameters which are necessary for
many applications in coastal and offshore engineering. According to
the researchers several and various prediction models have been proposed
in the literature for this purpose, decision tree models was found to give a
better accuracy.
 Wang Wei, 2012, In his study, used decision tree to classify image
classification, which was established based on the analysis of the spectrum
characteristics, the texture characteristics and other auxiliary information,
such as NDVI, NDBI and topography characteristics. The result of their
study indicated that the accuracy of decision tree classification was
4.06% higher than that of the maximum likelihood classification and
Kappa coefficient was increased by 5.61%.
 Kuldeep Kumar, Et. Al 2006 in their study discussed the
effectiveness of using decision trees for classification in mammography.
The results obtained using algorithms based on decision trees were compared
with that produced by neural network and decision tree was reported to have
higher classification rate.
 Micheal D Twa, 2011 described
the application of decision tree
induction, an automated machine learning classification method, to
discriminate between normal and keratoconic corneal shapes in an objective
and quantitative way in other to solve with the aim of providing solution to the
challenge of interpretation of volume and complexity of data produced during
videokeratography examinations. . In their research the proposed method was
compared with other known classification methods and decision tree classifier
performed equal to or better than the other classifiers tested.
 Gregor Stiglic, ET. Al. 2012, in their research, presented an
extension to an existing machine learning environment and a study on visual
tuning of decision tree classifiers. The results demonstrate a significant
increase of accuracy in fewer complexes visually tuned decision trees. In
contrast to classical machine learning benchmarking datasets, higher accuracy
gains were observed in bioinformatics datasets.
 Peng Du, Ding Xiaoqing 2008, in their research presented a method
based on decision tree classifier to identify the gender of a person. . The
result of their research shows that the performance of decision tree
classifier is superior to the ordinary classifier.
 Felipe Lirra ,2013 in their research developed a decision tree model,
which indicated the action range of peptides on the types of
microorganisms on which they can exercise biological activity in other to
assist in the recent attempts to find effective substitutes to combat
infections that have been directed at identifying natural antimicrobial
peptides in order to circumvent resistance to commercial antibiotics. ). The
results of their study showed that the use of decision trees to evaluate the
antimicrobial activity of synthetic peptides enables the creation of more
effective models for use in the development of new drugs.
PREDICTION
Decision trees algorithm used for prediction in different domains
independently and also in combination with other algorithms by different
researchers are discussed below:
 Jay Gholap, 2013 used attribute selection and boosting
meta-techniques to tune the performance of J48 decision tree
algorithm on the large amounts of data that are harvested
along with the crops in predicting the soil fertility class since
achieving and maintaining appropriate levels of soil fertility. J48
gives accuracy of 96.73% which makes a good predictive model
in predicting the soil fertility in agriculture.
 Mohammad Taha Khan ET. Al. 2012 primarily researched the
application of two decision tree algorithms C4.5 and the C5.0 was used
for breast cancer as well as heart disease prediction. Over running the
dataset of breast cancer of 400 records C4.5 shows 5 train error whereas
C5.0 show only 3 train errors. C5.0 produces rules in a very easy
readable form but C4.5 generates the rule set in the form of a decision
tree.
 Yoshikazu Goto, ET. Al. 2010 in their study developed a simple and
generally applicable bedside model for predicting outcomes after
cardiac arrest (OHCA). This simple prediction model may provide
clinicians with a practical bedside tool for the OHCA patient’s
stratification in the emergency department.
 Atul Kumar Pandey ET. Al 2013 studied the comparison
of Pruned J48 Decision Tree with Reduced Error Pruning
Approach prediction model against simple pruned and
unpruned approach using for classifying heart disease
based on clinical data of patients and also developed a
heart disease prediction model that can assist medical
professionals in predicting heart disease status based on
these clinical features. the result obtained it was discovered
that fasting blood sugar is the most important attribute
which gives better classification against the other attributes
but its gives not better accuracy.
 A. R. Senthil kumar, ET. Al.2013 Investigated the performance of soft
computing techniques in modeling qualitative and quantitative water resource
variables such as stream flow. It was found that REPtree(decision tree)
model performed well compared to other soft computing techniques such as
MLR, ANN, fuzzy logic, and M5P.

B.S. ZHANG, ET. Al. 2004 applied Decision tree models to predict
annual and seasonal pasture production and investigated the interactions
between pasture production and environmental and management factors in
the North Island hill country. . The decision tree models for annual, spring,
summer, autumn and winter pasture production correctly predicted 82%, 71%,
90%, 88% and 90 % of cases in the model validation.
 Sevgi Zeynep Dogan, ET. Al., 2008
In their study compared the performance of three different decision-tree-based methods of assigning
attribute weights to be used in a case-based reasoning (CBR) prediction model. The study compares the
impact of attribute weights generated by three different methods and, hence, highlights the fact that the
prediction rate of models such as CBR largely depends on the data associated with the parameters used in
the model.
Bark Cheung Chiu ET. Al. 2013 adopted the used of Input-Output Agent Modelling
(IOAM) which is an approach to modelling an agent in terms of relationships between
the inputs and outputs of the cognitive system together with a leading inductive
learning algorithm, C4.5 to build a subtraction skill modeller, C4.5-IOAM. Experimental
results from their investigation shows in the domain of modelling elementary
subtraction skills, showed that the tree quality and the leaf quality of a decision path
provided valuable references for resolving contradicting predictions and a single tree
model representation performed nearly equally well to the multi-tree model
representation.
 Middendorf et al. used alternating decision trees to predict whether an S.
cerevisiae gene would be up- or down regulated under particular conditions of
transcription regulator expression given the sequence of its regulatory region.
In addition to good performance predicting the expression state of target
genes, they were able to identify motifs and regulators that appear to control
the expression of the target genes.
 Lee S, Park I. 2013 in their study, analyzed the hazard to ground subsidence
using factors that can affect ground subsidence and a decision tree approach in
a geographic information system (GIS). The highest accuracy was achieved by
the decision tree model using CHAID algorithm (94.01%) comparing with
QUEST algorithms (90.37%) and frequency ratio model (86.70%). These
accuracies are higher than previously reported results for decision tree.
Decision tree methods can therefore be used efficiently for GSH analysis and
might be widely used for prediction of various spatial events.
 Heiko Milde, ET. Al 1999, In their research, introduced the MAD system
which generates decision trees based on a new method for qualitative
electrical circuit analysis. In particular, their new approach towards
qualitative reasoning about faults in electrical circuits has reached a level of
achievement so that it can be utilized to generate diagnosis systems employed
in industry.
 SMITHA.T, DR.V.SUNDARAM 2012 studied the application of ID3 algorithm
to build a decision tree model to predict the chances of occurrences of disease
in an area by identify the significant parameters for prediction process. 95% of
the prediction accuracy was achieved employing the decision tree classification
model in the research which made the researchers conclude that mostly female
inhabitant with a hereditary history living in a poor environment condition and
having an average age of greater than 35 is suffering the disease.
Methodology
 In this research, decision tree algorithm ID3 (Iterative
Dichotomized 3) was used. These classification
algorithm was selected because it have potential to
yield good results in prediction and classification
applications.
Heart Disease Data
 Record set with medical attributes was obtained online from a
Hospital. With the help of the dataset, the patterns significant to
the heart attack prediction are extracted using the developed ID3
Datamining model.
The records were split equally into two
datasets: training dataset and testing dataset. To avoid bias, the
records for each set were selected randomly. The data include
values for the following:
Heart Disease Predictor Interface
The result page shows result of the prediction which can either be Heart
disease Present or Absent
Results
 A decision tree is a flowchart-like structure in which
internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents
class label (decision taken after computing all attributes).
A path from root to leaf represents classification rules.
 The java program consists of several packages but ID3
Logic is the package that does the main work.
 The system has been built into a jar file which once double-
clicked on a system with java run time.
CONCLUSION
 Decision tree has been found useful in classification and prediction modeling due to the
fact that it can capability to accurately discover hidden relationships between variables,
it is capable of removing insignificant attributes within a dataset.
 Twenty One studies published between 1999 and 2014 in more than three application
domains have been studied in this research and met the minimum criteria for inclusion
in our literature review.
 Decision tree-a data mining model developed and employed in this research was used in
predicting the existence of heart disease in any diagnosed patient which has provided a
solution that helps remove the bottleneck at hospitals. It also provides a means of giving
an idea of the possible heart disease status of a patient without carry out laboratory test
simply by using the symptoms being felt by the patient. Interestingly, anybody can
make use of the system since training of the system is required just once for any
particular data set.