Weka Tutorial

Download Report

Transcript Weka Tutorial

Weka Tutorial
WEKA:: Introduction
• A collection of open source ML algorithms
– pre-processing
– classifiers
– clustering
– association rule
• Created by researchers at the University of
Waikato in New Zealand
• Java based
WEKA:: Installation
• Download software from
http://www.cs.waikato.ac.nz/ml/weka/
– If you are interested in modifying/extending weka
there is a developer version that includes the
source code
• Set the weka environment variable for java
– setenv WEKAHOME /usr/local/weka/weka-3-6-1
– setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH
• Download some ML data from
http://mlearn.ics.uci.edu/MLRepository.html
WEKA:: Introduction .contd
• Routines are implemented as classes and
logically arranged in packages
• Comes with an extensive GUI interface
– Weka routines can be used stand alone via the
command line
• Eg. java weka.classifiers.j48.J48 -t
$WEKAHOME/data/iris.arff
WEKA:: Interface
WEKA:: Data format
• Uses flat text files to describe the data
• Can work with a wide variety of data files including its own
“.arff” format and C4.5 file formats
• Data can be imported from a file in various formats:
– ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an SQL database
(using JDBC)
WEKA:: ARRF file format
@relation anneal
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
carbon
hardness
'enamelability' {'?','1','2','3','4','5'}
cholesterol numeric
shape { COIL, SHEET}
class {‘1’,’2’,’3’,’4’,’5’,’U’}
@data
•
•
•
'?','C','A',0,60,'T','?','?',0,'?','?','G','?','?','?','?','M','?','?','?'
,'?','?','?','?','?','?','?','?','?','?','?','COIL',2.801,385.1,0,'?','0',
'?','3'
'?','C','A',0,60,'T','?','?',0,'?','?','G','?','?','?','?','B','Y','?','?'
,'?','Y','?','?','?','?','?','?','?','?','?','SHEET',0.801,255,269,'?','0'
,'?','3'
'?','C','A',0,45,'?','S','?',0,'?','?','D','?','?','?','?','?','?','?','?'
,'?','?','?','?','?','?','?','?','?','?','?','COIL',1.6,610,0,'?','0','?',
'3'
...
A more thorough description is available here
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
WEKA:: Explorer: Preprocessing
• Pre-processing tools in WEKA are
called “filters”
• WEKA contains filters for:
– Discretization, normalization,
resampling, attribute selection,
transforming, combining attributes, etc
Annealing dataset : Description
• Annealing dataset is from the UCI repository of datasets.
It contains information about data being annealed and its
various properties.
• There are 38 attributes in this dataset in which 6 are
continuous, 3 are integer valued and remaining 29 are
nominal.
• This dataset consists of missing values and in total has
798 records along with 6 major classes.
• The notion of classes will be explained later during
classification.
Data Cleaning: Removing missing values:
Data Cleaning: Removing useless attributes
Earlier 38
now 32
Data transformation: Discretizing the attributes
Implies
15 bins
First-last
means all
attributes
Data reduction: Supervised attribute selection
Reducing
data size
from 32 to
10
Viewing and understanding the transformed
data
• This can be done using the ARFF viewer option
in Weka.
• It allows us to save files in other formats also
like CSV and others. arfftocsv convertor option
and vice versa is also there.
• Such files can then be imported into mysql
databases and others easily after this
conversion.
Data is now ready for data mining!