Machine Learning Group WEKA Tutorial Sugato Basu and Prem Melville Machine Learning Group Department of Computer Sciences University of Texas at Austin University of Texas at.
Download ReportTranscript Machine Learning Group WEKA Tutorial Sugato Basu and Prem Melville Machine Learning Group Department of Computer Sciences University of Texas at Austin University of Texas at.
University of Texas at Austin
WEKA Tutorial
Sugato Basu and Prem Melville
Machine Learning Group Department of Computer Sciences University of Texas at Austin
Machine Learning Group
Machine Learning Group
What is WEKA?
• • • • • Collection of ML algorithms – open-source Java package – http://www.cs.waikato.ac.nz/ml/weka/ Schemes for
classification
include: – decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, voted perceptrons, multi-layer perceptron Schemes for
numeric prediction
include: – linear regression, model tree generators, locally weighted regression, instance based learners, decision tables, multi-layer perceptron
Meta-schemes
include: – Bagging, boosting, stacking, regression via classification, classification via regression, cost sensitive classification Schemes for
clustering
: – EM and Cobweb
University of Texas at Austin 2
Machine Learning Group
Getting Started
• Set environment variable WEKAHOME – setenv WEKAHOME /u/ml/software/weka • Add $WEKAHOME/weka.jar to your CLASSPATH – setenv CLASSPATH /u/ml/software/weka/weka.jar
• Test – java weka.classifiers.j48.J48 –t $WEKAHOME/data/iris.arff
University of Texas at Austin 3
Machine Learning Group
ARFF File Format
• • Require declarations of
@RELATION
,
@ATTRIBUTE
and
@DATA @RELATION
declaration associates a name with the dataset – @RELATION
@ATTRIBUTE
declaration specifies the name and type of an attribute – @attribute
@DATA
declaration is a single line denoting the start of the data segment – Missing values are represented by ?
@DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor
University of Texas at Austin 4
Machine Learning Group
Sparse ARFF Files
• • Similar to AARF files except that data value 0 are not represented Non-zero attributes are specified by attribute number and value @data 0, X, 0, Y, “class A” 0, 0, W, 0, "class B" @data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} • For examples of ARFF files see $WEKAHOME/data
University of Texas at Austin 5
Machine Learning Group
Running Learning Schemes
• java
Naïve bayes KNN weka.classifiers.j48.J48
weka.classifiers.NaiveBayes
weka.classifiers.IBk
• Important generic options -t
University of Texas at Austin 6
Output
• Summary of model – if possible • Statistics on training data • Cross-validation statistics • Example • Output for numeric prediction is different – Correlation coefficient instead of accuracy – No confusion matrices
University of Texas at Austin Machine Learning Group 7
Machine Learning Group
Using Meta-Learners
• java
• CVParameterSelection finds best value for specified param using CV – Use –P option to specify the parameter and space to search -P “
java …CVParameterSelection –W …OneR –P “B 1 10 10” –t iris.arff
University of Texas at Austin 8
Machine Learning Group
Using Filters
• Filters can be used to change data files, e.g.
– delete first and second attributes java weka.filters.AttributeFilter
–R
1,2
–i
iris.arff
–o
iris.new.arff
•
AttributeSelectionFilter
lets you select a set of attributes using classes in the weka.attributeSelection
package java weka.filters.AttributeSelectionFilter
–E
weka.attributeSelection.InfoGainAttributeEval –i weather.arff • Other filters
DiscretizeFilter NominalToBinaryFilter
Discretizes a range of numeric attributes in the dataset into nominal attributes Converts nominal attributes into binary ones, replacing each attribute with
k
values with
k-1
new binary attributes
NumericTransformFilter
Transforms numeric attributes using given method ( java weka.filters. NumericTransformFilter
–C
java.lang.Math
–M
sqrt … )
University of Texas at Austin 9
Machine Learning Group
The Instance Class
• All attribute values are stored as doubles – Value of nominal attribute is index of the nominal value in attribute definition • Some important methods
classAttribute
()
classValue
()
value
(int)
enumerateAttributes
()
weight
() Returns class attribute Returns an instance's class value Returns an specified attribute value in internal format Returns an enumeration of all the attributes Returns the instance's weight • Instances is a collection of Instance objects
numInstances
()
instance
(int)
enumerateInstances
() Returns the number of instances in the dataset Returns the instance at the given position Returns an enumeration of all instances in the dataset
University of Texas at Austin 10
Machine Learning Group
Writing Classifiers
• • • • Import the following packages import weka.classifiers
.*; import weka.core.*; import weka.util.*; Extend
Classifier
– If predicting class probabilities then extend
DistributionClassifier
Essential methods
buildClassifier
(Instances)
classifyInstance
(Instance)
distributionForInstance
(Instance) Generates a classifier Classifies a given instance Predicts the class memberships (for DistributionClassifier) Interfaces that can be implemented
UpdateableClassifier
For incremental classifiers
WeightedInstanceHandler
If classifier can make use of instance weights
University of Texas at Austin 11
Machine Learning Group
Example: ZeroR (Majority Class)
public class ZeroR extends DistributionClassifier implements WeightedInstancesHandler { private double m_ClassValue; //The class value 0R predicts private double [] m_Counts; //The number of instances in each class public void buildClassifier (Instances instances) throws Exception { m_Counts = new double [instances.numClasses()]; for (int i = 0; i < m_Counts.length; i++) { //Initialize counts m_Counts[i] = 1; } Enumeration enum = instances.enumerateInstances(); while (enum.hasMoreElements()) { //Add up class counts Instance instance = (Instance) enum.nextElement(); m_Counts[(int)instance.classValue()] += instance.weight(); } m_ClassValue = Utils.maxIndex(m_Counts); //Find majority class Utils.normalize(m_Counts); } //Normalize counts
University of Texas at Austin 12
Example: ZeroR - II
//Return index of the predicted class public double classifyInstance (Instance instance) { } return m_ClassValue; //Return predicted class probability distribution public double [] distributionForInstance (Instance instance) throws Exception { } return (double []) m_Counts.clone();
Machine Learning Group University of Texas at Austin 13
Machine Learning Group
WekaUT: Extensions to WEKA
• Clusterers package:
– SemiSupClusterer: Interface for semi-supervised clustering – SeededEM, SeededKMeans: Implements SemiSupClusterer, has seeding – HAC, MatrixHAC: Implements top-down agglomerative clustering – ConsensusClusterer: Abstract class for consensus clustering – ConsensusPairwiseClusterer: Takes output of many clusterings, uses cluster collocation statistics as similarity values, applies clustering algo – CoTrainableClusterer: Performs co-trainable clustering, similar to Nigam’s Co-EM – CVEvaluation: 10-fold cross-validation with learning curves, in transductive framework
University of Texas at Austin 14
Machine Learning Group
WekaUT (contd.)
• Metrics:
– Metric: Abstract class for metric – LearnableMetric: Abstract class for learnable distance metric – Weighted DotP: Learnable – WeightedL1Norm: Learnable – WeightedEuclid: Learnable – Mahalanobis metric: Uses Jama for matrix operations
University of Texas at Austin 15
Machine Learning Group
Making Weka Text-friendly
• Preprocess text by making wrapper calls to: – Mooney’s IR package: Tokenize, Porter Stemming, TFIDF – McCallum’s BOW package: Tokenize, Stem, TFIDF, Information theoretic pruning, N-gram tokens, different smoothing algorithms – Fan’s MC toolkit: Tokenize, TFIDF, pruning, CCS format • No inverted index in Weka: OK if not doing IR, but KNN is inefficient – May want to integrate
VSR
package of
IR
with Weka • Probability underflow currently: have to do calculations with logs – NaiveBayes, KNN, etc: Can have 2 versions of each (sparse, dense) • Sparse vector format: – Weka’s
SparseInstance
– IR’s
hashMapVector
University of Texas at Austin 16
Machine Learning Group
Weka’s SparseInstance format
• Non-zero attributes explicitly stated, 0 values not stated: @data {1:”the”,3: ”small”,6:”boy”,9: “ate”,13: “the”,17: “small”,21: “pie”} • Strings mapped to integer indices using a hashtable: the small boy ate the small 0 1 2 3 4 5 pie 6 • Use StringToWordVectorFilter to convert text SparseInstance to word vector (in Weka 3-2-2)
University of Texas at Austin 17
Machine Learning Group
Comparison of sparse vector formats
• •
hashMapVector
+ Compact hashMap representation + Amortized constant-time access – Does not store position information, maybe necessary for future apps – Will need a lot of modification to Weka
SparseInstance +
Efficient storage, in terms of indices of string values and position + Contains position information of tokens + Will not require any modification to Weka – Uses binary search to insert new element to vector – Would need filters for TF, IDF, token counts, etc.
– Will require a hack to bypass soft-bug during multiple read-writes
University of Texas at Austin 18
Future Work
• Write wrappers for existing C/C++ packages – mc, spkmeans, rainbow, svmlight, cluto • Data format converters e.g. CCStoARFF • 10 fold CVevaluation with learning curves – inductive (modify Weka’s) – transductive (use clusterer CV code) • Statistical tests e.g. t-tests for classification • Cluster evaluation metrics – we have KL, MI, Pairwise • Making changes to handle text documents
Machine Learning Group University of Texas at Austin 19
Machine Learning Group
Weka Problems
• Internal variables private – Should have protected or package-level access • SparseInstance for Strings requires dummy at index 0 – Problem: • Strings are mapped into internal indices to an array • String at position 0 is mapped to value “0” • When written out as SparseInstance, it will not be written (0 value) • If read back in, first String missing from Instances – Solution: • Put dummy string in position 0 when writing a SparseInstance with strings • Dummy will be ignored while writing, actual instance will be written properly
University of Texas at Austin 20