Transcript Slide 1

Advanced data mining with
TagHelper and Weka
Carolyn Penstein Rosé
Carnegie Mellon University
Funded through the Pittsburgh Science of Learning Center and
The Office of Naval Research, Cognitive and Neural Sciences Division
Outline
 Selecting a classifier
 Feature Selection
 Optimization
 Semi-supervised learning
Selecting a Classifier
Classifier Options
* The three main types of
Classifiers are
Bayesian models (Naïve Bayes),
functions (SMO),
and trees (J48)
Classifier Options
 Rules of thumb:
 SMO is state-of-the-art for
text classification
 J48 is best with small
feature sets – also handles
contingencies between
features well
 Naïve Bayes works well for
models where decisions are
made based on
accumulating evidence
rather than hard and fast
rules
Feature Selection
Why do irrelevant features hurt
performance?
 They might confuse a classifier
 They waste time
Two Solutions
 Use a feature selection algorithm
 Only extract a subset of possible features
Feature Selection
* Click on the AttributeSlectedClassifier
Feature Selection
 Feature selection
algorithms pick out a
subset of the
features that work
best
 Usually they evaluate
each feature in
isolation
Feature Selection
* First click here
* Then pick your base
classifier just like before
* Finally you will configure
the feature selection
Setting Up Feature Selection
Setting Up Feature Selection
 The number of
features you pick
should not be larger
than the number of
features available
 The number should
not be larger than
the number of coded
examples you have
Examining Which Features are
Most Predictive
 You can find a
ranked list of
features in the
Performance
Report if you use
feature selection
* Predictiveness score
* Frequency
Optimization
Key idea:
combine multiple views on
the same data in order to
increase reliability
Boosting
 In boosting, a series of models are trained and
each trained model is influenced by the
strengths and weaknesses of the previous
model
 New models should be experts in classifying
examples that the previous model got wrong
 It specifically seeks to train multiple models that
complement each other
 In the final vote, model predictions are weighted
based on their model’s performance
More about Boosting
 The more iterations, the more confident
the trained classifier will be in its
predictions (since it will have more experts
voting)
 On the other side, sometimes Boosting
overfits
 Boosting can turn a weak classifier into a
strong classifier
Boosting
 Boosting is an
option listed in the
Meta folder, near
the Attribute
Selected Classifier
 It is listed as
AdaBoostM1
 Go ahead and click
on it now
Boosting
* Now click here
Setting Up Boosting
* Select a classifier
* Set the number of cycles of
boosting
Semi-Supervised
Learning
Using Unlabeled Data
 If you have a small amount of labeled data
and a large amount of unlabeled data:
 you can use a type of bootstrapping to learn a
model that exploits regularities in the larger
set of data
 The stable regularities might be easier to spot
in the larger set than the smaller set
 Less likely to overfit your labeled data
Co-training
 Train two different models based on a few labeled
examples
 Each model is learning the same labels but using
different features
 Use each of these to label the unlabeled data
 For each approach, take the example most
confidently labeled negative and most confidently
labeled positive and add them to the labeled data
 Now repeat the process until all of the data is
labeled
Semi-supervised Learning
 Remember the Basic idea:
 Train on a small amount of data
 Add the positive and negative example you
are most confident about to the training data
 Retrain
 Keep looping until you label all the data