Transcript Slide 1
Advanced data mining with
TagHelper and Weka
Carolyn Penstein Rosé
Carnegie Mellon University
Funded through the Pittsburgh Science of Learning Center and
The Office of Naval Research, Cognitive and Neural Sciences Division
Outline
Selecting a classifier
Feature Selection
Optimization
Semi-supervised learning
Selecting a Classifier
Classifier Options
* The three main types of
Classifiers are
Bayesian models (Naïve Bayes),
functions (SMO),
and trees (J48)
Classifier Options
Rules of thumb:
SMO is state-of-the-art for
text classification
J48 is best with small
feature sets – also handles
contingencies between
features well
Naïve Bayes works well for
models where decisions are
made based on
accumulating evidence
rather than hard and fast
rules
Feature Selection
Why do irrelevant features hurt
performance?
They might confuse a classifier
They waste time
Two Solutions
Use a feature selection algorithm
Only extract a subset of possible features
Feature Selection
* Click on the AttributeSlectedClassifier
Feature Selection
Feature selection
algorithms pick out a
subset of the
features that work
best
Usually they evaluate
each feature in
isolation
Feature Selection
* First click here
* Then pick your base
classifier just like before
* Finally you will configure
the feature selection
Setting Up Feature Selection
Setting Up Feature Selection
The number of
features you pick
should not be larger
than the number of
features available
The number should
not be larger than
the number of coded
examples you have
Examining Which Features are
Most Predictive
You can find a
ranked list of
features in the
Performance
Report if you use
feature selection
* Predictiveness score
* Frequency
Optimization
Key idea:
combine multiple views on
the same data in order to
increase reliability
Boosting
In boosting, a series of models are trained and
each trained model is influenced by the
strengths and weaknesses of the previous
model
New models should be experts in classifying
examples that the previous model got wrong
It specifically seeks to train multiple models that
complement each other
In the final vote, model predictions are weighted
based on their model’s performance
More about Boosting
The more iterations, the more confident
the trained classifier will be in its
predictions (since it will have more experts
voting)
On the other side, sometimes Boosting
overfits
Boosting can turn a weak classifier into a
strong classifier
Boosting
Boosting is an
option listed in the
Meta folder, near
the Attribute
Selected Classifier
It is listed as
AdaBoostM1
Go ahead and click
on it now
Boosting
* Now click here
Setting Up Boosting
* Select a classifier
* Set the number of cycles of
boosting
Semi-Supervised
Learning
Using Unlabeled Data
If you have a small amount of labeled data
and a large amount of unlabeled data:
you can use a type of bootstrapping to learn a
model that exploits regularities in the larger
set of data
The stable regularities might be easier to spot
in the larger set than the smaller set
Less likely to overfit your labeled data
Co-training
Train two different models based on a few labeled
examples
Each model is learning the same labels but using
different features
Use each of these to label the unlabeled data
For each approach, take the example most
confidently labeled negative and most confidently
labeled positive and add them to the labeled data
Now repeat the process until all of the data is
labeled
Semi-supervised Learning
Remember the Basic idea:
Train on a small amount of data
Add the positive and negative example you
are most confident about to the training data
Retrain
Keep looping until you label all the data