Transcript Slides
Machine Learning
in Practice
Lecture 21
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the Day
Announcements
Questions?
No
quiz and no new assignment today
Weka helpful hints
Clustering
Advanced Statistical Models
More on Optimization and Tuning
Weka Helpful Hints
Remember SMOreg vs SMO…
Setting the Exponent in SMO
* Note that an exponent larger than 1.0 means you are using a non-linear kernel.
Clustering
What is clustering
Finding natural groupings of your data
Not supervised! No class attribute.
Usually only works well if you have a huge
amount of data!
InfoMagnets: Interactive Text Clustering
What does clustering do?
Finds natural
breaks in your data
If there are obvious
clusters, you can
do this with a small
amount of data
If you have lots of
weak predictors,
you need a huge
amount of data to
make it work
What does clustering do?
Finds natural
breaks in your data
If there are obvious
clusters, you can
do this with a small
amount of data
If you have lots of
weak predictors,
you need a huge
amount of data to
make it work
Clustering in Weka
* You can pick which clustering algorithm you want to use and how many clusters
you want.
Clustering in Weka
* Clustering is unsupervised, so you want it to ignore your class attribute!
Click here
Select the class attribute
Clustering in Weka
* You can evaluate
the clustering in
comparison with class
attribute assignments
Adding a Cluster Feature
Adding a Cluster Feature
* You should set it explicitly to ignore the class attribute
* Set the pulldown menu to No Class
Why add cluster features?
Class 1
Class 2
Why add cluster features?
Class 1
Class 2
Clustering with Weka
K-means and
FarthestFirst: disjoint flat
clusters
EM: statistical approach
Cobweb: hierarchical
clustering
K-Means
You choose the number of clusters you want
You
might need to play with this by looking at what kind of
clusters you get out
K initial points chosen randomly as cluster centriods
All points assigned to the centroid they are closest
to
Once data is clustered, a new centroid is picked
based on relationships within the cluster
K-Means
Then clustering occurs again using the new
centroids
This continues until no changes in clustering take
place
Clusters are flat and disjoint
K-Means
K-Means
K-Means
K-Means
K-Means
EM: Expectation Maximization
Does not base clustering on distance from a
centroid
Instead clusters based on probability of class
assignment
Overlapping clusters rather than disjoint
clusters
Every
instance belongs to every cluster with
some probability
EM: Expectation Maximization
Two important kinds of probability distributions
Each
cluster has an associated distribution of
attribute values for each attribute
Based on the extent to which instances are in the
cluster
Each
instance has a certain probability of being in
each cluster
Based on how close its attribute values are to typical
attribute values for the cluster
Probabilities of Cluster Membership
Initialized
65% B
35% A
25% B
75% A
Central Tendencies Computed
Based on Cluster Membership
65% B
35% A
A
B
25% B
75% A
Cluster Membership Re-Assigned
Probabilistically
75% B
25% A
A
B
35% B
65% A
Central tendencies Re-Assigned
Based on Membership
75% B
25% A
A
B
35% B
65% A
Cluster Membership Reassigned
60% B
40% A
A
B
45% B
55% A
EM: Expectation Maximization
Iterative like k-means – but guided by a
different computation
Considered more principled than k-means, but
much more computationally expensive
Like k-means, you pick the number of clusters
you want
Advanced Statistical
Models
Quick View of Bayesian Networks
Windy
Play
Normally with Naïve
Bayes you have simple
conditional probabilities
Humidity
Outlook
Temperature
P[Play = yes | Humitity = high]
Quick View of Bayesian Networks
Windy
Play
With Bayes Nets, there
are interactions between
attributes
Humidity
Outlook
P[play = yes & temp = hot | Humidity =
high]
Similar likelihood
computation for an
instance
You
Temperature
will still have one
conditional probability per
attribute to multiply
together
But they won’t all be
simple
Humidity is related jointly
to temperature and play
Quick View of Bayesian Networks
Windy
Humidity
Outlook
Temperature
Learning algorithm
needs to find the
shape of the network
Probabilities come
from counts
Two stages – similar
idea to “kernel
methods”
Play
Doing Optimization
in Weka
Optimizing Parameter Settings
Test
Use a modified form of crossvalidation:
1
Iterate over settings
2
Validation
3
4
5
Compare performance
over validation set;
Pick optimal setting
Test on Test Set
Train
Still N folds, but each
fold has less training data than
with standard cross validation
Or you can have a hold-out
Validation set you use for all
folds
Remember!
Cross-validation is for estimating your
performance
If you want the model that achieves that
estimated performance, train over the
whole set
Same principle for optimization
Estimate
your tuned performance using cross
validation with an inner loop for optimization
When you build the model over the whole set,
use the settings that work best in crossvalidation over the whole set
Optimization in Weka
Divide your data into 10 train/test pairs
Tune
parameters using cross validation on the
training set (this is the inner loop)
Use those optimized settings on the
corresponding test set
Note that you may have a different set of
parameter setting for each of the 10 train/test
pairs
You can do the optimization in the
Experimenter
Train/Test Pairs
* Use the StratifiedRemoveFolds filter
Setting Up for Optimization
* Prepare to save the results
•Load in training sets for
all folds
•We’ll use cross validation
Within training folds to
Do the optimization
What are we optimizing?
Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75
Add Each Algorithm to Experimenter
Interface
Look at the Results
* Note that optimal setting varies
across folds.
Apply the optimized settings on each fold
* Performance on Test1 using
optimized settings from Train1
What if the optimization requires
work by hand?
Do you see a problem
with the following?
Do
feature selection over
the whole set to see
which words are highly
ranked
Create user defined
features with subsets of
these to see which ones
look good
Add those to your feature
space and do the
classification
What if the optimization requires
work by hand?
The problem is that is
just like doing feature
selection over your
whole data set
You will overestimate your
performance
So what’s a better
way of doing that?
What if the optimization requires
work by hand?
You could set aside a
small subset of data
Using that small
subset, do the same
process
Then use those user
defined features with
the other part of the
data
Take Home Message
Instance based learning and clustering both
make use of similarity metrics
Clustering can be used to help you
understand your data or to add new
features to your data
Weka provides opportunities to tune all of
its algorithms through the object editor
You can use the Experimenter to tune the
parameter settings when you are
estimating your performance using crossvalidation