Transcript Slides

Machine Learning
in Practice
Lecture 21
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the Day

Announcements
 Questions?
 No
quiz and no new assignment today
Weka helpful hints
 Clustering
 Advanced Statistical Models
 More on Optimization and Tuning

Weka Helpful Hints
Remember SMOreg vs SMO…
Setting the Exponent in SMO
* Note that an exponent larger than 1.0 means you are using a non-linear kernel.
Clustering
What is clustering
Finding natural groupings of your data
 Not supervised! No class attribute.
 Usually only works well if you have a huge
amount of data!

InfoMagnets: Interactive Text Clustering
What does clustering do?
Finds natural
breaks in your data
 If there are obvious
clusters, you can
do this with a small
amount of data
 If you have lots of
weak predictors,
you need a huge
amount of data to
make it work

What does clustering do?
Finds natural
breaks in your data
 If there are obvious
clusters, you can
do this with a small
amount of data
 If you have lots of
weak predictors,
you need a huge
amount of data to
make it work

Clustering in Weka
* You can pick which clustering algorithm you want to use and how many clusters
you want.
Clustering in Weka
* Clustering is unsupervised, so you want it to ignore your class attribute!
Click here
Select the class attribute
Clustering in Weka
* You can evaluate
the clustering in
comparison with class
attribute assignments
Adding a Cluster Feature
Adding a Cluster Feature
* You should set it explicitly to ignore the class attribute
* Set the pulldown menu to No Class
Why add cluster features?
Class 1
Class 2
Why add cluster features?
Class 1
Class 2
Clustering with Weka



K-means and
FarthestFirst: disjoint flat
clusters
EM: statistical approach
Cobweb: hierarchical
clustering
K-Means

You choose the number of clusters you want
 You
might need to play with this by looking at what kind of
clusters you get out



K initial points chosen randomly as cluster centriods
All points assigned to the centroid they are closest
to
Once data is clustered, a new centroid is picked
based on relationships within the cluster
K-Means



Then clustering occurs again using the new
centroids
This continues until no changes in clustering take
place
Clusters are flat and disjoint
K-Means
K-Means
K-Means
K-Means
K-Means
EM: Expectation Maximization
Does not base clustering on distance from a
centroid
 Instead clusters based on probability of class
assignment
 Overlapping clusters rather than disjoint
clusters

 Every
instance belongs to every cluster with
some probability
EM: Expectation Maximization

Two important kinds of probability distributions
 Each
cluster has an associated distribution of
attribute values for each attribute

Based on the extent to which instances are in the
cluster
 Each
instance has a certain probability of being in
each cluster

Based on how close its attribute values are to typical
attribute values for the cluster
Probabilities of Cluster Membership
Initialized
65% B
35% A
25% B
75% A
Central Tendencies Computed
Based on Cluster Membership
65% B
35% A
A
B
25% B
75% A
Cluster Membership Re-Assigned
Probabilistically
75% B
25% A
A
B
35% B
65% A
Central tendencies Re-Assigned
Based on Membership
75% B
25% A
A
B
35% B
65% A
Cluster Membership Reassigned
60% B
40% A
A
B
45% B
55% A
EM: Expectation Maximization
Iterative like k-means – but guided by a
different computation
 Considered more principled than k-means, but
much more computationally expensive
 Like k-means, you pick the number of clusters
you want

Advanced Statistical
Models
Quick View of Bayesian Networks

Windy
Play
Normally with Naïve
Bayes you have simple
conditional probabilities

Humidity
Outlook
Temperature
P[Play = yes | Humitity = high]
Quick View of Bayesian Networks

Windy
Play
With Bayes Nets, there
are interactions between
attributes

Humidity
Outlook

P[play = yes & temp = hot | Humidity =
high]
Similar likelihood
computation for an
instance
 You
Temperature
will still have one
conditional probability per
attribute to multiply
together
 But they won’t all be
simple
 Humidity is related jointly
to temperature and play
Quick View of Bayesian Networks
Windy
Humidity
Outlook
Temperature
Learning algorithm
needs to find the
shape of the network
 Probabilities come
from counts
 Two stages – similar
idea to “kernel
methods”

Play
Doing Optimization
in Weka
Optimizing Parameter Settings
Test
Use a modified form of crossvalidation:
1
Iterate over settings
2
Validation
3
4
5
Compare performance
over validation set;
Pick optimal setting
Test on Test Set
Train
Still N folds, but each
fold has less training data than
with standard cross validation
Or you can have a hold-out
Validation set you use for all
folds
Remember!
Cross-validation is for estimating your
performance
 If you want the model that achieves that
estimated performance, train over the
whole set
 Same principle for optimization

 Estimate
your tuned performance using cross
validation with an inner loop for optimization
 When you build the model over the whole set,
use the settings that work best in crossvalidation over the whole set
Optimization in Weka

Divide your data into 10 train/test pairs
 Tune
parameters using cross validation on the
training set (this is the inner loop)
 Use those optimized settings on the
corresponding test set
 Note that you may have a different set of
parameter setting for each of the 10 train/test
pairs

You can do the optimization in the
Experimenter
Train/Test Pairs
* Use the StratifiedRemoveFolds filter
Setting Up for Optimization
* Prepare to save the results
•Load in training sets for
all folds
•We’ll use cross validation
Within training folds to
Do the optimization
What are we optimizing?
Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75
Add Each Algorithm to Experimenter
Interface
Look at the Results
* Note that optimal setting varies
across folds.
Apply the optimized settings on each fold
* Performance on Test1 using
optimized settings from Train1
What if the optimization requires
work by hand?

Do you see a problem
with the following?
 Do
feature selection over
the whole set to see
which words are highly
ranked
 Create user defined
features with subsets of
these to see which ones
look good
 Add those to your feature
space and do the
classification
What if the optimization requires
work by hand?
The problem is that is
just like doing feature
selection over your
whole data set
 You will overestimate your
performance
 So what’s a better
way of doing that?

What if the optimization requires
work by hand?
You could set aside a
small subset of data
 Using that small
subset, do the same
process
 Then use those user
defined features with
the other part of the
data

Take Home Message
Instance based learning and clustering both
make use of similarity metrics
 Clustering can be used to help you
understand your data or to add new
features to your data
 Weka provides opportunities to tune all of
its algorithms through the object editor
 You can use the Experimenter to tune the
parameter settings when you are
estimating your performance using crossvalidation
