Transcript pptx

Machine Learning Homework
Gaining familiarity with Weka, ML
tools and algorithms
Goals for this homework
1. Learn how to use Weka, a collection of ML
algorithms implemented in Java
2. Apply a few ML techniques to some standard
datasets to see what happens
3. Learn how to connect to Weka’s API, so you
can include calls to ML techniques in your
own code.
Step 1: Install Weka, get data
• Download and install Weka on some machine that you will be able to use
for a while. Lab machines will not have this, but if you don’t have access
to another machine, let me know, and I will try to get this installed for you
on a lab machine.
http://www.cs.waikato.ac.nz/ml/weka/
• Also, download a dataset. For this homework, I will ask you to use the
University of California-Irvine’s repository of machine learning datasets,
and I will focus on this dataset:
http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
• You can browse all of the datasets on UCI’s machine learning repository
here:
http://archive.ics.uci.edu/ml/datasets.html
You won’t need these for this assignment, but you may be interested to
see what datasets people have used in the past for ML research.
Step 2: Format the data
For the Communities and Crime dataset, you should open the file
“communities.names” and scroll down to the section that looks like:
@relation crimepredict
@attribute state numeric
…
@data
Copy this whole section, and paste it into the top of the file called
“communities.data”.
Rename the file “communities.data” to “communities.arff”.
This puts the data into a format that Weka can easily understand.
Step 3: Load data into Weka
• Start Weka. You should see
a menu like the one on the
right.
• Click on “Explorer”.
• In the “Preprocess” tab,
click on “Open file.”
• Find the “communities.arff”
file that you created, and
open it.
How it should look when you load the
data into Weka:
First task: Get familiar with the data
(and Weka)
• Click on the “visualize” tab.
• Use the scatterplots under this tab to try to get a feel
for what this data contains.
• We will use ViolentCrimePerPop as the Y variable that
we will try to predict in this dataset. What other
variables seem to make a difference for predicting this
variable, based on the plots you see?
Question 1: Write down the name of three different
features that you think each have a significant predictive
relationship with ViolentCrimePerPop. For each one,
briefly (1 sentence or less) describe the relationship.
Task 2: Determining relationships
between variables
Focus on the variables PctFam2Par and
PctNotHSGrad. Both seem to have some
correlation with ViolentCrimePerPop.
Question 2: Based on the plots in the visualize
tab, see if you can determine whether there is a
correlation between PctFam2Par and
PctNotHSGrad or not. Explain what evidence
you have found to support your conclusion.
Prepping for a regression
• First, you will need to remove non-numeric attributes
from the data, since most of Weka’s regression
algorithms can’t handle such attributes.
• Click on the “preprocess” tab.
• You should see a list of all 128 attributes (including
ViolentCrimePerPop) on the left.
• Click the check box next to “communityName”.
• Click the button called “remove” at the bottom of the
screen.
• You should be all set.
Task 3:
Running a regression experiment
• Click on the “classify” tab.
• At the top, click the button called “choose”.
You will see a list of many classifiers that are built in to Weka. Many of these
are greyed-out, since they can’t do regression. The non-grey ones are
available for our experiment.
• Under “functions”, select “linear regression”.
• Under “test options”, select “percentage split”, and set the percentage to
66.
• Make sure that “(Num) ViolentCrimePerPop” shows up in the dropdown
list below the test options.
• Click “start”.
Question: When the classifier finishes, copy the results from the “Classifier
output” box to a text file called “linear-regression-results.txt”. You will email
this to your TA when you’re done.
Task 4: Running a more complicated
regression model
• This time we’ll try a Support Vector Machine.
• Click the “choose” box again, and under
functions, select “SMOreg”.
• Use the same “test options” as before.
• Click start. (It may take 20-30 seconds to finish
training.)
Question: Which model performed better in this
experiment? How can you tell? Cite two pieces of
evidence that tell you why SMOreg was better than
linear regression, or vice versa.
Task 5:
Running a clustering experiment
•
•
•
•
•
•
•
•
•
•
•
Click on the “cluster” tab at the top.
Click the “choose” button, and select “SimpleKMeans”.
To the right of the “choose” button is a textbox that says “SimpleKMeans –N 2 …”
Click anywhere in the textbox. It should bring up a new window.
In the new window, under “numClusters”, change it from 2 to 10.
Click “Ok”.
Set the “cluster mode” to “use training set”.
Click “start”.
When this finishes, in the “Result-list” text area, right-click the most recentlyappeared line of text.
Select “Visualize cluster assignments” from the popup menu.
In the new window, change the “X” variable to “Cluster (Nom)”. Change the “Y”
variable to “ViolentCrimesPerPop (Num)”.
Question: Did the K-means clustering algorithm do a good job of separating the data
into clusters that have different violent crime rates? What evidence from the chart
you just created supports your conclusion? (2 sentences max.)
Task 6: Weka API
• Write a Java class that runs a SMO regression
on the communities dataset. DO NOT write
the code to do the SMO regression; instead,
call Weka’s API to make this happen.
Submit your Java class to your TA.
Extra Credit Task 1 (1 point):
Principal Components Analysis
We’re going to run a PCA on this dataset, and save it as a new dataset.
• Click on the “select attributes” tab.
• Click “choose”, select “principalComponents”.
• In the popup window, click “Yes” to automatically selecting the
“Ranker”.
• Click “start”.
• In the “result-list” window, right-click the most recently-appeared
line, and select “save transformed data…”. Save this data as a file
called “transformed-communities.arff”.
Question: How many of the eigenvectors from this PCA have an
eigenvalue greater than 1? What percentage of the total variance of
the data does this subset of the eigenvectors represent?
Extra Credit Task 2 (2 points): Linear
regression over PCA-transformed data
• Load the new dataset (transformed by PCA) into Weka.
• Run a linear regression with this new dataset. Again, use
66% of the data for training.
Question: How do the results for this linear regression
compare with SMO regression and the previous linear
regression? Cite two evaluation metrics in your comparison.
• Analyze the results some more: right-click on the most
recently-produced line under “Result list”, and select
“visualize classifier errors”.
Question: When the regression made errors, did it more
often predict too high of a value, or too low of a value?
To turn in
You should turn in a single zip archive called <your-name>.zip.
It should contain:
• a text file called “answers.txt” with the answers to all of the
questions in this homework
• your Java source code for Task 5 (don’t include the Weka
jar, just your own code that references Weka). Please
include a comment in the code to explain to the TA how he
should get the code to read in data from his own file.
• “linear-regression-results.txt”
• If you did the extra credit, “transformed-communities.arff”
Email or otherwise transfer this zip file to your TA.