Transcript CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines
CS378 Final Project
The Netflix Data Set Class Project Ideas and Guidelines
The Data Set
● ● ● ● 17,770 Movies 480,189 Reviewers More than 100 Million reviews Rating of 1 through 5 Review Date Uncompressed full dataset is 2 Gigabytes
Netflix Data Properties
● ● ● Distribution of Number of Reviews per Reviewer X-axis: # of reviews Y-axis P (# of reviews)
Netflix Data Subsets
● ● ● You will be given two subsets of the data Format:
Project Requirements
● ● Compute each of the following Average review score Top 10 most highly rated movies Distribution of all review scores ● p(rating=1), ..., p(rating=5) Number of reviews as a function of time The reviewer whose review score distribution has the largest entropy Compute five other properties of the data These properties should be relevant to your project You should explain this relevancy
Project Options
● ● ● ● Classification Clustering Recommendation Data Cubes
Project 1: Classification
● ● Goal: Predict classification scores 5-class classification problem K-Nearest Neighbor ● ● ● ● Represent each reviewer by a (sparse) vector of his review scores How can scores be predicted given a reviewer's nearest neighbors?
Represent each movie by a vector of each reviewer's scores How can scores be predicted given a movie's nearest neighbors?
Experiment with different distance measures Experiment with various normalization schemes
Project 1: Classification
● Decision Trees and other Parametric Classifiers Create dense features for each instance ● ● ● ● Reviewer's average rating Movie's average rating Movie related features Actors in each movie (collected from IMDB) Time related features Number of reviewer's previous scores Use the WEKA machine learning package ● Evaluate performance of various algorithms in the package Decision Tree, SVM, ...
Project 1: Classification
● ● Evaluation of Classification Performance Accuracy, Confusion Matrices ● Analysis: Are 1's harder to predict than 5's?
Cross-validation ● Does this make sense when these is a time-series component?
Extensions Learning curves ● How does accuracy change as the training set size increases Distribution of accuracy per reviewer ● Are some reviewers harder to predict than others?
● Are some movies harder to predict?
...
Project 2: Clustering
● ● Goal: Cluster reviewers and movies K-means based methods Download G-Means ● Supports k-means and also other variants Cluster using both sparse and dense representations ● Sparse representation: same as used for KNN classification ● Dense representation: same as used for parametric classification
Project 2: Clustering
● ● Graph-based methods Compute pairwise similarities between reviewers ● ● ● Correlation Your own ad-hoc method i.e. The Kevin Bacon method ● Sim(x, y) = # of Kevin Bacon movies viewed by both x and y Similarity computation may be too expensive to perform on the full dataset Software: Graclus Results analysis Quantitative as well as Qualitative
Project 3: Recommendations
● ● Goal: Create movie recommendations for each reviewer K-Nearest Neighbor Instance representation ● Sparse representation Find the reviewer's nearest neighbors ● Recommend movies scored highly by these neighbors Try out various distance measures
Project 3: Recommendation
● Evaluation Propose a way of quantifying the quality of your recommendations ● i.e. A recommendation is good if a reviewer ended up rating the recommendation with score of 4 or higher Is it harder to recommend movies to reviewers who do not watch many movies?
● Does your evaluation metric reflect this?
Project 3: Data Cubes
● ● Load the data into a data cube Find interesting trends in the data ● ● i.e. Relation between average review score and day of week?
Slice on day, aggregate review scores across all reviewers and movies Find other interesting trends Use an open source data cube package (OLAP) Mondrian – Java based Must be a proficient coder