CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines

Transcript CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines

CS378 Final Project

The Netflix Data Set Class Project Ideas and Guidelines

The Data Set

● ● ● ● 17,770 Movies 480,189 Reviewers More than 100 Million reviews  Rating of 1 through 5  Review Date Uncompressed full dataset is 2 Gigabytes

Netflix Data Properties

● ● ● Distribution of Number of Reviews per Reviewer X-axis:  # of reviews Y-axis  P (# of reviews)

Netflix Data Subsets

● ● ● You will be given two subsets of the data Format:  Subset  Contains 9,000 reviewers  Restricted to only those movies with at least 5 ratings ● 12,000 movies   ~2 Million reviews ~50 MB

Project Requirements

● ● Compute each of the following  Average review score  Top 10 most highly rated movies  Distribution of all review scores ● p(rating=1), ..., p(rating=5)  Number of reviews as a function of time  The reviewer whose review score distribution has the largest entropy Compute five other properties of the data  These properties should be relevant to your project  You should explain this relevancy

Project Options

● ● ● ● Classification Clustering Recommendation Data Cubes

Project 1: Classification

● ● Goal: Predict classification scores  5-class classification problem K-Nearest Neighbor ● ● ● ● Represent each reviewer by a (sparse) vector of his review scores  How can scores be predicted given a reviewer's nearest neighbors?

Represent each movie by a vector of each reviewer's scores  How can scores be predicted given a movie's nearest neighbors?

Experiment with different distance measures Experiment with various normalization schemes

Project 1: Classification

● Decision Trees and other Parametric Classifiers  Create dense features for each instance ● ● ● ● Reviewer's average rating Movie's average rating Movie related features  Actors in each movie (collected from IMDB) Time related features  Number of reviewer's previous scores  Use the WEKA machine learning package ● Evaluate performance of various algorithms in the package  Decision Tree, SVM, ...

Project 1: Classification

● ● Evaluation of Classification Performance  Accuracy, Confusion Matrices ● Analysis: Are 1's harder to predict than 5's?

 Cross-validation ● Does this make sense when these is a time-series component?

Extensions  Learning curves ● How does accuracy change as the training set size increases  Distribution of accuracy per reviewer ● Are some reviewers harder to predict than others?

● Are some movies harder to predict?

 ...

Project 2: Clustering

● ● Goal: Cluster reviewers and movies K-means based methods  Download G-Means ● Supports k-means and also other variants  Cluster using both sparse and dense representations ● Sparse representation: same as used for KNN classification ● Dense representation: same as used for parametric classification

Project 2: Clustering

● ● Graph-based methods  Compute pairwise similarities between reviewers ● ● ● Correlation Your own ad-hoc method  i.e. The Kevin Bacon method ● Sim(x, y) = # of Kevin Bacon movies viewed by both x and y Similarity computation may be too expensive to perform on the full dataset  Software: Graclus Results analysis  Quantitative as well as Qualitative

Project 3: Recommendations

● ● Goal: Create movie recommendations for each reviewer K-Nearest Neighbor  Instance representation ● Sparse representation  Find the reviewer's nearest neighbors ● Recommend movies scored highly by these neighbors  Try out various distance measures

Project 3: Recommendation

● Evaluation  Propose a way of quantifying the quality of your  recommendations ● i.e. A recommendation is good if a reviewer ended up rating the recommendation with score of 4 or higher Is it harder to recommend movies to reviewers who do not watch many movies?

● Does your evaluation metric reflect this?

Project 3: Data Cubes

● ● Load the data into a data cube  Find interesting trends in the data ● ● i.e. Relation between average review score and day of week?

 Slice on day, aggregate review scores across all reviewers and movies Find other interesting trends Use an open source data cube package (OLAP)  Mondrian – Java based  Must be a proficient coder

CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines

Transcript CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines