Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal

Download Report

Transcript Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal

Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal Badrul M. Sarwar Computer Science & Engineering Dept.

University of Minnesota Advisor: Professor John Riedl

Talk Outline      Introduction to Recommender Systems Research Challenges Previous Work Future Work and Completion Plan Contributions and Conclusions

News items, Books, Journals, Research papers Consumer products, e commerce items, Information Overload TV programs, Music CDs, Movie titles Web pages, Usenet articles, e-mails

Computerized Solution techniques    Information Retrieval – Immediate information needs Information Filtering – – Content based filtering Information filtering agents Collaborative Filtering (CF) – Recommender systems (RS) - interface We’ll use the term CF and RS interchangeably

Collaborative Filtering   Why another filtering technique?

– Problems with content-based filtering  Limitations due to computer processing   Lack of aesthetic sense Different techniques for different media CF adds the missing piece into the picture – Human judgements

Collaborative Filtering Process

CF used successfully in e-commerce

Talk Outline  Introduction to Recommender Systems  Research Challenges  Previous Work   Future Work and Completion Plan Contributions and conclusions

Research Challenges     RC1: How can we improve RS quality and performance by using dimensionality reduction techniques?

RC2: How can we design better interface for RS?

RC3: How can we design distributed RS to make them widely available? RC4: How can utilize clustering algorithms to improve scalability in RS?

RC1: Motivation and Importance  RS Performance challenge – – – Meet two important goals Quality  Best CF is 77% accurate Scalability  Response time  Storage space

RC1: Motivation and Importance (contd.)  Stumbling blocks – – High-dimensional data  Computational complexity  Noise and data over-fitting Sparsity  Reduced number of predictions  Inferior quality

RC1: Specific Aims     Select a dimensionality reduction technique Apply the technique Evaluate quality Study performance implications

Research Challenges     RC1: How can we improve RS quality and performance by using dimensionality reduction techniques?

RC2: How can we design better interface for RS?

RC3: How can we design distributed RS to make them widely available? RC4: How can utilize clustering algorithms to improve scalability in RS?

RC 2: Motivation and Importance     Need for explanation interface – End-user point of view Explanation of recommendations – – Algorithmic explanation Visual explanation Visual explanation – Visualization amplifies cognition Benefits – Increases usability and confidence

RC 2: Specific aims    Identify techniques – Use of dimension reduction results Implementation Evaluation – – Usability study Comparison with text-based system

Research Challenge 3     How can we improve RS quality and performance by using dimensionality reduction techniques?

How can we design better interface for RS? How can we design distributed RSs to make them widely available?

How can utilize clustering algorithms to improve scalability in RS?

RC3: Motivation and Importance    Increasing needs for RS services – Availability challenge Travelling users Centralized RS problems – – Problems of scale and robustness Privacy concerns

RC3: Specific aims     Taxonomy of RS application space Design framework – – Key design issues Implementation models Evaluation criteria Analysis of different models

Research Challenge 4     How can we improve RS quality and performance by using dimensionality reduction techniques?

How can we design better interface for RS? How can we design distributed RS to make them widely available?

How can we utilize clustering algorithms to improve scalability in RSs?

RC4: Motivation and Importance      Scalability Sparsity Benefits of Clustering – Usenet (newsgroup) Recent studies Performance implications

RC4: Specific aims     Identify clustering algorithms – – Soft cluster Hard cluster Partition the data set Apply Galaxy algorithm Evaluate results

Talk Outline  Introduction to Recommender Systems  Research Challenges  Previous Work   Future Work and Completion Plan Contributions and conclusions

Identify Problem

Research Approach

Develop Hypotheses Discover Algorithm and solution techniques Validate solution techniques Create Experiment framework Apply solution techniques on experimental data

Create Dataset Separate training and test data

Dimension Reduction Experiments   Singular Value Decomposition – – – Matrix factorization Dimension reduction Prediction generation by re-constructing matrix Result highlights – – Quality of prediction improved We expect to see improved performance

Applying dimension reduction in RS   We applied LSI/SVD based technique SVD decomposes a matrix into three factors

R

R k =

U U k S S k V’ V k ’ m X n

The reconstructed matrix

R k = U k .S

k .V

k ’ rank-k

matrix to the original matrix

R.

is the closest

SVD as prediction generator 

S k V’ k U k

S’ k U k

S k V k ’ k X n j

th col

m X k k X k i

th row

Results: SVD as prediction generator 0.78

0.77

0.76

0.75

0.74

0.73

0.72

0.71

ROC and MAE plots for Data set 1

ROC MAE

DBLens ROC DBLens MAE Dimension, k

0.79

0.78

0.77

0.76

0.75

0.74

0.73

ROC and MAE plots for Data set 2

ROC MAE

DBLens ROC DBLens MAE Dimension, k

Data set 1 Data set 2

Visual Interface: Initial Prototype     Used SVD results Plotted user and items in 2-D feature space Prototype tested in Spotfire Problems: – Distance is non-Euclidean

 Design of Visual Interface Use of LSI/SVD for user-item visualization

Distributed RS: Work done    Taxonomy of the application space – Based on Identification of key design issues Three implementation models proposed – – – Local profile model Central profile model Geographically distributed profile model

Talk Outline  Introduction to Recommender Systems  Research Challenges  Previous Work   Future Work and Completion Plan Contributions and conclusions

Future Work: Dimension Reduction     Study performance implications SVD based prediction – – Offline (model building) Online Offline part is time-consuming – – Incremental SVD Fold-in Online is very promising

Future Work: Distributed RS  Evaluation – Possible approaches  Identify suitable evaluation criteria    Select applications from taxonomy Analyze using each model (hypothetical) Analyze each implementation in terms of the evaluation criteria

Future Work: Visual Interface   Implement Visual interface Perform usability studies – – – – – Setup live user experiment Identify usability questionnaires Conduct the usability survey Analyze results Revise/redesign interface

Future Work: Clustering in RS      Identify effective clustering algorithms – For soft and hard cluster ( K-means and E-M) Partition the dataset Apply galaxy algorithm Test for quality – Accuracy and coverage Test for performance – Response time

R e s e a r c h C h a l l e n g e W o r k t o b e d o n e Future Work: Completion Plan E x p e c t e d c o m p l e t i o n t i m e .

C h a ll ll e n g e 1 :: P e rr ff o rr m a n c e ii m p ll ii c a tt ii o n s o ff S V D a s p rr e d ii c tt ii o n g e n e rr aa tt o rr 8 // 1 9 9 9 C h a ll ll e n g e 2 :: II m p ll e m e n tt a tt ii o n o ff tt h e v ii s u a ll ii n tt e rr ff aa c e 1 2 // 1 9 9 9 U s a b ii ll ii tt y tt e ss tt ii n g C h a ll ll e n g e 3 :: C h a ll ll e n g e 4 :: E v a ll u a tt ii o n o ff D ii s tt rr ii b u tt e d R S ii m p ll e m e n tt a tt ii o n tt e c h n ii q u e s II d e n tt ii ff ii c a tt ii o n a n d a p p ll ii c a tt ii o n o ff c ll u s tt e rr ii n g a ll g o rr ii tt h m II m p ll e m e n tt a tt ii o n o ff G a ll a x y a ll g o rr ii tt h m Q u a ll ii tt y a n d P e rr ff o rr m a n c e E v a ll u a tt ii o n 2 // 2 0 0 0 1 0 // 1 9 9 9 1 2 4 // // 1 2 2 // 1 0 0 9 0 0 9 0 0 9

Contributions     Use of dimension reduction technique (SVD) to be a high-quality prediction generator – Submitted to ICDE 2000 Framework design for distributed RS.

– Submitted to CIKM’99 Visual interfaces Clustering to improve scalability

Distributed RS: Local Profile Model Local RS User Profile data User carries his profile to Remote RS Remote RS

RS Distributed RS: Central Profile Model CPS Remote RS User Profile storage Remote RS

GDPS 1 Profile database GDPS 2 Geographically Distributed RS RS User GDPS 3 Remote RS Remote RS User User User

Problems of high dimensional data A: 3 4 4 5 5 B: 2 3 1 3 2 4 5 C: A: 2 2 1 3 5 3 4 4 5 5 A is highly correlated with B B is highly correlated with C We can’t say that C is also highly correlated with A.