Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal
Download ReportTranscript Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal
Sparsity, Scalability and Distribution in Recommender Systems Doctoral Thesis Proposal Badrul M. Sarwar Computer Science & Engineering Dept.
University of Minnesota Advisor: Professor John Riedl
Talk Outline Introduction to Recommender Systems Research Challenges Previous Work Future Work and Completion Plan Contributions and Conclusions
News items, Books, Journals, Research papers Consumer products, e commerce items, Information Overload TV programs, Music CDs, Movie titles Web pages, Usenet articles, e-mails
Computerized Solution techniques Information Retrieval – Immediate information needs Information Filtering – – Content based filtering Information filtering agents Collaborative Filtering (CF) – Recommender systems (RS) - interface We’ll use the term CF and RS interchangeably
Collaborative Filtering Why another filtering technique?
– Problems with content-based filtering Limitations due to computer processing Lack of aesthetic sense Different techniques for different media CF adds the missing piece into the picture – Human judgements
Collaborative Filtering Process
CF used successfully in e-commerce
Talk Outline Introduction to Recommender Systems Research Challenges Previous Work Future Work and Completion Plan Contributions and conclusions
Research Challenges RC1: How can we improve RS quality and performance by using dimensionality reduction techniques?
RC2: How can we design better interface for RS?
RC3: How can we design distributed RS to make them widely available? RC4: How can utilize clustering algorithms to improve scalability in RS?
RC1: Motivation and Importance RS Performance challenge – – – Meet two important goals Quality Best CF is 77% accurate Scalability Response time Storage space
RC1: Motivation and Importance (contd.) Stumbling blocks – – High-dimensional data Computational complexity Noise and data over-fitting Sparsity Reduced number of predictions Inferior quality
RC1: Specific Aims Select a dimensionality reduction technique Apply the technique Evaluate quality Study performance implications
Research Challenges RC1: How can we improve RS quality and performance by using dimensionality reduction techniques?
RC2: How can we design better interface for RS?
RC3: How can we design distributed RS to make them widely available? RC4: How can utilize clustering algorithms to improve scalability in RS?
RC 2: Motivation and Importance Need for explanation interface – End-user point of view Explanation of recommendations – – Algorithmic explanation Visual explanation Visual explanation – Visualization amplifies cognition Benefits – Increases usability and confidence
RC 2: Specific aims Identify techniques – Use of dimension reduction results Implementation Evaluation – – Usability study Comparison with text-based system
Research Challenge 3 How can we improve RS quality and performance by using dimensionality reduction techniques?
How can we design better interface for RS? How can we design distributed RSs to make them widely available?
How can utilize clustering algorithms to improve scalability in RS?
RC3: Motivation and Importance Increasing needs for RS services – Availability challenge Travelling users Centralized RS problems – – Problems of scale and robustness Privacy concerns
RC3: Specific aims Taxonomy of RS application space Design framework – – Key design issues Implementation models Evaluation criteria Analysis of different models
Research Challenge 4 How can we improve RS quality and performance by using dimensionality reduction techniques?
How can we design better interface for RS? How can we design distributed RS to make them widely available?
How can we utilize clustering algorithms to improve scalability in RSs?
RC4: Motivation and Importance Scalability Sparsity Benefits of Clustering – Usenet (newsgroup) Recent studies Performance implications
RC4: Specific aims Identify clustering algorithms – – Soft cluster Hard cluster Partition the data set Apply Galaxy algorithm Evaluate results
Talk Outline Introduction to Recommender Systems Research Challenges Previous Work Future Work and Completion Plan Contributions and conclusions
Identify Problem
Research Approach
Develop Hypotheses Discover Algorithm and solution techniques Validate solution techniques Create Experiment framework Apply solution techniques on experimental data
Create Dataset Separate training and test data
Dimension Reduction Experiments Singular Value Decomposition – – – Matrix factorization Dimension reduction Prediction generation by re-constructing matrix Result highlights – – Quality of prediction improved We expect to see improved performance
Applying dimension reduction in RS We applied LSI/SVD based technique SVD decomposes a matrix into three factors
R
R k =
U U k S S k V’ V k ’ m X n
The reconstructed matrix
R k = U k .S
k .V
k ’ rank-k
matrix to the original matrix
R.
is the closest
SVD as prediction generator
S k V’ k U k
S’ k U k
S k V k ’ k X n j
th col
m X k k X k i
th row
Results: SVD as prediction generator 0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
ROC and MAE plots for Data set 1
ROC MAE
DBLens ROC DBLens MAE Dimension, k
0.79
0.78
0.77
0.76
0.75
0.74
0.73
ROC and MAE plots for Data set 2
ROC MAE
DBLens ROC DBLens MAE Dimension, k
Data set 1 Data set 2
Visual Interface: Initial Prototype Used SVD results Plotted user and items in 2-D feature space Prototype tested in Spotfire Problems: – Distance is non-Euclidean
Design of Visual Interface Use of LSI/SVD for user-item visualization
Distributed RS: Work done Taxonomy of the application space – Based on
Talk Outline Introduction to Recommender Systems Research Challenges Previous Work Future Work and Completion Plan Contributions and conclusions
Future Work: Dimension Reduction Study performance implications SVD based prediction – – Offline (model building) Online Offline part is time-consuming – – Incremental SVD Fold-in Online is very promising
Future Work: Distributed RS Evaluation – Possible approaches Identify suitable evaluation criteria Select applications from taxonomy Analyze using each model (hypothetical) Analyze each implementation in terms of the evaluation criteria
Future Work: Visual Interface Implement Visual interface Perform usability studies – – – – – Setup live user experiment Identify usability questionnaires Conduct the usability survey Analyze results Revise/redesign interface
Future Work: Clustering in RS Identify effective clustering algorithms – For soft and hard cluster ( K-means and E-M) Partition the dataset Apply galaxy algorithm Test for quality – Accuracy and coverage Test for performance – Response time
R e s e a r c h C h a l l e n g e W o r k t o b e d o n e Future Work: Completion Plan E x p e c t e d c o m p l e t i o n t i m e .
C h a ll ll e n g e 1 :: P e rr ff o rr m a n c e ii m p ll ii c a tt ii o n s o ff S V D a s p rr e d ii c tt ii o n g e n e rr aa tt o rr 8 // 1 9 9 9 C h a ll ll e n g e 2 :: II m p ll e m e n tt a tt ii o n o ff tt h e v ii s u a ll ii n tt e rr ff aa c e 1 2 // 1 9 9 9 U s a b ii ll ii tt y tt e ss tt ii n g C h a ll ll e n g e 3 :: C h a ll ll e n g e 4 :: E v a ll u a tt ii o n o ff D ii s tt rr ii b u tt e d R S ii m p ll e m e n tt a tt ii o n tt e c h n ii q u e s II d e n tt ii ff ii c a tt ii o n a n d a p p ll ii c a tt ii o n o ff c ll u s tt e rr ii n g a ll g o rr ii tt h m II m p ll e m e n tt a tt ii o n o ff G a ll a x y a ll g o rr ii tt h m Q u a ll ii tt y a n d P e rr ff o rr m a n c e E v a ll u a tt ii o n 2 // 2 0 0 0 1 0 // 1 9 9 9 1 2 4 // // 1 2 2 // 1 0 0 9 0 0 9 0 0 9
Contributions Use of dimension reduction technique (SVD) to be a high-quality prediction generator – Submitted to ICDE 2000 Framework design for distributed RS.
– Submitted to CIKM’99 Visual interfaces Clustering to improve scalability
Distributed RS: Local Profile Model Local RS User Profile data User carries his profile to Remote RS Remote RS
RS Distributed RS: Central Profile Model CPS Remote RS User Profile storage Remote RS
GDPS 1 Profile database GDPS 2 Geographically Distributed RS RS User GDPS 3 Remote RS Remote RS User User User
Problems of high dimensional data A: 3 4 4 5 5 B: 2 3 1 3 2 4 5 C: A: 2 2 1 3 5 3 4 4 5 5 A is highly correlated with B B is highly correlated with C We can’t say that C is also highly correlated with A.