Collaborative Filtering 101

Download Report

Transcript Collaborative Filtering 101

Collaborative Filtering 101
Adnan Masood
www.AdnanMasood.com
About Me
aka. Shameless Self Promotion
• Sr. Software Engineer / Tech Lead for Green Dot Corp. (Financial
Institution)
• Design and Develop Connected Systems
• Involved with SoCal Dev community, co-founded San Gabriel Valley .NET
Developers Group. Published author and speaker.
• MS. Computer Science, MCPD (Enterprise Developer), MCT, MCSD.NET
• Doctoral Student - Areas of Interest: Machine learning, Bayesian Inference,
Data Mining, Collaborative Filtering, Recommender Systems.
• Contact at [email protected]
• Read my Blog at www.AdnanMasood.com
• Doing a session in IASA 2008 in San Francisco on Aspect Oriented
Programming; for details visit http://www.iasaconnections.com
Agenda
What this Presentation Covers?
• Defines Collaborative Filtering and it’s use in
Recommendation Systems.
• Background and Current State of the Applications on
Collaborative Filtering Algorithms and their Feature
set.
• Illustrative implementation of the Algorithms with
example.
• Results on the large dataset via different Algorithms.
• Recommendations on what to use when doing
collaborative filtering on large scale dataset.
• Overview of SQL Server BI and Prediction Engine
Recommender Systems Zeitgeist
What is Collaborative Filtering and What
problem does it solve?
•
•
•
•
•
•
Collaborative filtering simply means that people collaborate to help one another
perform filtering by recording their reactions to documents they read. Such
reactions may be that a document was particularly interesting (or particularly
uninteresting). These reactions, more generally called annotations, can be
accessed by others’ filters.”
-Communications of the ACM – Dec. 1992
Collaborative Filtering (CF) finds items of interest to a user based on the
preferences of other similar users. Assumes that human behavior is predictable.
Recommender Systems (or recommenders) suggest items of interest based on a
user’s preferences, behavior and information about the items themselves
-Recommenders Everywhere – WikiSym ’07, ACM
With the large amounts of data generated in the e-commerce systems, the
classical methods of recommendation are insufficient and cannot handle
information overload. The modern automated recommendation systems are built
using Collaborative filtering to help dealing with large scale datasets.
Information overload problem - 20K movies Netflix, 250K songs on Yahoo Music,
Total number of books on Amazon?
First ACM Recommender System Conference in October 19-20, 2007 -Minneapolis, Minnesota, USA by SIGCHI
Types of Recommendation Systems
•
Recommender systems use the opinions of a community of users to help individuals in that community
more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and
Varian 1997].
Technique
Background
Input
Process
Collaborative
Ratings from U of items
in I.
Ratings from u of items
in I.
Identify users in U
similar to u, and
extrapolate from
their ratings of i.
Content-based
Features of items in I u’s
ratings of items in I
Generate a classifier
that
u’s ratings of items in I
Generate a classifier
that
Generate a classifier
that fits u’s rating
behavior and use it on i.
Demographic
Demographic
information about U and
their ratings of
items in I
Demographic
information about u.
Identify users that are
demographically similar
to u, and extrapolate
from their ratings of i.
Utility-based
Features of items in I.
A utility function over
items in I that describes
u’s preferences.
Apply the function to the
items and determine i’s
rank.
Knowledge-based
Features of items in I.
Knowledge of how these
items meet a user’s
needs.
A description of u’s
needs or interests.
Infer a match between I
and u’s need.
Applications
•
•
•
•
•
Search
Social Networking
Product Recommendations
Demographic Targeted Advertisements
Fraud Detection
– Pattern Detection / Clustering
• Security
– Firewall outlier analysis
– Text Mining Outliers
Applications in Information Security
(AT&T Hancock)
Major Challenges in Recommender System
Design
•
•
•
•
•
•
Scalability
Real-time Analysis and Prediction
Performance
Accuracy
Robustness
Growing Area of Research in KDD, Machine
Learning and AI
Issues and Future Research Directions
•
•
K-NN Optimization
Explainability
(D. Billsus and M. Pazzani, “A Personal News Agent that Talks, Learns and Explains,” Proc. Third Ann. Conf.
Autonomous Agents, 1999.)
•
Hybrid Algorithms between Memory based and Model based techniques.
[Pennock, David M. and Horvitz, Eric 1999]
•
Cold Start Problems
(A.I. Schein, A. Popescul, L.H. Ungar, and D.M. Pennock, “Methods and Metrics for Cold-Start
Recommendations,” Proc. 25th Ann. Int’l ACM SIGIR Conf., 2002.)
•
Privacy
(N. Ramakrishnan, B.J. Keller, B.J. Mirza, A.Y. Grama, and G. Karypis, “Privacy Risks in Recommender
Systems,” IEEE Internet Computing, vol. 5, no. 6, pp. 54-62, Nov./Dec. 2001.)
•
•
•
Error Method with Look Ahead
Boltzman Machines
Vertical Niche Markets
Popular Recommendation Systems
recommendaton systems
movies or music
news or articles
web pages
EachMoive
Morse
Firefly
...
Tapestry
GroupLens
Lotus Notes
...
Phoaks
GAB
Fab
...
•Do-I-Care [Turnbull, 1998; Collaborative Filtering
workshop, 1996]
•Fab recommendation system [Turnbull, 1998]
•Firefly [Turnbull, 1997 and 1998]
•GAB (group asynchronous browsing)
[Wittenburg, et. al., 1998]
•Grassroots system [Turnbull, 1998]
•Resnick [Resnick, et al. 1994]
•Let's browse/ Letizia, [Lieberman, 1996; Pryor,
1998]
•Lotus Notes [Turnbull, 1998]
•Mosaic system [Turnbull,1997]
•PHOAKS (People Helping One Another
Know Stuff) [Terveen et. al, 1997]
•Pointers [Maltz, 1995]
•Siteseer [Turnbull, 1997]
•Tapestry [Goldberg, 1992].
•Yahoo [Turnbull, 1998]
•The WebWatcher system [Joachims, 1996]
Classification of Collaborative Filtering Algorithms
•
A popular classification of CF algorithms was proposed by Breese et al (Convergent
algorithms for collaborative filtering, Proceedings of the 4th ACM conference on
Electronic commerce) into Memory-based and Model-based methods.
•
Memory-Based methods work on the principal of aggregating the labeled data
and attempt to match recommenders to those seeking recommendations. Most
common memory-based methods works are based on the notion of nearest
neighbor, using a variety of distance metrics.
– Use the entire database of user ratings to make predictions.
– Find users with similar voting histories to the active user.
– Use these users’ votes to predict ratings for products not voted on by the active user
•
Model-based Methods, on the other hand, try to learn a compact model from the
training data, for example learn parameters of a para-metric posterior distribution.
From an operational point of view, memory-based methods potentially work with
the entire training set and scale linearly with the amount of training data, while
model-based methods are constant time.
– Construct a model from the vote database.
– Use the model to predict the active user’s ratings
Classification of Collaborative Filtering Algorithms
• Memory-based Algorithm and Model-based Algorithms.
(Breese, et.al.,1998)
• Memory-based Algorithms
– Mean Squared Differences
– Pearson Correlation (Neighborhood based interpolation k-NN)
– Vector Similarity
• Model-Based Algorithms
– Bayesian Network Models:
– Neural Network Models (Boltzman Machines)
• Other / Hybrid Algorithms
– A hybrid memory- and model-based approach [Pennock, David M. and
Horvitz, Eric 1999]
– Singular Value Decomposition (SVD)
– Probabilistic Latent Semantic Analysis
Algorithms and their Performance
Reference: The Netflix Prize by James Bennett Stan Lanning, KDDCup’07, August 12, 2007, San Jose, California, USA.
Data Sets
Netflix Database
•
•
•
•
•
•
•
•
There are 17770 movies.
There are 480189 users.
ustomerIDs range from 1 to 2649429, with gaps.
Ratings are on a five star (integral) scale from 1
to 5.
YearOfRelease range from 1890 to 2005.
Training set consists of 100 million records.
Qualifying dataset size is 2817131. It contains
from 1-9999 movies ids. Prediction needs to be
submitted on this dataset.
Probe dataset size is 1408395. It contains from
1-9999 movies ids. This dataset is meant to be
used for checking the rmse before proceeding
for qualifying dataset prediction.
Download Link
http://www.netflixprize.com/download
MovieLens Database
•
•
•
•
DataSet 1 Consists of 100,000 ratings for 1682
movies by 943 users.
The second one consists of approximately 1
million ratings for 3900 movies by 6040 users
Download Link:
http://www.grouplens.org/node/73
UCIrvine Datasets
Experiment Details and Methodologies
•
•
•
Hardware
– Cluster of 3 P-IV Machines with ~2 GB RAM along with a remote desktop laptop (controller)
– ~ 1TB Storage (with backups)
DataSet
– Netflix DataSet
– Netflix provides a large movie rating dataset consisting of over 100 million ratings (and their dates) from
approximately 480,000 randomly-chosen users and 18,000 movies. The data were collected between
October, 1998 and December, 2005 and represent the distribution of all ratings Netflix obtained during this
time period. Given this dataset, the task is to predict the actual ratings of over 3 million unseen ratings from
these same users over the same set of movies”. [Yew Jin Lim and Yee Whye The, “Variational Bayesian
Approach to Movie Rating Prediction”, KDDCup.07 August 12, 2007, San Jose, California, USA]
Benchmarking
– Matrix Calculated on time and accuracy (RMSE) results.
Averages and Mean Statistics
Baseline Method
RMSE
3 stars
1.313
Global Mean
1.130
Movie Mean
1.052
User Mean
1.043
(Movie Mean + User Mean) / 2
1.004
Netflix “cinematch” system
.951
Reference: Gillic et al, 2006 – Stanford University
K-Nearest Neighbor
How does it work?
The technique uses individual user distributions to measure distance between users, then makes
predictions r(ui, mj) based on the ratings given mj by users near ui. The intuition here is that if many users
rate two movies the same, the movies should be considered similar. Conversely, if many users rate two
movies differently, the movies should be considered different. (Don Gillick, UC Berkley)
Calculate the "similarity" between each user by comparing how each user has rated common content. If
Frank has rated something 4/5 stars and Jane has also rated it 4/5 stars, then these users would be
considered similar. These calculations are very time consuming as it essentially becomes the "handshake"
problem. I.e. the calculation has to be performed for each unique combination of users. The number of
unique combinations is: n (n - 1) / 2. For the Netflix challenge, the number of unique combinations is
115,290,497,766...yes that's 115 billion
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. LOTR: The Two Towers
2. LOTR: The Return of the King
3. LOTR: The Fellowship of the
Ring: Extended Edition
4. LOTR: The Two Towers:
Extended Edition
5. Raiders of the Lost Ark
6. LOTR: The Return of the
King: Extended Edition
7. Pirates of the Caribbean:
The Curse of the Black Pearl
8. The Matrix
9. The Shawshank Redemption: Special Edition
10. Braveheart
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. Monsters
2. Shrek (Full-screen)
3. Shrek 2
4. LOTR: The Two Towers
5. Pirates of the
Caribbean: The Curse
of the Black Pearl
6. The Incredibles
7. The Sixth Sense
8. The Shawshank
Redemption: Special
Edition
9. LOTR: The Fellowship
of the Ring
10. Forrest Gump
K-Nearest Neighbor
How does it work?
K-Nearest Neighbor
How does it work? An Example
Step 1: Content bases survey classification.
X1 = User’s Rating
X2 = Movie’s Mean Rating
Y = Classification
7
7
Likely to be seen by an action
fan
7
4
Likely to be seen by an action
fan
3
4
Not likely to be seen by an
action fan
1
4
Not likely to be seen by an
action fan
Now the new user rates a new movie for X1 = 3 for X2 = 7. Without another expensive survey, can we
guess what the classification of this new movie is?
1. Determine parameter K = number of nearest neighbors
Suppose use K = 3
K-Nearest Neighbor
How does it work? An Example (cont.)
Step 2: Calculate the distance between the query-instance and all the training samples
Coordinate of query instance is (3, 7), instead of calculating the distance we compute
square distance which is faster to calculate (without square root)
K-Nearest Neighbor
How does it work? An Example (cont.)
•
Step 3. Sort the distance and determine nearest neighbors
based on the K-th minimum distance
K-Nearest Neighbor
How does it work? An Example (cont.)
Step 4. Gather the category of the nearest neighbors. Notice in the second row last
column that the category of nearest neighbor (Y) is not included because the rank
of this data is more than 3 (=K).
Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the
query instance. We have 2 “Not likely to be seen by an action fan” and 1 “Likely to be seen by an
action fan”, since 2>1 then we conclude that a new movie with X1 = 3 and X2 = 7 is included in
Former category.
Singular Value Decomposition (SVD)
•
•
•
•
•
The user rating vectors can be represented by a mn matrix A, with m users and n
products, where is the rating of user for product . [Qu & Yang, 2000]
Through singular value decomposition, A can by factored into USVT , where U and
V are orthogonal matrices and the S is a zero matrix, except for the diagonal
entries which are defined as the singular value of A.
U is representative of the response of each user to certain features.
V is representative of the amount of each feature present in each product.
S is a matrix related to the feature importance in overall determination of the
rating. The S matrix is a zero matrix, except for the diagonal entries which are
defined as the singular values of A [Pryor, H. Michael,1998]
How does SVD work?
An Example for inner workings of the Algorithm
Movies
1.
Pulp Fiction: The movie has excellent
cinematic value and storyline but has long
dialogues and conversation sequences.
2.
From Dusk Till Dawn: The movie has lots of
action, decent storyline and gets to the point
fairly quick but isn't a cinematic magic.
3.
4.
The Big Lebowski: Low budget but with
excellent dialogues and quite artistic niche.
Not the best cinema work and continuity.
Children of Men: Excellent cinematography
but rather long story line, sometimes not
keeping the user captivated. Not of artistic
value.
Reviewer
A.
Andrea the Action fan - likes action, short
and well put together movies. Long stories
artsy stuff does not typically attract her but
always appreciates good cinematography.
B.
Arthur the Art Lover - Loves niche movies
but also appreciates action; does not mind
long movies as long as they have good
artistic value.
C.
Dave the director - A film school graduate
who loves action, good camera work, story
line and dialogues. Not a big art fan.
D.
Jim the average movie guy - Likes action
and thrillers but detest long movies.
How does SVD work?
The Reviewer – Movie - Rating Matrix
1 2 3 4
A
B
C
D
5
3
6
?
4
7
4
?
2
5
1
?
6
2
4
?
A = U * W * V^T
U
-0.60 0.41 0.69
-0.58 -0.81 -0.02
-0.55 0.41 -0.73
W
14.49
0.00
0.00
0.00
4.93
0.00
0.00
0.00
1.65
V
0.00
0.00
0.00
-0.56 0.42 -0.60
-0.60 -0.49 -0.18
-0.32 -0.57 0.33
-0.48 0.50 0.70
V^T
-0.39
0.61
-0.68
0.14
-0.56 -0.60 -0.32 -0.48
0.42 -0.49 -0.57 0.50
-0.60 -0.18 0.33 0.70
-0.39 0.61 -0.68 0.14
How does SVD work?
Predicting what a new user would like
W is the main component for Principle components and identifies
14.49
0.00
0.00
•
0.00
4.93
0.00
0.00
0.00
1.65
0.00
0.00
0.00
Now imagine that Jim rated the first movie 2
Rd =Ui1S11Vj1
2 = U41 S11 V11
We solve for U1. To predict R 2 R 3 R 4 , & we substitute U1 into the above equation we get.
P = [2 2.1554 1.1577 1.7312]
•
Now he has rated the second movie 7
R1 = U41 S11 V11 + U42 S22 V12
R2 = U41 S11 V21 + U42 S22 V22
By solving for bothU1 andU2 , we can recalculate the predictions.
P = [2 7 5.3660 1.0166]
Similar to B
[3 7 5 2]
Recommendations for Large Scale
Recommender Systems
•
•
•
There is no silver-bullet. The BellKor solution to the Netflix Prize used modified kNN and the final solution (RMSE=0.8712) consists of blending 107 individual
results.
Occam’s Razor – Simplicity is good on smaller scale.
Algorithms Performance on Accuracy (low to high)
– Averages, Bayesian, Multinominal Distribution (Co-Variance), k-NN (Pearson
Correlation), Singular Value Decomposition, Specialized Hybrid Techniques
•
Algorithms Performance on Time-Space (low to high)
– Averages, Singular Value Decomposition, Specialized Hybrid Techniques, Multinominal
Distribution (Co-Variance), Bayesian, k-NN (Pearson Correlation),
•
Algorithms Performance on Scalability (low to high)
– Averages, k-NN (Pearson Correlation), Multinominal Distribution (Co-Variance),
Specialized Hybrid Techniques, Bayesian, Singular Value Decomposition
•
•
•
Perform offline processing and cache the results regardless for maximum
performance and scalability.
Build hybrid design to support the cold-start, privacy and content control.
Use adaptive models for better recommendations progressively.
SQL Server Data Mining
What's new in BI for SQL Server 2008
Lynn Langit
Room: 107
• www.SQLServerDataMining.com
• http://www.microsoft.com/sql/technologies/d
m/default.mspx
• http://scis.nova.edu/~adnan/
SQL Server DM Algorithms
Microsoft Association Algorithm
Microsoft Clustering Algorithm
Microsoft Decision Trees Algorithm
Microsoft Naive Bayes Algorithm
Microsoft Neural Network Algorithm (SSAS)
Microsoft Sequence Clustering Algorithm
Microsoft Time Series Algorithm
Microsoft Linear Regression Algorithm
Microsoft Logistic Regression Algorithm
SQL Server Prediction Queries
The following query retrieves report data indicating which customers are likely to
purchase a bicycle, and the probability that they will do so.
•
SELECT t.FirstName, t.LastName, (Predict ([Bike Buyer])) as [PredictedValue],
(PredictProbability([Bike Buyer])) as
[Probability] From [TM Decision Tree] PREDICTION JOIN OPENQUERY([Adventure
Works DW], 'SELECT [FirstName], [LastName], [CustomerKey], [MaritalStatus],
[Gender], [YearlyIncome], [TotalChildren], [NumberChildrenAtHome],
[HouseOwnerFlag], [NumberCarsOwned], [CommuteDistance] FROM
[dbo].[DimCustomer] ') AS t ON [TM Decision Tree].[Marital Status] =
t.[MaritalStatus] AND [TM Decision Tree].[Gender] = t.[Gender] AND [TM Decision
Tree].[Yearly Income] = t.[YearlyIncome] AND [TM Decision Tree].[Total Children] =
t.[TotalChildren] AND [TM Decision Tree].[Number Children At Home] =
t.[NumberChildrenAtHome] AND [TM Decision Tree].[House Owner Flag] =
t.[HouseOwnerFlag] AND [TM Decision Tree].[Number Cars Owned] =
t.[NumberCarsOwned] AND [TM Decision Tree].[Commute Distance] =
t.[CommuteDistance] WHERE (Predict ([Bike Buyer]))=@Buyer AND
(PredictProbability([Bike Buyer]))>@Probability
SQL Server Support for Prediction
• SELECT FLATTENED TopCount(Predict([Invoice Detail],
INCLUDE_STATISTICS), $AdjustedProbability, 5)
FROM [assoc1] NATURAL PREDICTION
JOIN ( SELECT 'Female' AS [Gender], 25 AS [Age],
( SELECT 'Mountain bottle cage' AS [Product Name] UNION
SELECT 'Hydration pack -70oz' AS [Product Name] -- specify
Gender, Marital Status, Income) AS [Invoice Detail] ) AS t
Questions?