F - Shuang Yang
Download
Report
Transcript F - Shuang Yang
Matrix Factorization
Models, Algorithms and Applications
Shuang Hong Yang
http://www.cc.gatech.edu/~syang46/
[email protected]
May, 2010
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
2
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
3
Problem Definition
• Matrix Factorization: for a given matrix M, find a compact (low-rank)
approximation
a
a
u(a)
M(a,b)
M(a,b)
D
b
b
v(b)
– M may be partially observed (i.e., entry missing)
– In the simplest form:
(U,V) = arg min ||M-UTV||F2
an identity function f(x) = x is used as link function
U, V and D are interacting in a multiplicative fashion
D is assumed to be an identity matrix
Euclidian distance is used as the measure of goodness
4
Problem Definition (cont)
• Matrices Co-Factorization
For a given set of related matrices {M}, find a coupled set of compact (low-rank)
approximations
– Each M represents an interaction observation between two entity
– Multi-View MF:
– Joint MF:
a
b
c
5
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
6
Overview: MF taxonomy
• Targeted tasks:
– Dimensionality reduction
PCA and other spectral algorithms
LSI and other SVD algorithms
NMF and other (convex) optimization algorithms
PPCA and other statistical models
– Clustering
K-mean, mean-shift, min-cut, normalized-cut, NMF, etc
Gaussian mixture model, Bi-Gaussian model, etc
– Factor analysis (e.g., profiling, decomposition)
ICA, CCA, NMF, etc.
SVD, MMMF, etc.
– Codebook learning
Sparse coding, k-mean, NMF, LDA, etc.
– Topic modeling
LSI, LDA, PLSI, etc.
– Graph mining
Random walk, pagerank, Hits, etc
– Prediction
Classification, Regression
Link prediction, matrix completion, community detection
Collaborative filtering, recommendation, learning to rank
Domain adaptation, multi-task learning
7
Overview: MF taxonomy
•
Models:
–
Computational: by optimization
models differ in objective and regularizer design
Objective:
L2 error minimization (least square, Frobenius in matrix form)
L1 error minimization (least absolute deviation)
Hinge, logistic, log, cosine loss
Huber loss, ε-loss, etc.
Information-theoretic loss: entropy, mutual information, KL-divergence
Exponential family loss and Bregman divergence: logistic, log, etc.
Graph laplacian and smoothness
Joint loss of fitting error and prediction accuracy
Regularizer:
–
L2 norm, L1 norm, Ky-Fan (e.g. nuclear)
Graph laplacian and smoothness
Lower & upepr bound constraint (e.g., positive constraint)
Other constraint: linear constraint (e.g., probabilistic wellness), quadratic constraint (e.g., covariance), orthogonal constraint
Statistic: by inference
model differ in factorization, prior and conditional design
Factorization:
Symmetric: p(a,b) = Σzp(z)p(a|z)p(b|z)
Asymmetric: p(a,b) = p(a)Σzp(z|a)p(b|z)
Conditional: usually exponential family
Gaussian, Laplacian, Multinomial, Bernoulli, Poission
Prior:
Conjugate prior
Popularly picked ones: Gaussian, Laplacian (or exponential), Dirichlet
Non-informative prior: max entropy prior, etc.
Nonparametric: ARD, Chinese restaurant process, Indian buffet process, etc.
8
Overview: MF taxonomy
• Models:
– Connection between the two lines [Collins et al, NIPS02,Long et al KDD 07, Singh et al, ECML08]
L2
L1
Logistic
Bregman/KL
…
L2
L1
Lap. Smoothness
…
Computational
Statistic
loss function
conditional
regularization
prior
Gaussian
Laplacian
Bernoulli
Exponential family
…
Gaussian
Lap./Exp.
Gaussian Rand Field
…
9
Overview: MF taxonomy
• Algorithm:
– Deterministic:
•
•
•
•
•
•
•
•
•
Spectral analysis
Matrix decomposition: SVD, QR, LU
Solving linear system
Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, SDP, etc.
Alternative coordinate descent
LARS, IRLS
EM
Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
…
– Stochastic:
Stochastic gradient descent (back propagation, message passing)
Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
Random walk
Simulated annealing, annealing EM
Randomized projection
…
10
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
11
Representative Work
• Spectral dimensionality reduction / clustering:
– PCA:
• L2 loss + orthogonal constraint
min ||M – UTV||F2, subject to: VTV = I
• Solve by spectral analysis of MTM
• Analogous to factor-based collaborative filtering
– Laplacian eigenmap:
Laplacian smoothness + orthogonal constraint
min wij||uiV-ujV||2, subject to: VTV = I
Graph encodes neighboring info, e.g., heat, kNN
Analogous to neighbor-based collaborative filtering
12
Representative Work
• Plain CF:
– Factor based (Ky-Fan):
• L2 loss + L2 regularization
min ||M – UTV||F2 + C1||U||F2 +C2||V||F2
• Solve by SVD
• Analogous to LSI, PCA, etc.
– Neighbor based (item or query oriented):
Not explicitly perform factorization, but could be viewed equivalently as
min wij||uiMi-ujMi||2, subject to: Σv uiv = 1, uiv ≥0
Graph encodes neighboring info, e.g., heat, kNN
Analogous to k-mean, Laplacian eigenmap
13
Representative Work
• Joint factor-neighbor based CF:
[Koren: Factor meets the Neighbors: Scalable and Accurate Collaborative Filtering, KDD’08]
–
–
–
–
L2 loss + L2 regularization
Neighbor graph constructed by pearson correlation
Solve by stochastic gradient descent
Analogous to locality (Laplacian) regularized PCA.
14
Representative Work
• Max-margin matrix factorization:
– Max-margin dimensionality reduction: [a lot of work here]
• Hinge loss + L2 regularization
min h(yij – uiTDvj)+C1||D||F2 +C2||U||F2 +C3||V||F2
• Solve by SDP, cutting plane, etc.
– Max-Margin Matrix Factorization:[Srebro et al, NIPS 2005, ALT 2005]
Hinge loss + Ky-Fan
min h(mij – uiTvj)+C1||U||F2 +C2||V||F2
Note: no constraint for the rank of U or V
Solve by SDP
– CoFi-Rank: [Weimer et al, NIPS 2009]
NDCG + Ky-Fan
min n(mij – uiTvj)+C1||U||F2 +C2||V||F2
Note: no constraint for the rank of U or V
Solve by SDP, bundle methods
15
Representative Work
• Sparse coding:
[Lee et al, NIPS 2007] [Lee et al, IJCAI 2009]
– L2 sparse coding :
• L2 loss + L1 regularization
min ||M – UTV||F2+C1||U||1 +C2||V||F2
• Solve by LARS, IRLS, gradient descent with sign searching
– Exponential family sparse coding:
Bregman divergence + L1 regularization
min D(Mab||g(uavb)) + C1||U||1 +C2||V||F2
Solve by gradient descent with sign searching
– Sparse is good --- my guess:
• compacter usually implies predictive
• Sparsity poses stronger prior, making local optima more distinguishable
• Shorter descriptive length ( the principal of MDL)
16
Representative Work
• NMF, LDA, and Exponential PCA
– NMF: [Lee et al, NIPS 2001]
• L2 loss + nonnegative constraint
min ||M – UTV||F2, subject to: U≥0, V ≥ 0
• Solve by SDP, projected gradient descent, interior point
– LDA: [Blei et al, NIPS 2002]
Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior
ua~ Dir(α), zab~Disc(ua), Mab~Mult(V, zab)
Variational Bayesian, EP, Gibbs sampling, collapsed VB/GS
– Exponential PCA: [Collins et al, NIPS 2002]
Bregman divergence + orthogonal constraint
min D(Mab||g(uavb)) , subject to: VTV = I
Solved by gradient descent
– Essentially, these are equivalent to each other
17
Representative Work
• Link analysis:
– Factor based / bi-clustering:
[a lot of papers in co-clustering and social network analysis]
• L2 loss + L2 regularization
min ||M – UTDV||F2+C1||U||1 +C2||V||F2 +C3||D||F2
• To further simplify, assume diagonal or even identity D
• Modern models use logistic regression
– Bayesian Co-clustering [Shan et al ICDM 2008]
Or Mixed membership stochastic block model [Airoldi et al, NIPS 2008]
Symmetric + Bernoulli conditional + Dirichlet prior
ui~ Dir(α), zi~Disc(ui), Mij~sigmoid(ziTDzj)
– Nonparametric feature model: [Miller et al, NIPS 2010]
Symmetric + Bernoulli conditional + Nonparametric prior
zi~ IBP(α), Mij~sigmoid(ziTDzj)
– In essence, equivalent
18
Representative Work
• Joint Link & content analysis:
– Collective factorization:
• L2 loss + L2 regularization [Long et al, ICML 2006, AAAI 2008; Zhou et al, WWW 2008]
• Or Laplacian smoothness loss + orthogonal [Zhou et al ICML 2007]
• Shared representation matrix
min ||M – UTDU||F2 +||F – UTB|| +C1||U||1 +C2||B||F2 +C3||D||F2
– Relational topic model:
[Chang et al, AISTATS 2009, KDD 2009]
For M: Symmetric + Bernoulli conditional + Dirichlet Prior
For F: Asymmetric + Multinomial conditional + Dirichlet Prior
Shared representation matrix
ui~ Dir(α), zif~Disc(ui), Fif~Mult(B, zif) , Mij~sigmoid(ziTDzj),
– Regression based latent factor model:
[Agarwal et al, KDD 2009]
For M: Symmetric + Gaussian conditional + Gaussian Prior
For F: Linear regression (Gaussian)
zi~Gaussian(BxFi , σI) , Mij~Gaussian(ziTzj),
– fLDA model:
[Agarwal et al, WSDM 2009]
LDA content factorization + Gaussian factorization model
– In essence, equivalent
19
Representative Work
• Tensor factorization/hypergraph mining and personalized CF:
– Two-way model:
[Rendle et al WSDM 2010, WWW 2010]
min ||Mijk – uiTDvj - uiTDwk - vjTDwk ||F2 +C(||U|| F2 +||V||F2 +||W||F2 +||D||F2 )
i
ui
i
Mijk
j
vj
j
k
wk
k
– Full factorization:
+
+
Mijk
[Symeonidis et al RecSys 2008, Rendle et al KDD 2009]
min ||Mijk – < ui , vj ,wk >||F2 +C(||U|| F2 +||V||F2 +||W||F2)
i
j
k
i
Mijk
j
k
ui
vj
wk
+
+
Mijk
20
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
21
Summary and discussion
• Recipe for design an MF model:
– Step 1: Understand your task / data:
What is the goal of my task?
What is the underlying mechanism in the task?
Knowledge, patterns, heuristics, clues…
What data are available to support my task?
Are all the available data sources reliable and useful to achieve the goal? Any
preprocessing/aggregation needed?
What is the basic characteristic of my data?
Symmetric, directional
positive, fractional, centralized, bounded
positive definite, triangle inequality
Which distribution is appropriate to interpret my data?
Any special concerns for the task?
Task requirement: is there a need for online operation?
Resources constraint: computational cost, labeled data,…
22
Summary and discussion
• Recipe for design an MF model:
– Step 2: Choose an appropriate model:
Computational or statistic?
Computational models are generally efficient, ease-of-implementation, off-the-shelf
blackbox (no need for fancy skills)…
Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly,
promising if properly designed…
If computational:
Which loss function?
L2, most popular, most efficient, generally promising
Evidently heavy noise: L1, Huber, epsilon
Dominant locality: Laplacian smoothness
Specific distribution: Bregman divergence (also use a link function)
Measurable prediction quality: wrapper the prediction objective
Readily translated knowledge, heuristic, clue:
What regularization?
L2, most popular, most efficient
Any constraints to retain?
Sparsity: L1
Dominant locality: Laplacian smoothness
Readily translated knowledge, heuristic, clue
23
Summary and discussion
• Recipe for design an MF model:
– Step 2: Choose an appropriate model (cont):
Computational or statistic?
Computational models are generally efficient, ease-of-implementation, off-the-shelf
blackbox (no need for fancy skills)…
Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly,
promising if properly designed…
If statistic:
How to decompose the joint pdf?
To reflect the underlying mechanism
To efficiently parameterize
What’s the appropriate model for each pdf factor?
To encode prior knowledge/underlying mechanism
To reflect the data distribution
What’s the appropriate prior for Bayesian treatment?
Conjugate:
Sparsity: Laplacian, exponential
Nonparametric prior
No idea? Choose none or noninformative
24
Summary and discussion
• Recipe for design an MF model:
–
Step 3: Choose or derive an algorithm:
To meet task requirement and/or resource constraints
To ease implementation
To achieve the best of the performance
Deterministic:
•
•
•
•
•
•
•
•
•
Spectral analysis
Matrix decomposition: SVD, QR, LU
Solving linear system
Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, etc.
Alternative coordinate descent
LARS, IRLS
EM
Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
…
Stochastic:
Stochastic gradient descent (back propagation, message passing)
Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
Random walk
Simulated annealing, annealing EM
Randomized projection
…
25
Summary and discussion
• Other thoughts
– Link propagation:
Friendship / correlation
?
?
Preprocessing:
Propagate S (self-propagation or based on an auxiliary similarity matrix)
S is required to be a random matrix (positive entries, row sum = 1)
Postprocessing:
Propagate P (using S or an auxiliary similarity matrix)
Both S and P are required to be random matrices
26
Summary and discussion
• Other thoughts
– Smoothness:
Friendship / neighborhood
Correlation, same-category
More parameters, but could be parameter free
Applying low-pass filtering
Single parameter
Spectral smoothness
27
Thanks!
Any comments would be appreciated!
28