Matrix Completion

Transcript Matrix Completion

Matrix Completion
IT530 Lecture Notes
Matrix Completion in Practice:
Scenario 1
• Consider a survey of M people where each is
asked Q questions.
• It may not be possible to ask each person all Q
questions.
• Consider a matrix of size M by Q (each row is the
set of questions asked to any given person).
• This matrix is only partially filled (many missing
entries).
• Is it possible to infer the full matrix given just
the recorded entries?
Matrix Completion in Practice:
Scenario 2
• Some online shopping sites such as Amazon, Flipkart, Ebay, Netflix
etc. have recommender systems.
• These websites collect product ratings from users (especially
Netflix).
• Based on user ratings, these websites try to recommend other
products/movies to the user that he/she will like with a high
probability.
• Consider a matrix with the number of rows equal to the number of
users, and number of columns equal to the number of
movies/products.
• This matrix will be HIGHLY incomplete (no user has the patience to
rate too many movies!!) – maybe only 5% of the entries will be
filled up.
• Can the recommender system infer user preferences from just the
defined entries?
Matrix Completion in Practice:
Scenario 2
• Read about the Netflix Prize to design a better
recommender system:
http://en.wikipedia.org/wiki/Netflix_Prize
Matrix Completion in Practice:
Scenario 3
• Consider an image or a video with several pixel values
missing.
• This is not uncommon in range imagery or remote
sensing applications!
• Consider a matrix whose each column is a (vectorized)
patch of M pixels. Let the number of columns be K.
• This M by K matrix will have many missing entries.
• Is it possible to infer the complete matrix given just
the defined pixel values?
• If the answer were yes, note the implications for image
compression!
Matrix Completion in Practice:
Scenario 4
• Consider a long video sequence of F frames.
• Suppose I mark out M salient (interesting points) {Pi},
1<=i<=M, in the first frame.
• And try to track those points in all subsequent frames.
• Consider a matrix F of size M x 2F where row j contains the
X and Y coordinates of points on the motion trajectory of
initial point Pj.
• Unfortunately, many salient points may not be trackable
due to occlusion or errors from the tracking algorithms.
• So F is highly incomplete.
• Is it possible to infer the true matrix from only the
available measurements?
A property of these matrices
• Scenario 1: Many people will tend to give very
similar or identical answers to many survey
questions.
• Scenario 2: Many people will have similar
preferences for movies (only a few factors
affect user choices).
• Scenario 3: Non-local self-similarity!
• This makes the matrices in all these scenarios
low in rank!
A property of these matrices
• Scenario 4: The true matrix underlying F in question has
been PROVED to be of low rank (in fact, rank 3) under
orthographic projection (ref: Tomasi and Kanade, “Shape and
Motion from Image Streams Under Orthography: a Factorization Method”,
IJCV 1992) and a few other more complex camera models up
to rank 9 (ref: Irani, “Multiframe correspondence estimation using
subspace constraints”, IJCV 2002).
• F (in the rank 3 case) can be expressed as a product of two
matrices – a rotation matrix of size 2F x 3, and a shape
matrix of size 3 x P.
• F is useful for many computer vision problems such as
structure from motion, motion segmentation and multiframe point correspondences.
(Many) low-rank matrices are cool!
• The answer to the four questions/scenarios is
a big NO in the general case.
• But it’s a big YES if we assume that the
underlying matrix has low rank (and which, as
we have seen, is indeed the case for all four
scenarios) and obeys a few more constraints.
Ref: Candes and Recht, “Exact Matrix Completion via Convex Optimization”, 2008.
Theorem 1 (Informal Statement)
• Consider an unknown matrix F of size n1 by n2 having rank r <
min(n1, n2).
• Suppose we observe only a fraction of entries of F in the form
of matrix G, where G(i,j) = F(i,j) for all (i,j) belonging to some
uniformly randomly sampled set W and G(i,j) undefined
elsewhere.
• If (1) F has row and column spaces that are “sufficiently
incoherent” with the canonical basis (i.e. identity matrix), (2) r is
“sufficiently small”, and (3) W is “sufficiently large”, then we can
accurately recover F from G by solving the following rank
minimization problem:
ˆ)
F*  min rank(F
ˆ
F
subject to
ˆ (i, j)  Γ(i, j)(i, j )  W
Φ
Cool theorem, but … 
• The afore-mentioned optimization problem is
NP-hard (in fact, it is known to have double
exponential complexity!)
Theorem 2 (Informal Statement)
• Consider an unknown matrix F of size n1 by n2 having rank r <
min(n1, n2).
• Suppose we observe only a fraction of entries of F in the form
of matrix G, where G(i,j) = F(i,j) for all (i,j) belonging to some
uniformly randomly sampled set W and G(i,j) undefined
elsewhere.
• If (1) F has row and column spaces that are “sufficiently
incoherent” with the canonical basis (i.e. identity matrix), (2) r is
“sufficiently small”, and (3) W is “sufficiently large”, then we can
accurately recover F from G by solving the following “traceˆ
norm” minimization problem:
F*  min ˆ F
F
*
subject to
ˆ (i, j)  Γ(i, j)(i, j)  Ω
Φ
What is the trace-norm of a matrix?
• The trace-norm of a matrix is the sum of its singular
values.
• It is also called nuclear norm.
• It is a softened version of the rank of a matrix, just like
the L1-norm of a vector is a softened version of the L0norm of the vector.
• Minimization of the trace-norm (even under the given
constraints) is a convex optimization problem and can
be solved efficiently (no local minima issues).
• This is similar to the L1-norm optimization (in
compressive sensing) being efficiently solvable.
More about trace-norm minimization
• The efficient trace-norm minimization procedure
is provably known to give the EXACT SAME result
as the NP-hard rank minimization problem (under
the same constraints and same conditions on the
unknown matrix F and the sampling set W).
• This is analogous to the case where L1-norm
optimization yielded the same result as L0-norm
optimization (under the same set of constraints
and conditions).
• Henceforth we will concentrate only on Theorem
2 (and beyond).
The devil is in the details
• Beware: Not all low-rank matrices can be recovered
from partial measurements!
• Example consider a matrix containing zeroes
everywhere except the top-right corner.
• This matrix is low rank, but it cannot be recovered
from knowledge of only a fraction of its entries!
• Many other such examples exist.
• In reality, Theorems 1 and 2 work for low-rank matrices
whose singular vectors are sufficiently spread out, i.e.
sufficiently incoherent with the canonical basis (i.e.
with the identity matrix).
Coherence of a basis
• The coherence of subspace U of Rn and having
dimension r with respect to the canonical
basis {ei} is defined as:
 (U ) 
n
max 1i n Uei
r
2
Formal definition of key assumptions
• Consider an underlying matrix M of size n1 by
n2. Let the SVD of M be given as follows:
r
M   k uk vkT
k 1
• We make the following assumptions about M:
1. (A0) 0 such thatmax((U ), (V ))  0
2. (A1) Ther maximum entry in the n1 by n2
matrix  uk vkT is upper bounded by 1 r /(n1n2 ) , 1  0
k 1
What do these assumptions mean (in
English)?
• (A0) means that the singular vectors of the
matrix are sufficiently incoherent with the
canonical basis.
• (A1) means that the singular vectors of the
matrix are not spiky (e.g. canonical basis
vectors are spiky signals – the spike has
magnitude 1 and the rest of the signal is 0; a
vector of n elements with all values equal to
1/square-root(n) is not spiky).
Theorem 2 (Formal Statement)
the trace-norm minimizer (in
the informal statement of
theorem 2)
Comments on Theorem 2
• Theorem 2 states that more entries of M must
be known (denoted by m) for accurate
reconstruction if (1) M has larger rank r, (2)
greater value of 0 in (A0), (3) greater value of
1 in (A1).
• Example: If 0 = O(1) and the rank r is small,
the reconstruction is accurate with high
1.2
m

Cn
r log(n) .
probability provided
Comments on Theorem 2
• It turns out that if the singular vectors of
matrix M have bounded values, the condition
(A1) almost always holds for the value 1 =
O(log n).
Matrix Completion under noise
• Consider an unknown matrix F of size n1 by n2
having rank r < min(n1, n2).
• Suppose we observe only a fraction of entries
of F in the form of matrix G, where G(i,j) =
F(i,j) + Z(i,j) for all (i,j) belonging to some set
W and G(i,j) undefined elsewhere.
• Here Z refers to a white noise process which
obeys the constraint that:
2
Z
 ij   ,   0
( i , j )W
Matrix Completion under noise
• In such cases, the unknown matrix F can be
recovered by solving the following
minimization procedure (called as a semidefinite program):
ˆ
F *  minFˆ F
*
subject t o
2
ˆ
(
Φ
(i,
j)

Γ(i,
j)
)


(i, j)W
Theorem 3 (informal statement)
• The reconstruction result from the previous
procedure is accurate with an error bound
given by:
FF
* 2
F
4
( 2  p ) min(n1 , n2 )
  2 ,
p
W
m
where p  fractionof knownentries

n1n2 n1n2
A Minimization Algorithm
• Consider the minimization problem:
ˆ
F *  minFˆ F
*
subject t o
2
ˆ
(
Φ
(i,
j)

Γ(i,
j)
)


(i, j)W
• There are many techniques to solve this problem
(http://perception.csl.illinois.edu/matrixrank/sample_code.html)
• Out of these, we will study one method called
“singular value thresholding”.
Ref: Cai et al, “A singular value thresholding algorithm for matrix completion”, SIAM Journal on Optimization,
2010.
Singular Value Thresholding (SVT)
soft  threshold(Y  R n1n2 ; )
F *  SVT (G,  0)
{
{
n1 n2
Y  0 R
k 1
while(convergencecriterionnot met)
{
(0)
F (k)  soft  threshold(Y ( k 1) ; )
Y ( k )  Y ( k 1)   k PW  (G  F (k) ); k  k  1;
}
Φ*  Φ (k) ;
}
Y  USV T ( using svd)
for (k  1 : n2 ) S (k , k )  max(0, S (k , k )   );
}
The soft-thresholding
procedure obeys the
following property (which
we state w/o proof).
soft  threshold(Y ; ) 
1
2
arg minX X  Y F   X
2
*
(i, j )  W, PW (i, j )  1, else PW (i, j )  0
Properties of SVT (stated w/o proof)
• The sequence {F(k)} converges to the true
solution of the main problem provided the
step-sizes {k} all lie between 0 and 2, and the
value of  is large.
Results
• The SVT algorithm works very efficiently and is
easily implementable in MATLAB.
• The authors report reconstruction of a 30,000
by 30,000 matrix in just 17 minutes on a 1.86
GHz dual-core desktop with 3 GB RAM and
with MATLAB’s multithreading option enabled.
Results (Data without noise)
Results (Noisy Data)
Results on real data
• Dataset consists of a matrix M of geodesic
distances between 312 cities in the
USA/Canada.
• This matrix is of approximately low-rank (in
fact, the relative Frobenius error between M
and its rank-3 approximation is 0.1159).
• 70% of the entries of this matrix (chosen
uniformly at random) were blanked out.
Results on real data
• The underlying matrix was estimated using
SVT.
• In just a few seconds and a few iterations, the
SVT produces an estimate that is as accurate
as the best rank-3 approximation of M.
Results on real data
Applications to Video Denoising
• Consider a video I(x,y,t) corrupted by Gaussian
and impulse noise.
• Impulse noise usually has high magnitude and
can be spatially sparse.
• Impulse noise can be “removed” by local median
filtering, but this can also attenuate edges and
corners.
• Instead, the median filtering is used as an
intermediate step for collecting together K
patches that are “similar” to a reference patch.
Applications to Video Denoising
• The similar patches are assembled in the form
of a matrix Q of size n x K (where n = number
of pixels in each patch).
• If the noise were only Gaussian, we could
perform an SVD on Q, attenuate appropriate
singular values and then reconstruct the
patches. In fact, this is what non-local PCA
does (with minor differences).
Applications to Video Denoising
• But the entries in Q are also corrupted by
impulse noise, and this adversely affects the
SVD computation.
• Hence we can regard those pixels that are
affected by impulse noise as “incomplete
entries” (how will you identify them?) and
pose this as a matrix completion under noise
problem.
Applications to Video Denoising
• The formal problem statement is:
P  minP P *
Matrix of true/clean
patches
subject to
Matrix of noisy
patches
*
 (P (i,j)  Q(i, j))
2
 W
2
(i, j)W
Set of indices of pixels in Q
having well-defined values
(i.e. values not corrupted
by impulse noise).
Standard deviation of
Gaussian noise
Applications to Video Denoising
• We know this can be solved using singular
value thresholding (though other algorithms
also exist).
• This procedure is repeated throughout the
image or the video.
Result: effect of median filter
Denoising results
Denoising results
Matrix Completion in Practice:
Scenario 5
• Consider N points of dimension d each,
denoted as {xi},1 ≤ i ≤ N.
• Consider a matrix D of size N x N, where
Dij = |xi - xj|2 = -2xTi xj + xTi xi + xTi xj
• D can be written in the form
D = 1zT+z1T-2XXT where z is a vector of length
N where zi = xTi xi, where X is a matrix of size N
by d, and where 1 is a vector of length N
containing all ones.
Matrix Completion in Practice:
Scenario 5
• Hence the rank of D = 2 + rank(XXT)=2+d.