Transcript Document

On Missing Data Prediction using Sparse Signal Models:
A Comparison of Atomic
Decompositions with Iterated
Denoising
Onur G. Guleryuz
DoCoMo USA Labs,
San Jose, CA 95110
[email protected]
(google: onur guleryuz)
(Please view in full screen mode. The presentation tries to squeeze in too
much, please feel free to email me any questions you may have.)
Overview
•Problem statement: Prediction of missing data.
•Formulation as a sparse linear expansion over overcomplete basis.
•AD (
l0 regularized) and ID formulations.
•Short simulation results ( l regularized) .
01
•Why ID is better than AD.
•Adaptive predictors on general data: all methods are mathematically the same.
Key issues are basis selection, and utilizing what you have effectively.
Mini FAQ:
1. Is ID the same as l1 ? No.
2. Is ID the same as l p , except implemented iteratively? No.
3. Are predictors that yield the sparsest set of expansion coefficients the best? No,
predictors that yield the smallest mse are the best.
4. On images, look for performance over large missing chunks (with edges).
Some results from Ivan W. Selesnick, Richard Van Slyke, and Onur G. Guleryuz,``Pixel Recovery via l1 Minimization in the Wavelet
Domain,'‘ Proc. IEEE Int'l Conf. on Image Proc. (ICIP2004), Singapore, Oct. 2004.
Pretty ID pictures: Onur G. Guleryuz, ``Nonlinear Approximation Based Image Recovery Using Adaptive Sparse Reconstructions and Iterated
( Some software available at my webpage.)
Denoising: Part II – Adaptive Algorithms,‘’ IEEE Tr. on IP, to appear.
Problem Statement
1. Original image
 x0 
x 
 x1 
available pixels
lost region pixels
(assume zero mean)
2.
Lost region
3. Derive predicted
 x0 
( P0 y    )
0
 x0  }n0
 0  }n (n0  n1  N )
  1
 x0 
y 
 xˆ1 
available data projection (“mask”)
 x0 
0 
 
Noisy signal (noise correlated with the data)
type 1 iterations
 x0 
y 
 xˆ1 
Signal space
?

+(1   ) 
=
Recipe for
 x0 
y 
 xˆ1 
1. Take NxM matrix of overcomplete basis,
M N
H  h1 h2 ... hM 
2. Write y in terms of the basis
M
y  Hc   ci hi
i 1
3. Find “sparse” expansion coefficients (AD v.s. ID)
Onur’s trivial sparsity theorem:
Any y has to be sparse
 x0 
y 
 xˆ1 
xˆ1  A( x0 ) x0
1
y    x0
 A
n0  N
N
1
 A
 
null space of dimension
y has to be sparse
Estimation algorithms
n1  N  n0
 x0  n0
y      di gi
 xˆ1  i 1
equivalent basis in which estimates are sparse
Who cares about y, what about the
original x?
If successful prediction is possible x also has to be ~sparse
i.e., if
|| x  y ||2 small, then x ~ sparse
1. Predictable
sparse
2. Sparsity of x is a necessary leap of faith to make in estimation
•Caveat: Any estimator is putting up a sparse y. Assuming x is sparse, the
estimator that wins is the one that matches the sparsity “correctly”!
•Putting up sparse estimates is not the issue, putting up estimates that
minimize mse is.
•Can we be proud of the y
ambitious.
  ci hi
i
formulation? Not really. It is honest, but
Getting to the heart of the matter:
AD: Find the expansion coefficients
to minimize the l 0 norm
M
min  | ci |0
c
M
subject to || P0 (
i 1
 c h  x) ||  T
i 1
i i
2
l0 norm of expansion coefficients
Regularization
Available data constraint
AD with Significance Sets
min card ( S )
S
subject to
y   ci hi
iS
and
|| P0 ( y  x) ||2  T
Finds the sparsest (the most predictable) signal consistent
with the available data.
Iterated Denoising with Insignificant
Sets
1.
2.
Pick
I (T )
min
| h
y
iI (T )
T
i
y|
2
subject to
P0 y  P0 x
(Once the insignificant set is determined, ID uses well defined denoising operators
to construct mathematically sound equations)
 x0 
y  denoising_recons(  , H , T )
0
y1  denoising_recons( y, H , T  dT)
...
y 2  denoising_recons( y1 , H , T  2dT )
y P  denoising_recons( y P1, H , Tf )
Progressions
Recipe for using your transform based
image denoiser (to justify progressions,
think decaying coefficients): …
Mini Formulation Comparison
No progression ID
min  | hiT y |2
y
subject to
iI
AD
min card ( S )
S
subject to
P0 y  P0 x
y   ci hi , || P0 ( y  x) ||2  T
iS
•If H is orthonormal the two formulations come close.
•Important thing is how you determine the sets/sparsity (ID: Robust DSP, AD: sparsest)
•ID uses progressions, progressions change everything!
Simulation Comparison
M
min  | ci |0
AD
c
M
1
subject to || P0 (
i 1
 c h  x) ||  T
i 1
i i
2
vs.
ID (no layering and no selective thresholding)
l0
l1 :
D. Donoho, M. Elad, and V. Temlyakov, ``Stable Recovery of Sparse Overcomplete Representations in the Presence of
Noise‘’.
H: Two times expansive M=2N, real, isotropic, dual-tree, DWT. Real part of:
N. G. Kingsbury, ``Complex wavelets for shift invariant analysis and filtering of signals,‘’ Appl. Comput. Harmon. Anal.,
10(3):234-253, May 2002.
Simulation Results
Original
l1: 23.49 dB
Missing
ID: 25.39 dB
5
5
5
5
10
10
10
10
15
15
15
15
20
20
20
20
10
20
10
Original
20
10
20
10
l1: 21.40 dB
Missing
ID: 30.38 dB
5
5
5
5
10
10
10
10
15
15
15
15
20
20
20
20
10
20
( l1 results are doctored!)
10
20
10
20
20
10
20
What is wrong with AD?
Problems in l0
l1 ? Yes and no.
•I will argue that even if we used an “ l0 solver”, ID will in general prevail.
•Specific issues with l1 .
•How to fix the problems with l1 based AD.
•How to do better.
So let’s assume we can solve the
l0 problem ...
Bottom Up (AD) v.s. Top Down (ID)
AD Builder
ID Sculptor
Prediction as signal construction:
•AD is a builder that tries to accomplish constructions using as few bricks as possible.
Requires very good basis.
•ID is a sculptor that removes portions that do not belong in the final construction by
easy
using as many carving steps as needed. Requires good denoising.
Application is not compression! (“Where will the probe hit the meteor?”, “What is the value
of S&P500 tomorrow?”)
Significance v.s. Insignificance,
The Sherlock Holmes Principle
•Both ID and AD do well with very good basis. But ID can also use unintuitive basis for
sophisticated results.
E.g.: ID can use “unsophisticated”, “singularity unfriendly” DCT basis to recover
singularities. AD cannot!
Secret: DCTs are not great on singularities but they are very good on everything else!
non-singularities
"How often have I said to you that when you have eliminated the impossible
whatever remains, however improbable, must be the truth?"
Sherlock Holmes, in "The Sign of the Four"
singularities
•DCTs are very good at eliminating non-singularities.
•ID is more robust to basis selection compared to AD (secretly violate coherency restrictions).
•You can add to the AD dictionary but solvers won’t be able to handle it.
Sherlock Holmes Principle using
overcomplete DCTs for elimination
Predicting missing
edge pixels:
Predicting missing wavelet
coefficients over edges:
basis: DCT 16x16
basis: DCT 8x8
Do not abandon isotropic *lets, use a framework that can extract the most mileage
from the chosen basis (“sparsest”).
Onur G. Guleryuz, ``Predicting Wavelet Coefficients Over Edges Using Estimates Based on Nonlinear Approximants,’’
Proc. Data Compression Conference, IEEE DCC-04, April 2004.
.
Progressions
type 1 iterations of simple denoising
 x0 
y  denoising_recons(  , H , T )
0
basis: DCT 16x16, best threshold
“Annealing” Progressions
(think decaying coefficients)
...
y1  denoising_recons( y, H , T  dT)
y P  denoising_recons( y P1, H , Tf )
Progressions generate up to tens of dBs. If the data was very sparse with respect to
H, if we were solving a convex problem, why should progressions matter? Modeling
assumptions…
More skeptical picture:
Sparse Modeling Generates NonConvex Problems
missing pixel
x
x
c1
c2
available pixel
Pixel coordinates for a “two pixel” image
Transform coordinates
available pixel constraint
x
Equally sparse solutions
l
How does this affect some “AD solvers”, i.e., 1 ?
Geometry
x
l1
ball
x
Case 1
Case 2
x
Case 3
Linear/Quadratic program, …,
Not sparse!
Case 3: the magic is gone…
x
You now have to argue:
“Under i.i.d. Laplacian model for the joint
probability of expansion coefficients, ...
max p(c1 , c2 ,...,cM )
min
l1
norm
Problems with the l1 norm I
What about all the optimality/sparsest results?
Results such as: D. Donoho et. al. ``Stable Recovery of Sparse Overcomplete
Representations in the Presence of Noise‘’…
are very impressive, but they are closely tied to H providing the sparsest
decomposition for x. Not every problem has this structure.
Worst case noise robustness results, but overwhelming noise:
 x0   x0 
w x    y
 x1   0 
2
1
modeling error
error due to
missing data
( 2  n1 2  aN 2 )
msel1 ( x, y)  f (1,  2 )
Problems with the l1 norm II
M
M
min  ci
subject to
i 1
i 1
(problem due to
2
|| P0 ( ci hi  x) ||2  T
M
)
||  ci P0 hi  P0 x ||2  T
i 1
H  h1 h2 ... hM 
~ ~
~
h1 h2 ... hM 
P0 H  

0


“nice” basis, “decoherent”
“not nice” basis (due to
masking), may become very
“coherent”
Example
1 / 3 1 / 2

H  1 / 3
0
1 / 3  1 / 2

1/ 6 

 2 / 6
1 / 6 
1 / 3 1 / 2 1 / 6 


P0 H   0
0
0 
 0

0
0


orthonormal, coherency=0
unnormalized coherency=
1/ 6
normalized coherency= 1
(worst possible)
Optimal solution sometimes tries to make coefficients of scaling functions zero.
Possible fix using Progressions
M
1.
min  | ci |
c
2.
 x0 
||  ci hi    ||2  T
i 1
0
M
subject to
i 1
Enforce available data
 x0 
y  l1_recons( , H , T )
0
...
y1  l1_recons( y, H , T  dT)
y P  l1_recons( y P1, H , Tf )
•If you pick a large T maybe you can pretend the first one is a convex problem.
•This is not an l1 problem! No single l1 solution will generate the final.
•After the first few solutions, you may start hitting l1 issues.
The fix is ID!
y1  l1_recons( y, H , T  dT)
v.s.
y1  denoising_recons( y, H , T  dT)
: You can do soft thresholding, “block descent”, or
Daubechies, Defrise, De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint”,
Figueiredo and Nowak, “An EM Algorithm for Wavelet-Based Image Restoration”.
Experience suggests:
>>
•There are many “denoising” techniques that discover the “true” sparsity.
•Pick the technique that is cross correlation robust.
Conclusion
•To see its limitations, go ahead and solve the real l1 (with or without masking setups, you
can even cheat on T) and compare to ID.
M
M
1
0
subject to
0
i i
2
i
c
i 1
i 1
min  | c |
|| P ( c h  x) ||  T
•Smallest mse not necessarily = sparsest. Somebody putting up really bad estimates maybe
very sparse (sparser than us) with respect to some basis.
•Good denoisers should be cross correlation robust (hard thresholding tends to beat soft).
•How many iterations you do within each l1_recons() or denoising_recons() is not very
important.
•Progressions!
•Wil l1 generate sparse results? In the sense of the trivial sparsity theorem, of course! (Sparsity
may not be in terms of your intended basis :). Please check the assumptions for your problem!
The trivial sparsity theorem is true. The prediction problem is all about the basis.
ID simply allows the construction of a sophisticated, signal adaptive basis, by
starting with a simple dictionary!