Slides - DAIICT Intranet

Download Report

Transcript Slides - DAIICT Intranet

Compression of facial images
using the K-SVD algorithm
Ori Bryt , Michael Elad, JVCIR, Elsevier, 2008
Shrishail S. Gajbhar
201121016
What are the principles behind
compression?



Two fundamental components of compression
are redundancy and irrelevancy reduction.
Redundancy reduction aims at removing
duplication from the signal source
(image/video).
Irrelevancy reduction omits parts of the
signal that will not be noticed by the signal
receiver, namely the Human Visual System
(HVS).
Example of Image Redudancy
Profile of cameraman image along row number 164. The top graph shows
pixel intensity, and the bottom graph shows corresponding normalized
correlation over 128 pixel displacements.
Lossless vs. Lossy Compression
Reconstructed
image
Compression rate
Lossless
Lossy
numerically
identical to
the original
image
2:1 (at most
3:1)
contains
degradation
relative to the
original
high
compression
(visually
lossless)
Image compression steps:
Original image
(reconstructed)
Source encoder
(inverse T)
linear transform to
decorrelate the
image data
(lossless)
Compressed
Entropy Coding
(decoding)
of the resulting
quantized values
(lossless)
Quantization
(dequantization)
of basis functions
coefficients
(lossy)
Facial Image Compression


Very specific class of images, a priori kown and
with additional contraints we can have additional
redudancy and hence more compression. (can
outperform JPEG 2000)
Goal: Compression ratio (154:1 and more)
e.g, 441× 358 size standard digital b/w passport
photo (154 Kbytes, 8bpp)
Compressing to less than 1KB without loss in
visual quality.
State of The Art Techniques


JPEG2000 (Standard general purpose wavelet
based image compression).
Geometric Allignment, VQ coding trained using
Tree K-Means. (Low Bit-Rate Compression of Facial Images, by
Michael Elad, IEEE TIP, 2007 (outperforms JPEG2000 !)).

PCA

ICA along with Matching Persuit type algoritms.
(On the use of independant component analysis for image compression
(outperforms JPEG, nearby to JPEG2000)).

K-SVD and Sparse representation
above Elad’s paper replacing VQ coding by KSVD).
(Extension of the
Background Material



Low Bit-Rate Compression of Facial Images
(Elad, Goldenberg and Kimmel)
Geometrical Canonization : Restricted to
frontal facial mug-shots, the handled images are
geometrically deformed into a canonical form, in
which facial features are located at the same
spatial locations. Using a plain feature detection
procedure, the image is divided to disjoint and
covering set of triangles, each deformed using a
different affine warp.
Leads to better conditioning of input images.
Geometrical Canonization



This information is sent to decoder for inverse
warping. Parameters coded using 20 bytes.
A feature based correspondance
Facial Features Detection and Image Warping:
13 feature points along with six corner points
define triangulation such that every triangle in
i/p image and ‘average’ image uniquely define
an affine transformation.
Allignment Example
(Top) Input images and their (canonical) aligned (bottom) versions.
Coding and Hierarchical
Multiscale Treatment




Coding : VQ dictionaries are trained using tree
K-Means per patch separately using patches
taken from the same location from 5000 images.
Only small patches can be used, which leads to
loss of some redudancy, hence in compression.
The image is scaled down and VQ-coded using
patches of size 8×8. Then it is interpolated back
to the original resolution, and the residual is
coded using VQ on 8×8 pixel patches once
again.
Surpass JPEG2000 visually and also in PSNR.
VQ based compression results
Left:VQ, Right:JPEG200 for
270 Bytes (compression ratio 585:1)
VQ based compression results
Sparse and Redudant
Representations






Underdetermined Linear Systems : consider a matrix A 
with n<m, and Ax  b .
More unnkowns than equations, no solution if b is not in
the span of columns of matrix A.
Infinitely many solutions with full rank assumption of A.
How can we find the proper x from infinitetly many
possible images that “explain” given measurement b.
(we knew b is down sampled and blurred version of x )
J ( x) subject to b  Ax .
Regularization: ( PJ ): min
x
Use of convex and strict functions J (.) that guarantees
uniquness of solution.
nm
Choice of J (.)




L2 norm : Closed form and unique solution
exists (forward and inverse transforms are
linear).
L1 norm : may have more than one solutions,
but has tendency to prefer sparse solutions.
(Fundamental property of Linear Programming)
i.e., L1 minimization can be formed as a LP
problem.
L0 norm gives the sparsest solution but with lot
of troubles! (misleading doesn’t satisfy all
axiomatic requiremets of norm)
P0 problem-Our Main Interest


•
•

subject to Ax  b .
Basic questions:
( P0 ): min x
x
0
Can uniqueness of a solution be claimed? Under what
conditions?
If a candidate solution is available, can we perform a
simple test to verify that the solution is actually the
global minimizer of P0 .
Suppose A is of size 500×2000, we know
sparsest solution to P0 has |S| 20 non-zeros,
then we have 3.9E+47 such |S| column options
then this simple calculation takes 1.2E+31 years
to end!!! (1E-9 seconds for each individual test)
Sparseland Model







A parametric description of signal sources in a way that
adapts to their true nature.
Coding mechanism for image patches.
ˆ  arg min  0 subject to D  x 22   2 .
x
ˆ 0  n
x is image patch D is redudant dictionary (learned by KSVD).
For compression of image patches, equation 1 is solved
using matching and basis persuit algorithms. OMP is
used here for it’s simplicity and efficiency.
The solution of P0 is called sparse coding.
Training A Dictionary using K-SVD

Given a set of image patches of size N×N to be coded,
  {x j }Mj 1 , dictionary D is required using available sparse
coding techniques so D is obtained by minimizing following,
energy functional w.r.t. D and { j }Mj 1
M
 ( D,{ j } )  [  j  j
M
j 1




j 1
2
0
 D j  x j ].
2
and j D j  z j 2   leads to different values of  j .
Adopts a block-coordinate descent idea.
Assuming that D is known, the penalty posed in Eq.
reduces to a set of M sparse coding operations, so OMP can
be used to obtain the near-optimal solutions.
Assuming representation vectors are fixed, the K-SVD
algorithm updates one dictionary column at a time..
j  j
0
L
K-SVD algorithm and RMSE per iteration
A typical representation error (RMSE) as a function of the iterations of the K-SVD
algorithm. This graph corresponds to a gradual increment of atoms in the
representation ðLÞ from 1 to 7.
Proposed Method in Paper






1)
2)
3)
4)
The general scheme :
An offline K-SVD training process.
An online image compression/decompression processes.
K – SVD Training: offline, produces set of K-SVD
dictionaries which are then fixed for image compression.
Learning set is formed for each 15×15 patch. Mean
patch from the set is subtracted from all the examples
from that set.
Then encoder uses following steps for compression :
Pre-processing
Slicing to patches
Sparse coding
Entropy coding and quantization
Block Diagram of proposed
method by authors
Encoding Steps in Detail

Pre-processing :

Geometrical warping
Simple scale down by a factor of 2:1.

Slicing to patches:


15×15 patch is used.
Mean patch images calculated for learning set before training set is
subtracted from relevant patches (so that patch fits the dictionary in
content).

Sparse coding:




Every patch location has a pre-trained dictionary of code-words
(atoms) Dij.
k= 512 is used, dictionaries are generated using K-SVD using 4415
training images.
Coding: assigning a linear combination of few atoms to describe the
patch content (sparse coding).
Encoding Steps in Detail









Sparse coding (cont..):
Patch contents : linear weights and indices
Number of atoms vary from one patch to another, based on its
average complexity, and this varied number is known at the
decoder.
Sparse coding is done using OMP method.
At decoder, building of output patches is done independent of
others.
Entropy coding and quantization:
Huffman tables for coding indices.
Coefficients weights uniformly quantized to 7 bits.
If on incoming input image, facial feature detection fails utterly
that leads to low PSNRs in such case JPEG2000 can be used or
asking user for selecting the feature points.
VQ vs. Sparse Representation

No of codewords k in VQ dictionary poses constraints on image
patch size since k increases exponentially as :
2
N B
P
B  2 .log 2 (k )  k  2 P
N




B: target rate of bits. P is total no of pixels in image.
e.g., P = 39,600 (180×220), B = 500bytes, N = 8, then k = 88.
If N = 12 then k = 24,000 approximately, which cannot be
trained using 6000 examples.
Redundancy in adjacent patches is overlooked.
In case of sparse and redundant dictionaries, we have
B


P
.L.(log 2 (k )  7)
N2
Out of k atoms L are used on average per patch, so effect of
increasing N is absorbed in varying L.
e.g., P = 39,600 pixels, k = 512 atoms and target rate of 5001000 bytes and N = 8-30, leads L in the range of 1-12.
Required average number of atoms
L as a function of the patch size N
Run-time and memory
requirement

Computational complexity :
Training : sparse coding (OMP) and dictionary update requires ( JnkLTr SL )
per patch. LTr is the maximal number of non-zero elements in each
coefficient vector in each patch during the training, and SL is the
number of examples in the learning set. Complexity for training all
patches is ( PJkLTr SL ) .






Compression : encoding and decoding stage.
Encoding : (nkLav ) per patch, Lav is avg no of non-zero elements in
coefficient vector of all patches. So total complexity ( PkLav )
Decoding : Due to simple linear combination requires (nLav ) per
patch. So overall complexity is ( PLav ) .
Memory requirements :
Considering dictionaries per patch, mean patch images, Huffman
tables, coefficient usage tables, quantization levels per patch
memory requirement is O(Pk) (P = 39,600 and k = 512) i.e., 20Mb.
More details and results







100hrs for training dictionaries, 5 seconds for
compression and 1 second for decompression.
4415 images for learning set, 100 images for testing.
Images size is 358×441, downscaled to 179×221.
General scheme applicable to all facial databases with
some changes in parameters.
K-SVD dictionaries:
Stopping criteria : Maximum no of iterations 100 and minimal
representation error. In compression, limit on maximum number of
atoms per patch. Every dictionary has 512 patches of size 15×15
pixels as atoms.
Dictionaries contain images similar in nature to the image patch for
which they were trained for.
K-SVD Dictionaries
The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the
OMP method with L = 4.
K-SVD Dictionaries
The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the
OMP method with L = 4.
Reconstructed Images


Coding strategy is able to learn which parts are more
difficult to code by assigning same representation error
threshold to all patches and observing how many atoms
are required per patch on average.
No of atoms increases, RMSE decreases for given image
bit rate. Figure shows the atom allocation map and the
representation error map for 400 bytes.
Reconstructed Images 630 bytes

No smearing artifact over all image but in local small area.
Compression is RMSE oriented not on the bit rate so
increasing the bitrate doesn’t improves the PSNR
drastically as can be seen in results.

Artifacts:






Caused by the slicing to disjoint patches, coding them
independently.
Blockiness effect.
inaccurate representation of high frequency elements in
the image.
inability to represent complicated textures.
inconsistency in the spatial location of several facial
elements comparing to the original image.
‘‘good” 4.04 and a ‘‘bad” 5.9 representation
RMSE, both using 630 bytes
Comparing the visual quality of a reconstructed image from the test set in
several bit-rates. From top left,clockwise: 285 bytes (RMSE 10.6), 459 bytes
(RMSE 8.27), 630 bytes (RMSE 7.61), 820 bytes (RMSE 7.11), 1021 bytes (RMSE
6.8), original image.
Comparing to other techniques
Facial images compression with a bit-rate of 400 bytes. Comparing
results of JPEG2000, the PCA results, and K-SVD method.
Comparing to other methods
Comparing to other methods
Rate Distortion curves and
comparison of PSNRs
Comments



Increased redudancy in dictionaries is
desirable for high compression and better
visual quality.
With additional constraints for increased
redudancy, compression performance can
surpass JPEG2000.
Use of sparse and redudant representation
is a sophisticated frame work for various
image related operations.