PPT - The Hong Kong Polytechnic University

Transcript PPT - The Hong Kong Polytechnic University

JHU
June 25, 2012
Multimedia Content Analysis via
Computational Human Visual
Model
Shenghua ZHONG
Department of Computing
The Hong Kong Polytechnic University
www.comp.polyu.edu.hk/~csshzhong
1
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




2
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




3
Multimedia Content Analysis

Definition of multimedia content analysis

Computerized understanding of the semantic meanings in multimedia document
[Wang et al, SPM, 2000]

Difficulty in multimedia content analysis


Semantic gap is the well-known challenge [Jiang et al, ACMMM, 2009]

Low-level features computable by computer

High-level concepts understandable by human
Typical multimedia content analysis tasks [Amit, MIT, 2002] [Liu, et al, HCI, 2001]

Quality assessment

Object detection and recognition

Indexing and annotation

Classification and retrieval
4
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




5
Introduction to Human Visual
System

Definition of cognitive science



Science of mind which may be concisely defined as the study of the nature of
intelligence, mainly about the nature of the human mind [Eckardt, MIT, 1995]
Definition of human visual system

One of research focus of cognitive science

The part of the central neural system which enables organisms to process visual
information
Four processes of human visual system

Formation of image on retina

Visual processing in visual cortex

Attentional allocation

Perceptual processing
6
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




7
Moment Invariants to Motion Blur for Water
Reflection Detection and Recognition
Fig. The proposed work about water reflection detection and recognition which background color is purple.
8
Introduction to Water Reflection
Definition




The change in direction of a wavefront at an interface between two different media so that
the wavefront returns into the original medium
Special case of imperfect symmetry
Importance
Fig. Example about the influence of water reflection part. (a) is an example of image with water reflection. (b) is the correct
segmentation result of (a). (c) is the actual segmentation result of (a). (d) is the color histogram of the whole image (a). (e) is the
color histogram of the object part.
9
Ineffectiveness of Existing
Symmetry Technology

Failure of Scale-invariant feature transform (SIFT)
descriptor [Loy et al, ECCV, 2006]
(a)
(b)
Fig. Examples of ineffectiveness of local feature in images with water reflection. (a) is the correct one, (b) is the result of SIFT
descriptor matching result. The red circles are used to denote the SIFT detector results. The green lines are used to denote the
matched SIFT descriptor pairs. It is easily to find the SIFT method is ineffective to the water reflection recognition or detection.
10
Ineffective of Existing Water
Reflection Detection Technology

Failure of flip invariant shape detector [Zhang et al, ICPR, 2010]
Fig. Examples of sharp detection result of invariant sharp technique.
11
Basic Idea

Definition and influence of motion blur

The relative motion of sensor and the scene in this exposure time [Flusser
et al, CS, 1996]

A well known degradation factor due to motion changes the image
features needed for feature-based recognition techniques
12
Moment Invariants to Motion Blur
Algorithm of Moment Invariants
The geometric moment
Input: Image I={x, y}.
Output: Moment invariants to motion blur: IR , IR , IR , IR mpq   x p y q ( x, y)
x y
1. The geometric moment calculation m00 , m01, m10
2. The components of the centroid calculation
Central moments
3. The central moments calculation 00 , 12 , 21, 30 , 03
 pq   ( x  x ) p ( y  y)q I ( x, y)
4. The complex moments calculation 12 ,21 ,30 ,03
x y
5. The Moment invariants to motion blur calculation
m1
 IRm1  (30  312 )2  (321  03 )2

2
2
 IRm2  (30  12 )  (21  03 )

2
2
 IRm3  (30  312 )(30  12 )[(30  12 )  3(21  03 ) ]

2
2
(321  03 )(21  03 )[3(30  12 )  ( 21  03 ) ]

2
2
 IRm4  (321  03 )(30  12 )[(30  12 )  3(21  03 ) ]
(  3 )(   )[3(   )2  (   ) 2 ]
12
21
03
30
12
21
03
 30
m2
m3
m4
The centroid moment
x
M 10
M 00
y
M 01
M 00
The complex moments
(1
ij  ij / 00
i j
)
2
13
High Frequency Energy Decay in
Water Reflection
(a) Original water reflection image
(b) High frequency information part
Fig. Decay of the information and energy in high-frequency band due to motion blur
14
Flowchart
Original Image
Low Frequency
Coefficients
Curvelet
Transform
Inverse Curvelet
Transform
Coefficients Difference
Calculation
Yes
Moments Feature
Calculation
Reflection
Axis
Image Subblocks
Reflection Cost
Minimization
If MIN RC  IRT
No
Non-symmetry
Image
High Frequency
Coefficients
Yes
Imperfect
Symmetry Image
If N1  N 2  NT
No
Other Imperfect
Symmetry Image
Yes
Water Reflection
Image
If N1  N2
No
2
Object located in N side
Yes
Object located in N 1side
15
Experiments and Results on
Detection of Reflection Axis
Database


100 nature images with water reflection from Google
Compared algorithms


Matching of SIFT descriptors to detect reflection axis [Loy et al, ECCV,
2006]

Matching based on flip invariant shape detector [Zhang et al, ICPR, 2010]
Results




The accuracy of axis detection is 29% [Loy et al, ECCV, 2006]
The accuracy of axis detection is 46% [Zhang et al, ICPR, 2010]
The accuracy of axis detection of our algorithm is 87%
16
Detection Results of Two
Algorithms
SIFT algorithm result
Sharp algorithm result
Our algorithm result
Fig. Thumbnails of some comparision example images with reflection symmetry detection results
17
Distinguish Object Part and
Reflection Part
Reflection Part
Object Part
(a) Reversed water reflection image.
(b) Positive Curvelet coefficients of object part (left) and reflection part (right).
Fig. Object part and reflection part determined by Curvelet coefficients
18
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




19
Top-down and Bottom-up Saliency Map for
No-Reference Image Quality Assessment
Fig. The proposed work about no-reference image quality assessment. The background color is purple.
20
Introduction to No-Reference Image
Quality Assessment

Definition of no-reference image quality assessment



Difficulty


Predefined correct image is not available
Mainly aim to measure the sharpness/blurriness
How to assess the quality in agree with human judgement
Limitation of existing work

Ignore cognitive understanding influences the perceived quality [Wang et al, TIP, 2004]
(a)
(b)
(c)
Fig. Example of images quality influenced by the cognitive understanding. (a) Image without
distortion (b) Blurriness mainly on the girl (c) Blurriness mainly on the apple
21
Basic Idea

Combine semantic information from prior information to build the saliency
map


Existing bottom-up saliency map does not match actual eye movements
Measure sharpness based on top-down and bottom-up saliency map modeling
22
Target Information Acquisition in
Whole Flowchart
Input Image
Tag Information
WordNet
Eye-Tracking
Data
Target
Information
Acquistion
Build Saliency
Map Model
Visual
Information
Calculate Saliency Regions
Output Sharpness
Score
Computer Edge Block
Distortion
Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is the target information acquisition.
23
Target Information Acquisition
Input Image
Tag: People, New York
Remove: New York
(not belong to physical entity)
Remain: People
Target information acquisition
Tag Information
WordNet
Get Target
Information
24
Saliency Map Model Learning in
Whole Flowchart
Input Image
Tag Information
WordNet
Eye-Tracking
Data
Target
Information
Acquistion
Build Saliency
Map Model
Visual
Information
Calculate Saliency Regions
Output Sharpness
Score
Computer Edge Block
Distortion
Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is the saliency map model learning.
25
Flowchart of Top-Down & BottomUp Saliency Map Model Learning
Input EyeTracking Data
Tag Information
Visual Information
WordNet
Get Target
Information
Target
Detection
Center Priority
High-level feature detection
Itti's Bottom-up
Saliency Map
Low-level feature detection
Feature detection
Saliency Map Model
Learning by SVM
Output Top-down & bottomup Saliency Model
Fig. Flowchart illustrating of the proposed top-down & bottom-up model algorithm.
26
Top-Down & Bottom-Up Saliency
Map Model Learning

Learning the saliency map model by SVM

Ground truth map

Created by convolving the function of contrast sensitivity [Wang et al, TIP,
2001] over the fixation locations [Judd et al, ICCV, 2009]
M
N
g ( x, y )    i , j (u  f x (i, j ), v  f y (i, j )) 
i 1 j 1
I ( x, y )  g ( x, y ) / max ( g ( x, y ))
x, y

An example of group truth map
1
d ( x  u, y  v)
e  tan 1 (
)
Lv
, d ( x, y )  x 2  y 2
Function of contrast sensitivity
Notations
(a) Original image
(b) Eye- fixation locations
(c) Ground truth map
• e: Half resolution
eccentricity constant
• L: Image width.
• v: Viewing distance.
• N: fixation locations
• M:Number of users
In the example, N=15 and
M=6.
27
Image Quality Assessment Based
on Proposed Saliency Map
Tag Information
WordNet
Eye-Tracking
Data
Target
Information
Acquistion
Build Saliency
Map Model
Input Image
Visual Information
Calculate Saliency Regions
Output
Sharpness Score
Computer Edge Block
Distortion
Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is image quality assessment based
on proposed saliency map.
28
Experiment Setting of Top-Down
and Bottom-Up Saliency Map Model

Dataset




From eye-tracking database [Judd et al, ICCV, 2009]
Training: 200 images
Test: 64 images
Training samples from ground truth map

Positive labelled data


Randomly choose 30 pixels from 10% most salient locations
Negative labelled data

Randomly choose 30 pixels from 10% least salient locations
29
Compare Results of Two Saliency
Map Model
(a) Original image
(b) Eye- fixation locations
(c) Fixation points covered by bottom-up saliency model (d) Fixation points covered by our saliency model
Fig. An sample example to compare coverage of fixation points by different saliency model.
Table Evaluation of the proposed saliency map
Saliency Points inside 20%
Most Saliency Regions
Non-Saliency Points Outside
80% Saliency Regions
Our Saliency Map Model
84.156%
76.344%
Itti’s Bottom-up Saliency Map Model
45.35%
72.875%
30
Experiment Setting and Result of
Image Quality Assessment

Database


Subjective image quality assessment


160 images download from Flickr blurred with eight different Gaussian noises
Rate from 1 to 5 corresponding to “very annoying”, “annoying”, “slightly
annoying”, “perceptible but not annoying”, and “imperceptible” by 14 subjects
Evaluation Results
Nonlinear Pearson
Spearman
MAE
RMS
Proposed Metric
0.914
0.86
0.173
0.25
Classical JNB [Ferzli et al, TIP,
2009]
0.885
0.815
0.219
0.292
Saliency Weighted JNB [Nabil et
al, ICIP, 2008]
0.863
0.801
0.317
0.232
JNB with Edge Refinement
[Varadarajan et al, ICIP, 2008]
0.618
0.466
0.387
0.494
31
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




32
Fuzzy Based Contextual Cueing for
Region Level Annotation
Fig. The proposed work about region level annotation. The background color is purple.
33
Introduction to Region Level
Annotation

Definition of region level annotation


Segment the image to semantic regions
Assign the given image level annotations to precise regions
Annotation
Water
Cow
Grass
(a) An image with given
image level annotations

(b) An image with automatic
region level annotation
Motivation of automatic region level annotation


Helpful to achieve reliable content-based image retrieval [Liu, ACM MM 09’]
Substitute tedious manually region-level annotation
34
Representative Work of Region
Level Annotation

Early work on region level annotation


Known as simultaneous object recognition and image segmentation
Unsupervised learning

Handle images with single major object or with clean background [Cao, ICCV
07’]

Supervised learning


Focus on special object recognition or special domain [Li, CVPR 09’]
Latest work for real-world applications

Label propagation by bi-layer sparse coding [Liu, ACM MM 09’]


Common annotations are more likely to have similar visual features in the
corresponding regions
Show impressive results on nature images
35
Limitation of Visual Similarity in
Region Level Annotation
Fig. Example of the difficulty to distinguish sky and sea based on visual feature. (a) The original image. (b) The original image with
200 data points. (c) – (f) 128 local features of four random points selected from sky and sea are shown.
36
Contextual Cueing in Perception
Processing

Contextual cueing


Human brains gather information by incidentally learned associations between
spatial configurations and target locations [Chun, CP, 1998]
Spatial invariants [Biederman et al, CoP, 1982] : probability; co-occurrence; size;
position; spatial topological relationship
37
Contextual Cueing Modeling by
Fuzzy Theory

The difficulty of modeling contextual cueing


Classical bivalent set theory causes serious semantic loss
Example of imprecise position and ambiguous topological relationship
A
The center of gravity
The center of gravity
B
The center of gravity
R
(a) Example of ambiguous topological
relationship

(b) Example of topological relationship
for object recognition
Fuzzy theory



Measure the degree of the truth
Fuzzy membership to quantize the degree of truth
Fuzzy logic allows decision making using imprecise information
38
Flowchart
39
Illustration of Fuzzy Based Contextual
Cueing Label Propagation
Annotation
Sky
Water
Beach
Boat
(a) Original image with given
image-level annotations
(b) Over segmentation result
(c) Label propagation inter
images
(d) Label propagation using
fuzzy based contextual cueing
40
Experiment on MSRC Dataset

MSRC Dataset

380 images with 18 categories


Comparison methods

Four baseline methods implemented by binary SVM with different
values of maximal patch size


SVM1: 150 pixels, SVM2: 200 pixels, SVM3: 400 pixels, and SVM4: 600 pixels
Two latest techniques [Liu et al , ACM MM09’]



Including building, grass, tree, cow, boat, sheep, sky, mountain, aeroplane, water,
bird, book, road, car, flower, cat, sign, and dog
Label propagation with one-layer sparse coding
Label propagation with bi-layer sparse coding
Experimental result
Table 1. Label-to-region assignment accuracy comparison.
Database
SVM1
SVM2
SVM3
SVM4
One Layer
Bi-Layer
FCLP
MSRC
0.24
0.22
0.27
0.25
0.54
0.65
0.72
41
Experiment Analysis on MSRC
Dataset
Annotation
Sky
Building
Tree
Road
(a) An image with annotations
(b) Bi-layer result
(c) FCLP result
(e) Bi-layer result
(f) FCLP result
Annotation
Sky
Building
Tree
Road
Car
(d) An image with annotations
42
Outline

Introduction to multimedia content analysis

Introduction to human visual system
Moment invariants to motion blur for water reflection
detection and recognition
Top-down and bottom-up saliency map for no-reference
image quality assessment
Fuzzy based contextual cueing for region level annotation
Proposed deep learning for multimedia content analysis




43
Diagram of Deep Learning
Fig. The proposed work about deep learning for multimedia content analysis. The background color is purple.
44
Outline of Proposed Deep
Learning Model



Introduction
Deep learning
Proposed algorithm


Experiments and results





Bilinear deep belief networks
Experiment on Handwriting Dataset MNIST
Experiment on Complicated Object Dataset Caltech 101
Experiments on the Urban & Natural Scene
Experiments on Face Dataset CMU PIE
Field effect bilinear deep belief networks
45
Introduction

Image classification is a classical problem





Aim to understand the semantic meaning of visual information
Determine the category of the images according to some predefined criteria
Image classification remains a well-known challenge after more than fifteen years
extensive research
Humans do not have difficulty with classifying images
Aim of this paper


Provide human-like judgment by referencing the architecture of the human visual
system and the procedure of intelligent perception
Deep architecture is a representative paradigm that has achieved notable success in
modeling the human visual system
46
Outline of Proposed Deep
Learning Model



Introduction
Deep learning
Proposed algorithm


Experiments and results





Bilinear deep belief networks
Experiment on Handwriting Dataset MNIST
Experiment on Complicated Object Dataset Caltech 101
Experiments on the Urban & Natural Scene
Experiments on Face Dataset CMU PIE
Field effect bilinear deep belief networks
47
Research on Deep Learning

Definition of deep learning



Models learning task using deep architectures composed of multiple layers
nonlinear modules
Deep belief network (DBN)

A densely-connected between layers

Utilize RBM as the basic block

Two stages: abstract input information layer by layer and fine-tune the whole deep
network to the ultimate learning target [Hinton et al, NC, 2006]
Research progress

Deep architectures are thought as the best exemplified by neural networks [Cottrell,
science, 2006]

DBN exhibits notable performance in different tasks, such as dimensionality
reduction [Hinton et al, science, 2006] and classification [Salakhutdinov et al, AISTATS, 2007]
48
Architecture of Deep Belief
Network
1. The initial weighted
connections are randomly
constructed.
2. The size of every layer is
determined based on intuition.
3. The parameter space is refined
by the greedy layer-wise
information reconstruction.
4. Repeat first to third stages
until the parameter space in all
layers is constructed.
Fig. Structure of the deep belief network (DBN).
5. The whole model is fine-tuned
to minimize the classification
error based on backpropagation.
49
Outline of Proposed Deep
Learning Model



Introduction
Deep learning
Proposed algorithm


Experiments and results





Bilinear deep belief networks
Experiment on Handwriting Dataset MNIST
Experiment on Complicated Object Dataset Caltech 101
Experiments on the Urban & Natural Scene
Experiments on Face Dataset CMU PIE
Field effect bilinear deep belief networks
50
Bilinear Deep Learning

Deep architecture



Three-stage learning



Human: visual information through the optic tract to a nerve
position is transmitted as the second-order data
Proposed technique: a novel bilinear deep belief networks
Human: two peaks of activation with “initial guess” and the
“post-recognition” in the visual cortex areas
Proposed techniques: a new bilinear discriminant initialization
Semi-supervised framework


Human: more practical for long-term daily learning of visual
world
Proposed techniques: a more flexible learning framework
51
Bilinear Deep Belief Network
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
H1
… …
…
…
…
…
1
…
…
…
H
…
…
…
 n1 n1
H
…
…
…
Hn
…
…
…
 N1 N1
H
2
…
…
HN
…

N
…

k
La
…

y 1k y k2 … y C
…
Input Label
…

Fully interconnected directed belief
network are constructed by a set of
second-order planes
The number of units in the input layer
is equal to the resolution of the images
The number of units in the label layer is
equal to the number of classes of
images.
The number of units in each hidden
layer is determined by bilinear
discriminant initialization
The search of the mapping is
transformed to the problem of finding
the optimum parameter space for the
deep architecture.
…

Input Image Xk
52
Three-stage Learning
H1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
1
… …
…
…
…
…
H
…
…
…

Human: late peak related to the activation
of “post-recognition”
Deep: refine the parameter space for
better classification performance
2
…
…
…
Global fine-tuning

…
 n1 n1
H
…
…
…
Hn
…
…

…
…

Human:
information
propagation
between adjacent layers
Deep: determine the parameters of each
pair of layers
 N1 N1
H
…

HN
…
Greedy layer-wise reconstruction
N
…

La
…

Human: early peak related to the
activation of “initial guess”
Deep: determine the initial parameters
and sizes of the upper layer
k
…

y 1k y k2 … y C
…
Input Label
…
Bilinear discriminant initialization
…

Input Image Xk
53
Bilinear Discriminant Initialization

Latent representation with projection matrices U and V

Preserve discriminant information in the projected feature space by optimizing the
objective function
K
arg max J (U, V ) =
U ,V
|| U (X
T
s
 Xt )V ||2 ( B st  (1   ) Wst ) s.t. UT U  I P , V T V  I Q
s ,t 1
between-class weights

within class weights
Obtain the discriminant initial connections in layer pair and utilize the optimal
dimension to define the structure of the next layer
P 2  row(U1 ) , Q2  column(V1 )
A1ij , pq (0)  (U1ip )T V1jq
54
Greedy Layer-Wise Reconstruction

A joint configuration ( h 1 , h 2 ) of the input layer H1 and the first hidden H 2
layer has energy
E  h1, h2 ; 1   (h1A1h2  b1h1  c1h2 )， 1   A1 ,b1 ,c1 

Utilize Contrastive Divergence algorithm to update the parameter space
 log p(h1 (0))
E (h 2 (0), h1 (0))
E (h 2 (t ), h1 (t ))
2
1
2
1

p(h (0) | h (0))

p(h (t ), h (t ))
1
1
 1




2
2
1
h (0)
h (t ) h (t )

Aij1, pq   A ( h1ij (0)h2pq (0) data   h1ij (1)h2pq (1) recon )

b1ij   b (hij1 (0)  hij1 (1))
c1pq   c (h2pq (0)  h2pq (1))
55
Global Fine-Tuning

Backpropagation adjusts the entire deep network to
find good local optimum parameters
 y log y ]

  arg min[
*


l
l
l
Before backpropagation, a good region in the whole
parameter space has been found


The convergence obtained from backpropagation learning is
not slow
The result generally converge to a good local minimum on
the error surface
56
Proposed Algorithm
Algorithm 1: Bilinear Deep Belief Network
Input: Training data set X, Labeled samples X L in X
Corresponding labels set Y, Number of layers N, Number of epochs E
Number of labeled data L , Parameter 
Between-class weights B st , Within class weights Wst
Initial bias parameters b and c, Momentum  and learning rate  A ,  b ,  c
Output: Optimal parameter space  *  [ A* , b* , c* ]
1.
Preserve discriminant
information by optimizing the objective function
K
T
arg max
J
(
U
,
V
)
=
||
U
( X s  Xt )V ||2 ( B st  (1   ) Wst )
U ,V

s ,t 1
while not convergent do
DV  st Est (Tsn  Ttn )VVT (Tsn  Ttn )T
DU  st Est (Tsn  Ttn )T UUT (Tsn  Ttn )
2.
Fix V, compute U by solving DV u  λu
Fix U, compute V by solving DU v  λv
end while
Compute initial weights of the connections
3.
Determine the structure of the next layer
4.
Calculate the state of the next layer
5
Bilinear Discriminant Initialization
Aijn, pq (0)  (Unip )T Vjqn
Pn1  row(Un ) , Qn1  column(V n )
p  hpqn1  1| h n    (

i P n , j Qn
i 1, j 1
Update the weights and biases
hijn Aijn, pq  c npq ) p  h ijn  1| h n1    (

p P n1 ,qQ n1
p 1,q 1
Aijn, pq   Aijn, pq   A ( hijn (0)hpqn1 (0) data   hijn (1)hpqn1 (1) recon )
6.
bij1  bij1   b (hij1 (0)  hij1 (1))
Calculate optimal parameter space
 y log y ]
Greedy Layer-Wise Reconstruction
c1pq  c1pq   c (h2pq (0)  h2pq (1))

 *  arg min[

Aijn, pq hpqn1  bijn )
l
l
Global Fine-Tuning
l
57
Outline of Proposed Deep
Learning Model



Introduction
Deep learning
Proposed algorithm


Experiments and results





Bilinear deep belief networks
Experiment on Handwriting Dataset MNIST
Experiment on Complicated Object Dataset Caltech 101
Experiments on the Urban & Natural Scene
Experiments on Face Dataset CMU PIE
Field effect bilinear deep belief networks
58
Experiment Setting

Database

Subset of Caltech101



Urban and Natural Scene


60,000 training images,10,000 test images
CMU PIE dataset


2,688 natural color images with 8 categories
Standard hand written digits dataset MNIST


Standard dataset for image classification with images of 100 different objects
Frequently used subset including 2,935 images from the first 5 categories
11560 face images varying pose, illumination and expression of 68 subjects
Compared algorithms









K-nearest neighbor (KNN)
Support vector machines (SVM)
Transductive SVM (TSVM) [Collobert et al, JMLR, 2006]
Neural network (NN)
EmbedNN [Weston et al, ICML, 2008]
Semi-DBN [Bengio et al, NIPS, 2006]
DBN-rNCA [Salakhutdinov et al, AISTATS, 2007]
DDBN [Liu et al, PR, 2011]
DCNN [Jarrett et al, ICCV, 2009]
59
Experiments on Caltech101

Sample images from datasets
Faces_easy
Faces
Motorbikes
Airplanes
Back_google

Two experiments



Classification accuracy comparison
Converging time comparison
Experiment setting


50 images for each category to form the test set
The rest to form the training set
60
Classification Accuracy
Comparison on Caltech101


Deep techniques achieve much better performance than shallow
techniques
Bilinear deep belief networks is the best with different numbers
of labeled data
61
Converging Time Comparison on
Caltech 101


Iterations in fine-tuning stage
BDBN converges much more quickly because of
better “initial guess”
62
Experiments on Urban and Natural
Scene


Sample images from datasets
Forest
Highway
Street
Open country
Mountain
Tall building
City Center
Two experiments



Coast & beach
Classification accuracy and real running time comparison
Limitation discussion
Experiment setting


50 images for each category to form the test set
The rest to form the training set
63
Performance Comparison on
Urban and Natural Scene

Classical setting of neurons numbers in hidden layers


BDBN setting by bilinear discriminant initalization


500, 500, and 2000
24*24, 21*21, 19*20
For compared models


“_d” means the same size of BDBN setting
“_c” means the same size of classical setting
64
Limitation of Image Classification
Only Based on Visual Similarity
Street
Highway
Misclassified image
Ground truth
label: “Street”
Misclassified label:
“Highway”

Only calculating visual similarity is limited


Human can give the correct judgment by referencing the buildings and
cars along the street
Contextual cueing

The knowledge about spatial invariants learned from past experiences
65
Simulate Primary Visual Cortex on
MNIST

Responses of V1 neurons



Selective spatial information filters
Similar to spatially local, complex Fourier transforms, Gabor transforms
Weights of proposed BDBN


Roughly represent different “strokes”
Oriented, Gabor-like and resemble the receptive fields of V1 simple cells
Samples of first
layer weights
Examples represent
“strokes” of digital
66
Experiments on CMU PIE

Sample images from datasets

Two experiments



Classification accuracy comparison for noise data
Parameter space visualization
Experiment setting


50 images for each category to form the test set
The rest to form the training set
67
Robustness to the Noise on CMU
PIE
The reconstruction of BDBN in every layer
68
Visualization of Parameter Space
on CMU PIE
Emphasize regions are identical to facial feature regions
Facial feature points
69
Outline of Proposed Deep
Learning Model



Introduction
Deep learning
Proposed algorithm


Experiments and results





Bilinear deep belief networks
Experiment on Handwriting Dataset MNIST
Experiment on Complicated Object Dataset Caltech 101
Experiments on the Urban & Natural Scene
Experiments on Face Dataset CMU PIE
Field effect bilinear deep belief networks
70
Image Recognition with
Incomplete Data

Incomplete data


Data values/features are partially observed
Resulted from measurement noise, corruption or
occlusion
(a)Incomplete images due to noise and corruption
(b) Incomplete face images due to the occlusion in the important facial feature regions
71
Learning Stages in FBDBN
D
S
…
S
G
S
…
D
S
D
S
…
…
…
2
H
…
H n1
…
D
…G
…
…
…
… …
D
G
… G …S
…
…
…
…
…
…
…
… …
n
H
…

Bottom-up inference
Construct the model by the available
features and the estimated features based on
the reliability
Top-down inference
Estimate the missing features by the higher
layer activations of the reference datum
H N1
…

G
…
…
…
S
…
…
G
N
H
…
Bi-direction inference
k
G
D
D

C
…
La
……
Map the original data into a discriminant
bilinear subspace based on the features with
high reliability
k
…

y k2 … y
1
y
Input Label
…
Field effect bilinear discriminant projection
…

D
G
S
G
…
…
S
D
… …
G
S
… …
S
…
H
D
…
Fine-tune to minimize the recognition error
and reestimate the values of missing features
S
1
G
…

D
…
Post activation by backpropagation
…

D
G
Input Image X k
72
Output Characteristic of FRBM
Fig. The output characteristic curve and the operating mode of field effect
RBM depends on the voltage VGS, Vth , and VDS.
73
Algorithm of FBDBN
74
Experiment of Block Incomplete
Digits with Fixed Missing Ratio
(a) Original images
(b) Incomplete images after fixed missing ratio pixels are removed
(c) Estimated images via FBDBN.
Fig. Samples of estimated images by FBDBN of the block
missing features with fixed missing ratio.
75
Unsupervised Auto-encode
Comparison
(a) Auto-encoder results of DBN
(a) Auto-encoder results of FBDBN
Fig. Auto-encoder comparison of DBN and FBDBN.
76
Experiment of Face Image
Estimation
(a) Reliability curve with the estimated mouth part
(b) Reliability curve with the estimated eyes part
Fig. The reliability curve with estimated images
77
Reference
[1] Hu, M.-K. “Visual pattern recognition by moment invariants,” IRE Transactions on Information Theory, vol. 8(2), 1962.
[2] Gregory, R. L., “Eye and Brain: The Psychology of Seeing”, Oxford: Oxford Unversity Press, 1967.Palmer, S.E., “The effects of
contextual scenes on the identification of objects”, Memory and Cognition, I. vol. 3, pp. 519-526, 1975.
[3] Biederman, R. Mezzanotte, and J. Rabinowitz, “Scene perception: detecting and judging objects undergoing relational violations”, In
Cognitve Psychology, vol. 14(2), pp. 143–77, 1982.
[4] Koch, C. & Ullman, S., "Shifts in selective visual attention: Towards the Underlying Neural Circuitry," Human Neurobiology. vol. 4 (4),
pp. 219-227, 1985.
[5] Newsome, WT Paré, EB., “A selective impairment of motion perception following lesions of the middle temporal visual area (MT)”, J
Neurosci., vol. 8, pp.2201–2211, 1988.
[6] Victor JD, Purpura K, Katz E, Mao B., “Population encoding of spatial frequency, orientation, and color in macaque V1”, J Neurophysiol.,
vol.72, pp.2151–2166, 1994.
[7] Barbara Von Eckardt, “What is cognitive science?”, Cambridge: MIT Press, Waddington, CH, 1995.
[8] Neil A. Stillings, Steven E. Weisler, Christopher H.Chase, Mark H. Feinstein, Jay L, Garfield and Edwina L. Rissland, “Cognitive Science:
An Introduction,” 2nd Edition, The MIT Press, 1995.
[9] Edwards, M., & Badcock, D., “Global motion perception: Interaction of chromatic and luminance signals”, Vision Research, vol. 36,
pp.2423-2431, 1996.
[10] J. Flusser, T. Suk and S. Saic, "Recognition of images degraded by linear motion blur without restoration", Computing Suppl., vol. 11, pp.
37-51, 1996.
[11] Chun, M. M. & Jiang, Y.. , “Contextual cueing: implicit learning and memory of visual context guides spatial attention”, In Cognit.
Psychol., vol. 36, pp. 28-71, 1998.
[12] D. Shen, H.H.S. Ip, and E.K. Teoh., “Robust detection of skewed symmetries”, In ICPR, vol.3, pp. 1010-1013, 2000
[13] Hegde J, Van Essen DC, “Selectivity for complex shapes in primate visual area v2”, J Neurosci., vol. 20: RC61, 2000.
[14] L. Itti, C. Koch, & E. Niebur, "A saliency-based search mechanism for overt and covert shifts of visual attention," Vision Research, vol.
40, pp. 1489-1506, Apr. 2000.
78
Reference
[15] Reynolds, J.H., Pasternak, T. & Desimone, R., “Attention increases sensitivity of V4 neurons”, Neuron., vol. 26, pp. 703–714, 2000.
[16] Y. Wang, Z. Liu, and J. Huang, “Multimedia content analysis using both audio and visual clues”, IEEE Signal Processing Magazine, vol.
17(6), pp. 12-36, 2000.
[17] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope, ” In IJCV, 2001.
[18] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural Networks,” science, vol. 313, no.5786, pp.
5045507, July 2006.Interaction, pp. 326 – 333, 2001.
[19] N.Zahid, 0.Abouelala, M.Limouri, A.Essaid, “Fuzzy clustering based on K-nearest-neighbors rule”, In Fuzzy sets and Systems, 200l.
[20] W. Liu, S. Dumais, Y. Sun, and H. Zhang, “Semi-automatic image annotation,” In Proceedings of the International Conference on
Human-Computer Interaction, pp. 326 – 333, 2001.
[21] Z. Wang, A. C. Bovik, L. Lu and J. Kouloheris, "Foveated wavelet image quality index," SPIE’s 46th Annual Meeting, Proc. SPIE,
Application of digital image processing XXIV, vol. 4472, 2001.
[22] Zhou Wang and Alan Conrad Bovik, “Embedded foveation image coding,”IEEE Transactions of Image Processing, vol. 10(10), Oct. 2001.
[23] Amit, Y., “2d object detection and recognition: models, algorithms and networks”, MIT Press: Cambridge, Mass, 2002.
[24] D. Walther, L. Itti, M. Riesenhuber, T. Poggio, and C. Koch, “Attentional selection for object Recognition—a gentle way,” Proc. Second
Int’l Workshop Biologically Motivated Computer Vision, 2002.
[25] Stern, I. Kruchakov, E. Yoavi, and N.S. Kopeika, “Recognition of motion-blurred images by use of the method of moments”, Applied
Optics, 2002.
[26] Chen, L. Q., Xie, X., X., Ma, W. Y., Zhang, H. J., Zhou, H. Q., “A visual attention model for adapting images on small displays”,
Multimed. Syst. Vol. 9(4), pp. 353–364, 2003.
[27] Anderson, John R., “Cognitive psychology and its implications”, 6th Edition, Worth Publishers, 2004.
[28] L. Lucchese, “Frequency domain classification of cyclic and dihedral symmetries of finite 2-D Patterns,” Pattern Recognition, 37:2263–
2280, 2004.
[29] Moshe Bar, "Visual objects in context," Nature Reviews Neuroscience, vol. 5, pp. 617-629, Aug. 2004.
[30] Ören, T.I. and L. Yilmaz., “Behavioral Anticipation in Agent Simulation”, Proceedings of WSC 2004 - Winter Simulation Conference, pp.
801-806, 2004.
[31] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based imagesegmentation”, In IJCV, vol. 59(2), pp. 167–181, 2004.
[32] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” Proc. IEEE CS Conf.
Computer Vision and Pattern Recognition, pp. 37-44, 2004.
79
Reference
[33] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, Eero P. Simoncelli, "Image quality assessment: from error visibility to structural
similarity," IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600-612, Apr. 2004.
[34] E.A. Styles, “Attention, Perception, and Memory: An Integrated Introduction”, First edition, Psychology Press, 2005.
[35] Zhongkang Lu, Weisi Lin, Xiaokang Yang, EePing Ong, Susu Yao, “Modeling visual attention’s modulatory aftereffects”, IEEE
Transactions on Image Processing, vol. 14(11), pp. 1928-1942, 2005.
[36] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp.1527-1554, 2006.
[37] G. W. Cottrell, “New life for neural networks,” Science, vol. 313, pp. 454-455, July, 2006.
[38] Rony Ferzli and Lina J. Karam, "A human visual system based no-reference objective image sharpness metric," IEEE International
Conference on Image Processing, pp. 2949-2952, Oct. 2006.
[39] G. Loy and J. Eklundh, “Detecting symmetry and symmetridc constellations of features,” In European Conference on Computer Vision,
Part II, LNCS 3952, pp. 508-521, May 2006.
[40] H. Cornelius and G. Loy, “Detecting bilateral symmetry in perspective,” In Proceedings of International Conference on Computer Vision
and Pattern Recognition Workshop, 2006.
[41] A. Torralba, A. Oliva, M.S. Castelhano and J.M. Henderson., “Contextual guidance of eye movements and attention in real world scenes:
The role of global features in object search”, In Psychological Review., pp. 766-786, 2006.
[42] J. Harel, C. Koch, and P. Perona, "Graph-Based Visual Saliency", In NIPS, 2006.
[43] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, “Greedy layer-wise training of deep networks”, In NIPS, 2006.
[44] R.R. Salakhutdinov, G.E. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure”, In AISTATS, 2007.
[45] Rony Ferzli and Lina J. Karam, "A no-reference objective image sharpness metric based on just noticeable blur and probability
summation," IEEE International Conference on Image Processing, vol. 3, pp. 445-448, Sept. 2007.
[46] R. R. Salakhutdinov and G. E. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure,” in Proceedings of
Eleventh International Conference on Artificial Intelligence and Statistics, 2007.
[47] R. Ren, P. Punitha, J. M. Jose, and J. Urban., “Attention-based video summarisation in rushes collection”, In TVS ’07: Proceedings of the
international workshop on TRECVID video summarization, New York, NY, USA, pp. 89–93, 2007.
[48] Y. Bengio, and Y. LeCun, “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines, 2007.
[49] Ninassi, O. L. Meur, P. L. Callet, and D. Barbba, “Does where you gaze on an Image affect your perception of quality? Applying Visual
Attention to Image Quality Metric”, in Proc. IEEE Int. Conf. Image Process, vol. 2, pp. 169–172, 2007.
[50] Yuan, J., Li, J., Zhang, B., “Exploiting spatial context constraints for automatic image region annotation”, In ACMMM, pp. 595–604,
2007.
80
Reference
[51] C. Galleguillos, A. Rabinovich and S. Belongie, “Object categorization using co-occurrence, location and appearance”, In CVPR, June. 2008.
[52] J. Weston, F. Ratle, R. Collobert, “Deep learning via semi-supervised embedding”, In ICML, 2008.
[53] E. K. Chen, X. K. Yang, H.Y. Zha, R. Zhang, and W. J. Zhang, “Learning object classes from image thumbnails through deep neural
detworks,” International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, 2008.
[54] Nabil G. Sadaka, Lina J. Karam, Rony Ferzli, and Glen P. Abousleman, "A no-reference perceptual image sharpness metric based on
saliency-weighted foveal pooling," IEEE International Conference on Image Processing, pp. 369-372, Oct. 2008.
[55] Srenivas Varadarajan and Lina J. Karam, "An improved perception-based no-rReference objective image sharpness metric using iterative
edge refinement," IEEE International Conference on Image Processing, pp. 401-404, Oct. 2008.
[56] J. You, A. Perkis, M. Hannuksela, and M. Gabbouj., “Perceptual quality assessment based on visual attention analysis”, In ACMMM, 2009.
[57] J. Li, R. Socher, and L. Fei-Fei, “Towards total scene understaning: classification, annotation and segmentation in an automatic framework”,
In CVPR, 2009.
[58] Luhong Liang, D. Jianhua Chen, Siwei Man, Debin Zhao and Wen Gao, "A no-reference perceptual blur metric using histogram of gradient
profile sharpness, ICIP, pp.4369-4372. Apr.2009.
[59] L. Ballan, A. Bazzica, M. Bertini, A. D. Bimbo, and G. Serra, “Deep networks for audio event classification in soccer videos,” IEEE
International Conference on Multimedia & Expo, 2009.
[60] Rabinovich, A., and Belongie, S. “Scenes vs. objects: a comparative study of two approaches to context based recognition,” In ViSU, 2009.
[61] Rony Ferzli and Lina J. Karam, "A no-reference objective image sharpness metric based on the notion of just noticeable blur (JNB)," IEEE
Transactions on Image Processing, vol. 18, No. 4, pp. 717-728, Apr.2009.
[62] S. Lee and Y. Liu. “Curved glide-reflection symmetry detection”, In CVPR, pp.1046–1053, 2009.
[63] Tilke Judd, Krista Ehinger, Fr´edo Durand and Antonio Torralba, "Learning to predict where humans look," IEEE International Conference
on Computer Vision, Sep. 2009.
[64] Xiaobai Liu, Bin Cheng, Shuicheng Yan, Jinhui Tang, Tat Seng Chua, Hai Jin, “Label to region by Bi-Layer sparsity priors”, In Proceedings
of ACM Multimedia, pp. 115-124, Oct. 2009.
[65] Y.-G. Jiang, C.-W. Ngo, and S.-F. Chang, “Semantic context transfer across heterogeneous sources for domain adaptive video search”, in
ACM Multimedia, 2009.
[66] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y.L. Cun, “What is the best multi-stage architecture for object recognition?”, In ICCV, 2009.
[67] R. Achanta, S. Hemami, F. Estrada and S. Süsstrunk, “Frequency-tuned salient region detection”, In CVPR, 2009.
[68] Shenghua Zhong, Yan Liu, Yang Liu, and Fu-lai Chung, “Fuzzy based Contextual Cueing for Region Level Annotation”, In Proceeding of
ACM International Conference on Internet Multimedia Computing and Service (ICIMCS’10), 2010.
81
Reference
[69] Shenghua Zhong, Yan Liu, Yang Liu, and Fu-lai Chung, “A semantic no-reference image sharpness metric based on top-down and bottom-up
saliency map modeling”, In Proceedings of 17th IEEE International Conference on Image Processing (ICIP’10), 2010.
[70] Chertok, M., & Keller, Y., “Spectral symmetry analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, July 2010.
[71] M. Wang, J. Li, T. Huang, Y. Tian, L. Duan, and G. Jia, “Saliency detection based on 2D log-gabor wavelets and center bias”, In ACMMM,
2010.
[72] S. Zhou, Q. Cheng, X. Wang, “Discriminative deep belief networks for image classification,” In ICIP, 2010.
[73] Shenghua Zhong, Yan Liu, Yang Liu, and Changsheng Li, “Water reflection detection and recognition based on moment invariants to motion
blur using dynamic programming”, In Procedding of ACM International Conference on Multimedia Retrieval (ICMR’11), 2011.
[74] McMaster University, "Discover psychology", Attention and Memory, Toronto, Ontario: Nelson Education Ltd. ISBN-13: 978-0-17-6613969, 2011.
[75] Shenghua Zhong, Yan Liu, Yang Liu, “Bilinear deep learning for image classification”, submitted to ACM International Conference on
Multimedia, 2011.
[76] Sheng-hua Zhong, Yan Liu, Ling Shao, Gangshan Wu, “Unsupervised Saliency Detection Based on 2D Gabor and Curvelets Transforms.” In
ACM ICIMCS, 2011.
82
Q&A
Thank You !
83
Query-oriented Multiple
Document Summarization


Extractive style query-oriented multi-document summarization

Generate the summary by extracting a proper set of sentences from
multiple documents based on the pre-given query

Important in both information retrieval and natural language processing
Humans do not have difficulty with multi-document
summarization


How does the neocortex process the lexical-semantic task?
Contribution

First paper of utilizing deep learning in document summarization

Provide human-like judgment by referencing the architecture of the
human neocortex
84
Flowchart

Query-oriented concept extraction


Hidden layers are used to abstract the
documents using greedy layer-wise
extraction algorithm
Reconstruction validation for global
adjustment

Reconstruct the data distribution by finetuning the whole deep architecture
globally
Summary Generation
Dynamic Programming
Summary
Candidate Sentence Pool
Reconstruction Validation
…………
h0
…………
h1
………
h2
1 T
(A )
(A2)T
3 T
(A )
Concept Extraction
……
h3
A3
Candidate Sentence Extraction
………
A2

Summary generation via dynamic
programming

Dynamic programming is utilized to
maximize the importance of the summary
with the length constraint
h2
Key Words Discovery
…………
1
h1
Not Important Words
Filtering Out
A
…………
f  [f , f ,  , f ,  , f ] tf Value
d
d
1
d
2
d
v
d
V
…
h0
Preprocessing
Word List
Document Topic Set
Query Oriented Initial Weight Setting
…
Query Oriented Penalty Process
Query Word List
85

PPT - The Hong Kong Polytechnic University

Transcript PPT - The Hong Kong Polytechnic University

Directory