Image2Vec: Learning word and image representations for reasoning

Download Report

Transcript Image2Vec: Learning word and image representations for reasoning

Image2Vec: Learning word and
image representations for
reasoning
Lerrel J. Pinto, Gunnar A. Sigurdsson
How would you summarize this image in 3 words?
2
Joint word and image embedding
Given an image, get words that approximate
Tower+Clock+Night
3
Joint word and image embedding
Given words, find an unlabeled image
Snowy+Young
4
Current embeddings
Use word similarity
Learn to map images to a word space
Does not apply to sums
Cannot recover sum of words given image
FC7 features:
-
+
=
???
5
Formulation
Before we start:
If elements of x are zero except at 1, 6, 15
OR
word1
word2
Intuition: Each column of D is a word
6
Formulation
Similar images = similar sums
Similarity between images
Indicator vector for word
(per image)
word1
word2
First solve this for
Then solve for D
A black dog laying on
his big dog bed
7
Formulation (Simple model)
This is a non-linear dimensionality reduction
Multidimensional scaling O(N^3)
Instead we use JL-Transform (adds noise)
where
is FC7
Thus given R find D
word1
word2
s.t.
8
Models (Learn all model)
To learn a better image representation
Learn R too (we could backprop to deep network)
(We need more constraints to do this)
find D and R
s.t.
if i and j occur in same image
(D,R column norm=1)
9
Models (Weak tag model)
To allow for weak tags (Flickr)
Find D and R
camping,
wilderness
s.t.
s.t.
s.t.
if i and j occur in same image
(D,R column norm=1)
desert+trees = camping+wilderness
only two words < k
desert != trees
10
Training
COCO dataset
80k training images
5 captions each
A black dog laying on
his big dog bed
Build a vocabulary from nouns and adjectives
Number of training examples, N = 400k
Words for image are vocab words in caption
Models solved with
Block coordinate descent
Stochastic gradient descent
Orthogonal Matching Pursuit
11
Summarizing image, successes
Given Trained model, find words that
summarize. (Sparse recovery)
pulling+horses+grassy
bedroom+wooden+small
surf+sandy
12
Summarizing image (w2v and imnet)
trombone+accordion+kimono
ice_cream+vase
poodle
black_bear+cat+guenon
dog+rabbit
canoe+sea_lion
13
Summarizing image (simple model)
hipster+rabbit
cheesecake+polka
lap+bride
bear+hydrant
hipster+lamb
wave+motorbike
14
Summarizing image (learn all model)
leaning+player+video
vase
toddler+lying+living
bear+rocky+kneeling
kneeling+outside
wet+rocky
15
Summarize image (weak tags model)
These represent successes
Flickr images
Uses ImageNet and Word2vec
Before: concrete,gbr,london
After: crane,london
Before: diving,scuba
After: brain_coral,scuba
Before: jet,lynx
After: airliner
16
Why is a visual model different?
Enter word (break to quit): 'yellow'
Using word: yellow
bus: 0.39
orange: 0.32
station: 0.30
road: 0.28
bananas: 0.27
Word2Vec most similar words:
Using word: yellow
red: 0.75
bright_yellow: 0.69
orange: 0.64
blue: 0.64
purple: 0.63
17
Summary and takeaways
L2 distance better than inner product
Allows scaling down noisy non-visual words
Better optimization / learning rate /
initialization
Poor local minima, overfitting
Learning word and image embedding jointly
improves summarization
bear+rocky+kneeling
18