Slides1 - Tamara L Berg
Download
Report
Transcript Slides1 - Tamara L Berg
Yansong Feng and Mirella Lapata
Ashish Bagate
What this paper is about
Explore the feasibility of automatic caption
generation for images in news domain
Why particularly news domain – training data is
available easily and abundantly
Why
Lots of digital images available on the Web
Improved searching
Analysis of the image
Keywords only searches are ambiguous
Targeted queries using longer search strings
Web accessibility
General Approach
Two step process
Analyze the image and build a representation for the
same
Run the text generation engine on the image
representation, and come up with a natural language
representation
Related Work
Hede et al. – not practical because of controlled
data set and also manual database creation
Yao et al. – based on just the image
Elzer et al. – what the graphic depicts, little
emphasis on graphics generation
These methods use some background information
/terminologies
Problem Formulation
For the given image I and the document D,
generate a caption C
Training data contains document – image –
caption tuples
Caption generation is a difficult task even for
humans
A good caption must be succinct, informative,
clearly identify the subject of the picture, draw
reader to the article
Overview of the method
Similar to Headline generation task
Get the training data (it would be noisy)
Follows two stage approach
Get the keywords from the image (image annotation
model)
Generate the caption from the given image words
Use of image features for faithful and meaningful
description for the images
Image Annotation
Probabilistic model – well suited for noisy data
Calculate SIFT descriptors of images
Visual words by K means clustering
Get the keywords by LDA
dmix - bag of words representing image –
document – caption
Extractive Caption Generation
Not much linguistic analysis is needed
Caption would be a sentence from the document
which is maximally similar to description
keywords
Types of Similarities
Word Overlap
Cosine Similarity
Probabilistic Similarity
KL divergence – similarity between an image and a
sentence is measured by the extent to which they share
the same topic distributions
Issues with Extractive Caption
Generation
No single sentence can represent the image
Selected caption sentences might be longer than
the average length of the sentence
May not be catchy
Abstractive Caption Generation
Word based model
Adapted from headline generation
Caption = the sequence of words that maximizes P
Abstractive Caption Generation
Phrase based model
Caption = the sequence of words that maximizes P
Evaluation…
Evaluation…
Evaluation
Thanks!