Efficient Visual Search for Objects in Videos JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011

Transcript Efficient Visual Search for Objects in Videos JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011

Efficient Visual Search for Objects in Videos

J O S E F S I V I C A N D A N D R E W Z I S S E R M A N P R E S E N T E R S : I L G E A K K A Y A & J E A N N E T T E C H A N G M A R C H 1 , 2 0 1 1

Introduction

Text Query Image Query

Generalize text retrieval methods to

Results: Documents Results: Frames

non-textual information

State-of-the-Art before this paper…

   Text-based search for images (Google Images) Object recognition    Barnard, et al. (2003): “Matching words and pictures” Sivic, et al. (2005): “Discovering objects and their location in images” Sudderth, et al. (2005): “Learning hierarchical models of scenes, objects, and parts” Scene classification    Fei-Fei and Perona (2005): “A Bayesian hierarchical model for learning natural scene categories” Quelhas, et al. (2005): “Modeling scenes with local descriptors and latent aspects” Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories”

Introduction (cont.)

   Retrieve specific objects vs. categories of objects/scenes (“Camry” logo vs. cars) Employ text retrieval techniques for visual search, with images as queries and results Why Text Retrieval Approach?

 Matches essentially precomputed so that no delay at run time  Any object in video can be retrieved without modification of descriptors originally built for video

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Pre-Processing (Offline)

Detection of Affine Covariant Regions

    Typically ~1200 regions / frame (720x576) Elliptical regions Each region represented by 128-dimensional SIFT vector SIFT features provide invariance against affine transformations

Two types of affine covariant regions: 1. Shape-Adapted(SA):

Mikolajczyk et al.

Elliptical Shape adaptation about a Harris interest point Often centered on corner-like features 1. Maximally-Stable(MS):

Proposed by Matas et al.

Intensity watershed image segmentation High-contrast blobs

Pre-Processing (Offline)

Tracking regions through video and rejecting unstable regions

   Any region that does not survive for 3+ frames is rejected These regions are not potentially interesting Reduces number of regions/frame to approx. 50% (~600/frame)

Pre-Processing (Offline)

Visual Indexing Using Text-Retrieval Methods

TEXT

Represent words by the “stems” ‘write’ ‘writing’ ‘write’ ‘written’

mapped to

Stop-list common words ‘a/an/the’ Rank search results according to how close the query words occur within retrieved document

IMAGE

Cluster similar regions into ‘visual words’ Stop-list common visual words Use spatial information to check retrieval consistency

Visual Vocabulary

 Purpose: Cluster regions from multiple frames into fewer groups called ‘visual words’  Each descriptor: 128-vector  K-means clustering (explain more)  ~300K descriptors mapped into 16K visual words (600 regions/frame x ~500 frames)  (6000 SA, 10000 MS regions used)

K-Means Clustering

   Purpose: Cluster N data points (SIFT descriptors) into K clusters (visual words) K = desired number of cluster centers (mean points) Step 1: Randomly guess K mean points

 Step 2: Calculate nearest mean point to assign each data point to a cluster center In this paper, Mahalanobis distance is used to determine ‘nearest cluster center’.

(

1 ,

2 ) = (

1 -

2 )

S 1 (

1 -

where ∑ is the covariance matrix for all descriptors, x 2 is the length 128 mean vector and x 1 ’s are the descriptor vectors(i.e. data points) 2 )

 Step 3: Recalculate cluster centers and distances, repeat until stationarity

Examples of Clusters of Regions

Samples of normalized affine covariant regions

Pre-Processing (Offline)

Remove Stop-Listed Words

Analogy to text-retrieval:  ‘a’, ‘and’, ‘the’ … are not distinctive words    Common words cause mismatches 5-10% of the most common visual words are stopped 800-1600 / 16000 words are stopped (Upper row) Matches before stop listing (Lower row) Matches after stop listing

Pre-Processing (Offline)

tf-idf Weighting

(term frequency-inverse document frequency weighting) n id n d : #of occurrences of word(visual word) i in document(frame) d : total number of words in document d N i : total number of documents containing term I N : number of documents in the database t i : weighted word frequency

 Each document(frame) represented by: where v = number of visual words in vocabulary And v d = the tf-idf vector of the particular frame d

Inverted File Indexing

Visual Word Index

1 2 … N

Found in Frames:

1,4,5 1,2,10 …

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Real-Time Query

Determine the set of visual words found within the query region Retrieve keyframes based on visual word frequencies (Ns = 500) Re-rank retrieved keyframes using spatial consistency

Retrieve keyframes based on visual word frequencies   v q : vector containing visual word frequencies corresponding to query region is computed the normalized scalar product of v q computed: with v d ’s are

Spatial Consistency Voting

 Analogy: Google text document retrieval    Matched covariant regions in the retrieved frames should have a similar spatial arrangement Search area: 15 nearest spatial neighbors of each match Each neighboring region which also matches in the retrieved frame, votes for the frame

Spatial Consistency Voting

Matched pair of words (A,B) Each region in defined search area in both frames casts a vote For the match (A,B) (upper row)Matches after stop-listing (lower row) Remaining matches after spatial consistency voting

Query Frame Sample Retrieved Frame 1: 2: 3-4: 5-6: 7-8: Query Region Close-up version of 1 Initial matches Matches after stop-listing Matches after spatial consistency matching 1 3 5 7 2 4 6 8

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Implementation Details

Offline Processing:  100-150K frames/typical feature length film,     Refined to 4000-6000 keyframes Descriptors are computed for stable regions in each frame Each region is assigned to a visual word Visual words over all keyframes assembled into an inverted file-structure

Algorithm Implementation

Real-Time Process:  User selects query region    Visual words are identified within query region A short list of Ns = 500 keyframes retrieved based on tf-idf vector similarity Similarity is recomputed considering spatial consistency voting

Example Visual Search

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Query Image A Few Retrieved Matches

Retrieval Examples

Query Image

Retrieval Examples (cont.)

A Few Retrieved Matches

Performance of the Algorithm

 Tried 6 object queries (1) Red Clock (2) Black Clock (3) “Frame’s” Sign (4) Digital Clock (5) “Phil” Sign (6) Microphone

Performance of the Algorithm (cont.)

  Evaluated on the level of shots rather than keyframes Measured using precision-recall plots  𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = # 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑠ℎ𝑜𝑡𝑠 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠ℎ𝑜𝑡𝑠 𝒓𝒆𝒕𝒓𝒊𝒆𝒗𝒆𝒅   # 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑠ℎ𝑜𝑡𝑠  𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠ℎ𝑜𝑡𝑠 𝒘𝒊𝒕𝒉 𝒐𝒃𝒋𝒆𝒄𝒕 Precision like measure of fidelity or exactness Recall like measure of completeness

Performance of the Algorithm (cont.)

 Ideally, precision = 1 for all recall values  Average Precision (AP) , ideally AP = 1

Examples of Missed Shots

 Extreme viewing angles Original query object Low-ranked shot

Examples of Missed Shots (cont.)

 Significant changes in scale and motion blurring Original query object Low-ranked shot

Qualitative Assessment of Performance

   General trends  Higher precision at low recall levels  Bias towards lightly textured regions detectable by SA/MS detectors Could address these challenges by adding more covariant regions Other Difficulties  Textureless regions (e.g., mug)   Thin or wiry objects (e.g., bike) Highly-deformable (e.g., clothing)

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Quality of Individual Visual Words

  Using single visual word as query  Tests the expressiveness of the visual vocabulary Sample query   Given an object of interest, select one of the visual words from that object Retrieve all frames that contain the visual word (no ranking)  Retrieval considered correct if contains object of interest

Examples of Individual Visual Words

Top row: Scale-normalized close-ups of elliptical regions overlaid on query image Bottom row: Corresponding normalized regions

Results of Individual Word Searches

  Individual words are “noisy” Intuitively because words occur in multiple objects and do not cover all occurrences of the object

Quality of Individual Visual Words

Unrealistic

  Require each word to occur on only one object (high precision) Growing number of objects would result in growing number of words

Realistic

 Visual words shared across objects, with objects represented by a combination of words

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Searching for Objects From Outside of the Movie

   Used external query images from the internet Manually labeled all occurrences of external queries in movies Results

External Query Image

Sony logo Hollywood sign Notre Dame

No. of Occurrences

3 1 1

Rankings of Retrieved Occurrences

1 st , 4 th , 35 th 1 st 1 st

AP (Average Precision)

0.53

1 1

Sample External Query Results

 Potential Applications

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Challenge I: Visual Vocabularies for Very Large Scale Retrieval

  Current progress: 150000 frame feature movie reduced to 6000 keyframes and then processed Ultimate goal: indexing billions of online images to build a visual search engine

(a) (c) external images downloaded from the Internet (b) Correct retrieval frame from the movie ‘Pretty Woman’ (d) Correct retrieval from the movie ‘Charade’  Should the vocabulary increase in size as the image archive grows?

 How discriminative should the words be?

 Generalization of images from one movie to an outside database of images?

 Learning a universal visual vocabulary still remains a challenge

Challenge II: Retrieval of 3D Objects

 Current algorithm covers successful detection despite slight changes in viewpoint, illumination, partial occlusion due to SIFT features  However, 3D retrieval is fundamentally a bigger challenge

Proposed approach 1:

Automatic association of images using temporal information   Grouping front-side-back of a car in a video Possible either in query and/or database side  Query-Side Matching: Associated query frames are computed and used for 3D image search Query-Side matching of associated frames

Proposed approach 1 (cont.)

 Grouping on database side: Query on a single aspect is expected to retrieve pregrouped frames associated with 3D image (Top Row) Query image (Bottom Rows) Matching frames

Proposed approach 2:

Building an explicit 3-D model for each 3-D object in the Video  Focus is more on model building than detection  Only rigid objects considered

Challenge III: Verification using Spatial Structure

   Spatial consistency was helpful, but could be improved A few suggestions  Caution with using measures for rigid geometry  Reduce cost using hierarchical approach Two complementary methods  Ferrari et al. (2004): matching deformable objects  Rothganger et al. (2003): matching 3D objects

Verification Using Spatial Structure (cont.)

  Method 1 (Ferrari)   Based on spatial overlap of local regions Requires regions to match individually and pattern of intersection between neighboring regions to be preserved Performance  Pro: Works well with deformations  Con: Computationally expensive

Verification Using Spatial Structure (cont.)

  Method 2 (Rothganger)  Based on 3-D object model  Requires consistency of local appearance descriptors and geometric consistency Performance  Pro: Object can be matched in diverse (even novel) poses  Con: 3-D model built offline, requires up to 20 images of object taken from different viewpoints

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Conclusion

  Demonstrated scalable object retrieval architecture which uses  Visual vocabulary based on vector-quantized viewpoint invariant descriptors  Efficient indexing techniques from text retrieval A few notable differences between document and image bag-of-words retrieval  Spatial information   Numbers of “words” in query Matching requirements

Looking forward…

  TinEye (May 2008)  Image-based search engine  Given a query image, searches for altered versions of that image (scaled or cropped)  1.86 billion images indexed Google Goggles (2009)   Use phone to take photo, results from the internet Limited categories

Overview of the Talk

     Visual Search Algorithm 

Offline Pre-Processing

 

Real-Time Query A Few Implementation Details

Performance   

General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future Directions Concluding Remarks Demo of the Algorithm

Demo of Retrieval Algorithm

 Live demonstration

Main References

   D. Lowe. Distinctive Image Features from Scale- Invariant Keypoints. International Journal of Computer Vision. 2(60):91.110, 2004.

J. Sivic and A. Zisserman. Efficient visual search for objects in videos. Proc. IEEE, 96(4):548–566, 2008.

W. Qian “Video Google: A Text Retrieval Approach to Object Matching in Videos.” www.mriedel.ece.umn.edu/wiki/index.php/Weikang_Qian

Efficient Visual Search for Objects in Videos JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011

Transcript Efficient Visual Search for Objects in Videos JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011

Efficient Visual Search for Objects in Videos

Introduction

State-of-the-Art before this paper…

Introduction (cont.)

Overview of the Talk

Overview of the Talk

Pre-Processing (Offline)

Detection of Affine Covariant Regions

Pre-Processing (Offline)

Tracking regions through video and rejecting unstable regions

Pre-Processing (Offline)

Visual Vocabulary

K-Means Clustering

Examples of Clusters of Regions

Pre-Processing (Offline)

Remove Stop-Listed Words

Pre-Processing (Offline)

tf-idf Weighting

Inverted File Indexing

Overview of the Talk

Real-Time Query

Spatial Consistency Voting

Spatial Consistency Voting

Overview of the Talk

Implementation Details

Algorithm Implementation

Example Visual Search

Overview of the Talk

Retrieval Examples

Retrieval Examples (cont.)

Performance of the Algorithm

Performance of the Algorithm (cont.)

Performance of the Algorithm (cont.)

Examples of Missed Shots

Examples of Missed Shots (cont.)

Qualitative Assessment of Performance

Overview of the Talk

Quality of Individual Visual Words

Examples of Individual Visual Words

Results of Individual Word Searches

Quality of Individual Visual Words

Overview of the Talk

Searching for Objects From Outside of the Movie

Sample External Query Results

Overview of the Talk

Challenge I: Visual Vocabularies for Very Large Scale Retrieval

Challenge II: Retrieval of 3D Objects

Proposed approach 1:

Proposed approach 1 (cont.)

Proposed approach 2:

Challenge III: Verification using Spatial Structure

Verification Using Spatial Structure (cont.)

Verification Using Spatial Structure (cont.)

Overview of the Talk

Conclusion

Looking forward…

Overview of the Talk

Demo of Retrieval Algorithm

Main References

Directory