Transcript ppt

Toward Consistent Evaluation of Relevance
Feedback Approaches in Multimedia Retrieval
Xiangyu Jin, James French, Jonathan Michel
July, 2005
Outline
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
Motivation
RF in MMIR is a cross discipline research area
(1). CV & PR [Rui 98] [Porkaew 99]
(2). Text IR [Rocchio 71] [Williamson 78]
(3). DB & DM [Ishikawa 98][Wu 00][Kim 03]
(4). HCI & Psychology ...
These different background groups follow different traditions
and have different standards toward evaluation, which makes
it hard to:
(1). study relations among them
(2). compare their performance fairly
Motivation
(1). Different testbed
Dataset: COREL [Muller 03], TRECVid
-”Every evaluation is done on a different image subset thus making comparison impossible”
Groundtruth
manual-judged [Kim 03] TRECVid Pure human labeling
auto-judged [Rui 98] [Porkaew 99] MARS as reference system
semi-auto-judged [Liu 01] MSRA MiAlbum, system-assisted human labeling
Motivation
(2). Different methodology
System-oriented vs. user-oriented
User-oriented method is not ideal for comparison since user experience
varies from person to person, time to time.
Normalized rank vs. non-normalized rank
Rank-normalization is generally accepted in Text IR [Williamson 78]
but not MMIR
Problems & Contributions
Prob 1. It is hard to study the relations among RF approaches
Cont. 1. Briefly summarize RF algorithms according to their implementation to
multi-query retrieval, so that each approach can be treated as a special case
under the same framework and their intrinsic relations can be studied.
Prob 2. It is hard to compare RF performance fairly
Cont. 2. Give the critics toward PE work in the listed works. Demonstrate an
example of how to fairly compare three typical RF approaches in large scale
testbeds (both text and image). And show improper PE methodology can lead
to different conclusions.
Where are we?
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
RF in MMIR (framework)
General RF model in distance-based IR
Both the document and queries
can be abstracted as points in
some space
rel-doc
irel-doc
RF in MMIR (framework)
General RF model in distance-based IR
D(q,d)
Each pair of points’ distance is
defined by some distance
function D(q,d) (assume D is
metric).
rel-doc
irel-doc
RF in MMIR (framework)
General RF model in distance-based IR
Retrieval can be interpreted as
getting the document points in
the neighborhood of the query
points (nearest neighbor search).
rel-doc
irel-doc
RF in MMIR (framework)
General RF model in distance-based IR
RF can be interpreted as a
process to move and reshape
the query region, so that it fits
the user interested region in the
space
rel-doc
irel-doc
RF in MMIR (framework)
General RF model in distance-based IR
Query Set
Search Engine
Results
Feedback
Examples
Feedback examples are used to modify query points in the query set.
The search engine could handle multiple query points, hence the
search results is modified by the change of the query set.
RF in MMIR (framework)
General RF model in distance-based IR
Query Set
Search Engine
Results
Feedback
Examples
In above discussion, D can only adapt to a single query point. We
need to extend D(q,d) to D’(Q,d) so that it could handle a query set Q.
Two possible solutions are: Combine Queries & Combine Distances.
RF in MMIR (framework)
General RF model in distance-based IR
Query Set
Search Engine
Results
Feedback
Examples
Assumptions:
(1). Our focus to RF research is on how to handle multiple query points.
i.e., given D, how to construct D’.
(2). Assume retrieval result is presented as a rank list.
(3). User select feedback examples in the retrieval result.
RF in MMIR (framework)
Combine Queries Approach
Search Engine
Results
Query Set
A single query point is generated from the query set by some
algorithm. Then the synthetic query is issued to search.
RF in MMIR (framework)
Combine Queries Approach
D' (Q, d )  D( f (Q), d )
f is a function which map Q to a single query point q,
|Q|
e.g.
f (Q )   wi * qi
i 1
qi is a query point in the query set, wi is its corresponding weight.
d
q4
q1
q2
q
q3
RF in MMIR (framework)
Combine Queries Approach
Combine query feedback modify the query region by the following
mechanisms
(1). Move the query center by f so that query region is moved.
(2). Modify distance function D so that the query region is reshaped.
Usually the distance function is defined as a squared distance
D(q,d)=(q-d)TM(q-d), where M is the distance matrix.
• Query-point-movement (QPM) [Rocchio 71]: M is an identity matrix
• Re-weighting (Standard Deviation Approach) [Rui 98]: M is a diagonal matrix
• MindReader [Ishikawa 98]: M is a symmetric matrix where det(M)=1
RF in MMIR (framework)
Combine Distances Approach
Search Engine
Mid-result
Mid-result
Fusion
Query Set
Results
Each query point is issued to search. The results are then combined
in post-processing with some merging algorithm.
RF in MMIR (framework)
Combine Distances Approach
1

 |Q|

D' (Q, d )   wi * D(qi , d ) 
 i 1

Define D ' (Q, d )  0 if both 
0
and
iD(qi , d )  0
The distances are combined using weighted power mean.
d
The query center movement and distance
function’s modification is a hidden process.
q4
q1
q2
q3
RF in MMIR (framework)
Combine Distances Approach
1

 |Q|

D' (Q, d )   wi * D(qi , d ) 
 i 1

Query-expansion [Porkaew 99]: α=1
FALCON [Wu 00]:
α>0 Fuzzy AND merge;
α<0 Fuzzy OR merge
RF in MMIR (framework)
Mixed Approach
Q-Cluster [Kim 03]
Feedback examples are clustered.
(1). The cluster centers (denote as a set C) are used for combine
distances feedback (by FALCON’s fuzzy OR merge)
(2). Each cluster center use its own distance function Di. Di is
trained using MindReader for query points in cluster i.
1



D' (Q, d )   wi * Di (ci , d ) 
 i 1

|C |
Extremely complex!
RF in MMIR (example)
An illustrative example
Suppose we have a 2D database of weights, height of people.
Name
Weight
Height
Steve
Mike
…
120
180
…
180
120
…
Steve
Mike
RF in MMIR (example)
Combine Queries Approaches
Query-point-movement: Rocchio’s method
Height
0
The new query region
The initial query region
Weight
M is an identity matrix, the query region is a circle.
RF in MMIR (example)
Combine Queries Approaches
Re-weighting: Standard Deviation Approach
Height
Weight
Band query: we want to find “tall” people, whose height is
around 6 feet. The user interested region is a sharp band. No
matter how you move the circle you cannot fit the query region.
RF in MMIR (example)
Combine Queries Approaches
Re-weighting: Standard Deviation Approach
Height
Weight
Solution: Extend M to be a diagonal matrix, so that the query region
is a ellipse (align to axis). The larger the variance along a axis is, the
smaller the weight this axis is.
RF in MMIR (example)
Combine Queries Approaches
MindReader
Height
0
Weight
Diagonal query: we want to find “good shape” people, whose
height/weight varies within a range. Since re-weighting can only form
ellipse align to axes, so it cannot fits the region well.
RF in MMIR (example)
Combine Queries Approaches
MindReader
Height
0
Weight
Solution: extend M to be is a symmetric matrix with det(M)=1. Now
the query region can be an arbitrary ellipse.
RF in MMIR (example)
Combine Distances Approaches
Query-expansion in MARS
Height
0
Weight
Triangle query: the user interested region is arbitrary shaped.
RF in MMIR (example)
Combine Distances Approaches
Query-expansion in MARS
Height
0
Weight
Solution: hiddenly change the distance function
|Q|
D' (Q, d )   wi * D(qi , d ) This is a special case where α=1
i 1
(arithmetic mean)
RF in MMIR (example)
Combine Distances Approaches
FALCON
Height
0
Weight
Disjoint query: user interest region is not a continuous region. Suppose
user interested in two types of person, either small size or large size.
This is common in a MMDB where low-level feature cannot reflect highlevel semantic clusters.
RF in MMIR (example)
Combine Distances Approaches
FALCON
Height
0
Weight
Solution: use small circles (not necessary circle) around each query
point to combine to a non-continuous region.


D' (Q, d )   wi * D(qi , d ) 
 i 1

|Q|
1

αis negative
RF in MMIR (example)
Mixed Approach
Q-Cluster
Height
0
Weight
Idea: use small ellipse to combine to a non-continuous region.
Use MindReader to construct each ellipse and use FALCON’s
fuzzy OR merge to combine them.
Where are we?
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
PE Problems
There are many kinds of PE problems and we only list several.
(1). Dataset
(2). Comparison
(3). Impractical parameter settings
We give examples only in previous listed works. But this doesn’t mean
ONLY these works has PE problems.
PE Problems
Dataset Problems
(1). Unverified assumptions in simulated environment
The algorithm is proposed based on some assumption.
E.g., re-weighting [Rui 98] requires ellipse query exist, MindReader
[Ishikawa 98] requires diagonal query exist.
In MARS works [Rui 98] [Porkaew 99], groundtruth is generated by
their own retrieval system with arbitrary distance function so that an
“ellipse” query already exist. It would be not astonishing that reweighting would over perform Rocchio method in this environment.
We are not arguing that these approaches are not useful, but the PE
tells us very little information (since it is a high probability event).
PE Problems
Dataset Problems
(2). Real data is not typical for application
small scale (1k image), highly structured (strong linear relation),
low dimensional (2D).
E.g., in MindReader [Ishikawa 98], a highly structural 2D dataset (the
Montgomery Country dataset) is used to evaluation, where the task
is in favor of their approach.
MMIR usually employ very high dimensional features, and only
a few dozen examples are available to feedback. In this case, it
is extremely hard to mine the relations among hundreds of
dimensions via so few training examples. It would be risky to
learn a more “free” & “powerful” distance function.
It is highly possible that user intention is overwhelmed by
the noise and wrong knowledge is learned.
PE Problems
Comparison Problems
(1). Misused comparison
The author proposed some modification to quicksort, but instead of
compare his new sorting with quicksort, he compare it to bubble sort.
For example, Q-Clusteris a modification to FALCON’s fuzzy OR merge.
But [Kim 03] compared it to Q-Expansion and QPM.
It is not astonishing that their approach performs much better, since
COREL database is in favor of any fuzzy OR similar approach.
PE Problems
Comparison Problems
(2). Unfair comparison
How to treat the training samples (feedback examples) in the
evaluation?
E.g., FALCON shift them to head, Rocchio doesn’t. But all
methods can do this in post-processing!
Directly compare approaches which inconsistently process the
feedback examples will result in unfair comparison. FALCON [Wu
00] and Q-Cluster [Kim 03] papers all have this problem.
PE Problems
Impractical Parameter Settings
Assume a “diligent” user
(1). Ask the user to judge too many
[Re-weighting] ask the too look through top 1100 retrieved results
to find feedback examples.
(2). Ask the user to click/select too many times
[Kim 03] and [Porkaew 99] feedback all relevant images in top 100
retrieved result.
Remember, COREL only has 100 relevant images for each query.
This could result in their conclusion of improvement appears mostly
in the FIRST iteration!
(3). Feedback too many iterations
[Wu 00] do feedback over 30 iterations
Where are we?
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
Rank Normalization
Rank normalization: re-rank retrieval result according to feedback
examples
Although Rank Normalization is a generally accepted in text IR
[Rank-Norm], but it is seldom paid enough attention in MMIR.
Rank-shifting: shift the feedback examples to the head of the refined
result even if they are not there. Easy to implement, fair for cross
system comparison, unfair for cross iteration comparison.
Rank-freezing [Rank-Norm]: freeze the rank of the feedback
examples during refinement process. Hard to implement, fair for cross
system comparison, fair for cross iteration comparison.
Rank Normalization
Rank-Shifting
3
4
6
rel-doc
3
1
4
5 7
9
6
2
8
Previous result
Rank Normalization
Rank-Shifting
3
4
6
3
1
4
5 7
9
6
2
8
Previous result
rel-doc
Refined result
2
3
8
4
5
9
1
6
7
(before rank-shifting)
Rank Normalization
Rank-Shifting
3
4
6
3
1
4
5 7
9
6
2
8
Previous result
rel-doc
Refined result
2
3
8
3
4
6
4
5
9
1
6
7
(before rank-shifting)
Refined result
(after rank-shifting)
Rank Normalization
Rank-Shifting
3
4
6
3
1
4
5 7
9
6
2
8
Previous result
rel-doc
Refined result
2
3
8
4
5
9
1
6
7
3
4
6
2
8
5
9
1
7
(before rank-shifting)
Refined result
(after rank-shifting)
Rank Normalization
Rank-Freezing
3
4
6
3
1
4
5 7
9
6
2
8
Previous result
rel-doc
Refined result
2
3
3
8
4
4
5
9
1
6
6
7
(before rank-freezing)
Refined result
(after rank-freezing)
Rank Normalization
Rank-Freezing
3
4
6
3
1
4
5 7
9
6
2
8
Previous result
rel-doc
Refined result
2
3
8
4
5
9
1
6
7
3
2
4
8
5
9
6
1
7
(before rank-freezing)
Refined result
(after rank-freezing)
Where are we?
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
Experimental Results
Testbed
(1). CBIR
Basic CBIR in 3K image DB (COREL)
(2). Text-IR
Lucene in TREC-3
Experimental Results
Feedback approaches
(1). QPM, Query-point-movement
(2). AND, FALCON with α=1 (Q-Expansion)
(3). OR, FALCON with α=-1
Rank-normalization
(1). Without rank-normalization
(2). Rank-shifting
(3). Rank-freezing
0.31
CBIR testbed
(1). OR>QPM=AND (multi-cluster)
(2). Without rank-normalization, we
would exaggerate OR merge’s
performance. Rank-shifting would
exaggerate performance
improvement cross iteration.
0.23
0.46
TEXT IR testbed
(1). OR=QPM=AND (single-cluster)
(2). Without rank-normalization, we
would exaggerate OR merge’s
performance. Rank-shifting would
exaggerate performance
improvement cross iteration.
(3). OR merge converge slower.
0.38
Where are we?
 Motivation & Contributions
 RF (Relevance Feedback) in MMIR
 PE (Performance Evaluation) Problems
 Rank Normalization
 Experimental Results
 Conclusions
Conclusions
(1). For RF approaches
QPM is quite similar to AND merge.
For case if the relevant documents is scattered several clusters in the
space, OR merge is preferred; If the relevant documents is clustered
together, QPM is preferred since it has smaller computational cost and
converge faster.
(2). For rank-normalization issue
If cross system comparison is needed, both approach could be used.
If cross iteration comparison is needed, rank-freezing is required.
References
[Rui 98] Y Rui and TS Huang and S Mehrotra: Relevance Feedback Techniques in Interactive
Content-Based Image Retrieval. Storage and Retrieval for Image and Video Databases (SPIE
1998). (1998) 25-36.
[Ishikawa 98] Y Ishikawa and R Subramanya and C Faloutsos: MindReader: Querying
Databases Through Multiple Examples. Proc. of VLDB'98. (1998) 218-227.
[Rocchio 71] JJ Rocchio: Relevance Feedback in Information Retrieval. In G Salton ed.: The
SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall (1971)
313-323.
[Wu 00] L Wu and C Faloutsos and K Sycara and TR Payne: Falcon: Feedback adaptive loop
for content-based retrieval. Proc. of VLDB'00“. (2000) 297-306.
[Kim 03] D Kim and C Chung: Qcluster: Relevance Feedback Using Adaptive Clustering for
Content-based Image Retrieval. Proc. of ACM SIGMOD'03. (2003) 599-610.
[Porkaew 99] K Porkaew and M Ortega and S Mehrotra: Query reformulation for content based
multimedia retrieval in MARS. Proc. of ICMCS'99. (1999) 747-751.
References
[Williamson 78] RE Williamson: Does relevance feedback improve document retrieval
performance? ACM SIGIR'78. (1978) 151-170.
[Yan 03] R Yan and R Jin and A Hauptmann: Multimedia Search with Pseudo-Relevance
Feedback. Proc. of CIVR'03. (2003)
[Westerveld 03] Thijs Westerveld and Arjen P. de Vries: Experimental result analysis for a
generative probabilistic image retrieval model. Proc. of SIGIR'03. (2003) 135-142.
[Muller 03] Henning Muller and Stephane Marchand-Maillet and Thierry Pun: The Truth about
Corel - Evaluation in Image Retrieval. Proc. of CIVR '02. (2002) 38-49.
[Liu 01] W Liu and Z Su and S Li and Y Sun and H Zhang: Performance Evaluation Protocol
for Content-Based Image Retrieval Algorithms/Systems. CVPR Workshop on Empirical
Evaluation Methods in Computer Vision. (2001).
End
Back up slides
Experimental Results
Testbed
(1). CBIR
DB: 34 COREL categories (3.4 K images)
Groundtruth: COREL groundtruth (image inside the same category is
considered relevant)
Retrieval systems: Basic CBIR (global visual feature based)
Queries: randomly select 6 images from each category (204 queries in
total).
In [French CIVR’04] we have shown this testbed is a good representative
to a much larger 60K COREL testbed for RF experiments.
Experimental Results
Testbed
(2). TEXT IR
DB: TREC-3 ad hoc (750K document)
Groundtruth: TREC’s qrel (by pooling)
Retrieval systems: Lucene’s default setting
Queries: manually created for TREC topic 151-200, one for each (50
queries in total)
Experimental Results
Other settings
(1). How many document (top of the rank list) is used for performance
evaluation?
For CBIR, 150. For TEXT IR, 1000. Avg-precision is used as PE metric.
(2). How many document (top of the rank list) is shown to the user for
feedback selection?
150.
(3). How many document the user is suppose to feedback (in systemoriented approach), assume the user make selection in order.
Up to 8 (if there are 8), sequential order.