Relevance Feedback - Computer Science Building, Colorado

Transcript Relevance Feedback - Computer Science Building, Colorado

Recall: Query Reformulation
Approaches
Relevance feedback based
1.
vector model (Rocchio …)
probabilistic model (Robertson & Sparck
Jones, Croft…)


2.
Cluster based Query Expansion
1.
2.
Local analysis: derive information from
retrieved document set
Global analysis: derive information
from corpus
Local Analysis



“Known relevant documents contain terms which
can be used to describe a larger cluster of
relevant documents.” MIR
In relevance feedback, clusters are built from
interaction with user about documents.
Local analysis automatically exploits the
documents retrieved by identifying terms related
to those in the query.
Term Clusters
Association Clusters: model co-occurrence of
stems in retrieved documents, expand using
co-occurring terms


unnormalized groups by large frequencies
normalized groups by rarity
Metric Clusters: factor in intra-document
distance
Problem: Expensive to compute on the fly
Global Analysis


All documents are analyzed for term
relationships.
Two Approaches:
Similarity thesaurus: relates whole query to new
terms. Focus is on concept underlying terms:
each term is indexed by the documents in
which it appears.
Statistical thesaurus: cluster documents into class
hierarchy
Similarity Thesaurus Basis
(0.5  0.5 
wi , j 
fi, j
max j ( f i , j )
)  itf j
N
f i ,l
2
2
(
0
.
5

0
.
5

)
itf

j
max l ( f i ,l )
l 1
where inverse term frequency (itf) for doc dj is:
t
itf j  log
tj
N is the number of documents,
t is number of distinct terms in collection and
tj is number of distinct terms in document j
Similarity Thesaurus Creation

Thesaurus is a matrix of correlation
factors between indexing terms:
cu,v  k u  k v   w u, j  w v, j
d j
Relationship between terms
and Query
from Qiu & Frei, “Concept Based Query Expansion”, SIGIR-93
Query Expansion w/Similarity
Thesaurus

2
Represent the query in the concept space of the
index terms (weight vector)
Based on the global similarity thesaurus,
compute a similarity sim(q,kv):
sim(q, k v )  q  k v 
2
w
k u q
u, q
 cu, v
Expand the query with the top r ranked terms
sim(q, k v )
and weight with:
w v,q' 
w
k u q
u,q
Global 2: Statistical Thesaurus



Thesaurus construction relies on high
discrimination/low frequency terms.
Hard to cluster…
So, build classes based on clustering
similar docs instead.

Similarity is minimum of cosine vector
model similarity between any two docs
(one from each cluster).
Complete Link Algorithm
[Crouch & Yang]
1.
2.
3.
4.
5.
6.
Place each document in a distinct cluster.
Compute the similarity between all pairs of
clusters.
Determine the pair of clusters [Cu,Cv] with the
highest inter-cluster similarity.
Merge the clusters Cu and Cv
Verify a stop criterion. If this criterion is not
met then go back to step 2.
Return a hierarchy of clusters.
Hierarchy Example
C 1 ,3,2,4
0.00
C 1 ,3,2
0.29
Doc1=D,D,A,B,C,A,B,C
Doc2=E,C,E,A,A,D
Doc3=D,C,B,B,D,A,B,C,A
Doc4=A
C 1 ,3
0.99
C1
C3
C2
C4
from MIR notes
D1
D3
D2
D4
Query Expansion w/Statistical
Thesaurus

Select the terms for each class:



Threshold on similarity determines which clusters
NDC determines max number of docs in cluster
MIDF determines minimum IDF for any term (i.e., how
rare)

Compute thesaurus class weight for terms
C
wtc 
w
i 1
C
i,C
wtc
Wc 
 0 .5
C
Global Analysis Summary


Thesaurus approach has been effective
for improving queries…
However


requires expensive processing (static
corpus required)
statistical generation exploits small
frequencies better but is sensitive to
parameter settings.
Relevance Feedback/Query
Reformulation Summary



Relevance feedback and query expansion
approaches have been shown to be effective
at improving relevance, sometimes at
expense of precision.
Users resist relevance feedback, takes time
and understanding.
Query reformulation can be costly (expensive
computation) for search engines/IR systems.
Search Engine Use of
Query Feedback

Relevance feedback



Similar/Related Pages or searches:




explicit tried, but mostly abandoned.
indirect: Teoma (ranks documents higher that users
look at more often)
suggest expanded queries or ask to search for related
pages (Altavista and MSN Search used to do this)
Google- Find Similar
Teoma
Web log data mining
Behavior-Based Ranking

AskJeeves used user behavior to
change results ranking:



For each query Q, record which URLs are
followed
Use click through counts to order URLs for
subsequent submissions of Q
Pseudo-relevance feedback
Teoma: Indirect Relevance



Combines indirect relevancy judgments with their
own link analysis
“Subject-Specific Popularity ranks a site based on
the number of same-subject specific pages that
reference it.” [Teoma.com page]
Clustering Usage:
Refine: Models communities to suggest search
classification
Resources: Suggests authoritative sites within
designated community
Web Log Mining


SOP for large search engines to monitor
what people are querying
Goals:


learn associations between common terms
based on large number of queries
Identify trends in user behavior that should
be addressed by system

Relevance Feedback - Computer Science Building, Colorado

Transcript Relevance Feedback - Computer Science Building, Colorado

Directory