Relevance Feedback - Computer Science Building, Colorado
Download
Report
Transcript Relevance Feedback - Computer Science Building, Colorado
Recall: Query Reformulation
Approaches
Relevance feedback based
1.
vector model (Rocchio …)
probabilistic model (Robertson & Sparck
Jones, Croft…)
2.
Cluster based Query Expansion
1.
2.
Local analysis: derive information from
retrieved document set
Global analysis: derive information
from corpus
Local Analysis
“Known relevant documents contain terms which
can be used to describe a larger cluster of
relevant documents.” MIR
In relevance feedback, clusters are built from
interaction with user about documents.
Local analysis automatically exploits the
documents retrieved by identifying terms related
to those in the query.
Term Clusters
Association Clusters: model co-occurrence of
stems in retrieved documents, expand using
co-occurring terms
unnormalized groups by large frequencies
normalized groups by rarity
Metric Clusters: factor in intra-document
distance
Problem: Expensive to compute on the fly
Global Analysis
All documents are analyzed for term
relationships.
Two Approaches:
Similarity thesaurus: relates whole query to new
terms. Focus is on concept underlying terms:
each term is indexed by the documents in
which it appears.
Statistical thesaurus: cluster documents into class
hierarchy
Similarity Thesaurus Basis
(0.5 0.5
wi , j
fi, j
max j ( f i , j )
) itf j
N
f i ,l
2
2
(
0
.
5
0
.
5
)
itf
j
max l ( f i ,l )
l 1
where inverse term frequency (itf) for doc dj is:
t
itf j log
tj
N is the number of documents,
t is number of distinct terms in collection and
tj is number of distinct terms in document j
Similarity Thesaurus Creation
Thesaurus is a matrix of correlation
factors between indexing terms:
cu,v k u k v w u, j w v, j
d j
Relationship between terms
and Query
from Qiu & Frei, “Concept Based Query Expansion”, SIGIR-93
Query Expansion w/Similarity
Thesaurus
2
Represent the query in the concept space of the
index terms (weight vector)
Based on the global similarity thesaurus,
compute a similarity sim(q,kv):
sim(q, k v ) q k v
2
w
k u q
u, q
cu, v
Expand the query with the top r ranked terms
sim(q, k v )
and weight with:
w v,q'
w
k u q
u,q
Global 2: Statistical Thesaurus
Thesaurus construction relies on high
discrimination/low frequency terms.
Hard to cluster…
So, build classes based on clustering
similar docs instead.
Similarity is minimum of cosine vector
model similarity between any two docs
(one from each cluster).
Complete Link Algorithm
[Crouch & Yang]
1.
2.
3.
4.
5.
6.
Place each document in a distinct cluster.
Compute the similarity between all pairs of
clusters.
Determine the pair of clusters [Cu,Cv] with the
highest inter-cluster similarity.
Merge the clusters Cu and Cv
Verify a stop criterion. If this criterion is not
met then go back to step 2.
Return a hierarchy of clusters.
Hierarchy Example
C 1 ,3,2,4
0.00
C 1 ,3,2
0.29
Doc1=D,D,A,B,C,A,B,C
Doc2=E,C,E,A,A,D
Doc3=D,C,B,B,D,A,B,C,A
Doc4=A
C 1 ,3
0.99
C1
C3
C2
C4
from MIR notes
D1
D3
D2
D4
Query Expansion w/Statistical
Thesaurus
Select the terms for each class:
Threshold on similarity determines which clusters
NDC determines max number of docs in cluster
MIDF determines minimum IDF for any term (i.e., how
rare)
Compute thesaurus class weight for terms
C
wtc
w
i 1
C
i,C
wtc
Wc
0 .5
C
Global Analysis Summary
Thesaurus approach has been effective
for improving queries…
However
requires expensive processing (static
corpus required)
statistical generation exploits small
frequencies better but is sensitive to
parameter settings.
Relevance Feedback/Query
Reformulation Summary
Relevance feedback and query expansion
approaches have been shown to be effective
at improving relevance, sometimes at
expense of precision.
Users resist relevance feedback, takes time
and understanding.
Query reformulation can be costly (expensive
computation) for search engines/IR systems.
Search Engine Use of
Query Feedback
Relevance feedback
Similar/Related Pages or searches:
explicit tried, but mostly abandoned.
indirect: Teoma (ranks documents higher that users
look at more often)
suggest expanded queries or ask to search for related
pages (Altavista and MSN Search used to do this)
Google- Find Similar
Teoma
Web log data mining
Behavior-Based Ranking
AskJeeves used user behavior to
change results ranking:
For each query Q, record which URLs are
followed
Use click through counts to order URLs for
subsequent submissions of Q
Pseudo-relevance feedback
Teoma: Indirect Relevance
Combines indirect relevancy judgments with their
own link analysis
“Subject-Specific Popularity ranks a site based on
the number of same-subject specific pages that
reference it.” [Teoma.com page]
Clustering Usage:
Refine: Models communities to suggest search
classification
Resources: Suggests authoritative sites within
designated community
Web Log Mining
SOP for large search engines to monitor
what people are querying
Goals:
learn associations between common terms
based on large number of queries
Identify trends in user behavior that should
be addressed by system