Transcript PowerPoint

CS 430: Information Discovery
Lecture 23
Query Refinement
1
Course Administration
Midterm Examination
Grades for Midterm Examination have been sent.
Contact [email protected] if you have not
received a grade.
Assignment 4
This will be an optional assignment for extra
credit. It is due on Friday, December 7 at 5 p.m.
2
The Human in the Loop
Return objects
Return
hits
Browse repository
Search index
3
From the Midterm
Question 2
(a) Why are precision and recall difficult measures
to use for retrieval effectiveness when there is a user
in loop?
4
Query Refinement
Query formulation and search
no hits
new query
Display number of hits
Reformulate query or display
Display retrieved information
new query
5
Decide next step
reformulate query
Reformulation of Query
Manual
•
•
•
Add or remove search terms
Change Boolean operators
Change wild cards
Automatic
6
•
•
Remove search terms
Change weighting of search terms
•
Add new search terms
Query Reformulation: Vocabulary
Tools
Feedback
•
Information about stop lists, stemming, etc.
•
Numbers of hits on each term or phrase
Suggestions
7
•
Thesaurus
•
Browse lists of terms in the inverted index
•
Controlled vocabulary
Query Reformulation:
Document Tools
Feedback to user consists of document excerpts or
surrogates
•
Shows the user how the system has interpreted the query
Effective at suggesting how to restrict a search
•
Shows examples of false hits
Less good at suggesting how to expand a search
•
8
No examples of missed items
Example: Tilebars
The figure represents a set of hits
from a text search.
Each large rectangle represents a
document or section of text.
Each row represents a search term or
subquery.
The density of each small square
indicates the frequency with which a
term appears in a section of a
document.
Hearst 1995
9
Document Vectors as Points on a
Surface
•
Normalize all document vectors to be of length 1
•
Then the ends of the vectors all lie on a surface
with unit radius
•
For similar documents, we can represent parts of
this surface as a flat region
•
Similar document are represented as points that are
close together on this surface
From Lecture 8
10
Theoretically Best Query
optimal
query
o
x
x
o
x
o
 x
x
x
x
x
x
o
x
x
x
x
x
o
o
x
x
x
x
x non-relevant documents
o relevant documents
11
Theoretically Best Query
For a specific query, Q, let:
DR be the set of all relevant documents
DN-R be the set of all non-relevant documents
sim (Q, DR) be the mean similarity between query Q and
documents in DR
sim (Q, DN-R) be the mean similarity between query Q and
documents in DN-R
The theoretically best query would maximize:
F = sim (Q, DR) - sim (Q, DN-R)
12
Estimating the Best Query
In practice, DR and DN-R are not known. (The objective is to
find them.)
However, the results of an initial query can be used to estimate
sim (Q, DR) and sim (Q, DN-R).
13
Relevance Feedback (concept)
x
x

o

x
o
x
hits from
original
search
o
x documents identified as non-relevant
o documents identified as relevant
 original query
 reformulated query
14
Rocchio's Modified Query
Modified query vector
= Original query vector
+ Mean of relevant documents found by original query
- Mean of non-relevant documents found by original query
15
Query Modification
Q1 = Q0 +
1
n1
n1
 Ri
i =1
-
1
n2
n2
 Si
i =1
Q0 = vector for the initial query
Q1 = vector for the modified query
Ri = vector for relevant document i
Si = vector for non-relevant document i
n1 = number of relevant documents
n2 = number of non-relevant documents
Rocchio 1971
16
Difficulties with Relevance Feedback
optimal
query
o
x
x
o

x
o
17
x
x
x
x
o
x
x
x
x

o
x
x
o
 x
x
x
x
x
x non-relevant documents
o relevant documents
 original query
 reformulated query
Hits from
the initial
query are
contained in
the gray
shaded area
Effectiveness of Relevance Feedback
Best when:
18
•
Relevant documents are tightly clustered (similarities
are large)
•
Similarities between relevant and non-relevant
documents are small
Positive and Negative Feedback
1
Q1 =  Q0 +  n
1
n1
1
R

 i
n2
i =1
n2
 Si
i =1
,  and  are weights that adjust the importance
of the three vectors.
If  = 0, the weights provide positive feedback,
by emphasizing the relevant documents in the
initial set.
If  = 0, the weights provide negative feedback,
by reducing the emphasis on the non-relevant
documents in the initial set.
19
When to Use Relevance Feedback
Relevance feedback is most important when the user wishes to
increase recall, i.e., it is important to find all relevant
documents.
Under these circumstances, users can be expected to put effort
into searching:
20
•
Formulate queries thoughtfully with many terms
•
Review results carefully to provide feedback
•
Iterate several times
•
Combine automatic query enhancement with studies of
thesauruses and other manual enhancements
Latent Semantic Indexing
A very rough sketch of the basic idea, without any details
or justification.
Objective
Replace indexes that use sets of index terms by indexes
that use concepts.
Approach
Map the index term vector space into a lower
dimensional space, using singular value decomposition.
21
The index term vector space
The space has
as many
dimensions as
there are terms
in the word
list.
t3
d1
d2

t2
t1
22
Mathematical concepts
Vector space theory (Singular Value Decomposition)
Define M as the term-document matrix, with t rows (number of
index terms) and n columns (number of documents). There
exist matrices K, S and D, such that:
M = KSD
K is the matrix of eigenvectors of MMT
D is the matrix of eigenvectors of MTM
S is an r x r diagonal matrix, where r is the rank of M, usually
the smaller of t and n, and every element of S is non-negative.
23
Reduction of dimension
Select the s largest elements of S and the
corresponding columns of K and D. This gives a
reduced matrix:
Ms = KsSsDs
It is claimed that the rows of this matrix represent
concepts. Therefore calculation of the similarity
between a query expressed in this space and a
document is more effective than in the index term
vector space.
24