A Risk Minimization Framework for Information Retrieval

Download Report

Transcript A Risk Minimization Framework for Information Retrieval

Information Retrieval:
Problem Formulation & Evaluation
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Research Process
• Identification of a research question/topic
• Propose a possible solution/answer (formulate a
hypothesis)
• Implement the solution
• Design experiments (measures, data, etc)
• Test the solution/hypothesis
• Draw conclusions
• Repeat the cycle of question-answering or
Today’s lecture
hypothesis-formulation- and-testing if
necessary
2
Part 1: IR Problem Formulation
Basic Formulation of TR (traditional)
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi  V
• Document di = di1,…,dimi, where dij  V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q)  C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task =
compute R’(q), an “approximate R(q)”
(i.e., decide which documents to return to a user)
4
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an
indicator function or classifier
– System must decide if a doc is relevant or not
(“absolute relevance”)
• Strategy 2: Document ranking
– R(q) = {dC|f(d,q)>}, where f(d,q)  is a relevance
measure function;  is a cutoff
– System must decide if one doc is more likely to be
relevant than another (“relative relevance”)
5
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
1
Doc Selection
f(d,q)=?
0
Doc Ranking
f(d,q)=?
User sets the threshold
+ +- + ++
R’(q)
- -- - - + - 0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)
6
Problems of Doc Selection/Boolean
model [Cooper 88]
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific): no
relevant documents found
– “Under-constrained” query (terms are too general):
over delivery
– It is hard to find the right position between these two
extremes (hard for users to specify constraints)
• Even if it is accurate,
all relevant documents are
not equally relevant; prioritization is needed
since a user can only examine one document at
a time
7
Ranking is often preferred
• A user can stop browsing anywhere, so the
boundary is controlled by the user
– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking
Principle [Robertson 77]
8
Probability Ranking Principle [Robertson 77]
• Seek for more fundamental justification
– Why is ranking based on probability of relevance
reasonable?
– Is there a better way of ranking documents?
– What is the optimal way of ranking documents?
•
Theoretical justification for ranking (Probability Ranking
Principle): returning a ranked list of documents in descending
order of probability that a document is relevant to the query is the
optimal strategy under the following two assumptions (do they
hold?):
– The utility of a document (to a user) is independent of the utility of any
other document
– A user would browse the results sequentially
9
Two Justifications of PRP
•
Optimization of traditional retrieval effectiveness
measures
– Given an expected level of recall, ranking based on PRP
maximizes the precision
– Given a fixed rank cutoff, ranking based on PRP maximizes
precision and recall
•
Optimal decision making
– Regardless the tradeoffs (e.g., favoring high precision vs. high
recall), ranking based on PRP optimizes the expected utility of a
binary (independent) retrieval decision (i.e., to retrieve or not to
retrieve a document)
•
Intuition: if a user sequentially examines one doc at
each time, we’d like the user to see the very best ones
first
10
According to the PRP, all we need is
“A relevance measure function f”
which satisfies
For all q, d1, d2,
f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2)
Most existing research on IR models so far has fallen into
this line of thinking…. (Limitations?)
Modeling Relevance:
Raodmap for Retrieval Models
Relevance constraints
[Fang et al. 04]
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
…
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model (Fuhr 89)
Generative
Model
Learn. To Rank Doc
(Joachims 02,
generation
Berges et al. 05)
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Div. from Randomness
(Amati & Rijsbergen 02)
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Query
generation
Different
inference system
Prob. concept Inference
network
space model
(Wong & Yao, 95) model
(Turtle & Croft, 91)
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
12
Part 2: IR Evaluation
Evaluation: Two Different Reasons
•
Reason 1: So that we can assess how useful an IR
system/technology would be (for an application)
– Measures should reflect the utility to users in a real application
– Usually done through user studies (interactive IR evaluation)
•
Reason 2: So that we can compare different systems and
methods (to advance the state of the art)
– Measures only need to be correlated with the utility to actual
users, thus don’t have to accurately reflect the exact utility to
users
– Usually done through test collections (test set IR evaluation)
14
What to Measure?
• Effectiveness/Accuracy: how accurate are the
search results?
– Measuring a system’s ability of ranking relevant
docucments on top of non-relevant ones
• Efficiency: how quickly can a user get the
results? How much computing resources are
needed to answer a query?
– Measuring space and time overhead
• Usability: How useful is the system for real
user tasks?
– Doing user studies
15
The Cranfield Evaluation Methodology
•
•
A methodology for laboratory testing of system
components developed in 1960s
Idea: Build reusable test collections & define measures
– A sample collection of documents (simulate real document
collection)
– A sample set of queries/topics (simulate user queries)
– Relevance judgments (ideally made by users who formulated the
queries)  Ideal ranked list
– Measures to quantify how well a system’s result matches the ideal
ranked list
•
•
A test collection can then be reused many times to
compare different systems
This methodology is general and applicable for evaluating
any empirical task
16
Test Collection Evaluation
Queries
Query= Q1
Q1 Q2 Q3
… Q50 ...
System A
D2
D1
…
D3
D48
Document Collection
System B
D2 +
D1 +
D4 D5 +
D1 +
D4 D3 D5 +
Relevance
Judgments
Q1 D1 +
Q1 D2 +
Precision=3/4
Q1 D3 –
Recall=3/3
Q1 D4 –
Q1 D5 +
…
Q2 D1 –
Q2 D2 +
Precision=2/4 Q2 D3 +
Q2 D4 –
Recall=2/3
…
Q50 D1 –
Q50 D2 –
Q50 D3 +
…
17
Measures for evaluating a set of
retrieved documents
Action
Doc
Relevant
Not relevant
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
a
b
Irrelevant Retrieved
Irrelevant Rejected
c
d
a
P recision
ac
a
Recall 
ab
Ideal results: Precision=Recall=1.0
In reality, high recall tends to be
associated with low precision (why?)
18
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
precision
precision
x
Which is better?
x
x
x
x
recall
x
x
x
recall
19
Computing Precision-Recall Curve
Total number of relevant documents in collection: 10
Precision
D1 +
D2 +
D3 –
D4 –
D5 +
D6 –
D7 –
D8 +
D9 –
D10 –
Recall
1/1
2/2
2/3
1/10
2/10
2/10
3/5
3/10
4/8
4/10
1.0
0.6
0.1 0.2 0.3 ….
?
1.0
10/10
20
How to summarize a ranking?
Total number of relevant documents in collection: 10
Precision
D1 +
D2 +
D3 –
D4 –
D5 +
D6 –
D7 –
D8 +
D9 –
D10 –
Recall
1/1
2/2
2/3
1/10
2/10
2/10
3/5
3/10
4/8
4/10
1
1
1.0
0.6
 22  53  84  0  0  0  0  0  0
10
Average Precision=?
0.1 0.2 0.3 ….
0
1.0
10/10
21
Summarize a Ranking: MAP
•
Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document is
retrieved => p(1),…,p(k), if we have k rel. docs
•
•
•
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
Compute the average over all the relevant documents
– Average precision = (p(1)+…p(k))/k
This gives us an average precision, which captures both precision and
recall and is sensitive to the rank of each relevant document
Mean Average Precisions (MAP)
– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more
affected by difficult topics)
– Which one should be used?
22
What if we have multi-level relevance judgments?
Relevance level: r=1 (non-relevant) , 2 (marginally relevant), 3 (very relevant)
Discounted
Cumulative
Gain
Cumulative
Gain
Gain
3
D1 3
3
3+2/log 2
Normalized DCG=?
3+2
D2 2
3+2+1
3+2/log 2+1/log 3
D3 1
D4 1
3+2+1+1
D5 3
D6 1
D7 1
D8 2
D9 1
DCG@10 = 3+2/log 2+1/log 3 +…+ 1/log 10
D10 1
IdealDCG@10 = 3+3/log 2+3/log 3 +…+ 3/log 9+ 2/log 10
…
…
Assume: there are 9 documents rated “3” in total in the collection
23
Summarize a Ranking: NDCG
•
•
•
•
What if relevance judgments are in a scale of [1,r]? r>2
Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in ranked
order)
– CG = r1+r2+…rn
Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
– We may use any base for the logarithm, e.g., base=b
– For rank positions above b, do not discount
Normalized Cumulative Gain (NDCG) at rank n
– Normalize DCG at rank n by the DCG value at rank n of the ideal
ranking
– The ideal ranking would first return the documents with the
highest relevance level, then the next highest relevance level,
etc
24
Other Measures
• Precision at k documents (e.g., prec@10doc):
– easier to interpret than MAP (why?)
– also called breakeven precision when k is the same as
the number of relevant documents
• Mean Reciprocal Rank (MRR):
– Same as MAP when there’s only 1 relevant document
– Reciprocal Rank = 1/Rank-of-the-relevant-doc
• F-Measure (F1): harmonic mean of precision and
recall
(  2  1) P * R
F 

2
2
1
1

PR

1

2
2
 1
R  1 P
2 PR
F1 
PR
1
P: precision
R: recall
: parameter
(often set to 1)
25
Challenges in creating early test
collections
• Challenges in obtaining documents:
– Salton had students to manually transcribe Time
magazine articles
– Not a problem now!
• Challenges in distributing a collection
– TREC started when CD-ROMs are available
– Not a problem now!
• Challenge of scale – limited by qrels (relevance
judgments)
– The idea of “pooling” (Sparck Jones & Rijsbergen 75)
26
Larger collections created in 1980s
Name
Docs. Qrys. Year
INSPEC
12,684
77 1981
CACM
3,204
64 1983
CISI
1,460
112 1983
LISA
6,004
35 1983
Size, Source document
Mb
- Title, authors, source, abstract and indexing
information from Sep-Dec 1979 issues of
Computer and Control Abstracts.
2.2 Title, abstract, author, keywords and
bibliographic information from articles of
Communications of the ACM, 1958-1979.
2.2 Author, title/abstract, and co-citation data for
the 1460 most highly cited articles and
manuscripts in information science, 19691977.
3.4 Taken from the Library and Information
Science Abstracts database.
Commercial systems then routinely support searching over
millions of documents
 Pressure for researchers to use larger collections for evaluation
27
The Ideal Test Collection Report [Sparck
Jones & Rijsbergen 75]
• Introduced the idea of pooling
– Have assessors to judge only a pool of top-ranked
documents returned by various retrieval systems
• Other recommendations (the vision was later
implemented in TREC)
1.that an ideal test collection be set up to facilitate and promote research;
2.that the collection be of sufficient size to constitute an adequate test bed for
experiments relevant to modern IR systems…
3.that the collection(s) be set up by a special purpose project carried out by an
experienced worker, called the Builder;
4.that the collection(s) be maintained in a well-designed and documented
machine form and distributed to users, by a Curator;
5.that the curating (sic) project be encouraged to, promote research via the
ideal collection(s), and also via the common use of other collection(s) acquired
from independent projects.”
28
TREC (Text REtrieval Conference)
• 1990: DARPA funded NIST to build a large test
collection
• 1991: NIST proposed to distribute the data set
through TREC (leader: Donna Harman)
• Nov. 1992: First TREC meeting
• Goals of TREC:
– create test collections for a set of retrieval tasks;
– promote as widely as possible research in those tasks;
– organize a conference for participating researchers to
meet and disseminate their research work using TREC
collections.
29
The “TREC Vision” (mass
collaboration for creating a pool)
“Harman and her colleagues appear to be the first to realize
that if the documents and topics of a collection were
distributed for little or no cost, a large number of groups
would be willing to load that data into their search systems
and submit runs back to TREC to form a pool, all for no costs
to TREC. TREC would use assessors to judge the pool. The
effectiveness of each run would then be measured and
reported back to the groups. Finally, TREC could hold a
conference where an overall ranking of runs would be
published and participating groups would meet to present
work and interact. It was hoped that a slight competitive
element would emerge between groups to produce the best
possible runs for the pool.” (Sanderson 10)
30
•
•
•
•
•
•
•
•
The TREC Ad Hoc Retrieval Task & Pooling
Simulate an information analyst (high recall)
Multi-field topic description
News documents + Government documents
Relevance criteria: “a document is judged relevant if any
piece of it is relevant (regardless of how small the piece
is in relation to the rest of the document)”
Each run submitted returns 1000 document for
evaluation with various measures
Top 100 documents were taken to form a pool
All the documents in the pool were judged
The unjudged documents are often assumed to be nonrelevant (problem?)
31
An example TREC topic
<top>
<num> Number:
200
<title> Topic: Impact of foreign textile imports on U.S. textile industry
<desc> Description: Document must report on how the importation of foreign
textiles or textile products has influenced or impacted on the U.S. textile
industry.
<narr> Narrative: The impact can be positive or negative or qualitative.
It may include the expansion or shrinkage of markets or manufacturing volume
or an influence on the methods or strategies of the U.S. textile industry.
"Textile industry“ includes the production or purchase of raw materials;
basic processing techniques such as dyeing, spinning, knitting, or weaving;
the manufacture and marketing of finished goods; and also research in the
textile field.
</top>
32
Typical TREC Evaluation Result
Precion-Recall Curve
Out of 4728 rel docs,
we’ve got 3212
Recall=3212/4728
Precision@10docs
about 5.5 docs
in the top 10 docs
are relevant
Breakeven Precision
(precision when prec=recall)
Mean Avg. Precision (MAP)
D1 +
D2 +
D3 –
D4 –
D5 +
D6 -
Total # rel docs = 4
System returns 6 docs
Average Prec = (1/1+2/2+3/5+0)/4
Denominator is 4, not 3 (why?)
33
What Query Averaging Hides
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation
34
Statistical Significance Tests
• How sure can you be that an observed
difference doesn’t simply result from the
particular queries you chose?
Experiment 1
Query System A System B
1
0.20
0.40
2
0.21
0.41
3
0.22
0.42
4
0.19
0.39
5
0.17
0.37
6
0.20
0.40
7
0.21
0.41
Average 0.20
0.40
Slide from Doug Oard
Experiment 2
Query System A System B
1
0.02
0.76
2
0.39
0.07
3
0.16
0.37
4
0.58
0.21
5
0.04
0.02
6
0.09
0.91
7
0.12
0.46
Average 0.20
0.40
35
Statistical Significance Testing
Query System A
1
0.02
2
0.39
3
0.16
4
0.58
5
0.04
6
0.09
7
0.12
Average 0.20
System B
0.76
0.07
0.37
0.21
0.02
0.91
0.46
0.40
Sign Test
+
+
+
p=1.0
Wilcoxon
+0.74
- 0.32
+0.21
- 0.37
- 0.02
+0.82
- 0.38
p=0.9375
95% of outcomes
Slide from Doug Oard
0
Try some out at: http://www.fon.hum.uva.nl/Service/CGI-Inline/HTML/Statistics.html 36
Live Labs: Involve Real Users in
Evaluation
• Stuff I’ve Seen [Dumais et al. 03]
– Real systems deployed with hypothesis testing in
mind (different interfaces + logging capability)
– Search logs can then be used to analyze hypotheses
about user behavior
• The “A-B Test”
– Initial proposal by Cutting at a panel [Lest et al. 97]
– First research work published by Joachims
[Joachims 03]
– Great potential, but only a few follow-up studies
37
What You Should Know
• Why is retrieval problem often framed as
a
ranking problem?
• Two assumptions of PRP
• What is Cranfield evaluation methodology?
• How to compute the major evaluation measures
(precision, recall, precision-recall curve, MAP,
gMAP, nDCG, F1, MRR, breakeven precision)
• How does “pooling” work?
• Why is it necessary to do statistical
significance test?
38
Open Challenges in IR Evaluation
•
•
Almost all issues are still open for research!
What are the best measures for various search tasks
(especially newer tasks such as subtopic retrieval)?
•
What’s the best way of doing statistical significance
test?
•
What’s the best way to adopt the pooling strategy in
practice?
•
•
How can we assess the quality of a test collection? Can
we create representative test sets?
New paradigms for evaluation? Open IR system for A-B
test?
39