Evaluation of IR

Download Report

Transcript Evaluation of IR

Evaluation of IR
LIS531H
Why eval?
• When designing and using a system there are
decisions to be made:






Manual or automatic indexing?
Controlled vocabularies or natural language.
Stem or not? Use stop lists?
What kind of query syntax?
How to display the retrieved items to the user?
What would be best for your user population and
company/library?
• Are human judgments consistent? Reliable?
• Need to demonstrate empirically that a system is
good …and find ways to improve it.
How to evaluate
• Consider IR evaluation as any other
research project: need repeatable criteria
to evaluate how effective an IR system is
in meeting the information needs of the
user.
• How to define the task that the human is
attempting?
• What criteria mean success?
How to evaluate?
• Big question! What specifically or whole
system?
 Speed of retrieval?
 Resources required
 Presentation of documents
 Market appeal
• Eval generally comparative: system A vs.
B
• Cost-benefit analysis
• Most commonly: Retrieval Effectiveness
How to evaluate
• Relevance and test collections
• Metrics:




Recall and precision
E and F
Swets and operating characteristics
Expected search length
• Case studies (known item searches)
• Significance tests
• Other issues and problems
Relevance
• A relevant doc is one judged useful in the
context of a query, but …
 Who judges? What’s useful?
 Serendipitous utility
 Humans aren’t consistent in judgments
• With real collections, one never knows the full
set of relevant documents in the collection
• Retrieval models incorporate the notion of
relevance
 Probability of (relevance | query, document)
 How similar to the query
How to find all relevant docs?
• Collection needs complete judges: “yes”,
“no” or some score for every query-doc
pair!
• For small test collections: can review all
docs for all queries
• Not practical for medium- or largecollections
• Other approaches:
 Pooling, sampling, search-based
Finding relevant docs (2)
• Search based:
 Rather than read every doc, use manually-guided
search; read retrieved docs until convinced all
relevance found
• Pooling:
 Retrieve docs using several, usually automatic,
techniques; Judge the top n docs for each technique.
Relevant set is the union; subset of true relevant set
• Sampling:
 Estimate size of true relevant set by sampling
• All are incomplete, so when testing…
 How should unjudged documents be treated?
Evaluation Fundamentals
D = set of documents
A = set of
documents that
satisfy some userbased criterion
B = set of
documents
identified by
the search
system
Evaluation Fundamentals
recall = retrieved relevant
relevant
= |AB|
|A|
retrieved relevant
retrieved
= |AB|
|B|
precision =
fallout =
retrieved not-relevant= | B - A  B |
not-relevant
|D-A|
Evaluation Fundamentals
• Recall and Precision: depend on concept of
relevance.
• Relevance is a context-, task-dependent property
of documents. [What’s linguistically and
philosophically wrong about this approach?]
“Relevance is the correspondence in context between an
information requirement statement ... and an article (a
document), that is, the extent to which the article covers
the material that is appropriate to the requirement
statement.?
F. W. Lancaster, 1979
Evaluation Fundamentals
• Relevancy judgments vary in reliability:
 For text documents, users knowledgeable
about a subject are usually in close
agreement in deciding whether a document is
relevant to an info requirement (the query)
 Less consistency with non-textual docs, such
as photos
 Attempts to use a level of relevance such as a
Likert scale haven’t worked. [Do you believe
it? Why not? Other tests?]
Effectiveness Measures - Recall &
Precision
• Precision
 Proportion of a retrieved set that’s relevant
 Precision = |relevant Ç retrieved| ÷ |relevant|
• Recall
 Proportion of all relevant docs in the collection
included in the retrieval set
 Recall = |relevant Ç retrieved| ÷ |relevant|
 = P(retrieved | relevant)
• Precision & recall well-defined for sets.
• For ranked retrieval:
 compute a P/R point for each relevant document;
compare a value at fixed recall points (e.g., Precision
at 20% recall); compute value at fixed rank cutoffs
(e.g., precision at rank 20)
Another common representation
• Relevant = A+C
• Retrieved = A+B
• Collection
size=A+B+C+D
•
•
•
•
Precision = A÷(A+B)
Recall = A ÷ (A+C)
Miss = C ÷ (A+C)
False alarm (fallout) =
B ÷ (B+D)
Retrieved
Not
relevant
Retrieved
A
B
Not
retrieved
C
D
Precision & Recall example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Average precision of a query
• Often want a single-number effectiveness
measure; average precision is widely
used
• Calculate by averaging precision when
recall increases:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Precision and Recall - ex. 2
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Evaluation Fundamentals - History
• Cranfield Experiments - Cyril Cleverdon,
Cranfield College of Aeronautics, 1957-68.
• SMART System, G. Salton, Cornell, 19641988
• TREC, Donna Hartman, National Institute
of Standards and Technology (NIST), 1992
 Various TREC tracks
Sample test collections
Collection
Cranfield
CACM
ISI
West
Trec2
Coll size
1,400 docs
3,204
1,460
11,953
742,611
Size (mb)
1.5
2.3
2.2
254
2,162
Year
1968
1983
1983
1990
1991
Unique
stems
8,226
5,493
5,448
196,707
1,040,415
Stem
occurrences
123,200
11,568
98,304
21,798,833
243,800,000
27
27
1,309
Max within
doc freq
Mean doc
length
88
36.7
67.3
1,823
328
# of queries
225
50
35
44
100
Sample Experiment (Cranfield)
Comparative efficiency of indexing systems:
(Universal Decimal Classification, alphabetical subject
index, a special facet classification, Uniterm system
of co-ordinate indexing)
Four indexes prepared manually for each document in
three batches of 6,000 documents -- total 18,000
documents, each indexed four times. The documents
were reports and paper in aeronautics.
Indexes for testing were prepared on index cards and
other cards.
Very careful control of indexing procedures.
Sample Experiment (Cranfield)
Searching:
•
1,200 test questions, each satisfied by at least
one document
•
Reviewed by expert panel
•
Searches carried out by 3 expert librarians
•
Two rounds of searching to develop testing
methodology
•
Subsidiary experiments at English Electric
Whetstone Laboratory and Western Reserve
University
Sample Experiment (Cranfield)
The Cranfield data was made widely available and
used by other researchers
• Salton used the Cranfield data with the SMART
system (a) to study the relationship between recall
and precision, and (b) to compare automatic
indexing with human indexing
• Sparc Jones and van Rijsbergen used the
Cranfield data for experiments in relevance
weighting, clustering, definition of test corpora, etc.
Sample Experiment (Cranfield)
Cleverdon's work was applied to matching methods. He
made extensive use of recall and precision, based on
concept of relevance.
precision (%)
Each x represents one
search. The graph
illustrates the trade-off
between precision and
recall.
x
x x
x
x x
x
x
x
x
recall (%)
Typical precision-recall graph for different
queries
precision
1.0
Narrow,
specific query
Using Boolean
type queries
0.75
0.5
Broad, general
query
0.25
recall
0.25
0.5
0.75
1.0
Some Cranfield Results
• The various manual indexing systems have
similar retrieval efficiency
• Retrieval effectiveness using automatic
indexing can be at least as effective as manual
indexing with controlled vocabularies
-> original results from the Cranfield + SMART
experiments (published in 1967)
-> considered counter-intuitive
-> other results since then have supported this
conclusion
Precision and Recall with Ranked Results
• Precision and recall are defined for a fixed set of
hits, e.g., Boolean retrieval.
• Their use needs to be modified for a ranked list of
results.
Ranked retrieval: Recall and precision after retrieval of n documents
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
relevantrecall
yes
yes
no
yes
no
yes
no
no
no
no
no
no
yes
no
precision
0.2
1.0
0.4
1.0
0.4
0.67
0.6
0.75
0.6
0.60
0.8
0.67
0.8
0.57
0.8
0.50
0.8
0.44
0.8
0.40
0.8
0.36
0.8
0.33
1.0
0.38
1.0
0.36
SMART system using
Cranfield data, 200
documents in aeronautics
of which 5 are relevant
Evaluation Fundamentals: Graphs
precision
1
Note: Some authors plot
recall against precision.
2
1.0
4
0.75
6
3
5
0.5
13
12
0.25
200
0.25
0.5
0.75
1.0
recall
11 Point Precision (Recall Cut Off)
p(n) is precision at that point where recall has first
reached n
Define 11 standard recall points p(r0), p(r1), ... p(r10),
where p(rj) = p(j/10)
Note: if p(rj) is not an exact data point, use
interpolation
Recall cutoff graph: choice of interpolation
points
1
2
1.0
4
0.75
The blue line is
the recall cutoff
graph.
6
3
5
0.5
13
12
0.25
0.25
0.5
0.75
1.0
200
recall
Example 2: SMART
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision
1.0
1.0
1.0
1.0
1.0
0.75
0.75
0.67
0.67
0.38
0.38
Precision values in
blue are actual data.
Precision values in red
are by interpolation
(by convention equal
to the next actual data
value).
Average precision
Average precision for a single topic is the mean of the
precision obtained after each relevant document is obtained.
Example:
p = (1.0 + 1.0 + 0.75 + 0.67 + 0.38) / 5
= 0.75
Mean average precision for a run consisting of many topics is
the mean of the average precision scores for each individual
topic in the run.
Definitions from TREC-8
Normalized recall measure
actual
ranks
ideal
ranks
worst
ranks
recall
5
10
15
ranks of retrieved documents
195
200
Normalized recall
area between actual and worst
Normalized recall =
area between best and worst
n
n
Rnorm = 1 -

i=1
ri -  i
i=1
n(N - n)
(after some mathematical
manipulation)
Combining Recall and Precision: Normalized Symmetric Difference
Symmetric Difference: The set of elements belonging to one but not
both of two given sets.
D = set of documents
B
Retrieved
A
Relevant
Symmetric difference, S = A  B - A  B
Normalized symmetric difference = |S| / 2 (|A| + |B|)
= 1
2
{
1 -
1
(1/recall + 1/precision)
}
Statistical tests
Suppose that a search is carried out on systems i
and j
System i is superior to system j if, for all test cases,
recall(i) >= recall(j)
precisions(i) >= precision(j)
In practice, we have data from a limited number of
test cases. What conclusions can we draw?
Recall-precision graph
recall
The red system appears
better than the black, but
is the difference
statistically significant?
1.0
0.75
0.5
0.25
0.25
0.5
0.75
1.0
precision
Statistical tests
• The t-test is the standard statistical test for
comparing two table of numbers, but depends on
statistical assumptions of independence and
normal distributions that do not apply to these
data.
• The sign test makes no assumptions of normality
and uses only the sign (not the magnitude) of the
the differences in the sample values, but
assumes independent samples.
• The Wilcoxon signed rank uses the ranks of the
differences, not their magnitudes, and makes no
assumption of normality but but assumes
Test collections
• Compare retrieval performance using a test
collection:
 a set of documents, a set of queries, and as et of
relevance judgments (which docs are relevant to each
query)
• To compare the performance of 2 techniques:
 Each technique used to evaluate test queries
 Results (set or ranked list) compared using some
performance measure; most commonly precision and
recall
• Usually multiple measures to get different views
of performance; usually test with multiple
collections (performance is collection
Recall/Precision graphs
• Average precision hides info
• Sometimes better to show tradeoff in table or
graph
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Averaging across queries
• Hard to compare P/R graphs or tables for
individual queries (too much data)
• Two main types of averaging:
 Micro-average (each relevant doc is a point in the
avg)
 Macro-average (each query is a point in the avg)*
• Also done with average precision value
 Average of many queries’ average precision values
 Called mean average precision (MAP)
Averaging Graphs: a false start
• The raw plots of P/R usually require
interpolation
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Interpolation
• As recall increases, precision decreases
(on average!)
• One approach takes a set of
recall/precision points (R,P) and make a
step
function:
P(R)  max{ P':R' R  (R',P')  S}
Step graph
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Another example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Van Rijsbergen, p. 117 (1979)
Interpolation and Averaging
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Interpolated average precision
• Average precision at standard recall points
• For a given query, compute P/R point for every
relevant doc
• Interpolate precision at standard recall levels
 11-pt is usually 100%, 90%… 10%, 0%
 3-pt is usually 75%, 50%, 25%
• Average overall queries to get average precision
at each recall level
• Average interpolated recall levels to get single
result
 “mean average precision” is common.
Improvements in IR over the years
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Effectiveness Measures: E & F
• E measure (van Rijsbergen)
E  1
1
a
1
1
 (1 a )
P
R
• Used to emphasize precision (or recall)
 Essentially a weighted avg
 of precision and recall
 Large a increases importance of precision
 Can transform by a = 1(b2+1), b=P/R
(b 2  1)PR
E  1 2
b PR
 When b=1 (a=1/2), equal importance of P and R
 Normalized symmetric difference of retrieved and

relevant sets
Symmetric Difference and E
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
F measure
• F=1-E often used - good results
mean
2
(b  1)PR
F

1
E

larger values of Fb
2
b PR

• “F1” measure is popular, especially with
classification researchers: F with b=1
2PR
F1 
PR
F measure as an average
• Harmonic mean of P and R (inverse of
averages of their inverses!); Heavily
penalizes low values of P or R
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
F measure, geometric interpretation
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Other single-valued measures
• Swets (next)
• Expected search length
• Normalized precision or recall (measures area
between actual and ideal curves)
• Breakeven point (point at which precision =
recall)
• Utility measures
 Assign costs to each cell in the contingency table;
sum costs for all queries
• Many others!
Swets Model, 1963
• Based on signal detection & statistical decision
theory
• “Properties of a desirable measure of retrieval
performance”
 Based on ability of system to distinguish between
wanted and unwanted items
 Should express discrimination power independent of
“any acceptance criterion” employed by system or user!
• Characterizes recall-fallout curve generated by
variation of control valuable l, e.g., P(Q|D)
• Assume l is normally distributed on the set of
relevant docs and also on the set of non-relevant
Probability distributions for relevant and
non-relevant documents.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Two normal distributions for l, one on the set of relevant docs; the
other on non-relevant documents. Shaded areas represent recall
and fallout.
Swets model, con’t.
• Note - we’re skipping a lot more about
interpreting Swets. Recognize that Swets
doesn’t use R/P but Recall/Fallout
• From this, DARPA studies are probabilities
of missing documents vs. probability of
false alarm (false hits).
Expected Search Length Evaluation
• Eval is based on type of info need:
 Only one relevant doc is required
 Some arbitrary number n
 All relevant documents
 A given proportion of relevant documents
• Search strategy output assumed to be weak ordering
• Simple ordering means never have 2+ docs at the same level of
ordering
• Search Length in a single ordering is the number of non-relevant
doc a suer must scan before the info need is satisfied
• Expected Search Length appropriate for weak ordering
Expected Search Length
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Evaluation Case Study: Known Item
Search
• What if purpose of query is to find
something you know exists? (e.g., a
homepage)
• Effective Site Finding using Anchor
Information (Craswell, Hawking, and
Robertson, SIGIR 2001)
• Links turn out to be more useful than
content!
• Turning this idea into a researchable
Problem statement
• Site finding
 A site is an organized collection of pages on a specific
topic, maintained by a single person or group
 Not the same as a domain (e.g., aol.com which has
numerous sites)
 Topic can be extremely broad
• Examples (from a query log)
 Where can I find the official Harry Potter site?
 [Not: “Who was Cleopatra”, “Where’s Paris”]
• Therefore, given a query, find a URL.
Setting up for an evaluation
• Recall:
need docs, queries, and relevance judgments
• Doc set is a collection of 18.5 million web pages
• Here, will create <site-finding query,URL> pairs
• Query-first selection:
 Randomly choose a query from an Excite query log
 Have a human find the URL that is a match
 Must ensure that page exists in the test collection
• Document-first selection:
 Randomly select a web page
 Navigate to its “entry page”
 Create a query that matches it
 May not be representative, tho
Evaluation the Known Item
• Usually there’s only 1 possible answer (the site’s
page)
 So recall is either 0 or 1
 Recall/precision graphs are not very interesting, tho
• Instead, measure the rank where the site’s URL
was found
 Sometimes scored as 1/rank (“reciprocal rank”)
 When averaged over many queries, called “mean
reciprocal rank”
• Can also plot it as a graph
Rank where site found
• Anchor method vastly more likely to get
answer at top rank:
 Anchor: MRR = .791
 Content: MRR = .321
• Intuition:
 On avg, ranchor method finds the site at rank
1/.791=1.25; content finds it at rank 3.1
• Precision at rank 1:
 Anchor: 68%; content: 21%
University example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see t his pict ure.
Evaluation: Significance Tests
• System A beats System B on one query…
 Just lucky? Maybe System B is better for
some things
• System A and B are identical on all but 1
query
 If A beats B by enough on that one query,
average will make A look better than B
• System A is .00001% better … so what?
• Significance tests consider those issues…
Significance Tests
• Are the observed differences statistically
different? Can’t make assumptions about
underlying distribution
 Most significance tests do make such assumptions
 See linguistic corpora for distributions, but …
• Sign test or Wilcoxon signed-ranks tests are
typical
 Don’t require normally distributed data; sign test
answers “how often”; Wilcoxon answers “how much”
• But are these differences detectable by users?
Feedback Evaluation
• Will review relevance feedback later, too.
• How to treat documents that have been seen before?
• Rank freezing:
 Ranks of relevant docs are fixed for subsequent iterations;
compare ranking with original ranking; performance can’t
get worse
• Residual collection:
 All previously-seen docs (e.g., the top n) are removed from
the collection; compare reranking with original (n+1…D)
• Both approaches are problematic:
 Users probably want to see the good documents move to
the top of the list
Feedback Eval: user perceptions
• Effectiveness measures give a user’s sense of quality - but
might consider:
 Time to complete a retrieval task
 User ‘satisfaction’
 How well users believe the system works
• An “intelligent” IR system is one that doesn’t look stupid to the
users
• User studies difficult to design and expensive to conduct
• Subjects can’t repeat queries
• Hard to isolate effects of search engine and user interface
• Hard to control for individual performance differences
• TREC “interactive” track
Problems with evaluation
• Retrieval techniques are highly collection- and query-specific
 Single techniques must be tested on multiple coll.
 Comparison of techniques must be on same coll.
 Isolated tests aren’t very useful; Scalability, too
• Standard methods assume user knows right collection
• Usually impossible to control all variables
• Hard to separate effects of retrieval model and interface when model
requires user interaction
• Good test collections are hard + expensive to produce
• Usually can’t do cost-benefit analysis
• Haven’t satisfied criteria specified by Swets and others in the 1960s
… in short, the question is wide open how to evaluate retrieval
effectiveness.