Retrieval Evaluation (Chapter 3)
Download
Report
Transcript Retrieval Evaluation (Chapter 3)
Retrieval Evaluation
(Chapter 3)
..
.
Introduction
Before final implementation of IR
system, evaluation is carried out
IR system require evaluation of how
precise answer set is
retrieval performance evaluation
Such evaluation based on test reference
collection and on evaluation measure
Test reference collection consists of
collection of documents
set of example information requests
set of relevant documents (provided by
specialists) for each example information
request
Example test reference collection
TIPSTER/TREC (Section 3.3.1)
CACM, ISI (Section 3.3.2)
Given retrieval strategy S, evaluation
measure quantifies (for each example
query)
similarity between set of documents
retrieved by S and set of relevant
documents provided by specialists
Evaluation measure
recall
precision
Retrieval Performance
Evaluation
Consideration criteria for retrieval
performance evaluation
nature of query request (batch or
interactive)
nature of setting (laboratory or real life
situation)
nature of interface (batch mode or
interactive mode)
Recall and Precision
Consider information request I and its
set R of relevant documents
let |R| be no. of documents in set
assume that given retrieval strategy
(being evaluated) processes I and
generates document answer set A
let |A| be no. of documents in set
let |Ra| be no. of documents in
intersection of sets R and A
Recall and Precision
(Cont.)
Recall is fraction of relevant
documents (set R) retrieved
recall =
| Ra |
|R|
Precision is fraction of retrieved
documents (set A) which is relevant
precision =
| Ra |
| A|
Assumes that all documents in set A has
been examined
However, user not usually presented with
all documents in set A at once
Documents in set A ranked and user
examines ranked list starting from top
Thus, recall and precision measures vary
as user proceeds examining set A
Proper evaluation requires plotting a
precision versus recall curves
Assume set Rq (defined) contain
relevant documents for query q
Assume Rq = {d3, d5, d9, d25, d39, d44,
d56, d71, d89, d123}
Hence, according to group of
specialists, there are ten documents
relevant to query q
Assume that a retrieval algorithm returns,
for query q, a ranking of documents in
answer set as follows
Ranking for query q
1 d123
2 d84
3 d56
4 d6
5 d8
6 d9
7 d511
8 d129
9 d187
10 d25
11 d38
12 d48
13 d250
14 d113
15 d3
Document d123 (rank 1) is relevant
precision of 100% (1 out of 1 document
examined in answer set is relevant)
10% recall (1 of 10 relevant documents in
set Rq)
Document d56 (rank 3) is next relevant
precision of ~ 66% (2 out of 3 documents
examined in answer set is relevant)
20% recall (2 of 10 relevant documents in
set Rq)
So far, the precision and recall figures
are for single query (see Fig. 3.2)
However, retrieval algorithms usually
evaluated for several distinct queries
To evaluate retrieval performance of
algorithm over all test queries, the
precision at each recall level averaged
Nq
P (r )
i 1
Pi(r )
Nq
P(r) is avg. precision at recall level r, Nq
is no. of queries used, and Pi(r) is
precision at recall level r for i-th query
since recall levels for each query distinct
from 11 standard recall levels,
interpolation often necessary
Assume that relevant document set Rq
for query q changed to
Rq = {d3, d56, d129}
Answer set is still set of 15 ranked doc.
Document d56 is relevant
recall level of 33.3% (1 of 3 relevant docs. in
set Rq)
precision is 33.3% (1 of 3 docs. examined is
relevant)
Document d129 is next relevant doc.
recall level of 66.6% (2 of 3 relevant docs. in
set Rq)
precision is 25% (2 of 8 docs. examined is
relevant)
Document d3 is next relevant doc.
recall level of 100% (3 of 3 relevant docs. in
set Rq)
precision is 20% (3 of 15 docs. examined is
relevant)
Interpolation
Let rj, j {0, 1, 2, …, 10} be reference to
j-th standard recall level (I.e. r5 is
reference to recall level 50%)
P(rj) = max rjrrj+1 P(r)
interpolated precision at j-th standard
recall level is maximum known precision
at any recall level between j-th recall level
and (j+1)-th recall level
Interpolation (Cont.)
In our last example, the interpolation rule
yields following precision and recall (see
Fig. 3.3)
at recall levels 0%, 10%, 20% and 30%,
interpolated precision is 33.3% (known
precision at recall level 33.3%)
at recall levels 40%, 50% and 60%, the
interpolated precision is 25% (known
precision at recall level 66.6%)
at recall levels 70%, 80%, 90% and
100%, interpolated precision is 20%
Average precision versus recall figures
used to compare retrieval performance
of distinct retrieval algorithms
standard evaluation strategy for IR
systems
Fig. 3.4 illustrates average precision
versus recall figures for two distinct
retrieval algorithms
one algorithm has higher precision at lower recall
levels
second algorithm is superior at higher recall levels
Trec Collection
Document Collection
TREC-3 ~ 2 gigabytes
TREC-6 ~ 5.8 gigabytes
documents come from Wall Street
Journal, AP, ZIFF, FR, DOE, San Jose
Mercury News, US Patents, Financial
Times, CR, FBIS, LA Times (see Table
3.1 for TREC-6)
example of TREC document numbered
WSJ880406-0090 shown in Fig. 4.7
Example Information Requests (Topics)
each request is description of information
need in natural language
each test information request referred to
as a topic
number of topics prepared for first six
TREC conferences number 350
example TREC information request for
topic number 168 is shown in Fig. 3.8
converting topic to system query (e.g.
Boolean query) done by system
Relevant Documents
at TREC conferences, set of relevant
documents for each topic obtained from
pool of possible relevant documents
this pool created by taking top K (usually,
K=100) documents in rankings generated
by various retrieval systems
documents in pool shown to human
assessors who then decide on relevance
of each document