Retrieval Evaluation (Chapter 3)

Download Report

Transcript Retrieval Evaluation (Chapter 3)

Retrieval Evaluation
(Chapter 3)
..
.
Introduction


Before final implementation of IR
system, evaluation is carried out
IR system require evaluation of how
precise answer set is


retrieval performance evaluation
Such evaluation based on test reference
collection and on evaluation measure

Test reference collection consists of
collection of documents
 set of example information requests
 set of relevant documents (provided by
specialists) for each example information
request


Example test reference collection
TIPSTER/TREC (Section 3.3.1)
 CACM, ISI (Section 3.3.2)


Given retrieval strategy S, evaluation
measure quantifies (for each example
query)
 similarity between set of documents
retrieved by S and set of relevant
documents provided by specialists

Evaluation measure
recall
 precision

Retrieval Performance
Evaluation

Consideration criteria for retrieval
performance evaluation
nature of query request (batch or
interactive)
 nature of setting (laboratory or real life
situation)
 nature of interface (batch mode or
interactive mode)

Recall and Precision





Consider information request I and its
set R of relevant documents
let |R| be no. of documents in set
assume that given retrieval strategy
(being evaluated) processes I and
generates document answer set A
let |A| be no. of documents in set
let |Ra| be no. of documents in
intersection of sets R and A
Recall and Precision
(Cont.)

Recall is fraction of relevant
documents (set R) retrieved


recall =
| Ra |
|R|
Precision is fraction of retrieved
documents (set A) which is relevant

precision =
| Ra |
| A|




Assumes that all documents in set A has
been examined
However, user not usually presented with
all documents in set A at once
Documents in set A ranked and user
examines ranked list starting from top
Thus, recall and precision measures vary
as user proceeds examining set A




Proper evaluation requires plotting a
precision versus recall curves
Assume set Rq (defined) contain
relevant documents for query q
Assume Rq = {d3, d5, d9, d25, d39, d44,
d56, d71, d89, d123}
Hence, according to group of
specialists, there are ten documents
relevant to query q

Assume that a retrieval algorithm returns,
for query q, a ranking of documents in
answer set as follows

Ranking for query q





1 d123 
2 d84
3 d56 
4 d6
5 d8
6 d9 
7 d511
8 d129
9 d187
10 d25 
11 d38
12 d48
13 d250
14 d113
15 d3 

Document d123 (rank 1) is relevant
precision of 100% (1 out of 1 document
examined in answer set is relevant)
 10% recall (1 of 10 relevant documents in
set Rq)


Document d56 (rank 3) is next relevant
precision of ~ 66% (2 out of 3 documents
examined in answer set is relevant)
 20% recall (2 of 10 relevant documents in
set Rq)




So far, the precision and recall figures
are for single query (see Fig. 3.2)
However, retrieval algorithms usually
evaluated for several distinct queries
To evaluate retrieval performance of
algorithm over all test queries, the
precision at each recall level averaged
Nq
P (r )  
i 1
Pi(r )
Nq


P(r) is avg. precision at recall level r, Nq
is no. of queries used, and Pi(r) is
precision at recall level r for i-th query
since recall levels for each query distinct
from 11 standard recall levels,
interpolation often necessary

Assume that relevant document set Rq
for query q changed to



Rq = {d3, d56, d129}
Answer set is still set of 15 ranked doc.
Document d56 is relevant


recall level of 33.3% (1 of 3 relevant docs. in
set Rq)
precision is 33.3% (1 of 3 docs. examined is
relevant)

Document d129 is next relevant doc.



recall level of 66.6% (2 of 3 relevant docs. in
set Rq)
precision is 25% (2 of 8 docs. examined is
relevant)
Document d3 is next relevant doc.


recall level of 100% (3 of 3 relevant docs. in
set Rq)
precision is 20% (3 of 15 docs. examined is
relevant)
Interpolation

Let rj, j  {0, 1, 2, …, 10} be reference to
j-th standard recall level (I.e. r5 is
reference to recall level 50%)
P(rj) = max rjrrj+1 P(r)
 interpolated precision at j-th standard
recall level is maximum known precision
at any recall level between j-th recall level
and (j+1)-th recall level

Interpolation (Cont.)

In our last example, the interpolation rule
yields following precision and recall (see
Fig. 3.3)
at recall levels 0%, 10%, 20% and 30%,
interpolated precision is 33.3% (known
precision at recall level 33.3%)
 at recall levels 40%, 50% and 60%, the
interpolated precision is 25% (known
precision at recall level 66.6%)
 at recall levels 70%, 80%, 90% and
100%, interpolated precision is 20%


Average precision versus recall figures
used to compare retrieval performance
of distinct retrieval algorithms


standard evaluation strategy for IR
systems
Fig. 3.4 illustrates average precision
versus recall figures for two distinct
retrieval algorithms


one algorithm has higher precision at lower recall
levels
second algorithm is superior at higher recall levels
Trec Collection

Document Collection
TREC-3 ~ 2 gigabytes
 TREC-6 ~ 5.8 gigabytes
 documents come from Wall Street
Journal, AP, ZIFF, FR, DOE, San Jose
Mercury News, US Patents, Financial
Times, CR, FBIS, LA Times (see Table
3.1 for TREC-6)
 example of TREC document numbered
WSJ880406-0090 shown in Fig. 4.7


Example Information Requests (Topics)
each request is description of information
need in natural language
 each test information request referred to
as a topic
 number of topics prepared for first six
TREC conferences number 350
 example TREC information request for
topic number 168 is shown in Fig. 3.8
 converting topic to system query (e.g.
Boolean query) done by system


Relevant Documents
at TREC conferences, set of relevant
documents for each topic obtained from
pool of possible relevant documents
 this pool created by taking top K (usually,
K=100) documents in rankings generated
by various retrieval systems
 documents in pool shown to human
assessors who then decide on relevance
of each document
