Transcript Average - 北京大学网络与信息系统研究所
Information Retrieval Evaluation
http://net.pku.edu.cn/~wbia 黄连恩 [email protected]
北京大学信息工程学院 10/21/2014
IR Evaluation Methodology
Measures for a search engine
创建 index 的速度 Number of documents/hour Documents size These criteria measurable.
但更关键的 measure 是 user happiness 怎样量化地度量它 ? 搜索的速度 响应时间: Latency as a function of index size 吞吐率: Throughput as a function of index size 查询语言的表达能力 Ability to express complex information needs Speed on complex queries 3
Measuring user happiness
Issue: 谁是 user? Web engine : user finds what they want and return to the engine Can measure rate of return users eCommerce site : user finds what they want and make a purchase Is it the end-user, or the eCommerce site, whose happiness we measure?
Measure time to purchase, or fraction of searchers who become buyers?
Enterprise (company/govt/academic): Care about “ user productivity ” How much time do my users save when looking for information?
Many other criteria having to do with breadth of access, secure access, etc.
4
Happiness: elusive to measure
Commonest proxy: relevance of search results But how do you measure relevance?
Methodology: test collection(corpus) 1.
A benchmark document collection 2.
3.
A benchmark suite of queries A binary assessment of either Relevant or Irrelevant for each query-doc pair Some work on more-than-binary, but not the standard 5
Evaluating an IR system
Note: the
information need query,
Relevance is assessed relative to the
information need
not is translated into a the query E.g., Information need : I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
Query :
wine red white heart attack effective
You evaluate whether the doc addresses the information need, not whether it has those words 6
Evaluation Corpus
• • TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years Test collections consisting of documents , queries , and relevance judgments , e.g., – CACM : Titles and abstracts from the Communications of the ACM from 1958-1979. Queries and relevance judgments generated by computer scientists.
– AP : Associated Press newswire documents from 1988-1990 (from TREC disks 1-3). Queries are the title fields from TREC topics 51-150. Topics and relevance judgments generated by government information analysts.
– GOV2 : Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701-850. Topics and relevance judgments generated by government analysts.
7/N
Test Collections
8/N
TREC Topic Example
9/N
Relevance Judgments
• •
Obtaining relevance judgments is an expensive, time-consuming process
–
who does it?
–
what are the instructions?
–
what is the level of agreement?
TREC judgments
–
depend on task being evaluated
–
generally binary
–
agreement good because of “narrative”
10/N
Pooling
• • •
Exhaustive judgments for all documents in a collection is not practical Pooling
–
technique is used in TREC top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines (or retrieval algorithms) are merged into a pool
–
duplicates are removed
–
documents are presented in some random order to the relevance judges Produces a large number of relevance judgments for each query, although still incomplete
11/N
IR Evaluation Metrics
Unranked retrieval evaluation: Precision and Recall
Precision
: 检索得到的文档中相关的比率 P(relevant|retrieved)
Recall
: 相关文档被检索出来的比率 P(retrieved|relevant) = = Retrieved Not Retrieved Relevant tp fn Not Relevant fp tn 精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn) 13
Accuracy
给定一个 Query ,搜索引擎对每个文档分类 classifies as “ Relevant ” or “ Irrelevant ” .
Accuracy of an engine: 分类的正确比率 .
Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure in IR?
Retrieved Not Retrieved Relevant Not Relevant tp fp fn tn 14
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget … .
Search for:
0 matching results found.
People doing information retrieval something want to find and have a certain tolerance for junk.
15
Precision and recall when ranked 把集合上的定 义扩展到 ranked list 在 ranked list 中每个文档 处,计算 P/R point 这样计算出来的值,那些是有用的? Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs e.g., precision at rank 20 Consider value only at fixed recall points e.g., precision at 20% recall May be more than one precision value at a recall point 16
Precision and Recall example 17
Average
precision of a query Often want a single-number effectiveness measure Average precision is widely used in IR Calculate by averaging precision when recall increases 18
Recall/precision graphs Average precision .vs. P/R graph AP hides information Recall/precision graph has odd saw-shape if done directly 但是 P/R 图很难比较 19
Precision and Recall, toward averaging 20
Averaging graphs: a false start How can graphs be averaged ?
不同的 queries 有不同的 recall values What is precision at 25% recall?
插 值 interpolate How?
21
Interpolation of graphs 可能的插 值方法 No interpolation Not very useful Connect the dots Not a function Connect max Connect min Connect average … 0%recall 怎么 处理 ?
Assume 0?
Assume best?
Constant start?
22
How to choose?
一个好的 检索系统具有这样的特点:一般来说( On average ),随着 recall 增加 , 它的 precision 会降低 Verified time and time again On average 插 值,使得 makes function monotonically decreasing 比如 : 从左往右,取右 边最大 precisions 值为插值 where S is the set of observed (R,P) points 结果是一个 step function 23
Our example, interpolated this way monotonically decreasing Handles 0% recall smoothly 24
Averaging graphs: using interpolation Asked: what is precision at 25% recall?
Interpolate values 25
Averaging
across
queries 多个 queries 间的平均 微平均 Micro-average 点,用来 计算平均 – 每个 relevant document 是一个 宏平均 算平均 Macro-average – 每个 query 是一个点,用来 计 Average of many queries’ average precision values Called mean average precision ( MAP ) “Average average precision” sounds weird Most common 26
Interpolated
average precision Average precision at standard recall points For a given query, compute P/R point for every relevant doc doc.
Interpolate precision at standard recall levels 11-pt 3-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) is usually 75%, 50%, 25% Average over all queries to get average precision at each recall level Average interpolated recall levels to get single result Called “interpolated average precision” Not used much anymore; common MAP “mean average precision” more Values at specific interpolated points still commonly used 27
Interpolation and averaging 28
A combined measure:
F
P/R 的综合指标 F measure ( weighted mean ):
F
1
P
1 ( 1
R
( 2 2
P
1 )
PR R
通常使用 balanced F 1 measure( harmonic = 1 or = ½ ) Harmonic mean is a conservative average , Heavily penalizes low values of P or R 29
Averaging F, example Q-bad has 1 relevant document Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2% Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2% Macro average ( 0.2% + 66.2%) Micro average ( 0.2% + 18% / 2 = 33.2% + … 100%) / 11 = 60.2% 30
Focusing on Top Documents
• • •
Users tend to look at only the top part of the ranked result list to find relevant documents Some search tasks have only one relevant document
–
e.g., navigational search , question answering Recall not appropriate
–
instead need to measure how well the search engine does at retrieving relevant documents at very high ranks
31/N
Focusing on Top Documents
• • •
Precision at N Precision at Rank R
–
R typically 5, 10, 20
–
easy to compute, average, understand
–
not sensitive to rank positions less than R Reciprocal Rank
–
reciprocal of the rank at which the first relevant document is retrieved
–
Mean Reciprocal Rank (MRR)
is the average of the reciprocal ranks over a set of queries
–
Discounted Cumulative Gain
• •
Popular measure for evaluating web search and related tasks Two assumptions :
–
Highly relevant documents are more useful than marginally relevant document
–
the lower the ranked position of a relevant document, the less useful it is for the user,
•
since it is less likely to be examined
33/N
Discounted Cumulative Gain
• • •
Uses
graded relevance
usefulness, or gain, from examining a document as a measure of the Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks Typical discount is 1/log (rank)
–
With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
34/N
Discounted Cumulative Gain
•
DCG is the total gain accumulated at a particular rank p:
•
Alternative formulation:
–
used by some web search companies
–
emphasis on retrieving highly relevant documents
35/N
DCG Example
• • •
10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
36/N
Normalized DCG
•
DCG values are often
normalized
by comparing the DCG at each rank with the DCG value for the
perfect ranking
–
makes averaging easier for queries with different numbers of relevant documents
37/N
NDCG Example
• • •
Perfect ranking: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10 NDCG values (divide actual by ideal): 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88
–
NDCG
1 at any rank position
38/N
Using Preferences
•
Two rankings described using preferences can be compared using the
Kendall tau coefficient (τ ) :
• –
P is the number of preferences that agree and Q is the number that disagree For preferences derived from binary relevance judgments, can use
BPREF
39/N
Testing and Statistics
Significance Tests
• •
Given the results from a number of queries, how can we conclude that ranking algorithm A is better than algorithm B?
A significance test enables us to reject the
null hypothesis
(no difference) in favor of the alternative hypothesis (B is better than A)
–
the
power
of a test is the probability that the test will reject the null hypothesis correctly
–
increasing the number of queries in the experiment also increases power of test
41/N
One-Sided Test
•
Distribution for the possible values of a test statistic assuming the null hypothesis
•
shaded area is region of rejection
42/N
Example Experimental Results
43/N
t-Test
• • •
Assumption is that the difference between the effectiveness values is a sample from a normal distribution Null hypothesis is that the mean of the distribution of differences is zero Test statistic
–
for the example,
44/N
Sign Test
• • • •
Ignores magnitude of differences Null hypothesis for this test is that
–
P(B > A) = P(A > B) = ½
–
number of pairs where B is “better” than A would be the same as the number of pairs where A is “better” than B Test statistic is number of pairs where B>A For example data,
–
test statistic is 7, p-value = 0.17
–
cannot reject null hypothesis
45/N
Online Testing
• • • •
Test (or even train) using live traffic on a search engine Benefits:
–
real users, less biased, large amounts of test data Drawbacks:
–
noisy data, can degrade user experience Often done on small proportion (1-5%) of live traffic
46/N
本次
课小结
IR evaluation Precision , Recall , F Interpolation MAP , interpolated AP P@N,P@R DCG,NDCG BPREF 47
Thank You!
Q&A
阅读材料
[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4
[2] M. P. Jay and W. B. Croft, "A language modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval . Melbourne, Australia: ACM Press, 1998.
49
习题 8-9 [**] 在 10,000 篇文档构成的文档集中, 某个 查询的相关文档总数为 8 ,下面 给出了某系统 针对该查询的前 20 个有序 结果的相关 ( 用 R 表示 ) 和 不相关 ( 用 N 表示 ) 情况,其中有 6 篇相关文档: RRNNN NNNRN RNNNR NNNNR a. 前 20 篇文档的正确率是多少? b. 前 20 篇文档的 F1 值是多少 ?
c. 在 25% 召回率水平上的插 值正确率是多少? d. 在 33% 召回率水平上的插 值正确率是多少? e. 计算其 MAP 值。
#2. Evaluation 定 义 precision-recall graph 如下: 对一个查询结果 列表,每一个返回 结果文档处计算 precision/recall 点,由 这些点构成的图 . 在 这个图上定义 breakeven point recall 值相等的点 . 为 precision 和 问 :存在多于一个 breakeven point 的 图吗?如果 有, 给出例子;没有的话,请证明之。