The University PowerPoint Template

Download Report

Transcript The University PowerPoint Template

Search Results Need to be
Diverse
Mark Sanderson
University of Sheffield
How to have fun while
running an evaluation
campaign
Mark Sanderson
University of Sheffield
Aim
• Tell you about our test collection work in
Sheffield
• How we’ve been having fun building test
collections
17/07/2015
3
Organising this is hard
• TREC
• Donna, Ellen
• CLEF
• Carol
• NTCIR
• Noriko
• Make sure you enjoy it
17/07/2015
4
ImageCLEF
• Cross language image retrieval
• Running for 6 years
• Photo
• Medical
• And other tasks
• Imageclef.org
17/07/2015
5
How do we do it?
• Organise and conduct research
• imageCLEFPhoto 2008
• Study diversity in search results
• Diversity?
17/07/2015
6
SIGIR
17/07/2015
7
ACL
17/07/2015
8
Mark Sanderson
17/07/2015
9
Cranfield model
17/07/2015
10
Operational search
engine
• Ambiguous queries
• What is correct interpretation?
• Don’t know
• Serve as diverse a range as possible
17/07/2015
11
Diversity is studied
• Carbonell, J. and Goldstein, J. (1998) The use of
MMR, diversity-based reranking for reordering
documents and producing summaries. In ACM SIGIR,
335-336.
• Zhai, C. (2002) Risk Minimization and Language
Modeling in Text Retrieval, PhD thesis, Carnegie
Mellon University.
• Chen, H. and Karger, D. R. (2006) Less is more:
probabilistic models for retrieving fewer relevant
documents. In ACM SIGIR, 429-436.
Cluster hypothesis
• “closely associated documents tend to be
relevant to the same requests”
• Van Rijsbergen (1979)
17/07/2015
13
Most test collections
• Focussed topic
• Relevance judgments
• Who says what is relevant?
• (almost always) one person
• Consideration of interpretations
• Little or none
• Gap between test and operation
Few test collections
• Hersh, W. R. and Over, P. (1999) Trec-8
interactive track report. TREC-8
• Over P. (1997) TREC-5 Interactive Track Report.
TREC-5, 29-56
• Clarke, C. L., Kolla, M., Cormack, G. V.,
Vechtomova, O., Ashkan, A., Büttcher, S., and
MacKinnon, I. (2008) Novelty and diversity in
information retrieval evaluation. In ACM SIGIR.
17/07/2015
15
Study diversity
• What sorts of diversity is there?
• Ambiguous query words
• How often is it a feature of search?
• How often are queries ambiguous?
• How can we add it into test collections?
17/07/2015
16
Extent of diversity?
• “Ambiguous queries: test collections need
more sense”, SIGIR 2008
• How do you define ambiguity?
• Wikipedia
• WordNet
17/07/2015
17
Disambiguation page
17/07/2015
18
Wikipedia stats
• enwiki-20071018-pages-articles.xml
• (12.7Gb)
• Disambiguation pages easy to spot
• “_(disambiguation)” in title
Chicago
• “{{disambig}}” template
George_bush
Conventional source
• Downloaded WordNet v3.0
• 88K words
17/07/2015
20
Query logs
Log
Web
PA
Unique Most frequent
queries (all)
(fr)
1,000,000
8,719
507,914
14,541
Year(s)
gathered
2006
2006-7
Fraction of ambiguous
1
Name
Web
PA
freq
Wi
7.6%
2
WN
4.0%
3
WN+Wi
10.0%
all
freq
all
2.5%
10.5%
2.1%
0.8%
6.4%
0.8%
3.0%
14.7%
2.7%
Conclusions
• Ambiguity is a problem
• Ambiguity is present in query logs
• Not just Web search
• Ambiguity present?
• Need for IR systems to produce diverse
results
17/07/2015
23
Test collections
• Don’t test for diversity
• Do search systems deal with it?
17/07/2015
24
ImageCLEFPhoto
• Build a test collection
• Encourage the study of diversity
• Study how others deal with diversity
• Have some fun
17/07/2015
25
Collection
• IAPR TC-12
• 20,000 travel photographs
• Text captions
• 60 existing topics
• Used in two previous studies
• 39 used for diversity study
17/07/2015
26
Diversity needs in topic
• “Images of typical Australian animals”
17/07/2015
27
Types of diversity
• 22 geographical
• “Churches in Brazil”
• 17 other
• “Australian animals”
17/07/2015
28
Relevance judgments
• Clustered existing qrels
• Multiple assessors
• Good level of agreement on clusters
17/07/2015
29
Evaluation
• Precision at 20
• P(20)
• Fraction of relevant in top 20
• Cluster recall at 20
• CR(20)
• Fraction of different clusters in top 20
17/07/2015
30
Track was popular
• 24 groups
• 200 runs in total
17/07/2015
31
Submitted runs
32
0.6
0.5
CR20
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
P20
17/07/2015
0.4
0.5
0.6
Compare with past years
33
• Same 39 topics used in 2006, 2007
• But without clustering
• Compare cluster recall on past runs
• Based on identical P(20)
• Cluster recall increased
• Substantially
• Significantly
17/07/2015
Meta-analysis
• This was fun
• We experimented on participants outputs
• Not by design
• Lucky accident
17/07/2015
34
Not first to think of this
• Buckley and Voorhees
• SIGIR 2000, 2002
• Use submitted runs to generate new
research
17/07/2015
35
Conduct user experiment
36
• Do users prefer diversity?
• Experiment
• Build a system to do this
• Show users
• your system
• Baseline system
• Measure users
17/07/2015
Why bother…
37
• …when others have done the work for you
• Pair up randomly sampled runs
• High CR(20)
• Low CR(20)
• Show to users
17/07/2015
Animals swimming
17/07/2015
38
Numbers
• 25 topics
• 31 users
• 775 result pairs compared
17/07/2015
39
User preferences
• 54.6% more diversified;
• 19.7% less diversified;
• 17.4% both were equal;
• 8.3% preferred neither.
17/07/2015
40
Conclusions
• Diversity appears to be important
• System don’t do diversity by default
• Users prefer diverse results
• Test collections don’t support diversity
• But can be adapted
17/07/2015
41
and
• Organising evaluation campaigns is
rewarding
• And can generate novel research
17/07/2015
42