Is Relevance Associated with Successful Use of Information

Transcript Is Relevance Associated with Successful Use of Information

Is Relevance Associated with
Successful Use of Information
Retrieval Systems?
William Hersh
Professor and Head
Division of Medical Informatics & Outcomes Research
Oregon Health & Science University
[email protected]
Goal of talk


Answer question of association of relevancebased evaluation measures with successful
use of information retrieval (IR) systems
By describing two sets of experiments in
different subject domains

Since focus of talk is on one question assessed in
different studies, I will necessarily provide only
partial details of the studies
For more information on these
studies…



Hersh W et al., Challenging conventional assumptions
of information retrieval with real users: Boolean
searching and batch retrieval evaluations, Info.
Processing & Management, 2001, 37: 383-402.
Hersh W et al., Further analysis of whether batch and
user evaluations give the same results with a
question-answering task, Proceedings of TREC-9,
Gaithersburg, MD, 2000, 407-416.
Hersh W et al., Factors associated with success for
searching MEDLINE and applying evidence to answer
clinical questions, Journal of the American Medical
Informatics Association, 2002, 9: 283-293.
Outline of talk

Information retrieval system evaluation



Methods and results of experiments



Text REtrieval Conference (TREC)
Medical IR
TREC Interactive Track
Medical searching
Implications
Information retrieval system
evaluation
Evaluation of IR systems

Important not only to researchers but
also users so we can




Understand how to build better systems
Determine better ways to teach those who
use them
Cut through hype of those promoting them
There are a number of classifications of
evaluation, each with a different focus
Lancaster and Warner
(Information Retrieval Today, 1993)

Effectiveness


Cost-effectiveness


e.g., cost, time, quality
e.g., per relevant citation, new citation,
document
Cost-benefit

e.g., per benefit to user
Hersh and Hickam
(JAMA, 1998)






Was system used?
What was it used for?
Were users satisfied?
How well was system used?
Why did system not perform well?
Did system have an impact?
Most research has focused on
relevance-based measures



Measure quantities of relevant documents
retrieved
Most common measures of IR evaluation in
published research
Assumptions commonly applied in
experimental settings


Documents are relevant or not to user information
need
Relevance is fixed across individuals and time
Recall and precision defined

Recall
# retrieved and relevantdocum ents
R
# relevantdocum entsin collection

Precision
# retrieved and relevantdocum ents
P
# retrieved docum ents
Some issues with relevancebased measures


Some IR systems return retrieval sets of
vastly different sizes, which can be
problematic for “point” measures
Sometimes it is unclear what a “retrieved
document” is



Surrogate vs. actual document
Users often perform multiple searches on a
topic, with changing needs over time
There are differing definitions of what is a
“relevant document”
What is a relevant document?


Relevance is intuitive yet hard to define
(Saracevic, various)
Relevance is not necessarily fixed


Changes across people and time
Two broad views


Topical – document is on topic
Situational – document is useful to user in specific
situation (aka, psychological relevance, Harter,
JASIS, 1992)
Other limitations of recall
and precision



Magnitude of a “clinically significant”
difference unknown
Serendipity – sometimes we learn from
information not relevant to the need at hand
External validity of results – many
experiments test using “batch” mode without
real users; is not clear that results translate
to real searchers
Alternatives to recall and
precision



“Task-oriented” approaches that measure
how well user performs information task with
system
“Outcomes” approaches that determine
whether system leads to better outcome or a
surrogate for outcome
Qualitative approaches to assessing user’s
cognitive state as they interact with system
Text Retrieval Conference
(TREC)


Organized by National Institutes for
Standards and Technology (NIST)
Annual cycle consisting of



Distribution of test collections and queries to
participants
Determination of relevance judgments and results
Annual conference for participants at NIST (each
fall)


TREC-1 began in 1992 and has continued annually
Web site: trec.nist.gov
TREC goals


Assess many different approaches to IR
with a common large test collection, set
of real-world queries, and relevance
judgements
Provide forum for academic and
industrial researchers to share results
and experiences
Organization of TREC

Began with two major tasks

Ad hoc retrieval – standard searching


Routing – identify new documents with queries
developed for known relevant ones



Discontinued with TREC 2001
In some ways, a variant of relevance feedback
Discontinued with TREC-7
Has evolved to a number of tracks

Interactive, natural language processing, spoken
documents, cross-language, filtering, Web, etc.
What has been learned in
TREC?

Approaches that improve performance


Approaches that may not improve performance


e.g., passage retrieval, query expansion, 2-poisson
weighting
e.g., natural language processing, stop words,
stemming
Do these kinds of experiments really matter?


Criticisms of batch-mode evaluation from Swanson,
Meadow, Saracevic, Hersh, Blair, etc.
Results that question their findings from Interactive
Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.
The TREC Interactive Track




Developed out of interest in how with
real users might search using TREC
queries, documents, etc.
TREC 6-8 (1997-1999) used instance
recall task
TREC 9 (2000) and subsequent years
used question-answering task
Now being folded into Web track
TREC-8 Interactive Track


Task for searcher: retrieve instances of a
topic in a query
Performance measured by instance recall



Proportion of all instances retrieved by user
Differs from document recall in that multiple
documents on same topic count as one instance
Used




Financial Times collection (1991-1994)
Queries derived from ad hoc collection
Six 20-minute topics for each user
Balanced design: “experimental” vs. “control”
TREC-8 sample topic

Title


Description


Hubble Telescope Achievements
Identify positive accomplishments of the Hubble
telescope since it was launched in 1991
Instances

In the time allotted, please find as many
DIFFERENT positive accomplishments of the sort
described above as you can
TREC-9 Interactive Track

Same general experimental design with

A new task


A new collection


Question-answering
Newswire from TREC disks 1-5
New topics

Eight questions
Issues in medical IR

Searching priorities vary by setting




In busy clinical environment, users usually want
quick, short answer
Outside clinical environment, users may be willing
to explore in more detail
As in other scientific fields, researchers likely to
want more exhaustive information
Clinical searching task has many similarities
to Interactive Track design, so methods are
comparable
Some results of medical IR
evaluations (Hersh, 2003)



In large bibliographic databases (e.g.,
MEDLINE), recall and precision comparable to
those seen in other domains (e.g., 50%-50%,
minimal overlap across searchers)
Bibliographic databases not amenable to busy
clinical setting, i.e., not used often,
information retrieved not preferred
Biggest challenges now in digital library
realm, i.e., interoperability of disparate
resources
Methods and results
Research question:
Is relevance associated with successful
use of information retrieval systems?
TREC Interactive Track and
our research question

Do the results of batch IR studies correspond
to those obtained with real users?


i.e., Do term weighting approaches which work
better in batch studies do better for real users?
Methodology



Identify a prior test collection that measures large
batch performance differential over some baseline
Use interactive track to see if this difference is
maintained with interactive searching and new
collection
Verify that previous batch difference is maintained
with new collection
TREC-8 experiments

Determine the best-performing measure


Perform user experiments


Use instance recall data from previous years as
batch test collection with relevance defined as
documents containing >1 instance
TREC-8 Interactive Track protocol
Verify optimal measure holds

Use TREC-8 instance recall data as batch test
collection similar to first experiment
IR system used for our TREC8 (and 9) experiments

MG




Public domain IR research system
Described in Witten et. al., Managing Gigabytes,
1999
Experimental version implements all “modern”
weighting schemes (e.g., TFIDF, Okapi, pivoted
normalization) via Q-expressions, c.f., Zobel and
Moffat, SIGIR Forum, 1998
Simple Web-based front end
Experiment 1 – Determine
best “batch” performance
MG Qexpression
BB-ACB-BAA
BD-ACI-BCA
(0.5)
BB-ACM-BCB
(0.275)
AB-BFC-BAA
AB-BFD-BAA
Common
Average
%
name
precision improvement
TFIDF
0.2129
Pivoted
0.2853
34%
normalization
Pivoted
0.2821
33%
normalization
Okapi
0.3612
70%
Okapi
0.3850
81%
Okapi term weighting performs much better than TFIDF.
Experiment 2 – Did benefit
occur with interactive task?

Methods

Two user populations


Using a simple natural language interface


Professional librarians and graduate students
MG system with Web front end
With two different term weighting schemes

TFIDF (baseline) vs. Okapi
User interface
Results showed benefit for
better batch system (Okapi)
Weighting
Approach
TFIDF
Okapi
Instance
Recall
0.33
0.39
+18%, BUT...
All differences were due to
one query
Instance recall
1.0
Okapi
TFIDF
Okapi batch benefit
+38.7%
0.8
0.6
+6.8%
+21.3%
+318.5%
0.4
-56.6%
-25.8%
0.2
0.0
408i
414i
428i 431i
Topic
438i
446i
Experiment 3 – Did batch
results hold with TREC-8 data?
Relevant
Query
Instances Documents
408i
24
71
414i
12
16
428i
26
40
431i
40
161
438i
56
206
446i
16
58
Average
29
92
TFIDF
0.5873
0.2053
0.0546
0.4689
0.2862
0.0495
0.2753
%
Okapi Improvement
0.6272
6.8%
0.2848
38.7%
0.2285
318.5%
0.5688
21.3%
0.2124
-25.8%
0.0215
-56.6%
0.3239
17.6%
Yes, but still with high variance
and without statistical significance.
TREC-9 Interactive Track
experiments

Similar to approach used in TREC-8

Determine the best-performing weighting measure


Perform user experiments



Use all previous TREC data, since no baseline
Follow protocol of track
Use MG
Verify optimal measure holds

Use TREC-9 relevance data as batch test collection
analogous first experiment
Determine best “batch”
performance
Query set
Collection
Cosine
FT91-94
Disks 1&2
Disks 2&3
Disks 4&5
minus CR
001qa-200qa
Disks 4&5
minus CR
Average improvement
303i-446i
051-200
202-250
351-450
0.2281
0.1139
0.1033
0.1293
Okapi
(% improvement)
0.3753 (+65)
0.1063 (-7)
0.1153 (+12)
0.1771 (+37)
Okapi + PN
(% improvement)
0.3268 (+43)
0.1682 (+48)
0.1498 (+45)
0.1825 (+41)
0.0360
0.0657 (+83)
0.0760 (+111)
(+38)
(+58)
Okapi+PN term weighting performs better than TFIDF.
Interactive experiments –
comparing systems
TFIDF
Question Searches
1
13
2
11
3
13
4
12
5
12
6
15
7
13
8
11
Total
100
Okapi+PN
#Correct %Correct Searches #Correct
3
23.1%
12
1
0
0.0%
14
5
0
0.0%
12
0
7
58.3%
13
8
9
75.0%
13
11
13
86.7%
10
6
11
84.6%
12
10
0
0.0%
14
0
43
43.0%
100
41
Little difference across systems but note
wide differences across questions.
%Correct
8.3%
35.7%
0.0%
61.5%
84.6%
60.0%
83.3%
0.0%
41.0%
Do batch results hold with
new data?
Question
1
2
3
4
5
6
7
8
Mean
TFIDF
0.1352
0.0508
0.1557
0.1515
0.5167
0.7576
0.3860
0.0034
0.2696
Okapi+PN
0.0635
0.0605
0.3000
0.1778
0.6823
1.0000
0.5425
0.0088
0.3544
% improvement
-53.0%
19.1%
92.7%
17.4%
32.0%
32.0%
40.5%
158.8%
31.5%
Batch results show improved performance
whereas user results do not.
Further analysis (Turpin,
SIGIR 2001)

Okapi searches definitely retrieve more
relevant documents



Okapi+PN user searches have 62% better MAP
Okapi+PN user searches have 101% better
Precision@5 documents
But


Users do 26% more cycles with TFIDF
Users get overall same results per experiments
Possible explanations for our
TREC Interactive Track results

Batch searching results may not
generalize


User data show wide variety of differences
(e.g., search terms, documents viewed)
which may overwhelm system measures
Or we cannot detect that they do


Increase task, query, or system diversity
Increase statistical power
Medical IR study design





Orientation to experiment and system
Brief training in searching and evidencebased medicine (EBM)
Collect data on factors of users
Subjects given questions and asked to search
to find and justify answer
Statistical analysis to find associations among
user factors and successful searching
System used – OvidWeb
MEDLINE
Experimental design

Recruited



45 senior medical students
21 second (last) year NP students
Large-group session



Demographic/experience questionnaire
Orientation to experiment, OvidWeb
Overview of basic MEDLINE and EBM skills
Experimental design (cont.)

Searching sessions


Two hands-on sessions in library
For each of three questions, randomly
selected from 20, measured:




Pre-search answer with certainty
Searching and answering with justification and
certainty
Logging of system-user interactions
User interface questionnaire (QUIS)
Searching questions

Derived from two sources



Medical Knowledge Self-Assessment Program
(Internal Medicine board review)
Clinical questions collection of Paul Gorman
Worded to have answer of either




Yes with good evidence
Indeterminate evidence
No with good evidence
Answers graded by expert clinicians
Assessment of recall and
precision




Aimed to perform a “typical” recall and
precision study and determine if they were
associated with successful searching
Designated “end queries” to have terminal set
for analysis
Half of all retrieved MEDLINE records judged
by three physicians each as definitely
relevant, possibly relevant, or not relevant
Also measured reliability of raters
Overall results

Prior to searching, rate of correctness
(32.1%) about equal to chance for both
groups


Rating of certainly low for both groups
With searching, medical students
increased rate of correctness to 51.6%
but NP students remained virtually
unchanged at 34.7%
Overall results
Incorrect
M
N
Pre-Search
Correct
M
N
Post-Search
Incorrect
133 (41%)
81 (36%)
52 (52%)
41 (13%)
27 (12%)
14 (14%)
Correct
87 (27%)
70 (31%)
17 (17%)
63 (19%)
45 (20%)
18 (18%)
Medical students were better able to convert
incorrect into correct answers, whereas NP
students were hurt as often as helped by
searching.
Recall and precision
Variable
Recall
Precision
Incorrect
18%
28%
Correct
18%
29%
p value
.61
.99
Variable
Recall
Precision
All
18%
29%
Medical
18%
30%
NP
20%
26%
Recall and precision were not associated with
successful answering of questions and were
nearly identical for medical and NP students.
Conclusions from results

Medical students improved ability to answer
questions with searching, NP students did not


Answering questions required >30 minutes
whether correct or incorrect


Spatial visualization ability may explain
This content not amenable to clinical setting
Recall and precision had no relation to
successful searching
Implications
Limitations of studies

Domains


Numbers of users and questions


Many more besides newswire and medicine
Small and not necessarily representative
Experimental setting

Real-world users may behave differently
But I believe we can conclude



Although batch evaluations are useful early in
system development, their results cannot be
assumed to apply to real users
Recall and precision are important
components of searching but not the most
important determiners of success
Further research should investigate what
makes documents relevant to users and helps
them solve their information problems
Thank you for inviting me…
It’s great to be back in
the Midwest!
www.irbook.org