IR4QA: An Unhappy Marriage

Download Report

Transcript IR4QA: An Unhappy Marriage

IR4QA: An Unhappy Marriage
Mark A. Greenwood
Natural Language Processing Group
Department of Computer Science
University of Sheffield, UK
Outline of Talk
• Background
• ‘Ancient’ History
• Recent Past
• An Uncertain Future
• Possible New Directions
Background
Although QA is not new, the language processing community has
yet to develop a clearly articulated and commonly accepted
guiding framework and research methodology, parallel to that of
IR, MT, or text summarization.
As a result, despite ten years of system evaluations in the TREC
QA track for specific kinds of questions and answers, the
community does not have a clear idea how much progress was
made during that period for QA in general.
OAQA09 Call for Papers
Background
• We will focus here on the selection of promising documents
which can be subjected to further processing in order to
extract exact answers to questions.
• The common approach to this problem has been to employ an
IR engine to retrieve a small set of relevant documents, a field
known as IR4QA.
• The rest of this talk will explain
 How we got to this point
 Why it is fundamentally flawed
 Where we might go from here
Outline of Talk
• Background
• ‘Ancient’ History
• Recent Past
• An Uncertain Future
• Possible New Directions
‘Ancient’ History
• Traditionally IR and QA were separate research areas
• They had different users and goals
• The inputs and outputs to both systems were radically different
• Both had their own strengths and weaknesses
‘Ancient’ History
• Early QA systems were usually just interfaces to structured
data
 LUNAR (Woods, 1973)
 BASEBALL (Green et al., 1961)
• Those systems which worked over text were usually based
around reading comprehension exercises and used scenario
templates
 SAM (Schank and Abelson, 1977)
• Questions varied in length but were asking for information
which wasn’t known to the user
• Systems were not open-domain, i.e. LUNAR only knew about
moon rocks.
‘Ancient’ History
• In comparison to QA systems early IR systems could be
applied to any document collection
 Performance varied from collection to collection but in principal
• Queries were usually quite long and described the documents
the user was looking for
 The CACM collection is a good example
• Systems returned full documents not exact answers
 As the user already knew what they were looking for this was OK
 Full documents doesn’t help when you don’t know what you are looking
for as you then have to read all the returned documents
Outline of Talk
• Background
• ‘Ancient’ History
• Recent Past
• An Uncertain Future
• Possible New Directions
Recent Past
• Recent QA research has been guided by the TREC evaluations
• The TREC QA track was originally conceived as a task that
would interest both the IR and IE communities
 Focused IR
 Open-Domain IE
• It was hoped that over time the two communities would work
together to develop new combined approaches
• Unfortunately it would seem that the IR community is not, on
the whole, interested in the QA task
Recent Past
• Most, if not all, modern QA systems have adopted a (roughly)
three stage architecture: question analysis, document retrieval,
and answer extraction.
Recent Past
• IR4QA has not been aggressively
community yet we know that...
researched
by
the
 IR performance places an upper-bound on end-to-end performance – a
commonly quoted figure is 60% (Tellex et al., 2003)
 Even if we look at the top 1000 documents no relevant documents are
returned for 8% of the questions (Hovy et al., 2000)
 Most systems use off-the-shelf IR components with little or no tuning to
the task, i.e. Lucene, Okapi...
 Complex multi-query strategies have been tried in an effort to solve the
problem, but they only serve to highlight how bad performance at this
step actually is.
Recent Past
• IR4QA has focused on the development and evaluation of the
document retrieval component in such systems.
• The main problems are
 QA researchers are not IR researchers
 We don’t fully understand the intricate details of IR engines
 QA and IR are fundamentally different tasks
Recent Past
• Commonly accepted evaluation
(Roberts and Gaizauskas, 2004)
framework
consists
of
 Coverage – the proportion of documents for which at least one answer
bearing document is retrieved
 Redundancy – the average number of answer bearing documents
retrieved for a question
Recent Past
• There have been two workshops focused on the problem of
IR4QA
 Sheffield, SIGIR 2004
 Manchester, Coling 2008
• The main conclusions of both were that
 IR4QA is very hard
 Approaches that lead to increased IR performance do not necessarily
lead to appreciable increases in end-to-end performance
 Selection of documents shouldn’t be performed in isolation from the rest
of the system
Outline of Talk
• Background
• ‘Ancient’ History
• Recent Past
• An Uncertain Future
• Possible New Directions
An Uncertain Future
• It seems clear that, on the whole, the IR community are not
interested in QA
• Using off-the-shelf IR components has been shown to
introduce unacceptable caps on performance
• The IR4QA community need to consider radically different
approaches to the problem of selecting relevant documents
from large corpora
Outline of Talk
• Background
• ‘Ancient’ History
• Recent Past
• An Uncertain Future
• Possible New Directions
Possible New Directions
• Answer extraction requires complex text processing
 Answer extraction techniques don’t scale well
 Some form of text selection component is required
• There are two orthogonal directions we could take
 Continue to use traditional IR techniques but discard the traditional view
of what makes a document (and/or query)
 Continue to work with traditional documents but use a radically different
selection approach
We need approaches that scale – working on AQUAINT size
collections is nice for self contained experiments but shouldn’t be
the end goal!
What Is A Document?
• Topic Indexing and Retrieval (Ahn and Webber, 2008) throws
away the common idea of documents while using a standard IR
engine to directly retrieve answers not text.
• Topics are entities that answer questions
 People, companies, locations etc.
• Topic documents are built by simply joining together all
sentences from a corpus that contain the topic (or variations
of, i.e. Bill Clinton and William Clinton)
• QA is then a matter of retrieving the most relevant topic
document using an IR engine and returning the associated
topic as the answer
What Is A Document?
Let The Data Guide You
• A decade of recent QA research has yielded a lot of useful data
• We have lots of example questions (at least a few thousand
just from TREC) each of which...
 Has a known correct answer
 Is associated with at least one answer bearing document
• We should use this data to guide new selection approaches.
 A simple approach would be to perform query expansion by looking for
terms which are often associated with correct answers to certain
question types (Derczynski et al., 2008)
 Look for patterns in the answer bearing documents and index
collections based on these patterns rather than words
Answer By Understanding
• I’ve always been of the opinion that QA is intelligent IR
 Where intelligence equates to some level of understanding
• This suggests we should index meaning not just textual
content.
 Take into account co-reference when selecting text passages
 Indexing relations should allow for more focused selection
 ‘Hybrid’ search that uses annotations and text (Bhagdev et al., 2008)
DISCUSSION
References
•
•
•
•
•
•
•
•
•
Kisuh Ahn and Bonnie Webber. 2008. Topic Indexing and Retrieval for Factoid QA. In
Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA).
Ravish Bhagdev, Sam Chapman, Fabio Ciravegna, Vitaveska Lanfranchi and Daniela Petrelli.
2008. Hybrid Search: Effectively Combining Keywords and Semantic Searches. In Proceedings of
the 5th European Semantic Web Conference, ESWC 08, Tenerife.
Leon Derczynski, Jun Wang, Robert Gaizauskas and Mark A. Greenwood. 2008. A Data Driven
Approach to Query Expansion in Question Answering. In Proceedings of the 2nd Workshop on
Information Retrieval for Question Answering (IR4QA).
Bert F. Green, Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. BASEBALL: An
Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference,
volume 19, pages 219--224.
Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, and Chin-Yew Lin. 2000. Question
Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference.
Ian Roberts and Robert Gaizauskas. 2004. Evaluating Passage Retrieval Approaches for Question
Answering. In Proceedings of 26th European Conference on Information Retrieval (ECIR’04),
pages 72--84, University of Sunderland, UK.
Roger C. Schank and Robert Abelson. 1977. Scripts, Plans, Goals and Understanding. Hillsdale.
Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative
Evaluation of Passage Retrieval Algorithms for Question Answering. In Proceedings of the
Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 41--47, Toronto, Canada, July.
William Woods. 1973. Progress in Natural Language Understanding - An Application to Lunar
Geology. In AFIPS Conference Proceedings, volume 42, pages 441--450.