Evaluating answer validation in multi

Download Report

Transcript Evaluating answer validation in multi

Information Science Institute
Marina del Rey, December 11, 2009
Evaluating
Question Answering Validation
Anselmo Peñas (and Alvaro Rodrigo)
NLP & IR group
UNED
nlp.uned.es
UNED
Old friends
Question Answering
 Nothing else than
answering a question
Natural Language Understanding
 Something there, if you are
able to answer a question
QA: extrinsic evaluation for NLU
Suddenly… (See the track?)
…The QA Track at TREC
nlp.uned.es
UNED
Question Answering at TREC
Before
Knowledge Base (e.g. Semantic networks)
Specific domain
Single accurate answer (with explanation)
More Reasoning


After
Big document collections (News, Blogs)
Unrestricted domain
Ranking of answers (linked to documents)
More Retrieval
Object of evaluation itself
Redefined as a (roughly speaking):


Highly-precision-oriented IR task
Where NLP was necessary
• Specially for Answer Extraction
nlp.uned.es
UNED
What’s this story about?
2003
2004
2005
2006
2007
2008
Multiple Language QA Main Task
Temporal
restrictions
and lists
QA Tasks
at CLEF
WiQA
2010
ResPubliQA
Answer Validation
Exercise (AVE)
Real
Time
2009
GikiCLEF
QA over Speech
Transcriptions (QAST)
WSD
QA
nlp.uned.es
UNED
Outline
1.
2.
3.
4.
5.
Motivation and goals
Definition and general framework
AVE 2006
AVE 2007 & 2008
QA 2009
nlp.uned.es
UNED
Out-line
Long cycle
Short cycle
2. Mid term goals
and strategy
1. Analysis of current
systems performance
4. Analysis of the
evaluation cycle
Generation of
methodology and
evaluation resources
3. Evaluation Task
definition
Task activation
and development
Result analysis
Methodology
analysis
nlp.uned.es
UNED
Systems performance
2003 - 2006 (Spanish)
Overall
Best result
<60%
Definitions
Best result
>80%
NOT
IR approach
nlp.uned.es
UNED
Pipeline Upper Bounds
Question
Question
analysis
Passage
Retrieval
0.8
Answer
Extraction
x
0.8
Answer
Answer
Ranking
x
1.0
=
0.64
Not enough evidence
SOMETHING to break the pipeline
nlp.uned.es
UNED
Results in CLEF-QA 2006 (Spanish)
Perfect
combination
81%
Best system
52,5%
Best with
ORGANIZATION
Best with
PERSON
Best with
TIME
nlp.uned.es
UNED
Collaborative architectures
Different systems response better different
types of questions
•Specialization
•Collaboration
Question
QA sys1
SOMETHING for
combining /
selecting
QA sys2
QA sys3
QA sysn
Answer
Candidate
answers
nlp.uned.es
UNED
Collaborative architectures
How to select the good answer?
•Redundancy
•Voting
•Confidence score
•Performance history
Why not deeper content analysis?
nlp.uned.es
UNED
Mid Term Goal
Goal
Improve QA systems performance
New mid term goal
Improve the devices for:
Rejecting / Accepting / Selecting Answers
The new task (2006)
Validate the correctness of the answers
Given by real QA systems...
...the participants at CLEF QA
nlp.uned.es
UNED
Outline
1.
2.
3.
4.
5.
Motivation and goals
Definition and general framework
AVE 2006
AVE 2007 & 2008
QA 2009
nlp.uned.es
UNED
Define Answer Validation

Decide whether an answer is correct or not
• More precisely:

The Task:

Given
• Question
• Answer
• Supporting Text


Decide if the answer is correct according to the
supporting text
Let’s call it Answer Validation Exercise (AVE)
nlp.uned.es
UNED
Whish list

Test collection
•Questions
•Answers
•Supporting Texts
•Human assessments
Evaluation measures
 Participants

nlp.uned.es
UNED
Evaluation linked to main QA task
Reuse human assessments
Questions
Question
Answering
Track
Systems’ answers
Systems’ Supporting Texts
Answer
Validation
Exercise
Human Judgements (R,W,X,U)
Mapping
QA Track results
(ACCEPT / REJECT)
(ACCEPT /
REJECT)
Evaluation
AVE Track results
nlp.uned.es
UNED
Answer Validation Exercise (AVE)
Answer Validation
Question
Hypothesis
Candidate answer
Automatic
Hypothesis
Generation
Answer is correct
Textual
Entailment
Supporting Text
Answer is not correct or not
enough evidence
AVE 2006
AVE 2007 - 2008
nlp.uned.es
UNED
Outline



Motivation and goals
Definition and general framework
AVE 2006
• Underlying architecture: pipeline
• Evaluating the validation
• As RTE exercise: pairs text-hypothesis


AVE 2007 & 2008
QA 2009
nlp.uned.es
UNED
AVE 2006: A RTE exercise
Exact Answer
Question
QA system
Supporting snippet
Text
Entailment?
Hypothesis
If the text semantically entails the hypothesis,
then the answer is expected to be correct.
Is this true? Yes 95% with current QA systems (J LOG COMP 2009)
nlp.uned.es
UNED
Collections AVE 2006
Testing (pairs entail.)
Training
English
2088 (10% YES)
2870 (15% YES)
Spanish
2369 (28% YES)
2905 (22% YES)
German
1443 (25% YES)
French
3266 (22% YES)
Italian
1140 (16% YES)
Dutch
807
Portuguese
1324 (14% YES)
Available at:
nlp.uned.es/clef-qa/ave/
(10% YES)
nlp.uned.es
UNED
Evaluating the Validation
Validation
Decide if each candidate answer is correct or not
• YES | NO

Not balanced collections

Approach: Detect if there is enough evidence to accept an
answer

Measures: Precision, recall and F over correct answers

Baseline system: Accept all answers
nlp.uned.es
UNED
Evaluating the Validation
Correct
Answer
Incorrect
Answer
Answer
Accepted
nCA
nWA
Answer
Rejected
nCR
nWR
nCA
precision 
nCA  nWA
nCA
recall 
nCA  nCR
2  recall  precision
F
recall  precision
nlp.uned.es
UNED
Results AVE 2006
Language
Baseline
(F)
Best
system (F)
Reported Techiques
English
.27
.44
Logic
Spanish
.45
.61
Logic
German
.39
.54
Lexical, Syntax,
Semantics, Logic, Corpus
French
.37
.47
Overlapping, Learning
Dutch
.19
.39
Syntax, Learning
Portuguese .38
.35
Overlapping
Italian
.41
Overlapping, Learning
.29
nlp.uned.es
UNED
Outline




Motivation and goals
Definition and general framework
AVE 2006
AVE 2007 & 2008
• Underlying architecture: multi-stream
• Quantify the potential benefit of AV in QA
• Evaluating the correct selection of one answer
• Evaluating the correct rejection of all answers

QA 2009
nlp.uned.es
UNED
AVE 2007 & 2008
Question
QA sys1
QA sys2
Answer Validation
& Selection
QA sys3
Answer
Candidate
answers
QA sysn
Participant systems in a
CLEF – QA
+ Supporting
Texts
Evaluation of Answer
Validation & Selection
nlp.uned.es
UNED
Collections
<q id="116" lang="EN">
<q_str> What is Zanussi? </q_str>
<a id="116_1" value="">
<a_str> was an Italian producer of home appliances </a_str>
<t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the
hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home
appliances that in 1984 was bought</t_str>
</a>
<a id="116_2" value="">
<a_str> who had also been in Cassibile since August 31 </a_str>
<t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe
Castellano informed of the additional clauses that had been presented by general Ronald
Campbell to another Italian general, Zanussi, who had also been in Cassibile since August
31.</t_str>
</a>
<a id="116_4" value="">
<a_str> 3 </a_str>
<t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>
</a>
</q>
nlp.uned.es
UNED
Evaluating the Selection
Goals
 Quantify the potential gain of Answer Validation in Question
Answering


Compare AV systems with QA systems
Develop measures more comparable to QA accuracy
qa _ accuracy
nquestions _ answered _ correctly
nquestions
nlp.uned.es
UNED
Evaluating the selection
Given a question with several candidate answers
Two options:

Selection

Select an answer ≡ try to answer the question
• Correct selection: answer was correct
• Incorrect selection: answer was incorrect

Rejection

Reject all candidate answers ≡ leave question unanswered
• Correct rejection: All candidate answers were incorrect
• Incorrect rejection: Not all candidate answers were incorrect
nlp.uned.es
UNED
Evaluating the Selection
n questions
n= nCA + nWA + nWS + nWR + nCR
Question with
Correct Answer
Question without
Correct Answer
Question Answered Correctly
(One Answer Selected)
nCA
-
Question Answered Incorrectly
nWA
nWS
Question Unanswered
(All Answers Rejected)
nWR
nCR
nCA
qa _ accuracy 
n
nCR
rej _ accuracy
n
nlp.uned.es
UNED
Evaluating the Selection
nCA
qa _ accuracy 
n
Rewards rejection
(not balanced cols)
nCR
rej _ accuracy
n
nCA nCR
accuracy 

n
n
Interpretation for QA: all questions correctly rejected
by AV will be answered correctly
nlp.uned.es
UNED
Evaluating the Selection
nCA
qa _ accuracy 
n
nCR
rej _ accuracy
n
nCA nCR nCA 1
nCA
estimated 


 (nCA  nCR
)
n
n
n
n
n
Interpretation for QA:
questions correctly rejected has value
as if they were answered correctly
in qa_accuracy proportion
nlp.uned.es
UNED
Analysis and discussion
(AVE 2007 Spanish)
Validation
Comparing AV & QA
Selection
nlp.uned.es
UNED
Techniques in AVE 2007
Generates hypotheses
6
Syntactic similarity
4
Wordnet
3
Functions (sub, obj, etc)
3
Chunking
3
Syntactic transformations
1
n-grams, longest common
Subsequences
5
Word-sense disambiguation
2
Semantic parsing
4
Phrase transformations
2
Semantic role labeling
2
NER
5
First order logic representation
3
Num. Expressions
6
Theorem prover
3
Temp. expressions
4
Semantic similarity
2
Coreference resolution
2
Dependency analysis
3
nlp.uned.es
UNED
Conclusion of AVE
Answer Validation
before
• It was assumed as a QA module
• But no space for its own development
The new devices should help to improve QA
they
• Introduce more content analysis
• Use Machine Learning techniques
• Are able to break pipelines or combine streams
Let’s transfer them to QA main task
nlp.uned.es
UNED
Outline

Motivation and goals
Definition and general framework
AVE 2006
AVE 2007 & 2008

QA 2009



nlp.uned.es
UNED
CLEF QA 2009 campaign
ResPubliQA: QA on European Legislation
GikiCLEF: QA requiring geographical reasoning
on Wikipedia
QAST: QA on Speech Transcriptions of
European Parliament Plenary sessions
nlp.uned.es
UNED
CLEF QA 2009 campaign
Task
Registered
groups
Participant
groups
Submitted Runs
Organizing
people
ResPubliQA
20
11
28 + 16 (baseline
runs)
9
Giki CLEF
27
8
17 runs
2
QAST
12
4
86 (5 subtasks)
8
Total
59 showed
interest
23 Groups
147 runs
evaluated
19 +
additional
assessors
nlp.uned.es
ResPubliQA 2009:
QA on European Legislation
Organizers
Additional Assessors
Advisory Board
Anselmo Peñas
Pamela Forner
Richard Sutcliffe
Álvaro Rodrigo
Corina Forascu
Iñaki Alegria
Danilo Giampiccolo
Nicolas Moreau
Petya Osenova
Fernando Luis Costa
Anna Kampchen
Julia Kramme
Cosmina Croitoru
Donna Harman
Maarten de Rijke
Dominique Laurent
UNED
Evolution of the task
2003
2004
2005
2006
2007
2008
2009
Target
languages
3
7
8
9
10
11
8
Collections
News 1994
+ News 1995
Number of
questions
Type of
questions
200
200 Factoid
Supporting
information
Size of
answer
+ Wikipedia
Nov. 2006
500
+ Temporal
restrictions
- Type of
question
+ Linked
questions
+ Definitions
+ Lists
+ Closed lists
Document
Snnipet
European
Legislation
Snippet
Exact
- Linked
+ Reason
+ Purpose
+ Procedure
Paragraph
Paragraph
nlp.uned.es
UNED
Collection

Subset of JRC-Acquis (10,700 docs x lang)




Parallel at document level
EU treaties, EU legislation, agreements and
resolutions
Economy, health, law, food, …
Between 1950 and 2006
nlp.uned.es
UNED
500 questions

REASON
 Why did a commission expert conduct an
inspection visit to Uruguay?

PURPOSE/OBJECTIVE
 What is the overall objective of the eco-label?

PROCEDURE
 How are stable conditions in the natural rubber
trade achieved?

In general, any question that can be answered in a
paragraph
nlp.uned.es
UNED
500 questions

Also

FACTOID
• In how many languages is the Official Journal of the
Community published?

DEFINITION
• What is meant by “whole milk”?

No NIL questions
nlp.uned.es
UNED
Systems response
No Answer
1.
Decide if they answer or not
•
•
•
•
2.
≠ Wrong Answer
[ YES | NO ]
Classification Problem
Machine Learning, Provers, etc.
Textual Entailment
Provide the paragraph (ID+Text) that answers the
question
Aim
To leave a question unanswered has more value than to
give a wrong answer
nlp.uned.es
UNED
Assessments
R: The question is answered correctly
W: The question is answered incorrectly
NoA: The question is not answered
• NoA R: NoA, but the candidate answer was correct
• NoA W: NoA, and the candidate answer was incorrect
• Noa Empty: NoA and no candidate answer was given
Evaluation measure: c@1


Extension of the traditional accuracy
(as proportion of questions correctly answered)
Considering unanswered questions
nlp.uned.es
UNED
Evaluation measure
nR
1
c @ 1  (nR  nU
)
n
n
n: Number of questions
nR: Number of correctly answered questions
nU: Number of unanswered questions
nlp.uned.es
UNED
Evaluation measure
Accuracy
Accuracy
nR
1
c @ 1  (nR  nU
)
n
n
If nU = 0 then c@1=nR/n  Accuracy
If nR = 0 then c@1=0
If nU = n then c@1=0


Leave a question unanswered gives value only if this avoids
to return a wrong answer
The added value is the performance shown with the
answered questions: Accuracy
nlp.uned.es
UNED
List of Participants
System
Team
elix
ELHUYAR-IXA, SPAIN
icia
RACAI, ROMANIA
iiit
Search & Info Extraction Lab, INDIA
iles
LIMSI-CNRS-2, FRANCE
isik
ISI-Kolkata, INDIA
loga
U.Koblenz-Landau, GERMAN
mira
MIRACLE, SPAIN
nlel
U. politecnica Valencia, SPAIN
syna
Synapse Developpment, FRANCE
uaic
AI.I.Cuza U. of IASI, ROMANIA
uned
UNED, SPAIN
nlp.uned.es
UNED
Value of reducing wrong answers
System
c@1
Accuracy
#R
#W
#NoA
#NoA
R
#NoA
W
#NoA
empty
combination
0.76
0.76
381
119
0
0
0
0
icia092roro
0.68
0.52
260
84
156
0
0
156
icia091roro
0.58
0.47
237
156
107
0
0
107
UAIC092roro
0.47
0.47
236
264
0
0
0
0
UAIC091roro
0.45
0.45
227
273
0
0
0
0
base092roro
0.44
0.44
220
280
0
0
0
0
base091roro
0.37
0.37
185
315
0
0
0
0
nlp.uned.es
UNED
Detecting wrong answers
System
c@1
Accuracy
#R
#W
#NoA
#NoA
R
#NoA W
#NoA
empty
combination
0.56
0.56
278
222
0
0
0
0
loga091dede
0.44
0.4
186
221
93
16
68
9
loga092dede
0.44
0.4
187
230
83
12
62
9
base092dede
0.38
0.38
189
311
0
0
0
0
base091dede
0.35
0.35
174
326
0
0
0
0
Maintaining the number of correct answers,
the candidate answer was not correct
for 83% of unanswered questions
Very good step towards improving the system
nlp.uned.es
UNED
IR important, not enough
System
c@1
Accuracy
#R
#W
#NoA
#NoA R
#NoA W
combination
0.9
0.9
451
49
0
Achievable
0
0 Task
0
uned092enen
0.61
0.61
288
184
28
15
1
uned091enen
0.6
0.59
282
190
28
nlel091enen
0.58
0.57
287
211
2
uaic092enen
0.54
0.52
243
204
53
15
13
Perfect
combination
is0
0
50%
better0 than best 2
18
35
0
system
base092enen
0.53
0.53
263
236
1
1
0
0
base091enen
0.51
0.51
256
243
1
0
1
0
elix092enen
0.48
0.48
240
260
0
0
0
0
uaic091enen
0.44
0.42
200
253
47
11
36
0
elix091enen
0.42
0.42
211
289
0
syna091enen
0.28
0.28
141
359
0
0
0
Many
systems
under 0
0
0
0
the
IR baselines
isik091enen
0.25
0.25
126
374
0
0
0
0
iiit091enen
0.2
0.11
54
37
409
0
11
398
elix092euen
0.18
0.18
91
409
0
0
0
0
elix091euen
0.16
0.16
78
422
0
0
0
12
#NoA empty
0
nlp.uned.es
UNED
Outline

Motivation and goals
Definition and general framework
AVE 2006
AVE 2007 & 2008

QA 2009

Conclusion



nlp.uned.es
UNED
Conclusion
New QA evaluation setting
 Assuming that


To leave a question unanswered has more
value than to give a wrong answer
This assumption give space to further
development QA systems
 And hopefully improve their performance

nlp.uned.es
Thanks!
http://nlp.uned.es/clef-qa/ave
http://www.clef-campaign.org
Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)