A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D.

Download Report

Transcript A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D.

A Phrase-Based Model of Alignment
for Natural Language Inference
Bill MacCartney, Michel Galley,
and Christopher D. Manning
Stanford University
26 October 2008
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Natural language inference (NLI) (aka
RTE)
• Does premise P justify an inference to hypothesis H?
• An informal notion of inference; variability of linguistic
expression
P
H
Gazprom today confirmed a two-fold increase in its gas price
for Georgia, beginning next Monday.
Gazprom will double Georgia’s gas bill.
yes
• Like MT, NLI depends on a facility for alignment
• I.e., linking corresponding words/phrases in two related
sentences
2
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Alignment example
H (hypothesis)
P (premise)
unaligned content:
“deletions” from P
approximate match:
price ~ bill
phrase alignment:
two-fold increase ~ double
3
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Approaches to NLI alignment
• Alignment addressed variously by current NLI
systems
• In some approaches to NLI, alignments are implicit:
• NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05]
• NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07]
• Other NLI systems make alignment step explicit:
• Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et
al. 06]
• What about using an MT aligner?
• Alignment is familiar in MT, with extensive literature
[Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08]
• Can tools & techniques of MT alignment transfer to NLI?
4
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
NLI alignment vs. MT alignment
Doubtful — NLI alignment differs in several respects:
1. Monolingual: can exploit resources like WordNet
2. Asymmetric: P often longer & has content unrelated
to H
3. Cannot assume semantic equivalence
•
NLI aligner must accommodate frequent unaligned content
4. Little training data available
•
•
MT aligners use unsupervised training on huge amounts of
bitext
NLI aligners must rely on supervised training & much less
data
5
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Contributions of this paper
In this paper, we:
1. Undertake the first systematic study of alignment for
NLI
•
Existing NLI aligners use idiosyncratic methods, are poorly
documented, use proprietary data
2. Examine the relation between alignment in NLI and
MT
•
How do existing MT aligners perform on NLI alignment
task?
3. Propose a new model of alignment for NLI: MANLI
•
Outperforms existing MT & NLI aligners on NLI alignment
task
6
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
The MANLI aligner
A model of alignment for NLI consisting of four
components:
1. Phrase-based representation
2. Feature-based scoring function
3. Decoding using simulated annealing
4. Perceptron learning
7
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Phrase-based alignment
representation
Represent alignments by sequence of phrase edits: EQ, SUB,
DEL, INS
EQ(Gazprom1, Gazprom1)
INS(will2)
DEL(today2)
DEL(confirmed3)
DEL(a4)
SUB(two-fold5 increase6, double3)
DEL(in7)
DEL(its8)
…
• One-to-one at phrase level (but many-to-many at token level)
• Avoids arbitrary alignment choices; can use phrase-based
resources
8
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
A feature-based scoring function
•
Score edits as linear combination of features, then
sum:
•
Edit type features: EQ, SUB, DEL, INS
•
Phrase features: phrase sizes, non-constituents
•
Lexical similarity feature: max over similarity scores
•
•
•
•
WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath
Distributional similarity à la Dekang Lin
Various measures of string/lemma similarity
Contextual features: distortion, matching neighbors
9
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
10
Decoding using simulated annealing
1. Start
…
2. Generate
successors
3. Score
4. Smooth/sharpenP(A) = P(A)
5. Sample
6. Lower temp
T = 0.9  T
7. Repeat … 100 times
1/T
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Perceptron learning of feature
weights
We use a variant of averaged perceptron [Collins 2002]
Initialize weight vector w = 0, learning rate R0 = 1
For training epoch i = 1 to 50:
For each problem Pj, Hj with gold alignment Ej:
Set Êj = ALIGN(Pj, Hj, w)
Set w = w + Ri  ((Ej) – (Êj))
Set w = w / ‖w‖2 (L2 normalization)
Set w[i] = w (store weight vector for this epoch)
Set Ri = 0.8  Ri–1 (reduce learning rate)
Throw away weight vectors from first 20% of epochs
Return average weight vector
Training runs require about 20 hours (on 800 RTE
problems)
11
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
The MSR RTE2 alignment data
•
Previously, little supervised data
•
Now, MSR gold alignments for
RTE2
•
•
•
Token-based, but many-to-many
•
•
[Brockett 2007]
dev & test sets, 800 problems each
allows implicit alignment of phrases
3 independent annotators
•
•
•
3 of 3 agreed on 70% of proposed links
2 of 3 agreed on 99.7% of proposed
links
merged using majority rule
12
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Evaluation on MSR data
•
We evaluate several systems on MSR data
•
•
•
•
A simple baseline aligner
MT aligners: GIZA++ & Cross-EM
NLI aligners: Stanford RTE, MANLI
How well do they recover gold-standard alignments?
•
•
We report per-link precision, recall, and F1
We also report exact match rate for complete alignments
13
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
14
Baseline: bag-of-words aligner
System
P%
RTE2 dev
R % F1 %
Bag-of-words
57.8
81.2
67.5
E%
3.5
P%
RTE2 test
R % F1 %
E%
62.1
82.6
5.3
Match each H token to most similar P token:
70.9
[cf. Glickman et al.
2005]
•
Surprisingly good recall, despite extreme simplicity
•
But very mediocre precision, F1, & exact match rate
•
Main problem: aligns every token in H
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
MT aligners: GIZA++ & Cross-EM
• Can we show that MT aligners aren’t suitable for
NLI?
• Run GIZA++ via Moses, with default parameters
• Train on dev set, evaluate on dev & test sets
• Asymmetric alignments in both directions
• Then symmetrize using INTERSECTION heuristic
• Initial results are very poor: 56% F1
• Doesn’t even align equal words
• Remedy: add lexicon of equal words as extra training
data
• Do similar experiments with Berkeley Cross-EM
aligner
15
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
16
Results: MT aligners
System
P%
RTE2 dev
R % F1 %
Bag-of-words
GIZA++
Cross-EM
57.8
83.0
67.6
81.2
66.4
80.1
67.5
72.1
72.1
E%
3.5
9.4
1.3
P%
RTE2 test
R % F1 %
E%
62.1
85.1
70.3
82.6
69.1
81.0
5.3
11.3
0.8
70.9
74.8
74.1
Similar F1, but GIZA++ wins on precision, Cross-EM on
recall
•
Both do best with lexicon & INTERSECTION heuristic
•
•
•
Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,
GROW-DIAG-FINAL-AND, and asymmetric alignments
All achieve better recall, but much worse precision & F1
Problem: too little data for unsupervised learning
•
Need to compensate by exploiting external lexical resources
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
The Stanford RTE aligner
• Token-based alignments: map from H tokens to P
tokens
• Phrase alignments not directly representable
• (But, named entities & collocations collapsed in preprocessing)
• Exploits external lexical resources
• WordNet, LSA, distributional similarity, string sim, …
• Syntax-based features to promote aligning
corresponding predicate-argument structures
• Decoding & learning similar to MANLI
17
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
18
Results: Stanford RTE aligner
System
P%
RTE2 dev
R % F1 %
Bag-of-words
GIZA++
Cross-EM
Stanford RTE
57.8
83.0
67.6
81.1
81.2
66.4
80.1
75.8*
67.5
72.1
72.1
78.4*
E%
3.5
9.4
1.3
0.5
P%
RTE2 test
R % F1 %
E%
62.1
85.1
70.3
82.7
82.6
69.1
81.0
75.8*
5.3
11.3
0.8
0.3
70.9
74.8
74.1
79.1*
* includes (generous) correction for missed punctuation
• Better F1 than MT aligners — but recall lags precision
• Stanford does poor job aligning function words
• 13% of links in gold are prepositions & articles
• Stanford misses 67% of these (MANLI only 10%)
• Also, Stanford fails to align multi-word phrases
peace activists ~ protestors, hackers ~ non-authorized personnel
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
19
Results: MANLI aligner
System
P%
RTE2 dev
R % F1 %
Bag-of-words
GIZA++
Cross-EM
Stanford RTE
MANLI
57.8
83.0
67.6
81.1
83.4
81.2
66.4
80.1
75.8
85.5
67.5
72.1
72.1
78.4
84.4
E%
3.5
9.4
1.3
0.5
21.7
P%
RTE2 test
R % F1 %
E%
62.1
85.1
70.3
82.7
85.4
82.6
69.1
81.0
75.8
85.3
5.3
11.3
0.8
0.3
21.3
70.9
74.8
74.1
79.1
85.3
• MANLI outperforms all others on every measure
• F1: 10.5% higher than GIZA++, 6.2% higher than Stanford
• Good balance of precision & recall
• Matched >20% exactly
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
MANLI results: discussion
•
Three factors contribute to success:
1. Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded
2. Contextual features enable matching function words
3. Phrases: death penalty ~ capital punishment, abdicate ~ give up
•
But phrases help less than expected!
•
•
If we set max phrase size = 1, we lose just 0.2% in F1
Recall errors: room to improve
•
40%: need better lexical resources: conservation ~ protecting,
organization ~ agencies, bone fragility ~ osteoporosis
•
Precision errors harder to reduce
•
equal function words (49%), forms of be (21%), punctuation
(7%)
20
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Can aligners predict RTE answers?
•
We’ve been evaluating against gold-standard
alignments
•
But alignment is just one component of an NLI system
•
Does a good alignment indicate a valid inference?
•
•
•
•
Not necessarily: negations, modals, non-factives &
implicatives, …
But alignment score can be strongly predictive
And many NLI systems rely solely on alignment
Using alignment score to predict RTE answers:
•
•
•
Predict YES if score > threshold
Tune threshold on development data
Evaluate on test data
21
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
22
Results: predicting RTE answers
System
Bag-of-words
Stanford RTE
MANLI
RTE2 entries (average)
LCC [Hickl et al. 2006]
RTE2 dev
Acc % AvgP %
RTE2 test
Acc % AvgP %
61.3
63.1
59.3
61.5
64.9
69.0
57.9
60.9
60.3
58.9
59.2
61.0
—
—
—
—
58.5
75.4
59.1
80.8
• No NLI aligner rivals best complete RTE system
• (Most) complete systems do a lot more than just alignment!
• But, Stanford & MANLI beat average entry for RTE2
• Many NLI systems could benefit from better
alignments!
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Conclusion
•
MT aligners not directly applicable to NLI
•
•
•
MANLI succeeds by:
•
•
•
They rely on unsupervised learning from massive amounts of
bitext
They assume semantic equivalence of P & H
Exploiting (manually & automatically constructed) lexical
resources
Accommodating frequent unaligned phrases
Phrase-based representation shows potential
:-)
•
But not yet proven: need better phrase-based lexical
resources
Thanks! Questions?
23
24
Backup slides follow
END
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion
Related work
•
Lots of past work on phrase-based MT
•
But most systems extract phrases from word-aligned
data
•
•
Despite assumption that many translations are noncompositional
Recent work jointly aligns & weights phrases
[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08]
•
However, this is of limited applicability to the NLI task
•
•
MANLI uses phrases only when words aren’t appropriate
MT uses longer phrases to realize more dependencies
(e.g. word order, agreement, subcategorization)
•
MT systems don’t model word insertions & deletions
25