A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D.
Download ReportTranscript A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D.
A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Natural language inference (NLI) (aka RTE) • Does premise P justify an inference to hypothesis H? • An informal notion of inference; variability of linguistic expression P H Gazprom today confirmed a two-fold increase in its gas price for Georgia, beginning next Monday. Gazprom will double Georgia’s gas bill. yes • Like MT, NLI depends on a facility for alignment • I.e., linking corresponding words/phrases in two related sentences 2 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Alignment example H (hypothesis) P (premise) unaligned content: “deletions” from P approximate match: price ~ bill phrase alignment: two-fold increase ~ double 3 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Approaches to NLI alignment • Alignment addressed variously by current NLI systems • In some approaches to NLI, alignments are implicit: • NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] • NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] • Other NLI systems make alignment step explicit: • Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et al. 06] • What about using an MT aligner? • Alignment is familiar in MT, with extensive literature [Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • Can tools & techniques of MT alignment transfer to NLI? 4 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion NLI alignment vs. MT alignment Doubtful — NLI alignment differs in several respects: 1. Monolingual: can exploit resources like WordNet 2. Asymmetric: P often longer & has content unrelated to H 3. Cannot assume semantic equivalence • NLI aligner must accommodate frequent unaligned content 4. Little training data available • • MT aligners use unsupervised training on huge amounts of bitext NLI aligners must rely on supervised training & much less data 5 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Contributions of this paper In this paper, we: 1. Undertake the first systematic study of alignment for NLI • Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data 2. Examine the relation between alignment in NLI and MT • How do existing MT aligners perform on NLI alignment task? 3. Propose a new model of alignment for NLI: MANLI • Outperforms existing MT & NLI aligners on NLI alignment task 6 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MANLI aligner A model of alignment for NLI consisting of four components: 1. Phrase-based representation 2. Feature-based scoring function 3. Decoding using simulated annealing 4. Perceptron learning 7 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Phrase-based alignment representation Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS EQ(Gazprom1, Gazprom1) INS(will2) DEL(today2) DEL(confirmed3) DEL(a4) SUB(two-fold5 increase6, double3) DEL(in7) DEL(its8) … • One-to-one at phrase level (but many-to-many at token level) • Avoids arbitrary alignment choices; can use phrase-based resources 8 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion A feature-based scoring function • Score edits as linear combination of features, then sum: • Edit type features: EQ, SUB, DEL, INS • Phrase features: phrase sizes, non-constituents • Lexical similarity feature: max over similarity scores • • • • WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath Distributional similarity à la Dekang Lin Various measures of string/lemma similarity Contextual features: distortion, matching neighbors 9 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 10 Decoding using simulated annealing 1. Start … 2. Generate successors 3. Score 4. Smooth/sharpenP(A) = P(A) 5. Sample 6. Lower temp T = 0.9 T 7. Repeat … 100 times 1/T Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R0 = 1 For training epoch i = 1 to 50: For each problem Pj, Hj with gold alignment Ej: Set Êj = ALIGN(Pj, Hj, w) Set w = w + Ri ((Ej) – (Êj)) Set w = w / ‖w‖2 (L2 normalization) Set w[i] = w (store weight vector for this epoch) Set Ri = 0.8 Ri–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (on 800 RTE problems) 11 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data • Previously, little supervised data • Now, MSR gold alignments for RTE2 • • • Token-based, but many-to-many • • [Brockett 2007] dev & test sets, 800 problems each allows implicit alignment of phrases 3 independent annotators • • • 3 of 3 agreed on 70% of proposed links 2 of 3 agreed on 99.7% of proposed links merged using majority rule 12 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Evaluation on MSR data • We evaluate several systems on MSR data • • • • A simple baseline aligner MT aligners: GIZA++ & Cross-EM NLI aligners: Stanford RTE, MANLI How well do they recover gold-standard alignments? • • We report per-link precision, recall, and F1 We also report exact match rate for complete alignments 13 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 14 Baseline: bag-of-words aligner System P% RTE2 dev R % F1 % Bag-of-words 57.8 81.2 67.5 E% 3.5 P% RTE2 test R % F1 % E% 62.1 82.6 5.3 Match each H token to most similar P token: 70.9 [cf. Glickman et al. 2005] • Surprisingly good recall, despite extreme simplicity • But very mediocre precision, F1, & exact match rate • Main problem: aligns every token in H Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MT aligners: GIZA++ & Cross-EM • Can we show that MT aligners aren’t suitable for NLI? • Run GIZA++ via Moses, with default parameters • Train on dev set, evaluate on dev & test sets • Asymmetric alignments in both directions • Then symmetrize using INTERSECTION heuristic • Initial results are very poor: 56% F1 • Doesn’t even align equal words • Remedy: add lexicon of equal words as extra training data • Do similar experiments with Berkeley Cross-EM aligner 15 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 16 Results: MT aligners System P% RTE2 dev R % F1 % Bag-of-words GIZA++ Cross-EM 57.8 83.0 67.6 81.2 66.4 80.1 67.5 72.1 72.1 E% 3.5 9.4 1.3 P% RTE2 test R % F1 % E% 62.1 85.1 70.3 82.6 69.1 81.0 5.3 11.3 0.8 70.9 74.8 74.1 Similar F1, but GIZA++ wins on precision, Cross-EM on recall • Both do best with lexicon & INTERSECTION heuristic • • • Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL, GROW-DIAG-FINAL-AND, and asymmetric alignments All achieve better recall, but much worse precision & F1 Problem: too little data for unsupervised learning • Need to compensate by exploiting external lexical resources Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The Stanford RTE aligner • Token-based alignments: map from H tokens to P tokens • Phrase alignments not directly representable • (But, named entities & collocations collapsed in preprocessing) • Exploits external lexical resources • WordNet, LSA, distributional similarity, string sim, … • Syntax-based features to promote aligning corresponding predicate-argument structures • Decoding & learning similar to MANLI 17 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 18 Results: Stanford RTE aligner System P% RTE2 dev R % F1 % Bag-of-words GIZA++ Cross-EM Stanford RTE 57.8 83.0 67.6 81.1 81.2 66.4 80.1 75.8* 67.5 72.1 72.1 78.4* E% 3.5 9.4 1.3 0.5 P% RTE2 test R % F1 % E% 62.1 85.1 70.3 82.7 82.6 69.1 81.0 75.8* 5.3 11.3 0.8 0.3 70.9 74.8 74.1 79.1* * includes (generous) correction for missed punctuation • Better F1 than MT aligners — but recall lags precision • Stanford does poor job aligning function words • 13% of links in gold are prepositions & articles • Stanford misses 67% of these (MANLI only 10%) • Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 19 Results: MANLI aligner System P% RTE2 dev R % F1 % Bag-of-words GIZA++ Cross-EM Stanford RTE MANLI 57.8 83.0 67.6 81.1 83.4 81.2 66.4 80.1 75.8 85.5 67.5 72.1 72.1 78.4 84.4 E% 3.5 9.4 1.3 0.5 21.7 P% RTE2 test R % F1 % E% 62.1 85.1 70.3 82.7 85.4 82.6 69.1 81.0 75.8 85.3 5.3 11.3 0.8 0.3 21.3 70.9 74.8 74.1 79.1 85.3 • MANLI outperforms all others on every measure • F1: 10.5% higher than GIZA++, 6.2% higher than Stanford • Good balance of precision & recall • Matched >20% exactly Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MANLI results: discussion • Three factors contribute to success: 1. Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded 2. Contextual features enable matching function words 3. Phrases: death penalty ~ capital punishment, abdicate ~ give up • But phrases help less than expected! • • If we set max phrase size = 1, we lose just 0.2% in F1 Recall errors: room to improve • 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis • Precision errors harder to reduce • equal function words (49%), forms of be (21%), punctuation (7%) 20 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Can aligners predict RTE answers? • We’ve been evaluating against gold-standard alignments • But alignment is just one component of an NLI system • Does a good alignment indicate a valid inference? • • • • Not necessarily: negations, modals, non-factives & implicatives, … But alignment score can be strongly predictive And many NLI systems rely solely on alignment Using alignment score to predict RTE answers: • • • Predict YES if score > threshold Tune threshold on development data Evaluate on test data 21 Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion 22 Results: predicting RTE answers System Bag-of-words Stanford RTE MANLI RTE2 entries (average) LCC [Hickl et al. 2006] RTE2 dev Acc % AvgP % RTE2 test Acc % AvgP % 61.3 63.1 59.3 61.5 64.9 69.0 57.9 60.9 60.3 58.9 59.2 61.0 — — — — 58.5 75.4 59.1 80.8 • No NLI aligner rivals best complete RTE system • (Most) complete systems do a lot more than just alignment! • But, Stanford & MANLI beat average entry for RTE2 • Many NLI systems could benefit from better alignments! Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Conclusion • MT aligners not directly applicable to NLI • • • MANLI succeeds by: • • • They rely on unsupervised learning from massive amounts of bitext They assume semantic equivalence of P & H Exploiting (manually & automatically constructed) lexical resources Accommodating frequent unaligned phrases Phrase-based representation shows potential :-) • But not yet proven: need better phrase-based lexical resources Thanks! Questions? 23 24 Backup slides follow END Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Related work • Lots of past work on phrase-based MT • But most systems extract phrases from word-aligned data • • Despite assumption that many translations are noncompositional Recent work jointly aligns & weights phrases [Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • However, this is of limited applicability to the NLI task • • MANLI uses phrases only when words aren’t appropriate MT uses longer phrases to realize more dependencies (e.g. word order, agreement, subcategorization) • MT systems don’t model word insertions & deletions 25