Transcript (PPTX)
Carnegie Mellon Diversifiable Bootstrapping for Acquiring High-Coverage Paraphrase Resource Hideki Shima Teruko Mitamura LREC 2012, May 24th, 2012 Language Technologies Institute School of Computer Science Carnegie Mellon University, USA Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. LREC 2012, May 24th, 2012 2 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. Mary was killed by John. LREC 2012, May 24th, 2012 passivization 3 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. Mary was killed by John. John is the killer of Mary. LREC 2012, May 24th, 2012 passivization nominalization 4 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. Mary was killed by John. John is the killer of Mary. John assassinated Mary. LREC 2012, May 24th, 2012 passivization nominalization entailment 5 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. passivization Mary was killed by John. nominalization John is the killer of Mary. entailment John assassinated Mary. slang John is the 187 suspect of Mary. 187 means: “California penal code for murder, made popular in west coast gangsta rap”. – From The Urban Dictionary dot com Usage: “This is Gavilan. In pursuit of possible 187 suspects.” –From the movie, Hollywood Homicide LREC 2012, May 24th, 2012 6 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. passivization Mary was killed by John. nominalization John is the killer of Mary. entailment John assassinated Mary. slang John is the 187 suspect of Mary. John terminated Mary with extreme euphemism prejudice. “In military and other covert operations, terminate with extreme prejudice is a euphemism for execution” – Wikipedia LREC 2012, May 24th, 2012 7 Carnegie Mellon Can a machine recognize the meaning similarity? John killed Mary. passivization Mary was killed by John. nominalization John is the killer of Mary. entailment John assassinated Mary. slang John is the 187 suspect of Mary. John terminated Mary with extreme euphemism prejudice. Humans use various expressions to convey the same or similar meaning, which makes it difficult for machines to “read” text. LREC 2012, May 24th, 2012 8 Carnegie Mellon Can a machine recognize the meaning similarity? X killed Y. passivization Y was killed by Y. nominalization X is the killer of Y. entailment X assassinated Y. slang X is the 187 suspect of Y. X terminated Y with extreme prejudice. euphemism Goal: automatically acquire paraphrase patterns that are lexically-diverse LREC 2012, May 24th, 2012 9 Carnegie Mellon Paraphrase Recognition / Generation is a common need in various applications Automatic Evaluation – In Machine Translation [Kauchak & Barzilay, 2006][Padó et al., 2009] – In Text Summarization [Zhou et al., 2006] – In Question Answering [Ibrahim et al., 2003] [Dalmas, 2007] Text Summarization [Lloret et al., 2008][Tatar et al., 2009] Information Retrieval [Parapar et al., 2005][Riezler et al., 2007] Information Extraction [Romano et al., 2006] Question Answering [Harabagiu & Hickl, 2006][Dogdan et al., 2008] Collocation Error Correction [Dahlmeier and Ng, 2011] LREC 2012, May 24th, 2012 10 Carnegie Mellon Outline Motivation Method: Diversifiable Bootstrapping Experiment Related Works Conclusion LREC 2012, May 24th, 2012 11 Carnegie Mellon Bootstrap Paraphrase Learning INPUT seed instances monolingual plain corpus LREC 2012, May 24th, 2012 BOOTSTRAP LEARNING ALGORITHM OUTPUT more instances patterns 12 Carnegie Mellon Bootstrap Paraphrase Learning INPUT seed instances monolingual plain corpus LREC 2012, May 24th, 2012 BOOTSTRAP X (killer) LEARNING Bootstrapping ALGORITHM John Wilkes Booth Mark David Chapman Nathuram Godse Yigal Amir John Bellingham Mohammed Bouyeri Dan White Sirhan Sirhan El Sayyid Nosair Mijailo Mijailovic OUTPUT Y (victim) Abrahammore Lincoln Johninstances Lennon Mahatma Gandhi Yitzhak Rabin Spencer Perceval Theo van Gogh patterns Mayor George Moscone Robert F. Kennedy Meir Kahane Anna Lindh 13 Carnegie Mellon Bootstrap Paraphrase Learning X, the assassin of Y INPUT assassination of Y by X Bootstrapping Y seedX assassinated instances the assassination of Y by X of X, the assassin of Y X assassinated Y in monolingual : : : plain corpus OUTPUT more instances patterns Unlike many other bootstrapping works the goal is acquire patterns, not instances LREC 2012, May 24 , 2012 th 14 Carnegie Mellon Bootstrap Paraphrase Learning INPUT seed instances monolingual plain corpus LREC 2012, May 24th, 2012 BOOTSTRAP LEARNING ALGORITHM OUTPUT more instances patterns 15 Carnegie Mellon Bootstrap Learning Algorithm 1st iteration 2nd iteration Seed Instances Sentences Extracted Patterns Extracted Instances Sentences Ranked Patterns Ranked Instances ... This framework is based on ESPRESSO [Pantel & Pennacchiotti, 2006] LREC 2012, May 24th, 2012 16 Carnegie Mellon Bootstrap Learning Algorithm Search sentences by instances Seed Instances Sentences Extracted Patterns 1st Edwin Booth was brother of John Wilkes Booth, the iteration Ranked Extracted assassin of Abraham Lincoln. Sentences Patterns Instances John Wilkes Booth, the assassin of Abraham Lincoln, was inspired by Brutus. In 1969 Berman was part of the defense team of 2nd Ranked ... SirhanInstances Sirhan, the assassin of Robert F. Kennedy. iteration ::: LREC 2012, May 24th, 2012 17 Carnegie Mellon Bootstrap Learning Algorithm Search sentences by instances Seed Instances Sentences Extracted Patterns 1st Edwin Booth was brother of X, the assassin of Y. iteration Extracted of Y, was inspired by Brutus. Ranked X, the assassin Sentences Patterns Instances In 1969 Berman was part of the defense team of X, the assassin of Y. 2nd Ranked . . :. : : iteration Instances LREC 2012, May 24th, 2012 18 Carnegie Mellon Bootstrap Learning Algorithm Extract patterns from sentences Seed Instances 1st … iteration 2nd iteration Sentences Extracted Patterns brother of X, the assassin of Y. Ranked Extracted Sentences Patterns Instances X, the assassin of Y, was …team of X, the assassin of Y. Ranked Instances LREC 2012, May 24th, 2012 ... 19 Carnegie Mellon Bootstrap Learning Algorithm Extract patterns from sentences Seed Instances 1st … iteration Sentences Extracted Patterns brother of X, the assassin of Y . Ranked Extracted Sentences Instances X, the assassin of Y Patterns , was …team of X, the assassin of Y . 2nd Ranked ... iteration Instances Extracted Pattern: Longest Common Substring among retrieved sentences LREC 2012, May 24th, 2012 20 Carnegie Mellon Bootstrap Learning Algorithm Score and rank patterns 1st iteration Seed Instances Sentences Extracted Patterns Extracted Instances Sentences Ranked Patterns Rank by reliability of pattern: r(p). r(p) is based on an association measure with eachRanked instance in the 2nd . . . corpus. iteration Instances LREC 2012, May 24th, 2012 21 Carnegie Mellon Bootstrap Learning Algorithm Score and rank patterns Seed Instances 1st iteration 1. 0.422Extracted X, the Instances Sentences assassin of Y Sentences 2. 0.324 assassination of Y by X 3. 0.312 X assassinated Y 4. 0.231Ranked the assassination of Y by X 2nd ... iteration 5. 0.208Instances of X, the assassin of Y ::: LREC 2012, May 24th, 2012 Extracted Patterns Ranked Patterns 22 Carnegie Mellon Bootstrap Learning Algorithm Search sentences by pattern(s) 1st iteration Seed Instances Sentences Extracted Patterns Extracted Instances Sentences Ranked Patterns Still shot from the CCTV video footage showing 2ndOguen Samast, Ranked the assassin . . . of Hrant Dink. iteration Instances is a descendant of John Henry Bellingham Bellingham, the assassin of Spencer Perceval. LREC 2012, May 24th, 2012 23 Carnegie Mellon Bootstrap Learning Algorithm Extract instances from sentences 1st iteration Seed Instances Sentences Extracted Patterns Extracted Instances Sentences Ranked Patterns Still shot from the CCTV video footage showing 2ndOguen Samast, Ranked the assassin . . . of Hrant Dink. iteration Instances is a descendant of John Henry Bellingham Bellingham, the assassin of Spencer Perceval. LREC 2012, May 24th, 2012 24 Carnegie Mellon Bootstrap Learning Algorithm Score and rank instances Seed Sentences Rank instances by reliability: Instances Extracted r(i)Patterns (similar to pattern reliability scoring) 1st iteration 2nd iteration Extracted Instances Ranked Instances LREC 2012, May 24th, 2012 Sentences Ranked Patterns ... 25 Carnegie Mellon Issue: Lack of Lexical Diversity Words participating in patterns are skewed X, the assassin of Y assassination of Y by X X assassinated Y the assassination of Y by X of X, the assassin of Y X assassinated Y in As a solution, we propose the Diversifiable Bootstrapping LREC 2012, May 24th, 2012 26 Carnegie Mellon Diversifiable Bootstrapping Original reliability score of a pattern How is a pattern lexically different from other patterns originally ranked higher than this? r ' ( p) r ( p) (1 ) diversity ( p) LREC 2012, May 24th, 2012 27 Carnegie Mellon Diversifiable Bootstrapping Original reliability score of a pattern How is a pattern lexically different from other patterns originally ranked higher than this? r ' ( p) r ( p) (1 ) diversity ( p) Interpolation parameter: 0 1 LREC 2012, May 24th, 2012 28 Carnegie Mellon Diversifiable Bootstrapping Key contribution By tweaking the parameter λ,How patterns is this to pattern different from acquire can be diversifiablelexically with a specific Original reliability other patterns originally degreescore oneofcan control. ranked higher than this? a pattern r ' ( p) r ( p) (1 ) diversity ( p) Interpolation parameter: 0 1 LREC 2012, May 24th, 2012 29 Carnegie Mellon Experimental Settings Bootstrapping Algorithm – Based on ESPRESSO framework [Pantel & Pennacchiotti, 2006] – Unlike ESPRESSO, we aim to obtain patterns not instances Lexical diversity scoring function: – Based on Shima & Mitamura [2011] Seed instances: Schlaefer et al., [2006] Corpus: English Wikipedia LREC 2012, May 24th, 2012 30 Carnegie Mellon Acquired Paraphrases: killed 1 (no diversification) X, the assassin of Y assassination of Y by X X assassinated Y the assassination of Y by X of X, the assassin of Y X assassinated Y in X, the man who assassinated Y Y's assassin, X of Y's assassin X of the assassination of Y by X X shot and killed Y Y was assassinated by X named X assassinated Y Y was shot by X X to assassinate Y LREC 2012, May 24th, 2012 31 Carnegie Mellon Acquired Paraphrases: killed 1 0.7 0.3 X, the assassin of Y assassination of Y by X X assassinated Y the assassination of Y by X of X, the assassin of Y X assassinated Y in X, the man who assassinated Y Y's assassin, X of Y's assassin X of the assassination of Y by X X shot and killed Y Y was assassinated by X named X assassinated Y Y was shot by X X to assassinate Y X, the assassin of Y X assassinated Y assassination of Y by X Y was shot by X X, who killed Y the assassination of Y by X X assassinated Y in X tells his version of Y X shoot Y X murdered Y Y's killer, X Y, at the theatre after X Y, push X to his breaking point X to assassinate Y of X, the assassin of Y X, the assassin of Y X, who killed Y Y was shot by X X tells his version of Y X shoot Y X murdered Y Y's killer, X Y, at the theatre after X Y, push X to his breaking point X assassinated Y assassination of Y by X X to assassinate Y X kills Y of X shooting Y X assassinated Y in LREC 2012, May 24th, 2012 32 Carnegie Mellon Acquired Paraphrases: killed 1 0.7 0.3 X, the assassin of Y assassination of Y by X X assassinated Y the assassination of Y by X of X, the assassin of Y X assassinated Y in X, the man who assassinated Y Y's assassin, X of Y's assassin X of the assassination of Y by X X shot and killed Y Y was assassinated by X named X assassinated Y Y was shot by X X to assassinate Y X, the assassin of Y X assassinated Y assassination of Y by X Y was shot by X X, who killed Y the assassination of Y by X X assassinated Y in X tells his version of Y X shoot Y X murdered Y Y's killer, X Y, at the theatre after X Y, push X to his breaking point X to assassinate Y of X, the assassin of Y X, the assassin of Y X, who killed Y Y was shot by X X tells his version of Y X shoot Y X murdered Y Y's killer, X Y, at the theatre after X Y, push X to his breaking point X assassinated Y assassination of Y by X X to assassinate Y X kills Y of X shooting Y X assassinated Y in LREC 2012, May 24th, 2012 33 Carnegie Mellon Acquired Paraphrases: died-of 1 X died of Y X died of Y in X died of Y on X died of lung Y X died of lung Y in X died of lung Y on X died of Y in the X died of Y at X died of stomach Y X died of natural Y X died of breast Y in X died of a Y X died of Y in his X passed away from Y X died of a Y in LREC 2012, May 24th, 2012 0.7 0.3 X died of Y in X died of Y X's death from Y X passed away from Y Y of X, news Y of X, a former that X was suffering from Y the suspected Y of X X to breast Y in X was diagnosed with ovarian Y X dies of Y X was dying of Y X died of lung Y X died of Y on X died of lung Y in X died of Y in X's death from Y X passed away from Y Y of X, news Y of X, a former that X was suffering from Y the suspected Y of X X succumbed to lung Y X to breast Y in X was diagnosed with ovarian Y X dies of Y X was dying of Y X died of Y X's death from Y in X died of lung Y 34 Carnegie Mellon Acquired Paraphrases: was-led-by 1 0.7 0.3 Y came to power in X in Y came to power in X Y to power in X Y came to power in X in the when Y came to power in X in when Y came to power in X Y took power in X Y rose to power in X after Y came to power in X Y became chancellor of X Y came to power in X and Y seized power in X Y gained power in X to power of Y in X Y's rise to power in X Y came to power in X Y to power in X regime of Y in X Y came to power in X in Y to power in X in Y became chancellor of X the rise of Y in X X's dictator Y X's president Y Y took control of X Y, who ruled X Y's success and X's saviour Y declared that X had X's leader Y government of Y in X Y came to power in X in regime of Y in X X's dictator Y Y became chancellor of X X's president Y the rise of Y in X X's leader Y Y, who ruled X Y took control of X government of Y in X X, led by Y quisling had visited Y in X to flee X after Y Y in X the year before X, under the leadership of Y LREC 2012, May 24th, 2012 35 Carnegie Mellon Related Works – Use of Thesaurus E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998], Nomlex [Macleod et al., 1998], VerbNet [Kipper et al., 2006] Synonyms of “lead (v)” in WordNet ID Words Definition S1 lead, take, direct, conduct, take somebody somewhere guide S2 leave, result, lead produce as a result or residue : S6 run, go, pass, lead, extend : S14 LREC 2012, May 24 , 2012 moderate, chair, lead th stretch out over a distance, space, time, or scope preside over 36 Carnegie Mellon Related Works – Use of Thesaurus E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998], WEAKNESS Nomlex [Macleod et al., 1998] , VerbNet [Kipper et al., 2006] Need WSD or contexts to avoid false-positives. Synonyms of “lead (v)” in WordNet ID Words Definition S1 lead, take, direct, conduct, take somebody somewhere guide S2 leave, result, lead produce as a result or residue : S6 run, go, pass, lead, extend : S14 LREC 2012, May 24 , 2012 moderate, chair, lead th stretch out over a distance, space, time, or scope preside over 37 Carnegie Mellon Related Works – Paraphrase Acquisition Alignment Approach – Monolingual Comparable Corpus [Shinyama et al, 2002] – Bilingual Parallel Corpus [Barzilay & McKeown, 2001][Bannard & Callison-Burch, 2005][Callison-Burch, 2008] Distributional Approach – Context as Vector Space [Pasca & Dienes, 2005][Bhagat & Ravichandran, 2008] – Context as Surface Pattern [Lin & Pantel, 2001][Ravichandran & Hovy, 2002] LREC 2012, May 24th, 2012 38 Carnegie Mellon Related Works – Paraphrase Acquisition [Bannard & Callison-Burch, 2005] [Callison-Burch, 2008] [Bhagat & Ravichandran, 2008] [Pasca & Dienes, 2005] murdered died beaten been killed are lost were killed kill have died murdered dead death deaths died victims killing been killed killed in killed , that killed killed NN people killed NN killed by were wounded in and wounding dead , including , hundreds used made involved found born done injured seen taken released Paraphrases acquired by Metzler et al., [2011] LREC 2012, May 24th, 2012 39 Carnegie Mellon Differences from Related Works Our work requires just a plain non-parallel corpus – Language portability: • Good news for resource/tool-scarce languages – There’s a potential to learn words used in a closed community (slangs, technical terms etc) by providing a domain-specific corpus Bootstrapping works iteratively with minimum supervision – Smaller human effort is required as compared to heavily supervised learning methods, or to relying on domain expert humans to hand-craft patterns. LREC 2012, May 24th, 2012 40 Carnegie Mellon Conclusion We proposed the Diversifiable Bootstrapping which can acquire lexically- diverse paraphrase patterns. We gave initial experimental results on a few relations, which look promising. As a future work, we hope to conduct formal evaluations on larger relations in different languages. LREC 2012, May 24th, 2012 41 Carnegie Mellon Acknowledgment This publication was made possible in part by a NPRP grant (No: 09-873-1-129) from the Qatar National Research Fund (a member of The Qatar Foundation). The statements made herein are solely the responsibility of the authors. LREC 2012, May 24th, 2012 We also gratefully acknowledge the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0172. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government. 42 Carnegie Mellon Questions? LREC 2012, May 24th, 2012 43