Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research)
Download ReportTranscript Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research)
Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research) Knowledge Is Power Many knowledge bases (KB) Google knowledge graph 2 Using KB in Search 3 Using KB in Social Media 4 What Is A Knowledge Base? Entities, entity types An entity is an instance of an entity type – Entity types organized in a hierarchy – all person place politician 5 What Is A Knowledge Base? Entities, entity types Predicates, subject-predicate-object triples – An SPO triple is an instance of a predicate is-born-in person place Other modeling constructs: rules, reification, etc. 6 What Is A Knowledge Base? Entities, entity types Predicates, subject-predicate-object triples Knowledge base: graph with entity nodes and SPO triple edges is-born-in is-located-in is-born-in 7 Knowledge Bases: Scope Domain-specific knowledge base Focus is on a well-defined domain – E.g., IMDB for movies, Music-Brainz for music – 8 Knowledge Bases: Scope Domain-specific knowledge base Global KBs Domain-specific KBs Global knowledge base Covers a variety of knowledge across domains – Intensional: Cyc, WordNet – Extensional: Freebase, Knowledge Graph, Yago/Yago2, DeepDive, NELL, Prospera, ReVerb, Knowledge Vault – 9 Knowledge Bases: Comparison [DGH+14] Name # of Entity Types # Entities # Predicates # Confident Triples Knowledge Vault (KV) 1100 45M 4469 271M DeepDive 4 2.7M 34 7M NELL 271 5.1M 306 0.435M PROSPERA 11 N/A 14 0.1M 350,000 9.8M 100 150M Freebase 1500 40M 35,000 637M Knowledge Graph 1500 570M 35,000 18,000M Yago2 10 Outline Motivation Knowledge extraction Knowledge fusion Interesting applications Future directions 11 Knowledge Bases: Population Use well-structured sources, manual curation Manual curation is resource intensive – E.g., Cyc, Wordnet, Freebase, Yago, Knowledge Graph – 12 Knowledge Bases: Population Use well-structured sources, manual curation Use less structured sources, automated knowledge extraction (Possibly) bootstrap from existing knowledge bases – Extractors extract knowledge from a variety of sources – 13 Knowledge Bases: Population Use well-structured sources, manual curation Use less structured sources, automated knowledge extraction 14 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging – Hamas claimed responsibility for the Gaza attack 15 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging – Hamas claimed responsibility for the Gaza attack 16 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging Synonyms: multiple representations of same entities, predicates – (DC, is capital of, United States); (Washington, is capital city of, US) 17 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging Synonyms: multiple representations of same entities, predicates (DC, is capital of, United States); (Washington, is capital city of, US) – Linkage is more challenging than in structured databases, where one can take advantage of schema information – 18 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging Synonyms: multiple representations of same entities, predicates Polysemy: similar representations of different entities I enjoyed watching Chicago on Broadway – Chicago has the largest population in the Midwest – Distinguishing them is more challenging than in structured databases, where one can take advantage of schema information – 19 Knowledge Extraction: Challenges Identifying semantically meaningful triples is challenging Synonyms: multiple representations of same entities, predicates Polysemy: similar representations of different entities 20 Knowledge Extraction: Tasks Triple identification Entity linkage Predicate linkage /people/person/date_of_birth 21 Triple Identification: Paradigms (I) Domain-specific knowledge base 1: pattern based [FGG+14] Supervised, uses labeled training examples – High precision, low recall because of linguistic variety – /people/person/date_of_birth Domain-specific knowledge base 2: iterative, pattern based Strengths: Higher recall than domain-specific 1 – Weaknesses: Lower precision due to concept drift – 22 Triple Identification: Paradigms (II) Global knowledge base: distant supervision [MBS+09] Weakly supervised, via large existing KB (e.g., Freebase) – Use entities, predicates in KB to provide training examples – Extractors focus on learning ways KB predicates expressed in text – Example: ( , /film/director/film, ) – Spielberg’s film Saving Private Ryan is based on the brothers’ story – Allison co-produced Saving Private Ryan, directed by Spielberg 23 Triple Identification: Distant Supervision Key assumption: – If a pair of entities participate in a KB predicate, a sentence containing that pair of entities is likely to express that predicate Example: Spielberg’s film Saving Private Ryan is based on the brothers’ story – Allison co-produced Saving Private Ryan, directed by Spielberg – Solution approach: obtains robust combination of noisy features Uses named entity tagger to label candidate entities in sentences – Trains a multiclass logistic regression classifier, using all features – 24 Triple Identification: Distant Supervision Lexical and syntactic features: Sequence of words, POS-tags in [entity1 – K, entity2 + K] window – Parsed dependency path between entities + adjacent context nodes – Use exact matching of conjunctive features: rely on big data! – Example: new triples identified using Freebase Predicate Size New triples /location/location/contains 253,223 Paris, Montmartre /people/person/profession 208,888 Thomas Mellon, Judge 25 Triple Identification: Distant Supervision Strengths: Canonical names for predicates in triples – Robustness: benefits of supervised IE without concept drift – Scalability: extracts large # of triples for quite a large # of predicates – Weaknesses: – Triple identification limited by the predicates in KB 26 Triple Identification: Paradigms (III) Global knowledge base: open information extraction (Open IE) No pre-specified vocabulary – Extractors focus on generic ways to express predicates in text – Systems: TextRunner (1st gen), ReVerb (2nd gen) [FSE11] – Example: – McCain fought hard against Obama, but finally lost the election 27 Triple Identification: Paradigms (III) Global knowledge base: open information extraction (Open IE) No pre-specified vocabulary – Extractors focus on generic ways to express predicates in text – Systems: TextRunner (1st gen), ReVerb (2nd gen) [FSE11] – Example: – McCain fought hard against Obama, but finally lost the election 28 X Triple Identification: Open IE [EFC+11] Architecture for (subject, predicate, object) identification Use POS-tagged and NP-chunked sentence in text – First, identify predicate phrase – Second, find pair of NP arguments – subject and object – Finally, compute confidence score using logistic regression classifier – Example: – Hamas claimed responsibility for the Gaza attack 29 Triple Identification: Open IE [EFC+11] Architecture for (subject, predicate, object) identification Use POS-tagged and NP-chunked sentence in text – First, identify predicate phrase – Second, find pair of NP arguments – subject and object – Finally, compute confidence score using logistic regression classifier – Example: – Hamas claimed responsibility for the Gaza attack 30 Triple Identification: Open IE [EFC+11] Predicate phrase identification: use syntactic constraints For each verb, find longest sequence of words expressed as a light verb construction (LVC) regular expression – Avoids incoherent, uninformative predicates – Example: – Hamas claimed responsibility for the Gaza attack 31 X Triple Identification: Open IE [EFC+11] Subject, object identification: use classifiers to learn boundaries Heuristics, e.g., using simple noun phrases, are inadequate – Subject: learn left and right boundaries – Object: need to learn only right boundary – Examples: – The cost of the war against Iraq has risen above 500 billion dollars – The plan would reduce the number of teenagers who start smoking 32 Triple Identification: Open IE [EFC+11] Subject, object identification: use classifiers to learn boundaries Heuristics, e.g., using simple noun phrases, are inadequate – Subject: learn left and right boundaries – Object: need to learn only right boundary – Example: – The cost of the war against Iraq has risen above 500 billion dollars – The plan would reduce the number of teenagers who start smoking 33 Triple Identification: Open IE Strengths: No limitation to pre-specified vocabulary – Scalability: extracts large # of triples for large # of predicates – Weaknesses: – No canonical names for entities, predicates in triples 34 Knowledge Extraction: Tasks Triple identification Entity linkage Predicate linkage /people/person/date_of_birth 35 Entity Linkage: Disambiguation to Wikipedia Goal: Output Wiki titles corresponding to NPs in text – Identifies canonical entities Examples: I enjoyed watching Chicago on Broadway – Chicago has the largest population in the Midwest – 36 Entity Linkage: Disambiguation to Wikipedia Local D2W approaches [BP06, MC07] Each mention in a document is disambiguated independently – E.g., use similarity of text near mention with candidate Wiki page – Global D2W approaches [C07, RRD+11] All mentions in a document are disambiguated simultaneously – E.g., utilize Wikipedia link graph to estimate coherence – 37 Global D2W Approach Chicago Midwest Φ Ψ Goal: many-to-1 matching on a bipartite graph Approach: solve an optimization problem involving Φ(m, t): relatedness between mention and Wiki title – Ψ(t, t’): pairwise relatedness between Wiki titles – 38 Global D2W Approach Chicago Midwest Φ Ψ Goal: many-to-1 matching on a bipartite graph Φ(m1,t1) + Φ(m2,t3) + Ψ(t1,t3) > Φ(m1,t2) + Φ(m2,t3) + Ψ(t2,t3) 39 Global D2W Approach [RRD+11] Represent Φ and Ψ as weighted sum of local and global features Φ(m,t) = ∑ wi * Φi(m,t) – Learn weights wi using SVM – Build an index (Anchor Text, Wiki Title, Frequency) 40 Global D2W Approach [RRD+11] Represent Φ and Ψ as weighted sum of local and global features Build an index (Anchor Text, Wiki Title, Frequency) Local features Φi(m, t): relatedness between mention and title P(t|m): fraction of times title t is target page of anchor text m – P(t): fraction of all Wiki pages that link to title t – Cosine-sim(Text(t), Text(m)), Cosine-sim(Text(t), Context(m)), etc. – 41 Global D2W Approach [RRD+11] Represent Φ and Ψ as weighted sum of local and global features Build an index (Anchor Text, Wiki Title, Frequency) Local features Φi(m, t): relatedness between mention and title Global features Ψ(t, t’): relatedness between Wiki titles I [t-t’] * PMI(InLinks(t), InLinks(t’)) – I [t-t’] * NGD(InLinks(t), InLinks(t’)) – 42 Global D2W Approach [RRD+11] Algorithm (simplified) Use shallow parsing and NER to get (additional) potential mentions – Efficiency: for each mention m, take top-k most frequent targets t – Use objective function to find most appropriate disambiguation – Chicago Midwest Φ Ψ 43 Predicate Linkage: Using Distant Supervision Global knowledge base: distant supervision [MBS+09] Weakly supervised, via large existing KB (e.g., Freebase) – Use entities, predicates in KB to provide training examples – Extractors focus on learning ways KB predicates expressed in text – Example: ( , /film/director/film, ) – Spielberg’s film Saving Private Ryan is based on the brothers’ story – Allison co-produced Saving Private Ryan, directed by Spielberg 44 Predicate Linkage: Using Redundancy [YE09] Goal: Given Open IE triples, output synonymous predicates – Use unsupervised method Motivation: Open IE creates a lot of redundancy in extractions (DC, is capital of, United States); (Washington, is capital city of, US) – Top-80 entities had an average of 2.9, up to about 10, synonyms – Top-100 predicates had an average of 4.9 synonyms – 45 Predicate Linkage: Using Redundancy Key assumption: Distributional hypothesis – Similar entities appear in similar contexts Use generative model to estimate probability that strings co-refer Probability depends on number of shared, non-shared properties – Formalize using a ball-and-urn abstraction – Scalability: use greedy agglomerative clustering Strings with no, or only popular, shared properties not compared – O(K*N*log(N)) algorithm, if at most K synonyms, N triples – 46 Case Study: Knowledge Vault [DGH+14] Text, DOM (document object model): use distant supervision /people/person/date_of_birth Web tables, lists: use schema mapping Annotations: use semi-automatic mapping 47 Knowledge Vault: Statistics A large knowledge base As of 11/2013 Highly skewed data – fat head, long tail #Triples/type: 1–14M (location, organization, business) – #Triples/entity: 1–2M (USA, UK, CA, NYC, TX) – 48 Knowledge Vault: Statistics 1B+ Webpages over the Web – Contribution is skewed: 1- 50K As of 11/2013 49 Knowledge Vault: Statistics 12 extractors; high variety As of 11/2013 50 KV: Errors Can Creep in at Every Stage Extraction error: (Obama, nationality, Chicago) 9/2013 51 KV: Errors Can Creep in at Every Stage Reconciliation error: (Obama, nationality, North America) American President Barack Obama 9/2013 52 KV: Errors Can Creep in at Every Stage Source data error: (Obama, nationality, Kenya) Obama born in Kenya 9/2013 53 KV: Quality of Knowledge Extraction Gold standard: Freebase LCWA (local closed-world assumption) If (s,p,o) exists in Freebase : true – Else If (s,p) exists in Freebase: false (knowledge is locally complete) – Else unknown – The gold standard contains about 40% of the triples 54 KV: Statistics for Triple Correctness Overall accuracy: 30% Random sample on 25 false triples Triple identification errors: 11 (44%) – Entity linkage errors: 11 (44%) – Predicate linkage errors: 5 (20%) – Source data errors: 1 (4%) – 55 KV: Statistics for Triple Correctness As of 11/2013 56 KV: Statistics for Extractors 12 extractors; high variety As of 11/2013 57 Outline Motivation Knowledge extraction Knowledge fusion Interesting applications Future directions 58 Goal: Judge Triple Correctness Input: Knowledge triples and their provenance – Which extractor extracts from which source Output: a probability in [0,1] for each triple – Probabilistic decisions vs. deterministic decisions 59 Usage of Probabilistic Knowledge Negative training examples, and MANY EXCITING APPLICATIONS!! Active learning, probabilistic inference, etc. Upload to KB 60 Data Fusion: Definition Input Output 61 Knowledge Fusion Challenges I. Input is three-dimensional (S, P) 62 Knowledge Fusion Challenges II. Output probabilities should be well-calibrated 63 Knowledge Fusion Challenges III. Data are of web scale Three orders of magnitude larger than currently published data-fusion applications Size: 1.1TB – Sources: 170K 1B+ – Data items: 400K 375M – Values: 18M 6.4B (1.6B unique) – Data are highly skewed #Triples / Data item: 1 – 2.7M – #Triples / Source: 1 – 50K – 64 Approach I.1: Graph-Based Priors Path 1 Path 2 Prec 1 Prec 0.03 Rec 0.01 Rec 0.33 F1 0.03 F1 0.04 Weight 2.62 Weight 2.19 atSchool education Sam Perkins education North Carolina Tar Heels profession play profession Michael Jordon Path Ranking Algorithm (PRA): Logistic regression [Lao et al., EMNLP’11] 65 Approach I.2: Graph-Based Priors Hidden Layer (100 dimensions) Predicate Neighbors Deep learning [Dean, CIKM’14] Pred Neighbor 1 Neighbor 2 Neighbor 3 children parents 0.4 spouse 0.5 birth-place 0.8 brith-date children 1.24 gender 1.25 parents 1.29 edu-end job-start 1.41 Edu-start 1.61 job-end 1.74 66 Approach II: Extraction-Based Learning [Dong et al, KDD’14] Features Square root of #sources – mean of extraction confidence – Adaboost learning Error at Stage t Error Classifier func built at Stage t-1 Weight Instance i Learner t Learning for each predicate 67 Approach III. Multi-Layer Fusion Intuitions [Dong et al., VLDB’15] Leverage source/extractor agreements – Trust a source/extractor with high quality – Graphical model – predict at the same time Extraction correctness, triple correctness – Source accuracy, extractor precision/recall – Source/extractor hierarchy Break down “large” sources – Group similar “small” sources – 68 High-Level Intuition [Dong, VLDB’09] Researcher affiliation Src1 Src2 Src3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 69 High-Level Intuition [Dong, VLDB’09] Researcher affiliation Src1 Src2 Src3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Voting: trust the majority 70 High-Level Intuition [Dong, VLDB’09] Researcher affiliation Src1 Src2 Src3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 71 High-Level Intuition [Dong, VLDB’09] Researcher affiliation Src1 Src2 Src3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Quality-based: give higher votes to more accurate sources 72 What About Extractions Extracted Harry Potter actors/actresses [Sigmod, 2014] Harry Potter Ext1 Ext2 Ext3 Daniel ✓ ✓ ✓ Emma ✓ Rupert ✓ Jonny ✓ Eric ✓ ✓ ✓ 73 What About Extractions Extracted Harry Potter actors/actresses [Sigmod, 2014] Harry Potter Ext1 Ext2 Ext3 Daniel ✓ ✓ ✓ Emma ✓ Rupert ✓ Jonny ✓ Eric ✓ ✓ ✓ Voting: trust the majority 74 What About Extractions Extracted Harry Potter actors/actresses [Sigmod, 2014] Harry Potter Ext1 (high rec) Ext2 (high prec) Ext3 (med prec/rec) Daniel ✓ ✓ ✓ Emma ✓ Rupert ✓ Jonny ✓ Eric ✓ ✓ ✓ Quality-based: More likely to be correct if extracted by high-precision sources – More likely to be wrong if not extracted by high-recall sources – 75 Graphical Model [VLDB, 2015] Observations – Xewdv: whether extractor e extracts from source w the (d,v) item-value pair Latent variables Cwdv: whether source w indeed provides (d,v) pair – Vd: the correct value(s) for d – Parameters Aw: Accuracy of source w – Pe: Precision of extractor e – Re: Recall of extractor e – 76 Algorithm [VLDB, 2015] Compute Pr(W provides T | Extractor quality) by Bayesian analysis E-Step Compute Pr(T | Source quality) by Bayesian analysis Compute source accuracy M-Step Compute extractor precision and recall 77 Fusion Performance 78 Outline Motivation Knowledge extraction Knowledge fusion Interesting applications Future directions 79 Usage of Probabilistic Knowledge Negative training examples, and MANY EXCITING APPLICATIONS!! Source errors: Knowledgebase trust Extraction errors: data abnormality diagnosis 80 Application I. Web Source Quality What we have now Page Rank: links between Websites/Webpages – Log based: search log and click-through rate – Web spam – etc. – 81 1.Tail Sources Can Be Useful Good answer for an award-winning song 82 1.Tail Sources Can Be Useful Missing answer for a not-so-popular song 83 1.Tail Sources Can Be Useful Very precise info on guitar players but low Page Rank 84 II. Popular Websites May Not Be Trustworthy Domain Gossip Websites http://www.ebizmba.com/articles/ gossip-websites www.eonline.com perezhilton.com radaronline.com www.zimbio.com 14 out of 15 have a PageRank among top 15% of the websites mediatakeout.com gawker.com www.popsugar.com www.people.com www.tmz.com www.fishwrapper.com celebrity.yahoo.com wonderwall.msn.com hollywoodlife.com www.wetpaint.com 85 Application I. Web Source Quality Fact 1 ✓ Fact 2 ✓ Fact 3 ✘ Fact 4 ✓ Fact 5 ✘ Fact 6 ✓ Fact 7 ✓ Fact 8 ✓ Fact 9 ✓ Fact 10 ✘ ... ... Accu 0.7 86 Application I. Web Source Quality Triple 1 1.0 Triple 2 0.9 Triple 3 0.3 Triple 4 0.8 Triple 5 0.4 Triple 6 0.8 Triple 7 0.9 Triple 8 1.0 Triple 9 0.7 Triple 10 0.2 ... ... Accu 0.7 87 Application I. Web Source Quality Triple Corr Extraction Corr Triple 1 1.0 1.0 Triple 2 0.9 1.0 Triple 3 0.3 1.0 Triple 4 0.8 1.0 Triple 5 0.4 0.9 Triple 6 0.8 0.9 Triple 7 0.9 0.8 Triple 8 1.0 0.2 Triple 9 0.7 0.1 Triple 10 0.2 0.1 ... ... ... Accu 0.70.73 88 Knowledge-Based Trust (KBT) Trustworthiness in [0,1] for 5.6M websites and 119M webpages 89 Knowledge-Based Trust vs. PageRank Correlated scores Often tail sources w. high trustworthiness 90 1.Tail Sources w. Low PageRank Among 100 sampled websites, 85 are indeed trustworthy 91 Knowledge-Based Trust vs. PageRank Often sources w. low accuracy Correlated scores Often tail sources w. high trustworthiness 92 II. Popular Websites May Not Be Trustworthy Domain Gossip Websites http://www.ebizmba.com/articles/ gossip-websites www.eonline.com perezhilton.com radaronline.com www.zimbio.com 14 out of 15 have a PageRank among top 15% of the websites mediatakeout.com gawker.com www.popsugar.com www.people.com www.tmz.com www.fishwrapper.com celebrity.yahoo.com All have knowledgebased trust in bottom 50% wonderwall.msn.com hollywoodlife.com www.wetpaint.com 93 II. Popular Websites May Not Be Trustworthy 94 III. Website Recommendation by Vertical 95 III. Website Recommendation by Vertical 96 Application II. X-Ray for Extracted Data Goal: – Help users analyze errors, changes and abnormalities in data Intuitions: Cluster errors by features – Return clusters with top error rates – 97 Application II. X-Ray for Extracted Data Cluster I. Feature: (besoccor.com, date_of_birth, 1986_02_18) – #Triples: 630; Errs: 100% – Reason: default value – Cluster 2. – – – – – Feature: (ExtractorX, pred: namesakes, obj:the county) #Triples: 4878; Errs: 99.8% E.g., [Salmon P. Chase, namesakes, The County] Context: The county was named for Salmon P. Chase, former senator and governor of Ohio Reason: Unresolved coreference 98 Outline Motivation Knowledge extraction Knowledge fusion Interesting applications Future directions 99 Extraction Readiness Extraction is still very sparse – 74% URLs contribute fewer than 5 triples each Extraction is of low quality – Overall accuracy is as low as 11.5% Imbalance between texts and semi-structured data Combine strengths of Distant Supervision and Open IE Add new predicates to knowledge bases – Role for crowdsourcing? – 100 Fusion Readiness Single-truth assumption Pro: filters large amount of noise – Con: often does not hold in practice – Value similarity and hierarchy – e.g., San Francisco vs. Oakland Copy detection – Existing techniques not applicable because of web-scale Simple vs. sophisticated facts; identify different perspectives 101 Customer Readiness Head: well-known authorities Long tails: limited support Call to arms: Leave NO Valuable Data Behind 102 Takeaways Building high quality knowledge bases is very challenging – Knowledge extraction, knowledge fusion are critical Much work in knowledge extraction by ML, NLP communities – Parallels to work in data integration by the database community Knowledge fusion is an exciting new area of research – Data fusion techniques can be extended for this problem A lot more research needs to be done! 103 Thank You! 104