Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research)

Download Report

Transcript Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research)

Knowledge Curation and
Knowledge Fusion: Challenges,
Models, and Applications
Xin Luna Dong (Google Inc.)
Divesh Srivastava (AT&T Labs-Research)
Knowledge Is Power
 Many knowledge bases (KB)
Google knowledge graph
2
Using KB in Search
3
Using KB in Social Media
4
What Is A Knowledge Base?
 Entities, entity types
An entity is an instance of an entity type
– Entity types organized in a hierarchy
–
all
person
place
politician
5
What Is A Knowledge Base?
 Entities, entity types
 Predicates, subject-predicate-object triples
–
An SPO triple is an instance of a predicate
is-born-in
person
place
 Other modeling constructs: rules, reification, etc.
6
What Is A Knowledge Base?
 Entities, entity types
 Predicates, subject-predicate-object triples
 Knowledge base: graph with entity nodes and SPO triple edges
is-born-in
is-located-in
is-born-in
7
Knowledge Bases: Scope
 Domain-specific knowledge base
Focus is on a well-defined domain
– E.g., IMDB for movies, Music-Brainz for music
–
8
Knowledge Bases: Scope
 Domain-specific knowledge base
Global KBs
Domain-specific KBs
 Global knowledge base
Covers a variety of knowledge across domains
– Intensional: Cyc, WordNet
– Extensional: Freebase, Knowledge Graph, Yago/Yago2, DeepDive,
NELL, Prospera, ReVerb, Knowledge Vault
–
9
Knowledge Bases: Comparison [DGH+14]
Name
# of Entity Types
# Entities
# Predicates # Confident Triples
Knowledge
Vault (KV)
1100
45M
4469
271M
DeepDive
4
2.7M
34
7M
NELL
271
5.1M
306
0.435M
PROSPERA
11
N/A
14
0.1M
350,000
9.8M
100
150M
Freebase
1500
40M
35,000
637M
Knowledge
Graph
1500
570M
35,000
18,000M
Yago2
10
Outline
 Motivation
 Knowledge extraction
 Knowledge fusion
 Interesting applications
 Future directions
11
Knowledge Bases: Population
 Use well-structured sources, manual curation
Manual curation is resource intensive
– E.g., Cyc, Wordnet, Freebase, Yago, Knowledge Graph
–
12
Knowledge Bases: Population
 Use well-structured sources, manual curation
 Use less structured sources, automated knowledge extraction
(Possibly) bootstrap from existing knowledge bases
– Extractors extract knowledge from a variety of sources
–
13
Knowledge Bases: Population
 Use well-structured sources, manual curation
 Use less structured sources, automated knowledge extraction
14
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
–
Hamas claimed responsibility for the Gaza attack
15
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
–
Hamas claimed responsibility for the Gaza attack
16
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
 Synonyms: multiple representations of same entities, predicates
–
(DC, is capital of, United States); (Washington, is capital city of, US)
17
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
 Synonyms: multiple representations of same entities, predicates
(DC, is capital of, United States); (Washington, is capital city of, US)
– Linkage is more challenging than in structured databases, where
one can take advantage of schema information
–
18
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
 Synonyms: multiple representations of same entities, predicates
 Polysemy: similar representations of different entities
I enjoyed watching Chicago on Broadway
– Chicago has the largest population in the Midwest
– Distinguishing them is more challenging than in structured
databases, where one can take advantage of schema information
–
19
Knowledge Extraction: Challenges
 Identifying semantically meaningful triples is challenging
 Synonyms: multiple representations of same entities, predicates
 Polysemy: similar representations of different entities
20
Knowledge Extraction: Tasks
 Triple identification
 Entity linkage
 Predicate linkage
/people/person/date_of_birth
21
Triple Identification: Paradigms (I)
 Domain-specific knowledge base 1: pattern based [FGG+14]
Supervised, uses labeled training examples
– High precision, low recall because of linguistic variety
–
/people/person/date_of_birth
 Domain-specific knowledge base 2: iterative, pattern based
Strengths: Higher recall than domain-specific 1
– Weaknesses: Lower precision due to concept drift
–
22
Triple Identification: Paradigms (II)
 Global knowledge base: distant supervision [MBS+09]
Weakly supervised, via large existing KB (e.g., Freebase)
– Use entities, predicates in KB to provide training examples
– Extractors focus on learning ways KB predicates expressed in text
–
 Example: (
, /film/director/film,
)
–
Spielberg’s film Saving Private Ryan is based on the brothers’ story
–
Allison co-produced Saving Private Ryan, directed by Spielberg
23
Triple Identification: Distant Supervision
 Key assumption:
–
If a pair of entities participate in a KB predicate, a sentence
containing that pair of entities is likely to express that predicate
 Example:
Spielberg’s film Saving Private Ryan is based on the brothers’ story
– Allison co-produced Saving Private Ryan, directed by Spielberg
–
 Solution approach: obtains robust combination of noisy features
Uses named entity tagger to label candidate entities in sentences
– Trains a multiclass logistic regression classifier, using all features
–
24
Triple Identification: Distant Supervision
 Lexical and syntactic features:
Sequence of words, POS-tags in [entity1 – K, entity2 + K] window
– Parsed dependency path between entities + adjacent context nodes
– Use exact matching of conjunctive features: rely on big data!
–
 Example: new triples identified using Freebase
Predicate
Size
New triples
/location/location/contains
253,223
Paris, Montmartre
/people/person/profession
208,888
Thomas Mellon, Judge
25
Triple Identification: Distant Supervision
 Strengths:
Canonical names for predicates in triples
– Robustness: benefits of supervised IE without concept drift
– Scalability: extracts large # of triples for quite a large # of predicates
–
 Weaknesses:
–
Triple identification limited by the predicates in KB
26
Triple Identification: Paradigms (III)
 Global knowledge base: open information extraction (Open IE)
No pre-specified vocabulary
– Extractors focus on generic ways to express predicates in text
– Systems: TextRunner (1st gen), ReVerb (2nd gen) [FSE11]
–
 Example:
–
McCain fought hard against Obama, but finally lost the election
27
Triple Identification: Paradigms (III)
 Global knowledge base: open information extraction (Open IE)
No pre-specified vocabulary
– Extractors focus on generic ways to express predicates in text
– Systems: TextRunner (1st gen), ReVerb (2nd gen) [FSE11]
–
 Example:
–
McCain fought hard against Obama, but finally lost the election
28
X
Triple Identification: Open IE [EFC+11]
 Architecture for (subject, predicate, object) identification
Use POS-tagged and NP-chunked sentence in text
– First, identify predicate phrase
– Second, find pair of NP arguments – subject and object
– Finally, compute confidence score using logistic regression classifier
–
 Example:
–
Hamas claimed responsibility for the Gaza attack
29

Triple Identification: Open IE [EFC+11]
 Architecture for (subject, predicate, object) identification
Use POS-tagged and NP-chunked sentence in text
– First, identify predicate phrase
– Second, find pair of NP arguments – subject and object
– Finally, compute confidence score using logistic regression classifier
–
 Example:
–
Hamas claimed responsibility for the Gaza attack
30
Triple Identification: Open IE [EFC+11]
 Predicate phrase identification: use syntactic constraints
For each verb, find longest sequence of words expressed as a light
verb construction (LVC) regular expression
– Avoids incoherent, uninformative predicates
–
 Example:
–
Hamas claimed responsibility for the Gaza attack
31
X
Triple Identification: Open IE [EFC+11]
 Subject, object identification: use classifiers to learn boundaries
Heuristics, e.g., using simple noun phrases, are inadequate
– Subject: learn left and right boundaries
– Object: need to learn only right boundary
–
 Examples:
–
The cost of the war against Iraq has risen above 500 billion dollars
–
The plan would reduce the number of teenagers who start smoking
32

Triple Identification: Open IE [EFC+11]
 Subject, object identification: use classifiers to learn boundaries
Heuristics, e.g., using simple noun phrases, are inadequate
– Subject: learn left and right boundaries
– Object: need to learn only right boundary
–
 Example:
–
The cost of the war against Iraq has risen above 500 billion dollars
–
The plan would reduce the number of teenagers who start smoking
33
Triple Identification: Open IE
 Strengths:
No limitation to pre-specified vocabulary
– Scalability: extracts large # of triples for large # of predicates
–
 Weaknesses:
–
No canonical names for entities, predicates in triples
34
Knowledge Extraction: Tasks
 Triple identification
 Entity linkage
 Predicate linkage
/people/person/date_of_birth
35
Entity Linkage: Disambiguation to Wikipedia
 Goal: Output Wiki titles corresponding to NPs in text
–
Identifies canonical entities
 Examples:
I enjoyed watching Chicago on Broadway
– Chicago has the largest population in the Midwest
–
36
Entity Linkage: Disambiguation to Wikipedia
 Local D2W approaches [BP06, MC07]
Each mention in a document is disambiguated independently
– E.g., use similarity of text near mention with candidate Wiki page
–
 Global D2W approaches [C07, RRD+11]
All mentions in a document are disambiguated simultaneously
– E.g., utilize Wikipedia link graph to estimate coherence
–
37
Global D2W Approach
Chicago
Midwest
Φ
Ψ
 Goal: many-to-1 matching on a bipartite graph
 Approach: solve an optimization problem involving
Φ(m, t): relatedness between mention and Wiki title
– Ψ(t, t’): pairwise relatedness between Wiki titles
–
38
Global D2W Approach
Chicago
Midwest
Φ
Ψ
 Goal: many-to-1 matching on a bipartite graph
 Φ(m1,t1) + Φ(m2,t3) + Ψ(t1,t3) > Φ(m1,t2) + Φ(m2,t3) + Ψ(t2,t3)
39
Global D2W Approach [RRD+11]
 Represent Φ and Ψ as weighted sum of local and global features
Φ(m,t) = ∑ wi * Φi(m,t)
– Learn weights wi using SVM
–
 Build an index (Anchor Text, Wiki Title, Frequency)
40
Global D2W Approach [RRD+11]
 Represent Φ and Ψ as weighted sum of local and global features
 Build an index (Anchor Text, Wiki Title, Frequency)
 Local features Φi(m, t): relatedness between mention and title
P(t|m): fraction of times title t is target page of anchor text m
– P(t): fraction of all Wiki pages that link to title t
– Cosine-sim(Text(t), Text(m)), Cosine-sim(Text(t), Context(m)), etc.
–
41
Global D2W Approach [RRD+11]
 Represent Φ and Ψ as weighted sum of local and global features
 Build an index (Anchor Text, Wiki Title, Frequency)
 Local features Φi(m, t): relatedness between mention and title
 Global features Ψ(t, t’): relatedness between Wiki titles
I [t-t’] * PMI(InLinks(t), InLinks(t’))
– I [t-t’] * NGD(InLinks(t), InLinks(t’))
–
42
Global D2W Approach [RRD+11]
 Algorithm (simplified)
Use shallow parsing and NER to get (additional) potential mentions
– Efficiency: for each mention m, take top-k most frequent targets t
– Use objective function to find most appropriate disambiguation
–
Chicago
Midwest
Φ
Ψ
43
Predicate Linkage: Using Distant Supervision
 Global knowledge base: distant supervision [MBS+09]
Weakly supervised, via large existing KB (e.g., Freebase)
– Use entities, predicates in KB to provide training examples
– Extractors focus on learning ways KB predicates expressed in text
–
 Example: (
, /film/director/film,
)
–
Spielberg’s film Saving Private Ryan is based on the brothers’ story
–
Allison co-produced Saving Private Ryan, directed by Spielberg
44
Predicate Linkage: Using Redundancy [YE09]
 Goal: Given Open IE triples, output synonymous predicates
–
Use unsupervised method
 Motivation: Open IE creates a lot of redundancy in extractions
(DC, is capital of, United States); (Washington, is capital city of, US)
– Top-80 entities had an average of 2.9, up to about 10, synonyms
– Top-100 predicates had an average of 4.9 synonyms
–
45
Predicate Linkage: Using Redundancy
 Key assumption: Distributional hypothesis
–
Similar entities appear in similar contexts
 Use generative model to estimate probability that strings co-refer
Probability depends on number of shared, non-shared properties
– Formalize using a ball-and-urn abstraction
–
 Scalability: use greedy agglomerative clustering
Strings with no, or only popular, shared properties not compared
– O(K*N*log(N)) algorithm, if at most K synonyms, N triples
–
46
Case Study: Knowledge Vault [DGH+14]
 Text, DOM (document object model): use distant supervision
/people/person/date_of_birth
 Web tables, lists: use schema mapping
 Annotations: use semi-automatic mapping
47
Knowledge Vault: Statistics
 A large knowledge base
As of 11/2013
 Highly skewed data – fat head, long tail
#Triples/type: 1–14M (location, organization, business)
– #Triples/entity: 1–2M (USA, UK, CA, NYC, TX)
–
48
Knowledge Vault: Statistics
 1B+ Webpages over the Web
–
Contribution is skewed: 1- 50K
As of 11/2013
49
Knowledge Vault: Statistics
 12 extractors; high variety
As of 11/2013
50
KV: Errors Can Creep in at Every Stage
 Extraction error: (Obama, nationality, Chicago)
9/2013
51
KV: Errors Can Creep in at Every Stage
 Reconciliation error: (Obama, nationality, North America)
American
President Barack
Obama
9/2013
52
KV: Errors Can Creep in at Every Stage
 Source data error: (Obama, nationality, Kenya)
Obama born
in Kenya
9/2013
53
KV: Quality of Knowledge Extraction
 Gold standard: Freebase
 LCWA (local closed-world assumption)
If (s,p,o) exists in Freebase : true
– Else If (s,p) exists in Freebase: false (knowledge is locally complete)
– Else unknown
–
 The gold standard contains about 40% of the triples
54
KV: Statistics for Triple Correctness
 Overall accuracy: 30%
 Random sample on 25 false triples
Triple identification errors: 11 (44%)
– Entity linkage errors: 11 (44%)
– Predicate linkage errors: 5 (20%)
– Source data errors: 1 (4%)
–
55
KV: Statistics for Triple Correctness
As of 11/2013
56
KV: Statistics for Extractors
 12 extractors; high variety
As of 11/2013
57
Outline
 Motivation
 Knowledge extraction
 Knowledge fusion
 Interesting applications
 Future directions
58
Goal: Judge Triple Correctness
 Input: Knowledge triples and their provenance
–
Which extractor extracts from which source
 Output: a probability in [0,1] for each triple
–
Probabilistic decisions vs. deterministic decisions
59
Usage of Probabilistic Knowledge
Negative training
examples, and MANY
EXCITING
APPLICATIONS!!
Active learning,
probabilistic
inference, etc.
Upload to
KB
60
Data Fusion: Definition
Input
Output
61
Knowledge Fusion Challenges
I. Input is three-dimensional
(S, P)
62
Knowledge Fusion Challenges
II. Output probabilities should be well-calibrated
63
Knowledge Fusion Challenges
III. Data are of web scale
 Three orders of magnitude larger than currently published
data-fusion applications
Size: 1.1TB
– Sources: 170K  1B+
– Data items: 400K  375M
– Values: 18M  6.4B (1.6B unique)
–
 Data are highly skewed
#Triples / Data item: 1 – 2.7M
– #Triples / Source: 1 – 50K
–
64
Approach I.1: Graph-Based Priors
Path 1
Path 2
Prec
1
Prec
0.03
Rec
0.01
Rec
0.33
F1
0.03
F1
0.04
Weight
2.62
Weight
2.19
atSchool
education
Sam Perkins
education
North Carolina
Tar Heels
profession
play
profession
Michael Jordon
Path Ranking Algorithm (PRA): Logistic regression [Lao et al., EMNLP’11]
65
Approach I.2: Graph-Based Priors
Hidden Layer
(100 dimensions)
Predicate
Neighbors
Deep learning
[Dean, CIKM’14]
Pred
Neighbor 1
Neighbor 2
Neighbor 3
children
parents 0.4
spouse 0.5
birth-place 0.8
brith-date
children 1.24
gender 1.25
parents 1.29
edu-end
job-start 1.41
Edu-start
1.61
job-end 1.74
66
Approach II: Extraction-Based Learning
[Dong et al, KDD’14]
 Features
Square root of #sources
– mean of extraction confidence
–
 Adaboost learning
Error at
Stage t
Error Classifier
func built at
Stage t-1
Weight
Instance i
Learner t
 Learning for each predicate
67
Approach III. Multi-Layer Fusion
 Intuitions
[Dong et al., VLDB’15]
Leverage source/extractor agreements
– Trust a source/extractor with high quality
–
 Graphical model – predict at the same time
Extraction correctness, triple correctness
– Source accuracy, extractor precision/recall
–
 Source/extractor hierarchy
Break down “large” sources
– Group similar “small” sources
–
68
High-Level Intuition
[Dong, VLDB’09]
 Researcher affiliation
Src1
Src2
Src3
Jagadish
UM
ATT
UM
Dewitt
MSR
MSR
UW
Bernstein
MSR
MSR
MSR
Carey
UCI
ATT
BEA
Franklin
UCB
UCB
UMD
69
High-Level Intuition
[Dong, VLDB’09]
 Researcher affiliation
Src1
Src2
Src3
Jagadish
UM
ATT
UM
Dewitt
MSR
MSR
UW
Bernstein
MSR
MSR
MSR
Carey
UCI
ATT
BEA
Franklin
UCB
UCB
UMD
 Voting: trust the majority
70
High-Level Intuition
[Dong, VLDB’09]
 Researcher affiliation
Src1
Src2
Src3
Jagadish
UM
ATT
UM
Dewitt
MSR
MSR
UW
Bernstein
MSR
MSR
MSR
Carey
UCI
ATT
BEA
Franklin
UCB
UCB
UMD
71
High-Level Intuition
[Dong, VLDB’09]
 Researcher affiliation
Src1
Src2
Src3
Jagadish
UM
ATT
UM
Dewitt
MSR
MSR
UW
Bernstein
MSR
MSR
MSR
Carey
UCI
ATT
BEA
Franklin
UCB
UCB
UMD
 Quality-based: give higher votes to more accurate sources
72
What About Extractions
 Extracted Harry Potter actors/actresses
[Sigmod, 2014]
Harry Potter
Ext1
Ext2
Ext3
Daniel
✓
✓
✓
Emma
✓
Rupert
✓
Jonny
✓
Eric
✓
✓
✓
73
What About Extractions
 Extracted Harry Potter actors/actresses
[Sigmod, 2014]
Harry Potter
Ext1
Ext2
Ext3
Daniel
✓
✓
✓
Emma
✓
Rupert
✓
Jonny
✓
Eric
✓
✓
✓
 Voting: trust the majority
74
What About Extractions
 Extracted Harry Potter actors/actresses
[Sigmod, 2014]
Harry Potter
Ext1
(high rec)
Ext2
(high prec)
Ext3
(med prec/rec)
Daniel
✓
✓
✓
Emma
✓
Rupert
✓
Jonny
✓
Eric
✓
✓
✓
 Quality-based:
More likely to be correct if extracted by high-precision sources
– More likely to be wrong if not extracted by high-recall sources
–
75
Graphical Model
[VLDB, 2015]
 Observations
–
Xewdv: whether extractor e
extracts from source w the (d,v)
item-value pair
 Latent variables
Cwdv: whether source w indeed
provides (d,v) pair
– Vd: the correct value(s) for d
–
 Parameters
Aw: Accuracy of source w
– Pe: Precision of extractor e
– Re: Recall of extractor e
–
76
Algorithm
[VLDB, 2015]
Compute Pr(W provides T |
Extractor quality)
by Bayesian analysis
E-Step
Compute Pr(T | Source quality)
by Bayesian analysis
Compute source accuracy
M-Step
Compute extractor precision and
recall
77
Fusion Performance
78
Outline
 Motivation
 Knowledge extraction
 Knowledge fusion
 Interesting applications
 Future directions
79
Usage of Probabilistic Knowledge
Negative training
examples, and MANY
EXCITING
APPLICATIONS!!
 Source errors: Knowledgebase trust
 Extraction errors: data
abnormality diagnosis
80
Application I. Web Source Quality
 What we have now
Page Rank: links between Websites/Webpages
– Log based: search log and click-through rate
– Web spam
– etc.
–
81
1.Tail Sources Can Be Useful
 Good answer for an award-winning song
82
1.Tail Sources Can Be Useful
 Missing answer for a not-so-popular song
83
1.Tail Sources Can Be Useful
 Very precise info on guitar players but low Page Rank
84
II. Popular Websites May Not Be Trustworthy
Domain
Gossip Websites
http://www.ebizmba.com/articles/
gossip-websites
www.eonline.com
perezhilton.com
radaronline.com
www.zimbio.com
14 out of 15 have a
PageRank among top
15% of the websites
mediatakeout.com
gawker.com
www.popsugar.com
www.people.com
www.tmz.com
www.fishwrapper.com
celebrity.yahoo.com
wonderwall.msn.com
hollywoodlife.com
www.wetpaint.com
85
Application I. Web Source Quality
Fact 1
✓
Fact 2
✓
Fact 3
✘
Fact 4
✓
Fact 5
✘
Fact 6
✓
Fact 7
✓
Fact 8
✓
Fact 9
✓
Fact 10
✘
...
...
Accu
0.7
86
Application I. Web Source Quality
Triple 1
1.0
Triple 2
0.9
Triple 3
0.3
Triple 4
0.8
Triple 5
0.4
Triple 6
0.8
Triple 7
0.9
Triple 8
1.0
Triple 9
0.7
Triple 10
0.2
...
...
Accu
0.7
87
Application I. Web Source Quality
Triple
Corr
Extraction
Corr
Triple 1
1.0
1.0
Triple 2
0.9
1.0
Triple 3
0.3
1.0
Triple 4
0.8
1.0
Triple 5
0.4
0.9
Triple 6
0.8
0.9
Triple 7
0.9
0.8
Triple 8
1.0
0.2
Triple 9
0.7
0.1
Triple 10
0.2
0.1
...
...
...
Accu
0.70.73
88
Knowledge-Based Trust (KBT)
 Trustworthiness in [0,1] for 5.6M websites and 119M webpages
89
Knowledge-Based Trust vs. PageRank
Correlated
scores
Often tail
sources w. high
trustworthiness
90
1.Tail Sources w. Low PageRank
 Among 100 sampled websites, 85 are indeed trustworthy
91
Knowledge-Based Trust vs. PageRank
Often sources
w. low accuracy
Correlated
scores
Often tail
sources w. high
trustworthiness
92
II. Popular Websites May Not Be Trustworthy
Domain
Gossip Websites
http://www.ebizmba.com/articles/
gossip-websites
www.eonline.com
perezhilton.com
radaronline.com
www.zimbio.com
14 out of 15 have a
PageRank among top
15% of the websites
mediatakeout.com
gawker.com
www.popsugar.com
www.people.com
www.tmz.com
www.fishwrapper.com
celebrity.yahoo.com
All have knowledgebased trust in bottom
50%
wonderwall.msn.com
hollywoodlife.com
www.wetpaint.com
93
II. Popular Websites May Not Be Trustworthy
94
III. Website Recommendation by Vertical
95
III. Website Recommendation by Vertical
96
Application II. X-Ray for Extracted Data
 Goal:
–
Help users analyze errors, changes and abnormalities in data
 Intuitions:
Cluster errors by features
– Return clusters with top error rates
–
97
Application II. X-Ray for Extracted Data
 Cluster I.
Feature: (besoccor.com, date_of_birth, 1986_02_18)
– #Triples: 630; Errs: 100%
– Reason: default value
–
 Cluster 2.
–
–
–
–
–
Feature: (ExtractorX, pred: namesakes, obj:the county)
#Triples: 4878; Errs: 99.8%
E.g., [Salmon P. Chase, namesakes, The County]
Context: The county was named for Salmon P. Chase, former
senator and governor of Ohio
Reason: Unresolved coreference
98
Outline
 Motivation
 Knowledge extraction
 Knowledge fusion
 Interesting applications
 Future directions
99
Extraction Readiness
 Extraction is still very sparse
–
74% URLs contribute fewer than 5 triples each
 Extraction is of low quality
–
Overall accuracy is as low as 11.5%
 Imbalance between texts and semi-structured data
 Combine strengths of Distant Supervision and Open IE
Add new predicates to knowledge bases
– Role for crowdsourcing?
–
100
Fusion Readiness
 Single-truth assumption
Pro: filters large amount of noise
– Con: often does not hold in practice
–
 Value similarity and hierarchy
–
e.g., San Francisco vs. Oakland
 Copy detection
–
Existing techniques not applicable because of web-scale
 Simple vs. sophisticated facts; identify different perspectives
101
Customer Readiness
 Head: well-known authorities
 Long tails: limited support
Call to arms: Leave NO
Valuable Data Behind
102
Takeaways
 Building high quality knowledge bases is very challenging
–
Knowledge extraction, knowledge fusion are critical
 Much work in knowledge extraction by ML, NLP communities
–
Parallels to work in data integration by the database community
 Knowledge fusion is an exciting new area of research
–
Data fusion techniques can be extended for this problem
 A lot more research needs to be done!
103
Thank You!
104