From Information to Knowledge

Download Report

Transcript From Information to Knowledge

From Information to Knowledge
Harvesting Entities and Relationships
From Web Sources
Gerhard Weikum
Max Planck Institute for Informatics
http://www.mpi-inf.mpg.de/~weikum/
Martin Theobald
Max Planck Institute for Informatics
http://www.mpi-inf.mpg.de/~mtb/
Goal: Turn Web into Knowledge Base
Source:
DB & IR methods for
knowledge discovery.
Communications of
the ACM 52(4), 2009
comprehensive DB of human knowledge
• everything that Wikipedia knows
• everything machine-readable
• capturing entities, classes, relationships
Approach: Harvesting Facts from Web
Politician
Political Party
Angela Merkel
CDU
Karl-Theodor zu Guttenberg
PoliticalParty Spokesperson
Christoph Hartmann
CDU
Philipp…Wachholz
CDU
FDP
Claudia Roth
Position
Facebook
FriendFeed
Angela Merkel
Chancellor Germany
Software AG IDS Scheer
Karl-Theodor zu Guttenberg Minister of Defense Germany
…
Christoph Hartmann
Minister of Economy Saarland
Politician
Die Grünen
…
Company
AcquiredCompany
Google
YouTube
Company
CEO
Yahoo
Overture
Google
Eric Schmidt
Facebook
FriendFeed
Movie
Software AG
IDS Scheer
Avatar
…
Actor
Christoph Waltz
Sandra Bullock
Sandra Bullock
Yahoo
Overture
ReportedRevenue
Facebook
FriendFeed
$ 2,718,444,933
AG IDS Scheer
The Reader Software
$ 108,709,522
Award
Facebook … FriendFeed
Oscar AG IDS Scheer
Software
… Oscar
Golden Raspberry
…
Cyc
TextRunner
YAGO-NAGA
IWP
ReadTheWeb
Knowledge as Enabling Technology
• entity recognition & disambiguation
• understanding natural language & speech
• knowledge services & reasoning for semantic apps
(e.g. deep QA)
• semantic search: precise answers to advanced queries
(by scientists, students, journalists, analysts, etc.)
US president when Barack Obama was born?
Indy 500 winners who are still alive?
Politicians who are also scientists?
Relationship between Angela Merkel, Jim Gray, Dalai Lama?
...
Enzymes that inhibit HIV?
Influenza drugs for teens with high blood pressure?
Knowledge Search (1)
Who was
US president
when Barack Obama
was born?
http://www.wolframalpha.com
5/54
Knowledge Search (1)
Who was
mayor of Indianapolis
when Barack Obama
was born?
not enough
facts in KB !
http://www.wolframalpha.com
6/54
Knowledge Search (2)
Indy500
winners?
http://www.google.com/squared/
7/54
Knowledge Search (2)
Indy500
winners?
http://www.google.com/squared/
8/54
Knowledge Search (2)
Indy500
winners
from
Europe?
no types
no inference !
http://www.google.com/squared/
9/54
Related Work
Yago-Naga
Text2Onto
EntityRank
Powerset
ReadTheWeb Cazoodle
Avatar
Hakia
System T
Cyc
information
UIMA
ontologies
Kylin
extraction
KOG
kosmix
(Semantic
Web)
(Statistical
Web)
WolframAlpha
sig.ma
SWSE
DBpedia
Answers
START
KnowItAll
TextRunner
StatSnowball
EntityCube
Cimple
PSOX
DBlife
communities
(Social Web)
TrueKnowledge
Freebase
WebTables
GoogleSquared
WorldWideTables
Cyc
IWP
YAGO-NAGA
TextRunner
ReadTheWeb
10/38
Outline

What and Why
Framework
Entities and Classes
Relationships
Temporal Knowledge
Wrap-up
...
Framework: Types of Knowledge
• facts / assertions: bornIn
(JohnDillinger, Indianapolis)
hasWon (JimGray, TuringAward), …
• taxonomic: instanceOf (JohnDillinger, bankRobbers),
subclassOf (bankRobbers, criminals), …
• lexical / terminology: means (“Big Apple“, NewYorkCity),
means (“Big Mike“, MichaelStonebraker)
means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …
• common-sense properties:
apples are green, red, juicy, sweet, sour … - but not fast, smart …
balls are round, smooth, slippery … - but not square, funny …
• common-sense axioms:
•
•
 x: human(x)  male(x)  female(x)
 x: (male(x)   female(x))  (female(x) )   male(x))
 x: animal(x)  (hasLegs(x)  isEven(numberOfLegs(x)) …
procedural: how to fix/install/prepare/remove …
epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),
believes (Copernicus, shape(Earth, sphere)) …
...
Framework: Information Extraction (IE)
Surajit
obtained his
PhD in CS from
Stanford University
under the supervision
of Prof. Jeff Ullman.
He later joined HP and
worked closely with
Umesh Dayal …
sourcecentric IE
1) recall !
2) precision
instanceOf (Surajit, scientist)
inField (Surajit, computer science)
hasAdvisor (Surajit, Jeff Ullman)
almaMater (Surajit, Stanford U)
workedFor (Surajit, HP)
friendOf (Surajit, Umesh Dayal)
…
one source
yield-centric
harvesting
many sources
hasAdvisor
Student
Advisor
Surajit Chaudhuri Jeffrey Ullman
Alon Halevy
Jeffrey Ullman
Jim Gray
Mike Harrison
…
…
1) precision !
almaMater
2) recall
Student
Surajit Chaudhuri
Alon Halevy
Jim Gray
…
near-human
quality !
University
Stanford U
Stanford U
UC Berkeley
…
Framework: Knowledge Representation
• RDF (Resource Description Framework, W3C):
subject-property-object (SPO) triples, binary relations
structure, but no (prescriptive) schema
• Relations, frames
• Description logics: OWL, DL-lite
• Higher-order logics, epistemic logics
facts (RDF triples)
triples):
facts about facts:
1:
2:
3:
4:
5: (1, inYear, 1968)
6: (2, inYear, 2006)
7: (3, validFrom, 22-Dec-2000)
8: (3, validUntil, Nov-2008)
9: (4, validFrom, 2-Feb-2008)
10: (2, source, SigmodRecord)
(JimGray, hasAdvisor, MikeHarrison)
(SurajitChaudhuri, hasAdvisor, JeffUllman)
(Madonna, marriedTo, GuyRitchie)
(NicolasSarkozy, marriedTo, CarlaBruni)
temporal & provenance annotations
can refer to reified facts via fact identifiers
(approx. equiv. to RDF quadruples: “Color“  Sub  Prop  Obj)
...
KB‘s: Example YAGO
(Suchanek et al.: WWW‘07)
2 Mio. entities, 20 Mio. facts
40 Mio. RDF triples
subclass( entity1-relation-entity2,
Location
subject-predicate-object
)
subclass
subclass
Entity
subclass
subclass
Organization
Person
subclass
Scientist
subclass
subclass
Biologist
subclass
Politician
instanceOf
State
instanceOf
Physicist
Accuracy
Country
 95%
subclass
instanceOf
City
instanceOf
Germany
instanceOf
Oct 23, 1944
instanceOf
Max_Planck
Society
diedOn
Nobel Prize
instanceOf
Erwin_Planck
Kiel
hasWon
FatherOf
locatedIn
locatedIn
SchleswigHolstein
bornIn
citizenOf
Oct 4, 1947
Apr 23, 1858
means(0.1)
diedOn
Max_Planck
bornOn
“Max
Planck”
means
(0.9)
means
“Max Karl Ernst
Ludwig Planck”
http://www.mpi-inf.mpg.de/yago-naga/
Angela Merkel
means
“Angela
Merkel”
means
“Angela
Dorothea
Merkel”
KB‘s: Example YAGO
(F. Suchanek et al.: WWW‘07)
http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)
• 3 Mio. entities,
• 1 Bio. facts (RDF triples)
• 1.5 Mio. entities mapped to
hand-crafted taxonomy of
259 classes with 1200 properties
http://www.dbpedia.org
Outline

What and Why

Framework
Entities and Classes
Relationships
Temporal Knowledge
Wrap-up
...
Entities & Classes
Which entity types (classes, unary predicates) are there?
scientists, doctoral students, computer scientists, …
female humans, male humans, married humans, …
Which subsumptions should hold
(subclass/superclass, hyponym/hypernym, inclusion dependencies)?
subclassOf (computer scientists, scientists),
subclassOf (scientists, humans), …
Which individual entities belong to which classes?
instanceOf (Surajit Chaudhuri, computer scientists),
instanceOf (BarbaraLiskov, computer scientists),
instanceOf (Barbara Liskov, female humans), …
Which names denote which entities?
means (“Lady Di“, Diana Spencer),
means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …
means (“Madonna“, Madonna Louise Ciccone),
means (“Madonna“, Madonna(painting by Edward Munch)), …
...
WordNet Thesaurus [Miller/Fellbaum 1998]
3 concepts / classes &
their synonyms (synset‘s)
http://wordnet.princeton.edu/
WordNet Thesaurus [Miller/Fellbaum 1998]
subclasses
(hyponyms)
superclasses
(hypernyms)
http://wordnet.princeton.edu/
WordNet Thesaurus [Miller & Fellbaum 1998]
> 100 000 classes and lexical relations;
can be cast into
• description logics or
• graph, with weights for relation strengths
(derived from co-occurrence statistics)
but:
only few individual entities
(instances of classes)
scientist, man of science
(a person with advanced knowledge)
=> cosmographer, cosmographist
=> biologist, life scientist
=> chemist
=> cognitive scientist
=> computer scientist
...
=> principal investigator, PI
…
HAS INSTANCE => Bacon, Roger Bacon
http://wordnet.princeton.edu/
…
Tapping on Wikipedia Categories
Tapping on Wikipedia Categories
Mapping: Wikipedia  WordNet
[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]
Missing
Person
Sailor,
Crewman
American
Jim Gray
(computer
specialist)
Computer
Scientist
Scientist
Chemist
Artist
Mapping: Wikipedia  WordNet
[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]
People
Lost at Sea
instanceOf
Jim Gray
(computer
specialist)
American
Computer
Scientists
?
Computer
Scientists
by Nation
subclassOf
Databases
Database
Researcher
?
Fellows of
the ACM
?
name similarity
(edit dist., n-gram overlap)
context similarity
(word/phrase level) ?
machine learning ?
?
Engineering
Societies
Missing
Person
American
Computer
Scientist
Scientist
Data? base
Fellow (1),
?
Comrade
Fellow (2),
Colleague
ACM
Members
of Learned
Societies
Sailor,
Crewman
Fellow (3)
(of Society)
?
Member (1),
Fellow
Member (2),
Extremity
Mapping: Wikipedia  WordNet
[Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07]
Given:
entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class c
Problem: vagueness & ambiguity of names c1, …, ck
Analyzing category names  noun group parser:
American Musicians of Italian Descent
pre-modifier head
post-modifier
American Folk Music of the 20th Century
pre-modifier
head
post-modifier
American Indy 500 Drivers on Pole Positions
pre-modifier
head
post-modifier
Head word is key, should be in plural for instanceOf
Mapping Wikipedia Entities to WordNet Classes
[Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07]
Given:
entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class c
Problem: vagueness & ambiguity of names c1, …, ck
Heuristic Method:
for each ci do
if head word w of category name ci is plural
{
1) match w against synsets of WordNet classes
2) choose best fitting class c and set e  c
3) expand w by pre-modifier and set ci  w+  c
}
tuned conservatively: high precision, reduced recall
• can also derive features this way
• feed into supervised classifier
Learning More Mappings [ Wu & Weld: WWW‘08 ]
Kylin Ontology Generator (KOG):
learn classifier for subclassOf across Wikipedia & WordNet using
• YAGO as training data
• advanced ML methods (MLN‘s, SVM‘s)
• rich features from various sources
• category/class name similarity measures
• category instances and their infobox templates:
template names, attribute names (e.g. knownFor)
• Wikipedia edit history:
refinement of categories
• Hearst patterns:
C such as X, X and Y and other C‘s, …
• other search-engine statistics:
co-occurrence frequencies
> 3 Mio. entities
> 1 Mio. w/ infoboxes
> 500 000 categories
Goal: Comprehensive & Consistent !
Known
For
Alma
Mater
Notable
Awards
…
Doctoral
Students
Jeffrey
Ullman
Jim Gray
(computer
specialist)
Database
Researcher
Fellows of
the ACM
Madonna
(entertainer)
…
Website
Bob Dylan
Bell Labs
American
Computer
Scientists
U Michigan
Alumni
Genres
Also
Known
As
American
Princeton
Alumni
Born
Years
Active
Telecomm.
History
Knuth Prize
Laureate
World
Record
Holders
Hall of Fame
Inductees
Guitar Players
American
People by
Occupation
Databases
Academic
Scientist
Fellow(1)
Computer
Data
Members of
Learned
Societies
Fellow(2)
Award
Winner
Athlete
Americans of
Italian Descent
American
Songwriters
Artist
People
by Status
Musician
Singer
Italian
Goal: Comprehensive & Consistent !
Known
For
Alma
Mater
Notable
Awards
…
Doctoral
Students
Jeffrey
Ullman
Database
Researcher
Fellows of
the ACM
Madonna
(entertainer)
Website
Bob Dylan
…
Also
Known
As
Bell Labs
American
Computer
Scientists
Jim Gray
(computer
specialist)
Genres
American
Princeton
Alumni
Born
Years
Active
Telecomm.
History
Knuth Prize
Laureate
U Michigan
Alumni
World
Record
Holders
Hall of Fame
Inductees
Guitar Players
American
People by
Occupation
Databases
Academic
Scientist
Fellow(1)
Computer
Data
Members of
Learned
Societies
Fellow(2)
Award
Winner
Athlete
Americans of
Italian Descent
American
Songwriters
Artist
People
by Status
Musician
Singer
Italian
Goal: Comprehensive & Consistent !
Known
For
Alma
Mater
Notable
Awards
…
Doctoral
Students
Jeffrey
Ullman
Database
Researcher
Fellows of
the ACM
Madonna
(entertainer)
Website
Bob Dylan
…
Also
Known
As
Bell Labs
American
Computer
Scientists
Jim Gray
(computer
specialist)
Genres
American
Princeton
Alumni
Born
Years
Active
Telecomm.
History
Knuth Prize
Laureate
U Michigan
Alumni
World
Record
Holders
Hall of Fame
Inductees
Guitar Players
American
People by
Occupation
Databases
Academic
Scientist
Fellow(1)
Computer
Data
Members of
Learned
Societies
Fellow(2)
Award
Winner
Athlete
Americans of
Italian Descent
American
Songwriters
Artist
People
by Status
Musician
Singer
Italian
Goal: Comprehensive & Consistent !
Known
For
Alma
Mater
Notable
Awards
Born
Years
Active
Genres
Website
Jeffrey
Ullman
American
Bell Labs
Princeton
Alumni
American
Computer
Scientists
mess:
American
People by
Occupation
Clean up the
Databases
Jim •Gray
graph algorithms
?
Database
(computer
Researcher
•
random
walk
with restart Computer
specialist)
Data
• denseFellows
subgraphs
…
of
ACM
Members
• statisticalthemachine
learning
? of
Learned
U Michigan reasoning
• logical consistency
Madonna
Societies?
Alumni
(entertainer)
• gigantic schema integration ?
Americans of
• ontology
merging
Italian Descent
World
Bob Dylan
…
Also
Known
As
…
Doctoral
Students
Telecomm.
History
Knuth Prize
Laureate
Record
Holders
Hall of Fame
Inductees
Guitar Players
American
Songwriters
People
by Status
Academic
Scientist
Fellow(1)
Fellow(2)
Award
Winner
Athlete
Artist
Musician
Singer
Italian
Long Tail of Class Instances
Long Tail of Class Instances
[Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]
State-of-the-Art Approach (e.g. SEAL):
• Start with seeds: a few class instances
• Find lists, tables, text snippets (“for example: …“), …
that contain one or more seeds
• Extract candidates: noun phrases from vicinity
• Gather co-occurrence stats (seed&cand, cand&className pairs)
• Rank candidates
• point-wise mutual information, …
• random walk (PR-style) on seed-cand graph
But:
Precision drops for classes with sparse statistics (DB profs, …)
Harvested items are names, not entities
Canonicalization (de-duplication) unsolved
Individual Entity Disambiguation
Names
“Penn“
“U Penn“
Entities
Sean Penn
?
University of
Pennsylvania
“Penn State“
Pennsylvania
State University
„PSU“
Pennsylvania
(US State)
Passenger
Service Unit
• ill-defined with zero context
• known as record linkage for names in record fields
• Wikipedia offers rich candidate mappings:
disambiguation pages, re-directs, inter-wiki links,
anchor texts of href links
Collective Entity Disambiguation
[McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …]
• Consider a set of names {n1, n2, …} in same context
and sets of candidate entities
E1 = {e11, e12, …}, E2 = {e21, e22, …}, …
• Define joint objective function (e.g. likelihood for prob. model)
that rewards coherence of mappings ni  eij
• Solve optimization problem
Stuart Russell (DJ)
Stuart Russell
Michael Jordan
Stuart Russell
(computer scientist)
Michael Jordan
(computer scientist)
Michael Jordan (NBA)
Problems and Challenges
Wikipedia categories reloaded
comprehensive & consistent instanceOf and subClassOf
across Wikipedia and WordNet (via consistency reasoning ?)
Long tail of entities
beyond Wikipedia: domain-specific entity catalogs
discover new entities, detect new names for known entities
Tags, tables, topics
tap on other sources: Web2.0, Web tables, directories, etc.
Robust disambiguation
near-real-time mapping of names to entities
with near-human quality
Outline

What and Why

Framework

Entities and Classes
Relationships
Temporal Knowledge
Wrap-up
...
Relationships
Which instances (pairs of individual entities) are there
for given binary relations with specific type signatures?
hasAdvisor (JimGray, MikeHarrison)
hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)
hasAdvisor (Susan Davidson, Hector Garcia-Molina)
graduatedAt (JimGray, Berkeley)
graduatedAt (HectorGarcia-Molina, Stanford)
hasWonPrize (JimGray, TuringAward)
bornOn (JohnLennon, 9Oct1940)
diedOn (JohnLennon, 8Dec1980)
marriedTo (JohnLennon, YokoOno)
Which additional & interesting relation types are there
between given classes of entities?
competedWith(x,y), nominatedForPrize(x,y), …
divorcedFrom(x,y), affairWith(x,y), …
assassinated(x,y), rescued(x,y), admired(x,y), …
Picking Low-Hanging Fruit (First)
Deterministic Pattern Matching
[Kushmerick 97, Califf & Mooney 99, Gottlob 01, …]
• Regular expressions matching
• Wrapper induction
(grammar learning for
restricted regular languages)
• Well understood
...
French Marriage Problem
facts in KB:
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
new facts or fact candidates:
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
married (Michelle, Barack)
married (Yoko, John)
married (Kate, Leonardo)
married (Carla, Sofie)
married (Larry, Google)
1) for recall: pattern-based harvesting
2) for precision: consistency reasoning
Pattern-Based Harvesting
(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)
Facts & Fact Candidates
Patterns
(Hillary, Bill)
X and her husband Y
(Carla, Nicolas)
X and Y on their honeymoon
(Angelina, Brad)
(Victoria, David)
(Hillary, Bill)
(Carla, Nicolas)
(Yoko, John)
(Kate, Pete)
(Carla, Benjamin)
(Larry, Google)
(Angelina, Brad)
(Victoria, David)
X and Y and their children
X has been dating with Y
X loves Y
…
• good for recall
• noisy, drifting
• not robust enough
for high precision
Reasoning about Fact Candidates
Use consistency constraints to prune false candidates
FOL rules (restricted):
ground atoms:
spouse(x,y)  (f(x)m(y))  (m(x)f(y))
spouse(Hillary,Bill)
spouse(Carla,Nicolas)
spouse(Cecilia,Nicolas)
spouse(Carla,Ben)
spouse(Carla,Mick)
Spouse(Carla, Sofie)
Rules reveal inconsistencies
Find consistent subset(s) of atoms
(“possible world(s)“, “the truth“)
f(Hillary)
f(Carla)
f(Cecilia)
f(Sofie)
spouse(x,y)  diff(y,z)  spouse(x,z)
spouse(x,y)  diff(w,y)  spouse(w,y)
spouse(x,y)  f(x) spouse(x,y)  m(y)
m(Bill)
m(Nicolas)
m(Ben)
m(Mick)
Rules can be weighted
(e.g. by fraction of ground atoms that satisfy a rule)
 uncertain / probabilistic data
 compute prob. distr. of subset of atoms being the truth
Markov Logic Networks (MLN‘s)
(M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidates
into probabilistic graph model: Markov Random Field (MRF)
s(x,y)  diff(y,z)  s(x,z)
s(x,y)  diff(w,y)  s(w,y)
Grounding:
s(Ca,Nic)  s(Ce,Nic)
s(x,y)  f(x)
s(x,y)  m(y)
f(x)  m(x)
M(x)  f(x)
Literal  Boolean Var
Literal  binary RV
s(Ca,Nic)  s(Ca,Ben)
s(Ca,Nic)  m(Nic)
s(Ca,Nic)  s(Ca,So)
s(Ce,Nic)  m(Nic)
s(Ca,Ben)  s(Ca,So)
s(Ca,Ben)  m(Ben)
s(Ca,Ben)  s(Ca,So)
s(Ca,So)  m(So)
s(Carla,Nicolas)
s(Cecilia,Nicolas
s(Carla,Ben)
s(Carla,Sofie)
…
Markov Logic Networks (MLN‘s)
(M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidates
into probabilistic graph model: Markov Random Field (MRF)
s(x,y)  diff(y,z)  s(x,z)
s(x,y)  diff(w,y)  s(w,y)
s(x,y)  f(x)
s(x,y)  m(y)
f(x)  m(x)
M(x)  f(x)
s(Ce,Nic)
m(Nic)
s(Ca,Nic)
s(Ca,Ben)
s(Ca,So)
m(Ben)
m(So)
Variety of algorithms for joint inference:
Gibbs sampling, other MCMC, belief propagation,
randomized MaxSat, …
s(Carla,Nicolas)
s(Cecilia,Nicolas
s(Carla,Ben)
s(Carla,Sofie)
…
RVs coupled
by MRF edge
if they appear
in same clause
MRF assumption:
P[Xi|X1..Xn]=P[Xi|N(Xi)]
joint distribution
has product form
over all cliques
Related Alternative Probabilistic Models
Constrained Conditional Models [D. Roth et al. 2007]
log-linear classifiers with constraint-violation penalty
mapped into Integer Linear Programs
Factor Graphs with Imperative Variable Coordination
[A. McCallum et al. 2008]
RV‘s share “factors“ (joint feature functions)
generalizes MRF, BN, CRF, …
inference via advanced MCMC
flexible coupling & constraining of RV‘s
s(Ca,Nic)
s(Ce,Nic)
m(Nic)
s(Ca,Ben)
m(Ben)
s(Ca,So)
m(So)
software tools: alchemy.cs.washington.edu
code.google.com/p/factorie/
research.microsoft.com/en-us/um/cambridge/projects/infernet/
Reasoning for KB Growth: Direct Route
(F. Suchanek et al.: WWW‘09)
new fact candidates:
facts in KB:
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
Direct approach:
+
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
married (Carla, Sofie)
married (Larry, Google)
?
patterns:
X and her husband Y
X and Y and their children
X has been dating with Y
X loves Y
• facts are true; fact candidates & patterns  hypotheses
• grounded constraints  clauses with hypotheses as vars
• cast into Weighted Max-Sat with weights from pattern stats
• customized approximation algorithm
• unifies: fact cand consistency, pattern goodness, entity disambig.
www.mpi-inf.mpg.de/yago-naga/sofie/
Facts & Patterns Consistency
(F. Suchanek et al.: WWW‘09)
constraints to connect facts, fact candidates, patterns
pattern-fact duality:
occurs(p,x,y)  expresses(p,R)  R(x,y)
occurs(p,x,y)  R(x,y)  expresses(p,R)
name(-in-context)-to-entity mapping:
 means(n,e1)   means(n,e2)  …
functional dependencies:
spouse(X,Y): X Y, Y X
relation properties:
asymmetry, transitivity, acyclicity, …
type constraints, inclusion dependencies:
spouse  Person  Person
capitalOfCountry  cityOfCountry
domain-specific constraints:
bornInYear(x) + 10years ≤ graduatedInYear(x)
hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t
www.mpi-inf.mpg.de/yago-naga/sofie/
Soft Rules vs. Hard Constraints
Enforce FD‘s (mutual exclusion) as hard constraints:
hasAdvisor(x,y)  diff(y,z)   hasAdvisor(x,z)
combine with weighted constraints
no longer MaxSat
constrained MaxSat instead
Generalize to other forms of constraints:
hard constraint
soft constraint
hasAdvisor(x,y) 
graduatedInYear(x,t) 
graduatedInYear(y,s)
s<t
firstPaper(x,p)  firstPaper(y,q) 
author(p,x)  author(p,y) ) 
inYear(p) > inYear(q) + 5years
 hasAdvisor(x,y)
open issue for arbitrary constraints
 rethink reasoning !
Problems and Challenges
High precision & high recall at affordable cost
robust pattern analysis & reasoning
parallel processing, lazy / lifted inference, …
Types and constraints
soft rules & hard constraints, rich DL, beyond CWA
explore & understand different families of constraints
Declarative, self-optimizing workflows
incorporate pattern & reasoning steps into IE queries/programs
Scale, dynamics, life-cycle
grow & maintain KB with near-human-quality over long periods
Open-domain knowledge harvesting
turn names, phrase & table cells into entities & relations
Outline

What and Why

Framework

Entities and Classes

Relationships
Temporal Knowledge
Wrap-up
...
Temporal Knowledge
Which facts for given relations hold
at what time point or during which time intervals ?
marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ]
capitalOf (Berlin, Germany) [ 1990, now ]
capitalOf (Bonn, Germany) [ 1949, 1989 ]
hasWonPrize (JimGray, TuringAward) [ 1998 ]
graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]
graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]
hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]
How can we query & reason on entity-relationship facts
in a “time-travel“ manner - with uncertain/incomplete KB ?
US president when Barack Obama was born?
students of Hector Garcia-Molina while he was at Princeton?
French Marriage Problem
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
facts in KB
1: married
(Hillary, Bill)
2: married
(Carla, Nicolas)
3: married
(Angelina, Brad)
validFrom (2, 2008)
new fact candidates:
4:
5:
6:
7:
8:
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
divorced (Madonna, Guy)
domPartner (Angelina, Brad)
validFrom (4, 1996)
validFrom (5, 2010)
validFrom (6, 2006)
validFrom (7, 2008)
validUntil (4, 2007)
Challenge: Temporal Knowledge
for all people in Wikipedia (100,000‘s) gather all spouses,
incl. divorced & widowed, and corresponding time periods!
>95% accuracy, >95% coverage, in one night
1) recall: gather temporal scopes for base facts
2) precision: reason on mutual consistency
consistency constraints are potentially helpful:
• functional dependencies: husband, time  wife
• inclusion dependencies: marriedPerson  adultPerson
• age/time/gender restrictions: birthdate +  < marriage < divorce
Difficult Dating
(Even More Difficult)
Implicit Dating
explicit dates vs.
implicit dates relative to other dates
(Even More Difficult)
Relative Dating
vague dates
relative dates
narrative text
relative order
TARSQI: Extracting Time Annotations
http://www.timeml.org/site/tarsqi/
(M. Verhagen et al.: ACL‘05)
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3"
TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy
advocate seeking high office in territory controlled by the Chinese government in Beijing. A prodemocracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE"
VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to
appear on the ballot to become the territory’s next chief executive. But heextraction
acknowledged that
he had no chance of beating the Beijing-backed incumbent, Donald Tsang,errors
who is seeking reelection. Under electoral rules imposed by Chinese officials, only 796 people on the election
committee – the bulk of them with close ties to mainland China – will be allowed to vote in the
<TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3>
election. It will be the first contested election for chief executive since Britain returned Hong
Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>.
Mr. Tsang, an able administrator who took office during the early stages of a sharp economic
upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is
popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s
people approve of the job he has been doing. It is of course a foregone conclusion – Donald
Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0"
endPoint="t8“
TYPE="DURATION"
VAL="P5Y">another
five
years
</TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
Representing Time: AI Perspective
[Allen 1984, Allen & Hayes 1989, …]
• Instant
– durationless piece of time
• Period
– potentially unbounded continuum of instants
• Events
– time as a sequence of events  E
– precedence and overlap relations on E  E
Relations between Time Periods
A Before B
B After A
A Meets B
B MetBy A
A Overlaps B
A Starts B
B OverlappedBy A
B StartedBy A
A During B
B Contains A
A Finishes B
B FinishedBy A
A Equal B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
Representing Time: DB Perspective
• Time point:
smallest time unit of fixed duration/granularity
(e.g., a day, a year, a second)
• Interval:
finite set of time points
• State relation:
fact holds at every time point within interval
isCapitalOf (Bonn, Germany) [1949, 1989]
• Event relation:
fact holds at exactly one time point within interval
wonCup (United, ChampionsLeague) [1999, 1999]
intervals can also capture
uncertainty of time points
Uncertainty and Time
• Point-probabilities for facts and intervals
playsFor(Beckham, United)[1990, 2005]:0.9
– fact valid in interval [tb, te ] with prob. p
– fact not valid with prob. 1-p
• Continuous distributions
0.9
playsFor(Beckham, United)
[1990, 2005]:Gauss(µ=1996,σ2=1)
• Histograms
playsFor(Beckham, United)
[1990, 1992):0.1
[1992, 2004):0.6
[2004, 2005]:0.2
‘05
‘90
µ=1996
σ2=1
‘90
‘96
0.6
‘05
0.1
0.2
‘90 ‘92
‘04 ‘05
Possible Worlds in Time
hasWon (Beckham,
ChampionsLeague)

playsFor(Beckham, United) 
wonCup(United, ChampionsLeague)
Event
0.54
0.30
0.06 0.06
‘96
0.6
‘98 ‘99 ‘00 ‘01
0.9
0.3
‘98
‘02
‘95
Base
Facts playsFor (Beckham, United)
State
0.12
• #P-complete
per histogram bin
• linear in #bins
0.5
0.30.1
0.2
1.0
0.2
‘96 ‘98‘99 ‘00 ‘01
wonCup (United,
ChampionsLeague)
Event
Joint Reasoning on Facts & Time
Rules:
marriedTo(a,b,T1) 
marriedTo(a,c,T2) 
different(b,c)
 disjoint(T1,T2)
Facts from KB (with confidence weights):
0.65
marriedTo
(Nicolas,
Cecilia)
marriedTo(a,b,T1) 
divorcedFrom(a,b,T2)
 before(T1,T2)
marriedTo(a,b,T1) 
bornIn(a,c,T2)
 before(T2,T1)
marriedTo
(Carla, Mick)
marriedTo
(Nicolas,
Carla)
0.91
divorcedFrom
(Nicolas,
Cecilia)
0.25
marriedTo
(Carla,
Ben)
0.18
0.78
bornIn
(Nicolas,
Paris)
0.77
bornIn
(Cecilia,
Boulogne)
0.12
bornIn
(Carla,
Turin)
0.43
Joint Reasoning on Facts & Time
Rules:
marriedTo(a,b,T1) 
marriedTo(a,c,T2) 
different(b,c)
 disjoint(T1,T2)
marriedTo(a,b,T1) 
divorcedFrom(a,b,T2)
 before(T1,T2)
Facts from KB (with confidence weights):
0.65
0.78
0.25
marriedTo
(Nicolas,
Cecilia)
divorcedFrom
(Nicolas,
Cecilia)
marriedTo
(Nicolas,
Carla)
bornIn
(Nicolas,
Paris)
bornIn
(Cecilia,
Boulogne)
bornIn
(Carla,
Turin)
0.43
marriedTo
(Carla,
Ben)
0.18
0.77
marriedTo(a,b,T1) 
bornIn(a,c,T2)
 before(T2,T1)
marriedTo
(Carla, Mick)
0.12
Compute most likely
possible world !
0.91
m(Carla, Ben)
m(Carla, Mick)
m(Nicolas, Carla)
+ more soft rules:
hasChild (a,c) 
hasChild (b,c) 
different (a,b)
 marriedTo(a,b)
+ recursive rules …
div(Nicolas, Cecilia)
m(Nicolas, Cecilia)
bornIn(Carla, Turin)
bornIn(Cecilia, Boulogne)
bornIn(Nicolas, Paris)
time
Problems and Challenges
Temporal Querying (Revived)
query language (T-SPARQL?), no schema
confidence weights & ranking
Gathering Implicit and Relative Time Annotations
biographies & news, relative orderings
aggregate & reconcile observations
Incomplete and Uncertain Temporal Scopes
incorrect, incomplete, unknown begin/end
vague dating
Consistency Reasoning
extended MaxSat, extended Datalog, prob. graph. models, etc.
for resolving inconsistencies on uncertain facts & uncertain time
Outline

What and Why

Framework

Entities and Classes

Relationships

Temporal Knowledge
Wrap-up
...
KB Building: Where Do We Stand?
Entities & Classes
strong success story, some problems left:
• large taxonomies of classes with individual entities
• long tail calls for new methods
• entity disambiguation remains grand challenge
Relationships
good progress, but many challenges left:
• recall & precision by patterns & reasoning
• efficiency & scalability
• soft rules, hard constraints, richer logics, …
• open-domain discovery of new relation types
Temporal Knowledge
widely open (fertile) research ground:
• uncertain / incomplete temporal scopes of facts
• joint reasoning on ER facts and time scopes
Overall Take-Home
Historic opportunity:
revive Cyc vision, make it real & large-scale !
challenging & risky, but high pay-off
Explore & exploit synergies between
semantic, statistical, & social Web methods:
statistical evidence + logical consistency !
For DB researchers (theoreticians & normal ones):
• efficiency & scalability
• constraints & reasoning
• killer app for uncertain data management
• knowledge-base life-cycle: growth & maintenance
...
Thank You !