Interactive Reasoning in Large and Uncertain RDF Knowledge Bases

Download Report

Transcript Interactive Reasoning in Large and Uncertain RDF Knowledge Bases

Interactive Reasoning in
Large and Uncertain
RDF Knowledge Bases
Martin Theobald
Joint work with:
Maximilian Dylla, Timm Meiser, Ndapa Nakashole,
Christina Tefliuodi, Yafang Wang, Mohamed Yahya,
Mauro Sozio, and Fabian Suchanek
Max Planck Institute
Informatics
French Marriage Problem
marriedTo:
person  person
x,y,z:
marriedTo(x,y) 
marriedTo(x,z)
 y=z
marriedTo_French:
person  person
...
2
French Marriage Problem
Facts in KB:
marriedTo
(Hillary, Bill)
marriedTo
(Carla, Nicolas)
marriedTo
(Angelina, Brad)
New facts or fact candidates:
marriedTo (Cecilia, Nicolas)
marriedTo (Carla, Benjamin)
marriedTo (Carla, Mick)
marriedTo (Michelle, Barack)
marriedTo (Yoko, John)
marriedTo (Kate, Leonardo)
marriedTo (Carla, Sofie)
marriedTo (Larry, Google)
1) for recall: pattern-based harvesting
x,y,z:
marriedTo(x,y)
 marriedTo(x,z)
 y=z
2) for precision:
consistency
reasoning
3
Agenda
– URDF: Reasoning in Uncertain Knowledge Bases
•
•
•
•
Resolving uncertainty at query-time
Lineage of answers
Propositional vs. probabilistic reasoning
Temporal reasoning extensions
– UViz: The URDF Visualization Frontend
• Demo!
4
URDF: Reasoning in Uncertain KB’s
[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]
• Knowledge harvesting from the Web may yield
knowledge bases which are
– Incomplete
bornIn(Albert_Einstein,?x)  {}
– Incorrect
bornIn(Albert_Einstein,?x)  {Stuttgart}
– Inconsistent
bornIn(Albert_Einstein,?x)  {Ulm, Stuttgart}
0.7
0.2
• Combine grounding of first-order logic rules with
additional step of consistency reasoning
– Propositional – Constrained Weighted MaxSat
– Probabilistic – Lineage & Possible Worlds Semantics
 At query time!
5
Soft Rules vs. Hard Constraints
(Soft) Inference Rules vs. (Hard) Consistency Constraints
• People may live in more than one place
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y)[0.6]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y) [0.2]
• People are not born in different places/on different dates
bornIn(x,y)  bornIn(x,z)  y=z
• People are not married to more than one person (at the
same time, in most countries?)
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z
 disjoint(t1,t2)
6
Soft Rules vs. Hard Constraints (ct’d)
Enforce FD‘s (e.g., mutual exclusion) as hard constraints:
livesIn(x,y)  type(y,City)  locatedIn(y,z)  type(z,Country)
 livesIn(x,z)
Combine soft and hard constraints
No longer regular MaxSat
Constrained (weighted) MaxSat instead
hasAdvisor(x,y)  hasAdvisor(x,z)  y=z
Generalize to other forms of constraints:
Hard constraint
Soft constraint
hasAdvisor(x,y) 
graduatedInYear(x,t) 
graduatedInYear(y,s)
 s < t
firstPaper(x,p) 
firstPaper(y,q) 
author(p,x)  author(p,y) 
inYear(p) > inYear(q)+5years
 hasAdvisor(x,y)[0.6]
Datalog-style grounding
(deductive & potentially recursive soft rules)
7
Deductive Grounding (SLD Resolution/Datalog)
Answers (derived facts):
livesIn(Bill, Arkansas)
livesIn(Bill, New_York)
Query
livesIn(Bill, ?x)
\/
F1
…
R2
R3
/\
X
F3
\/
R1
X
R1
R2
F2
R3
RDF Base Facts
X
F1: marriedTo(Bill, Hillary)
F2: represents(Hillary, New_York)
F3: governorOf(Bill, Arkansas)
First-Order Rules (Horn clauses)
R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y)
R2: livesIn(?x, ?y) :- represents(?x, ?y)
R3: livesIn(?x, ?y) :- governorOf(?x, ?y)
8
URDF: Reasoning Example
KB:
Base Facts
type[1.0]
Computer
Scientist
Rules
type[1.0]
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z)
type[1.0]
hasAdvisor[0.8]
Jeff
hasAdvisor[0.7]
Surajit
graduatedFrom
graduatedFrom
[0.6]
[?]
[0.4]
David
graduatedFrom[0.9]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 x=z
graduatedFrom[?]
graduatedFrom[0.7]
worksAt[0.9]
Stanford
Princeton
type[1.0]
type[1.0]
University
Derived Facts
gradFr(Surajit,Stanford)
gradFr(David,Stanford)
9
URDF: CNF Construction & MaxSat Solving
[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]
Query
graduatedFrom(?x,?y)
1) Deductive Grounding
CNF
(graduatedFrom(Surajit, Stanford) 
graduatedFrom(Surajit, Princeton)) 
 (graduatedFrom(David, Stanford) 
graduatedFrom(David, Princeton))

 (hasAdvisor(Surajit, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(Surajit, Stanford)) 0.4
 (hasAcademicAdvisor(David, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(David, Stanford))
0.4







worksAt(Jeff, Stanford)
hasAdvisor(Surajit, Jeff)
hasAdvisor(David, Jeff)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(David, Princeton)
graduatedFrom(David, Stanford)
0.9
0.8
0.7
0.6
0.7
0.9
0.0
– Yields only facts and rules which
are relevant for answering the
query (dependency graph D)
2) Boolean Formula in CNF
consisting of
– Grounded hard rules
– Grounded soft rules (weighted)
– Base facts (weighted)
3) Propositional Reasoning
– Compute truth assignment for
all facts in D such that the sum of
weights is maximized
 Compute “most likely” possible
world
10
URDF: Lineage & Possible Worlds
Query
graduatedFrom(Surajit,?y)
0.7x(1-0.888)=0.078
graduatedFrom
(Surajit,
Princeton)
0.7
1) Deductive Grounding
– Same as before, but trace
lineage of query answers
(1-0.7)x0.888=0.266
graduatedFrom
(Surajit,
Stanford)
2) Lineage DAG (not CNF!)
consisting of
1-(1-0.72)x(1-0.6)
=0.888

\/
0.8x0.9
=0.72
graduatedFrom
(Surajit,
Princeton)[0.7]
0.8
hasAdvisor
(Surajit,Jeff)[0.8]
/\
– Grounded hard rules
– Grounded soft rules
– Base facts
plus: derivation structure
0.6
graduatedFrom
(Surajit,
Stanford)[0.6]
0.9
worksAt
(Jeff,Stanford)[0.9]
3) Probabilistic Inference
– Marginalization:
aggregate probabilities of all
possible worlds where the
answer is “true”
– Drop “impossible worlds”
11
Classes & Complexities
FOL
Grounding first-order
Horn formulas (Datalog)
OWL-DL/lite
OWL
Horn
– Decidable
– EXPTIME-complete, PSPACE-complete
(including recursion, but in P w/o recursion)
Max-Sat (Constrained & Weighted)
– NP-complete
Probabilistic inference in graphical models
– #P-complete
12
Monte Carlo Simulation (I)
[Karp,Luby,Madras: J.Alg.’89]
Boolean formula:
F = X1X2  X1X3  X2X3
Naïve sampling:
cnt = 0
repeat N times
randomly choose X1, X2, X3  {0,1}
if F(X1, X2, X3) = 1
then cnt = cnt+1
P = cnt/N
May be very big
return P /* Pr'(F) */
for small Pr(F)
Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/d)/e2) then:
Pr[ | P/Pr(F) - 1 | > e ] < d
X1X2
X1X3
X2X3
Zero/One-estimator
theorem
Works for any F
(not in PTIME)
13
Monte Carlo Simulation (II)
Boolean formula in DNF:
[Karp,Luby,Madras: J.Alg.’89]
F = C1  C2  . . .  Cm
Improved sampling:
cnt = 0;
S = Pr(C1) + … + Pr(Cm)
repeat N times
randomly choose i  {1,2,…, m}, with prob. Pr(Ci)/S
randomly choose X1, …, Xn  {0,1} s.t. Ci = 1
if C1=0 and C2=0 and … and Ci-1= 0
then cnt = cnt+1
P = cnt/N
Now it’s better
return P /* Pr'(F) */
Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then:
Pr[ |P/Pr(F) - 1| > e ] < d
Only for F in DNF
in PTIME
14
Learning “Soft” Rules
Extend Inductive Logic Programming (ILP) techniques to
large and incomplete knowledge bases
Goal: learn livesIn(?x,?y)  bornIn(?x,?y)
livesIn(x,z)
livesIn(x,y)
Positive Examples
livesIn(?x,?y)  bornIn(?x,?y)
Li
Negative Examples
bornIn(x,y)
Background knowledge
Li
 livesIn(?x,?y)  bornIn(?x,?y)
 livesIn(?x,?z)
Software tools:
alchemy.cs.washington.edu
http://www.doc.ic.ac.uk/~shm/progol.html
http://dtai.cs.kuleuven.be/ml/systems/claudien
15
More Variants of Consistency Reasoning
• Propositional Reasoning
– Constrained Weighted MaxSat solver
• Lineage & Possible Worlds (independent base facts)
– Monte Carlo simulations (Luby-Karp)
• First-Order Logic & Probabilistic Graphical Models
– Markov Logic (currently via interface to Alchemy*)
[Richardson & Domingos: ML’06]
– Even more general: Factor Graphs
[McCallum et al. 2008]
– MCMC sampling for probabilistic inference
*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/
16
Experiments
•
•
•
YAGO Knowledge Base: 2 Mio entities, 20 Mio facts
Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft
rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)
Asymptotic runtime checks: runtime comparisons for synthetic soft rule expansions
• URDF: SLD grounding & MaxSat
solving
|C| - # literals in soft rules
|S| - # literals in hard rules
• URDF vs. Markov Logic (MAP
inference & MC-SAT)
17
French Marriage Problem (Revisited)
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Facts in KB:
1: marriedTo
(Hillary, Bill)
2: marriedTo
(Carla, Nicolas)
3: marriedTo
(Angelina, Brad)
validFrom (2, 2008)
New fact candidates:
4:
5:
6:
7:
8:
marriedTo (Cecilia, Nicolas)
marriedTo (Carla, Benjamin)
marriedTo (Carla, Mick)
divorced (Madonna, Guy)
domPartner (Angelina, Brad)
validFrom (4, 1996)
validFrom (5, 2010)
validFrom (6, 2006)
validFrom (7, 2008)
validUntil (4, 2007)
18
Challenge: Temporal Knowledge Harvesting
For all people in Wikipedia (100,000‘s) gather all spouses,
incl. divorced & widowed, and corresponding time periods!
>95% accuracy, >95% coverage, in one night!
Consistency constraints are potentially helpful:
• functional dependencies: {husband, time}  {wife, time}
• inclusion dependencies: marriedPerson  adultPerson
• age/time/gender restrictions: birthdate +  < marriage < divorce
19
Difficult Dating
20
(Even More Difficult)
Implicit Dating
vague dates
relative dates
narrative text
relative order
22
TARSQI: Extracting Time Annotations
http://www.timeml.org/site/tarsqi/
[Verhagen et al: ACL‘05]
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3"
TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy
advocate seeking high office in territory controlled by the Chinese government in Beijing. A prodemocracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE"
VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to
appear on the ballot to become the territory’s next chief executive. But heextraction
acknowledged that
he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking reerrors!
election. Under electoral rules imposed by Chinese officials, only 796 people on the election
committee – the bulk of them with close ties to mainland China – will be allowed to vote in the
<TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3>
election. It will be the first contested election for chief executive since Britain returned Hong
Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>.
Mr. Tsang, an able administrator who took office during the early stages of a sharp economic
upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is
popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s
people approve of the job he has been doing. It is of course a foregone conclusion – Donald
Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0"
endPoint="t8“
TYPE="DURATION"
VAL="P5Y">another
five
years
</TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
23
13 Relations between Time Intervals
[Allen, 1984; Allen & Hayes, 1989]
A Before B
B After A
A Meets B
B MetBy A
A
A Overlaps B
B OverlappedBy A
A
A Starts B
B StartedBy A
A During B
B Contains A
A Finishes B
B FinishedBy A
A Equal B
A
B
B
B
A
B
A
B
A
B
A
B
24
Possible Worlds in Time (I)
[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]
Derived
Facts

teamMates(Beckham,
Ronaldo,T3)
State
playsFor(Beckham, Real, T1)
 playsFor(Ronaldo, Real, T2)
 overlaps(T1,T2)
0.36
0.16
0.08
‘03
0.4
Base
Facts
0.6
‘04
0.12
‘05
‘07
1.0
0.1
0.2
0.4
0.9
0.2
‘05
‘07
‘03
playsFor(Beckham,Real)
‘00 ‘02
‘07
‘04 ‘05
playsFor(Ronaldo,Real)
State Relation
State Relation
25
Possible Worlds in Time (II)
[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]
Derived
Facts
won(Beckham,
ChampionsL,T3)

Event
playsFor(Beckham, United, T1)
 wonCup(United, ChampionsL,T2)
 overlaps(T1,T2)
0.54
0.30
0.060.12 0.06
‘96
0.12
‘98 ‘99 ‘00 ‘01
Non-independent
Independent
• Closed and complete representation model (incl. lineage)
 Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]
0.6
0.9
0.5
1.0
• Interval computation
remains linear in the
number0.2of bins
0.20.30.1
0.3
• Confidence
computation
Base
‘98
‘02 per bin is #P-complete
‘95
‘96 ‘98 ‘99 ‘00 ‘01
Facts  In
playsFor(Beckham,
wonCup(United,
ChampionsLeague)
general requires United)
possible-worlds-based
sampling
State
Event
techniques
(Luby-Karp, Gibbs sampling, etc.)
26
Agenda
– URDF: Reasoning in Uncertain Knowledge Bases
•
•
•
•
Resolving uncertainty at query-time
Lineage of answers
Propositional vs. probabilistic reasoning
Temporal reasoning extensions
– UViz: The URDF Visualization Frontend
• Demo!
27
UViz: The URDF Visualization Engine
• UViz System
Architecture
– Flash client
– Tomcat server (JRE)
– Relational backend
(JDBC)
– Remote Method
Invocation & Object
Serialization (BlazeDS)
28
UViz: The URDF Visualization Engine
Demo!
29