Towards Scalable Information Integration with Instance Coreferences Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1 1 Lehigh University 2 Tech-X Corporation 07/11/09 U.S.

Download Report

Transcript Towards Scalable Information Integration with Instance Coreferences Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1 1 Lehigh University 2 Tech-X Corporation 07/11/09 U.S.

Towards Scalable Information Integration
with Instance Coreferences
Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1
1 Lehigh University
2 Tech-X Corporation
07/11/09
U.S. Department of Energy DE-FG02-05ER84171 SBIR grant
The Semantic Web
• Definition
– The Semantic Web is not a separate Web but an extension of
the current one, in which information is given well-defined
meaning, better enabling computers and people to work in
cooperation. (Berners-Lee et al., Scientific American, May 2001)
• Ontology
– a key component of the Semantic Web
– ontologies define the semantics of the terms used in semistructured web pages
• identify context, provide shared definitions
• has a formal syntax and unambiguous semantics
– can be used to describe alignments between heterogeneous
schemas
2 of 30
A Web of Ontologies
S1
commits to
S2
extends
Foaf
Dublin Core
extends
extends
Region
extends
extends
Citeseer
commits to
S3
extends
Congress
DBLP
commits to
AIGP
NSF Awards
commits to
S4
commits to
The answer to a user’s query might
require the combination of data from S1,
S2, S3, and S4.
S5
commits to
S7
S6
3 of 30
Semantic Web Standards
World Wide Web Consortium (W3C) Recommendations
•
RDF(S) (1999, revised 2004)
– essentially semantic networks
with URIs
– XML serialization syntax
rdfs:Class
rdf:Property
rdf:type
rdf:type
rdf:type
g:Person
rdfs:domain
rdfs:subclassOf
u:Chair
g:name
rdf:type
g:name
John Smith
•
OWL (2004)
– extends RDF with more
semantic primitives
– based on description logics
(DLs)
– has a model theoretic
semantics
<owl:Class rdf:ID=”Band”>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=”#hasMember” />
<owl:allValuesFrom rdf:resource=”#Musician” />
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
A Band is a subset of the groups which only have Musicians
as members
4 of 30
Integrating RDF Sources
QUERY: Find all academic papers written by Marvin Minsky’s advisees.
AIGP - http://aigp.eecs.umich.edu/
DBLP - http://www.informatik.uni-trier.de/~ley/db/
“Eugene Charniak”
“Eugene Charniak”
aigp:name
aigp:researcher/show/93
aigp:advisorOf
aigp:researcher/show/21
aigp:name
“Marvin Minsky”
dblp:name
=?
dblp:c/Charniak:Eugene
dblp:hasAuthor
dblp:jrnl/aim/Charniak97
dblp:title
“Statistical Techniques for
Natural Language Parsing”
Coreference Information
• owl:sameAs
– states that two URIs denote the same individual
• Linking Open Data initiative
– ~100 sources with over 4 billion triples (i.e., facts)
– >100 million explicit owl:sameAs statements
• Many RDF users publish owl:sameAs
statements with their data
• Can use automated coreference resolution
techniques to find others
– allow for the possibility of human correction
Scaling
• AIGP and DBLP have about 4000
coreferent instances
• Marvin Minsky has about 20 advisees
• Only a small fragment of coreference
information is relevant to any given query
– Need to be selective about what information
to use
• Quantity of coreference information
– 80K between DBPedia and Geonames
– 100K between CIA factbook and Geonames
OBII
Semantic Web
Space
Domain
ontologies
O1
Om1
On
OWLII map
ontologies
Omn
LAV/GAV
LAV/GAV
Rcs and Rps
to LAV/GAV
REL set
R1
Rn
System startup
or periodic
update
IndexKB
LAV, GAV,
(REL
statements
are LAV +
URL of data
source)
Rs to Indexed
Equivalence closure
S1
S3
GNS
LAV/GAV
matches
is  ?
Potentially
relevant
sources from the
leaves
Get All 
EQKB
Data sources
Query Phase SPARQL Query
Retrieve potentially
relevant sources and
load them in a reasoner
http calls
S2
S4
S4
S5
Sn
Potentially
relevant sources
Result
Potential Relevance
• A summary of a source’s content that allows us to ignore
sources that can not possibly contribute to a query
• Unless we look inside the source there is no way to
guarantee its relevance
• REL statements have three forms stating relevance of
three different assertions a source can have
(In the following d is the URL of a data source, Cs is a class, CE is a class expression, Ps, Pq are
property names, {u1 …. un} are a set of URIs)
– For Classes Rc the form is REL (d, Cs, CE)
– For properties Rp the form is REL (d, Ps, Pq)
– For owl:sameAs assertions R the form is REL (d, {u1 …. un})
Information Integration
vs. Source Selection
Information
Integration
Source
Selection
Data sources
Queryable
Lightweight
Query
reformulation
process
Match and
expansion of
rules with query
atoms
Match and
expansion of
rules with query
atoms
Query
reformulation
result
Conjunctive
queries over
sources
A set of
“potentially
relevant” sources
Obtaining the
answer
Issue the queries
and union the
results
Load the atomic
sources into a
reasoning engine
and issue the
original query
Equivalence KB
• Implementation is a variation of disjoint set forest
algorithm [Cormen et al. 01]
– standard operations: union(x,y) and find-set(x)
• Also supports isEquivalent and getAllEquivalent methods
• The index is built by an update algorithm (with a set of
seed URLs)
• Uses an inverted document index for equivalence
relevance information
[Cormen et al. 01]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms: Second Edition. The MIT Press,
Cambridge, MA, 2001.
Update of EquivalenceKB
Preliminary Tests
• We have used
– 202,383 owl:sameAs
statements that align data from
AIGP, DBLP and Citeseer data sources
– Part of Hawkeye Project
• http://swat.cse.lehigh.edu/resources/index.html
• 166 million facts and several “integration resources”
• PC with 3GB
– EquivalanceKB is 7mb
– Buildup time 3 seconds
– 1000 calls to getAllEquivalents returns in less than
half a second
Query Answering
Needs equivalence information
GNS Extension
Needs equivalence
information
GNS Extension
 contains is used before expansion to avoid cyclic expansion
 To avoid redundancy, we consider syntactic query containment
E.g., CONTAINS(cl, P(x,a)) is true if P(x,y) is in cl
 Equivalence information is relevant
 author (X, GNS) in Closed list
 we should not expand author (X, GOAL-NODE-SEARCH)
assuming GNS = GOAL-NODE-SEARCH
GNS Extension
 unifyEQ is like regular unify except it accounts for coreferences
 When matching two constants we use isEqual of Equivalence KB
 livesIn(X, DC) and livesIn (X, WashingtonDC) will not unify unless
we know DC = WashingtonDC
Conclusion and Future Work
• Scalable Instance Coreference Handling is an important
issue
• Initial work shows promise
• Two important issues
– Avoid pre-computation of equivalence closure and make the
system more dynamic
– Disk based implementation of EquivalenceKB
• We are currently fine tuning a dynamic algorithm
– UpdateEqualKB is not seeded with all URIs but rather with URIs
from a query
– Equivalence information is updated as new URIs are discovered
due to rule expansion
– Coming soon to a conference near you
Backups
OWLII in OWL/RDF
Axiom type
Subject (left-hand side)
Object (right-hand side)
owl:equivalentClass
Named classes,
owl:intersectionOf,
owl:someValuesFrom
owl:hasValue
Named classes,
owl:intersectionOf,
owl:someValuesFrom
owl:hasValue
rdfs:subClassOf
All of the above +
owl:unionOf
All of the above +
owl:allValuesFrom
owl:equivalentProperty
rdfs:subPropertyOf
named properties ,
owl:inverseOf
named properties ,
owl:inverseOf
owl:inverseOf
named properties
named properties
Map example
O1:GreenTranpsort (X) :- O2:Transport (X),
O2:greenRating(X, good)
<owl:Class rdf:about=“http://O1#GreenTransport”>
<rdfs:subClassOf rdf:resource=“http://O2#Transport”/>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=“http://O2#greenRating”/>
<owl:hasValue rdf:resource= “http://uri#good”/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
REL example
R4: O1:MtnBike (X) ⊑
O1:GreenTransport(X) ,U2
<meta:RelStatement>
<meta:source rdf:resource=“http://U2”/>
<meta:contained>
<owl:Class rdf:about=“http://O1#MtnBike” />
</meta:contained>
<meta:container>
<owl:Class rdf:about=“http://O1#GreenTransport” />
</meta:container>
</meta:RelStatement>