A Machine Learning Approach to Linking FOAF Instances
Download
Report
Transcript A Machine Learning Approach to Linking FOAF Instances
Computing FOAF Co-reference
Relations with Rules and
Machine Learning
Jennifer Sleeman and Tim Finin
University of Maryland, Baltimore County
The Third International Workshop on Social Data on the
Web, November 2010
http://ebiquity.umbc.edu/paper/html/id/506/
FOAF
Friend of a Friend (FOAF) vocabulary
describes people and their relationships
One of oldest and most widely used ontologies
Does not include a globally unique identifier
Inverse functional properties (IFPs) help
Multiple foaf instances referring to the same
person are common
Increasingly so with more linked data
introduction foaf co-reference approach methodology evaluation conclusions
Linking data
Data integration requires linking instances
from different data sets
Linking foaf instances is a common and
typical use case
Sindice reports 23 foaf instances all referring
to Sir Tim Berners Lee
Probably more than my query revealed
Only a handful are linked via owl:sameAs
Automatically linking foaf instances is not
always easy
introduction foaf co-reference approach methodology evaluation conclusions
Example 1
Common properties but can we
say this is the same person…
<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia">
<rdfs:label>Bijan Parsia</rdfs:label>
<swivt:page rdf:resource="http://tw.rpi.edu/wiki/Bijan_Parsia"/>
<rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/>
<property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/>
<foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan</foaf:firstName>
<foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/>
<foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan Parsia</foaf:name>
<foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Parsia</foaf:surname>
<property:Has_affiliation rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Manchester_University"/>
<property:Has_identifier rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia"/>
</swivt:Subject>
http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia
<foaf:Person rdf:ID="bparsia">
<foaf:mbox_sha1sum>f49a6854842c5fa76dc0edb8e82f8fe04fd56bc9</foaf:mbox_sha1sum>
<foaf:firstName>Bijan</foaf:firstName> <foaf:surname>Parsia</foaf:surname> <foaf:name>Bijan
Parsia</foaf:name> <foaf:homepage rdf:resource="http://trust.mindswap.org/cgibin/FilmTrust/foaf.cgi?user=bparsia"/> <foaf:img rdf:resource="http://www.mindswap.org/~bparsia/talks/uriuse/bijan.jpg"/> <foaf:depiction rdf:resource="http://www.mindswap.org/~bparsia/talks/uri-use/bijan.jpg"/>
<foaf:nick>bparsia</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>bparsia</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=bparsia#tt0084827-bparpia
Example 2
Aliases and slight name
variations…
<foaf:Person>
<foaf:name>James A. Hendler</foaf:name>
<foaf:firstName>James</foaf:firstName>
<foaf:surname>Hendler</foaf:surname>
<foaf:publications>http://ebiquity.umbc.edu/papers/select/person/James/Hendler/</foaf:publications>
<foaf:homepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
<foaf:workInfoHomepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
http://ebiquity.umbc.edu/person/foaf/James/A./Hendler/foaf.rdf
<foaf:Person rdf:ID="jhendler">
<foaf:mbox_sha1sum>0b62d4242736e64be6138547c79a811b3e82fd52</foaf:mbox_sha1sum>
<foaf:firstName>Jim</foaf:firstName> <foaf:surname>Hendler</foaf:surname> <foaf:name>Jim
Hendler</foaf:name> <foaf:title>Tetherless World Constellation Chair</foaf:title> <foaf:homepage
rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jhendler"/> <foaf:homepage
rdf:resource="http://www.cs.umd.edu/~hendler"/> <foaf:depiction rdf:resource="http://www.semanticgrid.org/qiantbljim.jpg"/> <foaf:workplaceHomepage rdf:resource="http://owl.mindswap.org"/> <foaf:img
rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:depiction
rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:nick>jhendler</foaf:nick> <foaf:openID
rdf:resource="http://jhendler.pip.verisignlabs.com/" /> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>jhendler</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://www.cs.rpi.edu/~hendler/foaf.rdf
Example 3
What if mbox_sha1sums are
different?
<Agent rdf:about="http://identi.ca/user/53505">
<mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</mbox_sha1sum>
<name>David Wood</name>
<homepage rdf:resource="http://dw2-0.com"/>
<weblog rdf:resource="http://identi.ca/dw2"/>
<holdsAccount><OnlineAccount rdf:about="http://identi.ca/user/53505#acct">
<accountServiceHomepage rdf:resource="http://identi.ca/"/>
<accountName>dw2</accountName>
<accountProfilePage rdf:resource="http://identi.ca/dw2"/>
<sioc:account_of rdf:resource="http://identi.ca/user/53505"/>
<sioc:follows rdf:resource="http://identi.ca/user/136#acct"/>
</OnlineAccount></holdsAccount>
http://identi.ca/dw2/foaf
<foaf:Person rdf:about="http://zepheira.com/team/dave/#me"> <foaf:name>David Wood</foaf:name>
<foaf:title>Dr.</foaf:title> <foaf:givenname>David</foaf:givenname> <foaf:family_name>Wood</foaf:family_name>
<foaf:nick>prototypo</foaf:nick>
<foaf:mbox_sha1sum>37c8d030d4e615d05f31625b3460532a3f4e214e</foaf:mbox_sha1sum> <foaf:homepage
rdf:resource="http://prototypo.blogspot.com/"/> <foaf:depiction
rdf:resource="http://www.itee.uq.edu.au/~dwood/images/dave_w_0.jpg"/> <foaf:phone rdf:resource="tel:+1-(571)-3313723"/> <foaf:workplaceHomepage rdf:resource="http://www.zepheira.com/"/> <foaf:workInfoHomepage
rdf:resource="http://www.zepheira.com/team/dave"/> <foaf:schoolHomepage rdf:resource="http://www.vmi.edu/"/>
<foaf:schoolHomepage rdf:resource="http://www.nps.navy.mil/"/> <foaf:schoolHomepage
rdf:resource="http://www.itee.uq.edu.au/"/> <foaf:aimChatID>piprototypo</foaf:aimChatID>
http://www.itee.uq.edu.au/~dwood/dave.rdf#me
Example 3 cont.
Which David Wood was a
mindswapper?
<ms:Researcher rdf:ID="David_Wood" rdfs:label="David Wood">
<foaf:name>David Wood</foaf:name>
<foaf:mbox>
<owl:Thing rdf:about="mailto:[email protected]"/>
</foaf:mbox>
<foaf:homepage>
<foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/>
</foaf:homepage>
<foaf:workInfoHomepage>
<foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/>
</foaf:workInfoHomepage>
</ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#David.Wood
Example 5
Could jgolbeck and Jennifer Golbeck
be the same person …
<foaf:Person rdf:ID="jgolbeck">
<foaf:mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</foaf:mbox_sha1sum>
<foaf:firstName></foaf:firstName> <foaf:surname></foaf:surname> <foaf:name> </foaf:name> <foaf:homepage
rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck"/> <foaf:img rdf:resource=""/>
<foaf:depiction rdf:resource=""/> <foaf:nick>jgolbeck</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>jgolbeck</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck
<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Jennifer_Golbeck">
<rdfs:label>Jennifer Golbeck</rdfs:label>
<swivt:page rdf:resource="http://tw.rpi.edu/wiki/Jennifer_Golbeck"/>
<rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3AAssistant_Professor"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/>
<property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/>
<foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer</foaf:firstName>
<foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/>
<foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer Golbeck</foaf:name>
<foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Golbeck</foaf:surname>
http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck
Example 5 cont.
Which profile is most recent/relevant?
<rdf:RDF>
<foaf:Person>
<foaf:name>Jennifer Golbeck</foaf:name>
<foaf:mbox rdf:resource="mailto:[email protected]"/>
<foaf:mbox rdf:resource="mailto:[email protected]"/>
<owl:sameAs rdf:resource="http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck"/>
<foaf:workplaceHomepage rdf:resource="http://www.cs.umd.edu/~golbeck"/>
<foaf:currentProject rdf:resoruce="http://trust.mindswap.org"/>
<foaf:publications rdf:resource="http://www.mindswap.org/papers"/>
<foaf:knows rdf:resource="#danbri"/>
<rdfs:seeAlso rdf:resource="http://trust.mindswap.org/cgi-bin/getList.cgi"/>
http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf
<ms:Researcher rdf:ID="Jennifer.Golbeck" rdfs:label="Jennifer Golbeck">
<rdfs:seeAlso rdf:resource="http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf"/>
<foaf:name>Jennifer Golbeck</foaf:name>
<foaf:mbox><owl:Thing rdf:about="mailto:[email protected]"/></foaf:mbox>
<foaf:homepage><foaf:Document rdf:about="http://www.cs.umd.edu/~golbeck/"/></foaf:homepage>
<foaf:workInfoHomepage><foaf:Document rdf:about="http://www.mindswap.org/~golbeck/"/>
</foaf:workInfoHomepage>
</ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck
Our Contributions
Treating foaf smushing as entity co-reference
Use machine learning to train a classifier for
recognizing co-referent foaf instance
Combine this with rule-based evidence
Use of narrower RDF properties to express coreference, avoiding overuse of owl:sameAs
Use of a greedy algorithm for iteratively clustering
co-referent entities and re-evaluating their
potential co-reference relations
introduction foaf co-reference approach methodology evaluation conclusions
Co-Reference in FOAF
Approach problem like cross-document coreference resolution in text
Match pairs FOAF agents
Use rules and properties
Assign new properties to represent coref
and notCoref relationships
Cluster co-referent pairs
introduction foaf co-reference approach methodology evaluation conclusions
Cross-Document Co-reference Resolution
Determine when two documents mention
the same entity
Are two documents that talk about “George Bush”
talking about the same George Bush?
Is a document mentioning “Mahmoud Abbas”
referring to the same person as one mentioning
“Muhammed Abbas”? What about “Abu Abbas”?
“Abu Mazen”?
Drawing appropriate inferences from
multiple documents demands crossdocument co-reference resolution
2008 NIST Text Analysis Conference
TAC KBP: Entity Linking
Given an entity mention in an
article, find the link to the
right Wikipedia entity if one
exists.
John Williams
author
1922-1994
J. Lloyd Williams
botanist
1854-1945
John Williams
politician
1955-
John J. Williams
US Senator
1904-1988
John Williams
Archbishop
1582-1650
John Williams
composer
1932-
Jonathan Williams
poet
1929-
Michael Phelps
swimmer
1985-
Michael Phelps
biophysicist
1939-
John Williams
Richard Kaufman goes a long way back with John
Williams. Trained as a classical violinist,
Californian Kaufman started doing session work
in the Hollywood studios in the 1970s. One of his
movies was Jaws, with Williams conducting his
score in recording sessions in 1975...
Michael Phelps
Debbie Phelps, the mother of swimming star
Michael Phelps, who won a record eight gold
medals in Beijing, is the author of a new memoir,
...
Michael Phelps is the scientist most often
identified as the inventor of PET, a technique that
permits the imaging of biological processes in the
organ systems of living individuals. Phelps has ...
2009 NIST TAC Knowledge Base Population Track
Smushing
Smushing is the traditional term used for
recognizing that two “blank nodes” refer to the
same thing and merging them
Past work on smushing has exploited IFPs
(e.g., foaf:mbox), heuristic similarity metrics
and custom SPARQL queries
owl:sameAs is often used to relate smushed
nodes, enabling a reasoner to effect the merging
rdf:seeAlso used to find related foaf data
introduction foaf co-reference approach methodology evaluation conclusions
Smushing
foaf:
Person
rdfs:type
”bar"
foaf:nick
owl:sameAs
foaf:mbox
foaf:knows
foaf:mbox
"[email protected]"
introduction foaf co-reference approach methodology evaluation conclusions
Smushing
foaf:
Person
rdfs:type
”bar"
foaf:nick
foaf:knows
foaf:mbox
"[email protected]"
introduction foaf co-reference approach methodology evaluation conclusions
owl:sameAs considered harmful
Known problems
– Temporally qualified data (Ding vs. Ding)
– Noisy data (Clinton vs. Clinton)
– Referentially opaque contexts (John likes the
Morning Star beautiful)
Halpin et. Al (2010) suggest a vocabulary for
similarity relations similarity.owl
We use two weaker predicates: coref & notCoref
– Defer the sameAs problem to applications
introduction foaf co-reference approach methodology evaluation conclusions
Co-Reference in FOAF
coref: transitive, symmetric and reflexive; has
sameAs as subproperty
notCoref: symmetric and irreflexive but not
transitive; has differentFrom as subproperty
:coref a owl:TransitiveProperty, owl:SymmetricProperty,
owl:ReflexiveProperty
owl:sameAs rdfs:subPropertyOf :coref.
:notCoref a owl:SymmetricProperty, owl:IrreflexiveProperty.
owl:differentFrom rdfs:subPropertyOf :notCoref.
{?a :notCoref ?b. ?b :coref ?c.} => {?a :notCoref ?c}
{?a foaf:knows ?b.} => {?a :notCoref ?b}
The :coref and :notCoref properties that we use instead of owl:sameAs
introduction foaf co-reference approach methodology evaluation conclusions
Batch Approach
Given a potentially large set of foaf instances
Generate candidate pairs
Evaluate each pair for co-reference
Using
rules and classifier independently
Each results in a {coref, notCoref, unknown}
decision
Trust rules over classifier
Designate pairs as co-referent
Create Clusters
introduction foaf co-reference approach methodology evaluation conclusions
Ingest
Extract triples from FOAF profiles
Add each foaf agent as new entity in
database
Entity URLs followed in foaf:knows graph to
get additional information
introduction foaf co-reference approach methodology evaluation conclusions
Approach: System Architecture
ingestion
Abstract entity
generation
candidate
pair
generation
Potential pairs:
reduces classifier
workload
Model Generation
clusters form
new abstract
entities
rule-based
reasoning
machine
learning
deductive
decisions
predictions
Co-referent designation and clustering
introduction foaf co-reference approach methodology evaluation conclusions
Candidate Pairs
Filter pairs reduce matching set
Use simple string matching predicates
Dice score for 3-grams
Apply both to values of common properties
and also cross-property values
Experiment 2 ~30% reduction
Reductions vary based on data set
introduction foaf co-reference approach methodology evaluation conclusions
Input data sources
FOAF profiles extracted from Swoogle
Also used URLS extracted from tests
conducted in previous work
Distribution of URLs from Experiment 2
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Rule-based Model
Rules conclude that two instances are coreferent, not co-referent or draw no
conclusion (the most common outcome)
Basic co-reference rule:
{?p a owl:IFP. ?a ?p ?x. ?b ?p ?x) =>
{?a :coref ?b}
{?p a owl:FP . ?a ?p ?x. ?a ?p ?y.) =>
{ ?x :coref ?y}
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Rule-based Model
In text processing, very similar name mentions
in a document more likely to be co-referent
It also is used in disambiguating name mentions in citations in a single paper or Web page
A similar heuristic is useful for a “knows graph”
extracted from a single foaf profile
{?a foaf:knows ?b.
?a foaf:knows ?c.
?b neq ?c} => {?b :notCoref ?c}
introduction foaf co-reference approach methodology evaluation conclusions
Methodology – Vector Model
Support Vector Machine linear kernel
Features:
– Match/nomatch of any IFPs
– Distance measures over common property
values (Levenshtein & 3-gram Dice score)
– Alias and entity mention resolution
– Property specific feature comparison
– Knows graph comparisons: Jaccard coef of
similarity of foaf names of one-hop neighbors
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Clustering
Pairs form clusters
Clusters used as part of system evaluation
Can result in:
– Entity to Entity pairing
– Cluster to Entity pairing
– Cluster to Cluster pairing
Greedy process with a confidence threshold
Use rule-based model to eliminate known
non-coreferent pairs
introduction foaf co-reference approach methodology evaluation conclusions
Methodology – Clustering
Instance matching can result in new cluster formation and
cluster matching can result in merged clusters.
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Two experiments
– E1: 50,000 triples, over 500 entity
mentions, 600 classes used for training
– E2: 250,000 triples, over 3500 entity
mentions, over 1800 classes for training
10-fold cross-validation tests
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
For E1: 900 pairs non-match, majority
undetermined
E2: Results shown below
Pairs
Rule
Conclusion
differentFrom
Undetermined
47184
inverse functional
Undetermined
2402
inverse functional
Co-referent
8687410
knows graph
Undetermined
9138326
sameAs
Undetermined
1047874
knows
Not Co-referent
9138326
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Results promising
During our E2 clustering phase, the first
phase 90% accuracy
Second phase no new relationships among
pairs, cluster to cluster pairing occurred
TP Rate
FP Rate
Precision
Recall
F-Measure
E1
0.933
0.267
0.93
0.933
0.93
E2
0.959
0.128
0.958
0.959
0.958
Classification Results using 10-fold Validation
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Retrieving additional FOAF profiles based on
knows graph
Quickly retrieve large number of entities
Tightly linked
– reduced diversity of analyzed data
– more entities that are co-referent
Future experiments: a diversity filter
spanning domains
introduction foaf co-reference approach methodology evaluation conclusions
Future Work
Evaluating the contribution of each rule and
SVM feature to performance
Other ML approaches, e.g., markov logic, EM
Exploiting better clustering algorithms
Adding more features, e.g. non-foaf vocabulary, non-RDF data (e.g., hosting site)
Applying approach to other RDF instances
Scalability:
Providing
a non-batch, streaming service
Offering a coref Web service
introduction foaf co-reference approach methodology evaluation conclusions
Conclusions
We can treat instance linking as co-reference
resolution & exploit in-doc and xdoc distinction
Good results with an ensemble approach
combining rules and an SVM classifier
Apply clustering to form groups of co-referent
relations and reprocess
Promising initial results
introduction foaf co-reference approach methodology evaluation conclusions
http://ebiquity.umbc.edu/