Linked data:Predicting missing properties

Download Report

Transcript Linked data:Predicting missing properties

Klemen Simonic, Jan Rupnik, Primoz Skraba
{klemen.simonic, jan.rupnik, primoz.skraba}@ijs.si
Linked data:
Predicting missing properties
Overview
1. Linked Data (Motivation for the work)
2. Problem Definition
3. Approaches
4. Results
An example
Linked Data
- connect related data that was not previously linked
- practice for exposing, sharing, and connecting pieces of data
and information
How:
- URI (Uniform Resource Identifier)
- RDF (Resource Description Framework)
(description of how to model/present the data)
Linked Data, tiny example
Linked Data, tiny example
Resource
Predicate / Property
Resource / Literal
http://www.w3.org/res/Audi
http://www.w3.org/rel/manufacturer
http://www.w3.org/Audi_A6
http://www.w3.org/res/Audi
http://www.w3.org/rel/name
“Audi”
http://www.w3.org/res/Audi
http://www.w3.org/rel/industry
http://www.w3.org/res/Automotive_industry
http://www.w3.org/res/Claus_Luthe
http://www.w3.org/rel/employer
http://www.w3.org/res/Audi
http://www.w3.org/res/Audi
http://www.w3.org/rel/sameAs
http://en.wikipedia.org/wiki/Audi
Linked Data, one dataset
- Nodes are resources
- Edges are relations
- Edge Labels are properties
Linked Data cloud diagram
DBpedia
DBpedia extracted the information from the infoboxes from the
Resource
Wikipedia websites
Properties
Literal
Resource
en.wikipedia.org/wiki/University_of_Ljubljana
Location
http://en.wikipedia.org/wiki/Ljubljana
en.wikipedia.org/wiki/University_of_Ljubljana
Established
“1919”
DBpedia
DBraw
contains all the properties from all the infoboxes within the English
Wikipedia articles
DBmapped
the properties are unified (mapped onto a DBpedia ontology).
Semantic of properties: PlaceOfBirth = BirthPlace
The data is much cleaner and is better structured than the raw
properties dataset.
Freebase
An entity graph of people, places and things, built by people.
- Colloborative
knowledge base
- Property schemas
- Google
Knowledge graph
Scale of Datasets
#nodes
#edges
#objects
#properties
avgDeg
DBmapped
5M
17M
2M
1296
5.92
DBraw
11M
47M
3M
44463
8.45
141M
607M
23M
19700
8.58
Freebase
DBpedia 3.7 version
(additional properties and resources
may be added in the meanwhile)
Largest and most structured
dataset
(Large number of edges and objects,
and relatively small number of
properties)
Mesy and noisy
dataset
(Large number of
different properties
because they are not
unified )
Missing properties
Problem:
What are the missing
properties for Fiat?
For a given resource, we
want a rank of missing
properties by likelihood.
Approach
- Similar objects
- Measure of similarity
- Neighborhood
- Ranking function
Approach
Ranking = weighted average of the k nearest-neighbor objects’
property frequency vectors.
General framework (Kernel smoother):
We can replace d with normalized kernel function.
(More math on this topic is in the paper.)
The function g(o) depends on the choice of measure of closeness d(o,oi).
Evaluation protocol
The evaluation procedure:
1. For a given object, we delete one or more of its
properties, denoting (o, {p1, …, pk} )
2. Run the recommendation algorithm for the object
3. Compute several evaluation metrics
Evaluation metrics
- Inverse rank (IRank) =
- Top 5 =
- Top 10 =
Measure of Closeness
- Local Measures: local graph properties
- Baselines:
- Random Objects
- Objects with Common Properties
- Property Co-occurrence
- Global Measures: global graph properties
- Exogenous Measures: external information (text)
Local Graph Measures
We focus on a local description, based on the property distributions:
- PropertyCount
- DirPropertyCount
- NeighbDirProperyCount
Random objects
Choose uniformly at random some number of objects in the
network
Objects with common properties
Take the objects which share a minimum number of properties with
the query object
The number of shared properties is taken as the weight for the
object
Property Co-occurence
Approximate resource similarities through property co-occurrence
patterns
Only pairwise co-occurrences are considered for the purposes of
scalability and feasibility of estimation
Our method
Each object is described by DirPropertyCount
vector
The similarity is determined by the computing the
dot product between DirPropertyCount vectors
Comparison
Other Measure of Closeness
- Local Measures: local graph properties
- Baselines:
- Random Objects
- Objects with Common Properties
- Property Co-occurrence
- Global Measures: global graph properties
- Exogenous Measures: external (no graph) information
Global Graph Measures
We use two global measures of closeness based on graph geodesics
and graph diffusion:
(We treat the graph as a simple undirected graph. We also remove all the literals and constants from the
set of nodes to remove unintuitive paths.)
- Shortest path length
- The length of a shortest path between two objects
- We calculate the distances corresponding to the k nearest objects
- Exponential diffusion kernel
- Based on computing the matrix exponential of the graph adjacency matrix A
- Parameter α controls how local/global the similarities are
- Takes into account both the total number of paths between nodes as well as their
respective lengths
- Robust measure
Exogenous Measures
- Independent of the graph structure
- Rely on additional external information about the objects
- Helpful for nodes with little connections in the graph
Textual information:
- For some of the objects, we have extended abstracts describing
the objects
- TF-IDF weighting + cosine similarity
Results - IRank
Results - Top10
In vs. Out properties
Deleting several properties
Method: DirPropertyCount vector
Dataset: DBraw
We remove a fixed fraction of in and out properties
Degradation – nodes / edges
The negative effect of deleting a fraction of edges or nodes from the
network
Degradation – properties
The effect of deleting K most frequent properties from the network
Conclusion
- Method for predicting missing properties
- Use kernel smoother
- Measure similarity in a number of different ways:
- Local properties
- Global graph structure
- External data (text)
- Extensive experimentation
- Investigate more on combining measures
-
More details about the research is in the paper:
- Linked data: Predicting missing properties [machine learning]
- Predicting Instance Properties in Linked Data [semantics of data]
Take home message
- Big redundancy / regularity in the data
- Local measures perform well
- Scale changes the structure -> we need different method
What’s
Your Message?
Questions
?