Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL? Traditional statistical machine learning approaches assume: Traditional ILP/relational learning approaches assume: Multi-relational, heterogeneous.
Download
Report
Transcript Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL? Traditional statistical machine learning approaches assume: Traditional ILP/relational learning approaches assume: Multi-relational, heterogeneous.
Statistical
Relational
Learning: Theory &
Applications
Lise Getoor
University of Maryland, College Park
Why SRL?
Traditional statistical machine learning approaches assume:
Traditional ILP/relational learning approaches assume:
Multi-relational, heterogeneous and semi-structured
Noisy and uncertain
Statistical Relational Learning:
No noise or uncertainty in data
Real world data sets:
A random sample of homogeneous objects from single relation
newly emerging research area at the intersection of research in
social network and link analysis, hypertext and web mining, graph
mining, relational learning and inductive logic programming
Sample Domains:
web data, bibliographic data, epidemiological data, communication
data, customer networks, collaborative filtering, trust networks,
biological data, sensor networks, natural language, vision
SRL Theory
Directed Approaches
Semantics based on Bayesian Networks
Rule-based Directed Models
Frame-based Directed Models
Undirected Approaches
Semantics based on Markov Networks
Frame-based Undirected Models
Rule-based Undirected Models
Reference: Upcoming book on Statistical Relation Learning w/ Ben Taskar
SRL Challenges
Modeling Logical vs. Statistical dependencies
Feature construction
Instances vs. Classes
Collective Classification
Collective Consolidation
Effective Use of Labeled & Unlabeled Data
Link Prediction
Closed vs. Open World
SRL Applications: Link Mining
Data
Making use of Links
Structured Input: Mining graphs and networks
Structured Output: Extracting entity and relationships
from unstructured data
For ranking nodes
For collective classification of nodes
Discovering Links
Predicting missing links
Discovering new kinds of links and relationships
Link Mining Tasks
Node Centric
Edge Centric
Labeling/ranking nodes (aka Collective Classification/PageRank)
Consolidating nodes (aka Entity Resolution)
Discovering hidden nodes (aka Group Discovery)
Labeling/ranking edges
Predicting the existence of edges
Predicting the number of edges
Discovering new relations/paths
Graph/Subgraph Centric
Discovering frequent subpatterns
Generative models
Metadata discovery, extraction, and reformulation
Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.
LINQS Group @ UMD
Members
myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj,
Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen,
Vivek Senghal, Elena Zheleva
Projects
Link-based Classification
Entity Resolution (ER)
• Algorithms
• Query-time ER
• User Interface
Predictive Models for Social Network Analysis
• Abstraction in Affiliation Networks
• Social Capital in Friendship Event Networks
• Role Discovery
Temporal Analysis of Email Traffic Networks
Feature Generation for Sequences (biological data)
Protein Interaction Prediction (biological data)
Schema Mapping
LINQS Group @ UMD
Members
myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj,
Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen,
Vivek Senghal, Elena Zheleva
Projects
Link-based Classification
Entity Resolution (ER)
• Algorithms
• Query-time ER
• User Interface
Predictive Models for Social Network Analysis
• Abstraction in Affiliation Networks
• Social Capital in Friendship Event Networks
• Role Discovery
Temporal Analysis of Email Traffic Networks
Feature Generation for Sequences (biological data)
Protein Interaction Prediction (biological data)
Schema Mapping
Entity Resolution
The Problem
Relational Entity Resolution
Algorithms
Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
InfoVis Co-Author Network Fragment
Hus Su – Hua Su
before
after
L. Tweedie – Lisa Tweedie
before
after
H. Dawkes – Huw Dawkes
before
after
B. Spence – Bob Spence
before
after
Bob Spence – Robert Spence
before
after
Initial vs. Final
before
after
The Entity Resolution Problem
James
Smith
John
Smith
“John Smith”
“Jim Smith”
“J Smith”
“James Smith”
Jonathan Smith
“Jon Smith”
“J Smith”
“Jonthan Smith”
Issues:
1.
Identification
2.
Disambiguation
The Entity Resolution Problem
James
Smith
John
Smith
“John Smith”
“Jim Smith”
“J Smith”
Jonathan Smith
“Jon Smith”
“J Smith”
Unsupervised clustering approach
“Jonthan Smith”
“James Smith”
Number of clusters/entities
unknown apriori
Attribute-based Entity Resolution
Pair-wise classification
“J Smith”
“James Smith”
?
“Jim Smith”
“James Smith”
0.8
“J Smith”
“James Smith”
?
“John Smith”
“James Smith”
0.1
“Jon Smith”
“James Smith”
0.7
“Jonthan Smith”
“James Smith”
1. Inability to disambiguate
2. Choosing threshold: precision/recall tradeoff
3. Perform transitive closure?
0.05
Relational Entity Resolution
References not observed independently
Links between references indicate relations between
the entities
Co-author relations for bibliographic data
Use relations to improve identification and
disambiguation
Relational Identification
Very similar names.
Added evidence from
shared co-authors
Relational Disambiguation
Very similar names
but no shared
collaborators
Collective Entity Resolution
One resolutions
provides evidence
for another => joint
resolution
Relational Constraints
Co-authors are
typically distinct
Entity Resolution
The Problem
Relational Entity Resolution
Algorithms
Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
Evaluation Datasets
CiteSeer
arXiv HEP
Machine Learning Citations
Originally created by Lawrence et al.
2,892 references to 1,165 true authors
1,504 links
Papers from High Energy Physics
Used for KDD-Cup ‘03 Data Cleaning Challenge
58,515 references to 9,200 true authors
29,555 links
BioBase
Biology papers on immunology and infectious diseases
IBM KDD Challenge dataset constructed at Cornell
156,156 publications, 831,991 author references
Ground truth for only ~1060 references
Comparison: Structural Difference
Performance
Percentage of ambiguous references
0.5 % for Citeseer
9% for HEP
32% for BioBase
Average number of collaborators per author
2% improvement for Citeseer
4% improvement for HEP
16% improvement for BioBase
2.15 for Citeseer
4.5 for HEP
Average number of references per author
2.5 for Citeseer
6.4 for HEP
106 for BioBase
Entity Resolution
The Problem
Relational Entity Resolution
Algorithms
Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
Query-time ER: Motivation
Most publicly available databases do not have
resolved entities
Millions of queries everyday require resolved
entities directly or indirectly
PubMed, CiteSeer have many unresolved authors
“I am looking for all papers by Wei Wang”
How do we address this problem?
1.
2.
3.
Leave the burden on the user to do the resolution
Ask owners to ‘clean’ their databases
Develop techniques for query-time resolution
Query-time ER using Relations
1.
Use best techniques available
2.
Collective resolution using relationships
How we can localize collective resolution?
Two-phase collective resolution for query
Extract minimal set of relevant records
Collective resolution on extracted records
Cut-based Evaluation of
Relational Clustering
• Vertices embedded in attribute space
• Additional (hyper)edges represent relationships
C1
C3
C2
C1
C2
C4
Good separation of attributes
Many cluster-cluster relationships
C1-C3, C1-C4, C2-C4
C3
C4
Worse in terms of attributes
Fewer cluster-cluster relationships
C1-C3, C2-C4
A Cut-based Objective Function
w
i
weight for
attributes
A
sim A (ci , c j ) wR (ci , c j ) f (ci , c j )
j
similarity of
attributes
weight for
relations
1 iff relational edge
exists between
ci and cj
compatibility
of ci and cj
Greedy clustering algorithm: merge cluster pair with max
reduction in objective function
(ci c j ) wA simA (ci c j ) wR ( N (ci) N (c j )) f (ci c j )
Similarity of attributes
• Jaro, Levenstein; TF-IDF
Common cluster neighborhood
• Jaccard works better than intersection
Extracting Relevant Records
Query
W Wang
Name
expansion
Level 0
Hyper-edge
expansion
Level 1
P4: W W Wang
P4: A Ansari
P1: W Wang
P2: A Ansari
P2: W Wang
P1: A Ansari
P3: W Wang
P1: C Chen
P3: C Chen
Start with query name or record
Alternate between
Level 2
P: A Ansari
P: A Ansari
P: C Chen
P3: L Li
1.Name expansion: For any relevant record,
include other records with that name
2.Hyper-edge Expansion: For any relevant
record, include other related records
Terminate at some depth k
Name
expansion
P: C Chen
P: L Li
P: L Li
Adaptive Expansion for a Query
Too many records with unconstrained expansion
Adaptive Name Expansion
Adaptively select records based on ‘ambiguity’
‘Chen’ is more ambiguous than ‘Ansari’
Expand the more ambiguous records
They need extra evidence
Adaptive Hyper-edge expansion
Add fewer ambiguous records
They lead to imprecision
Evaluation Datasets
arXiv High Energy Physics
29,555 publications, 58,515 refs to 9,200 authors
Queries: All ambiguous names (75 in total)
• True authors per name: 2 to 11 (avg. is 2.4)
Elsevier BioBase
156,156 publications, 831,991 author refs
Keywords, topic classifications, language, country
and affiliation of corresponding author, etc
Queries: 100 most frequent names
• True authors per name: 1 to 100 (avg. is 32)
Query-time ER Results
arXiv F1
BioBase F1
A
0.721
0.701
A*
0.778
0.687
A+N
0.956
0.710
A+N*
0.952
0.753
RC-ER Depth 1
0.964
0.813
RC-ER Depth 3
0.970
0.821
Unconstrained expansion
Collective resolution more
accurate
Accuracy improves beyond
depth 1
A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure
Adaptive expansion
Minimal
loss in accuracy
Dramatic reduction in
query processing time
relv-set size
time (secs)
accuracy (F1)
Unconstr
AX-2
AX-1
44,129
5,510
3,743
607
43
31
0.821
0.820
0.818
AX-2: adaptive expansion at depths 2 and beyond
AX-1: adaptive expansion even at depth 1
Current ER Projects
Entity Resolution in Geospatial Data
D-Dupe Interactive ER Tool
Simple user-interface for entity resolution
Accepted to new Visual Analytics conference
Name Reference Resolution in Email
Using spatial information, location name information and location
type information
Goal: Allow users to query an unresolved database
Adaptive strategy constructs set of relevant references and
performs collective resolution
Preliminary adaptive strategy as accurate & 200 x faster
Ontology Alignment (work w/ Octavian Udrea)
Combines relational clustering with logical inference (e.g.
equivalence and subsumption)
Results in a 40% improvement in recall on 30 OWL lite ontology
pairs
Conclusion
SRL is an exciting new research area in ML
Entity resolution is one eample of a surprisingly rich
SRL topic
You CAN have your theory and apply it too,
Thanks!
…
Questions?
D-Dupe: An Interactive Tool for
ER
Tool Integrates
Case studies on bibliographic datasets
entity resolution algorithms
simple visual interface optimized for ER
on two clean datasets we quickly were able to find many
duplicates
on one dataset w/o author keys, we were able to easily clean
dataset to construct keys
Currently
adapting tool for database integration
• geospatial data
• academic genealogy
• email archives