Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL?  Traditional statistical machine learning approaches assume:   Traditional ILP/relational learning approaches assume:    Multi-relational, heterogeneous.

Download Report

Transcript Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL?  Traditional statistical machine learning approaches assume:   Traditional ILP/relational learning approaches assume:    Multi-relational, heterogeneous.

Statistical
Relational
Learning: Theory &
Applications
Lise Getoor
University of Maryland, College Park
Why SRL?

Traditional statistical machine learning approaches assume:


Traditional ILP/relational learning approaches assume:



Multi-relational, heterogeneous and semi-structured
Noisy and uncertain
Statistical Relational Learning:


No noise or uncertainty in data
Real world data sets:


A random sample of homogeneous objects from single relation
newly emerging research area at the intersection of research in
social network and link analysis, hypertext and web mining, graph
mining, relational learning and inductive logic programming
Sample Domains:

web data, bibliographic data, epidemiological data, communication
data, customer networks, collaborative filtering, trust networks,
biological data, sensor networks, natural language, vision
SRL Theory

Directed Approaches




Semantics based on Bayesian Networks
Rule-based Directed Models
Frame-based Directed Models
Undirected Approaches



Semantics based on Markov Networks
Frame-based Undirected Models
Rule-based Undirected Models
Reference: Upcoming book on Statistical Relation Learning w/ Ben Taskar
SRL Challenges
Modeling Logical vs. Statistical dependencies
 Feature construction
 Instances vs. Classes
 Collective Classification
 Collective Consolidation
 Effective Use of Labeled & Unlabeled Data
 Link Prediction
 Closed vs. Open World

SRL Applications: Link Mining

Data



Making use of Links



Structured Input: Mining graphs and networks
Structured Output: Extracting entity and relationships
from unstructured data
For ranking nodes
For collective classification of nodes
Discovering Links


Predicting missing links
Discovering new kinds of links and relationships
Link Mining Tasks

Node Centric




Edge Centric





Labeling/ranking nodes (aka Collective Classification/PageRank)
Consolidating nodes (aka Entity Resolution)
Discovering hidden nodes (aka Group Discovery)
Labeling/ranking edges
Predicting the existence of edges
Predicting the number of edges
Discovering new relations/paths
Graph/Subgraph Centric



Discovering frequent subpatterns
Generative models
Metadata discovery, extraction, and reformulation
Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.
LINQS Group @ UMD

Members


myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj,
Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen,
Vivek Senghal, Elena Zheleva
Projects


Link-based Classification
Entity Resolution (ER)
• Algorithms
• Query-time ER
• User Interface

Predictive Models for Social Network Analysis
• Abstraction in Affiliation Networks
• Social Capital in Friendship Event Networks
• Role Discovery




Temporal Analysis of Email Traffic Networks
Feature Generation for Sequences (biological data)
Protein Interaction Prediction (biological data)
Schema Mapping
LINQS Group @ UMD

Members


myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj,
Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen,
Vivek Senghal, Elena Zheleva
Projects


Link-based Classification
Entity Resolution (ER)
• Algorithms
• Query-time ER
• User Interface

Predictive Models for Social Network Analysis
• Abstraction in Affiliation Networks
• Social Capital in Friendship Event Networks
• Role Discovery




Temporal Analysis of Email Traffic Networks
Feature Generation for Sequences (biological data)
Protein Interaction Prediction (biological data)
Schema Mapping
Entity Resolution



The Problem
Relational Entity Resolution
Algorithms




Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
InfoVis Co-Author Network Fragment
Hus Su – Hua Su
before
after
L. Tweedie – Lisa Tweedie
before
after
H. Dawkes – Huw Dawkes
before
after
B. Spence – Bob Spence
before
after
Bob Spence – Robert Spence
before
after
Initial vs. Final
before
after
The Entity Resolution Problem
James
Smith
John
Smith
“John Smith”
“Jim Smith”
“J Smith”
“James Smith”
Jonathan Smith
“Jon Smith”
“J Smith”
“Jonthan Smith”
Issues:
1.
Identification
2.
Disambiguation
The Entity Resolution Problem
James
Smith
John
Smith
“John Smith”
“Jim Smith”
“J Smith”
Jonathan Smith
“Jon Smith”
“J Smith”
Unsupervised clustering approach

“Jonthan Smith”
“James Smith”
Number of clusters/entities
unknown apriori
Attribute-based Entity Resolution
Pair-wise classification
“J Smith”
“James Smith”
?
“Jim Smith”
“James Smith”
0.8
“J Smith”
“James Smith”
?
“John Smith”
“James Smith”
0.1
“Jon Smith”
“James Smith”
0.7
“Jonthan Smith”
“James Smith”
1. Inability to disambiguate
2. Choosing threshold: precision/recall tradeoff
3. Perform transitive closure?
0.05
Relational Entity Resolution

References not observed independently



Links between references indicate relations between
the entities
Co-author relations for bibliographic data
Use relations to improve identification and
disambiguation
Relational Identification
Very similar names.
Added evidence from
shared co-authors
Relational Disambiguation
Very similar names
but no shared
collaborators
Collective Entity Resolution
One resolutions
provides evidence
for another => joint
resolution
Relational Constraints
Co-authors are
typically distinct
Entity Resolution



The Problem
Relational Entity Resolution
Algorithms




Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
Evaluation Datasets

CiteSeer





arXiv HEP





Machine Learning Citations
Originally created by Lawrence et al.
2,892 references to 1,165 true authors
1,504 links
Papers from High Energy Physics
Used for KDD-Cup ‘03 Data Cleaning Challenge
58,515 references to 9,200 true authors
29,555 links
BioBase




Biology papers on immunology and infectious diseases
IBM KDD Challenge dataset constructed at Cornell
156,156 publications, 831,991 author references
Ground truth for only ~1060 references
Comparison: Structural Difference

Performance




Percentage of ambiguous references




0.5 % for Citeseer
9% for HEP
32% for BioBase
Average number of collaborators per author



2% improvement for Citeseer
4% improvement for HEP
16% improvement for BioBase
2.15 for Citeseer
4.5 for HEP
Average number of references per author



2.5 for Citeseer
6.4 for HEP
106 for BioBase
Entity Resolution



The Problem
Relational Entity Resolution
Algorithms




Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface
Query-time ER: Motivation

Most publicly available databases do not have
resolved entities


Millions of queries everyday require resolved
entities directly or indirectly


PubMed, CiteSeer have many unresolved authors
“I am looking for all papers by Wei Wang”
How do we address this problem?
1.
2.
3.
Leave the burden on the user to do the resolution
Ask owners to ‘clean’ their databases
Develop techniques for query-time resolution
Query-time ER using Relations
1.
Use best techniques available


2.
Collective resolution using relationships
How we can localize collective resolution?
Two-phase collective resolution for query


Extract minimal set of relevant records
Collective resolution on extracted records
Cut-based Evaluation of
Relational Clustering
• Vertices embedded in attribute space
• Additional (hyper)edges represent relationships
C1
C3
C2
C1
C2
C4
Good separation of attributes
Many cluster-cluster relationships
 C1-C3, C1-C4, C2-C4
C3
C4
Worse in terms of attributes
Fewer cluster-cluster relationships
 C1-C3, C2-C4
A Cut-based Objective Function
 w
i
weight for
attributes

A
sim A (ci , c j )  wR  (ci , c j ) f (ci , c j )
j
similarity of
attributes
weight for
relations
1 iff relational edge
exists between
ci and cj
compatibility
of ci and cj
Greedy clustering algorithm: merge cluster pair with max
reduction in objective function
(ci c j )  wA simA (ci c j )  wR ( N (ci)  N (c j )) f (ci c j )
Similarity of attributes
• Jaro, Levenstein; TF-IDF
Common cluster neighborhood
• Jaccard works better than intersection
Extracting Relevant Records
Query
W Wang
Name
expansion
Level 0
Hyper-edge
expansion
Level 1
P4: W W Wang
P4: A Ansari
P1: W Wang
P2: A Ansari
P2: W Wang
P1: A Ansari
P3: W Wang
P1: C Chen
P3: C Chen
Start with query name or record
Alternate between
Level 2
P: A Ansari
P: A Ansari
P: C Chen
P3: L Li
1.Name expansion: For any relevant record,
include other records with that name
2.Hyper-edge Expansion: For any relevant
record, include other related records
Terminate at some depth k
Name
expansion
P: C Chen
P: L Li
P: L Li
Adaptive Expansion for a Query

Too many records with unconstrained expansion



Adaptive Name Expansion



Adaptively select records based on ‘ambiguity’
‘Chen’ is more ambiguous than ‘Ansari’
Expand the more ambiguous records
They need extra evidence
Adaptive Hyper-edge expansion


Add fewer ambiguous records
They lead to imprecision
Evaluation Datasets

arXiv High Energy Physics


29,555 publications, 58,515 refs to 9,200 authors
Queries: All ambiguous names (75 in total)
• True authors per name: 2 to 11 (avg. is 2.4)

Elsevier BioBase



156,156 publications, 831,991 author refs
Keywords, topic classifications, language, country
and affiliation of corresponding author, etc
Queries: 100 most frequent names
• True authors per name: 1 to 100 (avg. is 32)
Query-time ER Results
arXiv F1
BioBase F1
A
0.721
0.701
A*
0.778
0.687
A+N
0.956
0.710
A+N*
0.952
0.753
RC-ER Depth 1
0.964
0.813
RC-ER Depth 3
0.970
0.821
Unconstrained expansion


Collective resolution more
accurate
Accuracy improves beyond
depth 1
A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure
Adaptive expansion
 Minimal
loss in accuracy
 Dramatic reduction in
query processing time
relv-set size
time (secs)
accuracy (F1)
Unconstr
AX-2
AX-1
44,129
5,510
3,743
607
43
31
0.821
0.820
0.818
AX-2: adaptive expansion at depths 2 and beyond
AX-1: adaptive expansion even at depth 1
Current ER Projects

Entity Resolution in Geospatial Data


D-Dupe Interactive ER Tool



Simple user-interface for entity resolution
Accepted to new Visual Analytics conference
Name Reference Resolution in Email




Using spatial information, location name information and location
type information
Goal: Allow users to query an unresolved database
Adaptive strategy constructs set of relevant references and
performs collective resolution
Preliminary adaptive strategy as accurate & 200 x faster
Ontology Alignment (work w/ Octavian Udrea)


Combines relational clustering with logical inference (e.g.
equivalence and subsumption)
Results in a 40% improvement in recall on 30 OWL lite ontology
pairs
Conclusion

SRL is an exciting new research area in ML

Entity resolution is one eample of a surprisingly rich
SRL topic

You CAN have your theory and apply it too, 
Thanks!
…
Questions?
D-Dupe: An Interactive Tool for
ER

Tool Integrates



Case studies on bibliographic datasets



entity resolution algorithms
simple visual interface optimized for ER
on two clean datasets we quickly were able to find many
duplicates
on one dataset w/o author keys, we were able to easily clean
dataset to construct keys
Currently

adapting tool for database integration
• geospatial data
• academic genealogy
• email archives