Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL? Traditional statistical machine learning approaches assume: Traditional ILP/relational learning approaches assume: Multi-relational, heterogeneous.
Download ReportTranscript Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL? Traditional statistical machine learning approaches assume: Traditional ILP/relational learning approaches assume: Multi-relational, heterogeneous.
Statistical Relational Learning: Theory & Applications Lise Getoor University of Maryland, College Park Why SRL? Traditional statistical machine learning approaches assume: Traditional ILP/relational learning approaches assume: Multi-relational, heterogeneous and semi-structured Noisy and uncertain Statistical Relational Learning: No noise or uncertainty in data Real world data sets: A random sample of homogeneous objects from single relation newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming Sample Domains: web data, bibliographic data, epidemiological data, communication data, customer networks, collaborative filtering, trust networks, biological data, sensor networks, natural language, vision SRL Theory Directed Approaches Semantics based on Bayesian Networks Rule-based Directed Models Frame-based Directed Models Undirected Approaches Semantics based on Markov Networks Frame-based Undirected Models Rule-based Undirected Models Reference: Upcoming book on Statistical Relation Learning w/ Ben Taskar SRL Challenges Modeling Logical vs. Statistical dependencies Feature construction Instances vs. Classes Collective Classification Collective Consolidation Effective Use of Labeled & Unlabeled Data Link Prediction Closed vs. Open World SRL Applications: Link Mining Data Making use of Links Structured Input: Mining graphs and networks Structured Output: Extracting entity and relationships from unstructured data For ranking nodes For collective classification of nodes Discovering Links Predicting missing links Discovering new kinds of links and relationships Link Mining Tasks Node Centric Edge Centric Labeling/ranking nodes (aka Collective Classification/PageRank) Consolidating nodes (aka Entity Resolution) Discovering hidden nodes (aka Group Discovery) Labeling/ranking edges Predicting the existence of edges Predicting the number of edges Discovering new relations/paths Graph/Subgraph Centric Discovering frequent subpatterns Generative models Metadata discovery, extraction, and reformulation Reference: SigKDD Explorations Special Issue on Link Mining, December 2005. LINQS Group @ UMD Members myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj, Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen, Vivek Senghal, Elena Zheleva Projects Link-based Classification Entity Resolution (ER) • Algorithms • Query-time ER • User Interface Predictive Models for Social Network Analysis • Abstraction in Affiliation Networks • Social Capital in Friendship Event Networks • Role Discovery Temporal Analysis of Email Traffic Networks Feature Generation for Sequences (biological data) Protein Interaction Prediction (biological data) Schema Mapping LINQS Group @ UMD Members myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj, Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen, Vivek Senghal, Elena Zheleva Projects Link-based Classification Entity Resolution (ER) • Algorithms • Query-time ER • User Interface Predictive Models for Social Network Analysis • Abstraction in Affiliation Networks • Social Capital in Friendship Event Networks • Role Discovery Temporal Analysis of Email Traffic Networks Feature Generation for Sequences (biological data) Protein Interaction Prediction (biological data) Schema Mapping Entity Resolution The Problem Relational Entity Resolution Algorithms Graph-based Clustering (GBC) Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface InfoVis Co-Author Network Fragment Hus Su – Hua Su before after L. Tweedie – Lisa Tweedie before after H. Dawkes – Huw Dawkes before after B. Spence – Bob Spence before after Bob Spence – Robert Spence before after Initial vs. Final before after The Entity Resolution Problem James Smith John Smith “John Smith” “Jim Smith” “J Smith” “James Smith” Jonathan Smith “Jon Smith” “J Smith” “Jonthan Smith” Issues: 1. Identification 2. Disambiguation The Entity Resolution Problem James Smith John Smith “John Smith” “Jim Smith” “J Smith” Jonathan Smith “Jon Smith” “J Smith” Unsupervised clustering approach “Jonthan Smith” “James Smith” Number of clusters/entities unknown apriori Attribute-based Entity Resolution Pair-wise classification “J Smith” “James Smith” ? “Jim Smith” “James Smith” 0.8 “J Smith” “James Smith” ? “John Smith” “James Smith” 0.1 “Jon Smith” “James Smith” 0.7 “Jonthan Smith” “James Smith” 1. Inability to disambiguate 2. Choosing threshold: precision/recall tradeoff 3. Perform transitive closure? 0.05 Relational Entity Resolution References not observed independently Links between references indicate relations between the entities Co-author relations for bibliographic data Use relations to improve identification and disambiguation Relational Identification Very similar names. Added evidence from shared co-authors Relational Disambiguation Very similar names but no shared collaborators Collective Entity Resolution One resolutions provides evidence for another => joint resolution Relational Constraints Co-authors are typically distinct Entity Resolution The Problem Relational Entity Resolution Algorithms Graph-based Clustering (GBC) Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface Evaluation Datasets CiteSeer arXiv HEP Machine Learning Citations Originally created by Lawrence et al. 2,892 references to 1,165 true authors 1,504 links Papers from High Energy Physics Used for KDD-Cup ‘03 Data Cleaning Challenge 58,515 references to 9,200 true authors 29,555 links BioBase Biology papers on immunology and infectious diseases IBM KDD Challenge dataset constructed at Cornell 156,156 publications, 831,991 author references Ground truth for only ~1060 references Comparison: Structural Difference Performance Percentage of ambiguous references 0.5 % for Citeseer 9% for HEP 32% for BioBase Average number of collaborators per author 2% improvement for Citeseer 4% improvement for HEP 16% improvement for BioBase 2.15 for Citeseer 4.5 for HEP Average number of references per author 2.5 for Citeseer 6.4 for HEP 106 for BioBase Entity Resolution The Problem Relational Entity Resolution Algorithms Graph-based Clustering (GBC) Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface Query-time ER: Motivation Most publicly available databases do not have resolved entities Millions of queries everyday require resolved entities directly or indirectly PubMed, CiteSeer have many unresolved authors “I am looking for all papers by Wei Wang” How do we address this problem? 1. 2. 3. Leave the burden on the user to do the resolution Ask owners to ‘clean’ their databases Develop techniques for query-time resolution Query-time ER using Relations 1. Use best techniques available 2. Collective resolution using relationships How we can localize collective resolution? Two-phase collective resolution for query Extract minimal set of relevant records Collective resolution on extracted records Cut-based Evaluation of Relational Clustering • Vertices embedded in attribute space • Additional (hyper)edges represent relationships C1 C3 C2 C1 C2 C4 Good separation of attributes Many cluster-cluster relationships C1-C3, C1-C4, C2-C4 C3 C4 Worse in terms of attributes Fewer cluster-cluster relationships C1-C3, C2-C4 A Cut-based Objective Function w i weight for attributes A sim A (ci , c j ) wR (ci , c j ) f (ci , c j ) j similarity of attributes weight for relations 1 iff relational edge exists between ci and cj compatibility of ci and cj Greedy clustering algorithm: merge cluster pair with max reduction in objective function (ci c j ) wA simA (ci c j ) wR ( N (ci) N (c j )) f (ci c j ) Similarity of attributes • Jaro, Levenstein; TF-IDF Common cluster neighborhood • Jaccard works better than intersection Extracting Relevant Records Query W Wang Name expansion Level 0 Hyper-edge expansion Level 1 P4: W W Wang P4: A Ansari P1: W Wang P2: A Ansari P2: W Wang P1: A Ansari P3: W Wang P1: C Chen P3: C Chen Start with query name or record Alternate between Level 2 P: A Ansari P: A Ansari P: C Chen P3: L Li 1.Name expansion: For any relevant record, include other records with that name 2.Hyper-edge Expansion: For any relevant record, include other related records Terminate at some depth k Name expansion P: C Chen P: L Li P: L Li Adaptive Expansion for a Query Too many records with unconstrained expansion Adaptive Name Expansion Adaptively select records based on ‘ambiguity’ ‘Chen’ is more ambiguous than ‘Ansari’ Expand the more ambiguous records They need extra evidence Adaptive Hyper-edge expansion Add fewer ambiguous records They lead to imprecision Evaluation Datasets arXiv High Energy Physics 29,555 publications, 58,515 refs to 9,200 authors Queries: All ambiguous names (75 in total) • True authors per name: 2 to 11 (avg. is 2.4) Elsevier BioBase 156,156 publications, 831,991 author refs Keywords, topic classifications, language, country and affiliation of corresponding author, etc Queries: 100 most frequent names • True authors per name: 1 to 100 (avg. is 32) Query-time ER Results arXiv F1 BioBase F1 A 0.721 0.701 A* 0.778 0.687 A+N 0.956 0.710 A+N* 0.952 0.753 RC-ER Depth 1 0.964 0.813 RC-ER Depth 3 0.970 0.821 Unconstrained expansion Collective resolution more accurate Accuracy improves beyond depth 1 A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure Adaptive expansion Minimal loss in accuracy Dramatic reduction in query processing time relv-set size time (secs) accuracy (F1) Unconstr AX-2 AX-1 44,129 5,510 3,743 607 43 31 0.821 0.820 0.818 AX-2: adaptive expansion at depths 2 and beyond AX-1: adaptive expansion even at depth 1 Current ER Projects Entity Resolution in Geospatial Data D-Dupe Interactive ER Tool Simple user-interface for entity resolution Accepted to new Visual Analytics conference Name Reference Resolution in Email Using spatial information, location name information and location type information Goal: Allow users to query an unresolved database Adaptive strategy constructs set of relevant references and performs collective resolution Preliminary adaptive strategy as accurate & 200 x faster Ontology Alignment (work w/ Octavian Udrea) Combines relational clustering with logical inference (e.g. equivalence and subsumption) Results in a 40% improvement in recall on 30 OWL lite ontology pairs Conclusion SRL is an exciting new research area in ML Entity resolution is one eample of a surprisingly rich SRL topic You CAN have your theory and apply it too, Thanks! … Questions? D-Dupe: An Interactive Tool for ER Tool Integrates Case studies on bibliographic datasets entity resolution algorithms simple visual interface optimized for ER on two clean datasets we quickly were able to find many duplicates on one dataset w/o author keys, we were able to easily clean dataset to construct keys Currently adapting tool for database integration • geospatial data • academic genealogy • email archives