No Slide Title

Download Report

Transcript No Slide Title

Data Mining:
Concepts and Techniques
— Chapter 9 —
9.3. Multirelational Data Mining
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Xiaoxin Yin
7/20/2015
Data Mining: Principles and Algorithms
1
Multi-Relational and Multi-DB Mining

Classification over multiple-relations in databases

Clustering over multi-relations by User-Guidance

Mining across multi-relational databases

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
2
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance




LinkClus: Efficient clustering by exploring the power law
distribution
Distinct: Distinguishing objects with identical names by
link analysis
Mining across multiple heterogeneous data and
information repositories
Summary
7/20/2015
Data Mining: Principles and Algorithms
3
Outline
Theme: “Knowledge is power, but knowledge is hidden in
massive links”
 Starting with PageRank and HITS

CrossMine: Classification of multi-relations by link analysis

CrossClus: Clustering over multi-relations by user-guidance

More recent work and conclusions
7/20/2015
Data Mining: Principles and Algorithms
4
Traditional Data Mining

Work on single “flat” relations
Doctor
Patient


7/20/2015
Contact
flatten
Lose information of linkages and relationships
Cannot utilize information of database
structures or schemas
Data Mining: Principles and Algorithms
5
Multi-Relational Data Mining (MRDM)


Motivation
 Most structured data are stored in relational
databases
 MRDM can utilize linkage and structural
information
Knowledge discovery in multi-relational
environments
 Multi-relational rules
 Multi-relational clustering
 Multi-relational classification
 Multi-relational linkage analysis
 …
7/20/2015
Data Mining: Principles and Algorithms
6
Applications of MRDM




e-Commerce: discovering patterns involving customers,
products, manufacturers, …
Bioinformatics/Medical databases: discovering patterns
involving genes, patients, diseases, …
Networking security: discovering patterns involving hosts,
connections, services, …
Many other relational data sources

Example: Evidence Extraction and Link Discovery
(EELD): A DARPA-funding project that emphasizes
multi-relational and multi-database linkage analysis
7/20/2015
Data Mining: Principles and Algorithms
7
Importance of Multi-relational
Classification (from EELD Program Description)



The objective of the EELD Program is to research, develop,
demonstrate, and transition critical technology that will enable
significant improvement in our ability to detect asymmetric threats …,
e.g., a loosely organized terrorist group.
… Patterns of activity that, in isolation, are of limited significance but,
when combined, are indicative of potential threats, will need to be
learned.
Addressing these threats can only be accomplished by developing a
new level of autonomic information surveillance and analysis to
extract, discover, and link together sparse evidence from vast amounts
of data sources, in different formats and with differing types and
degrees of structure, to represent and evaluate the significance of the
related evidence, and to learn patterns to guide the extraction,
discovery, linkage and evaluation processes.
7/20/2015
Data Mining: Principles and Algorithms
8
MRDM Approaches




Inductive Logic Programming (ILP)
 Find models that are coherent with background
knowledge
Multi-relational Clustering Analysis
 Clustering objects with multi-relational
information
Probabilistic Relational Models
 Model cross-relational probabilistic distributions
Efficient Multi-Relational Classification
 The CrossMine Approach [Yin et al, 2004]
7/20/2015
Data Mining: Principles and Algorithms
9
Inductive Logic Programming (ILP)


Find a hypothesis that is consistent with
background knowledge (training data)
 FOIL, Golem, Progol, TILDE, …
Background knowledge
 Relations (predicates), Tuples (ground facts)
Training examples
Daughter(mary, ann)
+
Daughter(eve, tom)
+
Daughter(tom, ann)
–
Daughter(eve, ann)
–
7/20/2015
Background knowledge
Parent(ann, mary)
Female(ann)
Parent(ann, tom)
Female(mary)
Parent(tom, eve)
Female(eve)
Parent(tom, ian)
Data Mining: Principles and Algorithms
10
Inductive Logic Programming (ILP)

Hypothesis
 The hypothesis is usually a set of rules,
which can predict certain attributes in
certain relations
 Daughter(X,Y) ← female(X), parent(Y,X)
7/20/2015
Data Mining: Principles and Algorithms
11
FOIL: First-Order Inductive Learner


Find a set of rules consistent with training data
 E.g. female(X), parent(Y,X) → daughter(X,Y)
A top-down, sequential covering learner
Examples covered
by Rule 2
Examples covered
Examples covered
by Rule 1
by Rule 3
All examples

Build each rule by heuristics
 Foil gain – a special type of information gain
7/20/2015
Data Mining: Principles and Algorithms
12
ILP Approaches

Top-down Approaches (e.g. FOIL)
while(enough examples left)

generate a rule
remove examples satisfying this rule
Bottom-up Approaches (e.g. Golem)
Use each example as a rule
Generalize rules by merging rules

Decision Tree Approaches (e.g. TILDE)
7/20/2015
Data Mining: Principles and Algorithms
13
ILP – Pros and Cons


Advantages
 Expressive and powerful
 Rules are understandable
Disadvantages
 Inefficient for databases with complex schemas
 Not appropriate for continuous attributes
7/20/2015
Data Mining: Principles and Algorithms
14
Automatically Classifying Objects
Using Multiple Relations

Why not convert multiple relational data into a single
table by joins?




Relational databases are designed by domain experts
via semantic modeling (e.g., E-R modeling)
Indiscriminative joins may loose some essential
information
One universal relation may not be appealing to
efficiency, scalability and semantics preservation
Our approach to multi-relational classification:

7/20/2015
Automatically classifying objects using multiple
relations
Data Mining: Principles and Algorithms
15
An Example: Loan Applications
Ask the backend database
Approve or not?
7/20/2015
Data Mining: Principles and Algorithms
Apply for loan
16
The Backend Database
Loan
Target relation:
Each tuple has a
class label,
indicating whether a
loan is paid on time.
loan-id
account-id
Account
District
account-id
district-id
district-id
dist-name
frequency
date
date
amount
duration
payment
Transaction
Card
region
card-id
#people
disp-id
#lt-500
type
#lt-2000
issue-date
#lt-10000
trans-id
#gt-10000
account-id
date
Order
order-id
account-id
bank-to
account-to
#city
Disposition
type
disp-id
operation
account-id
amount
client-id
balance
symbol
amount
type
ratio-urban
avg-salary
unemploy95
unemploy96
den-enter
Client
client-id
#crime95
#crime96
birth-date
gender
district-id
How to make decisions to loan applications?
7/20/2015
Data Mining: Principles and Algorithms
17
Roadmap






7/20/2015
Motivation
Rule-based Classification
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study
Data Mining: Principles and Algorithms
18
Rule-based Classification
Applicant
Applicant
7/20/2015
Ever bought a house
Just apply for a credit card
Live in Chicago
Approve!
Reject …
Data Mining: Principles and Algorithms
19
Rule Generation

Search for good predicates across multiple relations
Loan ID
Applicant #1
Account ID
Amount
Duration
Decision
1
124
1000
12
Yes
2
124
4000
12
Yes
3
108
10000
24
No
4
45
12000
36
No
Loan Applications
Applicant #2
Account ID
Applicant #3
Frequency
Open date
District ID
128
monthly
02/27/96
61820
108
weekly
09/23/95
61820
45
monthly
12/09/94
61801
67
weekly
01/01/95
61822
Orders
Accounts
Applicant #4
7/20/2015
Other relations
Data Mining: Principles and Algorithms
Districts
20
Previous Approaches

Inductive Logic Programming (ILP)
 To build a rule



Repeatedly find the best predicate
To evaluate a predicate on relation R, first join target
relation with R
Not scalable because


Huge search space (numerous candidate predicates)
Not efficient to evaluate each predicate

To evaluate a predicate
Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?)
first join loan relation with account relation

CrossMine is more scalable and more than one hundred times
faster on datasets with reasonable sizes
7/20/2015
Data Mining: Principles and Algorithms
21
CrossMine: An Efficient and Accurate
Multi-relational Classifier




Tuple-ID propagation: an efficient and flexible
method for virtually joining relations
Confine the rule search process in promising
directions
Look-one-ahead: a more powerful search strategy
Negative tuple sampling: improve efficiency while
maintaining accuracy
7/20/2015
Data Mining: Principles and Algorithms
22
Roadmap






Motivation
Rule-based Classification
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study
7/20/2015
Data Mining: Principles and Algorithms
23
Tuple ID Propagation
Loan ID
Applicant #1
Account ID
Amount
1
124
1000
12
Yes
2
124
4000
12
Yes
3
108
10000
24
No
4
45
12000
36
No
Account ID
Applicant #2
Applicant #3
Duration
Decision
Frequency
Open date
Propagated ID
Labels
124
monthly
02/27/93
1, 2
2+, 0–
108
weekly
09/23/97
3
0+, 1–
45
monthly
12/09/96
4
0+, 1–
67
weekly
01/01/97
Null
0+, 0–
Possible predicates:
•Frequency=‘monthly’: 2 +, 1 –
•Open date < 01/01/95: 2 +, 0 –
Applicant #4


7/20/2015
Propagate tuple IDs of target relation to non-target relations
Virtually join relations to avoid the high cost of physical joins
Data Mining: Principles and Algorithms
24
Tuple ID Propagation (cont.)


Efficient
 Only propagate the tuple IDs
 Time and space usage is low
Flexible
 Can propagate IDs among non-target relations
 Many sets of IDs can be kept on one relation, which are
propagated from different join paths
Target
Relation
R1
R2
R3
7/20/2015
Data Mining: Principles and Algorithms
25
Roadmap






Motivation
Rule-based Classification
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study
7/20/2015
Data Mining: Principles and Algorithms
26
Overall Procedure

Sequential covering algorithm
while(enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 2
Examples covered
by Rule 1
Examples covered
by Rule 3
Positive
examples
7/20/2015
Data Mining: Principles and Algorithms
27
Rule Generation

To generate a rule
while(true)
find the best predicate p
if foil-gain(p)>threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
A3=1
Positive
examples
7/20/2015
Negative
examples
Data Mining: Principles and Algorithms
28
Evaluating Predicates


All predicates in a relation can be evaluated
based on propagated IDs
Use foil-gain to evaluate predicates
 Suppose current rule is r. For a predicate p,
foil-gain(p) =




Pr 
Pr  p 
Pr  p   log
 log
Pr   N r 
Pr  p   N r  p 

Categorical Attributes
 Compute foil-gain directly
Numerical Attributes
 Discretize with every possible value
7/20/2015
Data Mining: Principles and Algorithms
29
Rule Generation



Start from the target relation
 Only the target relation is active
Repeat
 Search in all active relations
 Search in all relations joinable to active relations
 Add the best predicate to the current rule
 Set the involved relation to active
Until
 The best predicate does not have enough gain
 Current rule is too long
7/20/2015
Data Mining: Principles and Algorithms
30
Rule Generation: Example
Target relation
Loan
loan-id
account-id
date
amount
duration
payment
Account
District
account-id
district-id
district-id
dist-name
frequency
date
First predicate
Transaction
Card
region
card-id
#people
disp-id
#lt-500
type
#lt-2000
issue-date
#lt-10000
trans-id
#gt-10000
account-id
date
Order
order-id
account-id
bank-to
account-to
Disposition
type
disp-id
operation
account-id
amount
client-id
balance
symbol
amount
ratio-urban
avg-salary
unemploy95
unemploy96
Second
predicate
den-enter
Client
client-id
type
#crime95
#crime96
birth-date
Range of Search
Add best predicate to rule
7/20/2015
#city
gender
district-id
Data Mining: Principles and Algorithms
31
Look-one-ahead in Rule Generation


Two types of relations: Entity and Relationship
Often cannot find useful predicates on relations of
relationship
No good predicate
Target
Relation

Solution of CrossMine:
 When propagating IDs to a relation of relationship,
propagate one more step to next relation of entity.
7/20/2015
Data Mining: Principles and Algorithms
32
Roadmap






Motivation
Rule-based Classification
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study
7/20/2015
Data Mining: Principles and Algorithms
33
Negative Tuple Sampling



A rule covers some positive examples
Positive examples are removed after covered
After generating many rules, there are much less
positive examples than negative ones
–
+
–
+
–
+ +
+
–
+
+ – + +
– –
+
–
–
–
+
– –
+
–
+ +
– + + –
–
7/20/2015
–
Data Mining: Principles and Algorithms
34
Negative Tuple Sampling (cont.)


When there are much more negative examples than positive
ones
 Cannot build good rules (low support)
 Still time consuming (large number of negative examples)
Make sampling on negative examples
 Improve efficiency without affecting rule quality
–
–
+ –
–
–
– +
7/20/2015
–
–
–
–
–
+
–
–
–
Data Mining: Principles and Algorithms
–
–
–
+ +
35
Roadmap






Motivation
Rule-based Classification
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study
7/20/2015
Data Mining: Principles and Algorithms
36
Synthetic datasets:
Scalability w.r.t. number of relations
7/20/2015
Scalability w.r.t. number of tuples
Data Mining: Principles and Algorithms
37
Real Dataset


PKDD Cup 99 dataset – Loan Application
Accuracy
Time (per fold)
FOIL
74.0%
3338 sec
TILDE
81.3%
2429 sec
CrossMine
90.7%
15.3 sec
Mutagenesis dataset (4 relations)
7/20/2015
Accuracy
Time (per fold)
FOIL
79.7%
1.65 sec
TILDE
89.4%
25.6 sec
CrossMine
87.7%
0.83 sec
Data Mining: Principles and Algorithms
38
Multi-Relational Classification: Summary
Classification across multiple relations


Interesting pieces of information often lie across multiple relations

It is desirable to mine across multiple interconnected relations
New methodology in CrossMine (for classification model building)


ID (and class label) propagation leads to efficiency and
effectiveness (by preserving semantics) in CrossMine

Rule generation and negative tuple sampling lead to further
improved performance

Our performance study shows orders of magnitude faster and high
accuracy comparing with the Relational Mining approach

7/20/2015
Future work: classification in heterogeneous relational databases
Data Mining: Principles and Algorithms
39
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance

LinkClus: Efficient clustering by exploring the power law
distribution

Distinct: Distinguishing objects with identical names by
link analysis

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
40
Multi-Relational and Multi-DB Mining

Classification over multiple-relations in databases

Clustering over multi-relations by User-Guidance

Mining across multi-relational databases

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
41
Motivation 1: Multi-Relational Clustering
Open-course
Course
Work-In
Professor
person
name
course
course-id
group
office
semester
name
position
instructor
area
Advise
Group
professor
name
Publish
author
title
student
area
Publication
title
year
conf
degree
Student
Target of
clustering
name
office
position
Register
student
course
semester
unit
grade


Traditional clustering works on a single table
Most data is semantically linked with multiple relations
Thus we need information in multiple relations

7/20/2015
Data Mining: Principles and Algorithms
42
Motivation 2: User-Guided Clustering
person
name
course
course-id
group
office
semester
name
position
instructor
area
Advise
professor
name
Publish
degree
User hint
Target of
clustering
Publication
author
title
title
year
student
area

Course
Professor
Group

Open-course
Work-In
conf
Register
student
Student
course
name
office
semester
position
unit
grade
User usually has a goal of clustering, e.g., clustering students by
research area
User specifies his clustering goal to CrossClus
7/20/2015
Data Mining: Principles and Algorithms
43
Comparing with Classification
User hint  User-specified feature (in the form of
attribute) is used as a hint, not class
labels

The attribute may contain too
many or too few distinct values


All tuples for clustering
7/20/2015
E.g., a user may want to
cluster students into 20
clusters instead of 3
Additional features need to be
included in cluster analysis
Data Mining: Principles and Algorithms
44
Comparing with Semi-supervised Clustering


Semi-supervised clustering [Wagstaff, et al’ 01, Xing, et al.’02]
 User provides a training set consisting of “similar” and “dissimilar”
pairs of objects
User-guided clustering
 User specifies an attribute as a hint, and more relevant features
are found for clustering
Semi-supervised clustering
User-guided clustering
x
7/20/2015
All tuples for clustering
All tuples for clustering
Data Mining: Principles and Algorithms
45
Semi-supervised Clustering



Much information (in multiple relations) is needed to judge whether
two tuples are similar
A user may not be able to provide a good training set
It is much easier for a user to specify an attribute as a hint, such as a
student’s research area
Tom Smith
Jane Chang
SC1211
BI205
TA
RA
Tuples to be compared
User hint
7/20/2015
Data Mining: Principles and Algorithms
46
CrossClus: An Overview




Use a new type of multi-relational features of
clustering
Measure similarity between features by how
they cluster objects into groups
Use a heuristic method to search for pertinent
features
Use a k-medoids-based algorithm for clustering
7/20/2015
Data Mining: Principles and Algorithms
47
Roadmap
1.
2.
3.
4.
5.
7/20/2015
Overview
Feature Pertinence
Searching for Features
Clustering
Experimental Results
Data Mining: Principles and Algorithms
48
Multi-relational Features
A multi-relational feature is defined by:

A join path. E.g., Student → Register → OpenCourse → Course

An attribute. E.g., Course.area

(For numerical feature) an aggregation operator. E.g., sum or average

Categorical Feature f = [Student → Register → OpenCourse → Course,
Course.area, null]
f(t1)
areas of courses of each student
Values of feature f

Tuple
t1
t2
t3
t4
t5

Areas of courses
DB
AI
TH
5
0
1
5
3
5
3
5
0
3
0
7
4
5
4
Feature f
Tuple
t1
t2
t3
t4
t5
DB
AI
TH
0.5
0
0.1
0.5
0.3
0.5
0.3
0.5
0
0.3
0
0.7
0.4
0.5
0.4
f(t2)
f(t3)
f(t4)
DB
AI
TH
f(t5)
Numerical Feature, e.g., average grades of students


7/20/2015
h = [Student → Register, Register.grade, average]
E.g. h(t1) = 3.5
Data Mining: Principles and Algorithms
49
Representing Features


Most important information of a feature f is how f clusters objects into
groups
f is represented by similarities between every pair of objects indicated
by f
Similarity vector Vf
1
0.9
0.5-0.6
0.8
0.4-0.5
0.7
0.3-0.4
0.6
0.5
5
0.4
0.3
0.2
0.1
0
4
0.2-0.3
0.1-0.2
Similarity between each pair of
tuples indicated by f. The
horizontal axes are the tuple
indices, and the vertical
axis is the similarity.
This can be considered as a
vector of N x N dimensions.
0-0.1
3
S5
2
S4
S3
1
S2
S1
7/20/2015
Data Mining: Principles and Algorithms
50
Similarity between Tuples

Categorical feature f
 Defined as the probability t1 and t2 having the same value

e.g. when each of them selects another course, what is the probability
they select courses of the same area
L
sim f t1 , t2    f t1 . pk  f t2 . pk
k 1
DB
AI
TH
simf(t1,t2)=0.5*0.3+0.5*0.3=0.3

Numerical feature h

ht   ht 2 
 1 1
Simh t1 , t 2   
 h , if ht1   ht 2    h

0, otherwise
7/20/2015
Data Mining: Principles and Algorithms
51
Similarity Between Features
Vf
Values of Feature f and g
Feature f (course)
DB
AI
0.5
TH
0
Feature g (group)
Info sys
Cog sci
1
0
1
0.9
0.5-0.6
0.8
0.4-0.5
Theory
t1
0.5
t2
0
0.3
0.7
0
0
1
t3
0.1
0.5
0.4
0
0.5
0.5
0.7
0.3-0.4
0.6
0.5
0
0
0.5
0.5
0
0.5
t5
0.3
0.3
0.4
0.5
0.5
0
Similarity between two features –
cosine similarity of two vectors
V V
Sim  f , g   f g
V V
f
g
0-0.1
3
2
S4
S3
1
S2
0.5
0.1-0.2
4
S5
t4
0.2-0.3
5
0.4
0.3
0.2
0.1
0
S1
Vg
1
0.9
0.8
0.5-0.6
0.7
0.4-0.5
0.6
0.3-0.4
0.5
0.2-0.3
0.4
1
0.3
0.2
0.1-0.2
0-0.1
2
0.1
3
0
S1
4
S2
S3
S4
7/20/2015
Data Mining: Principles and Algorithms
5
S5
52
Computing Feature Similarity
Feature f
Objects
Feature g
DB
Info sys
AI
Cog sci
TH
Theory
Similarity between feature values
w.r.t. the objects
sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq
Info sys
DB
2
ti , t j  simg ti , t j   
 fsimilarities,
V f V g  Object
sim f similarities,
Featuresim
value
k , gq 
N
N
l
i 1 j 1 hard to compute
DB
Info sys
AI
Cog sci
TH
Theory
7/20/2015
m
k 1 q easy
1
to compute
Compute similarity
between each pair of
feature values by one
scan on data
Data Mining: Principles and Algorithms
53
Similarity between Categorical and Numerical
Features
V V  2 sim h ti , t j  sim f ti , t j 
N
h
f
i 1 j i
N
l
 2 
i 1 k 1
N
l







f ti . pk 1  hti   f t j . pk   2  f ti . pk   ht j  f t j . pk 
i 1 k 1
 j i

 j i

Only depend on ti
Feature f
DB
AI
TH
7/20/2015
Depend on all tj with j<i
Objects
(ordered by h) Feature h
2.7
2.9
3.1
3.3
3.5
3.7
3.9
Parts depending on ti
Parts depending on all ti with j<i
Data Mining: Principles and Algorithms
54
Similarity between Numerical Features


Similarity between numerical features h and g
 Suppose objects are ordered according to h
Computing Vh∙Vg
 Scan objects in order of h
*, maintain the set of objects t with
 When scanning each object t
0<h(t*)-h(t)<σh, in a binary tree sorted with g
h g
*
*
*
 Update V ∙V by all t with 0<h(t )-h(t )<σh, |g(t)-g(t )|<σg
Feature h Objects
7/20/2015
2.7
2.9
3.1
3.3
3.5
3.7
3.9
t*
A search tree containing
objects t with 0<h(t*)-h(t)<σh,
which is sorted in g(t)
t*
Objects t satisfying
0<h(t*)-h(t)<σh, |g(t)-g(t*)|<σg
Data Mining: Principles and Algorithms
55
Roadmap
1.
2.
3.
4.
5.
7/20/2015
Overview
Feature Pertinence
Searching for Features
Clustering
Experimental Results
Data Mining: Principles and Algorithms
56
Searching for Pertinent Features

Different features convey different aspects of information
Research area
Research group area
Demographic info
GPA
Conferences of papers
Permanent address
GRE score
Nationality
Number of papers
Advisor


Academic Performances
Features conveying same aspect of information usually
cluster objects in more similar ways
 research group areas
vs. conferences of publications
Given user specified feature
 Find pertinent features by computing feature similarity
7/20/2015
Data Mining: Principles and Algorithms
57
Heuristic Search for Pertinent Features
Overall procedure
Course
Professor
person
name
course
course-id
group
office
semester
name
position
instructor
area
2
1.Start from the userspecified feature
Group
2. Search in neighborhood name
of existing pertinent
area
features
User hint
3. Expand search range
gradually
Target of
clustering

Open-course
Work-In
Advise
Publish
professor
student
author
1
title
degree
Publication
title
year
conf
Register
student
Student
course
name
office
semester
position
unit
grade
Tuple ID propagation [Yin, et al.’04] is used to create multi-relational
features
 IDs of target tuples can be propagated along any join path, from
which we can find tuples joinable with each target tuple
7/20/2015
Data Mining: Principles and Algorithms
58
Roadmap
1.
2.
3.
4.
5.
7/20/2015
Overview
Feature Pertinence
Searching for Features
Clustering
Experimental Results
Data Mining: Principles and Algorithms
59
Clustering with Multi-Relational Feature

Given a set of L pertinent features f1, …, fL, similarity
between two objects
L
simt1 , t2    sim fi t1 , t2   f i .weight
i 1

Weight of a feature is determined in feature search by
its similarity with other pertinent features

For clustering, we use CLARANS, a scalable k-medoids
[Ng & Han’94] algorithm
7/20/2015
Data Mining: Principles and Algorithms
60
Roadmap
1.
2.
3.
4.
5.
7/20/2015
Overview
Feature Pertinence
Searching for Features
Clustering
Experimental Results
Data Mining: Principles and Algorithms
61
Experiments: Compare CrossClus with



Baseline: Only use the user specified feature
PROCLUS [Aggarwal, et al. 99]: a state-of-the-art
subspace clustering algorithm
 Use a subset of features for each cluster
 We convert relational database to a table by
propositionalization
 User-specified feature is forced to be used in every
cluster
RDBC [Kirsten and Wrobel’00]
 A representative ILP clustering algorithm
 Use neighbor information of objects for clustering
 User-specified feature is forced to be used
7/20/2015
Data Mining: Principles and Algorithms
62
Clustering Accuracy


To verify that CrossClus captures user’s clustering goal, we define
“accuracy” of clustering
Given a clustering task
 Manually find all features that contain information directly related to
the clustering task – standard feature set
 E.g., Clustering students by research areas
 Standard feature set: research group, group areas, advisors,
conferences of publications, course areas
 Accuracy of clustering result: how similar it is to the clustering
generated by standard feature set

degC  C ' 
n
i 1

max1 j n ' ci  c' j

n
i 1
simC , C ' 
7/20/2015

ci
deg C  C '  deg C '  C 
2
Data Mining: Principles and Algorithms
63
Measure of Clustering Accuracy

Accuracy

Measured by manually labeled data


Accuracy of clustering: Percentage of pairs of tuples in
the same cluster that share common label


7/20/2015
We manually assign tuples into clusters according to
their properties (e.g., professors in different
research areas)
This measure favors many small clusters
We let each approach generate the same number of
clusters
Data Mining: Principles and Algorithms
64
CS Dept Dataset
Clustering Accuracy - CS Dept
1
0.8
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
0.6
0.4
0.2
0
Group








Course
Group+Course
(Theory): J. Erickson, S. Har-Peled, L. Pitt, E. Ramos, D. Roth, M. Viswanathan
(Graphics): J. Hart, M. Garland, Y. Yu
(Database): K. Chang, A. Doan, J. Han, M. Winslett, C. Zhai
(Numerical computing): M. Heath, T. Kerkhoven, E. de Sturler
(Networking & QoS): R. Kravets, M. Caccamo, J. Hou, L. Sha
(Artificial Intelligence): G. Dejong, M. Harandi, J. Ponce, L. Rendell
(Architecture): D. Padua, J. Torrellas, C. Zilles, S. Adve, M. Snir, D. Reed, V. Adve
(Operating Systems): D. Mickunas, R. Campbell, Y. Zhou
7/20/2015
Data Mining: Principles and Algorithms
65
DBLP Dataset
Clustering Accurarcy - DBLP
1
0.9
0.8
0.7
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
0.6
0.5
0.4
0.3
0.2
0.1
e
th
re
A
ll
ho
r
oa
ut
+C
or
d
Co
au
th
or
or
d
Co
nf
+
W
or
or
d
Co
au
th
Co
nf
+
W
7/20/2015
W
Co
nf
0
Data Mining: Principles and Algorithms
66
Scalability w.r.t. Data Size and # of Relations
7/20/2015
Data Mining: Principles and Algorithms
67
CrossClus: Summary


7/20/2015
User guidance, even in a very simple form,
plays an important role in multi-relational
clustering
CrossClus finds pertinent features by
computing similarities between features
Data Mining: Principles and Algorithms
68
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance

LinkClus: Efficient clustering by exploring the power law
distribution

Distinct: Distinguishing objects with identical names by
link analysis

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
69
Link-Based Clustering: Motivation
Authors
Proceedings
Tom
sigmod03
Conferences
sigmod04
Mike
Cathy
John
sigmod05
vldb03
vldb04
vldb
vldb05
aaai04
Mary
sigmod
aaai05
aaai
Questions:
Q1: How to cluster each type of objects?
Q2: How to define similarity between each type of objects?
7/20/2015
Data Mining: Principles and Algorithms
70
Link-Based Similarities

Two objects are similar if they are linked with same or
similar objects
Tom
sigmod03
sigmod04
Mary
Tom
Mike
Cathy
John
7/20/2015
Jeh & Widom, 2002 - SimRank
sigmod
sigmod05
sigmod03
sigmod04
sigmod05
sigmod
vldb03
vldb04
vldb05
vldb
The similarity between two objects x
and y is defined as the average
similarity between objects linked with
x and those with y.
But: It is expensive to compute:
For a dataset of N objects and M
links, it takes O(N2) space and
O(M2) time to compute all
similarities.
Data Mining: Principles and Algorithms
71
Observation 1: Hierarchical Structures

Hierarchical structures often exist naturally among
objects (e.g., taxonomy of animals)
Relationships between articles and
words (Chakrabarti, Papadimitriou,
Modha, Faloutsos, 2004)
A hierarchical structure of
products in Walmart
grocery electronics
TV
7/20/2015
DVD
apparel
Articles
All
camera
Data Mining: Principles and Algorithms
Words
72
Observation 2: Distribution of Similarity
portion of entries
0.4
Distribution of SimRank similarities
among DBLP authors
0.3
0.2
0.1
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
similarity value

Power law distribution exists in similarities
 56% of similarity entries are in [0.005, 0.015]
 1.4% of similarity entries are larger than 0.1
 Our goal: Design a data structure that stores the
significant similarities and compresses insignificant ones
7/20/2015
Data Mining: Principles and Algorithms
73
Our Data Structure: SimTree
Each non-leaf node
represents a group
of similar lower-level
nodes
Each leaf node
represents an object
Similarities between
siblings are stored
Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer
camera
electronics
Apparels
TVs
7/20/2015
Data Mining: Principles and Algorithms
74
Similarity Defined by SimTree
Similarity between two
sibling nodes n1 and n2
n1
Adjustment ratio
for node n7
0.8
n4
0.9
n7



0.3
n2
0.2
0.9
0.9
n5
n6
0.8
n8
n3
1.0
n9
simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)
 Path-based node similarity
Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
Adjustment ratio for x = Average similarity between x and all other nodes
7/20/2015
Average similarity between x’s parent and all
other nodes
Data Mining: Principles and Algorithms
75
Overview of LinkClus

Initialize a SimTree for objects of each type

Repeat

For each SimTree, update the similarities between its
nodes using similarities in other SimTrees


Adjust the structure of each SimTree

7/20/2015
Similarity between two nodes x and y is the average
similarity between objects linked with them
Assign each node to the parent node that it is most
similar to
Data Mining: Principles and Algorithms
76
Initialization of SimTrees


Initializing a SimTree
 Repeatedly find groups of tightly related nodes, which
are merged into a higher-level node
Tightness of a group of nodes
 For a group of nodes {n1, …, nk}, its tightness is
defined as the number of leaf nodes in other SimTrees
that are connected to all of {n1, …, nk}
Nodes
n1
n2
7/20/2015
Leaf nodes in
another SimTree
1
2
3
4
5
The tightness of {n1, n2} is 3
Data Mining: Principles and Algorithms
77
(continued)

Finding tight groups
Frequent pattern mining
Reduced to
The tightness of a
g1
group of nodes is the
support of a frequent
pattern
g2

n1
n2
n3
n4
Transactions
1
2
3
4
5
6
7
8
9
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
Procedure of initializing a tree
 Start from leaf nodes (level-0)
 At each level l, find non-overlapping groups of similar
nodes with frequent pattern mining
7/20/2015
Data Mining: Principles and Algorithms
78
Updating Similarities Between Nodes


The initial similarities can seldom capture the relationships
between objects
Iteratively update similarities
 Similarity between two nodes is the average similarity
between objects linked with them
0
ST2
1
4
2
5
3
6
7
8
sim(na,nb) =
average similarity between
9
c
a
b
f
l m n
o p
q r
7/20/2015
ST1
d
e
g
s
u v w
13
and
14
takes O(3x2) time
h
t
11
12
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
z
10
k
x
y
Data Mining: Principles and Algorithms
79
Aggregation-Based Similarity Computation
0.2
4
0.9
1.0 0.8
10
11
ST2
5
12
0.9
1.0
13
14
a
b
ST1
For each node nk∈{n10,n11,n12} and nl∈{n13,n14}, their path-based
similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
simna , nb  
k 10 snk , n4 
12
3

 sn , n 
14
l 13
4
5
snl , n5 
2
 0.171
takes O(3+2) time
After aggregation, we reduce quadratic time computation to linear
time computation.
7/20/2015
Data Mining: Principles and Algorithms
80
Computing Similarity with Aggregation
Average similarity
and total weight
sim(na, nb) can be computed
from aggregated similarities
a:(0.9,3)
0.2
4
10
11
12
a
b:(0.95,2)
5
13
14
b
sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5)
= 0.9 x 0.2 x 0.95 = 0.171
To compute sim(na,nb):



Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
7/20/2015
Data Mining: Principles and Algorithms
81
Adjusting SimTree Structures
n1
n2
0.9
n4
0.8
n7

n5
n7 n8
n3
n6
n9
After similarity changes, the tree structure also needs to be
changed
 If a node is more similar to its parent’s sibling, then move
it to be a child of that sibling
 Try to move each node to its parent’s sibling that it is
most similar to, under the constraint that each parent
node can have at most c children
7/20/2015
Data Mining: Principles and Algorithms
82
Complexity
For two types of objects, N in each, and M linkages between them.
Time
Space
Updating similarities
O(M(logN)2)
O(M+N)
Adjusting tree structures
O(N)
O(N)
LinkClus
O(M(logN)2)
O(M+N)
SimRank
O(M2)
O(N2)
7/20/2015
Data Mining: Principles and Algorithms
83
Empirical Study



Generating clusters using a SimTree
 Suppose K clusters are to be generated
 Find a level in the SimTree that has number of nodes
closest to K
 Merging most similar nodes or dividing largest nodes
on that level to get K clusters
Accuracy
 Measured by manually labeled data
 Accuracy of clustering: Percentage of pairs of objects
in the same cluster that share common label
Efficiency and scalability
 Scalability w.r.t. number of objects, clusters, and
linkages
7/20/2015
Data Mining: Principles and Algorithms
84
Experiment Setup


DBLP dataset: 4170 most productive authors, and 154 well-known
conferences with most proceedings
 Manually labeled research areas of 400 most productive authors
according to their home pages (or publications)
 Manually labeled areas of 154 conferences according to their call for
papers
Approaches Compared:
 SimRank (Jeh & Widom, KDD 2002)
 Computing pair-wise similarities
 SimRank with FingerPrints (F-SimRank)
 Fogaras & R´acz, WWW 2005
 pre-computes a large sample of random paths from each object
and uses samples of two objects to estimate SimRank similarity
 ReCom (Wang et al. SIGIR 2003)
 Iteratively clustering objects using cluster labels of linked objects
7/20/2015
Data Mining: Principles and Algorithms
85
Accuracy
1
Conferences
0.8
Authors
0.7
LinkClus
SimRank
ReCom
F-SimRank
0.85
0.8
0.5
0.4
0.3
LinkClus
SimRank
ReCom
F-SimRank
0.2
#iteration
Approaches
15
13
11
9
7
5
3
1
19
17
13
11
9
7
5
3
1
0.1
#iteration
Accr-Author
Accr-Conf
average time
LinkClus
0.957
0.723
76.7
SimRank
0.958
0.760
1020
ReCom
0.907
0.457
43.1
F-SimRank
0.908
0.583
83.6
7/20/2015
19
0.9
17
accuracy
0.6
15
accuracy
0.95
Data Mining: Principles and Algorithms
86
(continued)
0.9
0.6
0.4
0.8
LinkClus
SimRank
ReCom
F-SimRank
P-SimRank
0.7
LinkClus
SimRank
ReCom
F-SimRank
P-SimRank
0.2
0
0.6
10

Accuracy
0.8
Accuracy
1
100
1000
Time (sec)
10000
100000
10
100
1000
Time (sec)
10000
100000
Accuracy vs. Running time
 LinkClus is almost as accurate as SimRank (most
accurate), and is much more efficient
7/20/2015
Data Mining: Principles and Algorithms
87
Email Dataset


F. Nielsen. Email dataset.
http://www.imm.dtu.dk/∼rem/data/Email-1431.zip
370 emails on conferences, 272 on jobs, and 789 spam
emails
Approach
Accuracy
LinkClus
SimRank
ReCom
F-SimRank
CLARANS
7/20/2015
0.8026
0.7965
0.5711
0.3688
0.4768
Data Mining: Principles and Algorithms
Total time
(sec)
1579.6
39160
74.6
479.7
8.55
88
Scalability (1)
Tested on synthetic datasets, with randomly generated
clusters
Scalability w.r.t. number of objects
 Number of clusters is fixed (40)

time (sec)
10000
1000
0.8
LinkClus
SimRank
ReCom
F-SimRank
O(N)
O(N*(logN)^2)
O(N^2)
0.7
0.6
Accuracy

LinkClus
SimRank
ReCom
F-SimRank
0.5
0.4
0.3
100
0.2
0.1
10
1000
7/20/2015
2000
3000
4000
#objects per relation
5000
0
1000
Data Mining: Principles and Algorithms
2000
3000
4000
#objects per relation
5000
89
Scalability (2)

Scalability w.r.t. number of objects & clusters
 Each cluster has fixed size (100 objects)
LinkClus
SimRank
ReCom
F-SimRank
0.8
10000
0.7
0.6
100
LinkClus
SimRank
ReCom
F-SimRank
O(N)
O(N*(logN)^2)
O(N^2)
10
1
500
7/20/2015
1000
2000 5000 10000 20000
#objects per relation
Accuracy
time (sec)
1000
0.5
0.4
0.3
0.2
0.1
0
500
1000
Data Mining: Principles and Algorithms
2000
5000 10000
#objects per relation
20000
90
Scalability (3)
Scalability w.r.t. number of linkages from each object

1
10000
0.8
1000
Accuracy
time (sec)
0.6
100
10
5
7/20/2015
10
15
selectivity
20
LinkClus
SimRank
ReCom
F-SimRank
0.4
LinkClus
SimRank
ReCom
F-SimRank
O(S)
O(S^2)
0.2
0
25
5
Data Mining: Principles and Algorithms
10
15
selectivity
20
25
91
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance

LinkClus: Efficient clustering by exploring the power law
distribution

Distinct: Distinguishing objects with identical names by
link analysis

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
92
People/Objects Do Share Names

Why distinguishing objects with identical names?

Different objects may share the same name



7/20/2015
In AllMusic.com, 72 songs and 3 albums
named “Forgotten” or “The Forgotten”
In DBLP, 141 papers are written by at least 14
“Wei Wang”
How to distinguish the authors of the 141
papers?
Data Mining: Principles and Algorithms
93
(2)
(1)
Wei Wang, Jiong Yang,
Richard Muntz
VLDB
Haixun Wang, Wei Wang,
Jiong Yang, Philip S. Yu
SIGMOD
Jiong Yang, Hwanjo Yu, Wei CSB
Wang, Jiawei Han
ICDM
2002
VLDB
2004
Hongjun Lu, Yidong Yuan, Wei ICDE
Wang, Xuemin Lin
2005
2003
Wei Wang, Xuemin Lin
Jiong Yang, Jinze Liu, Wei Wang
Jinze Liu, Wei Wang
Wei Wang, Haifeng Jiang,
Hongjun Lu, Jeffrey Yu
1997
KDD 2004
Jian Pei, Jiawei Han,
Hongjun Lu, et al.
2004
ADMA
ICDM
2005
2001
(4)
Jian Pei, Daxin Jiang,
Aidong Zhang
ICDE
2005
Aidong Zhang, Yuqing
Song, Wei Wang
WWW
2003
(3)
Wei Wang, Jian Pei, CIKM 2002
Jiawei Han
Haixun Wang, Wei Wang, Baile ICDM 2005
Shi, Peng Wang
Yongtai Zhu, Wei Wang, Jian Pei,
Baile Shi, Chen Wang
KDD
2004
(1) Wei Wang at UNC
(3) Wei Wang at Fudan Univ., China
7/20/2015
(2) Wei Wang at UNSW, Australia
(4) Wei Wang at SUNY Buffalo
Data Mining: Principles and Algorithms
94
Challenges of Object Distinction

Related to duplicate detection, but




Textual similarity cannot be used
Different references appear in different contexts (e.g.,
different papers), and thus seldom share common
attributes
Each reference is associated with limited information
We need to carefully design an approach and use all
information we have
7/20/2015
Data Mining: Principles and Algorithms
95
Overview of DISTINCT

Measure similarity between references

Linkages between references


Neighbor tuples of each reference


As shown by self-loop property, references to the
same object are more likely to be connected
Can indicate similarity between their contexts
References clustering

7/20/2015
Group references according to their similarities
Data Mining: Principles and Algorithms
96
Similarity 1: Link-based Similarity



Indicate the overall strength of connections between two
references
We use random walk probability between the two tuples
containing the references
Random walk probabilities along different join paths are
handled separately


7/20/2015
Because different join paths have different semantic
meanings
Only consider join paths of length at most 2L (L is the
number of steps of propagating probabilities)
Data Mining: Principles and Algorithms
97
Example of Random Walk
Publish
Authors
1.0 vldb/wangym97
Wei Wang
0.5 vldb/wangym97
Jiong Yang
0.5
Jiong Yang
0.5 Richard Muntz
0.5 vldb/wangym97 Richard Muntz
Publications
1.0 vldb/wangym97
STING: A Statistical Information Grid
Approach to Spatial Data Mining
vldb/vldb97
Proceedings
vldb/vldb97 Very Large Data Bases 1997 Athens, Greece
7/20/2015
Data Mining: Principles and Algorithms
1.0
98
Similarity 2: Neighborhood Similarity

Find the neighbor tuples of each reference


Weights of neighbor tuples



Neighbor tuples within L joins
Different neighbor tuples have different connections to
a reference
Assign each neighbor tuple a weight, which is the
probability of walking from the reference to this tuple
Similarity: Set resemblance between two sets of neighbor
tuples
7/20/2015
Data Mining: Principles and Algorithms
99
Training with the Same Data Set

Build a training set automatically




Select distinct names, e.g., Johannes Gehrke
The collaboration behavior within the same community
share some similarity
Training parameters using a typical and large set of
“unambiguous” examples
Use SVM to learn a model for combining different join
paths


7/20/2015
Each join path is used as two attributes (with linkbased similarity and neighborhood similarity)
The model is a weighted sum of all attributes
Data Mining: Principles and Algorithms
100
Clustering References

Why choose agglomerative hierarchical clustering
methods?

We do not know number of clusters (real entities)

We only know similarity between references

7/20/2015
Equivalent references can be merged into a cluster,
which represents a single entity
Data Mining: Principles and Algorithms
101
How to Measure Similarity between Clusters?

Single-link (highest similarity between points in two
clusters)?


Complete-link (minimum similarity between them)?


No, because references to different objects can be
connected.
No, because references to the same object may be
weakly connected.
Average-link (average similarity between points in two
clusters)?

7/20/2015
A better measure
Data Mining: Principles and Algorithms
102
Problem with Average-link
C2
C1
C3



C2 is close to C1, but their average similarity is low
We use collective random walk probability: Probability of
walking from one cluster to another
Final measure:
Average neighborhood similarity and Collective random
walk probability
7/20/2015
Data Mining: Principles and Algorithms
103
Clustering Procedure

Procedure
 Initialization: Use each reference as a cluster
 Keep finding and merging the most similar pair of
clusters
 Until no pair of clusters is similar enough
7/20/2015
Data Mining: Principles and Algorithms
104
Efficient Computation


In agglomerative hierarchical clustering, one needs to
repeatedly compute similarity between clusters
 When merging clusters C1 and C2 into C3, we need
to compute the similarity between C3 and any other
cluster
 Very expensive when clusters are large
We invent methods to compute similarity incrementally
 Neighborhood similarity

7/20/2015
Random walk probability
Data Mining: Principles and Algorithms
105
Experimental Results



Distinguishing references to authors in DBLP
Accuracy of reference clustering
 True positive: Number of pairs of references to same
author in same cluster
 False positive: Different authors, in same cluster
 False negative: Same author, different clusters
 True negative: Different authors, different clusters
Measures
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
f-measure = 2*precision*recall / (precision+recall)
Accuracy = TP/(TP+FP+FN)
7/20/2015
Data Mining: Principles and Algorithms
106
Accuracy on Synthetic Tests



Select 1000 authors with at least 5 papers
Merge G (G=2 to 10) authors into one group
Use DISTINCT to distinguish each group of references
0.7
1
0.8
0.5
Precision
Accuracy
0.6
G2
G4
G6
G8
G10
0.4
0.6
G2
G4
G6
G8
G10
0.4
0.2
1E-10
1.00E-05
5.00E-05
0.0002
0.001
0.005
1
0.3
0
0
0.2
0.4
0.6
0.8
1
Recall
min-sim
7/20/2015
Data Mining: Principles and Algorithms
107
Compare with “Existing Approaches”


Random walk and neighborhood similarity have been used
in duplicate detection
We combine them with our clustering approaches for
comparison
0.8
0.6
Max f-measure
Max accuracy
0.8
0.4
DISTINCT
Set resemblance - unsuper.
Random walk - unsuper.
Combined
0.2
0
2
7/20/2015
4
6
Group size
8
0.6
0.4
DISTINCT
Set resemblance - unsuper.
Random walk - unsuper.
Combined
0.2
0
10
2
Data Mining: Principles and Algorithms
4
6
Group size
8
10
108
Real Cases
Name
#author
#ref
accuracy
precision
recall
f-measure
Hui Fang
3
9
1.0
1.0
1.0
1.0
Ajay Gupta
4
16
1.0
1.0
1.0
1.0
Joseph Hellerstein
2
151
0.81
1.0
0.81
0.895
Rakesh Kumar
2
36
1.0
1.0
1.0
1.0
Michael Wagner
5
29
0.395
1.0
0.395
0.566
Bing Liu
6
89
0.825
1.0
0.825
0.904
Jim Smith
3
19
0.829
0.888
0.926
0.906
Lei Wang
13
55
0.863
0.92
0.932
0.926
Wei Wang
14
141
0.716
0.855
0.814
0.834
Bin Yu
5
44
0.658
1.0
0.658
0.794
0.81
0.966
0.836
0.883
average
7/20/2015
Data Mining: Principles and Algorithms
109
Real Cases: Comparison
1
DISTINCT
0.9
Supervised set
resemblance
Supervised random
walk
Unsupervised
combined measure
Unsupervised set
resemblance
Unsupervised
random walk
0.8
0.7
0.6
0.5
0.4
accuracy
7/20/2015
f-measure
Data Mining: Principles and Algorithms
110
Distinguishing Different “Wei Wang”s
UNC-CH
(57)
Fudan U, China
(31)
Zhejiang U
China
(3)
Najing Normal
China
(3)
SUNY
Binghamton
(2)
Ningbo Tech
China
(2)
Purdue
(2)
Chongqing U
China
(2)
Harbin U
China
(5)
Beijing U Com
China
5
2
UNSW, Australia
(19)
6
7/20/2015
SUNY
Buffalo
(5)
Beijing
Polytech
(3)
NU
Singapore
(5)
Data Mining: Principles and Algorithms
(2)
111
Scalability

Agglomerative hierarchical clustering takes quadratic time
 Because it requires computing pair-wise similarity
16
Time (second)
12
8
4
0
0
50
100
150
200
250
300
350
#references
7/20/2015
Data Mining: Principles and Algorithms
112
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance

LinkClus: Efficient clustering by exploring the power law
distribution

Distinct: Distinguishing objects with identical names by
link analysis

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
113
Mining Across Multiple Databases


A. Doan, P. Domingos, and A. Halevy, “Reconciling
Schemas of Disparate Data Sources: A Machine Learning
Approach”, SIGMOD'01
Utilize correlations between different attributes of objects
to be matched


E.g. rating on IMDB and rating on Ebert will not differ
much
Use profilers to specify requirements on such correlations


7/20/2015
Manually specified
Classification model (such as Bayesian networks) from
training data (matched and non-matched pairs)
Data Mining: Principles and Algorithms
114
Mining Across Multiple Databases

R. Ananthakrishna, S. Chaudhuri, V. Ganti, “Eliminating
Fuzzy Duplicates in Data Warehouses”, VLDB’02

Use related objects in a hierarchy
Joplin → Missouri (or MO) → USA (or United States)
Vancouver → BC (or British Columbia) → Canada (or CA)

When matching two objects, consider related objects
in the hierarchy

This approach can only be used when such a hierarchy
is present
7/20/2015
Data Mining: Principles and Algorithms
115
Mining Across Multiple Databases

I. Bhattacharya and L. Getoor (DMKD’04)

Deduplicating objects (tuples) in a database


Use attributes of tuples (as in many previous papers)
Use references to tuple—a reference is actually a join
to a tuple



7/20/2015
E.g. J. Ullman is referenced by the Author relation
because he wrote papers
When matching two objects, compute their similarity by
their attributes and references
Match objects iteratively, because a pair of matched
objects may lead to more matched objects
Data Mining: Principles and Algorithms
116
Multirelational Data Mining

Classification over multiple-relations in databases

Clustering over multi-relations by user-guidance

LinkClus: Efficient clustering by exploring the power law
distribution

Distinct: Distinguishing objects with identical names by
link analysis

Mining across multiple heterogeneous data and
information repositories

Summary
7/20/2015
Data Mining: Principles and Algorithms
117
Summary

Knowledge is power, but knowledge is hidden in massive
links

More stories than Web page rank and search

CrossMine: Classification of multi-relations by link analysis

CrossClus: Clustering over multi-relations by user-guidance



LinkClus: Efficient clustering by exploring the power law
distribution
Distinct: Distinguishing objects with identical names by link
analysis
Much more to be explored!
7/20/2015
Data Mining: Principles and Algorithms
118
References (1)











7/20/2015
H. Blockeel, L. De Raedt and J. Ramon. Top-down induction of logical decision trees.
ICML’98.
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 1998.
L. Dehaspe and H. Toivonen. Discovery of Relational Association Rules. In Relational
Data Mining, Springer-Verlag, 2000.
S. Dzeroski. Multi-relational data mining: an introduction. KDD Explorations, July 2003.
W. Emde, D. Wettschereck. Relational instance-based learning. ICML, 1996.
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of
relational structure. ICML’01.
H. A. Leiva. MRDTL: a multi-relational decision tree learning algorithm. M.S. thesis,
Iowa State U., 2002.
T. Mitchell. Machine Learning. McGraw Hill, 1996.
S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special
issue on Inductive Logic Programming, 1995.
S. Muggleton and C. Feng. Efficient induction of logic programs. In Proc. of First Conf.
on Algorithmic Learning Theory, Tokyo, Japan, 1990.
M. Kirsten, S. Wrobel. Relational distance-based clustering. ILP, 1998.
Data Mining: Principles and Algorithms
119
References (2)










M. Kirsten, S. Wrobel. Extending k-means clustering to first-order representations. ILP, 2000.
S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on
Inductive Logic Programming, 1995.
S. Muggleton and C. Feng. Efficient induction of logic programs. In Proc. of First Conf. on
Algorithmic Learning Theory, Tokyo, Japan, 1990
A. Popescul, L. Ungar, S. Lawrence, and M. Pennock. Towards Structural Logistic Regression:
Combining Relational and Statistical Learning. In Proc. of Multi-Relational Data Mining Workshop,
Alberta, Canada, 2002.
B. Taskar, E. Segal, and D. Koller. Probabilistic Classification and Clustering in Relational Data.
IJCAI’2001
X. Yin, J. Han, J. Yang, and P. S. Yu, “CrossMine: Efficient Classification across Multiple Database
Relations”, ICDE'04
X. Yin, J. Han, and P. S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05
X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”,
VLDB'06
X. Yin, J. Han, and P. S. Yu, “Object Distinction: Distinguishing Objects with Identical Names by Link
Analysis”, ICDE'07
E. P. Xing, A. Y. Ng, M. I. Jordan, S. Russell. Distance metric learning, with application to clustering
with side-information. NIPS, 2002
7/20/2015
Data Mining: Principles and Algorithms
120
7/20/2015
Data Mining: Principles and Algorithms
121