Transcript Slide 1

School of Computer Science
Carnegie Mellon University
National Taiwan University
of Science & Technology
Unifying Guilt-by-Association
Approaches:
Theorems and Fast Algorithms
Danai Koutra
U Kang
Hsing-Kuo Kenneth Pao
Tai-You Ke
Duen Horng (Polo) Chau
Christos Faloutsos
ECML PKDD, 5-9 September 2011, Athens, Greece
Problem Definition:
GBA techniques
Given: graph with N
nodes & M edges;
few labeled nodes
Find: class (red/green)
for rest nodes
Assuming: network
effects (homophily/
heterophily)
© Danai Koutra - PKDD'11
Homophily and Heterophily
Step 1
All methods
handle homophily
NOT all methods
handle heterophily
BUT
Step 2
proposed method
does!
© Danai Koutra - PKDD'11
Why do we study these
methods?
© Danai Koutra - PKDD'11
Motivation (1):
Law Enforcement
[Tong+ ’06][Lin+ ‘04][Chen+ ’11]…
© Danai Koutra - PKDD'11
Motivation (2):
Cyber Security
bot
victims?
botnet
members?
[Kephart+ ’95][Kolter+ ’06][Song+ ’08-’11][Chau+ ‘11]…
© Danai Koutra - PKDD'11
Motivation (3):
Fraud Detection
fraudster
Lax controls?
[Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]…
© Danai Koutra - PKDD'11
fraudsters?
Motivation (4):
Ranking
[Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…
© Danai Koutra - PKDD'11
Our Contributions
Theory
Practice
 correspondence:
BP ≈ RWR ≈ SSL
 linearization for BP
 convergence criteria for
linearized BP
 FABP
algorithm
 fast
 accurate and
 scalable
 Experiments
on DBLP,
Web, and
Kronecker graphs
© Danai Koutra - PKDD'11
Roadmap
Background
Belief Propagation
Random Walk with Restarts
Semi-supervised Learning
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Background
Apologies for diversion…
© Danai Koutra - PKDD'11
Background 1:
Belief Propagation (BP)
• Iterative message-based method
class of
• “Propagation matrix”:
“receiver”
 Homophily
class of
“sender”
Usually
“about-half”
same
diagonal
 Heterophily
=homophily
homophily
factorh
factor
hh = h-0.5
0.9 0.1
0.2 0.8
0.4
-0.4
0.3 0.1
0.7
0.9
0.9 0.9
0.1
-0.4
0.4
0.1
© Danai Koutra - PKDD'11
...
st round
until
stop
1nd
2
round
criterion
fulfilled
Background 1:
Belief Propagation Equations
mij (x j ) = å fi (xi ) × yij (xi , x j ) ×
xi
bi (xi ) = h × fi (xi ) ×
Õ mni (xi )
nÎN(i )\ j
Õ mij (xi )
j ÎN(i )
[Pearl ‘82][Yedidia+ ‘02]
…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
© Danai Koutra - PKDD'11
Background 2:
Semi-Supervised Learning
• graph-based SSL
• use few labeled data &
exploit neighborhood
information
[I + a(D - A)]x = y
[Zhou ‘06][Ji, Han ’10]…
0.8
0.8
S
T
E
P
1
?
?
-0.3
© Danai Koutra - PKDD'11
S
T
E
P
2
0.6
-0.1
-0.3
Background 3:
Personalized Random Walk with
Restarts (RWR)
[I - cAD-1 ]x = (1- c)y
[Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…
© Danai Koutra - PKDD'11
Background
© Danai Koutra - PKDD'11
Qualitative Comparison of GBA
Methods
GBA
Method
RWR
SSL
BP
FABP
Heterophily Scalability
✗
✗
✓
✓
✓
✓
✓
✓
© Danai Koutra - PKDD'11
Convergence
✓
✓
?
✓
Qualitative Comparison of GBA
Methods
GBA
Method
RWR
SSL
BP
FABP
Heterophily Scalability
✗
✗
✓
✓
✓
✓
✓
✓
© Danai Koutra - PKDD'11
Convergence
✓
✓
?
✓
Roadmap
Previous work
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
New work
Linearized BP
DETAILS!
Theorem [Koutra+]
BP is approximated by
1
[I + aD - c'A] bh = f h
1
1
d1
0 1 0
d2 1 0 1 ?
d3 0 1 0
0
-10-2
10-2
prior
final
beliefs p
beliefs
i
Sketch of proof
p
• Odds ratio pr =
1- p
• Maclaurin expansions
1
scalar
constants
0.5
0
© Danai Koutra - PKDD'11
“å msg”
Linearized BP vs BP
Original [Yedidia+]:
Belief Propagation
mij (x j ) = å fi (xi ) × yij (xi , x j ) ×
xi
bi (xi ) = h × fi (xi ) ×
Our proposal:
Linearized BP
BP is approximated by
Õ mni (xi )
[I + aD - c'A] bh = f h
nÎN(i )\ j
1
Õ mij (xi )
j ÎN(i )
1
1
d1
0 1 0
0
d2 1 0 1 ? -10-2
d3 0 1 0
10-2
non-linear
© Danai Koutra - PKDD'11
linear
Our Contributions
Theory
Practice
 correspondence:
BP ≈ RWR ≈ SSL
✓ linearization for BP
 convergence criteria for
linearized BP
 FABP
algorithm
 fast
 accurate and
 scalable
 Experiments
on DBLP,
Web, and
Kronecker graphs
© Danai Koutra - PKDD'11
DETAILS!
Linearized BP: convergence
Theorem
Sketch of proof
Linearized BP converges if
hh £ function(d11 ,..., dnn )
1-norm < 1
OR
Frobenius norm < 1
degree of
node n
© Danai Koutra - PKDD'11
Our Contributions
Theory
Practice
 correspondence:
BP ≈ RWR ≈ SSL
✓ linearization for BP
✓ convergence criteria for
linearized BP
 FABP
algorithm
 fast
 accurate and
 scalable
 Experiments
on DBLP,
Web, and
Kronecker graphs
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Correspondence of Methods
Method
RWR
SSL
FABP
Matrix
[I – c AD-1]
[I + a(D - A)]
[I + a D - c’A]
1
1
1
Unknown
known
×
x
= (1-c)y
×
x
=
y
×
bh
=
φh
d1
0 1 0
d2 1 0 1
d3 0 1 0
adjacency
matrix
?
final
labels/
beliefs
© Danai Koutra - PKDD'11
0
1
1
prior
labels/
beliefs
RWR ≈ SSL
DETAILS!
Simplification
c
a=
(1- c)d
THEOREM
RWR and SSL identical if
c
ai =
(1- c)dii
individual
homophily
strength
of node i
(SSL)
fly-out
probability
(RWR)
© Danai Koutra - PKDD'11
global
homophily
strength
of nodes
(SSL)
SSL scores
RWR ≈ SSL: example
y=x
RWR scores
individual hom.
strength
c
ai =
(1- c)di
global hom.
strength
a=
c
(1- c)d
similar scores and
identical rankings
© Danai Koutra - PKDD'11
Our Contributions
Theory
Practice
✓ correspondence:
BP ≈ RWR ≈ SSL
✓ linearization for BP
✓ convergence criteria for
linearized BP
 FABP
algorithm
 fast
 accurate and
 scalable
 Experiments
on DBLP,
Web, and
Kronecker graphs
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Proposed algorithm: FABP
①Pick the homophily factor
hh = max{1- norm, frobenius norm}
②Solve the linear system
pi
[I + aD - c'A] bh = f h
1
1
1
d1
0 1 0
d2 1 0 1
d3 0 1 0
?
0
1
1
1
0.5
①(opt) If accuracy is low, run BP with prior
beliefs bh.
© Danai Koutra - PKDD'11
0
“å msg”
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Datasets
Dataset
# nodes
# edges
YahooWeb
1,413,511,390
6,636,600,779
Kronecker 1
177,147
1,977,149,596
Kronecker 2
120,552
Kronecker 3
59,049
282,416,924
Kronecker 4
19,683
40,333,924
DBLP
37,791
170,794
6 billion!1,145,744,786
• p% labeled nodes initially
YahooWeb: .edu/others | DBLP: AI/not AI
• accuracy computed on hold-out set
© Danai Koutra - PKDD'11
Specs
• hadoop version 0.20.2
• M45 hadoop cluster (Yahoo!)
 500
machines
 4000 cores
 1.5PB total storage
 3.5TB of memory
• 100 machines used for the experiments
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
1. Accuracy
2. Convergence
3. Sensitivity
4. Scalability
5. Parallelism
Conclusions
© Danai Koutra - PKDD'11
beliefs in FABP
Results (1): Accuracy
Scatter plot of beliefs for
(h, priors) = (0.5±0.002, 0.5±0.001)
0.3% labels
AI
non-AI
beliefs in BP
All points on the diagonal  scores near-identical
© Danai Koutra - PKDD'11
Results (2): Convergence
Accuracy wrt hh (priors = ±0.001)
% accuracy
0.3% labels
convergence
bounds
1-norm
frobenius
hh
|e_val| = 1
norm
hh
FABP achieves maximum accuracy
within the convergence bounds.
© Danai Koutra - PKDD'11
Results (3): Sensitivity to the
homophily factor
Accuracy wrt hh (priors = ±0.001)
% accuracy
0.3% labels
convergence
bounds
1-norm
frobenius
hh
|e_val| = 1
norm
FABP is robust to the homophily factor
hh within the convergence bounds.
© Danai Koutra - PKDD'11
hh
% accuracy
% accuracy
% accuracy
( For all plots )
hh
prior beliefs’ magnitude
Average over 10 runs
Error bars  tiny
© Danai Koutra - PKDD'11
Results (3): Sensitivity to the prior
beliefs
Accuracy wrt priors (hh = ±0.002)
% accuracy
p=5%
p=0.1%
p=0.3%
p=0.5%
prior beliefs’ magnitude
FABP is robust to the prior beliefs φh.
© Danai Koutra - PKDD'11
runtime (min)
Results (4): Scalability
# of edges (Kronecker graphs)
FABP is linear on the number of edges.
© Danai Koutra - PKDD'11
% accuracy
runtime (min)
Results (5): Parallelism
% accuracy
# of steps
# of steps
FABP ~2x faster
& wins/ties on
accuracy.
runtime (min)
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Our Contributions
Theory
Practice
✓ correspondence:
BP ≈ RWR ≈ SSL
✓ linearization for BP
✓ convergence criteria for
linearized BP
✓ FABP algorithm
 fast ~2x faster
 accurate and same/better
 scalable
6 billion edges!
✓ Experiments on DBLP,
Web, and
Kronecker graphs
© Danai Koutra - PKDD'11
Thanks
• Data
ILLINOIS
Ming Ji, Jiawei Han
• Funding
NSC
© Danai Koutra - PKDD'11
% accuracy
Thank you!
runtime (min)
[email protected]
© Danai Koutra - PKDD'11
Q: Can we have multiple classes?
A: yes!
AI
Propagation matrix
ML
DB
0.7
0.2
0.1
0.2
0.6
0.2
0.1
0.2
0.7
© Danai Koutra - PKDD'11
Q: Which of the methods do you
recommend?
A: (Fast) Belief Propagation
Reasons:
• solid bayesian foundation
• heterophily and multiple classes
0.7
0.2
0.2
0.6
0.1
0.2
0.1 0.2 0.7
Propagation matrix
© Danai Koutra - PKDD'11
Q: Why is FABP faster than BP?
A:
• BP
2|E| messages per iteration
• FABP
|V| records per “power method” iteration
|V| < 2 |E|
© Danai Koutra - PKDD'11