Transcript Triangles

FAST COUNTING OF TRIANGLES IN
LARGE NETWORKS:
ALGORITHMS AND LAWS
Charalampos (Babis) Tsourakakis
School of Computer Science
Carnegie Mellon University
http://www.cs.cmu.edu/~ctsourak
RPI Theory Seminar, 24 November 2008
Counting Triangles
2


Given an undirected, simple graph G(V,E) a triangle is
a set of 3 vertices such that any two of them by an
edge of the graph.
Related Problems
Our focus
a) Decide if a graph is triangle-free.
b) Count the total number of triangles δ(G).
c) Count the number of triangles δ(v) that each vertex v
participates at.
 (v) | {(u, w)  E : (v, u)  E, (v, w)  E} |
d) List the triangles that each vertex v participates at.
RPI, November 2008
Why is triangle counting important*?
3




Social Network Analysis:
“Friends of friends are friends” [WF94]
Web Spam Detection [BPCG08]
Hidden Thematic Structure of the
Web [EM02]
Motif Detection e.g. biological
networks [YPSB05]
*few indicative reasons, from the graph mining perspective
RPI, November 2008
Why is triangle counting important?
4
Furthermore, two often used metrics are:
 Clustering Coefficient
1
1
 (v )
CC (G) 
cc(v) 


| V ' | vV '
| V ' | vV '  (v)
 d (v) 
where: V '  {v : d (v)  2} and (v)   2 



Transitivity Ratio
3 (G )
TR 
 (G )
where:  (G) 
v
Triple
at node v
Triangle
1
 (v) and (G)   (v)

3 vV
vV
RPI, November 2008
Outline
5
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Counting methods
6
Dense
graphs
Sparse
graphs
Fast
Low space
Time complexity
O(n2.37)
O(n3)
Space complexity
O(n2)
O(m)
Fast
Low space
Time complexity
O(m0.7n1.2+n2+o(1))
2
e.g. O( n dmax
)
Space complexity
Θ(n2) (eventually)
Θ(m)
RPI, November 2008
Outline
7
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Outline of the Proposed Method
8





EigenTriangle theorem
EigenTriangleLocal theorem
EigenTriangle algorithm
EigenTriangleLocal algorithm
Efficiency & Complexity
Power law degree distributions
 Gershgorin discs
 Real world network spectra

RPI, November 2008
Theorem [EigenTriangle]
9

Theorem
The number of triangles δ(G) in an undirected,
simple graph G(V,E) is given by:
|V |
 (G ) 
3

 i
i 1
6
where 1  2  ...  |V |
are the eigenvalues of the adjacency matrix of
graph G.
RPI, November 2008
Proof
10

Call A the adjacency matrix of the graph. Consider
the i-th diagonal element of A3, αii. This element is
equal to the number of triangles vertex i
participates at. So the trace is 6δ(G) because each
triangle is counted 6 times (3 participating vertices
and is also counted as i-j-k, and i-k-j). Furthermore,
if Ax=λx, then λ3 is an eigenvalue of A3 (*) and vice
versa if λ is an eigenvalue of A3 , then 3  is an
eigenvalue of A.
* A3 x=AAAx=AAλx=λΑΑx=λΑλx=λ2Αx=λ3x
RPI, November 2008
Theorem [EigenTriangleLocal]
11

Theorem
The number of triangles δ(i) vertex i partipates at
|V |
is equal to:
3 u 2
 (i) 


j ij
j 1
2
where uij is the j-th entry of the i-th eigenvector u i
Proof [Sketch]
Follows from the previous theorem and the fact that
A is symmetric, therefore diagonalizable and also
A3  U3U T
RPI, November 2008
EigenTriangle Algorithm
12
RPI, November 2008
EigenTriangleLocal Algorithm
13
Why are
these two
algorithms
efficient?
RPI, November 2008
Skewed Degree Distributions
14


Skewed degree distribution ubiquitous in nature!
Have been termed as “the signature of human
activity”[FKP02] but appear as well to all other
kind of networks, e.g. biological.
See [N05][M04] for generative models of power
law distributions.
Typically referred to as power-laws (even if
sometimes we abuse the strict definition of a power
law, i.e log( y)  a log(x)  b ).
RPI, November 2008
Examples of power laws
15

Newman [N05] demonstrated
how often power laws appear
using may different types of
networks, ranging from word
frequencies to population of
cities.
Many cities have
a small population
RPI, November 2008
Few cities have
a huge population
Gershgorin’s Discs
16

Theorem
Let B an arbitrary matrix. Then the eigenvalues λ of
B are located in the union of the n discs
|   bkk | | bkj |
j k

For a proof see Demmel [D97], p.82.
RPI, November 2008
Gershgorin Discs
17

Bounds on the airports network (Observe how loose)
RPI, November 2008
Typical real world spectra
18
Political blogs
Airports
RPI, November 2008
Top Eigenvalues
19


Zooming in the top eigenvalues and plotting the
rank vs. the eigenvalue in log-log scale reveals that
the top eigenvalues follow a power law [FFF99]
Some years later, Mihail & Papadimitriou [MP02]
and Chung, Lu and Vu [CLV03] proved this fact.
RPI, November 2008
Our idea
20


Simple & clear:
Use a low-rank approximation of A3 to estimate the
diagonal elements and the trace.
Suggests also a way of thinking:
Take advantage of special properties (e.g. power
laws) to reduce the complexity of certain
computational tasks in real-world networks.
RPI, November 2008
Summing up: Why does it work?
21


Almost symmetry of the spectrum around 0 for the
bulk of the eigenvalues except the top ones is the
first main reason.
Cubes amplify strongly this phenomenon!
RPI, November 2008
Complexity Analysis
22



Main computational bottleneck that determines the
complexity is the Lanczos method.
Lanczos runs in linear time with respect to the nonzero entries of the matrix, i.e. the edges, assuming
that we compute a few constant number of
eigenvalues.
Convergence of Lanczos is fast due to the
eigenvalue power law (see Kaniel-Paige theory
[GL89])
RPI, November 2008
Outline
23
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Datasets
24
RPI, November 2008
Competitor: Node Iterator
25



Node Iterator algorithm considers each node at the
time, looks at its neighbors and checks how many
among them are connected among them.
Complexity: O(n d )
We report the results as the speedup that
EigenTriangle algorithm gives compared to the
running time of the Node Iterator .
2
max
RPI, November 2008
Results: #Eigenvalues vs. Speedup
26
RPI, November 2008
Results: #Edges vs. Speedup
27
RPI, November 2008
Main points
28





Some interesting facts for the two scatterplots:
Mean required approximations rank for at least
95% is 6.2
Speedups are between 33.7x and 1159x.
The mean speedup is 250.
Notice the increasing speedup as the size of the
network grows.
RPI, November 2008
Zooming in
29
Zooming
in this point
RPI, November 2008
Evaluating the Local Counting Method
30



Pearson’s correlation coefficient ρ
Relative Reconstruction Error
1 |V | |  ( )   ' (i) |
RRE 

| V | i 1
 ( )
Political Blogs:
RRE 7*10-4
ρ 99.97%
RPI, November 2008
#Eigenvalues vs. ρ for three networks
31
Observe how
a low rank
results in
almost optimal results.
This holds for
surprisingly many
real world networks
RPI, November 2008
Outline
32
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Triangle Participation Law
33

Plots the number of triangles δ (x-axis) vs. the count
of vertices with δ participating triangles.
(a)
(c)
(b)
a) EPINIONS, who trusts-whos
b) ASN, social network
c) HEP_TH, collaboration network
RPI, November 2008
Degree Triangle Law
34

Plots the degree di (x-axis) vs. the mean number of
triangles that nodes with degree di participate at.
Epinions
ASN
RPI, November 2008
Outline
35
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
New Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Kronecker Graphs
36



This model was introduced in [LCKF05]. It is based
on the simple operation of the Kronecker product to
generate graphs that mimic real world networks.
Deterministic Kronecker Graphs: Kronecker Product
of the adjacency matrix at the current step k with
the initiator adjacency matrix (typically small).
Stochastic Kronecker Graphs: Kronecker Product of
the matrix at the current step k with the initiator
matrix. Initiator matrix contains probabilities.
For more details see [LF07].
RPI, November 2008
Triangles in Kronecker Graphs
37
Some notation first:
A: nxn initiatior adjacency matrix of the undirected,
simple graph GA
B = A[k] k-th Kronecker product
λ=(λ1,...,λn) the eigenvalues of A
Δ(GA), Δ(GΒ) #triangles of GA , GΒ
 Theorem [KroneckerTRC]

k 1
Δ(GB )  6 Δ(GA ) , k  0
k
RPI, November 2008
Proof
38

We use induction on the number of recursion steps k.
For k=0 the theorem trivially holds.
Assume now that KroneckerTRC holds now for some
r  1.Call C=A[r], D=A[r+1] and the eigenvalues of C,
r
r 1
Δ(G
)

6
Δ(G
)
c
A
[μi]i=1..s.By the assumption
The eigenvalues of D are given by the Kronecker
product    . By the EigenTriangle theorem, the
number of triangles in D is given by:
RPI, November 2008
Proof
39
s
 (GD ) 
n
  
i 1 j 1
6
s
3 3
i
j

n
 
i 1
3
i
6
j 1
s
3
j


i 1
3
i
6 (G A )
6

s
 6 (G A )
3

 i
i 1
6
 6 (G A ) (GC )  6 r 1 (G A ) r  2
Therefore KroneckerTRC holds for all
Q.E.D
RPI, November 2008
k  0.
Outline
40
•
•
•
•
•
•
Related Work
Proposed Method
Experiments
New Triangle-related Laws
Triangles in Kronecker Graphs
Future Work & Open Problems
RPI, November 2008
Theoretical Challenge I:
Spectra of real world networks
41


Can we prove things about the distribution of the
eigenvalues, adopting a random graph model such
as the expected degree model G(w) [CLV03]?
An analog to Wigner’s semicircle law for random
Erdos-Renyi graphs (see Furedi-Komlos [FK81])
Spectrum of
G
40 ,
RPI, November 2008
1
2
over 100000
Iterations
[S07]
Theoretical Challenge I:
Spectra of real world networks
42
Empirically, the rest
Can weofprove
the spectrum:
Something
about
Triangular-like
thisdistribution
empirical
[FDBV01] ?
observation
RPI, November 2008
Theoretical Challenge II:
Eigenvectors of real world networks
43

Things even “worse” than the case of spectra. Very
few knowledge about the eigenvectors.
Related work:
See [P08] for random graphs.
RPI, November 2008
Theoretical Challenge III:
Degree Triangle Law
44


Prove using the expected degree random graph
model G(w) the pattern we saw (see [S04])
Conjecture:
The relationship we observed probably appears
for some cases of the slope of the degree
distribution. Further experiments, recently showed
that for some graphs this pattern does not hold.
RPI, November 2008
Experimental Challenge I:
Compare with Streaming Methods
45


Streaming or Semi-Streaming methods, perform one
or O(1) passes over the graph.
[YKS02]
[BFLSS06]
[BPCG08]
Common Underlying Idea: Sophisticated sampling
methods
Implement and compare.
RPI, November 2008
Practical Challenge I:
Triangles in Large Scale Graph Mining
46


Many Giga-byte and Peta-byte sized graphs.
How to handle these graphs?
HADOOP
EigenTriangle algorithms are based just on simple
matrix vector multiplications.
Easy to parallelize in all sorts of architectures
(distributed memory , shared memory).
See [DHV93] for the details.
RPI, November 2008
PEGASUS: Peta-Graph Mining
from the Triangle perspective
47


Soon…
Stay tuned!

On-going work with U Kang and
Christos Faloutsos in collaboration
with Yahoo! Research.
Among others: Implement
EigenTriangle algorithms in
HADOOP and compare to other
methods.
Find outliers in graphs with many
billions of edges wrt triangles.
RPI, November 2008
Curious about:
48
RPI, November 2008
Acknowledgements
49

Christos Faloutsos
For the helpful discussions

Yiannis Koutis
RPI, November 2008
Acknowledgements
50

Maria Tsiarli
For the PEGASUS logo
RPI, November 2008
51
RPI, November 2008
References
52







[WF94] Wasserman, Faust: “Social Network Analysis: Methods and
Applications (Structural Analysis in the Social Sciences)”
[EM02] Eckmann, Moses: “Curvature of co-links uncovers hidden thematic
layers in the World Wide Web”
[BPCG08] Becchetti, Boldi, Castillo, Gionis Efficient Semi-Streaming
Algorithms for Local Triangle Counting in Massive Graphs
[FKP02] Fabrikant, Koutsoupias, Papadimitriou: “Heuristically Optimized
Trade-offs: A New Paradigm for Power Laws in the Internet”
[N05] Newman: “Power laws, Pareto distributions and Zipf's law”
[M04] Mitzenmacher: “A brief history of generative models for power law
and lognormal distributions”
[FK81] Furedi-Komlos: “Eigenvalues of random symmetric matrices”
RPI, November 2008
References
53







[S04] Danilo Sergi: “Random graph model with power-law distributed
triangle subgraphs”
[D97] Demmel: “Applied Numerical Algebra”
[LCKF05] Leskovec, Chakrabarti, Kleinberg, Faloutsos: “Realistic,
Mathematically Tractable Graph Generation and Evolution using Kronecker
Multiplication”
[LK07] Leskovec, Faloutsos: “Scalable Modeling of Real Graphs using
Kronecker Multiplication”
[FFF09] Faloutsos, Faloutsos, Faloutsos: “On power-law relationships of the
Internet topology”
[MP02] Mihail, Papadimitriou: “On the Eigenvalue Power Law”
[CLV03] Chung, Lu, Vu: “Spectra of Random Graphs with given expected
degrees”
RPI, November 2008
References
54








[YKS02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using
Kronecker Multiplication”
[GL89] Golub, Van Loan: “Matrix Computations”
[BFLSS06] Buriol, Frahling, Leonardi, Spaccamela, Sohler: “Counting triangles in
data streams”
[DHV93] Demmel, Heath, Vorst: “Parallel Numerical Linear Algebra”
[YPSB05] Ye, Peyser, Spencer, Bader: “Commensurate distances and similar
motifs in genetic congruence and protein interaction networks in yeast”
[P08] Mitra Pradipta: “Entrywise Bounds for Eigenvectors of Random Graphs”
[FDBV01] Farkas, Derenyi, Barabasi, Vicsek: “Spectra of "real-world" graphs:
Beyond the semi-circle law”
[S07] Spielman’s “Spectral Graph Theory and its Applications” class (YALE):
http://www.cs.yale.edu/homes/spielman/eigs/
RPI, November 2008
References
55


[F08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU):
http://www.cs.cmu.edu/~christos/courses/826.S08
For more references, take a look also in the paper:
http://www.cs.cmu.edu/~ctsourak/tsourICDM08.pdf
RPI, November 2008