PPT - Mining of Massive Datasets

Download Report

Transcript PPT - Mining of Massive Datasets

Note to other teachers and users of these slides:

We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

We often think of networks being organized into modules, cluster, communities:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

Find micro-markets by partitioning the query-to-advertiser graph: advertiser

[Andersen, Lang: Communities from seed sets, 2006] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

Clusters in Movies-to-Actors graph:

[Andersen, Lang: Communities from seed sets, 2006] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

Discovering social circles, circles of trust:

[McAuley, Leskovec: Discovering social circles in ego networks, 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

How to find communities?

We will work with

undirected

(unweighted) networks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7

 

Edge betweenness: Number of shortest paths passing over the edge

Intuition:

b=16 b=7.5

Edge strengths (call volume) in a real network Edge betweenness in a real network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

[Girvan Newman ‘02]  Divisive hierarchical clustering based on the notion of edge

betweenness:

Number of shortest paths passing through the edge

Girvan-Newman Algorithm:

Undirected unweighted networks

Repeat until no edges are left:

 Calculate betweenness of edges  Remove edges with highest betweenness  Connected components are communities  Gives a hierarchical decomposition of the network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

9

1 12 33 49 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Need to re-compute betweenness at every step 10

Step 1: Step 2: Step 3: Hierarchical network decomposition:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11

Communities in physics collaborations J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12

Zachary’s Karate club:

Hierarchical decomposition J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

13

1.

2.

How to compute betweenness?

How to select the number of clusters?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14

Want to compute betweenness of paths starting at node

𝑨 

Breath first search starting from

𝑨

:

15 0 1 2 3 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Count the number of shortest paths from

𝑨

to all other nodes of the network:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

16

Compute betweenness by working up the tree:

If there are multiple paths count them fractionally

The algorithm:

•Add edge

flows

: -- node flow = 1+∑child edges -- split the flow up based on the parent value • Repeat the BFS procedure for each starting node 𝑈 1+0.5 paths to J Split 1:2 1 path to K.

Split evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1+1 paths to H Split evenly 17

Compute betweenness by working up the tree:

If there are multiple paths count them fractionally

The algorithm:

•Add edge

flows

: -- node flow = 1+∑child edges -- split the flow up based on the parent value • Repeat the BFS procedure for each starting node 𝑈 1+0.5 paths to J Split 1:2 1 path to K.

Split evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1+1 paths to H Split evenly 18

1.

2.

How to compute betweenness?

How to select the number of clusters?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

19

 

Communities:

sets of

tightly connected nodes

Define: Modularity 𝑸  A measure of how well a network is partitioned into communities  Given a partitioning of the network into groups 𝒔  𝑺 :

Q



s

S

[ (# edges within group s) – (expected # edges within group s) ] Need a null model!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

Given real

𝑮

on

𝒏

nodes and

𝒎

construct rewired network

𝑮’

edges,

 Same degree distribution but random connections

i j

 Consider 𝑮’ as a multigraph 

The expected number of edges between nodes

𝒊 

and

𝒋 of degrees 𝒌 𝒊 and 𝒌 𝒋 equals to: 𝒌 𝒊 ⋅ 𝒌 𝒋 𝟐𝒎 The expected number of edges in (multigraph) G’: = 𝒌 𝒊 𝒌 𝒋 𝟐𝒎   = = 𝟏 𝟐 𝒊∈𝑵 𝒋∈𝑵 𝒌 𝒊 𝒌 𝒋 𝟐𝒎 = 𝟏 𝟒𝒎 𝟐𝒎 ⋅ 𝟐𝒎 = 𝒎 𝟏 𝟐 ⋅ 𝟏 𝟐𝒎 𝒊∈𝑵 𝒌 𝒊 𝒋∈𝑵 𝒌 𝒋 = Note: 𝑘 𝑢 = 2𝑚 𝑢∈𝑁 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

Modularity of partitioning S of graph G:

Q

s

S

[ (# edges within group s) – (expected # edges within group s) ]

 𝑸 𝑮, 𝑺 = 𝟏 𝟐𝒎 𝒔∈𝑺 𝒊∈𝒔 𝒋∈𝒔 Normalizing cost.: -1

Modularity values take range [−1,1]

 It is positive if the number of edges within groups exceeds the expected number  0.3-0.7means significant community structure J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22

Modularity is useful for selecting the number of clusters:

Q Next time: Why not optimize Modularity directly?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

23

Undirected graph

𝑮(𝑽, 𝑬):

1 2

Bi-partitioning task:

3

Divide vertices into two disjoint groups 𝑨, 𝑩

4 5

A

1 5

B

2 3 4 6

Questions:

 How can we define a “good” partition of 𝑮 ?

 How can we efficiently identify such a partition?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

25

6

What makes a good partition?

 Maximize the number of within-group connections  Minimize the number of between-group connections

5 1 2 A 3 4 B 6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

Express partitioning objectives as a function of the “edge cut” of the partition

Cut:

Set of edges with only one vertex in a group: A

2 1 3 4 5 6

B

cut(A,B) = 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27

 

Criterion:

Minimum-cut

 Minimize weight of connections between groups

arg min A,B Degenerate case:

cut(A,B)

“Optimal cut” Minimum cut

Problem:

 Only considers external cluster connections  Does not consider internal cluster connectivity J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28

[Shi-Malik] 

Criterion:

Normalized-cut

[Shi-Malik, ’97]  Connectivity between groups relative to the density of each group  𝒗𝒐𝒍(𝑨) : total weight of the edges with at least one endpoint in 𝑨 : 𝒗𝒐𝒍 𝑨 = 𝒊∈𝑨

Why use this criterion?

𝒌 𝒊   Produces more balanced partitions

How do we efficiently find a good partition?

Problem: Computing optimal cut is NP-hard J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

29

  

A

: adjacency matrix of undirected G

x A

ij

=1 if (𝒊, 𝒋) is an edge, else 0 is a vector in 

n

with components (𝒙 𝟏 , … , 𝒙 𝒏 )  Think of it as a label/value of each node of 𝑮

What is the meaning of

A

x

?

Entry

y i

is a sum of labels

x j

of neighbors of

i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30

 

j th

coordinate of

A

x

:

 Sum of the

x

-values of neighbors of

j

 Make this a new value at node

j

Spectral Graph Theory:

 𝑨 ⋅ 𝒙 = 𝝀 ⋅ 𝒙 Analyze the “spectrum” of matrix representing 𝑮

Spectrum:

Eigenvectors 𝒙 𝒊 of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues 𝝀 𝒊 : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

31

  Suppose all nodes in 𝑮 and 𝑮 is connected have degree 𝒅

What are some eigenvalues/vectors of

𝑮

?

𝑨  𝒙 = 𝝀 ⋅ 𝒙 What is  ? What

x

?

Let’s try:

𝒙 = (𝟏, 𝟏, … , 𝟏) 

Then:

𝑨 ⋅ 𝒙 = 𝒅, 𝒅, … , 𝒅 = 𝝀 ⋅ 𝒙 . So: 𝝀 = 𝒅 

We found eigenpair of

𝑮

:

𝒙 = (𝟏, 𝟏, … , 𝟏)

,

𝝀 = 𝒅

Remember the meaning of

𝒚 = 𝑨  𝒙

:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

32

Details!

      G is d-regular connected, A is its adjacency matrix

Claim:

d is largest eigenvalue of A, d has multiplicity of 1 (there is only 1 eigenvector associated with eigenvalue d)

Proof: Why no eigenvalue

𝒅 ′ > 𝒅

?

 To obtain d we needed 𝒙 𝒊 = 𝒙 𝒋 for every 𝑖, 𝑗  This means 𝒙 = 𝑐 ⋅ (1,1, … , 1) for some const. 𝑐

Define:

𝑺 = nodes 𝒊 with maximum possible value of 𝒙 𝒊 Then consider some vector 𝒚 which is not a multiple of vector (𝟏, … , 𝟏) . So not all nodes 𝒊 (with labels 𝒚 𝒊 ) are in 𝑺 Consider some node node 𝒋 𝒋 ∈ 𝑺 and a neighbor gets a value strictly less than 𝒅 𝒊 ∉ 𝑺 then  So 𝑦 is not eigenvector! And so 𝒅 is the largest eigenvalue!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

33

  

What if

𝑮

is not connected?

 𝑮 has 2 components, each 𝒅 -regular

What are some eigenvectors?

A B

 𝒙 = Put all 𝟏 s on 𝑨 and   𝟎 s on 𝑩 or vice versa 𝒙′ = (𝟏, … , 𝟏, 𝟎, … , 𝟎) |A| |B| 𝒙′′ = (𝟎, … , 𝟎, 𝟏, … , 𝟏)

then then

𝐀 ⋅ 𝒙′ = 𝒅, … , 𝒅, 𝟎, … , 𝟎 𝑨 ⋅ 𝒙′′ = (𝟎, … , 𝟎, 𝒅, … , 𝒅)  And so in both cases the corresponding 𝝀 = 𝒅

A bit of intuition: A B A B

2 nd largest eigval. 𝜆 𝑛−1 now has value very close to 𝜆 𝑛 𝝀 𝒏 = 𝝀 𝒏−𝟏 𝝀 𝒏 − 𝝀 𝒏−𝟏 ≈ 𝟎 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

34

More intuition:

A B A B

2 nd largest eigval. 𝜆 𝑛−1 now has value very close to 𝜆 𝑛 𝝀 𝒏 = 𝝀 𝒏−𝟏 𝝀 𝒏 − 𝝀 𝒏−𝟏 ≈ 𝟎 If the graph is connected (right example) then we already know that 𝒙 𝒏 = (𝟏, … 𝟏) is an eigenvector   Since eigenvectors are orthogonal then the components of 𝒙 𝒏−𝟏 sum to 0.

 Why? Because 𝒙 𝒏 ⋅ 𝒙 𝒏−𝟏 = 𝒊 𝒙 𝒏 𝒊 ⋅ 𝒙 𝒏−𝟏 [𝒊] So we can look at the eigenvector of the 2 eigenvalue and declare nodes with positive label in A and negative label in B. nd largest 

But there is still lots to sort out.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

35

Adjacency matrix (

A

):

n

n

matrix 

A=[a ij ], a ij =1

if edge between node

i

and

j

 2 1 3 4 5 6 1 1 2 3 4 0 5 6 1 0 0 1 1 2 1 0 1 0 0 0 3 1 1 0 1 0 0 4 0 0 1 0 1 1 5 1 0 0 1 0 1 6 0 0 0 1 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

36

Degree matrix (D):

n

n

diagonal matrix 

D=[d ii ], d ii =

degree of node

i

2 1 3 4 5 6 5 6 1 2 3 4 1 3 0 0 0 0 0 2 0 2 0 0 0 0 3 0 0 3 0 0 0 4 0 0 0 3 0 0 5 0 0 0 0 3 0 6 0 0 0 0 0 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

37

Laplacian matrix (L):

n

n

symmetric matrix 5 1 2 3 4 6 1 2 3 4 5 6 1 3 -1 -1 0 -1 0 2 -1 2 -1 0 0 0 3 -1 -1 3 -1 0 0 4 0 0 -1 3 -1 -1  

What is trivial eigenpair?

𝑳 = 𝑫 − 𝑨  𝒙 = (𝟏, … , 𝟏) then 𝑳 ⋅ 𝒙 = 𝟎 and so 𝝀 = 𝝀 𝟏

Important properties:

Eigenvalues are non-negative real numbers  Eigenvectors are real and orthogonal = 𝟎 5 -1 0 0 -1 3 -1 6 0 0 0 -1 -1 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

38

Details!

(a) All eigenvalues are

≥ 0

(b) (c)

𝑥 𝑇 𝐿𝑥 = 𝑖𝑗 𝐿 = 𝑁 𝑇 ⋅ 𝑁 𝐿 𝑖𝑗 𝑥 𝑖 𝑥 𝑗 ≥ 0  for every 𝑥  That is, 𝐿 is positive semi-definite

Proof:

(c)

(b):

𝑥 𝑇 𝐿𝑥 = 𝑥 𝑇 𝑁 𝑇 𝑁𝑥 = 𝑥𝑁   𝑇 𝑁𝑥 ≥ 0  As it is just the square of length of 𝑁𝑥

(b)

𝑥 𝑇  (a): Let 𝐿𝑥 ≥ 0 so 𝝀 𝑥 be an eigenvalue of 𝑳 . Then by (b) 𝑇 𝐿𝑥 = 𝑥 𝑇 𝜆𝑥 = 𝜆𝑥 𝑇 𝑥  𝝀 ≥ 𝟎

(a)

(c): is also easy! Do it yourself.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

39

Fact: For symmetric matrix

M

:

 2  min

x x T M x x T x

What is the meaning of min

x

T

L x

 x T L x = 𝑛 𝑖,𝑗=1 𝐿 𝑖𝑗 𝑥 𝑖 𝑥 𝑗 = 𝑛 𝑖,𝑗=1  = 𝑖 𝐷 𝑖𝑖 𝑥 𝑖 2 − 𝑖,𝑗 ∈𝐸 2𝑥 𝑖 𝑥 𝑗

on G?

𝐷 𝑖𝑗 − 𝐴 𝑖𝑗 𝑥 𝑖 𝑥 𝑗  = 𝑖,𝑗 ∈𝐸 (𝑥 𝑖 2 + 𝑥 𝑗 2 − 2𝑥 𝑖 𝑥 𝑗 ) = 𝒊,𝒋 ∈𝑬 𝒙 𝒊 Node 𝒊 has degree 𝒅 𝒊 . So, value 𝒙 𝒊 𝟐 But each edge (𝒊, 𝒋) needs to be summed up 𝒅 𝒊 has two endpoints so we need 𝒙 𝒊 𝟐 +𝒙 𝟐 𝒋 times.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

− 𝒙 𝒋 𝟐 40

 2  min

x x T M x x T x

Details!

   Write 𝑥 in axes of eigenvecotrs 𝑤 1 , 𝑤 2 , … , 𝑤 𝑛 𝑴 . So, 𝑥 = 𝑖 𝑛 𝛼 𝑖 𝑤 𝑖 Then we get: 𝑀𝑥 = 𝑖 So, what is 𝒙 𝑻 𝑴𝒙 ?

𝛼 𝑖 𝑀𝑤 𝑖 𝝀 𝒊 𝒘 𝒊 = 𝑖 𝛼 𝑖 𝜆 𝑖 𝑤 𝑖 = 𝟎 of

if

𝒊 ≠ 𝒋

1 otherwise

 𝑥 𝑇 𝑀𝑥 = 𝑖 𝛼 𝑖 𝑤 𝑖 𝑖 𝛼 𝑖 𝜆 𝑖 𝑤 𝑖 = 𝑖𝑗 𝛼 𝑖 𝜆 𝑗 𝛼 𝑗 𝑤 𝑖 𝑤 𝑗   = 𝑖 𝛼 𝑖 𝜆 𝑖 𝑤 𝑖 𝑤 𝑖 = 𝒊 𝝀 𝒊 𝜶 𝒊 𝟐 To minimize this over all unit vectors x orthogonal to: w = min over choices of (𝛼 1 , … 𝛼 𝑛 ) 𝛼 𝑖 2 = 1 (unit length) 𝛼 𝑖 = 0 so that: (orthogonal to 𝑤 1 ) To minimize this, set 𝜶 𝟐 = 𝟏 and so 𝒊 𝝀 𝒊 𝜶 𝒊 𝟐 = 𝝀 𝟐 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

41

What else do we know about

x

?

  𝒙 is unit vector: 𝒊 𝒙 𝒊 𝟐 = 𝟏 𝒙 𝒊 is orthogonal to 1

st

𝒙 𝒊 ⋅ 𝟏 = 𝒊 𝒙 𝒊 = 𝟎 eigenvector (𝟏, … , 𝟏) thus:  

Remember:

2  min All labelings  of nodes 𝑖 so (

i

, that 𝑥 𝑖 = 0 ) 

E j

 (

x i i x i

2 

x j

) 2

We want to assign values

𝒙 𝒊

to nodes that few edges cross 0.

i

(we want x i and x j such to subtract each other)

𝑥 𝑖 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

0

Balance to minimize

𝑥 𝑗 42 x

  

Back to finding the optimal cut Express partition (A,B) as a vector

𝒚 𝒊 = +𝟏 −𝟏 𝒊𝒇 𝒊 ∈ 𝑨 𝒊𝒇 𝒊 ∈ 𝑩 We can minimize the cut of the partition by finding a non-trivial vector

x

that

minimizes

:

Can’t solve exactly. Let’s relax

𝒚

allow it to take any real value.

and

𝑦 𝑖 = −1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

𝑦 𝑗 = +1 43

𝑥 𝑖 0  𝝀 𝟐 = 𝐦𝐢𝐧 𝒚 𝒇 𝒚

:

The minimum value of given by the 2 nd smallest eigenvalue

λ 2

Laplacian matrix

L

𝒇(𝒚) of the is 𝑥 𝑗  𝐱 = 𝐚𝐫𝐠 𝐦𝐢𝐧 𝐲 𝒇 𝒚

:

The optimal solution for is given by the corresponding eigenvector 𝒙 ,

y

referred as the

Fiedler vector

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

x 44

Details!

  Suppose there is a partition of G into A and B (# 𝑒𝑑𝑔𝑒𝑠 𝑓𝑟𝑜𝑚 𝐴 𝑡𝑜 𝐵) where 𝐴 ≤ |𝐵| , s.t.

𝜶 = 𝐴 then 2𝜶 ≥ 𝝀 𝟐  This is the approximation guarantee of the spectral clustering. It says the cut spectral finds is at most 2 away from the optimal one of score 𝜶 .

Proof:

 Let: a=|A|, b=|B| and e= # edges from A to B  Enough to choose some 𝒙 𝒊 2 that: 𝜆 2 ≤ 𝑥 𝑖 −𝑥 𝑗 𝑖 𝑥 𝑖 2 ≤ 2𝛼 based on A and B such (while also 𝑖 𝑥 𝑖 = 0 ) 𝝀 𝟐

is only smaller

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

45

Details!

Proof (continued):

1) Let’s set:

𝒙 𝒊 = − + 𝟏 𝒂 𝟏 𝒃 𝒊𝒇 𝒊 ∈ 𝑨 𝒊𝒇 𝒊 ∈ 𝑩  Let’s quickly verify that 𝑖 𝑥 𝑖 = 0: 𝑎 − 1 𝑎 + 𝑏 1 𝑏 = 𝟎 

2) Then:

𝑥 𝑖 −𝑥 𝑗 𝑖 𝑥 𝑖 2 2 = 𝑖∈𝐴,𝑗∈𝐵 𝑎 − 1 𝑎 2 1 1 1 1 𝟐 𝑒 + ≤ 𝑒 + ≤ 𝒆 𝑎 𝑏 𝑎 𝑎 𝒂 e … number of edges between A and B 1 𝑏 +𝑏 + 1 𝑎 1 𝑏 2 2 = 𝟐𝜶 = 𝑒⋅ 1 𝑎 1 𝑎 + 1 𝑏 + 1 𝑏 2 = Which proves that the cost achieved by spectral is better than twice the OPT cost J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

46

Details!

Putting it all together:

 𝜶 𝟐 𝟐𝜶 ≥ 𝝀 𝟐 ≥ 𝟐𝒌 𝒎𝒂𝒙 where 𝑘 𝑚𝑎𝑥 in the graph  is the maximum node degree Note we only provide the 1 st part: 𝟐𝜶 ≥ 𝝀 𝟐   We did not prove 𝝀 𝟐 𝜶 𝟐 ≥ 𝟐𝒌 𝒎𝒂𝒙 Overall this always certifies that 𝝀 𝟐 useful bound always gives a J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

47

 

How to define a “good” partition of a graph?

Minimize a given graph cut criterion  

How to efficiently identify such a partition?

Approximate using information provided by the eigenvalues and eigenvectors of a graph 

Spectral Clustering

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

48

Three basic stages:

1) Pre-processing

 Construct a matrix representation of the graph 

2) Decomposition

 Compute eigenvalues and eigenvectors of the matrix  Map each point to a lower-dimensional representation based on one or more eigenvectors 

3) Grouping

 Assign points to two or more clusters, based on the new representation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

49

1)

Pre-processing:

Build Laplacian matrix

L

of the graph 1 5 6 2 3 4 1 3 -1 -1 0 -1 0 0 0 2 -1 2 -1 0 3 -1 -1 3 -1 0 0 0 -1 3 -1 -1 4 0 5 -1 0 0 -1 3 -1 0 0 -1 -1 2 6 0 

2) Decomposition:

 Find eigenvalues and eigenvectors

x

of the matrix

L

 

=

0.0

1.0

3.0

3.0

4.0

5.0

X =

0.4

0.4

0.4

0.4

0.4

0.4

0.3

0.6

0.3

-0.3

-0.3

-0.6

-0.5

0.4

0.1

0.1

-0.5

0.4

-0.2

-0.4

0.6

-0.4

0.4

-0.4

0.6

-0.2

0.4

0.4

-0.4

-0.4

-0.5

0.0

0.5

-0.5

0.5

0.0

 Map vertices to corresponding components of 

2

1 2 3 4 5 6

0.3

0.6

0.3

-0.3

-0.3

-0.6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How do we now find the clusters?

50

 

3)

Grouping:

Sort components of reduced 1-dimensional vector  Identify clusters by splitting the sorted vector in two

How to choose a splitting point?

 Naïve approaches:  Split at 0 or median value  More expensive approaches:  Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) 1 2 3 4 5 6 0.3

0.6

0.3

-0.3

-0.3

-0.6

Split at 0:

Cluster A: Positive points Cluster B: Negative points 1 0.3

4 -0.3

2 0.6

5 -0.3

3 0.3

6 -0.6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A B 51

Rank in

x 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

52

Components of x

2

Rank in

x 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

53

Components of

x 1

Components of

x 3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

54

How do we partition a graph into

k

clusters?

Two basic approaches:

Recursive bi-partitioning

[Hagen et al., ’92]  Recursively apply bi-partitioning algorithm in a hierarchical divisive manner  Disadvantages: Inefficient, unstable 

Cluster multiple eigenvectors

[Shi-Malik, ’00]  Build a reduced space from multiple eigenvectors  Commonly used in recent papers  A preferable approach… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

55

   

Approximates the optimal cut

[Shi-Malik, ’00]  Can be used to approximate optimal

k

-way normalized cut

Emphasizes cohesive clusters

 Increases the unevenness in the distribution of the data  Associations between similar points are amplified, associations between dissimilar points are attenuated  The data begins to “approximate a clustering”

Well-separated space

 Transforms data to a new “embedded space”, consisting of

k

orthogonal basis vectors Multiple eigenvectors prevent instability due to information loss J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

56

[Kumar et al. ‘99]  

Searching for small communities in the Web graph

What is the signature of a community / discussion in a Web graph?

Use this to define “topics”:

What the same people on the left talk about on the right

Remember HITS!

Dense 2-layer graph

Intuition:

Many people all talking about the same things J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

58

A more well-defined problem:

Enumerate complete bipartite subgraphs K

s,t

 Where K

s,t

:

to the same

t s

nodes on the “left” where each links other nodes on the “right” |X| = s = 3 |Y| = t = 4 X Y

K 3,4

Fully connected

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

59

[Agrawal-Srikant ‘99]   

Market basket analysis.

Setting:   Market: Universe

U

of

n

items Baskets: m subsets of

U

:

S 1 , S 2 , …, S m

(

S i

is a set of items one person bought)   Support: Frequency threshold

f

Goal:

U

Find all subsets

(items in

T T

s.t.

T

S i

of at least

were bought together at least

f f

sets

S i

times)

What’s the connection between the itemsets and complete bipartite graphs?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

60

[Kumar et al. ‘99]

Frequent itemsets

=

complete bipartite graphs!

How?

 View each node

i

set

S i

of nodes

i

as a points to 

K s,t

= a set

Y

of size that occurs in

s

sets

t S i

 Looking for K

s,t

 set of frequency threshold to

s

and look at layer

t

– all frequent sets of size

t

i

X k j i

a b c d S i ={a,b,c,d}

a b c d Y

t s

… minimum support (|X|=s) … itemset size (|Y|=t) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

61

[Kumar et al. ‘99] View each node i as a set S

i

of nodes i points to

a i b c d S i ={a,b,c,d}

Find frequent itemsets:

s

… minimum support

t

… itemset size

x

Say we find a

frequent itemset

Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}:

a a b c y b c z We found K

s,t

!

K s,t

= a set Y of size t that occurs in

s

sets S

i

X x y z J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

a b c Y 62

a b c

a c d e f Itemsets:

a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {}

b

 

Support threshold s=2

{b,d}: support 3  {e,f}: support 2

And we just found 2 bipartite subgraphs:

a c d b c e

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

e f d

63

Example of a community from a web graph Nodes on the right Nodes on the left

[Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

64