Slide 1

Transcript Slide 1

CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu

Review of basic probability:
 Today, Thu 9/29
 In Gates B01, 4-6pm

Review of basic linear algebra:
 Tomorrow, Fri 9/30
 Gates B03, 4-6pm

Next week:
 Intro to SNAP (Gates B01, 4-6pm on Thu 10/6)
 Intro to NetworkX (Gates B03, 4-6pm on Fri 10/7)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
2

Recall from the last lecture:
 1) We took a real system: the Web
 2) We represented it as a directed graph
 3) We used the language of graph theory
 Strongly Connected Components
v
 4) We designed a computational
experiment:
 Find In- and Out-components of a given node v
Out(v)
 5) We learned something about the
structure of the Web

This class:
 Define basic terminology and properties
that you can compute on networks
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
3
Undirected

JOKE Sex network
is undirected
Directed
Links: undirected
(symmetrical)

Links: directed
(arcs)
L
A
D
B
M
F
C
I
D
A
H
F
C

Undirected links:
 Collaborations
 Friendship on Facebook
7/17/2015
E
G
G
B

Directed links:
 Phone calls
 Following on Twitter
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
4
4
4
3
2
3
2
1
1
Aij=1
Aij=0
if there is a link between node i and j
if nodes i and j are not connected to each other
0

1
A
0

1

1
0
0
0
0
0
1
1
1

1
1

0 
0

1
A
0

0

0
0
0
0
0
0
1
1
1

0
0

0 
Note that for a directed graph (right) the matrix is not symmetric.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
5
Undirected
Node degree, ki: the number of
links connected to node i
ki  4
j
i
Directed
Avg. degree:
C
E
G
A
F
Source: A node with kin = 0
Sink: A node with kout = 0
7/17/2015
N
 ki 
i 1
2E
N
In directed networks we define
an in-degree and out-degree.
D
B
1
k
N
The (total) degree of a node is the
sum of in- and out-degree.
kCin  2
E
k
N
kCout  1
kC  3
k in  k out
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
6
The maximum number of edges in
an undirected graph on N nodes is
Emax
 N  N ( N  1)
   
2
2
A graph with the number of edges E=Emax
is a complete graph,
and its average degree is N-1
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
7
Most real-world networks are sparse
E << Emax (or k << N-1)
WWW (Stanford-Berkeley):
Social networks (LinkedIn):
Communication (MSN IM):
Coauthorships (DBLP):
Internet (AS-Skitter):
Roads (California):
Protein (S. Cerevisiae):
N=319,717
N=6,946,668
N=242,720,596
N=317,080
N=1,719,037
N=1,957,027
N=1,870
k=9.65
k=8.87
k=11.1
k=6.62
k=14.91
k=2.82
k=2.39
(Source: Leskovec et al., Internet Mathematics, 2009)
Consequence: Adjacency matrix is filled with zeros!
(Density (E/N2): WWW=1.5110-5, MSN IM = 2.2710-8)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
8

Unweighted

(undirected)
Weighted
(undirected)
4
4
1
1
2
2
3
0

1
Aij  
1

0
3
1
1
0
1
1
0
1
0
Aii  0
1 N
E   Aij
2 i , j 1
Aij  A ji
2E
k
N
Friendships, WWW
7/17/2015
0

1
0

0
 0

2
Aij  
0.5

 0
2
0.5
0
1
1
0
4
0
Aii  0

1 N
E   nonzero( Aij )
2 i , j 1
0

4
0

0
Aij  A ji
2E
k
N
Call graph, Email graph
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
9

Self-edges

(undirected)
Multigraph
(undirected)
4
4
1
1
2
2
3
1

1
Aij  
1

0
3
1
1
0
1
1
0
1
0
Aii  0
N
1 N
E
Aii
 Aij  
2 i , j 1,i  j
i 1
WWW, Email
7/17/2015
0

1
0

1
0

2
Aij  
1

0
2
1
0
1
1
0
3
0
Aij  A ji
Aii  0
?
1 N
E   nonzero( Aij )
2 i , j 1

0

3
0

0
Aij  A ji
2E
k
N
Social networks, collaboration networks
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
10
WWW >> directed multigraph with self-interactions
Facebook friendships >> undirected, unweighted
Citation networks >> unweighted directed acyclic
Collaboration networks >> undirected multigraph or weighted
Mobile phone calls >> directed, (weighted?) multigraph
Protein Interactions >> undirected, unweighted with self-interactions
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
11

Bipartite graph is a graph whose nodes
can be divided into two disjoint sets U and V
such that every link connects a node in U to one
in V; that is, U and V are independent sets.

Examples:
 Authors-to-papers
 Movies-to-Actors
 Users-to-Movies

“Folded” networks
U
V
 Author collaboration networks
 Actor collaboration networks
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
12

Degree distribution P(k): Probability that
a randomly chosen node has degree k
Nk = # nodes with degree k
P(k) = Nk / N ➔ plot
P(k)
0.6
0.5
0.4
0.3
0.2
0.1
1
7/17/2015
2
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
3
4
k
14

A path is a sequence of nodes in which each
node is linked to the next one
Pn  {i0 ,i1,i2 ,...,in }

Pn  {(i0 ,i1),(i1 ,i2 ),(i2 ,i3 ),...,( in1,in )}
Path can intersect itself
and pass through the

same edge multiple times
 E.g.: ACBDCDEG
 In a directed graph a path
can only follow the direction
of the “arrow”
7/17/2015
B
F
A
D
E
G
C
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
H
15

Could do a better
job
ILLUSTRATION
link between u and
v, how
to illustrate
the stuff is really
the matrix
powering
path of length two
Number of paths between nodes u and v :
 Length h=1: If there is a
Auv=1 else Auv=0
 Length h=2: If there is a
between u and v then Auk Akv=1 else Auk Akv=0
N
H uv   Auk Akv  [ A2 ]uv
( 2)
k 1
 Length h: If there is a path of length h between u
and v then Auk .... Akv=1 else Auk .... Akv=0
So, the no. of paths of length h between u and v is
H
(h)
uv
 [ A ]uv
h
(holds for both directed and undirected graphs)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
16
D

Distance (shortest path, geodesic)
between a pair of nodes is defined
A
as the number of edges along the
C
shortest path connecting the nodes.
B
 *If the two nodes are disconnected, the
distance is defined as infinite
D
A
7/17/2015
In directed graphs paths need to
follow the direction of the arrows.
C
B

 Consequence: Distance is not
symmetric: h(A,C) ≠ h(C,A)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
17

Breath-First Search:
 Start with node u, mark it to be at distance hu(u)=0,
add u to the queue
 While queue not empty:
 Take node v off the queue, put it’s unmarked
neighbor w into the queue and mark hu(w)=hu(v)+1
3
u
4
2
3
1
3
4
3
4
2
0
1
2
3
3
4
4
4
1
3
4
2
4
2
3
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
18


Diameter: the maximum (shortest path)
distance between any pair of nodes in the
graph
Average path length/distance for a connected
graph (component) or a strongly connected
(component of a) directed graph
1
h
2 Emax
h
i , j i
ij
where hij is the distance from node i to node j
 Many times we compute the average only over the
connected pairs of nodes (i.e., we ignore “infinite” paths)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
19

Clustering coefficient:
 What portion of i’s neighbors are
 Node i with degree ki
 Ci  [0,1]

Consequence of
Clustering:
-- communities
connected?
-- clustering is correla
with network density
need a NULL model t
estimate how surprise
we by the CCF.
where ei is the number of edges
between the neighbors of node i
i
i
Ci=0
i
Ci=1/3
Ci=1
1
 Average Clustering Coefficient: C 
N
7/17/2015
N
C
i
i
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
20

Clustering coefficient:
 What portion of i’s neighbors are connected?
 Node i with degree ki

where ei is the number of edges
between the neighbors of node i
B
F
A
D
E
G
C
H
7/17/2015
kB=2, eB=1, CB=2/2 = 1
kD=4, eD=2, CD=4/12 = 1/3
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
21
7/17/2015
Degree distribution:
P(k)
Path length:
h
Clustering coefficient:
C
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
22
Better explanation of
the alternative
calculation -- picture



P(k) = (k-4)
k=4 for each node
C = ½ for each node if N>6
Path length:
Alternative calculation:
h max
N

2
hmax
4  N
h 1
 h max
N

4
 The average path-length is h  N
 Constant degree, constant clustering coefficient.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
23

P(k) = (k-6)
 k=6 for each inside node


C = 6/15 for inside nodes
Path length:
hmax
 6h  N
h 1
 h max  N
 In general, for lattices:
 average path-length is h  N 1/ D
 Constant degree, constant clustering coefficient
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
24

Animation to use a
better use of the side
calculation. Explain
the idea that we
want to find h_max
such that sum is =N
Degree: k  2
 k=3 for non-leaves
 k=1 for leaves


C=0
Path length:
hmax
log N
h 1
3 2  N  h max  log k N 
log k
h 1
hmax
2h
h 1
1 2 dx  h
hmax
1
2 hmax

 2  2 hmax
hmax
2 hmax  N  hmax  log 2 N
 Distances vary logarithmically with N.
Constant degree, no clustering.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
25

Erdös-Renyi Random Graph [Erdös-Renyi, ‘60]

Two variants:
 Gn,p: undirected graph on n nodes and each
edge (u,v) appears i.i.d. with probability p
 Gn,m : undirected graph with n nodes, and
m uniformly at random picked edges
What kinds of networks does such
model produce?
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
27


n and p do not uniquely define the graph!
We can have many different realizations.
How many?
n = 10
p= 1/6
The probability of Gnp to form a particular
graph G(N,E) is
P(G ( N , E ))  p E (1  p )
7/17/2015
N ( N 1)
E
2
That is, each concrete graph
G(N,E) appears with
probability P(G(N,E)).
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
28


How many likely is a graph on E edges?
P(E): the probability that a given Gnp
generates a graph on exactly E edges:
 E max  E
 p (1  p ) Emax  E
P ( E )  

E


where Emax=n(n-1)/2 is the maximum possible number of edges
Binomial distribution >>>
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
29


What is expected degree of a node?
Let Xv be a random var. measuring the degree
of the node v: E[ X ]   j P(X  j )
n 1
v
j 0
v
 Linearity of expectation:
 For any random variables Y1,Y2,…,Yk
 If Y=Y1+Y2+…Yk, then E[Y]= i E[Yi]

Easier way:
 Decompose Xv in Xv= Xv1+Xv2+…+Xvn-1
 where Xvu is a {0,1}-random variable
which tells if edge (v,u) exists or not
n 1
E[ X v ]   E[ X vu ]  (n  1) p
y 1
7/17/2015
How to think about this?
• Prob. of node u linking to node v is p
• u can link (flips a coin) for all other (n-1) nodes
• Thus, the expected degree of node u is: p(n-1)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
30

Degree distribution of Gnp is Binomial.
Let P(k) denote a fraction of nodes with
degree k:
 n  1 k
 p (1  p ) n 1 k
P (k )  
 k 
Select k nodes
from n-1
Probability of
having k edges
Probability of
missing n-1-k
edges
k
1  p 1 


k  p (n  1) 
k  p (n  1)
 k2  p(1  p)(n  1)
7/17/2015
P(k)

1/ 2
1

(n  1)1/ 2
As the network size increases, the distribution becomes
increasingly narrow—we are increasingly confident that
the degree of a node is in the vicinity of k.
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
31
2ei
Ci 
ki (ki  1)
Since edges in Gnp appear i.i.d with probability p
ki (ki  1)
ei  p
2
k
C p
N
Clustering coefficient of a random graph is small.
For a fixed degree C decreases with the graph size N.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
32

Configuration model:
A
C
B
D
Nodes with spokes


A
B
C
Randomly pair up
“mini”-n0des
D
A
C
B
D
Resulting graph
Assume a degree sequence k1, k2, … kN
Useful for as a “null” model of networks
 We can compare the real network G and a
“random” graph G’ which has the same degree
sequence as G
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
33
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
34

Assume each node has d spokes (half-edges):
 k=1:
Graph is a set of pairs
Graph is a set of cycles
 k=2:
 k=3:

Arbitrarily complicated
graphs
Randomly pair them up
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
35


Graph G(V, E) has expansion α: if S V:
# of edges leaving S  α min(|S|,|V\S|)
Or equivalently:
# edges leaving S
  min
min(| S |, | V \ S |)
S V
S
7/17/2015
V\S
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
36
One more figure of
expansion. Basically
have the Si -- >Si+1
the idea is that from
a set of Si nodes
there are alpha Si
edges pointing out –
this is really
expansion. Gives
lower bound on how
quickly we expect to
expand out when
doing a BFS.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
37

Expansion is measure of robustness:
 To disconnect l nodes, we need to cut  α l edges

Low expansion:

High expansion:

Social networks:
 “Communities”
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
38

k-regular graph (every node has degree k):
 Expansion is at most k (when S is 1 node)

Is there a graph on n nodes (n), of fixed max deg.
k, so that expansion α remains const?
Examples:
 nn grid: k=4: α =2n/(n2/4)0
S
(S=n/2  n/2 square in the center)
 Complete binary tree:
α 0 for|S|=(n/2)-1
S
 Fact: For a random 3-regular graph on n nodes, there is
some const α (α>0, independent. of n) such that w.h.p.
the expansion of the graph is  α
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
39

Fact: In a graph on n nodes with expansion α
for all pairs of nodes s and t there is a path of
O((log n) / α) edges connecting them.

Proof:
 Let Sj be a set of all nodes
found within j steps of BFS from s.
 Then:
Expansion
S j 1  S j 
 Sj
k
 
 S j 1  
k

s
S0 S1
S2
Edges can
“collide”
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
40

 


S j 1  S j 1    1  
k 
k

Proof (continued):
j 1
 In how many steps ofj BFS we reach >n/2 nodes?
n
 
1


 Need j so that:  k  2

 Let’s set: j 
 Then:
 
1  
k


k log 2 n

k log 2 n

 2log 2 n
n
n
2
s
In log(n) steps, we
reach >n/2 nodes
t
In log(n) steps, we
reach >n/2 nodes
Note
 In O(log n) steps |Sj| grows to Θ(n).
So, the diameter of G is O(log(n)/ α)
7/17/2015
 
1  
k

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
k log 2 n

 2log 2 n
for the numbers we
care about (n>>0, α<k)
41
Degree distribution:
 n 1 k
n 1k
P(k )    p (1  p)
 k 
Path length:
O(log n)
Clustering coefficient:
C=p=k/n
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
42
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
43




Remember, expected degree E[ X v ]  (n  1) p
We want E[Xv] be independent of n
So let: p=c/(n-1)
Observation: If we build random graph Gnp
with p=c/(n-1) we have many isolated nodes
Why?
P[v has degree 0]  (1  p )
c 

1



lim
n

1

n  
n 1
 1
 1  
 x
1
 xc
c
Use substitution x  n  1
7/17/2015
n 1
c 

 1 

 n 1 
x

 1 
 lim 1   
 x   x  
c
 e c
e
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
n 1
e
c
n 
By definition:
1

e  lim 1  
x
x  
44
x



How big do we have to make p before we are
likely to have no isolated nodes?
We know: P[v has degree 0] = e-c
Event we are asking about is:
 I = some node is isolated
 I   I v where Iv is the event that v is isolated
vN

We have:


PI   P  I v    PI v   ne c
 vN  vN
Union bound
Ai
A
i
i
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
  Ai
i
45


We just learned: P(I) = n e-c
Let’s try:
 c = ln n
 c = 2 ln n

then: n e-c = n e-ln n
then: n e-2 ln n = n1/n2
=n1/n= 1
= 1/n
So if:
 p = ln n
 p = 2 ln n
7/17/2015
then: P(I) = 1
then: P(I) = 1/n  0 as n
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
46

Graph structure of Gnp as p changes:
p
0
Empty
graph

1/(n-1)
c/(n-1)
ln(n)/(n-1)
2*ln(n)/(n-1)
Giant component
appears
Avg. deg const.
Lots of isolated
nodes.
Fewer isolated
nodes.
No isolated nodes.
1
Complete
graph
Emergence of a Giant Component:
avg. degree k=2E/n or p=k/(n-1)
 k=1-ε: all components are of size Ω(ln n)
 k=1+ε: 1 component of size Ω(n), others have size Ω(ln n)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
47
Show a few phase
transisitions.

Show
7/17/2015
Good to move
quickly to social
media and web
appliations – connect
the 6-degres of
separation with MSN
and edge locality
examples.
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
48

Slide 1

Transcript Slide 1

Directory