Transcript Slide 1
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Review of basic probability:
Today, Thu 9/29
In Gates B01, 4-6pm
Review of basic linear algebra:
Tomorrow, Fri 9/30
Gates B03, 4-6pm
Next week:
Intro to SNAP (Gates B01, 4-6pm on Thu 10/6)
Intro to NetworkX (Gates B03, 4-6pm on Fri 10/7)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
2
Recall from the last lecture:
1) We took a real system: the Web
2) We represented it as a directed graph
3) We used the language of graph theory
Strongly Connected Components
v
4) We designed a computational
experiment:
Find In- and Out-components of a given node v
Out(v)
5) We learned something about the
structure of the Web
This class:
Define basic terminology and properties
that you can compute on networks
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
3
Undirected
JOKE Sex network
is undirected
Directed
Links: undirected
(symmetrical)
Links: directed
(arcs)
L
A
D
B
M
F
C
I
D
A
H
F
C
Undirected links:
Collaborations
Friendship on Facebook
7/17/2015
E
G
G
B
Directed links:
Phone calls
Following on Twitter
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
4
4
4
3
2
3
2
1
1
Aij=1
Aij=0
if there is a link between node i and j
if nodes i and j are not connected to each other
0
1
A
0
1
1
0
0
0
0
0
1
1
1
1
1
0
0
1
A
0
0
0
0
0
0
0
0
1
1
1
0
0
0
Note that for a directed graph (right) the matrix is not symmetric.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
5
Undirected
Node degree, ki: the number of
links connected to node i
ki 4
j
i
Directed
Avg. degree:
C
E
G
A
F
Source: A node with kin = 0
Sink: A node with kout = 0
7/17/2015
N
ki
i 1
2E
N
In directed networks we define
an in-degree and out-degree.
D
B
1
k
N
The (total) degree of a node is the
sum of in- and out-degree.
kCin 2
E
k
N
kCout 1
kC 3
k in k out
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
6
The maximum number of edges in
an undirected graph on N nodes is
Emax
N N ( N 1)
2
2
A graph with the number of edges E=Emax
is a complete graph,
and its average degree is N-1
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
7
Most real-world networks are sparse
E << Emax (or k << N-1)
WWW (Stanford-Berkeley):
Social networks (LinkedIn):
Communication (MSN IM):
Coauthorships (DBLP):
Internet (AS-Skitter):
Roads (California):
Protein (S. Cerevisiae):
N=319,717
N=6,946,668
N=242,720,596
N=317,080
N=1,719,037
N=1,957,027
N=1,870
k=9.65
k=8.87
k=11.1
k=6.62
k=14.91
k=2.82
k=2.39
(Source: Leskovec et al., Internet Mathematics, 2009)
Consequence: Adjacency matrix is filled with zeros!
(Density (E/N2): WWW=1.5110-5, MSN IM = 2.2710-8)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
8
Unweighted
(undirected)
Weighted
(undirected)
4
4
1
1
2
2
3
0
1
Aij
1
0
3
1
1
0
1
1
0
1
0
Aii 0
1 N
E Aij
2 i , j 1
Aij A ji
2E
k
N
Friendships, WWW
7/17/2015
0
1
0
0
0
2
Aij
0.5
0
2
0.5
0
1
1
0
4
0
Aii 0
1 N
E nonzero( Aij )
2 i , j 1
0
4
0
0
Aij A ji
2E
k
N
Call graph, Email graph
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
9
Self-edges
(undirected)
Multigraph
(undirected)
4
4
1
1
2
2
3
1
1
Aij
1
0
3
1
1
0
1
1
0
1
0
Aii 0
N
1 N
E
Aii
Aij
2 i , j 1,i j
i 1
WWW, Email
7/17/2015
0
1
0
1
0
2
Aij
1
0
2
1
0
1
1
0
3
0
Aij A ji
Aii 0
?
1 N
E nonzero( Aij )
2 i , j 1
0
3
0
0
Aij A ji
2E
k
N
Social networks, collaboration networks
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
10
WWW >> directed multigraph with self-interactions
Facebook friendships >> undirected, unweighted
Citation networks >> unweighted directed acyclic
Collaboration networks >> undirected multigraph or weighted
Mobile phone calls >> directed, (weighted?) multigraph
Protein Interactions >> undirected, unweighted with self-interactions
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
11
Bipartite graph is a graph whose nodes
can be divided into two disjoint sets U and V
such that every link connects a node in U to one
in V; that is, U and V are independent sets.
Examples:
Authors-to-papers
Movies-to-Actors
Users-to-Movies
“Folded” networks
U
V
Author collaboration networks
Actor collaboration networks
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
12
Degree distribution P(k): Probability that
a randomly chosen node has degree k
Nk = # nodes with degree k
P(k) = Nk / N ➔ plot
P(k)
0.6
0.5
0.4
0.3
0.2
0.1
1
7/17/2015
2
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
3
4
k
14
A path is a sequence of nodes in which each
node is linked to the next one
Pn {i0 ,i1,i2 ,...,in }
Pn {(i0 ,i1),(i1 ,i2 ),(i2 ,i3 ),...,( in1,in )}
Path can intersect itself
and pass through the
same edge multiple times
E.g.: ACBDCDEG
In a directed graph a path
can only follow the direction
of the “arrow”
7/17/2015
B
F
A
D
E
G
C
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
H
15
Could do a better
job
ILLUSTRATION
link between u and
v, how
to illustrate
the stuff is really
the matrix
powering
path of length two
Number of paths between nodes u and v :
Length h=1: If there is a
Auv=1 else Auv=0
Length h=2: If there is a
between u and v then Auk Akv=1 else Auk Akv=0
N
H uv Auk Akv [ A2 ]uv
( 2)
k 1
Length h: If there is a path of length h between u
and v then Auk .... Akv=1 else Auk .... Akv=0
So, the no. of paths of length h between u and v is
H
(h)
uv
[ A ]uv
h
(holds for both directed and undirected graphs)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
16
D
Distance (shortest path, geodesic)
between a pair of nodes is defined
A
as the number of edges along the
C
shortest path connecting the nodes.
B
*If the two nodes are disconnected, the
distance is defined as infinite
D
A
7/17/2015
In directed graphs paths need to
follow the direction of the arrows.
C
B
Consequence: Distance is not
symmetric: h(A,C) ≠ h(C,A)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
17
Breath-First Search:
Start with node u, mark it to be at distance hu(u)=0,
add u to the queue
While queue not empty:
Take node v off the queue, put it’s unmarked
neighbor w into the queue and mark hu(w)=hu(v)+1
3
u
4
2
3
1
3
4
3
4
2
0
1
2
3
3
4
4
4
1
3
4
2
4
2
3
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
18
Diameter: the maximum (shortest path)
distance between any pair of nodes in the
graph
Average path length/distance for a connected
graph (component) or a strongly connected
(component of a) directed graph
1
h
2 Emax
h
i , j i
ij
where hij is the distance from node i to node j
Many times we compute the average only over the
connected pairs of nodes (i.e., we ignore “infinite” paths)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
19
Clustering coefficient:
What portion of i’s neighbors are
Node i with degree ki
Ci [0,1]
Consequence of
Clustering:
-- communities
connected?
-- clustering is correla
with network density
need a NULL model t
estimate how surprise
we by the CCF.
where ei is the number of edges
between the neighbors of node i
i
i
Ci=0
i
Ci=1/3
Ci=1
1
Average Clustering Coefficient: C
N
7/17/2015
N
C
i
i
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
20
Clustering coefficient:
What portion of i’s neighbors are connected?
Node i with degree ki
where ei is the number of edges
between the neighbors of node i
B
F
A
D
E
G
C
H
7/17/2015
kB=2, eB=1, CB=2/2 = 1
kD=4, eD=2, CD=4/12 = 1/3
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
21
7/17/2015
Degree distribution:
P(k)
Path length:
h
Clustering coefficient:
C
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
22
Better explanation of
the alternative
calculation -- picture
P(k) = (k-4)
k=4 for each node
C = ½ for each node if N>6
Path length:
Alternative calculation:
h max
N
2
hmax
4 N
h 1
h max
N
4
The average path-length is h N
Constant degree, constant clustering coefficient.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
23
P(k) = (k-6)
k=6 for each inside node
C = 6/15 for inside nodes
Path length:
hmax
6h N
h 1
h max N
In general, for lattices:
average path-length is h N 1/ D
Constant degree, constant clustering coefficient
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
24
Animation to use a
better use of the side
calculation. Explain
the idea that we
want to find h_max
such that sum is =N
Degree: k 2
k=3 for non-leaves
k=1 for leaves
C=0
Path length:
hmax
log N
h 1
3 2 N h max log k N
log k
h 1
hmax
2h
h 1
1 2 dx h
hmax
1
2 hmax
2 2 hmax
hmax
2 hmax N hmax log 2 N
Distances vary logarithmically with N.
Constant degree, no clustering.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
25
Erdös-Renyi Random Graph [Erdös-Renyi, ‘60]
Two variants:
Gn,p: undirected graph on n nodes and each
edge (u,v) appears i.i.d. with probability p
Gn,m : undirected graph with n nodes, and
m uniformly at random picked edges
What kinds of networks does such
model produce?
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
27
n and p do not uniquely define the graph!
We can have many different realizations.
How many?
n = 10
p= 1/6
The probability of Gnp to form a particular
graph G(N,E) is
P(G ( N , E )) p E (1 p )
7/17/2015
N ( N 1)
E
2
That is, each concrete graph
G(N,E) appears with
probability P(G(N,E)).
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
28
How many likely is a graph on E edges?
P(E): the probability that a given Gnp
generates a graph on exactly E edges:
E max E
p (1 p ) Emax E
P ( E )
E
where Emax=n(n-1)/2 is the maximum possible number of edges
Binomial distribution >>>
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
29
What is expected degree of a node?
Let Xv be a random var. measuring the degree
of the node v: E[ X ] j P(X j )
n 1
v
j 0
v
Linearity of expectation:
For any random variables Y1,Y2,…,Yk
If Y=Y1+Y2+…Yk, then E[Y]= i E[Yi]
Easier way:
Decompose Xv in Xv= Xv1+Xv2+…+Xvn-1
where Xvu is a {0,1}-random variable
which tells if edge (v,u) exists or not
n 1
E[ X v ] E[ X vu ] (n 1) p
y 1
7/17/2015
How to think about this?
• Prob. of node u linking to node v is p
• u can link (flips a coin) for all other (n-1) nodes
• Thus, the expected degree of node u is: p(n-1)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
30
Degree distribution of Gnp is Binomial.
Let P(k) denote a fraction of nodes with
degree k:
n 1 k
p (1 p ) n 1 k
P (k )
k
Select k nodes
from n-1
Probability of
having k edges
Probability of
missing n-1-k
edges
k
1 p 1
k p (n 1)
k p (n 1)
k2 p(1 p)(n 1)
7/17/2015
P(k)
1/ 2
1
(n 1)1/ 2
As the network size increases, the distribution becomes
increasingly narrow—we are increasingly confident that
the degree of a node is in the vicinity of k.
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
31
2ei
Ci
ki (ki 1)
Since edges in Gnp appear i.i.d with probability p
ki (ki 1)
ei p
2
k
C p
N
Clustering coefficient of a random graph is small.
For a fixed degree C decreases with the graph size N.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
32
Configuration model:
A
C
B
D
Nodes with spokes
A
B
C
Randomly pair up
“mini”-n0des
D
A
C
B
D
Resulting graph
Assume a degree sequence k1, k2, … kN
Useful for as a “null” model of networks
We can compare the real network G and a
“random” graph G’ which has the same degree
sequence as G
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
33
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
34
Assume each node has d spokes (half-edges):
k=1:
Graph is a set of pairs
Graph is a set of cycles
k=2:
k=3:
Arbitrarily complicated
graphs
Randomly pair them up
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
35
Graph G(V, E) has expansion α: if S V:
# of edges leaving S α min(|S|,|V\S|)
Or equivalently:
# edges leaving S
min
min(| S |, | V \ S |)
S V
S
7/17/2015
V\S
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
36
One more figure of
expansion. Basically
have the Si -- >Si+1
the idea is that from
a set of Si nodes
there are alpha Si
edges pointing out –
this is really
expansion. Gives
lower bound on how
quickly we expect to
expand out when
doing a BFS.
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
37
Expansion is measure of robustness:
To disconnect l nodes, we need to cut α l edges
Low expansion:
High expansion:
Social networks:
“Communities”
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
38
k-regular graph (every node has degree k):
Expansion is at most k (when S is 1 node)
Is there a graph on n nodes (n), of fixed max deg.
k, so that expansion α remains const?
Examples:
nn grid: k=4: α =2n/(n2/4)0
S
(S=n/2 n/2 square in the center)
Complete binary tree:
α 0 for|S|=(n/2)-1
S
Fact: For a random 3-regular graph on n nodes, there is
some const α (α>0, independent. of n) such that w.h.p.
the expansion of the graph is α
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
39
Fact: In a graph on n nodes with expansion α
for all pairs of nodes s and t there is a path of
O((log n) / α) edges connecting them.
Proof:
Let Sj be a set of all nodes
found within j steps of BFS from s.
Then:
Expansion
S j 1 S j
Sj
k
S j 1
k
s
S0 S1
S2
Edges can
“collide”
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
40
S j 1 S j 1 1
k
k
Proof (continued):
j 1
In how many steps ofj BFS we reach >n/2 nodes?
n
1
Need j so that: k 2
Let’s set: j
Then:
1
k
k log 2 n
k log 2 n
2log 2 n
n
n
2
s
In log(n) steps, we
reach >n/2 nodes
t
In log(n) steps, we
reach >n/2 nodes
Note
In O(log n) steps |Sj| grows to Θ(n).
So, the diameter of G is O(log(n)/ α)
7/17/2015
1
k
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
k log 2 n
2log 2 n
for the numbers we
care about (n>>0, α<k)
41
Degree distribution:
n 1 k
n 1k
P(k ) p (1 p)
k
Path length:
O(log n)
Clustering coefficient:
C=p=k/n
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
42
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
43
Remember, expected degree E[ X v ] (n 1) p
We want E[Xv] be independent of n
So let: p=c/(n-1)
Observation: If we build random graph Gnp
with p=c/(n-1) we have many isolated nodes
Why?
P[v has degree 0] (1 p )
c
1
lim
n
1
n
n 1
1
1
x
1
xc
c
Use substitution x n 1
7/17/2015
n 1
c
1
n 1
x
1
lim 1
x x
c
e c
e
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
n 1
e
c
n
By definition:
1
e lim 1
x
x
44
x
How big do we have to make p before we are
likely to have no isolated nodes?
We know: P[v has degree 0] = e-c
Event we are asking about is:
I = some node is isolated
I I v where Iv is the event that v is isolated
vN
We have:
PI P I v PI v ne c
vN vN
Union bound
Ai
A
i
i
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Ai
i
45
We just learned: P(I) = n e-c
Let’s try:
c = ln n
c = 2 ln n
then: n e-c = n e-ln n
then: n e-2 ln n = n1/n2
=n1/n= 1
= 1/n
So if:
p = ln n
p = 2 ln n
7/17/2015
then: P(I) = 1
then: P(I) = 1/n 0 as n
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
46
Graph structure of Gnp as p changes:
p
0
Empty
graph
1/(n-1)
c/(n-1)
ln(n)/(n-1)
2*ln(n)/(n-1)
Giant component
appears
Avg. deg const.
Lots of isolated
nodes.
Fewer isolated
nodes.
No isolated nodes.
1
Complete
graph
Emergence of a Giant Component:
avg. degree k=2E/n or p=k/(n-1)
k=1-ε: all components are of size Ω(ln n)
k=1+ε: 1 component of size Ω(n), others have size Ω(ln n)
7/17/2015
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
47
Show a few phase
transisitions.
Show
7/17/2015
Good to move
quickly to social
media and web
appliations – connect
the 6-degres of
separation with MSN
and edge locality
examples.
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
48