Transcript Slide 1

The Link Prediction Problem
for Social Networks
David Libel-Nowell, MIT
John Klienberg, Cornell
Saswat Mishra
sxm111131
Summary



The “Link Prediction Problem”
Given a snapshot of a social network, can we infer
which new interactions among its members are likely to
occur in the near future?
Based on “proximity” of nodes in a network
Introduction

Nodes = people/entities
Edges = interaction/ collaboration
Natural examples of social networks:
Nodes
Edges
Scientists in a discipline
Co-authors of a paper
Employees in a large
company
Working on a project
Business Leaders
Serve together on a
board
Motivation

Understanding how social networks evolve

The link prediction problem
Given a snapshot of a social network at time t, we seek
to accurately predict the edges that will be added to the
network during the interval (t, t’)

?
Why?

To suggest interactions or collaborations that haven’t yet
been utilized within an organization

To monitor terrorist networks - to deduce possible
interaction between terrorists (without direct evidence)

Used in Facebook and Linked In to suggest friends

Open Question: How does Facebook do it?
(friends of friends, same school, manually…)
Motivation

D
B

A
Co-authorship network for scientists
C
Scientists who are “close” in the network
will have common colleagues & circles –
likely to collaborate
Caveat: Scientists who have never collaborated might
in future - hard to predict

Goal: make that intuitive notion precise;
understand which measures of
“proximity” lead to accurate predictions
Goals

Present measures of proximity

Understand relative effectiveness of network proximity
measures (adapted from graph theory, CS, social sciences)

Prove that prediction by proximity outperforms random
predictions by a factor of 40 to 50

Prove that subtle measures outperform more direct
measures
Data and Experimental Setup


Co-authorship network (G) from “author list” of the
physics e-Print arXiv (www.arxiv.org)
Took 5 such networks from 5 sections of the print
B
A
D
B
C
Training interval [1994,1996]
Ktraining = 3
A
C
Test interval [1997,1999]
Ktest = 3
Core: set of authors who have at least 3 papers during both training and test
G[1994,1996] = Gcollab = (A,Eold)
Enew = new collaborations (edges)
Data
Methods for Link Prediction

Take the input graph during training period Gcollab
Pick a pair of nodes (x, y)
Assign a connection weight score(x, y)
Make a list in descending order of score

score is a measure of proximity

Any ideas for measures?



Proximity Measures for Link Prediction
Graph distance & Common Neighbors

E
D
B
A
C
E

D
B
A
Graph distance: (Negated) length
of shortest path between x and y
C
(A, C)
-2
(C, D)
-2
(A, E)
-3
Common Neighbors: A and C have
2 common neighbors, more likely to
collaborate
Jaccard’s coefficient and Adamic / Adar


E
D
B
A
C
Jaccard’s coefficient: same as
common neighbors, adjusted for
degree
Adamic / Adar: weighting rarer
neighbors more heavily
Preferential Attachment

Probability that a new collaboration involves x is
proportional to T(x), current neighbors of x

score (x, y) :=
Considering all paths: Katz

E
D
B
A

C
Katz: measure that sums over
the collection of paths,
exponentially damped by length
(to count short paths heavily)
β is chosen to be a very small value (for
dampening)
Hitting time, PageRank


Hitting time: expected number of steps for a random
walk starting at x to reach y
Commute time:

If y has a large stationary probability, Hx,y is small. To
counterbalance, we can normalize

PageRank: to cut down on long random walks, walk can
return to x with a probablity α at every step y
SimRank

Defined by this recursive definition: two nodes are
similar to the extent that they are joined by similar
neighbors
Low-rank approximation

Treat the graph as an adjacency matrix
-A
A


B
1
C
0
B
C
1
0
1
1
Compute the rank-k matrix Mk (noise-reduction)
x is a row, y is a row, score(x, y) = inner product of rows
r(x) and r(y)
Unseen bigrams and Clustering



Unseen bigrams: Derived from language modeling
Estimating frequency of unseen bigrams – pairs of
words (nodes here) that co-occur in a test corpus but not
in the training corpus
Clustering: deleting tenuous edges in Gcollab through a
clustering procedure and running predictors on the
“cleaned-up” subgraph
Results

The results are presented as:

1. Factor improvement of proposed predictors over





Random predictor
Graph distance predictor
Common neighbors predictor
2. Relative performance vs. the above predictors
3. Common Predictions
Factor Improvement of different measures
Factor Improvement - meta approaches
Relative performance vs. Random Predictions
vs. graph distance predictor,

a
vs. common neighbors predictor
Common Predictions

a
Conclusions

No single clear winner

Many outperform the random predictor => there is useful
information in the network topology

Katz + clustering + low-rank approximation perform
significantly well

Some simple measures i.e. common neighbors and
Adamic/ Adar perform well
Critique



Even the best predictor (Katz on gr-qc) is correct on only
16% of predictions
How good is that?
Treat all collaborations equally. Perhaps, treating recent
collaborations as more important than older ones will
help?
References




Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211{230, July
2003.
A. L. Barabasi, H. Jeong, Z. N eda, E. Rav asz, A. Schubert, and T. Vicsek. Evolution of the social network
of scientist collaboration. Physica A, 311(3{4):590{614, 2002.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hyper textual Web search engine
Computer Networks and ISDN Systems, 30(1{7):107{117, 1998.
Rodrigo De Castro and Jerrold W. Grossman. F amous trails to Paul Erdos. Mathematical Intelligencer,
21(3):51{63, 1999.
Question
Question???
Thank You