Transcript Slide 1
The Link Prediction Problem
for Social Networks
David Libel-Nowell, MIT
John Klienberg, Cornell
Saswat Mishra
sxm111131
Summary
The “Link Prediction Problem”
Given a snapshot of a social network, can we infer
which new interactions among its members are likely to
occur in the near future?
Based on “proximity” of nodes in a network
Introduction
Nodes = people/entities
Edges = interaction/ collaboration
Natural examples of social networks:
Nodes
Edges
Scientists in a discipline
Co-authors of a paper
Employees in a large
company
Working on a project
Business Leaders
Serve together on a
board
Motivation
Understanding how social networks evolve
The link prediction problem
Given a snapshot of a social network at time t, we seek
to accurately predict the edges that will be added to the
network during the interval (t, t’)
?
Why?
To suggest interactions or collaborations that haven’t yet
been utilized within an organization
To monitor terrorist networks - to deduce possible
interaction between terrorists (without direct evidence)
Used in Facebook and Linked In to suggest friends
Open Question: How does Facebook do it?
(friends of friends, same school, manually…)
Motivation
D
B
A
Co-authorship network for scientists
C
Scientists who are “close” in the network
will have common colleagues & circles –
likely to collaborate
Caveat: Scientists who have never collaborated might
in future - hard to predict
Goal: make that intuitive notion precise;
understand which measures of
“proximity” lead to accurate predictions
Goals
Present measures of proximity
Understand relative effectiveness of network proximity
measures (adapted from graph theory, CS, social sciences)
Prove that prediction by proximity outperforms random
predictions by a factor of 40 to 50
Prove that subtle measures outperform more direct
measures
Data and Experimental Setup
Co-authorship network (G) from “author list” of the
physics e-Print arXiv (www.arxiv.org)
Took 5 such networks from 5 sections of the print
B
A
D
B
C
Training interval [1994,1996]
Ktraining = 3
A
C
Test interval [1997,1999]
Ktest = 3
Core: set of authors who have at least 3 papers during both training and test
G[1994,1996] = Gcollab = (A,Eold)
Enew = new collaborations (edges)
Data
Methods for Link Prediction
Take the input graph during training period Gcollab
Pick a pair of nodes (x, y)
Assign a connection weight score(x, y)
Make a list in descending order of score
score is a measure of proximity
Any ideas for measures?
Proximity Measures for Link Prediction
Graph distance & Common Neighbors
E
D
B
A
C
E
D
B
A
Graph distance: (Negated) length
of shortest path between x and y
C
(A, C)
-2
(C, D)
-2
(A, E)
-3
Common Neighbors: A and C have
2 common neighbors, more likely to
collaborate
Jaccard’s coefficient and Adamic / Adar
E
D
B
A
C
Jaccard’s coefficient: same as
common neighbors, adjusted for
degree
Adamic / Adar: weighting rarer
neighbors more heavily
Preferential Attachment
Probability that a new collaboration involves x is
proportional to T(x), current neighbors of x
score (x, y) :=
Considering all paths: Katz
E
D
B
A
C
Katz: measure that sums over
the collection of paths,
exponentially damped by length
(to count short paths heavily)
β is chosen to be a very small value (for
dampening)
Hitting time, PageRank
Hitting time: expected number of steps for a random
walk starting at x to reach y
Commute time:
If y has a large stationary probability, Hx,y is small. To
counterbalance, we can normalize
PageRank: to cut down on long random walks, walk can
return to x with a probablity α at every step y
SimRank
Defined by this recursive definition: two nodes are
similar to the extent that they are joined by similar
neighbors
Low-rank approximation
Treat the graph as an adjacency matrix
-A
A
B
1
C
0
B
C
1
0
1
1
Compute the rank-k matrix Mk (noise-reduction)
x is a row, y is a row, score(x, y) = inner product of rows
r(x) and r(y)
Unseen bigrams and Clustering
Unseen bigrams: Derived from language modeling
Estimating frequency of unseen bigrams – pairs of
words (nodes here) that co-occur in a test corpus but not
in the training corpus
Clustering: deleting tenuous edges in Gcollab through a
clustering procedure and running predictors on the
“cleaned-up” subgraph
Results
The results are presented as:
1. Factor improvement of proposed predictors over
Random predictor
Graph distance predictor
Common neighbors predictor
2. Relative performance vs. the above predictors
3. Common Predictions
Factor Improvement of different measures
Factor Improvement - meta approaches
Relative performance vs. Random Predictions
vs. graph distance predictor,
a
vs. common neighbors predictor
Common Predictions
a
Conclusions
No single clear winner
Many outperform the random predictor => there is useful
information in the network topology
Katz + clustering + low-rank approximation perform
significantly well
Some simple measures i.e. common neighbors and
Adamic/ Adar perform well
Critique
Even the best predictor (Katz on gr-qc) is correct on only
16% of predictions
How good is that?
Treat all collaborations equally. Perhaps, treating recent
collaborations as more important than older ones will
help?
References
Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211{230, July
2003.
A. L. Barabasi, H. Jeong, Z. N eda, E. Rav asz, A. Schubert, and T. Vicsek. Evolution of the social network
of scientist collaboration. Physica A, 311(3{4):590{614, 2002.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hyper textual Web search engine
Computer Networks and ISDN Systems, 30(1{7):107{117, 1998.
Rodrigo De Castro and Jerrold W. Grossman. F amous trails to Paul Erdos. Mathematical Intelligencer,
21(3):51{63, 1999.
Question
Question???
Thank You