Transcript SimRank
SimRank:
A Measure of Structural-Context Similarity
Glen Jeh and Jennifer Widom
Stanford University
ACM SIGKDD 2002
January 19, 2011
Taikyoung Kim
SNU IDB Lab.
Outline
Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
2
Introduction
Many applications require a measure of “similarity” between
objects
– “find-similar-document” query in search engine
– Collaborative filtering in a recommender system
3
Introduction
Propose a general approach that exploits the object-to-object
relationships in many domains
– An algorithm to compute similarity scores between nodes based on the
structural context
Intuition behind the algorithm
– Similar objects are related to similar objects
– The base case is that objects are similar to themselves
“Two objects are similar if they are referenced by similar objects”
4
Basic Graph Model
G = (V, E) [vertex, edge]
– Nodes in V: objects in the domain
– Directed edges in E: relationships between objects
– <p, q> : from object p to object q
For a node v, denote:
–
–
–
–
O (Univ)
I(v): the set of in-neighbors of v
O(v): the set of out-neighbors of v
Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| )
Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )
5
I (ProfB)
Outline
Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
6
SimRank
Motivation
– Two objects are similar if they are referenced by similar object
– Consider an object maximally similar to itself (similarity score of 1)
Similar nodes:
{ProfA, ProfB},
{StudentA, StudentB},
{Univ, ProfB},
…
7
SimRank
Basic SimRank Equation
The similarity between objects a and b: s(a, b) ∈ [0, 1]
1
s(a , b)
C
I (a) I (b)
(if a b)
I (a)
I (b)
s(I (a), I
i 1
j 1
i
j
(b))
(if a b)
– C is a constant between 0 and 1
Confidence level or decay factor
C gives the rate of decay as similarity flows across edges (since C < 1)
– If a or b may not have any in-neighbors, s(a,b) = 0
– SimRank scores are symmetric, i.e., s(a,b) = s(b,a)
Similarity between a and b is the average similarity between inneighbors of a and in-neighbors of b
8
SimRank
Basic SimRank Equation
Similarity can be thought of as “propagating” from pair to pair
– Consider the derived graph G2=(V2, E2) where
V2=V x V, represents a pair (a,b) of nodes in G
An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and <b,d> exist in G
9
SimRank
Bipartite SimRank
Bipartite domains consist of two types of objects
Recommender system
– People are similar if they purchase similar items
– Items are similar if they are purchased by similar people
10
SimRank
Bipartite SimRank
Bipartite Equation
– Directed edges go from people to items
– s(A,B) denote the similarity between persons A and B, (A≠B)
C1
s( A, B)
O( A) O( B)
O ( A) O ( B )
s(O ( A),O ( B))
i 1
j 1
i
j
– s(c,d) denote the similarity between items c and d, (c≠d)
C2
s(c, d )
I (c ) I ( d )
I (c)
I (d )
s( I (c), I
i 1
j 1
i
j
(d ))
– The similarity between persons A and B is the average similarity between
the items they purchased
– The similarity between items c and d is the average similarity between the
people who purchased them
11
SimRank
Computing SimRank - Naïve Method
Rk(a,b) gives the score between a and b on iteration k
0 (if a b)
R0 (a,b)
1(if a b)
C
Rk 1 (a, b)
I (a) I (b)
I (a)
I (b )
R ( I (a), I
i 1
j 1
k
i
j
(b))
The values Rk(*,*) are non-decreasing as k increase
limk Rk (a, b) s(a, b)
In experiments, when K = 5, Rk is rapidly converged
Complexity
– Space: O(n2) to store the result Rk,
– Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)
12
SimRank
Computing SimRank - Pruning
Pruning the logical graph G2
– In naïve method,
All n2 nodes of G2 are considered
Similarity score are computed for every node-pair
– Nodes far from a node v has less similarity score with v than nodes near v
Pruning
– Set the similarity between two nodes far apart to be 0
– Consider node-pairs only for nodes which are near each other in the range
of radius r
– Complexity
space: O(ndr), dr is average nodes which are near from a node
time: O(Kndrd2)
13
Outline
Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
14
Random Surfer-Pairs Model
For the intuition of similarity scores, provide an intuitive model
– Based on “random surfers”
– Show the SimRank score s(a,b) measures how soon two random surfers are
expected to meet at the same node
Expected Distance
– u and v are nodes in strongly connected graph
– The ED from u to v is exactly the expected number of steps a random surfer
would take before he first reaches v, starting from u
d (u ,v )
P[t ]l[t ]
t :u v
– Tour t = <w1, …, wk>
– l[t]: length of t
– P[t]: probability of traveling t
15
Random Surfer-Pairs Model
Expected Meeting Distance (EMD)
– EMD is symmetric
– EMD m(a,b) is simply the expected distance in G2 from (a,b) to any
singleton node(x,x) ∈ V2
m(a, b)
P[t ]l[t ]
t :( a ,b )( x , x )
m(*,*)= ∞
m(v,w)=1 m(*,*)= 3
m(u,v)=∞
m(u,w)=∞
16
Random Surfer-Pairs Model
Expected-f Meeting Distance
– Our approach to circumvent the “infinite EMD” problem
Map all distances to a finite interval: instead of computing expected length l(t)
of a tour
s' (a, b)
P[t ]c
l (t )
t :( a ,b )( x , x )
Equivalence to SimRank
– S’(*,*) is exactly models that our original definition of SimRank scores
17
Outline
Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
18
Conclusion
Main contribution
– A formal definition for SimRank similarity scoring over arbitrary graphs, sev
eral useful derivatives of SimRank, and an algorithm to compute SimRank
– A graph-theoretic model for SimRank that gives intuitive mathematical insig
ht into its use and computation
– Experimental results using an in-memory implementation of SimRank over
two real data sets shows the effectiveness and feasibility of SimRank
19
Future Work
Address efficiency and scalability issues
– Including additional pruning heuristics and disk-based algorithms
Consider ternary (or more) relationships in computing structuralcontext similarity
Explore the combination of SimRank with other domain-specific
similarity measures
20