Transcript SimRank

SimRank:
A Measure of Structural-Context Similarity
Glen Jeh and Jennifer Widom
Stanford University
ACM SIGKDD 2002
January 19, 2011
Taikyoung Kim
SNU IDB Lab.
Outline






Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
2
Introduction
 Many applications require a measure of “similarity” between
objects
– “find-similar-document” query in search engine
– Collaborative filtering in a recommender system
3
Introduction
 Propose a general approach that exploits the object-to-object
relationships in many domains
– An algorithm to compute similarity scores between nodes based on the
structural context
 Intuition behind the algorithm
– Similar objects are related to similar objects
– The base case is that objects are similar to themselves
“Two objects are similar if they are referenced by similar objects”
4
Basic Graph Model
 G = (V, E) [vertex, edge]
– Nodes in V: objects in the domain
– Directed edges in E: relationships between objects
– <p, q> : from object p to object q
 For a node v, denote:
–
–
–
–
O (Univ)
I(v): the set of in-neighbors of v
O(v): the set of out-neighbors of v
Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| )
Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )
5
I (ProfB)
Outline






Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
6
SimRank
 Motivation
– Two objects are similar if they are referenced by similar object
– Consider an object maximally similar to itself (similarity score of 1)
Similar nodes:
{ProfA, ProfB},
{StudentA, StudentB},
{Univ, ProfB},
…
7
SimRank
Basic SimRank Equation
 The similarity between objects a and b: s(a, b) ∈ [0, 1]
1

s(a , b)  
C
 I (a) I (b)

(if a  b)
I (a)
I (b)
  s(I (a), I
i 1
j 1
i
j
(b))
(if a  b)
– C is a constant between 0 and 1
 Confidence level or decay factor
 C gives the rate of decay as similarity flows across edges (since C < 1)
– If a or b may not have any in-neighbors, s(a,b) = 0
– SimRank scores are symmetric, i.e., s(a,b) = s(b,a)
 Similarity between a and b is the average similarity between inneighbors of a and in-neighbors of b
8
SimRank
Basic SimRank Equation
 Similarity can be thought of as “propagating” from pair to pair
– Consider the derived graph G2=(V2, E2) where
 V2=V x V, represents a pair (a,b) of nodes in G
 An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and <b,d> exist in G
9
SimRank
Bipartite SimRank
 Bipartite domains consist of two types of objects
 Recommender system
– People are similar if they purchase similar items
– Items are similar if they are purchased by similar people
10
SimRank
Bipartite SimRank
 Bipartite Equation
– Directed edges go from people to items
– s(A,B) denote the similarity between persons A and B, (A≠B)
C1
s( A, B) 
O( A) O( B)
O ( A) O ( B )
  s(O ( A),O ( B))
i 1
j 1
i
j
– s(c,d) denote the similarity between items c and d, (c≠d)
C2
s(c, d ) 
I (c ) I ( d )
I (c)
I (d )
  s( I (c), I
i 1
j 1
i
j
(d ))
– The similarity between persons A and B is the average similarity between
the items they purchased
– The similarity between items c and d is the average similarity between the
people who purchased them
11
SimRank
Computing SimRank - Naïve Method
 Rk(a,b) gives the score between a and b on iteration k
0 (if a  b)
R0 (a,b)  
1(if a  b)
C
Rk 1 (a, b) 
I (a) I (b)




I (a)
I (b )
  R ( I (a), I
i 1
j 1
k
i
j
(b))
The values Rk(*,*) are non-decreasing as k increase
limk  Rk (a, b)  s(a, b)
In experiments, when K = 5, Rk is rapidly converged
Complexity
– Space: O(n2) to store the result Rk,
– Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)
12
SimRank
Computing SimRank - Pruning
 Pruning the logical graph G2
– In naïve method,
 All n2 nodes of G2 are considered
 Similarity score are computed for every node-pair
– Nodes far from a node v has less similarity score with v than nodes near v
 Pruning
– Set the similarity between two nodes far apart to be 0
– Consider node-pairs only for nodes which are near each other in the range
of radius r
– Complexity
 space: O(ndr), dr is average nodes which are near from a node
 time: O(Kndrd2)
13
Outline






Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
14
Random Surfer-Pairs Model
 For the intuition of similarity scores, provide an intuitive model
– Based on “random surfers”
– Show the SimRank score s(a,b) measures how soon two random surfers are
expected to meet at the same node
 Expected Distance
– u and v are nodes in strongly connected graph
– The ED from u to v is exactly the expected number of steps a random surfer
would take before he first reaches v, starting from u
d (u ,v ) 
 P[t ]l[t ]
t :u v
– Tour t = <w1, …, wk>
– l[t]: length of t
– P[t]: probability of traveling t
15
Random Surfer-Pairs Model
 Expected Meeting Distance (EMD)
– EMD is symmetric
– EMD m(a,b) is simply the expected distance in G2 from (a,b) to any
singleton node(x,x) ∈ V2
m(a, b) 
 P[t ]l[t ]
t :( a ,b )( x , x )
m(*,*)= ∞
m(v,w)=1 m(*,*)= 3
m(u,v)=∞
m(u,w)=∞
16
Random Surfer-Pairs Model
 Expected-f Meeting Distance
– Our approach to circumvent the “infinite EMD” problem
 Map all distances to a finite interval: instead of computing expected length l(t)
of a tour
s' (a, b) 
 P[t ]c
l (t )
t :( a ,b )( x , x )
 Equivalence to SimRank
– S’(*,*) is exactly models that our original definition of SimRank scores
17
Outline






Introduction
Basic Graph Model
SimRank
Random Surfer-Pairs Model
Conclusion
Future Work
18
Conclusion
 Main contribution
– A formal definition for SimRank similarity scoring over arbitrary graphs, sev
eral useful derivatives of SimRank, and an algorithm to compute SimRank
– A graph-theoretic model for SimRank that gives intuitive mathematical insig
ht into its use and computation
– Experimental results using an in-memory implementation of SimRank over
two real data sets shows the effectiveness and feasibility of SimRank
19
Future Work
 Address efficiency and scalability issues
– Including additional pruning heuristics and disk-based algorithms
 Consider ternary (or more) relationships in computing structuralcontext similarity
 Explore the combination of SimRank with other domain-specific
similarity measures
20