Transcript Slide 1

School of Information
University of Michigan
Search in networks
Lada Adamic (U. Michigan)
NetSci Workshop
May 16th, 2006
Outline
 Search in structured networks




small world experiments
geographical models
hierarchical models
studies:




HP labs email network (simulated)
Club Nexus online community (simulated)
Phone interview company (survey)
LiveJournal (simulated)
 Search in unstructured networks
 power law networks
 Erdos-Renyi networks
 P2P networks (Gnutella example)
Search in structured networks
Small world experiments then
MA
NE
Milgram’s experiment (1960’s):
Given a target individual and a particular property, pass the message to a
person you correspond with who is “closest” to the target.
Milgram’s small world experiment
 Target person worked in Boston as a stockbroker.
 296 senders from Boston and Omaha.
 20% of senders reached target.
 Typical strategy – if far from target choose someone
geographically closer, if close to target geographically,
choose someone professionally closer
 average chain length = 6.5
“Six degrees of separation”
Small world experiments now
email experiment
Dodds, Muhamad, Watts,
Science 301, (2003)
18 targets
13 different countries
24,163 message chains
384 reached their targets
average path length 4.0
image by Stephen G. Eick
http://www.bell-labs.com/user/eick/index.html
(unrelated to small world experiment…)
Small world experiment at Columbia
Successful chains disproportionately used
• weak ties (Granovetter)
• professional ties (34% vs. 13%)
• ties originating at work/college
• target's work (65% vs. 40%)
. . . and disproportionately avoided
• hubs (8% vs. 1%) (+ no evidence of funnels)
• family/friendship ties (60% vs. 83%)
Strategy: Geography -> Work
Why study small world phenomena?
Curiosity:
Why is the world small?
How are people able to route messages?
Social Networking as a Business:
Friendster, Orkut, MySpace,FaceBook
LinkedIn, Spoke, VisiblePath
Six degrees of separation - to be expected
Pool and Kochen (1978) - average person has 500-1500
acquaintances
Ignoring clustering, other redundancy …
~ 103 first neighbors, 106 second neighbors, 109 third neighbors
But networks are clustered: my friends’ friends tend to be my friends
Watts & Strogatz (1998) - a few random links in an otherwise clustered
graph give an average shortest path close to that of a random graph
Is this the whole picture?
Why are small worlds navigable?
How are people are able to find short paths?
How to choose among hundreds of acquaintances?
Strategy:
Simple greedy algorithm - each participant chooses
correspondent
who is closest to target with respect to the given property
Models
geography
Kleinberg (2000)
hierarchical groups
Watts, Dodds, Newman (2001), Kleinberg(2001)
high degree nodes
Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)
Reverse small world experiment







Killworth & Bernard (1978):
Given hypothetical targets (name, occupation, location, hobbies, religion…)
participants choose an acquaintance for each target
Acquaintance chosen based on
(most often) occupation, geography
only 7% because they “know a lot of people”
Simple greedy algorithm: most similar acquaintance
two-step strategy rare
Spatial search
Kleinberg, ‘The Small World Phenomenon, An Algorithmic Perspective’
Proc. 32nd ACM Symposium on Theory of Computing, 2000.
(Nature 2000)
“The geographic movement of the [message]
from Nebraska to
Massachusetts is striking. There is a
progressive closing in on the target
area as each new person is added to the
chain”
S.Milgram ‘The small world
problem’, Psychology Today 1,61,1967
nodes are placed on a lattice and
connect to nearest neighbors
additional links placed with puv~
d
r
uv
no locality
When r=0, links are randomly distributed, ASP ~ log(n), n size of grid
When r=0, any decentralized algorithm is at least a0n2/3
p ~ p0
When r<2,
expected
time at
least arn(2-r)/3
Overly localized links on a lattice
When r>2 expected search time ~ N(r-2)/(r-1)
1
p~ 4
d
Links balanced between long and short range
When r=2, expected time of a DA is at most C (log N)2
1
p~ 2
d
Hierarchical social network models
Kleinberg, ‘Small-World Phenomena and the Dynamics of Information’
NIPS 14, 2001
h
Hierarchical network models:
b=3
Individuals classified into a hierarchy,
hij = height of the least common ancestor.
pij
~
b
a hij
e.g. state-county-city-neighborhood
industry-corporation-division-group
Theorem: If a = 1 and outdegree is polylogarithmic, can
s ~ O(log n)
Group structure models:
Individuals belong to nested groups
q = size of smallest group that v,w belong to
f(q) ~ q-a
Theorem: If a = 1 and outdegree is polylogarithmic, can
s ~ O(log n)
Sketch of proof
l2|R|<|R’|<l|R|
R
R’
T
S
k = c log2n
calculate probability that s fails to have a link in R’
Identity and search in social networks
Watts, Dodds, Newman (Science,2001)
individuals belong to hierarchically nested groups
pij ~ exp(-a x)
multiple independent hierarchies h=1,2,..,H
coexist corresponding to occupation,
geography, hobbies, religion…
Identity and search in social networks
Watts, Dodds, Newman (2001)
Message chains fail at each node with probability p
Network is ‘searchable’ if a fraction r of messages reach the target
q  (1  p )
L
L
r
N=102400
N=204800
N=409600
Small World Model, Watts et al.
Fits Milgram’s data well
Model
parameters:
N = 108
z = 300
g = 100
b = 10
a= 1, H = 2
Lmodel= 6.7
Ldata = 6.5
more slides on this:
http://www.aladdin.cs.cmu.edu/workshops/wsa/papers/dodds-2004-04-10search.pdf
High degree search
Adamic et al. Phys. Rev. E, 64 46135 (2001)
Mary
Who could
introduce me to
Richard Gere?
Bob
Jane
Small world experiments so far
Classic small world experiment:
Given a target individual, forward to one of your acquaintances
Observe chains but not the rest of the social network
Reverse small world experiment (Killworth & Bernard)
Given a hypothetical individual,
which of your acquaintances would you choose
Observe individual’s social network and possible choices,
but not resulting chains or complete social network
Testing search models on social networks
advantage: have access to entire communication network
and to individual’s attributes
Use a well defined network:
HP Labs email correspondence over 3.5 months
Edges are between individuals who sent
at least 6 email messages each way
450 users
median degree = 10, mean degree = 13
average shortest path = 3
Node properties specified:
degree
geographical location
position in organizational hierarchy
Can greedy strategies work?
Strategy 1: High degree search
Power-law degree distribution of all senders of email passing through HP labs
10
0
outdegree distribution
a = 2.0 fit
of senders
proportionfrequency
10
10
10
10
-2
-4
-6
-8
10
0
10
1
10
2
10
3
10
outdegree
number of recipients
sender has sent email to
4
Filtered network
(at least 6 messages sent each way)
Degree distribution no longer power-law, but Poisson
35
10
0
p(k)
25
p(k)
30
10
-2
20
15
10
10
-4
0
20
40
k
60
80
5
0
0
20
40
60
number of email correspondents, k
80
It would take 40 steps on average (median of 16) to reach a target!
Strategy 2:
Geography
Communication across corporate geography
1U
1L
87 % of the
4000 links are
between individuals
on the same floor
4U
2U
3U
2L
3L
Cubicle distance vs. probability of being linked
0
10
measured
1/r
proportion of linked pairs
1/r2
-1
10
-2
10
optimum for search
-3
10
2
10
distance in feet
3
10
Strategy 3: Organizational hierarchy
Email correspondence superimposed on the organizational hierarchy
Example of search path
distance 2
distance 1
distance 1
distance 1
hierarchical distance = 5
search path distance = 4
Probability of linking vs. distance in hierarchy
observed
fit exp(-0.92*h)
probability of linking
0.6
0.5
0.4
0.3
0.2
0.1
0
2
4
6
hierarchical distance h
8
10
in the ‘searchable’ regime: 0 < a < 2 (Watts, Dodds, Newman 2001)
Results
5
x 10
distance
hierarchy
geography
geodesic
org
random
median
4
7
3
6
28
mean
5.7 (4.7)
12
3.1
6.1
57.4
4
16000
number of pairs
number of pairs
14000
hierarchy
4
3
2
geography
12000
10000
8000
6000
4000
1
2000
0
0
5
10
15
number of steps in search
20
0
0
252
4
6
8
10
12
number of steps
14
16
18
20
Expt 2
Searching
a social
networking
website
Profiles:
status (UG or G)
year
major or department
residence
gender
Personality
you
friendship
romance
freetime
support
(choose 3 exactly):
funny, kind, weird, …
honesty/trust, common interests, commitment, …
-“socializing, getting outside, reading, …
unconditional accepters, comic-relief givers, eternal optimists
Interests
books
movies
music
social activities
land sports
water sports
other sports
(choose as many as apply)
mystery & thriller, science fiction, romance, …
western, biography, horror, …
folk, jazz, techno, …
ballroom dancing, barbecuing, bar-hopping, …
soccer, tennis, golf, …
sailing, kayaking, swimming, …
ski diving, weightlifting, billiards, …
Differences between data sets
HP labs email network
Online community
• complete image of
communication network
• partial information of
social network
• affinity not reflected
• only friends listed
Degree Distribution for Nexus Net
2469 users, average degree 8.2
200
number of users
number of users with so many links
250
150
2
10
1
10
0
10
0
10
100
1
10
number of links
2
10
50
0
0
20
40
60
number of links
80
100
Problem: how to construct hierarchies?
Probability of linking by separation in years
0.02
prob. two grads are friends
prob. two undergrads are friends
0.014
0.012
0.01
0.008
0.006
data
(x+1)-1.7 fit
0.015
0.01
0.005
0
0
1
2
3
4
separation in years
5
0.004
0.002
data
(x+1)-1.1 fit
0
0
1
2
separation in years
3
Hierarchies not useful for other attributes:
Geography
probability of being friends
0.06
0.05
0.04
0.03
0.02
0.01
0
0
100
200
300
400
500
600
distance between residences
Other attributes: major, sports, freetime activities, movie preferences…
Strategy using user profiles
prob. two undergrads are friends (consider simultaneously)
• both undergraduate, both graduate, or one of each
• same or different year
• both male, both female, or one of each
• same or different residences
• same or different major/department
Results
strategy
random
high degree
profile
median
133
39
21
mean
390
137
53
With an attrition rate of 25%, 5% of the messages get through at
an average of 4.8 steps,
=> hence network is barely searchable
The accuracy of small world chains in social networks
Peter D. Killworth, Christopher McCarty, H. Russell Bernard, Mark House
Social Networks, 2006
 First parallel study of
individuals choices vs.
actual shortest paths
 Network
 105 members of an
interviewing bureau
 10,920 shortest path
connections
 who knows whom
 who a person would select as
the next link in a chain to a
particular person x
recent hire
worked a while
old timer
Accuracy of small world chains
 Shortest paths
 use the network of who-knows whom to calculate actual shortest
paths
 compare to paths formed by individuals’ choices
 21.7% fail through reaching missing data
 23.7% reach cycles : i chooses j, j chooses i
 54.6% reach the target, with chains that are 40% longer on average
than the shortest path
Next choice accuracy and a Markov model
 48% of the time, a person chooses a contact who is
closer to the target
 over half of the choices are wrong!
 Markov model:
 terminate chain with probability a (attrition)
 choose someone closer to the target with probability p, otherwise
choose someone at same distance
LiveJournal
 LiveJournal provides an API to crawl the friendship
network + profiles
 friendly to researchers
 great research opportunity
 basic statistics
 Users
 How many users, and how many of those are active?
 Total accounts: 9980558
 ... active in some way: 1979716
 ... that have ever updated: 6755023
 ... updating in last 30 days: 1300312
 ... updating in last 7 days: 751301
 ... updating in past 24 hours: 216581
Age distribution
Predominantly female
& young demographic
 Male: 1370813 (32.4%)
 Female: 2856360 (67.6%)
 Unspecified: 1575389
13 18483
14 87505
15 211445
16 343922
17 400947
18 414601
19 405472
20 371789
21 303076
22 239255
23 194379
24 152569
25 127121
26 98900
27 73392
28 59188
29 48666
Geographic Routing in Social Networks
 David Liben-Nowell, Jasmine Novak, Ravi Kumar,
Prabhakar Raghavan, and Andrew Tomkins (PNAS
2005)
 data used
 Feb. 2004
 500,000 LiveJournal users with US locations
 giant component (77.6%) of the network
 clustering coefficient: 0.2
Degree distributions
 The broad degree distributions we’ve learned to know
and love
 but more probably lognormal than power law
broader in degree than outdegree distribution
Results of a simple greedy geographical algorithm
 Choose source s and target t randomly
 Try to reach target’s city – not target itself
 At each step, the message is forwarded from the current message holder u
to the friend v of u geographically closest to t
stop if d(v,t) > d(u,t)
13% of the chains are completed
stop if d(v,t) > d(u,t)
pick a neighbor at random in the
same city if possible, else stop
80% of the chains are completed
the geographic basis of friendship
 d = d(u,v) the distance between pairs of people
 The probability that two people are friends given their
distance is equal to
 P(d) = e + f(d), e is a constant independent of geography
 e is 5.0 x 10-6 for LiveJournal users who are very far apart
the geographic basis of friendship
 The average user will have ~ 2.5 non-geographic friends
 The other friends (5.5 on average) are distributed according to an
approximate 1/distance relationship
 But 1/d was proved not to be navigable by Kleinberg, so what gives?
Navigability in networks of variable geographical density
 Kleinberg assumed a uniformly populated 2D lattice
 But population is far from uniform
 population networks and rank-based friendship
 probability of knowing a person depends not on absolute
distance but on relative distance (i.e. how many people live
closer) Pr[u ->v] ~ 1/ranku(v)
Structured search Conclusions
Individuals associate on different levels into groups.
Individuals tend to know others who are ‘close by’
Group structure facilitates decentralized search using social
ties.
Hierarchy search faster than geographical search
Simple strategies are not perfect – but short (rather than
shortest) chains can be found
Weighted shortest paths
 Routes
 shortest route from Chicago to Boston
 vertex: intersection
 edge weights: road distances
 alternative weights: expected time traveled, gas consumed…
 usually sum the weights from each segment
finish
surface road
25 mph, 50 miles
2 hours
start
freeway, 70 mph
30 miles/70 mph
~ 26 minutes
freeway, 65 mph
40 miles/65 mph ~ 37 minutes
Reliable paths through social networks
 The probability of transmitting a message or infectious
agent could be related to the strength of the tie
 e.g. rather than summing the weights, we might multiply the
probabilities of getting through
p=1
p = 0.001
p = 0.05
p = 0.5
p = 0.5
Probability of getting an idea through to the head of labs
via CEO (0.001*1 = 0.001), via direct manager (0.5*0.5 = 0.25)
Search in random networks
Motivation
Power-law (PL) networks, social and P2P
Analysis of scaling of search strategies in PL networks
Simulation
artificial power-law topologies, real Gnutella networks
Comparison with existing P2P search strategies
Reflector, Morpheus
Directed Search
Freenet
2
How do we search?
Mary
Who could
introduce me to
Richard Gere?
Bob
Jane
# of telephone numbers
from which calls were made
AT&T Call Graph
# of telephone numbers called
Aiello et al. STOC ‘00
Gnutella network
power-law link distribution
proportion of nodes
10
10
10
data
power-law fit
t = 2.07
2
1
0
10
0
1
10
number of neighbors
summer 2000,
data provided by Clip2
Preferential attachment model
Nodes join at different times
The more connections a node has, the more likely it is to acquire
new connections
Growth process produces power-law network
ping
host cache
ping
Gnutella and the bandwidth barrier
file sharing w/o a central index
queries broadcast to every node within radius ttl
 as network grows, encounter a bandwidth
barrier (dial up modems cannot keep up with
query traffic, fragmenting the network)
Clip 2 report
Gnutella: To the Bandwidth Barrier and Beyond
http://www.clip2.com/gnutella.html#q17
power-law graph
number of
nodes found
94
67
63
54
2
6
1
Poisson graph
number of
nodes found
93
19
15
11
7
3
1
Search with knowledge of 2nd neighbors
Outline of search strategy
pass query onto only one neighbor at each step
OPTIONS
requires that nodes sign query
- avoid passing message onto a node twice
requires knowledge of one’s neighbors degree
- pass to the highest degree node
requires knowledge of one’s neighbors neighbors
- route to 2nd degree neighbors
Generating functions
 M.E.J. Newman, S.H. Strogatz, and D.J. Watts
 ‘Random graphs with arbitrary degree distributions and their
applications’, PRE, cond-mat/0007235
 Generating functions for degree distributions

G0 ( x )   pk x k
k 0
 Useful for computing moments of degree distribution,
 component sizes, and average path lengths
Fun with generating functions
 normalization condition: probabilities sum to 1

G0 (1)   pk  1
k 0
 derivatives: the generating function contains all the
information of the degree distribution
1  G0
pk 
k! x k
k
x 0
Fun with generating functions (cont’d)
 Expected degree of a randomly chosen vertex
 k   kpk  G0' (1)
k
 Higher moments of degree distribution
kn
n


 d 
n
  k pk   x  G0 ( x)
k
 dx 
 x 1
Example: Poisson distribution
 Let p = z/N be the probability of an edge existing
between two vertices (z is the average degree)
N k
G0 ( x)     p (1  p) N k x k
k 0  k 
N
 (1  p  px) N ~ e z ( x 1)
for large N
G'0 ( x)  ze z ( x 1)
G'0 (1)  z
1  k G0
pk 
k! x k
x 0
1
 z k ez
k!
just the regular Poisson
distribution
Introducing cutoffs
kmax  N  1 a node cannot have more connections than there are other nodes
This is important for exponents close to 2


1
1 pk 1 Ct xt  1
C  6
2
2

p( k  1000, t  2)   pk ~ 0.001
1000
Probability that none of the nodes in a 1,000 node graph has 1000 or more neighbors:
(1 p(k  1000,t  2))1000 ~ 0.36
without a cutoff, for t = 2
have > 50% chance of observing a node with more neighbors than there are nodes
for t = 2.1, have a 25% chance
Selecting from a variety of cutoffs
kmax  N
2.
pk  Ck t ek /
3.
Ck
pk  
0
Newman et al.
k  CN 
1t
t
otherwise
Aiello et al.
Generating Function
G0 x   C
CN 1 t
t k
k
 x
k 1
1 million websites (~ 1997)
proportion of sites w/ so many links
1.
N
1000
# of sites linking to the site
Aiello’s ‘conservative’ vs. Havlin’s ‘natural’
cutoff
n(k)
N * pk  1
cutoff where expected
number of nodes of degree
k is 1
Ck t  N 1
1
k ~ Nt
1
k
n(k)

N*

k  kmax
cutoff so that
expected number of nodes
of degree > k is 1
1
k


pk  1
ck t ~ N 1
k  kmax
1t
kmax
~ N 1
kmax ~ N
1
t 1
The imposed cutoff can have a dramatic
effect on the properties of the graph
degrees drawn at random, for t = 2, and N = 1000
Generating functions for degree
distributions
 Random graphs with arbitrary degree distributions and their applications
 by Newman, Strogatz & Watts
2
2

2
1
G0 ( x )   pk x k
k 0
1
pk ~ k t
is the probability that a randomly
chosen vertex has degree k
 k   kpk  G0' (1)
1
2
k
2
2
is a generating function
G1  x  
G0'  x 
G0' 1
z2  G0' 1G1' 1
is the expected degree of a randomly
chosen vertex
is the distribution of remaining
outgoing edges following and edge
is the expected number of second
degree neighbors
assuming neighbors don’t share edges
search with knowledge of first neighbors
kmax
G0 ( x )  c  k t x k
1
Generating function with cutoff
kmax

G (x) 
G0 ( x )  c  k 1t x k 1
x
1
'
0
kmax
kmax
1
1
G0' (1)  k  c  k 1t ~

k 1t dk 
G0' ( x )
c  kmax 1t k 1
G (x)  '
 '
k x

G0 (1) G0 (1) x 1
'
1
c kmax 1t
k 2
 '
k
(
k

1)
x

G0 (1) 2
constant in N
Average degree of vertex

1
2 t
1  kmax
t 2

Average number of neighbors
following an edge
for 2<t<3, and kmax~Na, decreases
with N
3 t
2 t
2 t
k
(
t

2)

2
(
t

1)

k
1
max
max (3  t )
G1' (1)  '
G0 (1)
(t  2)(3  t )
search with knowledge of first neighbors
(cont’d)
3 t
3 t
1 kmax
t  2 kmax
3 t
z1B  G (1) ~ '

k
~
max
2t
G0 (1) (3  t ) 1  kmax
(3  t )
'
1
In the limit t->2,
'
1
G (1) ~
kmax
log(kmax )
Let’s for the moment ignore the fact that as we do a random walk, we encounter neighbors
that we’ve seen before
N
s = number of steps =
z1B
Search time with different cutoffs
If kmax = N, s(t ) ~
N
N
t 2


N
,2  t  3
3 t
3 t
kmax N
s(2.1) ~ N 0.1
s~
If kmax = N1/(t-1),
grow from 1,000 to 1,000,000 nodes,
search time increases by a factor of ~2
N log(kmax )
 log(N ),t  2
kmax
t 2
2
N
N
s(t ) ~ 3t  3t  N t 1 ,2  t  3
kmax
N t 1
s(2.1) ~ N
s(2) ~
0.18
grow 1000x
search time increases 3x
N log(kmax )
 log(N )
kmax
search with knowledge of first neighbors
(cont’d)
If kmax =
N1/t,
s~
So the best we can do is
N

3 t
kmax
N
N
1
 N 23 / t ,2  t  3
(N t )3t
for exponents close to 2
2nd neighbor random walk, ignoring overlap:
ns  z2B  N 
S~
N
z2B  N 
2
 t 2 k



z2B   G1(G1( x ))  G1' (1)  

2t
1

k
(3

t
)
 x
 x 1
max


S N ,t  ~ N 312 t 
3 t
max
S N ,t  2.1 ~ N 0.15
2
Following the degree sequence
Go to highest degree node, then next highest, … etc.
z1D  
kmax
kmax a
1t
Nk 1t dk ~ Nakmax
a ~ s = # of steps taken
2nd neighbors, ignoring overlap:
2(2 t )
z1DG1' ( x ) ~ Nak max
2(t  2)
s ~ k max
~ N 24 / t
Sdeg N ,t  2.1  N 0.1
Ratio of the degree of a node to the expected degree of its highest
degree neighbor for 10,000 node power-law graphs of varying exponents
t = 2.00
t = 2.25
t = 2.50
t = 2.75
t = 3.00
t = 3.25
t = 3.50
t = 3.75
20
degree of neighbor - 1
degree of node
10
5
2
1
0
10
20
30
40
50
60
degree of node
70
80
90
100
Exponents t close to 2 required to search effectively
Gnutella
World Wide Web,
Social networks,
t ~ 2.0-2.3,
high degree nodes: directories, search engines
AT&T call graph t ~ 2.1
Actor collaboration graph
(imdb database)
t ~ 2.0-2.2
number of actors/actresses
105
actors, t = 2
actresses, t = 2.1
104
103
102
101
100 0
10
101
102
103
number of costars
104
Following the degree sequence
17
18
10
5
1
6
9
8
50
Complications
Should not visit same node more than once
Many neighbors of current node being visited
were also neighbors of previously visited
nodes, and there is a bias toward high degree
nodes being ‘seen’ over and over again
Status and degree of node visited
30
not visited
visited
neighbors visited
degree of node
25
20
15
10
5
0
0
100
200
300
step
400
500
600
1
random walk
degree sequence
0.1
seeking high degree nodes
speeds up the search process
-2
10
-3
10
-4
10
1
10
10
2
10
3
10
4
10
5
10
6
step
about 50% of a 10,000 node graph
is explored in the first 12 steps
cumulative nodes found at step
proportion of nodes found at step
Progress of exploration in a 10,000 node graph knowing
2nd degree neighbors
1
random walk
degree sequence
0.8
0.6
0.4
0.2
0
12
20
40
step
60
80
100
Scaling of search time with size of graph
3
covertime for half the nodes
10
random walk
a = 0.37 fit
degree sequence
a = 0.24 fit
2
10
1
10
0
10 1
10
2
10
3
10
size of graph
4
10
5
10
Comparison with a Poisson graph
10
G0 x  e z  x1
x
G1  x   G0  x   G0  x 
z
1
10
0
10 0
10
1
10
step
10
2
expected degree and expected
degree following a link are equal
scaling is linear
10
3
cover time for 1/2 of graph
degree of current node
10
Poisson
power-law
2
10
10
10
10
5
4
constant av. deg. = 3.4
g = 1.0 fit
3
2
1
0
10 1
2
4
10
10
10
number of nodes in graph
10
6
Gnutella network
50% of the files in a 700 node network can be found in < 8 steps
cumulative nodes found at step
1
0.8
0.6
0.4
0.2
0
high degree seeking 1st neighbors
high degree seeking 2nd neighbors
0
20
40
60
step
80
100
Required modifications to nodes
•
Maintain a list of files in their neighborhood
•
Check query against list.
•
Periodically contact neighbors to maintain list
•
Append ID to each query processed
Tradeoff
storage/cpu
(available)
for
bandwidth
(limited)
Theory vs. reality:
• overloading
high degree nodes
but no worse than original scenario where all nodes
handle all traffic
assume high degree -> high bandwidth
so can carry the traffic load
• fewer nodes used for routing,
system is more susceptible to malicious
attack
Partial implementation:
• localized
indexing
• traffic routed to high degree nodes
Clip2 Distributed Search Solutions
http://dss.clip2.com
© Clip2.com, Inc.
Broadband user running
Reflector
Broadband user running
Gnutella
Dial-up user running
Gnutella
Connection-preferencing rules
LimeWire, BearShare:
drop connections to unresponsive hosts
drives slower hosts to have fewer connections &
move to edge of network
Supernodes
Kazaa, BearShare defender, Morpheus SuperNodes
from Clip2: Morpheus out of the Underworld
http://www.openp2p.com/pub/a/p2p/2001/07/02/morpheus.html
Freenet
Queries are passed to one peer at a time.
Queries routed to high degree nodes.
Has a power-law topology
Theodore Hong, ‘Performance’ chapter in O’Reilley’s
“Peer-to-Peer, Harnessing the Power of Disruptive Technologies”
Scales as N0.275 with the size of the network, N.
Theodore Hong,
power - law link distribution of a simulated Freenet network
Theodore Hong,
scaling of mean search time
on a simulated Freenet network
Node specialization key to Freenet’s speed
Each node forwards query to node with “closest” hash key
Node passing back a match remembers the address the
data came from
Results in nodes developing a bias towards a part of the
keyspace
112
659
?356?
356
340
340
388
388
396
396
135
135
214
214
Queries are naturally routed to high degree nodes
Use keys for orientation
Conclusions
Search is faster and scales in power-law networks
Networks intended to be searched, such as Gnutella,
have a favorable P-L topology
High degree strategy has partially been implemented in existing p2p
clients, such as BearShare, Kazaa & Morpheus
Current research on search
 search in weighted networks
 expertise search
 P2P architectures with ‘friendship’ overlays
 weak ties vs. strong ties and online communication
A PL link distribution shortens the average shortest path
zr  a
r 1
Poisson:
PL:
 z2 
z1   
 z1 
a = z1
a > z1
r 1
z1
10
6
power-law a =2.5
Poisson a =1.0
6
4
a
neighbors at radius
10
PL
PS
5
10
10
10
10
10
4
2
3
0
2
10
4
6
10
N
10
2
1
0
1
1.5
2
2.5
3
radius
3.5
4
4.5
5
What about the shortest path discovered along the way?
B.J. Kim et al. ‘Path finding strategies in scale-free networks’, PRE (65) 027103.
B
each node passes
message to highest degree
neighbor it hasn’t
passed the message to
previously
‘cut off’ loops
A
A high degree seeking strategy finds shortest paths whose
average scales logarithmically with the size of the graph
8
7.5
av. path length found
7
6.5
6
5.5
5
4.5
4
PL high degree
0.72*ln(N)
3.5
3
2
10
10
3
10
N
4
10
5
Scaling of the path length found using a
• random strategy on a PL graph
• high-degree strategy on a Poisson graph
av. path length found
10
10
2
PL
Poisson
0.46
N
0.48
N
1
10
2
10
3
10
N
4
10
5
But…
Search costs are prohibitive, might as well do a BFS
10
median search cost
10
10
10
4
3
2
1
PL high degree
PL rand
Poisson high degree
0
10
2
10
10
3
10
N
4
10
5