Search and Replication in Unstructured Peer-to

Download Report

Transcript Search and Replication in Unstructured Peer-to

Search and Replication in
Unstructured Peer-to-Peer
Networks
Pei Cao, Christine Lv., Edith
Cohen, Kai Li and Scott Shenker
ICS 2002
Outline
•
•
•
•
•
Brief survey of P2P architectures
Evaluation Methodology
Search Methods
Replication
Conclusions
Peer-to-Peer Networks
• Peers are connected by an
overlay network.
• Users cooperate to share files
(e.g., music, videos, etc.)
• Dynamic: nodes join or leave
frequently
P2P Network Architectures I
• Centralized:
– Use of central directory server (CDS)
– Peers query to the CSD to find other peers
that hold the desired object
Pros: very efficient
Cons: poorly scales
single point of failure
P2P Network Architectures II
• Decentralized: No
central directory
server
– But structured:
• P2P network topology is
tightly controlled
• Files are placed at
specified locations
– Unstructured:
• No control in Network
topology or file
placement
P2P Network Architectures III
Decentralized but Structured
• “loose structured”
–
•
Placement of files is based on hints
“tight structure”
–
Precisely declare
•
•
structure of P2P network and
file placement
– Use of distributed hash table
Pros: Efficient satisfaction of queries
Good scaling
Cons: No proof it works
P2P Network Architectures IV
Decentralized and Unstructured
• Placement of files not based on topology
knowledge
• Finding files
– Node queries neighbors (usually using
flooding)
Pros: extremely resilient to network changes
Cons: extremely unscalable
generates large loads
Evaluation Methodology I
Terminology
• Network Topology:
instant graph formed by nodes in the network
• Query Distribution:
frequency of lookups to files
• Replication Distribution:
percentage of nodes that have a particular
file
Evaluation Methodology II
• Network Topologies
– Powel-Law Random Graph (PLRG)
• Max node degree: 1746, median: 1 average 4.46
– Normal Random Graph (Random)
• Average and median node degree is 4
– Gnutella graph (Gnutella)
• Oct 2000 snapshot
• Max degree: 136, median: 2, average: 5.5
– Two-dimensional Grid
• 100x100  10000 nodes
Evaluation Methodology III
• Object query distribution qi
– Uniform
– Zipf-like
• Object replication density distribution ri
– Uniform
– Proportional: ri  qi
– Square-Root: ri   qi
Evaluation Methodology IV
• Metrics
– User aspects
• Pr(success)
• #hops
– Load aspects
• Average #messages per node
• #nodes visited
• Peak #messages
Limitation of Flooding I
• Gnutella uses TTL to
check #hops queries
travel
• Problem:
– Hard to choose TTL:
• For objects that are widely
present in the network,
small TTLs suffice
• For objects that are rare in
the network, large TTLs
are necessary
– Number of query
messages grow
exponentially as TTL grows
Limitation of Flooding II
• Node may receive the
same messages more
than once
• Need for duplication
detection
mechanisms
• Still duplication
increases as TTL
increases in flooding
Limitation of Flooding Conclusion
• Flooding increases per-node overhead
• Need for more scalable search methods:
– Expanding Ring
– Random Walks
Expanding Ring
• Adaptively Adjust TTL
– Multiple floods: start with TTL=1; increment TTL by 2
each time until search succeeds
Still have duplicate messages
Random Walk
• Simple random walk
– Takes too long to find anything
• Multiple-walker random walk
– K walkers after each walking T steps visits as
many nodes as 1 walker walking K*T steps
– More messages  more overhead
– When to terminate the search:
• TTL
• Checking: check back with query originator once
every C steps
Search Traffic Comparison
avg. # msgs per node per query
2.85
3
2.5
2
1.863
1.5
0.961
1
0.5
0.053
0.027
0.031
0
Random
Gnutella
Flood
Ring
Walk
Search Delay Comparison
# hops till success
9.12
10
7.3
8
6
4.03
4
3.4
2.51
2.39
2
0
Random
Gnutella
Flood
Ring
Walk
Lessons Learned about Search
Methods
• Key: Cover the right number of nodes as
quickly as possible and with as little
overhead as possible
• Pay Attention to
– Adaptive termination
– Minimize message duplication
– Small expansion in each step
Replication
• In unstructured P2P systems, search
success is essentially about coverage:
visiting enough nodes to find the object =>
replication density matters
• Goal: minimize average search size
(number of probes till query is satisfied)
• Theoretical Optimal: copy everything
everywhere
– Limited node storage
Replication Strategies
• Uniform Replication
– pi = 1/m
– Simple, resources are divided equally
• Proportional Replication
– pi = qi
– “Fair”, resources per item proportional to
demand
– Reflects current P2P practices
Square-Root Replication
• pi is proportional to square-root(qi)
• Lies “In-between” Uniform and Proportional
Achieving Square-Root Replication
I
• Assuming that each query keeps track the
number of probes needed
• Store an object at a number of nodes that
is proportional to the number of probes
• Two implementations:
– Path replication: store the object along the
path of a successful “walk”
– Random replication: store the object randomly
among nodes visited by the agents
Achieving Square-Root Replication
II
Evaluation of Replication Methods I
• Metrics
– Overall message traffic
– Search delay
• Dynamic simulation
– Assume Zipf-like object query probability
– 5 query/sec Poisson arrival
– Results are during 5000sec-9000sec
– Search method: 32-walkers random walk with
state keeping and check every 4 steps
Evaluation of Replication Methods
II
Avg. # msgs per node (5000-9000sec)
60000
50000
40000
30000
20000
Owner Rep
Path Rep
Random Rep
10000
0
Square-Root Replication reduces search traffic
Evaluation of Replication Methods
III
Dynamic simulation: Hop Distribution
(5000~9000s)
queries finished (%)
120
100
80
60
Owner Replication
40
Path Replication
Random Replication
20
0
1
2
4
8
16
#hops
32
64
128
256
Conclusions
• Multi-walker random walk scales much
better than flooding
– Can find data more quickly
– Reduces the traffic overload
• Square-root replication distribution is
desirable
– Minimizes search delay
– Minimizes the overall search traffic