A Quantitative Analysis of the Gnutella Network Traffic”

Download Report

Transcript A Quantitative Analysis of the Gnutella Network Traffic”

Dept. of Computer Science & Engineering. @ University of California - Riverside
“Information Retrieval in
Peer-to-Peer Systems”
Demetrios Zeinalipour-Yazti
M.Sc. Thesis Defense
Monday, May 5, 2003
Surge 349 12:00-1:00 PM
Thesis Committee:
Dr. Dimitrios Gunopulos, Chairperson
Dr. Vana Kalogeraki
Dr. Chinya V. Ravishankar
http://www.cs.ucr.edu/~csyiazti/msc.html
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
Introduction to Peer-to-Peer
•
Peer-to-Peer Computing definition:
“Sharing of computer resources and information through
direct exchange”
•
Clients (downloaders) are also
servers
•
Clients may join or leave the
network at any time => highly
fault-tolerant but with a cost!
•
Searches are done within the
virtual network while actual
downloads are done offline
(with HTTP).
The virtual P2P topology
The physical topology
Introduction to Peer-to-Peer
•
Peer-to-Peer (P2P) systems are increasingly
becoming popular.
•
P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files.
•
Traditionally, files were shared using the ClientServer model (e.g. http). Not scalable since they
are centralized services.
•
P2P uncover new advantages in simplicity of use,
robustness, self organization and scalability.
Information Retrieval in P2P
Problem:
“How to efficiently retrieve Information in P2P systems where
each node shares a collection of documents?”
keywords
•
•
•
Documents consists of keywords.
Resembles Information Retrieval but resources are
distributed now.
Primary Data Structures such as Global Inverted
Indexes can’t be maintained efficiently.
Solutions for P2P Information Retrieval
1)
Centralized Approaches
•
•
2)
Centralized Indexes
e.g. Napster, SETI@HOME
Centralized Index
1) Upload Index
2) Query/QueryHit
3) Download (offline)
1
2
3
Purely Distributed Approaches
•
•
•
3)
Each node has only local
knowledge.
I.R is done using Brute force
mechanisms
e.g. Gnutella, Fasttrack (Kazaa)
Hybrid Approaches
•
•
One or more peers have partial
indexes of the contents of others.
e.g. Limewire's Ultrapeers
1) Connect
2) Query/QueryHit
3) Download (offline)
1,2
3
1) Connect
2) Intelligent Query/QueryHit
3) Download (offline)
1,2
3
Motivation
•
•
•
•
On 1st June we crawled the Gnutella P2P Network for 5
hours with 17 workstations.
We analyzed 15,153,524 query messages.
Observation: High locality of specific queries.
We try to exploit this property for more efficient searches?
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
Search Techniques for P2P systems
1. Breadth-First Search (Gnutella)
•
•
•
•
Idea: Each Query Message is propagated along all
outgoing links of a peer using TTL (time-to-live).
TTL is decremented on each forward until it becomes 0
Technique for I.R in P2P systems such as Gnutella.
Highlights
–
–
The physical network comes to its knees
Long Delays for search results.
P2P Network N
A
1
QUERY
2 QUERYHIT
Peer q
Peer d
Search Techniques for P2P systems
2. Modified Random BFS
[V. Kalogeraki, D. Gunopulos, D. Zeinalipour-Yazti . CIKM2002]
•
•
•
Idea: Each Query Message is forwarded to only a
fraction of outgoing links (e.g. ½ of them).
TTL is again decremented on each forward until it
becomes 0.
Highlights
–
–
–
Fewer Messages but possibly less results
This algorithm is probabilistic.
B
Some segments may become
unreachable
A
1
QUERY
2 QUERYHIT C
Peer q
unreachable
Peer d
P2P Network N
Search Techniques for P2P systems
3. Searching Using Random Walkers
[Q. Lv et al P. Cao, E. Cohen, K. Li, and S. Shenker. ICS2002]
•
•
•
Idea: Each Query Message is forwarded to 1 neighbor
With k walkers after T steps we reach 
the same nodes
as 1 walker after kT steps. (They use 16-64 walkers)
Highlights
–
–
–
Network Traffic reduced (from BFS) by 2 orders of magnitude
Increases the user-perceived delay (from 2-6 hops to 4-15 hops)
This algorithm is probabilistic and the likelihood to locate the
objects depends on the network topology.
2-walker
1 QUERY
2 QUERY
Peer q
Peer d
Search Techniques for P2P systems
4. Using Randomized Gossiping to Replicate
Global State [F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12]
•
•
•
Idea: PlanetP uses Bloom Filters to propagate
summary indexes of the contents of a Peer.
Bloom Filters are used for Membership Queries
Highlights
D = {d1,d2,...,dn}
–
–
–
Not Scalable (Technique works well
for <10000 nodes)
No Data Replication Required
False Positives are a function of m,n,k
and can be kept small
d1?
1
1
1
1
1
h1(d1)
h2(d1)
001
010
011
1
h3(d1)
h4(d1)
000
m
100
101
1
110
1
111
An 8-bit bloom filter w/
4 hash functions
Search Techniques for P2P systems
5. Searching using Local Indices
[Arturo Crespo and
Hector Garcia-Molina, ICDCS 2002.]
•
•
Idea: Create indices which contain “statistics” that
reveal the “direction” towards the documents.
Types of Proposed Indices
–
–
–
•
Compound Routing Index (CRI): metric=number of documents
Hop-Count Routing Index (HRI): maintain a CRI for k hops,
Exponentially Aggregated Index (ERI): Apply some cost
formula on HRI to shrink HRI’s size.
Highlights
–
–
Not Scalable, Expensive Routing Updates but better than
replicating data indexes.
Assumes static environment but No Data Replication Required
Search Techniques for P2P systems
6. Directed BFS and the >RES Heuristic 1/2
[Beverly Yang and Hector Garcia-Molina, ICDCS 2002.]
•
Proposed Techniques:
–
Directed BFS based on aggregate statistics (e.g. num of results a
peer returned, shortest queue, forwarded the most data)
–
–
•
•
Iterative Deepening, until Z results are returned
Local Indexes, each node maintains the actual index over the
data of peers r hops away.
Their experiments deploy the Direct BFS
techniques by attaching nodes to the
Gnutella Network.
The >RES Heuristic is shown to be working
well.
Search Techniques for P2P systems
6. Directed BFS and the >RES Heuristic 2/2
•
•
The >RES Heuristic is optimized to find Z
documents efficiently for some user defined Z.
>RES works well because:
–
–
•
•
It captures stable/large network segments.
Potentially less overloaded peers
>RES is a quantitative approach
Drawback: >RES doesn’t route queries to most
relevant content
QUERYHIT
RES=1000
QUERY
q
RES=10
A
RES=1
B
QUERYHIT
C
Search Techniques for P2P systems
7. Depth-First-Search and Freenet
[I. Clarke O. Sandberg, B. Wiley, and T.W. Hong, LNCS 2009 ]
Idea: Objects are Hashed and route the hash of a query
based on the “key closeness” in a DFS manner.
Highlights:
–
–
–
–
Uses caching of key/object for future requests.
Data Replication along the QueryHit path provides Availability
Anonymity of Searcher and Publisher.
Drawbacks: i) Searches ONLY based on Object Identifier.
ii) The user-perceived delay is high
S
B
Search: A
1 QUERY h(A)
2 result: S
A
replicated
file:A
B
C
Peer q
original file:A
R
Search Techniques for P2P systems
8. Consistent Hashing and Chord
[Ion Stoica et al. SIGCOMM 2001]
Idea: Objects/Nodes are hashed with m-bit identifier and
organized in a virtual ring. Object lookup is achieved in
O(logN).
Highlights:
–
–
Consistent Hashing achieves : (i) Good Load Balancing of keys
(ii) Little object/key movement in case of node join/leave .
Drawbacks: i) Searches ONLY based on Object Identifier
ii) Data Movement may be a big overhead.
0
Nodes
h(A)=2
h(B)=3
h(C)=5
h(D)=7
Objects
h(file1)=7
finger table C
successor [5,7)=7
successor [7,5)=0
1
7
D file1
m=3
6
C
A
B
5
3
4
2
finger table A
successor [2,3)=3
successor [3,5)=5
successor [5,2)=0
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
Intelligent Search Mechanism ISM
Introduction
•
•
Idea: Each Query Message is forwarded intelligently
based on what queries a peer answered in the past.
Components of ISM (for each node u)
a) Profile Mechanism, for each neighbor N(u).
b) Peer Ranking Mechanism, for ranking peers locally and send a
search query only to the ones that most likely will answer.
c) Similarity Function, for finding similar search queries.
d) Search Mechanism, for propagating queries based on local
indexes
A
1
QUERY
profiles
2 QUERYHIT
Peer q
?
Peer d
Intelligent Search Mechanism ISM
Components of ISM
a) Profile mechanism.
–
–
Maintains a list of past queries routed through that host.
Every time a QueryHit is received the table is updated
Size:
T*d
}
–
–
The profile manager uses a Least Recently Used policy to keep
most recent queries in repository.
Profiles are kept for neighbors only so the cost for maintaining
this cost is O(Td), T is a limiting factor per profile, d is the degree
of a node
Intelligent Search Mechanism ISM
Components of ISM
b) The RelevanceRank Peer Ranking Metric.
–
–
Before forwarding a Query Message a peer performs an on-thefly ranking of its peers to determine the best paths.
We use the Aggregate Weighted Similarity of peer Pi to a query
q, computed by a peer Pl as:
Example
Assume host Pl needs to forward a query q=“italy disaster” to two of
its peers {P1, P2, P3}. Pk maintains queries {q1 ,q2,. ,q5} in its profile.
P1
P2
{
P3 {
Sim(q, q1) = 0.8
Sim(q, q2) = 0.6
Sim(q, q3) = 0.5
Sim(q, q4) = 0.4
Sim(q, q5) = 0.4
=> RR(P1, q) = 0.8 x 2 = 1.6
} => RR(P2, q) = (0.6x2 + 0.5x2) = 2.2
} => RR(P3, q) = (0.4x2 + 0.3x2) = 1.4
=2
Intelligent Search Mechanism ISM
Components of ISM
c) Similarity Function – The cosine similarity.
•
Assume that L is a set of all words (in Profile Manager)\
e.g. L={elections, bush, clinton, super, bowl, san,
diego, … ,italy, earthquake, disaster}
•
We define an |L|-dimensional space where each query
is a vector.
If q=“italy disaster” => q (vector of q) = [0,0,0,…,1,0,1]
•
Recall that we have a vector for each qi stored in the
Profile Manager ( i.e. qi )
Intelligent Search Mechanism ISM
Components of ISM
d) Search Mechanism
•
Utilizes the Peer Ranking Mechanism to forward
Queries to nodes that will potentially contain the info we
are looking for
Peer d
?
1
Peer q
QUERY
?
profiles
Intelligent Search Mechanism ISM
Breaking cycles with Random Perturbation
•
•
•
•
Suppose that nodes answers to conjunction of q-terms
Suppose that query: q has no answer from A,B,C or D.
and that one of them answered to similar q in the past
 Query q fails to explore the segment through E
Random Perturbation adds one additional random
message
B
QUERY
q
A
D
C
QUERYHIT
E
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
PeerWare Simulation Infrastructure
Introduction
• PeerWare is our distributed middleware
infrastructure that allows us to benchmark various
Query Routing Algorithms.
• It is deployed on a network of 50 workstations
• It uses Public/Private Keys and SSH to connect to
the networked hosts.
• It is implemented in JAVA and consists of
approximately 10000 lines of code.
PeerWare Simulation Infrastructure
Why real middleware and not simulations?
• Many properties such as network failures,
dropped queries may reveal interesting and
unknown patterns.
• In a real middleware we are able to measure the
actual time to satisfy queries.
• Finally there are no assumptions (network delays
etc) which are typical in simulation environments
The Anthill Project (Univ. of Bologna) uses a
similar approach to investigate properties of the
Freenet algorithm.
PeerWare Simulation Infrastructure
PeerWare Components
1.
2.
3.
4.
dataGen – The Dataset Generator
graphGen – The Network Graph Generator
dataPeer – The Data Node
searchPeer – The Search Node
Other Administrative Components
• netLaucher – Shell script that launches Network
• netStats – Shell script that provides statistics
• graphPlot – Shell script that plots Graphs based
on generated results.
PeerWare Simulation Infrastructure
1) dataGen Component
•
•
•
•
dataGen is the Dataset Generator which generates
documents about specific documents
(each peer can have some specialized knowledge)
It uses the REUTERS News Agency dataset (22,531
documents).
It groups documents by various properties:
{Date, Topics, Places, People, Orgs, Companies}
In our experiments we use the Places attribute and
generate 104 countries.
PeerWare Simulation Infrastructure
2) graphGen Component
•
•
•
graphGen is topology generator
Currently it generates Random Topologies given
parameters such as {degree, IPs, ports}
It generates with graphViz visualizations of the
generated topologies.
PeerWare Simulation Infrastructure
3) dataPeer Component
•
•
•
dataPeer is a P2P client that maintains an XML
repository of documents.
It uses the PDOM-XQL engine to query its
documents.
It pre-establishes connections to other peers with
persistent TCP connections
mexico
Data-Peer (e.g. usa)
Routing Structures (Profiles)
argentina
u.k
china
italy
PDOM-XML
Manager
XQL
P2P Network
Module
france
india
greece
usa.graph
XML Data Files
germany
PeerWare Simulation Infrastructure
4) searchPeer Component
•
•
•
searchPeer is a P2P client that connects to a
PeerWare Network and performs unstructured
queries.
Keywords are sampled from within the dataset
It logs statistics such as query response time,
nodes answered to a node etc.
mexico
Search-Peer
P2P Network
Module
argentina
u.k
china
italy
france
india
results.txt queries.txt
greece
germany
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
Experimental Evaluation
Introduction
•
We create a distributed Newspaper application
•
We use a Random Network of 104 peers
– Each peer has documents for 1 country
– The average degree of a node is 7 ~= log2100 (connected graph)
•
We perform two series of experiments
1. 10x10 sequential queries with a delay of 4 sec.
2. 400 random queries with a delay of 4 sec.
•
We compare Doc. Ratio (Recall Rate) vs. Num. of messages
– BFS (Gnutella Message Flooding) (forward to degree nodes).
– Random BFS (randomly forward to degree/2 nodes).
– Intelligent Search Mechanism (forward to M=(degree/2)-1 highest
RelevanceRank nodes + 1 random).
– >RES Heuristic (forward to degree/2 nodes that answered >RES)
Experimental Evaluation
Reducing Query Messages (10x10 Experiment)
Recall Rate vs. Num. of messages with TTL=4
• BFS uses ~1050 messages w/ recall rate 100%
• RBFS uses ~220 (20%) msgs w/ recall rate ~50%
• >RES uses ~400 (38%) msgs w/ recall rate ~70%
• ISM uses ~400 (38%) msgs w/ recall rate ~90%
• ISM improves over time since Peer Profiles get more knowledge.
• ISM and >RES start out slow since the use RBFS
until they populate their routing structures
Experimental Evaluation
Digging Deeper by Increasing the TTL (10x10)
•
•
•
•
•
Recall Rate vs. Num. of messages with TTL=5
BFS uses again ~1050 messages w/ recall rate 100%
RBFS uses ~450 (43%) msgs w/ recall rate ~82%
>RES uses ~570(54%) msgs w/ recall rate ~90%
ISM uses ~570 (54%) msgs w/ recall rate ~99%
Experimental Evaluation
Reducing Query Response Time (QRT) (10x10 Experiment)
• BFS’s QRT is in the order of 6 seconds
• RBFS, ISM and >RES use
30-60% of BFS for TTL=4
60-80% of BFS for TTL=5
• BFS unnecessary messages increase the user perceived
delay


The Query Response Time as a percentage of BFS
Experimental Evaluation
The Discarded Message Problem (DMP)
• A query q is identified by a GUID.
• To avoid cycles a node never forwards a query it
already forwarded.
• DMP occurs if a node has forwarded q with TTL1 and
then receives again q with TTL2, where TTL2>TTL1
• In our experiments approximately 30% of queries were
affected by the DMP problem.
Experimental Evaluation
Improving Recall Rate over Time (400 Experiment)
• 10x10 Queries Experiment suited well ISM
• In this experiment we perform 400 random queries
• BFS overwhelming message create two major outbreaks
• ISM improves over time achieving:
96% Recall Rate using again 38% of Messages
Presentation Outline
•
•
•
•
•
•
Introduction & Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions & Future Work.
Conclusions
• Efficient Information Retrieval in P2P networks is
not feasible with the current Search Algorithms.
• We propose an Intelligent Search Mechanism
that uses local knowledge to improve
Information Retrieval in P2P.
• We implement PeerWare and evaluate the
performance of various Search Techniques
• The ISM achieves in some cases 100% recall
rate while using only 57% of the BFS
messaging.
Future Work
•
•
•
•
•
Probe different Network Topologies such as ASMap with PowerLaws.
Deploy larger PeerWares with more queries.
Probe different Peer-Profile maintenance policies.
Use Stemming/Stop Words to answer more accurately.
Compare the performance of our method with new proposed
techniques (random gossiping, random walkers, etc).
• 60% of Gnutella belongs to 20% ISPs. How to exploit that to provide
London
more efficient query routing schemes?
pc-62-30-117-83-cr.blueyonder.co.uk
AverageRTT=163ms
19 Router Hops
8,747Km
12-224-0-236.client.attbi.com
AverageRTT=46ms
13 Router Hops
1,544Km
Tokyo
p237-165.yahoo.co.jp
AverageRTT=140ms
19 Router Hops
8,806Km
Seattle
Rochester
roc-24-169-109-208.rochester.rr.com
AverageRTT=130ms
22 Router Hops
3,933Km
Riverside
66-215-0-xx1.oc-nod.charterpipeline.net
AverageRTT=184ms
22 Router Hops
12,764Km
AverageRTT=9ms
4 Router Hops
66-215-0-xx2.oc-nod.charterpipeline.net
sdcax6-097.dialup.optusnet.com.au
Melbourne
Dept. of Computer Science & Engineering. @ University of California - Riverside
“Information Retrieval in Peer-toPeer Systems”
Demetrios Zeinalipour-Yazti
Thank You!
M.Sc. Thesis Defense
Monday, May 5, 2003
Surge 349 12:00-1:00 PM
Thesis Committee:
Dr. Dimitrios Gunopulos, Chairperson
Dr. Vana Kalogeraki
Dr. Chinya V. Ravishankar
http://www.cs.ucr.edu/~csyiazti/msc.html