Transcript Document

Peer-to-Peer Systems as
Real-life Instances of Distributed Systems
Adriana Iamnitchi
University of South Florida
[email protected]
http://www.cse.usf.edu/~anda
P2P Systems – 1
Why Peer-to-Peer Systems?
•
•
•
•
Wide-spread user experience
Large-scale distributed
application, unprecedented
growth and popularity
– KaZaA – 389 millions downloads
(1M/week) one of the most
popular applications ever!
Heavily researched in the last 89 years with results in:
– User behavior characterization
– Scalability
– Novel problems (or aspects):
reputation, trust, incentives for
fairness
eDonkey
3,108,909
FastTrack (Kazaa)
2,114,120
Gnutella
2,899,788
Cvernet
691,750
Filetopia
3,405
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
Commercial impact
– Do you know of any examples?
P2P Systems – 2
Outline Today: Peer-to-peer Systems
•
•
•
Background
Some history
Unstructured overlays
–
–
–
–
•
Napster
Gnutella (original and new)
BitTorrent
Exploiting user behavior in distributed file-sharing systems
Structured overlays (“DHT”s)
– Basics
– Chord
– CAN
P2P Systems – 3
What Is a P2P System?
Node
Node
Node
Internet
Node
•
A distributed system architecture:
•
•
Large number of unreliable nodes
Initially identified with music file sharing
Node
– No centralized control (debatable: Napster?)
– Nodes are symmetric in function (debatable: New Gnutella protocol?)
P2P Systems – 4
P2P Definition(s)
A number of definitions coexist:
• Def 1: “A class of applications that takes advantage of resources —
storage, cycles, content, human presence — available at the edges of
the Internet.”
– Edges often turned off, without permanent IP addresses
•
Def 2: “A class of decentralized, self-organizing distributed systems, in
which all or most communication is symmetric.”
Lots of other definitions that fit in between
P2P Systems – 5
The Promise of P2P Computing
•
High capacity through parallelism:
•
Reliability:
– Many disks
– Many network connections
– Many CPUs
– Many replicas:
• of data
• of network paths
– Geographic distribution
•
Automatic configuration
•
Useful in public and proprietary settings
P2P Systems – 6
History
•
•
Decentralized, P2P solutions: USENET
As a grass-root movement: started in 1999 with Napster
– Objective: (music) file sharing
P2P Systems – 7
Popularity since 2004
Britney Spears 2.60
p2p
1.00
Normalized and compared to the popularity of Britney Spears as shown
by Google Trends
P2P Systems – 8
Napster: History
• Program for sharing files over the Internet
• History:
– 5/99: Shawn Fanning (freshman, Northeasten U.)
founds Napster Online music service
– 12/99: first lawsuit
– 3/00: 25% UWisc traffic Napster
– 2000: est. 60M users
– 2/01: US Circuit Court of Appeals:
Napster knew users violating
copyright laws
– 7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
P2P Systems – 9
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
P2P Systems – 10
Napster: How It Works
napster.com
• Client-Server: Use central server to locate files
• Peer-to-Peer: Download files directly from peers
P2P Systems – 11
Napster
1. File list is uploaded
(Join and Publish)
napster.com
users
P2P Systems – 12
Napster
2. User requests search
at server (Search).
napster.com
Request
and
results
user
P2P Systems – 13
Napster
napster.com
3. User pings hosts
that apparently
have data.
Looks for best
transfer rate.
ping
ping
user
P2P Systems – 14
Napster
4. User retrieves file
(Fetch)
napster.com
Download
file
user
P2P Systems – 15
Lessons Learned from Napster
•
Strengths: Decentralization of Storage
– Every node “pays” its participation by providing access to its resources
• physical resources (disk, network), knowledge (annotations), ownership (files)
– Every participating node acts as both a client and a server (“servent”): P2P
– Decentralization of cost and administration = avoiding resource bottlenecks
•
Weaknesses: Centralization of Data Access Structures (Index)
– Server is single point of failure
– Unique entity required for controlling the system = design bottleneck
– Copying copyrighted material made Napster target of legal attack
increasing degree of resource sharing and decentralization
Centralized
System
Decentralized
System
P2P Systems – 16
Gnutella: File-Sharing with No Central Server
P2P Systems – 17
Gnutella: History
•
Developed in a 14 days “quick hack” by Nullsoft (winamp)
– Originally intended for exchange of recipes
•
Evolution of Gnutella
– Published under GNU General Public License on the Nullsoft web server
– Taken off after a couple of hours by AOL (owner of Nullsoft)
• Too late: this was enough to “infect” the Internet
–
–
–
–
•
Gnutella protocol was reverse engineered from downloaded versions of the original
Protocol published
Third-party clients were published and Gnutella started to spread
Many iterations to fix poor initial design
High impact:
– Many versions implemented
– Many different designs
– Lots of research papers/ideas
P2P Systems – 18
Gnutella: Search in an Unstructured Overlay
I have file A.
I have file A.
Reply
Flooding
Query
Where is file A?
P2P Systems – 19
Gnutella: Overview
• Join: on startup, client contacts a few other nodes; these become
its “neighbors”
– Initial list of contacts published as gnutellahosts.com:6346
– Outside the Gnutella protocol specification
– Default value for number of open connections (neighbors): C = 4
• Publish: no need
• Search:
– Flooding: ask neighbors, who ask their neighbors, and so on...
•
•
•
•
Each forwarding of requests decreases a TTL. Default: TTL = 7
When/if found, reply to sender
Drop forwarding requests when TTL expires
TTL
One request leads to 2 * i 0 C * (C 1)i  26,240
messages
– Back-propagation in case of success (Why?)
• Fetch: get the file directly from peer (HTTP)
P2P Systems – 20
Gnutella: Protocol Message Types
Type
Ping
Pong
Query
QueryHit
Push
Description
Contained Information
Announce availability and probe for None
other servents
Response to a ping
IP address and port# of responding servent;
number and total kb of files shared
Search request
Minimum network bandwidth of responding
servent; search criteria
Returned by servents that have
IP address, port# and network bandwidth of
the requested file
responding servent; number of results and
result set
File download requests for
Servent identifier; index of requested file; IP
servents behind a firewall
address and port to send file to
P2P Systems – 21
What would you ask about
a Gnutella network?
P2P Systems – 22
Gnutella: Tools for Network Exploration
•
•
Eavesdrop traffic - insert modified node into the network and log traffic.
Crawler - connect to active nodes and use the membership protocol to
discover membership and topology.
P2P Systems – 23
Gnutella: Heterogeneity
All Peers Equal? (1)
1.5Mbps DSL
1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
P2P Systems – 24
Gnutella Network Structure: Improvement
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
P2P Systems – 25
Déjà vu?
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
P2P Systems – 26
Gnutella: Free Riding
All Peers Equal? (2)


•
•
More than 25% of Gnutella clients
share no files; 75% share 100 files
or less
Conclusion: Gnutella has a high
percentage of free riders
If only a few individuals
contribute to the public good,
these few peers effectively act as
centralized servers.
Outcome:
– Significant efforts in building
incentive-based systems
– BitTorrent?
Adar and Huberman (Aug ’00)
P2P Systems – 27
Flooding in Gnutella: Loops?
Seen request already
P2P Systems – 28
Improvements of Message Flooding
•
Expanding Ring
– start search with small TTL (e.g. TTL = 1)
– if no success iteratively increase TTL (e.g. TTL = TTL +2)
•
k-Random Walkers
– forward query to one randomly chosen neighbor only, with large TTL
– start k random walkers
– random walker periodically checks
with requester whether to continue
•
Experiences (from simulation)
– adaptive TTL is useful
– message duplication should be avoided
– flooding should be controlled at fine granularity
P2P Systems – 29
Gnutella Topology (Mis)match?
P2P Systems – 30
Gnutella: Network Size?
Explosive growth in 2001, slowly shrinking thereafter
•
High user interest
– Users tolerate high latency, low
quality results
•
Better resources
– DSL and cable modem nodes grew
from 24% to 41% over first 6
months.
P2P Systems – 31
Is Gnutella a Power-Law Network?
Power-law networks: the number of links per node
follows a power-law distribution
N = L-k
Num. of nodes (log scale)
10000
Examples of power-law networks:
November 2000
1000
–
–
–
–
–
100
10
The Internet at AS level
In/out links to/from HTML pages
Airports
US power grid
Social networks
1
1
10
100
Number of links (log scale)
Implications: High tolerance to random node failure but low reliability
when facing of an ‘intelligent’ adversary
P2P Systems – 32
Network Resilience
Partial Topology
Random 30% die
Targeted 4% die
from Saroiu et al., MMCN 2002
P2P Systems – 33
Is Gnutella a Power-Law Network? (Later Data)
Later, larger networks display a bimodal distribution
Implications:
– High tolerance to random node failures preserved
– Increased reliability when facing an attack.
Number of nodes
(log scale)
10000
May 2001
1000
From Ripeanu, Iamnitchi, Foster, 2002
100
10
1
1
10
100
Number of links (log scale)
P2P Systems – 34
Discussion Unstructured Networks
•
Performance
– Search latency: low (graph properties)
– Message Bandwidth: high
• improvements through random walkers, but essentially the whole network needs to
be explored
– Storage cost: low (only local neighborhood)
– Update cost: low (only local updates)
– Resilience to failures good: multiple paths are explored and data is replicated
•
Qualitative Criteria
– search predicates: very flexible, any predicate is possible
– global knowledge: none required
– peer autonomy: high
P2P Systems – 35
BitTorrent
P2P Systems – 36
BitTorrent Components
•
Torrent File
– Metadata of file to be shared
– Address of tracker
– List of pieces and their checksums
•
Tracker
– Lists peers interested in the distribution of the file
•
Peers
– Clients interested in the distribution of the file
– Can be “seeds” or “leachers”
P2P Systems – 37
A BitTorrent Swarm
•
•
•
A “seed” node has the file
A “tracker” associated with the file
A “.torrent” meta-file is built for the
file: identifies the address of the
tracker node
•
•
The .torrent file is published on web
File is split into fixed-size segments
(e.g., 256KB)
P2P Systems – 38
Choking Algorithm
•
Each connected peer is in one of two states
•
Each peer has a certain number of unchoke slots
•
Choking Algorithm
– Choked: Download requests by a choked peer are ignored
– Unchoked: Download requests by an unchoked peer are honored
– Choking occurs at the peer level
– 4 regular unchoke slots (per BitTorrent standard)
– 1 optimistic unchoke slot (per BitTorrent standard)
– Peers unchoke connected peers with best service rate
• Service rate = rolling 20 second average of its upload bandwidth
– Optimistically unchoking peers prevents a static set of unchoked peers
– The choking algorithm runs every 10 seconds
– Peers optimistically unchoked every 30 seconds
• New peers are 3 times more likely to be optimistically unchoked
P2P Systems – 39
Piece Selection
•
Random First Piece
– Piece download at random
– Algorithm used by new peers
•
Rarest Piece First
– Ensures > 1 distributed copies of a piece
– Increases interest of connected peers
– Increases scalability
•
Random Piece vs. Rarest Piece
– Rarest has probabilistically high download time
– New peers want to reduce download time but also increase their interest
P2P Systems – 40
BitTorrent: Overview
• Join: nothing
– Just find that there is a community ready to host your tracker
• Publish: Create tracker, upload .torrent metadata file
• Search:
– For file: nothing
• the community is supposed to provide search tools
– For segments: exchange segment IDs maps with other peers.
• Fetch: exchange segments with other peers (HTTP)
P2P Systems – 41
Gnutella vs. BitTorrent: Discussion
• Architecture
– Decentralization?
• System properties
–
–
–
–
–
Reliability?
Scalability?
Fairness?
Overheads?
Quality of Service
• Search coverage for content?
• Ability to download content fast?
P2P Systems – 42
Distributed Hash Tables:
Design and Performance
P2P Systems – 43
What Is a DHT?
•
A building block used to locate key-based objects over millions of
hosts on the internet
•
Inspired from traditional hash table:
•
How to do this across millions of hosts on the Internet?
•
What might be difficult?
–
–
–
–
key = Hash(name)
put(key, value)
get(key) -> value
Service: O(1) storage
– Distributed Hash Tables
–
–
–
–
–
Decentralized: no central authority
Scalable: low network traffic overhead
Efficient: find items quickly (latency)
Dynamic: nodes fail, new nodes join
General-purpose: flexible naming
P2P Systems – 44
From Hash Tables to Distributed Hash Tables
Challenge: Scalably distributing the index space:
–
–
•
Scalability issue with hash tables: Add new entry => move many items
Solution: consistent hashing (Karger 97)
Consistent hashing:
–
–
–
Circular ID space with a distance metric
Objects and nodes mapped onto the same space
A key is stored at its successor: node with next higher ID
K5
K20
N105
Circular
ID space
K: object IDs
N: node IDs
N90
K80
P2P Systems – 45
N32
The Lookup Problem
N1
Put (Key=“title”
Value=file data…)
Publisher
N2
Internet
N4
N5
N3
?
Client
Get(key=“title”)
N6
P2P Systems – 46
DHTs: Main Idea
N2
N1
Publisher
N4
Key=H(audio data)
Value={artist,
album
title,
track title}
N6
N7
N3
Client
Lookup(H(audio data))
N8
N9
P2P Systems – 47
What Is a DHT?
•
Distributed Hash Table:
key = Hash(data)
lookup(key) -> IP address
send-RPC(IP address, PUT, key, value)
send-RPC(IP address, GET, key) -> value
•
API supports a wide range of applications
– DHT imposes no structure/meaning on keys
– And thus build complex data structures
P2P Systems – 48
Approaches
•
Different strategies
– Chord: constructing a distributed hash table
– CAN: Routing in a d-dimensional space
– Many more…
•
Commonalities
– Each peer maintains a small part of the index information (routing
table)
– Searches performed by directed message forwarding
•
Differences
– Performance and qualitative criteria
P2P Systems – 49
Example 1: Distributed Hash Tables (Chord)
•
Hashing of search keys AND peer addresses on binary keys of length m
– Key identifier = SHA-1(key); Node identifier = SHA-1(IP address)
– SHA-1 distributes both uniformly
– e.g. m=8, key(“yellow-submarine.mp3")=17, key(192.178.0.1)=3
•
Data keys are stored at next larger node key
p
peer with hashed identifier p,
data with hashed identifier k
k stored at node p2 such that p2 is
the smallest node ID larger than k
k
predecessor
m=8
stored
32 keys
at
p2
p3
Search possibilities?
1. every peer knows every other
O(n) routing table size
2. peers know successor
O(n) search cost
P2P Systems – 50
Routing Tables
•
Every peer knows m peers with exponentially increasing distance
p p+1
p+2
Each peer p stores a routing table
First peer with hashed identifier p such that
si =successor(p+2i-1) for i=1,..,m
We write also si = finger(i, p)
p+4
s1, s2, s3
s5
s4
p3
p4
p+16
si
p
p2
s1
p2
s2
p2
p+8
s3
p2
s4
p3
s5
p4
Search
O(log n) routing table size
P2P Systems – 51
Search
search(p, k)
find in routing table largest (si, p*) such that p* largest
node ID in interval [p,k]
if such a p* exists then search(p*, k)
else return (successor(p)) // found
p p+1
p+2
p+4
s1, s2, s3
s5
k2
p2
p+8
s4
k1
Search
O(log n) search cost
p3
p4
p+16
P2P Systems – 52
Finger i Points to Successor of n+2i
N120
112
¼
½
1/8
1/16
1/32
1/64
1/128
N80
P2P Systems – 53
Lookups Take O(log(N)) Hops
N5
N10
K19
N20
N110
N99
N32 Lookup(K19)
N50
N80
N60
P2P Systems – 54
Node Insertion (Join)
•
New node q joining the network
p p+1
p+2
q
p+4
p2
p+8
p3
p4
routing table
of p
routing table
of q
i
p
i
p
s1
q
s1
p2
s2
q
s2
p2
s3
p2
s3
p3
s4
p3
s4
p3
s5
p4
s5
p4
p+16
P2P Systems – 55
Load Balancing in Chord
Network size n=10^4
5 10^5 keys
P2P Systems – 56
Length of Search Paths
Network size n=2^12
100 2^12 keys
Path length ½ Log2(n)
P2P Systems – 57
Chord Discussion
• Performance
–
–
–
–
–
–
Search latency: O(log n) (with high probability, provable)
Message Bandwidth: O(log n) (selective routing)
Storage cost: O(log n) (routing table)
Update cost: low (like search)
Node join/leave cost: O(Log2 n)
Resilience to failures: replication to successor nodes
• Qualitative Criteria
– search predicates: equality of keys only
– global knowledge: key hashing, network origin
– peer autonomy: nodes have by virtue of their address a specific role
in the network
P2P Systems – 58
Example 2: Topological Routing (CAN)
•
Based on hashing of keys into a d-dimensional space (a torus)
– Each peer is responsible for keys of a subvolume of the space (a zone)
– Each peer stores the addresses of peers responsible for the neighboring zones
for routing
– Search requests are greedily forwarded to the peers in the closest zones
•
Assignment of peers to zones depends on a random selection made by the
peer
P2P Systems – 59
Network Search and Join
Node 7 joins the network by choosing a coordinate in the volume of 1
P2P Systems – 60
CAN Refinements
•
Multiple Realities
–
–
–
–
–
•
We can have r different coordinate spaces
Nodes hold a zone in each of them
Creates r replicas of the (key, value) pairs
Increases robustness
Reduces path length as search can be continued in the reality where the
target is closest
Overloading zones
–
–
–
–
Different peers are responsible for the same zone
Splits are only performed if a maximum occupancy (e.g. 4) is reached
Nodes know all other nodes in the same zone
But only one of the neighbors
P2P Systems – 61
CAN Path Length
P2P Systems – 62
Increasing Dimensions and Realities
P2P Systems – 63
CAN Discussion
•
Performance
– Search latency: O(d n1/d), depends on choice of d (with high probability,
provable)
– Message Bandwidth: O(d n1/d), (selective routing)
– Storage cost: O(d) (routing table)
– Update cost: low (like search)
– Node join/leave cost: O(d n1/d)
– Resilience to failures: realities and overloading
•
Qualitative Criteria
– search predicates: spatial distance of multidimensional keys
– global knowledge: key hashing, network origin
– peer autonomy: nodes can decide on their position in the key space
P2P Systems – 64
Comparison of (some) P2P Solutions
Search Paradigm
Overlay maintenance
costs
Search Cost
Gnutella
Breadth-first on
search graph
O(1)
Chord
Implicit binary
search trees
O(log n)
O(log n)
CAN
d-dimensional
space
O(d)
O(d n1/d)
TTL
2* i 0 C *(C  1)i
P2P Systems – 65
DHT Applications
Not only for sharing music anymore…
–
–
–
–
–
–
–
Global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT]
Naming services [Chord-DNS, Twine, SFR]
DB query processing [PIER, Wisc]
Internet-scale data structures [PHT, Cone, SkipGraphs]
Communication services [i3, MCAN, Bayeux]
Event notification [Scribe, Herald]
File sharing [OverNet]
P2P Systems – 66
Discussions
P2P Systems – 67
Research Trends: A Superficial History Based on
Articles in IPTPS
•
In the early ‘00s (2002-2004):
•
2005-…
•
More recently:
– DHT-related applications, optimizations, reevaluations… (more than 50% of
IPTPS papers!)
– System characterization
– Anonymization
– BitTorrent: improvements, alternatives, gaming it
– Security, incentives
– Live streaming
– P2P TV (IPTV)
– Games over P2P
P2P Systems – 68
What’s Missing?
•
Very important lessons learned
– …but did we move beyond vertically-integrated applications?
•
Can we distribute complex services on top of p2p overlays?
P2P Systems – 69
References
•
•
•
Chord: A Scalable Peer-to-peer Lookup Service for Internet
Applications, Stoica et al., Sigcomm 2001
A Scalable Content-Addressable Network, Ratnasamy et al., Sigcomm
2001
Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer
Systems and Implications for System Design. Matei Ripeanu, Adriana
Iamnitchi and Ian Foster. IEEE Internet Computing, vol. 6(1), Feb 2002
•
•
•
•
Interest-Aware Information Dissemination in Small-World Communities,
Adriana Iamnitchi and Ian Foster, HPDC 2005, Raleigh, NC, July 2005
Small-World File-Sharing Communities. Adriana Iamnitchi, Matei
Ripeanu, Ian Foster, Infocom 2004, Hong Kong, March 2004
IPTPS paper archive: http://www.iptps.org/papers.html
Many materials available on the web, including lectures by Matei Ripeanu,
Karl Aberer, Brad Karp, and others.
P2P Systems – 70
Exploiting Usage Behavior in
Small-World File-Sharing
Communities
P2P Systems – 71
Context and Motivation
• By the time we did this research, many p2p communities were
large, active and stable
• Characterization of p2p systems showed particular user behavior
• Our question: instead of building system without user behavior
in mind, could we (learn, observe, and) exploit it in system
design?
• Follow-up questions:
– What user behavior should we focus on?
– How to exploit it?
– Is this pattern particular to one type of file-sharing community or
more general?
P2P Systems – 72
“Yellow Submarine”
“Les Bonbons”
“Yellow Submarine”
“Wood Is a Pleasant
Thing to Think About”
“No 24 in B minor, BWV 869”
“Les Bonbons”
“Wood Is a Pleasant
Thing to Think
About”
New metric: The Data-Sharing Graph GmT(V, E):


V is set of users active during interval T
An edge in E connects users that asked for at least m
common files within T
P2P Systems – 73
The DØ Collaboration
6 months of traces (January – June 2002)
300+ users, 2 million requests for 200K files
Average path length: 7days, 50 files
4
3.5
3
2.5
2
1.5
1
0.5
0
CCoef =
# Existing Edges
# Possible
Edges
Clustering coeficient: 7days, 50 files
Small World!
Large clustering coefficient
P2P Systems – 74
07/23/02
D0
07/03/02
06/13/02
05/24/02
05/04/02
04/14/02
03/25/02
03/05/02
02/13/02
01/24/02
Random
01/04/02
Small average path length
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
12/15/01
07/23/02
D0
07/03/02
06/13/02
05/24/02
05/04/02
04/14/02
03/25/02
03/05/02
02/13/02
01/24/02
01/04/02
12/15/01
Random
Small-World Graphs
•
Small path length, large clustering coefficient
– Typically compared against random graphs
•
Think of:
– “It’s a small world!”
– “Six degrees of separation”
•
•
Milgram’s experiments in the 60s
Guare’s play “Six Degrees of Separation”
P2P Systems – 75
Avg. path length ratio (log scale) .
Other Small Worlds
10.0
Food
web
Power
grid
LANL
coauthors
Film
actors
Web
1.0
Internet
Word
co-occurrences
0.1
1
10
100
1000
10000
Clustering coefficient ratio (log scale)
D. J. Watts and S. H. Strogatz, Collective dynamics of small-world networks. Nature, 393:440-442, 1998
R. Albert and A.-L. Barabási, Statistical mechanics of complex networks, R. Modern Physics
74,– 47
P2P Systems
76 (2002).
Web Data-Sharing Graphs
Avg. path length ratio (log scale) .
10.0
300s,
1file
Web data-sharing graph
Other small-world graphs
1800s,
10file
7200s,
50files
1.0
1800s,
100files
3600s,
50files
0.1
1
10
100
1000
10000
Clustering coefficient ratio (log scale)
Data-Sharing Relationships in the Web, Iamnitchi, Ripeanu, and Foster, WWW’03
P2P Systems – 77
DØ Data-Sharing Graphs
Avg. path length ratio (log scale) .
10.0
Web data-sharing graph
D0 data-sharing graph
Other small-world graphs
1.0
28 days,
1 file
7days,
1file
0.1
1
10
100
1000
10000
Clustering coefficient ratio (log scale)
P2P Systems – 78
KaZaA Data-Sharing Graphs
Avg. path length ratio (log scale) .
10.0
Web data-sharing graph
D0 data-sharing graph
Other small-world graphs
Kazaa data-sharing graph
2 hours
1 file
1.0
4h
2 files
28 days 12h
1 file
4 files
1 day
2 files
7day,
1file
0.1
1
10
100
1000
10000
Clustering coefficient ratio (log scale)
Small-World File-Sharing Communities, Iamnitchi, Ripeanu, and Foster, Infocom ‘04
P2P Systems – 79
Overview
•
Small-world file sharing communities:
– The data-sharing graph
– Traces from 3 file-sharing communities:
• D0, Kazaa, Web
– It’s a small world!
•
Exploiting small-world patterns:
– Overlay construction
– Cluster identification
– Information dissemination
P2P Systems – 80
Exploiting Small-World Patterns
•
•
•
•
Exploit the small-world properties of the data-sharing graph:
–
–
Large clustering coefficient
(… and small average path length)
–
–
Objective: dynamically identify groups of users with proven
common interests in data
Direct relevant information to groups of interest
–
–
–
–
Concrete problem
Real traces
Specific performance metrics
Real, new requirements
–
–
–
Reputation mechanisms
Replica placement
…
Interest-aware information dissemination
Case study: File location
Other mechanisms:
P2P Systems – 81
Clustering
Graph construction
Dissemination
Interest-Aware Information Dissemination in Small-World Communities,
Iamnitchi and Foster, HPDC’05
P2P Systems – 82
Step 1: Graph Construction
Objective:
Make nodes aware of their common interests without central control
N
X, T0, Fx
A, T1, F
C, T1, F3
…
A
<T1, F>
N
<T2, F>
N
F
F
A,T1,F
B
Log access when downloading file (not
when requesting location!)
P2P Systems – 83
Step 2: Clustering
•
•
(extra) Challenge: no global knowledge (graph)
Idea: Label edges
•
Multiple ways to define ‘short’/‘long’:
– Each node labels its edges as ‘short’ or ‘long’ based only on local information
– “Short” edge if:
• Dead end
• In a triad
– “Long” otherwise
Web
•
•
Avg. nodes per cluster
70
60
50
40
•
30
Skewed cluster size distribution
Similar results obtained with
centralized algorithm
Need solutions to limit size
20
10
0
2 min
5 min
15 min
30 min
P2P Systems – 84
P2P Systems – 85
Step 3: Information Dissemination (1)
Hit rate due to previous
information dissemination
within clusters: up to 70%
(compared to under 5% for
random groups of same size)
D0
100
Except largest cluster
90
Total hit rate
80
70
60
50
40
30
20
10
Web
0
3 days
Except largest cluster
Total hit rate
100
90
80
70
60
50
40
30
20
10
0
2 min
5 min
15 min
30 min
Kazaa
7 days
10 days 14 days 21 days 28 days
Except largest cluster
Total hit rate
100
90
80
70
60
50
40
30
20
10
0
1 hour
4 hours
8 hours
P2P Systems – 86
Step 3: Information Dissemination (2)
3 days, 10 files
3 days, 100 files
CDF
CDF of files per
collection found
locally due to
information
dissemination.
% collection found locally due to information dissemination
21 days, 500 files
CDF
21 days, 100 files
% collection found locally due to information dissemination
P2P Systems – 87
•The D0 experiment
•Web traces
•KaZaA network
Small Worlds!
The Message:
 There are small-world patterns in fileClustering
Graph construction
sharing communities
 …that can be exploited for designing
Dissemination
algorithms
– Information dissemination
– Must be other mechanisms, as well!
P2P Systems – 88
Where Are We?
•
We saw the major solutions in unstructured P2P systems:
– Napster
– Gnutella
– BitTorrent
•
And a solution that starts from usage patterns to get inspiration for
system design
– Exploiting small-world patterns in file-sharing
•
There are many other ideas for unstructured p2p networks
but
•
There are also the structured p2p networks!
P2P Systems – 89