ecs251 Spring 2007: Operating System Models #1: File Systems

Transcript ecs251 Spring 2007: Operating System Models #1: File Systems

UCDavis, ecs251
Spring 2007
ecs251 Spring 2007:
Operating System Models
#3: Peer-to-Peer Systems
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.cs.ucdavis.edu/~wu/
[email protected]
05/03/2007
P2P
1
UCDavis, ecs251
Spring 2007
The role of service provider..

Centralized management of services
– DNS, Google, www.cnn.com, Blockbuster,
SBC/Sprint/AT&T, cable service, Grid
computing, AFS, bank transactions…

Information, Computing, & Network
resources owned by one or very few
administrative domains.
– Some with SLA (Service Level Agreement)
05/03/2007
P2P
2
UCDavis, ecs251
Spring 2007
Interacting with the “SP”

Service providers are the owner of the
information and the interactions
– Some enhance/establish the interactions
05/03/2007
P2P
3
UCDavis, ecs251
Spring 2007
Let’s compare …






Google
Blockbuster
CNN
MLB/NBA
LinkIn
e-Bay
05/03/2007






P2P
Skype
Bittorrent
Blog
Youtube
BotNet
Cyber-Paparazzi
4
UCDavis, ecs251
Spring 2007
Toward P2P

More participation of the end nodes (or their
users)
– More decentralized Computing/Network
resources available
– End-user controllability and interactions
– Security/robustness concerns
05/03/2007
P2P
5
UCDavis, ecs251
Spring 2007
Service Providers in P2P

We might not like SP, but we still can not
avoid SP entirely.
– Who is going to lay the fiber and switch?
– Can we avoid DNS?
– How can we stop “Cyber-Bullying” and other
similar?
– Copyright enforcement?
– Internet becomes a junkyard?
05/03/2007
P2P
6
UCDavis, ecs251
Spring 2007
We will discuss…

P2P system examples
– Unstructured, structured, incentive
Architectural analysis and issues
 Future P2P applications and why?

05/03/2007
P2P
7
UCDavis, ecs251
Spring 2007
Challenge to you…
Define a new P2P-related application,
service, or architecture.
 Justify why it is practical, useful and will
scale well.

– Example: sharing cooking recipes, experiences
& recommendations about restaurants and
hotels
05/03/2007
P2P
8
UCDavis, ecs251
Spring 2007
Napster
P2P File sharing
 “Unstructured”

05/03/2007
P2P
9
UCDavis, ecs251
Spring 2007
Napster
pee rs
Napste r se rv er
Inde x
1. File locati on
request
Napste r se rv er
Inde x
3. File request
2. List of peers
offering the file
5. Index update
4. File deli vered
05/03/2007
P2P
10
UCDavis, ecs251
Spring 2007
Napster
Advantages?
 Disadvantages?

05/03/2007
P2P
11
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
12
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
13
UCDavis, ecs251
Spring 2007
Originally conceived of by Justin Frankel, 21 year old founder of Nullsoft
 March 2000, Nullsoft posts Gnutella to the web
 A day later AOL removes Gnutella at the behest of Time Warner
 The Gnutella protocol version 0.4
http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf
and version 0.6
http://rfc-gnutella.sourceforge.net/Proposals/Ultrapeer/Ultrapeers.htm
 there are multiple open source implementations at http://sourceforge.net/
including:
– Jtella
– Gnucleus
 Software released under the Lesser Gnu Public License (LGPL)
 the Gnutella protocol has been widely analyzed

05/03/2007
P2P
14
UCDavis, ecs251
Spring 2007

Gnutella Protocol Messages
Broadcast Messages
– Ping: initiating message (“I’m here”)
– Query: search pattern and TTL (time-to-live)

Back-Propagated Messages
– Pong: reply to a ping, contains information about the
peer
– Query response: contains information about the
computer that has the needed file

Node-to-Node Messages
– GET: return the requested file
– PUSH: push the file to me
05/03/2007
P2P
15
UCDavis, ecs251
Spring 2007
Steps:
• Node 2 initiates search for file A
7
1
A
4
2
6
3
5
05/03/2007
P2P
16
UCDavis, ecs251
Spring 2007
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
7
1
4
2
3
A
6
A
5
05/03/2007
P2P
17
UCDavis, ecs251
Spring 2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
7
1
4
2
6
3
A
05/03/2007
A
5
P2P
18
UCDavis, ecs251
Spring 2007
A:7
A
7
1
4
2
6
3
A:5
05/03/2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
5
P2P
19
UCDavis, ecs251
Spring 2007
7
1
4
2
3
A:7
A:5
A 6
A
05/03/2007
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
P2P
20
UCDavis, ecs251
Spring 2007
7
1
A:7
2
4
A:5
6
3
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
05/03/2007
P2P
21
UCDavis, ecs251
Spring 2007
Limited Scope Flooding
Reverse Path Forwarding
download A
1
7
4
2
6
3
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
• File download
• Note: file transfer between
clients behind firewalls is not
possible; if only one client, X, is
behind a firewall, Y can request
that X push the file to Y
05/03/2007
P2P
22
UCDavis, ecs251
Spring 2007
Gnutella
Advantages?
 Disadvantages?

05/03/2007
P2P
23
UCDavis, ecs251
Spring 2007




GUID:
Short for Global Unique Identifier, a randomized string
that is used to uniquely identify a host or message on the
Gnutella Network. This prevents duplicate messages from
being sent on the network.
GWebCache:
a distributed system for helping servants connect to the
Gnutella network, thus solving the "bootstrapping"
problem. Servants query any of several hundred
GWebCache servers to find the addresses of other servants.
GWebCache servers are typically web servers running a
special module.
Host Catcher:
Pong responses allow servants to keep track of active
gnutella hosts
On most servants, the default port for Gnutella is 6346
05/03/2007
P2P
24
05/03/2007
Gnutella Network Growth
P2P
05/12/01
05/16/01
05/22/01
05/24/01
05/29/01
50
02/27/01
03/01/01
03/05/01
03/09/01
03/13/01
03/16/01
03/19/01
03/22/01
03/24/01
11/20/00
11/21/00
11/25/00
11/28/00
Number of nodes in the largest
network component ('000)
UCDavis, ecs251
Spring 2007
.
40
30
20
10
-
25
UCDavis, ecs251
Spring 2007
“Limited Scope Flooding”
Ripeanu reported that Gnutella traffic totals 1Gbps (or
330TB/month).
– Compare to 15,000TB/month in US Internet backbone
(December 2000)
– this estimate excludes actual file transfers
Reasoning:
 QUERY and PING messages are flooded. They form
more than 90% of generated traffic
 predominant TTL=7
 >95% of nodes are less than 7 hops away
 measured traffic at each link about 6kbs
 network with 50k nodes and 170k links
05/03/2007
P2P
26
UCDavis, ecs251
Spring 2007
A
B
F
D
E
C
G
H
Perfect Mapping
05/03/2007
P2P
27
UCDavis, ecs251
Spring 2007
A
B
F
D
E
C
G
H
Inefficient mapping
 Link D-E needs to support six times higher
traffic.

05/03/2007
P2P
28
UCDavis, ecs251
Spring 2007
Topology mismatch
The overlay network topology doesn’t match
the underlying Internet infrastructure
topology!





40% of all nodes are in the 10 largest Autonomous
Systems (AS)
Only 2-4% of all TCP connections link nodes
within the same AS
Largely ‘random wiring’
Most Gnutella generated traffic crosses AS border,
making the traffic more expensive
May cause ISPs to change their pricing scheme
05/03/2007
P2P
29
UCDavis, ecs251
Spring 2007
Scalability
Whenever a node receives a message,
(ping/query) it sends copies out to all of its
other connections.
 existing mechanisms to reduce traffic:

– TTL counter
– Cache information about messages they
received, so that they don't forward duplicated
messages.
05/03/2007
P2P
30
UCDavis, ecs251
Spring 2007




70% of Gnutella users share no files
90% of users answer no queries
Those who have files to share may limit number of connections or
upload speed, resulting in a high download failure rate.
If only a few individuals contribute to the public good, these few
peers effectively act as centralized servers.
05/03/2007
P2P
31
UCDavis, ecs251
Spring 2007
Anonymity
Gnutella provides for anonymity by
masking the identity of the peer that
generated a query.
 However, IP addresses are revealed at
various points in its operation: HITS
packets includes the URL for each file,
revealing the IP addresses

05/03/2007
P2P
32
UCDavis, ecs251
Spring 2007





Query Expressiveness
Format of query not standardized
No standard format or matching semantics for the
QUERY string. Its interpretation is completely
determined by each node that receives it.
String literal vs. regular expression
Directory name, filename, or file contents
Malicious users may even return files unrelated to
the query
05/03/2007
P2P
33
UCDavis, ecs251
Spring 2007
Superpeers

Cooperative, long-lived peers typically with
significant resources to handle very high
amount of query resolution traffic.
05/03/2007
P2P
34
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
35
UCDavis, ecs251
Spring 2007




Gnutella is a self-organizing, large-scale, P2P application
that produces an overlay network on top of the Internet; it
appears to work
Growth is hindered by the volume of generated traffic and
inefficient resource use
since there is no central authority the open source
community must commit to making any changes
Suggested changes have been made by
– Peer-to-Peer Architecture Case Study: Gnutella Network, by Matei
Ripeanu
– Improving Gnutella Protocol: Protocol Analysis and Research
Proposals by Igor Ivkovic
05/03/2007
P2P
36
UCDavis, ecs251
Spring 2007
Freenet

Essentially the same as Gnutella:
– Limited-scope flooding
– Reverse-path forwarding

Difference:
– Data objects (I.e., files) are also being delivered
via “reverse-path forwarding”
05/03/2007
P2P
37
UCDavis, ecs251
Spring 2007
P2P Issues
Scalability & Load Balancing
 Anonymity
 Fairness, Incentives & Trust
 Security and Robustness
 Efficiency
 Mobility

05/03/2007
P2P
38
UCDavis, ecs251
Spring 2007
Incentive-driven Fairness

P2P means we all should contribute..
– Hopefully fair, but the majority is selfish…

“Incentive for people to contribute…”
05/03/2007
P2P
39
UCDavis, ecs251
Spring 2007
Bittorrent: “Tit for Tat”

Equivalent Retaliation (Game theory)
– A peer will “initially” cooperate, then respond
in kind to an opponent's previous action. If the
opponent previously was cooperative, the agent
is cooperative. If not, the agent is not.
05/03/2007
P2P
40
UCDavis, ecs251
Spring 2007
Bittorrent
Fairness of download and upload between a
pair of peers
 Every 10 seconds, estimate the download
bandwidth from the other peer

– Based on the performance estimation to decide
to continue uploading to the other peer or not
05/03/2007
P2P
41
UCDavis, ecs251
Spring 2007
Client & its Peers

Client
– Download rate (from the peers)

Peers
– Upload rate (to the client)
05/03/2007
P2P
42
UCDavis, ecs251
Spring 2007
BT Choking by Client

By default, every peer is “choked”
– stop “uploading” to them, but the TCP connection is
still there.

Select four peers to “unchoke”
– Best “upload rates” and “interested”.
– Uploading to the unchoked ones and monitor the
download rate for all the peers
– “Re-choke” every 30 seconds

Optimistic Unchoking
– Randomly select a choked peer to unchoke
05/03/2007
P2P
43
UCDavis, ecs251
Spring 2007

“Interested”
A request for a piece (or its sub-pieces)
05/03/2007
P2P
44
UCDavis, ecs251
Spring 2007
Becoming “seed”

Use “upload” rate to the peers to decide
which peers to unchoke.
05/03/2007
P2P
45
UCDavis, ecs251
Spring 2007
05/03/2007
Bittorrent Wiki
P2P
46
UCDavis, ecs251
Spring 2007
BT Peer Selection

From the “Tracker”
– We receive a partial list of all active peers for
the same file
– We can get another 50 from the tracker if we
want
05/03/2007
P2P
47
UCDavis, ecs251
Spring 2007
Piece Selection

Piece (64K~1M) Sub-piece (16K)
– Piece-size: trade-off between performance and the size
of the torrent file itself
– A client might request different sub-pieces of the same
piece from different peers.


Strict Priority - sub-pieces and piece
Rarest First
– Exception: “random first”
– Get the stuff out of Seed(s) as soon as possible..
05/03/2007
P2P
48
UCDavis, ecs251
Spring 2007
Rarest First

Exchanging bitmaps with 20+ peers
– Initial messages
– “have” messages

Array of buckets
– Ith buckets contains “pieces” with I known
instances
– Within the same bucket, the client will
randomly select one piece.
05/03/2007
P2P
49
UCDavis, ecs251
Spring 2007
Random-First
Usually, rare-first pieces are rare.
 The client has to get all the sub-pieces from
one or very few peers.
 For the first 4~5 pieces, get some random
pieces so the client can have a few pieces to
upload.

05/03/2007
P2P
50
UCDavis, ecs251
Spring 2007
BitTorrent
Connect to the Tracker
 Connect to 20+ peers
 Random-first or Rarest-first
 Monitoring the download rate from the
peers (or upload rate to the client)
 Unchoke and Optimistic Unchoke

05/03/2007
P2P
51
UCDavis, ecs251
Spring 2007
Bittorrent
Advantages
 Disadvantages

05/03/2007
P2P
52
UCDavis, ecs251
Spring 2007
Trackerless Bittorrent
Every BT peer is a tracker!
 But, how would they share and exchange
information regarding other peers?
 Similar to Napster’s index server or DNS

05/03/2007
P2P
53
UCDavis, ecs251
Spring 2007
Pure P2P
Every peer is a tracker
 Every peer is a DNS server
 Every peer is a Napster Index server


How can this be done?
– We try to remove/reduce the role of “special
servers”!
05/03/2007
P2P
54
UCDavis, ecs251
Spring 2007
Peer

The requirements of Peer?
05/03/2007
P2P
55
UCDavis, ecs251
Spring 2007
Structured Peering

Peer identity and routability
05/03/2007
P2P
56
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
 Key/content assignment

– Which identity owns what? (Google Search?)
05/03/2007
P2P
57
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
 Key/content assignment

– Which identity owns what?
Napster: centralized index service
Skype/Kazaa: login-server & super peers
DNS: hierarchical DNS servers
Two problems:
(1). How to connect to the “ring”?
(2). How to prevent failures/changes?
05/03/2007
P2P
58
UCDavis, ecs251
Spring 2007
DHT

Distributed hash tables (DHTs)
– decentralized lookup service of a hash table
– (name, value) pairs stored in the DHT
– any peer can efficiently retrieve the value
associated with a given name
– the mapping from names to values is distributed
among peers
05/03/2007
P2P
59
UCDavis, ecs251
Spring 2007
HT as a search table
Information/content is distributed, and we need
to know where?
Index key
05/03/2007
Where is this piece of music?
What is the location of this type of content?
What is the current IP address of this skype
user?
P2P
60
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
61
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
62
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
63
UCDavis, ecs251
Spring 2007
DHT
Scalable
 Peer arrivals, departures, and failures
 Unstructured versus structured

05/03/2007
P2P
64
UCDavis, ecs251
Spring 2007
DHT (Name, Value)

How to utilize DHT to avoid Trackers in
Bittorrent?
05/03/2007
P2P
65
UCDavis, ecs251
Spring 2007
DHT-based Tracker
FreeBSD 5.4 CD images
Publish the key on
the class web site.
Index key
Whoever owns
this hash entry is
the tracker for the
corresponding
key!
Seed’s IP address
PUT & GET
05/03/2007
P2P
66
UCDavis, ecs251
Spring 2007
Chord





Consistent Hashing
A Simple Key Lookup Algorithm
Scalable Key Lookup Algorithm
Node Joins and Stabilization
Node Failures
05/03/2007
P2P
67
UCDavis, ecs251
Spring 2007
Chord
Given a key (data item), it maps the key
onto a peer.
 Uses consistent hashing to assign keys to
peers.
 Solves problem of locating key in a
collection of distributed peers.
 Maintains routing information as peers join
and leave the system

05/03/2007
P2P
68
UCDavis, ecs251
Spring 2007
Issues




Load balance: distributed hash function, spreading keys
evenly over peers
Decentralization: chord is fully distributed, no node
more important than other, improves robustness
Scalability: logarithmic growth of lookup costs with
number of peers in network, even very large systems are
feasible
Availability: chord automatically adjusts its internal
tables to ensure that the peer responsible for a key can
always be found
05/03/2007
P2P
69
UCDavis, ecs251
Spring 2007
Example Application
File System
Block Store
Block Store
Block Store
Chord
Chord
Chord
Client
Server
Server

Highest layer provides a file-like interface to user including userfriendly naming and authentication

This file systems maps operations to lower-level block operations

Block storage uses Chord to identify responsible node for storing a
block and then talk to the block storage server on that node
05/03/2007
P2P
70
UCDavis, ecs251
Spring 2007
Consistent Hashing




Consistent hash function assigns each peer and
key an m-bit identifier.
SHA-1 is used as a base hash function.
A peer’s identifier is defined by hashing the peer’s
IP address.
A key identifier is produced by hashing the key
(chord doesn’t define this. Depends on the
application).
– ID(peer) = hash(IP, Port)
– ID(key) = hash(key)
05/03/2007
P2P
71
UCDavis, ecs251
Spring 2007
Consistent Hashing





m
In an m-bit identifier space, there are 2
identifiers.
Identifiers are ordered on an identifier circle
m
modulo 2 .
The identifier ring is called Chord ring.
Key k is assigned to the first peer whose identifier
is equal to or follows (the identifier of) k in the
identifier space.
This peer is the successor peer of key k, denoted
by successor(k).
05/03/2007
P2P
72
UCDavis, ecs251
Spring 2007
Consistent Hashing - Successor
Peers
identifier
node
6
1
0
6
identifier
circle
6
5
2
2
successor(2) = 3
3
4
05/03/2007
key
successor(1) = 1
1
7
successor(6) = 0
X
2
P2P
73
UCDavis, ecs251
Spring 2007
Consistent Hashing – Join and
Departure
When a node n joins the network, certain
keys previously assigned to n’s successor
now become assigned to n.
 When node n leaves the network, all of its
assigned keys are reassigned to n’s
successor.

05/03/2007
P2P
75
UCDavis, ecs251
Spring 2007
Node Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
05/03/2007
P2P
76
UCDavis, ecs251
Spring 2007
Node Departure
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
05/03/2007
P2P
77
UCDavis, ecs251
Spring 2007
Technical Issues

???
05/03/2007
P2P
78
UCDavis, ecs251
Spring 2007
A Simple Key Lookup

A very small amount of routing information suffices
to implement consistent hashing in a distributed
environment

If each node knows only how to contact its current
successor node on the identifier circle, all node can
be visited in linear order.
Queries for a given identifier could be passed
around the circle via these successor pointers until
they encounter the node that contains the key.

05/03/2007
P2P
80
UCDavis, ecs251
Spring 2007
A Simple Key Lookup

Pseudo code for finding successor:
// ask node n to find the successor of id
n.find_successor(id)
if (id  (n, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
05/03/2007
P2P
81
UCDavis, ecs251
Spring 2007
A Simple Key Lookup

The path taken by a query from node 8 for
key 54:
05/03/2007
P2P
82
UCDavis, ecs251
Spring 2007
Successor

Each active node MUST know the IP
address of its successor!
– N8 has to know that the next node on the ring is
N14.
Departure N8 => N21
 But, how about failure or crash?

05/03/2007
P2P
83
UCDavis, ecs251
Spring 2007
Robustness

Successor in R hops
– N8 => N14, N21, N32, N38 (R=4)
– Periodic pinging along the path to check, &
also find out maybe there are “new members”
in between
05/03/2007
P2P
84
UCDavis, ecs251
Spring 2007
Is that good enough?
05/03/2007
P2P
85
UCDavis, ecs251
Spring 2007
Complexity of the search

Time/messages: O(N)
– N: # of nodes on the Ring

Space: O(1)
– We only need to remember R IP addresses

Stablization depends on “period”.
05/03/2007
P2P
86
UCDavis, ecs251
Spring 2007
Scalable Key Location
To accelerate lookups, Chord maintains
additional routing information.
 This additional information is not essential
for correctness, which is achieved as long as
each node knows its correct successor.

05/03/2007
P2P
87
UCDavis, ecs251
Spring 2007
Scalable Key Location – Finger
Tables




Each node n’ maintains a routing table with up to m
entries (which is in fact the number of bits in
identifiers), called finger table.
The ith entry in the table at node n contains the
identity of the first node s that succeeds n by at least
i-1
2 on the identifier circle.
i-1
s = successor(n+2 ).
s is called the ith finger of node n, denoted by
n.finger(i)
05/03/2007
P2P
88
UCDavis, ecs251
Spring 2007
Scalable Key Location – Finger
Tables
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
1
3
0
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
finger table
For.
start
3
0
3+2
3+21
3+22
4
05/03/2007
succ.
finger table
For.
start
0
7
keys
6
P2P
4
5
7
succ.
keys
2
0
0
0
89
UCDavis, ecs251
Spring 2007
Finger Tables
A finger table entry includes both the Chord
identifier and the IP address (and port
number) of the relevant node.
 The first finger of n is the immediate
successor of n on the circle.

05/03/2007
P2P
90
UCDavis, ecs251
Spring 2007
Scalable Key Location – Example
query

The path a query for key 54 starting at node 8:
05/03/2007
P2P
91
UCDavis, ecs251
Spring 2007
Scalable Key Location – A
characteristic

Since each node has finger entries at power of two
intervals around the identifier circle, each node
can forward a query at least halfway along the
remaining distance between the node and the
target identifier. From this intuition follows a
theorem:
Theorem: With high probability, the number of nodes
that must be contacted to find a successor in an N-node
network is O(logN).
05/03/2007
P2P
92
UCDavis, ecs251
Spring 2007
Complexity of the Search

Time/messages: O(logN)
– N: # of nodes on the Ring

Space: O(logN)
– We need to remember R IP addresses
– We need to remember logN Fingers

Stablization depends on “period”.
05/03/2007
P2P
93
UCDavis, ecs251
Spring 2007
An Example
M = 4096 (identifier size), ring size is 24096
 N = 216 (# of nodes)
 How many entries we need to have for the
Finger Table?

Each node n’ maintains a routing table with up to m entries
(which is in fact the number of bits in identifiers), called
finger table.
The ith entry in the table at node n contains the identity of
the first node s that succeeds n by at least 2i-1 on the
identifier circle.
s = successor(n+2i-1).
05/03/2007
P2P
94
UCDavis, ecs251
Spring 2007
Complexity of the Search

Time/messages: O(M)
– M: # of bits of the identifier

Space: O(M)
– We need to remember R IP addresses
– We need to remember M Fingers

Stablization depends on “period”.
05/03/2007
P2P
95
UCDavis, ecs251
Spring 2007
Structured Peering

Peer identity and routability
– 2M identifiers, Finger Table routing

Key/content assignment
– Hashing

Dynamics/Failures
– Inconsistency??
05/03/2007
P2P
96
UCDavis, ecs251
Spring 2007
Node Joins and
Stabilizations



The most important thing is the successor pointer.
If the successor pointer is ensured to be up to date,
which is sufficient to guarantee correctness of
lookups, then finger table can always be verified.
Each node runs a “stabilization” protocol
periodically in the background to update successor
pointer and finger table.
05/03/2007
P2P
97
UCDavis, ecs251
Spring 2007
Node Joins and
Stabilizations

“Stabilization” protocol contains 6
functions:
–
–
–
–
–
–
create( )
join( )
stabilize( )
notify( )
fix_fingers( )
check_predecessor( )
05/03/2007
P2P
98
UCDavis, ecs251
Spring 2007
Node Joins – join()
When node n first starts, it calls n.join(n’),
where n’ is any known Chord node.
 The join() function asks n’ to find the
immediate successor of n.
 join() does not make the rest of the network
aware of n.

05/03/2007
P2P
99
UCDavis, ecs251
Spring 2007
Node Joins – join()
// create a new Chord ring.
n.create()
predecessor = nil;
successor = n;
// join a Chord ring containing node n’.
n.join(n’)
predecessor = nil;
successor = n’.find_successor(n);
05/03/2007
P2P
100
UCDavis, ecs251
Spring 2007
Node Joins – stabilize()



Each time node n runs stabilize(), it asks its
successor for the it’s predecessor p, and decides
whether p should be n’s successor instead.
stabilize() notifies node n’s successor of n’s
existence, giving the successor the chance to
change its predecessor to n.
The successor does this only if it knows of no
closer predecessor than n.
05/03/2007
P2P
102
UCDavis, ecs251
Spring 2007
Node Joins – stabilize()
// called periodically. verifies n’s immediate
// successor, and tells the successor about n.
n.stabilize()
x = successor.predecessor;
if (x  (n, successor))
successor = x;
successor.notify(n);
// n’ thinks it might be our predecessor.
n.notify(n’)
if (predecessor is nil or n’  (predecessor, n))
predecessor = n’;
05/03/2007
P2P
103
UCDavis, ecs251
Spring 2007
Node Joins – Join and
Stabilization

nil
succ(np) = n
n
succ(np) = ns

np
05/03/2007
–
–
pred(ns) = n
pred(ns) = np
ns
n joins
n runs stabilize
–
–

n notifies ns being the new predecessor
ns acquires n as its predecessor
np runs stabilize
–
–
–
–

predecessor = nil
n acquires ns as successor via some n’
np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor
all predecessor and successor pointers are
now correct
fingers still need to be fixed, but old
P2P fingers will still work
104

UCDavis, ecs251
Spring 2007
Node Joins – fix_fingers()
Each node periodically calls fix fingers to
make sure its finger table entries are correct.
 It is how new nodes initialize their finger
tables
 It is how existing nodes incorporate new
nodes into their finger tables.

05/03/2007
P2P
105
UCDavis, ecs251
Spring 2007
Node Joins – fix_fingers()
// called periodically. refreshes finger table entries.
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1);
// checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
05/03/2007
P2P
106
UCDavis, ecs251
Spring 2007
Node Failures

Key step in failure recovery is maintaining correct successor pointers

To help achieve this, each node maintains a successor-list of its r nearest
successors on the ring

If node n notices that its successor has failed, it replaces it with the first
live entry in the list

Successor lists are stabilized as follows:
– node n reconciles its list with its successor s by copying s’s successor list,
removing its last entry, and prepending s to it.
– If node n notices that its successor has failed, it replaces it with the first
live entry in its successor list and reconciles its successor list with its new
successor.
05/03/2007
P2P
108
UCDavis, ecs251
Spring 2007
Chord – The Math

Every node is responsible for about K/N keys (N nodes,
K keys)

When a node joins or leaves an N-node network, only
O(K/N) keys change hands (and only to and from
joining or leaving node)

Lookups need O(log N) messages

To reestablish routing invariants and finger tables after
node joining or leaving, only O(log2N) messages are
required
05/03/2007
P2P
109