Decentralized systems

Download Report

Transcript Decentralized systems

CIS 455/555: Internet and Web Systems
Decentralized systems
February 15, 2016
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
1
Announcements

HW1 MS2 is due February 19


Try to finish a few days early (testing/debugging...)
Another Basic Testing Guide will be available


Some MS1 features will be tested again


NOT an exhaustive list of all the features you need to implement!
Please use the feedback from your MS1 grade report to improve your
server (grade reports should be available later this week)
Reading:

Stoica et al., "Chord: A Scalable Peer-to-Peer Lookup Service
for Internet Applications", SIGCOMM 2001

© 2016 A. Haeberlen, Z. Ives
http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf
University of Pennsylvania
2
The road ahead
Remember our goal:


So far, we have seen:




Understand large web systems
like Google, Facebook, ...
"Frontend" technology
Data representation, indexing
Focus was on a single machine
Coming up next:

How to build large services
with lots of machines


© 2016 A. Haeberlen, Z. Ives
Salil S. (F0t0Synth), http://www.flickr.com/photos/ss2001/4531189792/

... such as a crawler
Main challenges: Scalability,
robustness
University of Pennsylvania
3
Plan for the next two lectures



A few words on Java servlets
Decentralization
Partly centralized systems
NEXT


Consistent hashing


Example: BitTorrent
Distributed hashtables
Fully decentralized systems



© 2016 A. Haeberlen, Z. Ives
KBR; Chord
Pastry
Attacks on KBR
University of Pennsylvania
4
How do we distribute a B+ tree?


We need to host the root
at one machine and
distribute the rest
Implications for scalability?



Consider building the
index as well as searching
What limits scalability?
Implications for robustness?

© 2016 A. Haeberlen, Z. Ives
Consider benign faults, rational behavior, and malicious
attacks
5
Problem: Centralized structure

Some systems are fully or partly centralized




Is this a good thing or a bad thing?



Some nodes maintain important state that only they know
Some nodes perform functions only they can perform
Some 'players' contribute most or all of the resources
Good: Simple, easier to get consistency, ...
Bad: Single point of failure, load imbalance, ...
Are there alternatives?



© 2016 A. Haeberlen, Z. Ives
Sometimes centralization is inherent (Example?)
Sometimes it is a consequence of the system design
If the latter, we can do something about it!
University of Pennsylvania
6
Approach: Decentralization

How can we make systems less centralized?

Idea #1: Utilize resources of ALL nodes


ALL the nodes can help the system by contributing storage,
bandwidth, computation, ...
Idea #2: Remove centralized components

© 2016 A. Haeberlen, Z. Ives
Avoid having individual nodes or systems that are crucial to
the operation of the system
University of Pennsylvania
7
Spectrum of approaches
Client/server
Centralized
© 2016 A. Haeberlen, Z. Ives
BitTorrent, Skype, ...
Partly centralized
University of Pennsylvania
Pastry,
Gnutella
Decentralized
8
Examples of deployed systems

Examples of partly centralized systems:






Skype (telephony)
Akamai NetSession (content distribution)
BitTorrent (content distribution)
SETI@home/BOINC (volunteer computing)
Amazon Dynamo (key-value store)
Examples of decentralized systems:





© 2016 A. Haeberlen, Z. Ives
Freenet (censorship-resistant data store)
Gnutella (file sharing)
CoralCDN (content distribution)
BGP (the Internet's interdomain routing system)
NNTP and SMTP (news and mail distribution)
University of Pennsylvania
9
"P2P = File sharing"

Some of the early P2P applications were used
for file sharing (Napster, Gnutella, ...)


But P2P is not the same as file sharing!



Some people even believe they are the same
File sharing: A specific application
P2P: A design principle for distributed systems
And file sharing is not the only application!

© 2016 A. Haeberlen, Z. Ives
Other examples: Streaming media, telephony, content
distribution, routing, volunteer computing, ...
University of Pennsylvania
10
Recap: Decentralization

Sometimes a single machine is not enough


Systems can be centralized to various degrees


Several machines must work together to implement service
Is there a single machine, or a small set of machines, that do
most of the work, or are involved in every single operation?
Centralized or decentralized?


© 2016 A. Haeberlen, Z. Ives
Pro centralized: Simpler, easier to get consistency, ...
Pro decentralized: No single point of failure, load balance,
scalability, ...
University of Pennsylvania
11
Plan for the next two lectures



A few words on Java servlets
Decentralization
Partly centralized systems
NEXT


Consistent hashing


Example: BitTorrent
Distributed hashtables
Fully decentralized systems



© 2016 A. Haeberlen, Z. Ives
KBR; Chord
Pastry
Attacks on KBR
University of Pennsylvania
12
Characteristics of partly centralized systems

Contains some centralized
components


Example: Central controller that
maintains a list of participating nodes
But: Centralized component is not involved in
resource-intensive operations

© 2016 A. Haeberlen, Z. Ives
Example: Data is downloaded or uploaded directly to peers
University of Pennsylvania
13
An example
1 Gbps


x 10,000
Suppose we want to ship a DVD image to
10,000 clients. How do we do this?
Option #1: Server does all the work


x1
1 Mbps
Example: 1 Gbps upstream  Need about 190 hours
Option #2: Let the clients help


© 2016 A. Haeberlen, Z. Ives
1 Mbps upstream x 10,000 = 10 Gbps!
Even if the server has only 1 Mbps, can finish in 19 hours!
University of Pennsylvania
14
Swarming
Fixed-size pieces
Node that
originally has
the file
Client now
has entire file,
turns into a
'seeder'
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
15
Trackers and torrent files

How do clients find peers to connect to?


Clients connect to a special tracker node
Tracker responds with the IP+port of a few other peers who
are downloading the same file


Modern BitTorrent clients are trackerless and use a DHT instead
(more about this later)
How do clients find the tracker?


Clients begin by downloading a 'torrent file' (e.g., from a
web server), which has the URL of the tracker
Torrent file also contains a SHA1 hash of each file block

© 2016 A. Haeberlen, Z. Ives
Why is this needed?
University of Pennsylvania
16
BitTorrent

Simplified BitTorrent session:
1.
2.
3.
4.
5.
© 2016 A. Haeberlen, Z. Ives
Download the 'torrent file'
Connect to the tracker and get a list of peers
Connect to the peers - initially as a 'leecher'
While file is not yet fully downloaded:
 Advertise to peers which blocks are available locally
 Request blocks from peers
 Compare hash of downloaded blocks to hash in
torrent file (why?)
Turn into a 'seeder', i.e., continue uploading to peers
without downloading
University of Pennsylvania
17
Incentives

Many users would rather not upload content




Danger: Tragedy of the commons


Some users pay per byte (e.g., cellular networks)
Uploading may take bandwidth from other applications
Upload traffic may introduce jitter or queueing delay (VoIP!)
Everyone wants to download, but nobody uploads
Idea: Provide an incentive for uploading


© 2016 A. Haeberlen, Z. Ives
Many possible incentives (name a few!)
BitTorrent's approach is based on reciprocity
University of Pennsylvania
18
Tit for tat

Idea: Upload to peers with best download rate


Result: Everyone has an incentive to upload
Instance of an old, successful idea



Goes back to Axelrod's tournament (iterated prisoner's dilemma)
Attempts to achieve pareto optimality
How this is used in BitTorrent:




© 2016 A. Haeberlen, Z. Ives
At any given time, peer uploads to a fixed # of other peers
Peers are chosen based on current download rate
All other peers are 'choked' (no uploads)
Additionally, one peer is optimistically 'unchoked' (why?)
University of Pennsylvania
19
Other examples

Akamai NetSession

SETI@home

End-system multicast
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
20
Recap: Partly centralized systems

Contain a few centralized components



Example: HDFS namenode, BitTorrent tracker, ...
However, most of the actual work is done by the peers
Some pros and cons:



© 2016 A. Haeberlen, Z. Ives
More scalable than centralized systems
'Organic growth': More peers potentially means more
demand, but also more resources
But: Centralized component is single point of failure and
eventually becomes a bottleneck
University of Pennsylvania
21