RON: Resilient Overlay Networks

Download Report

Transcript RON: Resilient Overlay Networks

RON: Resilient Overlay Networks
David Andersen, Hari Balakrishnan,
Frans Kaashoek, and Robert Morris
MIT Laboratory for Computer Science
http://nms.lcs.mit.edu/ron/
Fault-tolerant networking
B
A
Network
C
D
• Packet switching and route around failures
Internet: network of networks
Site 2
Site 3
ISP1
ISP2
Site 1
ISP3
Site 5
• ISPs peer to forward packets
• ISP exchange route info using BGP
Site 4
The Internet is ill suited to
mission-critical applications
• Commercial peer architecture
– Performance bottlenecks at peering points
– Ignores many existing alternate paths
– Directly conflicts with robustness
• Internet’s global scale:
– Prevents sophisticated algorithms
– Route selection uses fixed, simple metrics
– Routing isn’t sensitive to path quality
How robust is Internet routing?
Paxson
95-97
• 3.3% of all routes had serious problems
Labovitz 97-00
• 10% of routes available < 95% of the time
• 65% of routes available < 99.9% of the time
• 3-min minimum detection+recovery time;
often 15 mins
• 40% of outages took 30+ mins to repair
Chandra 01
• 5% of faults last more than 2.75 hours
Our goal
To improve communication availability for small
groups by at least a factor or 10
• Many applications
– Collaboration and conferencing
– Virtual Private Networks (VPNs) across public Internet
– Overlay Internet Service
Overlay routes around Internet
failures
Utah
Utah
Company
MIT
Cable
Modem
• Failures:
–Outages: Configuration/operational errors, backhoes, etc.
–Performance failures: Severe congestion, denial-of-service attacks, etc.
Scalability versus recovery
• Internet scalability pays a price:
– Slow recovery
• RON recovers fast by
– Limiting size of overlay
– Exploiting redundancy in underlying Internet
Redundant links
• Multiple paths between all sites
Utah
Utah
Company
Internet 2
MIT
Cable
Modem
Redundant links
• But many of them are hidden
Utah
Utah
Company
MIT
Cable
Modem
Resilient overlay networks
•
•
•
•
Measure all links between nodes
Compute path properties
Determine best route
Forward traffic over that path
RON design
Nodes in different
routing domains
(ASes)
RON library
Conduit
Conduit
Forwarder
Prober Router
Performance
Database
Forwarder
Prober Router
Application-specific routing tables
Policy routing module
Routing and path selection
• Path selection at the entry node
– Specialized for routing through one intermediate node
• Router computes the forwarding tables
– Link-state dissemination through RON
• Path evaluation and selection
– Latency minimizer: EWMA of round-trip samples
– Loss-rate minimizer: average of the last k samples
– Throughput optimizer: TCP throughput equation
• Select when estimated throughput improves by 2x
• 5% hysteresis to avoid flapping
Policy routing
• Router computes a forwarding table for each
policy
• Two ways of describing policies:
– Exclusive cliques (e.g., educational only)
– General policies
• BPF-like packet matcher, which returns a policy
• Links that are denied by a policy
• Entry node classifies packet with a policy tag
Responding to failure
• Probe interval: 12 seconds
• Probe timeout: 3 seconds
• Routing update interval: 14 seconds
RON overhead
10 nodes 20 nodes 30 nodes 40 nodes 50 nodes
1.8 Kbps 5.9 Kbps 12 Kbps
21 Kbps
32 Kbps
• Probe overhead: 69 bytes
• RON routing overhead: 60 + 20 (N-1)
• 50: allows recovery times between 12 and 25 s
Many research questions
• Does the RON approach work at all?
• Each RON is small in size, no more than 50 or 100
nodes
– How fast can failure detection & recovery happen?
• Policy routing
– Doesn’t RON violate AUPs and other policies?
• Routing behavior
– Can stable routing be achieved?
– Implementing efficient multi-criteria routing
• Is it safe to deploy a large number of (small)
interacting RONs on the Internet?
IP forwarder
• A RON application
• Transparently
forwards IP traffic
over RON
• Allows comparisons
of IP traffic over RON
versus over direct
Internet
To vu.nl
Lulea.se
OR-DSL
CMU
RON deployment (19 sites)
MIT
MA-Cable
Cisco
Cornell
CA-T1
CCI
Aros
Utah
NYU
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
.com (ca), .com (ca), dsl (or), cci (ut), aros (ut), utah.edu, .com (tx)
cmu (pa), dsl (nc), nyu , cornell, cable (ma), cisco (ma), mit,
vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela
AS view
Experiments
• Measure loss, latency, and throughput with and
without RON
• RON1: 12 hosts in the US and Europe
– 64 hours of measurements in March 2001
• RON2: 16 hosts
– 85 hours of measurements in May 2001
• 30-minute average loss rates
– A 30 minute outage is very serious!
• Note: Experiments done with “No-Internet2-forcommercial-use” policy
Take home messages
1. RON reduced outages by a factor 5 to 10,
and routed around all major outages
2. RON takes 18s (average) to route around a
failure, and can do so in the face of
flooding attacks
3. Single route indirection delivers the
majority RON benefits
1
"loss.jit"
0.8
0.6
0.4
0.2
0
RON improves loss-rate
30-min average loss rate on Internet
0
0.2
0.4
0.6
0.8
1
RON loss rate never
more than 30%
13,000 samples
30-min average loss rate with RON
An order-of-magnitude fewer failures
30-minute average loss rates
Loss Rate
10%
20%
30%
50%
80%
100%
RON
Better
526 [517]
142 [140]
32 [32]
20 [20]
14 [14]
10
No
Change
58 [51]
4 [3]
0
0
0
0
RON
Worse
47 [45]
15 [15]
0
0
0
0
6,825 “path hours” represented here
12 “path hours” of essentially complete outage
72 “path hours” of TCP outage
RON routed around all of these!
One indirection hop provides almost all the benefit!
Why does one hop work?
P(good path) =
(1 – (1-p)^2)^(R+1)
source
R RON nodes
RON
RON
•••
In RON testbed:
– P(direct path is good) is 48.8%
– P(intermediate path is good) is 51%
RON
target
Resilience Against DoS Attacks
Latency using RON
What’s next for RON?
• Data mining of collected samples
• Applications
• Routing policies (e.g., rate control)
Other progress: Chord
• Chord: a peer-to-peer lookup system
• CFS: a peer-to-peer file sharing application
www.pdos.lcs.mit.edu/chord
Conclusion
• Improved availability of Internet communication
paths using small overlays
– Layered above scalable IP substrate
– RON provides a set of libraries and programs to
facilitate this application-specific routing
• Experimental data suggest that approach works
– Over 10X availability
– Outage detection and recovery in about 15 seconds
– Able to route around certain denial-of-service attacks
• Many interesting questions remain…
http://nms.lcs.mit.edu/ron/