Robust Internet Routing - Networks and Mobile Systems

Download Report

Transcript Robust Internet Routing - Networks and Mobile Systems

Interdomain Routing
Correctness and Stability
Nick Feamster
Is correctness really that important?
2
Is correctness really that important?
• The Internet is increasingly becoming part of the
mission-critical Infrastructure (a public utility!).
Big problem: Very poor understanding
of how to manage it.
3
Why does routing go wrong?
• Complex policies
– Competing / cooperating networks
– Each with only limited visibility
• Large scale
– Tens of thousands networks
– …each with hundreds of routers
– …each routing to hundreds of
thousands of IP prefixes
4
What can go wrong?
Some things are out of the hands of networking research
But…
Two-thirds of the problems are caused by
configuration of the routing protocol
5
Review: Simple operation…
Autonomous Systems (ASes)
Route Advertisement
MIT
Destination
18.0.0.0/8
18.0.0.0/8
Next-hop
AS Path
192.5.89.89
1 3356 3
66.250.252.44 174 3
Session
6
…but complex configuration!
Flexibility for realizing goals in
complex business landscape
• Which neighboring
networks can send traffic
Traffic
• Where traffic enters and
leaves the network
• How routers within the
network learn routes to
external destinations
Flexibility
Route
No Route
Complexity
7
Configuration Semantics
Filtering: route advertisement
Customer
Competitor
Ranking: route selection
Primary
Backup
Dissemination: internal route advertisement
8
What types of problems does
configuration cause?
•
•
•
•
•
•
Persistent oscillation (today’s reading)
Forwarding loops
Partitions
“Blackholes”
Route instability
…
9
These problems are real
“…a glitch at a small ISP… triggered a major outage in
Internet access across the country. The problem started
when MAI Network Services...passed bad router
information from one of its customers onto Sprint.”
-- news.com, April 25, 1997
UUNet
Sprint
Florida Internet
Barn
10
These problems are real
“…a glitch at a small ISP… triggered a major outage in
Internet access across the country. The problem started
when MAI Network Services...passed bad router
information
one of
its customers
onto
Sprint.”
“Microsoft's from
websites
were
offline for up
to 23
-- news.com,
25, 1997
hours...because
of a [router]April
misconfiguration…it
took
nearly a day to determine what was wrong and undo the
“WorldCom
a widespread
changes.” Inc…suffered
-- wired.com,
January 25,outage
2001 on its
Internet backbone that affected roughly 20 percent of its
U.S. customer base. The network problems…affected
millions of computer users worldwide. A spokeswoman
attributed
outagecustomers
to "a route went
tableout
issue."
"A numberthe
of Covad
from 5pm today
-- cnn.com,
October
3, 2002denial of service
due to, supposedly,
a DDOS
(distributed
attack) on a key Level3 data center, which later was
described as a route leak (misconfiguration).”
-- dslreports.com, February 23, 2004
11
# Threads over Stated Period
Several “Big” Problems a Week
90
80
70
60
50
40
30
20
10
0
Filtering
Route
Leaks
Route
Hijacks
1994-1997
Route
Routing Blackholes
Instability Loops
1998-2001
2001-2004
12
Why is routing hard to get right?
• Defining correctness is hard
• Interactions cause unintended consequences
– Each network independently configured
– Unintended policy interactions
• Operators make mistakes
– Configuration is difficult
– Complex policies, distributed configuration
13
Correctness Specification
Safety
The protocol converges to a stable
The assignment
protocol does
not oscillate
path
for every
possible
initial state and message ordering
14
Safety: No Persistent Oscillation
130
10
1
0
210
20
2
3
320
30
Varadhan, Govindan, & Estrin, “Persistent Route Oscillations in Interdomain Routing”, 1996
15
Strawman: Global Policy Check
• Require each AS to publish its policies
• Detect and resolve conflicts
Problems:
• ASes typically unwilling to reveal policies
• Checking for convergence is NP-complete
• Failures may still cause oscillations
16
Think Globally, Act Locally
• Key features of a good solution
–
–
–
–
Safety: guaranteed convergence
Expressiveness: allow diverse policies for each AS
Autonomy: do not require revelation/coordination
Backwards-compatibility: no changes to BGP
• Local restrictions on configuration semantics
– Ranking
– Filtering
17
Main Idea of Today’s Paper
• Permit only two business arrangements
– Customer-provider
– Peering
• Constrain both filtering and ranking based on
these arrangements to guarantee safety
• Surprising result: these arrangements
correspond to today’s (common) behavior
Gao & Rexford, “Stable Internet Routing without Global Coordination”, IEEE/ACM ToN, 2001 18
Relationship #1: Customer-Provider
Filtering
– Routes from customer: to everyone
– Routes from provider: only to customers
From other destinations
To the customer
providers
From the customer
To other destinations
providers
advertisements
traffic
customer
customer
19
Relationship #2: Peering
Filtering
– Routes from peer: only to customers
– No routes from other peers or providers
advertisements
peer
peer
traffic
customer
customer
20
Rankings
• Routes from customers over routes from peers
• Routes from peers over routes from providers
provider
peer
customer
21
Additional Assumption: Hierarchy
Disallowed!
22
Safety: Proof Sketch
• System state: the current route at each AS
• Activation sequence: revisit some router’s
selection based on those of neighboring ASes
23
Activation Sequence: Intuition
• Activation: emulates a message ordering
– Activated router has received and processed all
messages corresponding to the system state
• “Fair” activation: all routers receive and
process outstanding messages
24
Safety: Proof Sketch
• State: the current route at each AS
• Activation sequence: revisit some router’s
selection based on those of neighboring ASes
• Goal: find an activation sequence that leads to a
stable state
• Safety: satisfied if that activation sequence is
contained within any “fair” activation sequence
25
Proof, Step 1: Customer Routes
• Activate ASes from customer to provider
– AS picks a customer route if one exists
– Decision of one AS cannot cause an earlier AS to
change its mind
An AS picks a customer
route when one exists
26
Proof, Step 2: Peer & Provider Routes
• Activate remaining ASes from provider to customer
– Decision of one Step-2 AS cannot cause an earlier Step2 AS to change its mind
– Decision of Step-2 AS cannot affect a Step-1 AS
AS picks a peer or provider
route when no customer
route is available
27
Ranking and Filtering Interactions
• Allowing more flexibility in ranking
– Allow same preference for peer and customer routes
– Never choose a peer route over a shorter customer route
• … at the expense of stricter AS graph assumptions
– Hierarchical provider-customer relationship (as before)
– No private peering with (direct or indirect) providers
Peering
28
Some problems
• Requires acyclic hierarchy (global condition)
• Cannot express many business relationships
Sprint
Abovenet
Verio
Customer
PSINet
Question: Can we relax the constraints on filtering? What
happens to rankings?
29
Other Possible Local Rankings
Accept only next-hop rankings
– Captures most routing policies
– Generalizes customer/peer/provider
– Problem: system not safe
1
2
1*, 3*, 0*
3*,2*,0*
3
2*,1*,0*
Accept only shortest hop count rankings
– Guarantees safety under filtering
– Problem: not expressive
Feamster, Johari, & Balakrishnan, “Implications of Autonomy for the Expressiveness of Policy
30
Routing”, SIGCOMM 2005
What Rankings Violate Safety?
Theorem.
Permitting paths of length n+2 over paths of
length n will violate safety under filtering.
Theorem.
Permitting paths of length n+1 over paths of
length n will result in a dispute wheel.
Feamster, Johari, & Balakrishnan, “Implications of Autonomy for the Expressiveness of Policy
31
Routing”, SIGCOMM 2005
What about properties of resulting paths,
after the protocol has converged?
We need additional correctness properties.
32
Correctness Specification
Safety
The protocol converges to a stable
path
for every
possible
The assignment
protocol does
not oscillate
initial state and message ordering
Path Visibility
EveryIfdestination
with
usable
there exists
aa
path,
paththen
has athere
routeexists
advertisement
a route
Example violation: Network partition
Route Validity
EveryIfroute
thereadvertisement
exists a route,
corresponds
to aexists
usablea path
then there
path
Example violation: Routing loop
33
Path Visibility: Internal BGP (iBGP)
Default: “Full mesh” iBGP.
Doesn’t scale.
“iBGP”
Large ASes use “Route reflection”
Route reflector:
non-client routes over client sessions;
client routes over all sessions
Client: don’t re-advertise iBGP routes.
34
iBGP Signaling: Static Check
Theorem.
Suppose the iBGP reflector-client relationship graph
contains no cycles. Then, path visibility is satisfied if,
and only if, the set of routers that are not route
reflector clients forms a clique.
Condition is easy to check with static analysis.
35
How do we guarantee these
additional properties in practice?
36
Today: Reactive Operation
What happens if I
tweak this policy…?
Revert
No
Configure
Observe
Desired
Effect?
Yes
Wait for
Next Problem
• Problems cause downtime
• Problems often not immediately apparent
37
Goal: Proactive Operation
• Idea: Analyze configuration before deployment
rcc
Configure
Detect
Faults
Deploy
Many faults can be detected with static analysis.
Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200538
rcc Overview
Distributed router
configurations
(Single AS)
Correctness
Specification
Constraints
“rcc”Normalized
Faults
Representation
Challenges
• Analyzing complex, distributed configuration
• Defining a correctness specification
• Mapping specification to constraints
Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200539
rcc Implementation
Preprocessor
Distributed router
configurations
(Cisco, Avici, Juniper,
Procket, etc.)
Parser
Constraints
Relational
Database
(mySQL)
Verifier
Faults
Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200540
Summary: Faults across 17 ASes
Every AS had faults, regardless of network size
Most faults can be attributed to distributed configuration
10
Path Visibility
8
6
4
Incomplete
Filter
Undefined
Filter
Transit
Between
Peers
Inconsistent
Import
Inconsistent
Export
Incomplete
iBGP
Session
0
Duplicate
Loopback
2
iBGP
Signaling
Partition
Number of ASes
Route Validity
41
rcc: Take-home lessons
• Static configuration analysis uncovers many errors
• Major causes of error:
– Distributed configuration
– Intra-AS dissemination is too complex
– Mechanistic expression of policy
http://nms.csail.mit.edu/rcc/
About 100 downloads (70 network operators)
42
Two Philosophies
• This lecture: Accept the Internet as is. Devise
“band-aids”.
• Another direction: Redesign Internet routing to
guarantee safety, route validity, and path visibility
43
Preventing Errors in the First Place
Before: conventional iBGP
eBGP
iBGP
After: RCP gets “best” iBGP routes (and IGP topology)
RCP
iBGP
Feamster et al., “The Case for Separating Routing from Routers”, SIGCOMM FDNA, 2004
Caesar et al., “Design and Implementation of a Routing Control Platform”, NSDI, 2005
44
45
Configuration Syntax (Example)
router bgp 7018
neighbor 192.0.2.10 remote-as 65000
neighbor 192.0.2.10 route-map IMPORT in
neighbor 192.0.2.20 remote-as 7018
Dissemination
neighbor 192.0.2.20 route-reflector-client
!
route-map IMPORT permit 1
match ip address 199
Ranking
set local-preference 80
!
route-map IMPORT permit 2
match as-path 99
set local-preference 110
!
route-map IMPORT permit 3
set community 7018:1000
!
ip as-path access-list 99 permit ^65000$
access-list 199 permit ip host 192.0.2.0 host 255.255.255.0
46
access-list 199 permit ip host 10.0.0.0 host 255.0.0.0
Why is Routing Hard to Get Right?
• Defining correctness is hard
• Operators make mistakes
– Configuration is difficult
– Complex policies, distributed configuration
• Interactions cause unintended consequences
– Each network independently configured
– Unintended policy interactions
47
Which faults does rcc detect?
Faults found by rcc
Latent faults
Potentially active faults
End-to-end failures
48
Normalizing Router Configuration
Challenge:
• Hundreds of routers distributed across an AS
• Thousands to tens of thousands of lines per router
• Many ways to express identical policy
Solution:
• Express configuration with centralized tables
• Check constraints by issuing queries on tables
Filters
Sessions
Rankings
49
Route Validity: Consistent Export
Possible
NeighborCauses
AS
Export
Export
• Malice/deception
10.1.2.3 456
1
1
• iBGP
signaling
10.4.5.6
456
2 partition
2
• Inconsistent export policy
Clause Prepend
1
123
1
123 123
Policy normalization makes comparison easy.
neighbor 10.1.2.3
route-map PEER permit 10
set prepend 123
neighbor 10.4.5.6
route-map PEER permit 10
set prepend 123 123
50
Percentage of destinations with
inconsistent routes
Inconsistent Export Observed at AT&T
15% of destinations
inconsistent for >4 days
Percentage of time
Feamster et al., “BorderGuard: Detecting Cold Potatoes from Peers”. ACM IMC, October 2004. 51
Example: “Bogon” Routes
52
Feamster et al., “An Empirical Study of ‘Bogon’ Route Advertisements”. ACM CCR, January 2005.
rcc Interface
53
Parsing Configuration
54
List of Faults
55