Online Testing of BGP

Download Report

Transcript Online Testing of BGP

Networked
Systems
Laboratory
Online Testing of BGP
Marco Canini
EPFL, Switzerland
Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar,
Dejan Novaković, Boris Spasojević, Olivier Crameri,
and Dejan Kostić
Work supported by the European Research Council
4/5/2011
Marco Canini, RIPE 62
1
Is it hard to crash the Internet?
• Software bugs in inter-domain routers
0-length AS4_PATH attribute!
Router type A
Router type B
Protocol-compliant,
confusing message
?
Reset session!
At 17:07:26 UTC on August 19, 2009 CNCI (AS9354),
a small network service provider in Nagoya, Japan,
advertised a handful of BGP updates containing an
empty AS4_PATH attribute. [renesys blog]
4/5/2011
Marco Canini, RIPE 62
2
Is it hard to crash the Internet?
• What went wrong
?
Unaffected router
Affected router
?
?
Repeated service disruptions:
routing instabilities!
4/5/2011
Marco Canini, RIPE 62
Unreachable!
?
?
3
BGP not always reliable
• Distributed system behavior
– Aggregate result of interleaved actions of multiple
routers
– Federated, heterogeneous and failure-prone
environment
• Difficult to reason about all corner cases or
combinations of configurations
– Unanticipated interactions, subtle differences in
inter-operable implementations, system-wide
conflicts, seemingly valid local fault handling
4/5/2011
Marco Canini, RIPE 62
4
Agenda
• Our system for online testing
– Disclaimer: still a research work!
– Not going to be an immediate solution
– Hope it will be a tool for this community
• Solicit feedback
– Which faults would you look for?
– What would convince you to deploy our system?
• … discussion
4/5/2011
Marco Canini, RIPE 62
5
DiCE comes to the rescue
• Key idea: automatically explore system behavior
to detect potential faults
1. Create an isolated snapshot of a BGP neighborhood
2. Subject a router’s BGP process to many inputs that
systematically exercise router actions
3. For each input, check if the snapshot misbehaves
DiCE
BGP
process
4/5/2011
Error in the snapshot 
Evidence of possible
BGP
future behavior of
neighbors
production system
Marco Canini, RIPE 62
6
BGP snapshot
• Isolate testing from production environment
Special IP prefix
Custom attribute
Local checkpoint
of current state
Cloned BGP
BGP’s
federated
 Each
BGPenvironment
process
and
configuration
process
router keeps its
local checkpoint  Private state & config stays in the AS
FIB
Sockets
Sockets
BGP
BGP
peers
checkpoints
ASes collaborate to detect potential faults
4/5/2011
Marco Canini, RIPE 62
7
Exploration of behavior
DiCE
Use a path exploration engine
Concolic (CONCrete + symbOLIC)
execution systematically
exercises code paths
Is there
Error!
an error?
3 Clone of BGP process
2
1
4/5/2011
Marco Canini, RIPE 62
8
Driving behavior by inputs
Route selection
Route ranking: is most
preferred route?
Input generation
Inputs
Configuration
Failures
changes
UPDATE
Random
Timeouts
choices
Header
Messages
Withdrawn Routes
Path exploration
Path Attributes
engine
Attribute Type | Length | Value
Symbolic
inputs
4/5/2011
Code & current config
Network Layer Reachability Information
NLRI Length | Prefix
Marco Canini, RIPE 62
Path constraints
9
Detecting faults
• Check properties that capture desired behavior
• Example: Harmful Global Events (session resets)
Unaffected router
f() 1
? BGP error
Affected router
f()
0 Error count
f() >1
BGP error
?threshold?
f()
0
DiCE
∑errors
5 controller
BGP
4/5/2011
Valid but
Log
inputs
that
have
ambiguous
f()
f() 1
0
? BGP error
harmful global behaviormessages
f()
0
f() 1
? BGP error
f() 1
? BGP error
Marco Canini, RIPE 62
10
Other properties
• Policy-induced divergence
• Origin misconfiguration
– Check: routing tables polluted in external ASes?
• Route leaks (hijacks) by customer or provider
C
P
UPDATE
AS_PATH C
prefix d
4/5/2011
List of prefixes that can leak
Prefix
AS_PATH
d
XCY Z
Marco Canini, RIPE 62
11
Keeping confidential information
• Potential router behavior
– Common code paths already exposed
– Reverse engineering any easier than today?
• Private state or configuration
– Information hiding through randomization
– Avoid inputs driven by confidential data  cannot leak
• Rate limit, refuse certain explorers
• Anonymous property checks
– Secure multi-party computation  no need for trusted
3rd party
4/5/2011
Marco Canini, RIPE 62
12
Implementation details
• Integrated DiCE in BIRD 1.1.7
– Open source router, coded in C
• Concolic execution instruments code to track
symbolic inputs
– Instrumentation needed only for testing
– Negligible impact on the production environment
4/5/2011
Marco Canini, RIPE 62
13
Evaluation
• Multiple BIRD instances on a 48-core machine
• Properties checked
– Harmful global events
– Origin misconfiguration
– Policy conflict
4/5/2011
Marco Canini, RIPE 62
14
Evaluation topology
[Haeberlen et al., NSDI ’09] + Annotations
AS 1
Rest of the
Internet
AS 3
AS 2
AS 4
AS
165053
4/5/2011
customer-provider link
peering link
backup link
router that resets session
due to 0-length AS4_PATH
AS 5
AS 8
•
•
•
•
AS 6
AS 9
AS 10
Marco Canini, RIPE 62
Loaded ~300k BGP prefixes
Replayed 15-min trace
Policy and filtering
Installed in ModelNet
network emulator [OSDI ‘02]
– 30 ms intra-AS
– 5 ms inter-AS
– 620 Mbps
15
Micro benchmarks
• CPU overhead
• Metric: BGP updates per s
– Stress test during RIB load
• Baseline: 15.1 – W/ exploration: 13.9 – Impact 8%
– Realistic test during trace replay
• Negligible impact
• Memory overhead
– Cloned process has 37% overhead on avg
• Bandwidth
– 8 Kbps avg for exploratory messaging
4/5/2011
Marco Canini, RIPE 62
16
Results
• Avg: 243 s, 756 explorations
– Max 670 s, 2002 explorations
– Without ModelNet: avg 155 s
– Detected session reset and origin misconfiguration
Explored all paths in the UPDATE handlers + across the
Internet-like testbed in ~4 min avg (11 min max)
4/5/2011
Marco Canini, RIPE 62
17
Deployment option 1
• Convince Cisco, Juniper, Huawei, etc. to
integrate DiCE
4/5/2011
Marco Canini, RIPE 62
18
Deployment option 2
• Deploy DiCE+BIRD in a server
– Potentially run multiple router instances
– Configure with the AS policy & BGP feed
– Connect with DiCE servers in neighboring ASes
4/5/2011
Marco Canini, RIPE 62
19
Incentives
• Common infrastructure
• ISP benefits as an exploration target
– Knowing about its faults
• Upstream ISPs can incentivize customer ISPs
to serve as an “explorer”
– Fewer faults, lower operational costs
4/5/2011
Marco Canini, RIPE 62
20
Conclusion
• We have an online testing system for BGP
• Are you interested to try out our prototype?
• Do you have suggestions for properties to
check?
– Get in touch: [email protected]
• Thank you! Questions?
• More info in our papers
– [LADIS ’10, USENIX ATC ’11]
4/5/2011
Marco Canini, RIPE 62
21
Backup slides
4/5/2011
Marco Canini, RIPE 62
22
My Research
• Improving the reliability of distributed systems
• Why?
– Foundation of our society’s infrastructure
– ... but it is difficult to make them reliable
• Produce robust design and implementation
• Deploy and operate reliably
• A prime example: BGP (inter-domain routing)
– Fundamental service for Internet’s operation
– Additional challenges: federation & heterogeneity
4/5/2011
Marco Canini, RIPE 62
23
DiCE/BGP Prototype in Action
Node 1 (explorer)
Node 2
1’: annotated message
1’’: fork()
1’: fork()
2’: fork()/
run
2’’’: fork()/
run
3: message
ctrl
path exploration engine
4/5/2011
Marco Canini, RIPE 62
24
Input generation code
Original input
x.y.z.w/l
a.b.c.d/l
Fuzz?
Fuzz
attr
Import
filter2?
Import
filter1?
yes
Send
update
x.y.z.w/l
4/5/2011
fuzz?
Fuzz
attr
Fuzz
attr
yes
Import
filter1?
Import
filter2?
Import
filter2?
Apply
update
Drop
update
x.y.z.w/l: (fuzz)
Fuzz?
Import
filter1?
Router update handling code
Inputs produced by DiCE
Apply
update
Apply
update
Drop
update
Send
update
a.b.c.d/l (leaked prefix)
Marco Canini, RIPE 62
yes
Drop
update
Send
update
x.y.z.w/l: (0-length AS4_PATH)
25
Property 3: BGP Policy Conflicts
Checking convergence is hard [Varadhan et al.,‘96, Griffin et al.,’00]
– Check: Dispute wheel?
• Absence of: sufficient condition for robust convergence
130
10
BAD
GADGET II
1
0
[Timothy G. Griffin,
Leiden Global Internet talk ‘00]
4/5/2011
210
20
2 Nodes locally prefer
not routing directly
to 0
3
4
3420
30
420
430
Marco Canini, RIPE 62
Cycle!
26
Dispute Wheel Detection with DiCE
• Use symbolic input to change policy
– Can cause a dispute wheel in a single step
10
130
GOOD
BAD
GADGET
GADGET II
3420
30
Report:
1
2
List of policy
0
changes that
cause
oscillations4
3
210
20
420
430
• Use global precedence metric to detect and resolve
conflict [Ee et al., SIGCOMM ‘07]
– Metric invoked  DW in the cloned snapshot  Fault
4/5/2011
Marco Canini, RIPE 62
27