Online Testing of BGP
Download
Report
Transcript Online Testing of BGP
Networked
Systems
Laboratory
Online Testing of BGP
Marco Canini
EPFL, Switzerland
Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar,
Dejan Novaković, Boris Spasojević, Olivier Crameri,
and Dejan Kostić
Work supported by the European Research Council
4/5/2011
Marco Canini, RIPE 62
1
Is it hard to crash the Internet?
• Software bugs in inter-domain routers
0-length AS4_PATH attribute!
Router type A
Router type B
Protocol-compliant,
confusing message
?
Reset session!
At 17:07:26 UTC on August 19, 2009 CNCI (AS9354),
a small network service provider in Nagoya, Japan,
advertised a handful of BGP updates containing an
empty AS4_PATH attribute. [renesys blog]
4/5/2011
Marco Canini, RIPE 62
2
Is it hard to crash the Internet?
• What went wrong
?
Unaffected router
Affected router
?
?
Repeated service disruptions:
routing instabilities!
4/5/2011
Marco Canini, RIPE 62
Unreachable!
?
?
3
BGP not always reliable
• Distributed system behavior
– Aggregate result of interleaved actions of multiple
routers
– Federated, heterogeneous and failure-prone
environment
• Difficult to reason about all corner cases or
combinations of configurations
– Unanticipated interactions, subtle differences in
inter-operable implementations, system-wide
conflicts, seemingly valid local fault handling
4/5/2011
Marco Canini, RIPE 62
4
Agenda
• Our system for online testing
– Disclaimer: still a research work!
– Not going to be an immediate solution
– Hope it will be a tool for this community
• Solicit feedback
– Which faults would you look for?
– What would convince you to deploy our system?
• … discussion
4/5/2011
Marco Canini, RIPE 62
5
DiCE comes to the rescue
• Key idea: automatically explore system behavior
to detect potential faults
1. Create an isolated snapshot of a BGP neighborhood
2. Subject a router’s BGP process to many inputs that
systematically exercise router actions
3. For each input, check if the snapshot misbehaves
DiCE
BGP
process
4/5/2011
Error in the snapshot
Evidence of possible
BGP
future behavior of
neighbors
production system
Marco Canini, RIPE 62
6
BGP snapshot
• Isolate testing from production environment
Special IP prefix
Custom attribute
Local checkpoint
of current state
Cloned BGP
BGP’s
federated
Each
BGPenvironment
process
and
configuration
process
router keeps its
local checkpoint Private state & config stays in the AS
FIB
Sockets
Sockets
BGP
BGP
peers
checkpoints
ASes collaborate to detect potential faults
4/5/2011
Marco Canini, RIPE 62
7
Exploration of behavior
DiCE
Use a path exploration engine
Concolic (CONCrete + symbOLIC)
execution systematically
exercises code paths
Is there
Error!
an error?
3 Clone of BGP process
2
1
4/5/2011
Marco Canini, RIPE 62
8
Driving behavior by inputs
Route selection
Route ranking: is most
preferred route?
Input generation
Inputs
Configuration
Failures
changes
UPDATE
Random
Timeouts
choices
Header
Messages
Withdrawn Routes
Path exploration
Path Attributes
engine
Attribute Type | Length | Value
Symbolic
inputs
4/5/2011
Code & current config
Network Layer Reachability Information
NLRI Length | Prefix
Marco Canini, RIPE 62
Path constraints
9
Detecting faults
• Check properties that capture desired behavior
• Example: Harmful Global Events (session resets)
Unaffected router
f() 1
? BGP error
Affected router
f()
0 Error count
f() >1
BGP error
?threshold?
f()
0
DiCE
∑errors
5 controller
BGP
4/5/2011
Valid but
Log
inputs
that
have
ambiguous
f()
f() 1
0
? BGP error
harmful global behaviormessages
f()
0
f() 1
? BGP error
f() 1
? BGP error
Marco Canini, RIPE 62
10
Other properties
• Policy-induced divergence
• Origin misconfiguration
– Check: routing tables polluted in external ASes?
• Route leaks (hijacks) by customer or provider
C
P
UPDATE
AS_PATH C
prefix d
4/5/2011
List of prefixes that can leak
Prefix
AS_PATH
d
XCY Z
Marco Canini, RIPE 62
11
Keeping confidential information
• Potential router behavior
– Common code paths already exposed
– Reverse engineering any easier than today?
• Private state or configuration
– Information hiding through randomization
– Avoid inputs driven by confidential data cannot leak
• Rate limit, refuse certain explorers
• Anonymous property checks
– Secure multi-party computation no need for trusted
3rd party
4/5/2011
Marco Canini, RIPE 62
12
Implementation details
• Integrated DiCE in BIRD 1.1.7
– Open source router, coded in C
• Concolic execution instruments code to track
symbolic inputs
– Instrumentation needed only for testing
– Negligible impact on the production environment
4/5/2011
Marco Canini, RIPE 62
13
Evaluation
• Multiple BIRD instances on a 48-core machine
• Properties checked
– Harmful global events
– Origin misconfiguration
– Policy conflict
4/5/2011
Marco Canini, RIPE 62
14
Evaluation topology
[Haeberlen et al., NSDI ’09] + Annotations
AS 1
Rest of the
Internet
AS 3
AS 2
AS 4
AS
165053
4/5/2011
customer-provider link
peering link
backup link
router that resets session
due to 0-length AS4_PATH
AS 5
AS 8
•
•
•
•
AS 6
AS 9
AS 10
Marco Canini, RIPE 62
Loaded ~300k BGP prefixes
Replayed 15-min trace
Policy and filtering
Installed in ModelNet
network emulator [OSDI ‘02]
– 30 ms intra-AS
– 5 ms inter-AS
– 620 Mbps
15
Micro benchmarks
• CPU overhead
• Metric: BGP updates per s
– Stress test during RIB load
• Baseline: 15.1 – W/ exploration: 13.9 – Impact 8%
– Realistic test during trace replay
• Negligible impact
• Memory overhead
– Cloned process has 37% overhead on avg
• Bandwidth
– 8 Kbps avg for exploratory messaging
4/5/2011
Marco Canini, RIPE 62
16
Results
• Avg: 243 s, 756 explorations
– Max 670 s, 2002 explorations
– Without ModelNet: avg 155 s
– Detected session reset and origin misconfiguration
Explored all paths in the UPDATE handlers + across the
Internet-like testbed in ~4 min avg (11 min max)
4/5/2011
Marco Canini, RIPE 62
17
Deployment option 1
• Convince Cisco, Juniper, Huawei, etc. to
integrate DiCE
4/5/2011
Marco Canini, RIPE 62
18
Deployment option 2
• Deploy DiCE+BIRD in a server
– Potentially run multiple router instances
– Configure with the AS policy & BGP feed
– Connect with DiCE servers in neighboring ASes
4/5/2011
Marco Canini, RIPE 62
19
Incentives
• Common infrastructure
• ISP benefits as an exploration target
– Knowing about its faults
• Upstream ISPs can incentivize customer ISPs
to serve as an “explorer”
– Fewer faults, lower operational costs
4/5/2011
Marco Canini, RIPE 62
20
Conclusion
• We have an online testing system for BGP
• Are you interested to try out our prototype?
• Do you have suggestions for properties to
check?
– Get in touch: [email protected]
• Thank you! Questions?
• More info in our papers
– [LADIS ’10, USENIX ATC ’11]
4/5/2011
Marco Canini, RIPE 62
21
Backup slides
4/5/2011
Marco Canini, RIPE 62
22
My Research
• Improving the reliability of distributed systems
• Why?
– Foundation of our society’s infrastructure
– ... but it is difficult to make them reliable
• Produce robust design and implementation
• Deploy and operate reliably
• A prime example: BGP (inter-domain routing)
– Fundamental service for Internet’s operation
– Additional challenges: federation & heterogeneity
4/5/2011
Marco Canini, RIPE 62
23
DiCE/BGP Prototype in Action
Node 1 (explorer)
Node 2
1’: annotated message
1’’: fork()
1’: fork()
2’: fork()/
run
2’’’: fork()/
run
3: message
ctrl
path exploration engine
4/5/2011
Marco Canini, RIPE 62
24
Input generation code
Original input
x.y.z.w/l
a.b.c.d/l
Fuzz?
Fuzz
attr
Import
filter2?
Import
filter1?
yes
Send
update
x.y.z.w/l
4/5/2011
fuzz?
Fuzz
attr
Fuzz
attr
yes
Import
filter1?
Import
filter2?
Import
filter2?
Apply
update
Drop
update
x.y.z.w/l: (fuzz)
Fuzz?
Import
filter1?
Router update handling code
Inputs produced by DiCE
Apply
update
Apply
update
Drop
update
Send
update
a.b.c.d/l (leaked prefix)
Marco Canini, RIPE 62
yes
Drop
update
Send
update
x.y.z.w/l: (0-length AS4_PATH)
25
Property 3: BGP Policy Conflicts
Checking convergence is hard [Varadhan et al.,‘96, Griffin et al.,’00]
– Check: Dispute wheel?
• Absence of: sufficient condition for robust convergence
130
10
BAD
GADGET II
1
0
[Timothy G. Griffin,
Leiden Global Internet talk ‘00]
4/5/2011
210
20
2 Nodes locally prefer
not routing directly
to 0
3
4
3420
30
420
430
Marco Canini, RIPE 62
Cycle!
26
Dispute Wheel Detection with DiCE
• Use symbolic input to change policy
– Can cause a dispute wheel in a single step
10
130
GOOD
BAD
GADGET
GADGET II
3420
30
Report:
1
2
List of policy
0
changes that
cause
oscillations4
3
210
20
420
430
• Use global precedence metric to detect and resolve
conflict [Ee et al., SIGCOMM ‘07]
– Metric invoked DW in the cloned snapshot Fault
4/5/2011
Marco Canini, RIPE 62
27