Experience in Black-box OSPF Measurement

Download Report

Transcript Experience in Black-box OSPF Measurement

Management of Routing
Protocols in IP Networks
Ph.D. Defense
Aman Shaikh
Computer Engineering, UCSC
November 18, 2003
Aman Shaikh
Ph.D. Defense
1
Introduction
• Internet connects millions of computers
– Internet is packet-switched:
• Each packet travels independently of the rest
• Routers provide connectivity
– Routers forward packets so that they reach their
ultimate destination
• Forwarding is destination-based and hop-by-hop
– Router decides next-hop (i.e., neighbor router) for
each packet based on its destination address
• Routing protocols allow routers to determine
next-hop(s) for every destination
Aman Shaikh
Ph.D. Defense
2
Management of Routing Infrastructure
• Management of routing infrastructure is a
nightmare
– “Simple core (= routing infrastructure), smart edge
(= end hosts)” design paradigm
• Internet only provides a best-effort, connectionless,
unreliable service
• Routing is not designed with manageability in mind
– Large distributed system
• Hundreds of routers and thousands of links in big service
provider networks
• Variety of routing protocols
– The infrastructure is evolving
• New services require new protocols and devices
Aman Shaikh
Ph.D. Defense
3
Dissertation Contribution
• Focuses on management of Open Shortest Path
First (OSPF) protocol
– OSPF is widely used to control routing within
service provider and enterprise networks
• Three areas of focus
– Monitoring
– Characterization
– Maintenance
Aman Shaikh
Ph.D. Defense
4
Monitoring
• Motivation:
– Effective management requires sound monitoring
systems
• Contribution:
– Design and implementation of an OSPF monitor
– Deployment in two commercial networks
• Has proved valuable for trouble-shooting and identifying
impending problems in early stage
• Collection and archiving of OSPF data that is used for
performance improvement, post-mortem analysis and
further research
Aman Shaikh
Ph.D. Defense
5
Characterization
• Motivation:
– Need sound simulation and analytical models for
scalability studies, addition of new features etc...
• How do we parameterize these models?
– Need vendor-independent benchmarking methods
• Contribution:
– Black-box techniques for estimating OSPF
processing delays within a router
• Has become basis for OSPF benchmarking standardization
efforts
– Case study of OSPF dynamics in an enterprise
network
Aman Shaikh
Ph.D. Defense
6
Maintenance
• Motivation:
– Maintenance of routers occurs fairly frequently
• Protocol enhancements, bug fixes, hardware/software
upgrades
– During maintenance, operators have to withdraw
router undergoing maintenance
• Leads to route flapping and instability
– How to perform seamless maintenance?
• Contribution:
– I’ll Be Back (IBB) capability for OSPF
• Allows “router-under-maintenance” to be used for
forwarding
Aman Shaikh
Ph.D. Defense
7
Outline
• Background
– Routing and OSPF overview
– Design of an IP router
• Monitoring
– OSPF Monitor
• Characterization
– Black-box measurements for OSPF
– Case study of OSPF dynamics
• Maintenance
– I’ll Be Back (IBB) Capability for OSPF
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
8
Routing in the Internet
AS1
AS2
BGP
OSPF
IS-IS
BGP
BGP
AS3
BGP
BGP
BGP
OSPF
AS4
AS5
BGP
RIP
OSPF
• Internet is a collection of Autonomous Systems (ASes)
• Two classes of routing protocols
– IGP (Interior Gateway Protocols)
• Used within an AS
• Example: OSPF, IS-IS, RIP, EIGRP
– EGP (Exterior Gateway Protocols)
• Used across ASes
• Example: BGP
Aman Shaikh
Ph.D. Defense
9
Overview of OSPF
• OSPF is a link-state protocol
– Every router learns entire network topology
• Topology is represented as graph
– Routers are vertices, links are edges
– Every link is assigned weight through configuration
– Every router uses Dijkstra’s single source shortest
path algorithm to build its forwarding table
• Router builds Shortest Path Tree (SPT) with itself as root
• Shortest Path Calculation (SPF)
– Packets are forwarded along shortest paths defined
by link weights
Aman Shaikh
Ph.D. Defense
10
Areas in OSPF
• OSPF allows domain to be divided into areas for
scalability
–
–
–
–
Areas are numbered 0, 1, 2 …
Hub-and-spoke with area 0 as hub
Every link is assigned to exactly one area
Routers with links in multiple areas are called border
routers
Border routers
Area 1
Area 2
Area 0
Aman Shaikh
Ph.D. Defense
11
Summarization with Areas
• Each router learns
– Entire topology of its attached areas
– Information about subnets in remote areas and their
distance from the border routers
• Distance = sum of link costs from border router to subnet
R1
Area 0
R2
200
100
400
300
B1
C1
OSPF domain
500
200
20
10
50
10.10.4.0/24
Area 0
R2
R3
B2
100
400
300
200
500
200
B1
C2
20
10.10.5.0/24
Area 1
Aman Shaikh
R1’s View
R1
R3
B2
60 70
10.10.4.0/24
10
10.10.5.0/24
Area 1
Ph.D. Defense
12
Link State Advertisements (LSAs)
• Every router describes its local connectivity in Link
State Advertisements (LSAs)
• Router originates an LSA due to…
– Change in network topology
• Example: link goes down or comes up
– Periodic soft-state refresh
• Recommended value of interval is 30 minutes
• LSA is flooded to other routers in the domain
– Flooding is reliable and hop-by-hop
– Includes change and refresh LSAs
– Flooding leads to duplicate copies of LSAs being received
• Every router stores LSAs (self-originated + received) in
link-state database (= topology graph)
Aman Shaikh
Ph.D. Defense
13
Adjacency
• Neighbor routers (i.e., routers connected by a
physical link) form an adjacency
• The purpose is to make sure
– Link is operational and routers can communicate
with each other
– Neighbor routers have consistent view of network
topology
• To avoid loops and black holes
• Link gets used for data forwarding only after
adjacency is established
• Use of periodic Hellos to monitor the status of
link and adjacency
Aman Shaikh
Ph.D. Defense
14
Design of an IP Router
Route Processor (CPU)
OSPF Process
Routing calculation
BGP Process
Routing calculation
RIP Process
Routing calculation
Route Manager
Control Plane
Data Plane
Data packet
Forwarding Info. Base (FIB)
Forwarding
Interface card
Aman Shaikh
Forwarding
Data packet
Switching
Interface card
Fabric
Ph.D. Defense
15
Outline
• Background
• Monitoring
– Motivation:
• Effective management requires sound monitoring systems
– Contribution: OSPF monitor
• Design
– Three component and their functionality
• Deployment in two commercial networks
– How OSPF Monitor is being used
– Lessons learnt through deployment
• Characterization
• Maintenance
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
16
OSPF Monitor: Objectives
• Real-time analysis of OSPF behavior
– Trouble-shooting, alerting
– Real-time snapshots of OSPF network topology
• Off-line analysis
– Post-mortem analysis of recurring problems
– Identify anomaly signatures and use them to predict
impending problems
– Allow operators to tune configurable parameters
– Improve maintenance procedures
– Analyze OSPF behavior in commercial networks
Aman Shaikh
Ph.D. Defense
17
Related Work
• Route monitoring
– Commercial IP monitors
• Route Dynamics (IPSUM), Route Explorer (PacketDesign)
– IPMON project at Sprint
• IS-IS and BGP listeners
– RouteViews and RIPE
• Collects BGP updates from several networks
• Topology tracking
– OSPF topology server [shaikh:jsac02]
• Evaluation and comparison of LSA-based versus SNMP-based
approaches
– Rocketfuel project at UW Seattle
• Inference of intra-domain topologies from end-to-end measurements
Aman Shaikh
Ph.D. Defense
18
Components
• Data collection: LSA Reflector (LSAR)
– Passively collects OSPF LSAs from network
– “Reflects” streams of LSAs to LSAG
– Archives LSAs for analysis by OSPFScan
• Real-time analysis: LSA aGgregator (LSAG)
– Monitors network for topology changes, LSA storms,
node flaps and anomalies
• Off-line analysis: OSPFScan
– Tools for analysis of LSA archives
• Post-mortem analysis of recurring problems, performance
improvement, what-if analysis, OSPF dynamics
Aman Shaikh
Ph.D. Defense
19
Example
LSAG
Real-time Monitoring
LSAs
LSAR 1
Aman Shaikh
LSAs
“Reflect” LSA
LSA archive
LSA archive
Area 1
Off-line Analysis
LSAs
LSAR 2
“Reflect” LSA
LSAs
OSPFScan
LSAs
replicate
LSA archive
LSAs
Area 0
Area 2
Ph.D. Defense
OSPF Network
20
How LSAR attaches to Network
• Host mode
– Join multicast group
– Adv: completely passive
– Disadv: not reliable, delayed initialization of LSDB
• Full adjacency mode
– Form full adjacency with a router
– Adv: reliable, immediate initialization of LSDB
– Disadv: LSAR’s instability can impact entire network
• Partial adjacency mode
– Keep adjacency in a state that allows LSAR to receive LSAs,
but does not allow data forwarding over link
– Adv: reliable, LSAR’s instability does not impact entire
network, immediate initialization of LSDB
– Disadv: can raise alarms on the router
Aman Shaikh
Ph.D. Defense
21
LSA aGregator (LSAG)
• Analyzes “reflected” LSAs from LSARs over TCP
connections in real-time
• Generates console messages:
– Changes in OSPF network topology
• ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2)
 rtr 10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0
– Node flaps
• RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec
– LSA storms
• LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area
0.0.0.0 no_lsas 7 storm_window 470 sec
– Anomalous behavior
• TYPE-3 ROUTE FROM NON-BORDER RTR: ntw
10.3.0.0/24 rtr 10.0.0.6 area 0.0.0.0
Aman Shaikh
Ph.D. Defense
22
OSPFScan
• Tools for off-line analysis of LSA archives
– Parse, select (based on queries), and analyze
• Derivation and analysis of auxiliary information from
LSA archives
– LSAs indicating network topology changes
– Routing table entries
• How OSPF routing tables evolved in response to network changes
• How end-to-end path within OSPF domain looked like at any instance
– Topology changes as graph-based abstraction
• Vertex addition/deletion and link addition/deletion/change_weight
• Playback of topology change events
– Essentially an LSAG playback
Aman Shaikh
Ph.D. Defense
23
Deployment
• Deployed in two commercial networks
– Enterprise network
•
•
•
•
15 areas, 500+ routers; Ethernet-based LANs
Deployed since February, 2002
LSA archive size: 10 MB/day
LSAR connection: host mode
– ISP network
•
•
•
•
Aman Shaikh
Area 0, 100+ routers; Point-to-point links
Deployed since January, 2003
LSA archive size: 8 MB/day
LSAR connection: partial adjacency mode
Ph.D. Defense
24
LSAG in Day-to-day Operations
• Generation of alarms by feeding messages into higher
layer network management systems
– Correlation and grouping of messages into a single alarm
– Prioritization of messages
• Validation of maintenance steps and monitoring the
impact of these steps on network-wide OSPF behavior
– Example:
• Operators change link weights to carry out maintenance
activities
• A “link-audit” web-page allows operators to keep track of
link weights in real-time
Aman Shaikh
Ph.D. Defense
25
Problems Caught by LSAG
• Equipment problem
– Detected internal problems in a crucial router in
enterprise network
• Problem manifested as episodes of OSPF adjacency
flapping
• Configuration problem
– Identified assignment of same router-ids to two
routers in enterprise network
• OSPF implementation bug
– Caught a bug in refresh algorithm of routers from a
particular vendor in ISP network
• Bug resulted in a much faster refresh of LSAs than
standards-mandated rate
Aman Shaikh
Ph.D. Defense
26
Long Term Analysis by OSPFScan
• LSA traffic analysis
– Identified excessive duplicate LSA traffic in some
areas of the enterprise network
• Led to root-cause analysis and preventative steps
• Generation of statistics
– Inter-arrival time of change LSAs in the ISP network
• Fine-tuning configurable timers related to SPF calculation
– Mean down-time and up-time for links and routers in
the ISP network
• Assessment of reliability and availability as ISP network
gears for deployment of new services
Aman Shaikh
Ph.D. Defense
27
Lessons Learnt through Deployment
• New tools reveal new failure modes
• Real networks exhibit significant activity
– Maintenance and genuine problems
• Archive all LSAs
– LSA volume is manageable
• Stability and reliability of monitor is extremely
important
• Keep data collection separate from its analysis
– Keep data collector as simple as possible
• Add functionality incrementally and through interaction
with users
Aman Shaikh
Ph.D. Defense
28
Summary
• Three component architecture
– LSAR: LSA capture from the network
– LSAG: real-time analysis of LSA stream
• Detection and trouble-shooting of problems
– OSPFScan: off-line analysis tools for LSA archives
• Post-mortem analysis of recurring problems, performance
improvement, what-if analysis, OSPF dynamics
• Deployed in two commercial networks
– Has proven a valuable network management tool
– “OSPF Monitor was a lifesaver”
• VP of Networking, Enterprise network 
– When monitor caught an impending failure in an early
stage
Aman Shaikh
Ph.D. Defense
29
Outline
• Background
• Monitoring
• Characterization
– Motivation:
• Simulation and analytical models, benchmarking
– Contributions:
• Black-box techniques for estimating OSPF processing
delays on a router
– Tasks we measure, methodology, results for Cisco and GateD
• Case study of OSPF dynamics in an enterprise network
• Maintenance
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
30
Black-box Measurements for OSPF
• OSPF processing delays within a router matter!
– Add up to impact convergence and stability
– Guidance in tuning configurable parameters, head to head
vendor comparisons, simulation models
• Instrumenting routing code for measuring delays is
challenging
– Commercial implementations are proprietary
– May involve grappling with
• Numerous code versions, hardware platforms, and developers
• Use black-box measurements
– Measure the timing delays using external observations
– Applied to Cisco and GateD OSPF implementations
Aman Shaikh
Ph.D. Defense
31
Related Work
• White-box measurements for IS-IS [alaettinoglu]
– SPF delays reported are comparable to results
obtained by us
• Empirical analysis of router behavior under large
BGP routing tables [chang:imw02]
– Cisco and Juniper routers
• Benchmarking Methodology working group
(bmwg) at IETF
– Drafts related to OSPF benchmarking
• Our black-box methods are basis for some benchmark tests
Aman Shaikh
Ph.D. Defense
32
What tasks did we measure?
LSA Processing
Route Processor (CPU)
OSPF Process
LSA Flooding
Topology
View
SPF Calculation
SPF Calculation
FIB Update
FIB
LSA
LS Ack
Forwarding
Forwarding
Data packet
Interface card
Aman Shaikh
LSA
Switching
Fabric
Ph.D. Defense
Interface card
33
Methodology
Emulated topology
LSA
TopTracker
Target router
Testbed
• Load emulated topology on target router
• Initiate task of interest
• Measure the time for task
Aman Shaikh
Ph.D. Defense
34
Measuring Task Time
1. Use a black-box method to bracket task start and finish
times
2. Subtract out intervals that precede and exceed these
times
top bracket event
task start time
B
A
X
task finish time
bottom bracket event
C
X = A - (B + C)
Aman Shaikh
Ph.D. Defense
35
Measuring SPF Calculation
TopTracker
Target Router
Load desired topology
Send initiator LSA
Initiator LSA arrives
B
A
C
SPF calculation starts
Send duplicate LSA
X
E
SPF calculation ends
D
Send ack for duplicate LSA
Ack for duplicate LSA arrives
• X = A – (B + C + D + E)
• Estimate the overhead = B + C + D + E
Aman Shaikh
Ph.D. Defense
36
Estimating the Overhead
• Remove SPF calculation from bracket
– spf_delay = 60 seconds
TopTracker
Target Router
B
Send initiator LSA
Send duplicate LSA
Initiator LSA arrives
Duplicate LSA arrives
overhead
C
Initiator LSA processing done
Duplicate LSA processing done;
send ack
Ack for duplicate LSA arrives
E
D
SPF calculation starts
overhead = B + C + D + E
Aman Shaikh
Ph.D. Defense
37
Results
• Results for Cisco GSR, 7513 and GateD
– For GateD, comparison of black-box results with
those obtained using instrumentation (white-box)
– Route processors
• Cisco: 200 MHz R5000 processor
• GateD: 500 MHz AMD-K6 processor
• Topology: full n  n mesh with random OSPF
edge weights
– n in range 10, 20, …, 100
Aman Shaikh
Ph.D. Defense
38
Results for Cisco Routers
Mean SPF Time (Cisco 7513)
0.03
0.03
0.025
0.025
Time (seconds)
Time (seconds)
Mean SPF time (Cisco GSR)
0.02
0.015
0.01
0.02
0.015
0.01
0.005
0.005
0
0
0
20
40
60
80
100
0
Number of nodes (n)
20
40
60
80
100
Number of nodes (n)
• Observations
– Similar results for two models
– SPF calculation time is O(n2)
Aman Shaikh
Ph.D. Defense
39
Results for GateD
Time (seconds)
Mean SPF Time (GateD)
0.018
0.016
0.014
0.012
0.01
Black-box
White-box
0.008
0.006
0.004
0.002
0
0
20
40
60
80
100
Number of nodes (n)
• Observations:
– Black-box over-estimates white-box measurement
– Black-box captures the characteristics very well
Aman Shaikh
Ph.D. Defense
40
Summary
• Black-box methods for estimating OSPF processing
delays
–
–
–
–
Work across wide range of time delays
Work for pure CPU bound tasks
Effective in capturing scaling
Match with white-box measurements
• Applied methods to Cisco GSR and 7513
– LSA Processing: 100-800 microseconds
– LSA flooding: 30-40 milliseconds
• Pacing timer is the determining factor
– SPF calculation: 1-40 milliseconds
• O(n2) behavior for full n x n mesh
– FIB update time: 100-300 milliseconds
• No dependence on topology size
Aman Shaikh
Ph.D. Defense
41
Outline
• Background
• Monitoring
• Characterization
– Motivation:
• Simulation and analytical models, benchmarking
– Contributions:
• Black-box techniques for estimating OSPF processing delays on a
router
• Case study of OSPF dynamics in an enterprise network
– Enterprise network topology, categorization of LSA traffic, results
• Maintenance
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
42
Case Study of OSPF Dynamics
• OSPF behavior in commercial networks is not
well understood
• Understanding dynamics of LSA traffic is key to
better understanding of OSPF
– Bulk of OSPF processing is due to LSAs
– Big impact on OSPF convergence, (in)stability
• Analysis of LSA archives collected by OSPF
monitor in enterprise network
– Focus on April, 2002 data
Aman Shaikh
Ph.D. Defense
43
Related Work
• Several studies focusing on BGP dynamics in
the Internet
– Relatively easy to collect BGP data
– BGP is more complicated
• OSPF dynamics in a regional service provider
network (MichNet) [watson:icdcs03]
– One year worth of data
– Several findings are similar to our observations
• Analysis of OSPF stability through simulations
[basu:sigcomm01]
Aman Shaikh
Ph.D. Defense
44
Enterprise Network
• Provides customers with connectivity to
applications and databases residing in data
center
• OSPF network
– 15 areas, 500 routers
• This case study covers 8 areas, 250 routers
• One month: April, 2002
– Ethernet-based LANs
• Customers are connected via leased lines
– Customer routes are injected via EIGRP into OSPF
• The routes are propagated via external LSAs
Aman Shaikh
Ph.D. Defense
45
Enterprise Network Topology
Customer
Customer
EIGRP
OSPF
Domain
Area B
EIGRP
Customer
EIGRP
External
(EIGRP)
Area A
LAN1 Area A
Area 0
Area C
B1
Monitor
B2
Border rtrs
Area 0
Servers
Database Applications
Aman Shaikh
LAN 2
Monitor uses host mode to
receive LSAs
Ph.D. Defense
46
Categorizing LSA Traffic
• Refresh LSA traffic
– Originated due to periodic soft-state refresh
– Forms base-line LSA traffic
– Can be predicted using configuration information
• Change LSA traffic
– Originated due to changes in network topology
• E.g, link goes down/comes up
– Allows detection of anomalies and problems
• Duplicate LSA traffic
– Received due to redundancy in flooding
– Overhead -- wastes resources
Aman Shaikh
Ph.D. Defense
47
LSA Traffic in Different Areas
Area 0
Area 2
1000000
8000
Refresh
LSAs
Genuine Anomaly
10000
4000
100
Change
LSAs
Genuine Anomaly
0
1
1
11
21
Days
1
8000
8000
4000
4000
11
21
Days
21
Days
Duplicate
LSAs
Artifact: 23 hr day (Apr 7)
0
0
1
11
21
Days
Area 3
Aman Shaikh
1
11
Area 4
Ph.D. Defense
48
Baseline LSA Traffic: Refresh LSAs
• Refresh LSA traffic can be reliably predicted
using router configuration files
– Important for workload generation
5000
7000
Refresh LSAs (expected:config)
Refresh LSAs (expected:config)
Refresh LSAs (actual)
Refresh LSAs (actual)
6000
4000
5000
3000
4000
1
11
21
Days
1
Area 2
Aman Shaikh
11
21
Days
Area 3
Ph.D. Defense
49
Refresh process is not synchronized
• No evidence of synchronization
– Contrary to simulation-based study [basu:sigcomm01]
• Reasons
– Changes in the topology help break synchronization
– LSA refresh at one router is not coupled with LSA refresh at other routers
– Drift in the refresh interval of different routers
Aman Shaikh
Ph.D. Defense
50
Change LSAs
10000
1000
External
100
Internal
10
1
1
11
21
Days
• Internal to OSPF domain versus external
– Change LSAs due to external events dominated
– Not surprising due to large number of leased lines
and import of customer routes into OSPF
• Customer volatility  network volatility
Aman Shaikh
Ph.D. Defense
51
Root Causes of Change LSAs
• Persistent problem  flapping  numerous change LSAs
– Internal LSA spikes  hardware router problems
• OSPF monitor identified a problem (not visible other network mgt
tools) early and led to preventive maintenance
– External LSA spikes  customer route volatility
• Overload of an external link to a customer between 9 PM – 3 AM
caused EIGRP session to flap
Total LSAs in area 2
Total LSAs due to flapping link
Total LSAs in area 2
Total LSAs due to flapping link
12000
1200
8000
800
4000
400
0
1
11
21
0
1
Day in April, 2002
Aman Shaikh
7
13
19
Hour on April 11, 2002
Ph.D. Defense
Link flaps
52
Overhead: Duplicate LSAs
Duplicate LSAs in area 3
Duplicate LSAs in area 2
2950
1950
950
-50
1
11
21
Days
• Why do some areas witness substantial duplicate LSA
traffic, while other areas do not witness any?
– OSPF flooding over LANs leads to control plane asymmetries
and to imbalances in duplicate LSA traffic
Aman Shaikh
Ph.D. Defense
53
Summary
• Refresh LSAs: constituted bulk of overall LSA traffic
– No evidence of synchronization between different routers
– Refresh LSA traffic predictable from configuration
information
• Change LSAs: mostly indicated persistent yet partial
failure modes
– Internal LSA spikes  hardware router problems 
preventive router maintenance
– External LSA spikes  customer congestion problems 
“preventive” customer care
• Duplicate LSAs: arose from control plane asymmetries
– Simple configuration changes could eliminate duplicate LSAs
and improved performance
Aman Shaikh
Ph.D. Defense
54
Outline
• Background
• Monitoring
• Characterization
• Maintenance
– Motivation:
• Seamless maintenance and upgrades of routers
– Minimal instability and flaps
– Contribution:
• I’ll Be Back (IBB) capability for OSPF
– What IBB capability provides, how capability is implemented,
performance analysis
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
55
Maintenance is a Pain
• Maintenance of routers is a way of life in
commercial networks
– Extensions to routing protocols, new functionality,
hardware and software upgrades, bug fixes
• Maintenance is a painful exercise
– During maintenance, operators withdraw “routerunder-maintenance” from forwarding service
• Leads to route flaps, traffic disruption and instability
– Operators have to carefully schedule maintenance
• Schedule them during night when load is moderate
• Stagger maintenance of different routers across time
Aman Shaikh
Ph.D. Defense
56
We can do better
• Observation: router can continue forwarding
even while its routing process is inactive, at least
for a while
– Current routers have separate routing and forwarding
paths
• Routing in software (CPU)
• Forwarding in hardware (switching)
• Need to extend routing protocols since they
always try to route around inactive router
– Our proposal: IBB (I’ll Be Back) extensions to OSPF
Aman Shaikh
Ph.D. Defense
57
IBB Proposal in a Nutshell
• OSPF process on router R needs to be shutdown
• Before shutdown, R informs other routers that
it is going to be inactive for a while
• R specifies a time period (IBB Timeout) by which it
expects to become operational again
• Other routers continue using R for forwarding during
IBB Timeout period
• If R comes back within IBB Timeout period,
no routing instability or flaps
• Else other routers start forwarding packets around R
Aman Shaikh
Ph.D. Defense
58
Related Work
• Graceful restart proposals for various routing
protocols at IETF
– Graceful restart proposal for OSPF by John Moy
• Alex zinin’s propsal to avoid flaps upon restart
of OSPF process
– Process has to come up before other routers notice it
was shutdown
– Provides small window of opportunity
• Use of redundant route processors and seamless
transfer of control
– NSR (Avici), High Availability Initiative (Cisco)
Aman Shaikh
Ph.D. Defense
59
What if topology changes
• R cannot update its forwarding table to reflect the
change
– Can lead to loop or black holes
A
A
3
B
2
6
10
R
B
(a) Topology when
R went down
Aman Shaikh
6
2
R
(b) Topology changes while
R is inactive
Ph.D. Defense
60
Handling Changes: Three Options
• Don’t do anything
• Stop using R: John Moy’s proposal
– Inadvertent changes during upgrade are likely
• Example: flapping due to a bad interface somewhere
– But all changes are not bad
• Do not always lead to loops or black holes
• Stop using R only when loop or black hole gets
Our approach
formed
– And only for destinations for which there is a
problem
Aman Shaikh
Ph.D. Defense
61
Roadmap of Algorithm
• Single area, single inactive router case
– Loop formation
– Black hole formation
• Single area, multiple inactive routers case
– Loop formation
• Multiple areas
– Black hole formation and area partitions
Aman Shaikh
Ph.D. Defense
62
Single Area, Single Inactive Router
• Problem Formulation
– Inactive Router = R
– All routers other than R have the same image of the
topology graph
– R’s image is that of a past = the time at which it went
down
– Source = S, Destination = D
– Next hop(R, D) = Y
– Actual path a packet takes from S to D = P(SD)
Aman Shaikh
Ph.D. Defense
63
Loop Detection
P(SD) has a loop
iff S and Y have R on their paths to D in their SPTs
S
S
1
1
20
R
2
Y
S
6
3
D
Topology when
R went down
1
20
R
2
Y
Y
6
10
2
R
6
D
Topology changes
while R is inactive
D
R
2
Y
6
D
1
S
S and Y have R on their paths
to D in their SPT
If there is a loop, neighbor can always detect it
Aman Shaikh
Ph.D. Defense
64
Loop Prevention
Every router needs to calculate a path to D
such that R does not appear on it
S
1
20
R
2
Y
Y
10
D
6
10
D
Changed topology
while R is inactive
Aman Shaikh
S
20
D
S and Y calculate paths
to D w/o R on it
Ph.D. Defense
65
Loop Avoidance Procedure
• R sends forwarding table to neighbors before shutdown
- Thus, Y knows that next hop(R, D) is Y
• Detection: during SPF calculation neighbors detect loops
- Y checks if R exists on the path to D or not
• Upon detection, neighbors send avoid messages
to other routers in the domain
- avoid(R, D) = avoid using R for reaching D
• Prevention: upon receiving avoid(R, D) message,
other routers calculate a new path to D without R on it
Aman Shaikh
Ph.D. Defense
66
Performance
• Maximum effect on SPF calculation
– Quantify overhead
– Impact of topology size
• Prototype Implementation
– IBB extensions incorporated into GateD 4.0.7
Aman Shaikh
Ph.D. Defense
67
Testbed Setup
SUT’s view of the Topology
Physical Topology
SUT
LAN
SUT
LAN
LSAs
TopTracker
1
1
TopTracker
X
1
20
SUT
R
Router under
maintenance
1
System Under Test
= where IBB overhead
is measured
M1
Complete graph
with n nodes
Emulated topology
Aman Shaikh
Ph.D. Defense
68
Experiment Sequence
Time (mins)
T=0
T=4
GateD on SUT
IBB-GateD on SUT
Bring R down
Bring R down in IBB mode
Case A
inactive rtr
Send avoid(R, Mj) messages to SUT
(1j  n)
Case B
inactive rtr, avoid it
T=8
Bring R up
Bring R up
mean SPF time in Case B
Overhead =
mean SPF time in Case A
Aman Shaikh
Ph.D. Defense
69
Result
0.035
Time (seconds)
0.03
Mean SPF time (Case A)
0.025
Mean SPF time (Case B)
0.02
0.015
0.01
0.005
0
50
60
70
80
90
100
Number of nodes in fully connected component (n)
• Overhead remains constant at roughly 2.0 as n increases
• Sources of overhead:
– Second SPF calculation
– Graph in case B is larger than graph in case A
Aman Shaikh
Ph.D. Defense
70
Summary
• IBB proposal: extend OSPF so that a router can
be used for forwarding even while its OSPF
process is inactive
• Main contribution: algorithm that gracefully
handles topology changes
– Stops using the inactive router for a destination if
using the router can lead to loops or black holes
– Overhead of the algorithm is modest
• Shows good scaling behavior in terms of topology size
Aman Shaikh
Ph.D. Defense
71
Outline
•
•
•
•
Background
Monitoring
Characterization
Maintenance
• Conclusions and future work
Aman Shaikh
Ph.D. Defense
72
Conclusions
• Monitoring
– Design and implementation of an OSPF monitor
– Deployment in two commercial networks
• Characterization
– Black-box techniques for estimating OSPF
processing delays within a router
– Case study of OSPF dynamics in enterprise network
• Maintenance
– I’ll Be Back (IBB) capability for OSPF that allows a
“router-under-maintenance” to be used for
forwarding
Aman Shaikh
Ph.D. Defense
73
Future Work
• Three principal directions for future work
– Application of this work to other routing protocols
• IS-IS is very similar to OSPF
• EIGRP, RIP and BGP bring their own set of challenges
– Distance-vector nature of the protocols
– BGP also brings scalability issues
– Other areas related to routing and network
management
• Security, network design, configuration management,
simulation & modeling
• How performance of routing infrastructure affects userperceived performance
– More work in each of three focus areas
Aman Shaikh
Ph.D. Defense
74
Future Work for Monitoring
• Real-time analysis
– More meaningful alerting
• Correlation with other fault and performance data
• Learn from past events
– Prioritization of alerts
• Off-line analysis
– Correlation with other data sources
• Work already underway: BGP, fault, performance
– Identification of problem signatures and feeding
them into real-time component for problem
prediction
Aman Shaikh
Ph.D. Defense
75
Future Work for Characterization
• Expand measurements to cover other router
vendors and commercial networks
• Use results to build simulation and analytical
models
– Validation of models
Aman Shaikh
Ph.D. Defense
76
Future Work for Maintenance
• Improvements to IBB scheme
– Incremental deployment
– Reduction in overhead
• How to use IBB-like schemes in conjunction
with other approaches
– Routing software that can be upgraded without
bringing the process down
– Use of redundant route processors and seamless
transfer of control
– Scheduling maintenance task such that they have
minimal impact
Aman Shaikh
Ph.D. Defense
77
Holy Grail
Networks that manage themselves!
Aman Shaikh
Ph.D. Defense
78
Grill me ...
Probably your last chance… :-)
Q and A
Aman Shaikh
Ph.D. Defense
79
Backups
Aman Shaikh
Ph.D. Defense
80
Partial Adjacency for LSAR
I need LSA L
from LSAR
I have LSA L
LSAR
Please send me LSA L
R
Partial state
• Router R does not advertise a link to LSAR
• Routers (except R) not aware of the presence of LSAR
• Does not trigger SPF calculations in network
• LSAR’s going up/down does not impact the network
• LSAR does not originate any LSAs
• LSARR link not used for data forwarding
• LSAR does not install any routes in forwarding table
Aman Shaikh
Ph.D. Defense
81
Multiple Inactive Routers for IBB
• Loop Avoidance
– Change in loop detection conditions
– Simplification for loop prevention
• No change in black-hole detection
Aman Shaikh
Ph.D. Defense
82
Loop Avoidance
• Set of inactive routers: R1, R2, …, Rn
• Loop avoidance procedure applies for each
inactive router
– Detection
• Router detects loops for all its inactive neighbors
– Prevention
• A router can get avoid(Ri, D) messages for j inactive
routers (j <= n)
• The router avoids these j forbidden routers on its path to D
• Problem: Set of forbidden routers can be different for
different destinations
– O(n) shortest path calculations
• n = number of vertices
Aman Shaikh
Ph.D. Defense
83
Simplification
• Router avoids all inactive routers if it has some
forbidden routers on its path to D
– Calculate two SPTs:
– SPT with all inactive routers on it
– SPT w/o any inactive router on it
– If the path to D does not contain any forbidden
routers on it,
• Pick next hop for D from the first SPT
– Else,
• Pick next hop for D from the second SPT
Aman Shaikh
Ph.D. Defense
84
Multiple Inactive Routers: Loop Detection
• Loop detection condition for single inactive
router cannot detect all loop when multiple
routers are inactive
• Two new conditions for loop detection by
neighbors
– Generalization of loop detection for single inactive router
• Conditions can result in false positives
• Evaluation using realistic OSPF topology graphs
with two inactive routers
– Using two conditions together eliminate most false positives
(90% hit-rate), but not all...
Aman Shaikh
Ph.D. Defense
85
Publications
• Aman Shaikh, Mukul Goyal, Albert Greenberg, Raju Rajan and K.K.
Ramakrishnan, An OSPF Topology Server: Design and Evalution, IEEE JSAC, 20(4), May 2002.
• Aman Shaikh and Albert Greenberg, OSPF Monitoring: Architecture,
Design, and Deployment Experience, submitted to NSDI, 2004.
• Aman Shaikh and Albert Greenberg, Experience in Black-box OSPF
Measurement, In Proc. ACM SIGCOMM IMW, pp. 113-125, November 2001
• Aman Shaikh, Chris Isett, Albert Greenberg, Matthew Roughan and Joel
Gottlieb, A Case Study of OSPF Behavior in a Large Enterprise Network,
In Proc. ACM SIGCOMM IMW, pp. 217-230, November 2002.
• Aman Shaikh, Rohit Dube and Anujan Varma, Avoiding Instability during
Graceful Shutdown of OSPF, In Proc. IEEE INFOCOM, June 2002.
• Aman Shaikh, Rohit Dube and Anujan Varma, Avoiding Instability during
Graceful Shutdown of Multiple OSPF Routers, submitted to IEEE/ACM
Transactions on Networking (ToN).
Aman Shaikh
Ph.D. Defense
86