Network Routing - Yale University

Download Report

Transcript Network Routing - Yale University

Routing and Forwarding
4/17/2012
1
Recap
 Internet routing is divided into intra-AS
(intradomain) and inter-AS (interdomain)
routing
 Technical reasons
 Non-technical reasons
 BGP (Border Gateway Protocol), a path-
vector protocol, is the de facto standard
interdomain protocol
AS D
Routing: Example
E
d1
d
d2
AS A
(OSPF)
a2
F
i
AS C
a1
How to specify?
AS B
(OSPF intra
routing)
b
AS I
3
IP Addressing Scheme: Requirements
 We need an address to uniquely identify
each destination
 Routing scalability needs flexibility in
aggregation of destination addresses
 we
should be able to aggregate a set of
destinations as a single routing unit
 Preview: the unit of routing in the Internet
is a network---the destinations in the routing
protocols are networks
4
IP Address: An IP Address Identifies an Interface
 IPv4 address: 32-bit
identifier for an
interface

interface:


routers typically
have multiple
interfaces
host may have
multiple interfaces
223.1.1.1
223.1.2.1
223.1.1.2
223.1.1.4
223.1.1.3
223.1.2.9
223.1.3.27
223.1.2.2
223.1.3.2
223.1.3.1
%/sbin/ifconfig -a
223.1.3.2 = 11011111 00000001 00000011 00000010
223
1
3
2
5
IP Addressing
 IP address:
 network part
 host part

223.1.1.2
223.1.1.1
223.1.1.3
223.1.9.2
What’s a network
? (from IP address
perspective)
 is a unit of routing:
can be routed
together (depend
on the routing
protocol)
223.1.1.4
223.1.7.0
223.1.9.1
223.1.7.1
223.1.8.1
223.1.2.6
223.1.2.1 223.1.2.2
223.1.8.0
223.1.3.27
223.1.3.1
223.1.3.2
6
IP Addressing: Class-ful Addressing
given notion of “network”, let’s re-examine IP
addresses:
“class-ful” addressing in the original IP design:
1.0.0.0 to
127.255.255.255
128.0.0.0 to
191.255.255.255
192.0.0.0 to
223.255.255.255
224.0.0.0 to
239.255.255.255
240.0.0.0 to
255.255.255.255
Problem of class-ful addressing?
7
IP Addressing: CIDR
 (Static)classful addressing:

inefficient use of address space, address space exhaustion

not flexible for aggregation
• e.g., a class A net allocated enough addresses for 16 million
hosts; a class B address may also be too big
 CIDR: Classless InterDomain Routing


network portion of address of arbitrary length
address format: a.b.c.d/x, where x is # bits in network
portion of address
network
part
host
part
11001000 00010111 00010000 00000000
200.23.16.0/23
Some systems use mask (1’s to indicate network bits), instead of the /x format
8
CIDR Address Aggregation
AS D
d
d1
AS A
(OSPF)
130.132.1/24
i
i->a1: I can reach
130.132/22; my
path: I
a2
a1
130.132.2/24
AS I
intradomain
routing uses /24
130.132.3/24
9
CIDR Address Aggregation
B
x00/24: B
A
x01/24: C
C
x10/24: E
G
E
x11/24: F
F
F
10
Routing Table Size of BGP (number of globally
advertised, aggregated networks)
Active BGP Entries (http://bgp.potaroo.net/as1221/bgp-active.html)
Internet Growth
(http://www.caida.org/research/topology/as_core_network/historical.xml)
11
Routing Table Prefix Length Distr.
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
8
13
18
23
28
12
IP Addressing: How to Get One?
Q: How does an ISP get its block of
addresses?
A: ICANN: Internet Corporation for Assigned
Names and Numbers
 allocates addresses
 manages DNS
 assigns domain names, resolves disputes
Use
%whois –h whois.arin.net “n <org>”
to check addresses allocated to <org>
13
IP addresses: How to Get One?
Q: How does a host get an IP address?
 Static configured
wintel: control-panel->network->configuration>tcp/ip->properties
 unix:
%/sbin/ifconfig eth0 inet 192.168.0.10 netmask
255.255.255.0

 DHCP: Dynamic Host Configuration Protocol:
dynamically get address from as server
 “plug-and-play”
14
DHCP: Dynamic Host Configuration Protocol
Goal: allow host to dynamically obtain its IP address
from network server when it joins network



can renew its lease on address in use
allows reuse of addresses (only hold address while
connected)
support for mobile users who want to join network
DHCP msgs:
 host broadcasts “DHCP discover” msg
 DHCP server responds with “DHCP offer” msg
 host requests IP address: “DHCP request” msg
 DHCP server sends address: “DHCP ack” msg
15
Network Address Translation: Motivation
 A local network uses just one public IP address as far as outside
world is concerned
 Each device on the local network is assigned a private IP address
rest of
Internet
local network
(e.g., home network)
192.168.1.0/24
192.168.1.1
192.168.1.2
192.168.1.3
138.76.29.7
192.168.1.4
All datagrams leaving local
network have same single source
NAT IP address: 138.76.29.7,
different source port numbers
Datagrams with source or
destination in this network
have 192.168.1/24 address for
source, destination (as usual)
16
Private IP
17
NAT: Network Address Translation
Implementation: NAT router must:

outgoing datagrams: replace (source IP address, port

remember (in NAT translation table) every (source

incoming datagrams: replace (NAT IP address, new
#) of every outgoing datagram to (NAT IP address,
new port #)
. . . remote clients/servers will respond using (NAT
IP address, new port #) as destination addr.
IP address, port #) to (NAT IP address, new port #)
translation pair
port #) in dest fields of every incoming datagram
with corresponding (source IP address, port #)
stored in NAT table
18
NAT: Network Address Translation
NAT translation table
WAN side addr
LAN side addr
1: host 192.168.1.2
2: NAT router
sends datagram to
changes datagram
138.76.29.7, 5001 192.168.1.2, 3345
128.119.40.186, 80
source addr from
……
……
192.168.1.2, 3345 to
138.76.29.7, 5001,
S: 192.168.1.2, 3345
updates table
D: 128.119.40.186, 80
2
S: 138.76.29.7, 5001
D: 128.119.40.186, 80
138.76.29.7
S: 128.119.40.186, 80
D: 138.76.29.7, 5001
3: Reply arrives
dest. address:
138.76.29.7, 5001
3
192.168.1.2
1
192.168.1.1
192.168.1.3
S: 128.119.40.186, 80
D: 192.168.1.2, 3345
4
192.168.1.4
4: NAT router
changes datagram
dest addr from
138.76.29.7, 5001 to 192.168.1.2, 3345
19
Network Address Translation: Advantages
 No need to be allocated range of addresses
from ISP: - just one public IP address is
used for all devices
16-bit port-number field allows 60,000
simultaneous connections with a single LAN-side
address !
 can change ISP without changing addresses of
devices in local network
 can change addresses of devices in local network
without notifying outside world

 Devices inside local net not explicitly
addressable, visible by outside world (a
security plus)
20
Network Address Translation: Problems
 If both hosts are behind NAT, they will have
difficulty establishing connection
 NAT is controversial:
 routers
should process up to only layer 3
 violates end-to-end argument
• NAT possibility must be taken into account by app
designers, e.g., P2P applications
 address
shortage should instead be solved by
having more addresses --- IPv6 !
21
Outline
 Admin. and recap
 BGP
 IP addressing

IP forwarding
22
IP Datagram Format
IP protocol version
number
header length
(bytes)
“type” of data
max number
remaining hops
(decremented at
each router)
upper layer protocol
to deliver payload to
how much overhead
with TCP?
 20 bytes of TCP
 20 bytes of IP
 = 40 bytes + app
layer overhead
32 bits
type of
ver head.
len service
length
fragment
16-bit identifier flgs
offset
time to upper
Internet
layer
live
checksum
total datagram
length (bytes)
for
fragmentation/
reassembly
32 bit source IP address
32 bit destination IP address
Options (if any)
data
(variable length,
typically a TCP
or UDP segment)
E.g. timestamp,
record route
taken, specify
list of routers
to visit.
23
Data Forwarding: Steps
 Error checking, e.g., check header checksum;
if error, set up error flag
 Decrement TTL; if TTL == 0, set error flag
 If error, drop the packet, and generate
ICMP report
24
The Network Layer
Host, router network layer functions:
Transport layer: TCP, UDP
Network
layer
The IP protocol
•addressing
•datagram format
Routing protocols
•path selection
•RIP, OSPF, BGP
forwarding
ICMP protocol
•error reporting
•router “signaling”
Link layer
Physical layer
25
ICMP: Internet Control Message Protocol
 communicate network-level
information
 error reporting:
unreachable host, network,
port, protocol
 echo request/reply (used by
ping)
 network-layer “above” IP:
 ICMP msgs carried in IP
datagrams
 ICMP message: type, code plus
first 8 bytes of IP datagram
causing error
type
code
checksum
ICMP message body
Type
0
3
3
3
3
3
3
4
Code
0
0
1
2
3
6
7
0
8
9
10
11
12
0
0
0
0
0
description
echo reply (ping)
dest network unreachable
dest host unreachable
dest protocol unreachable
dest port unreachable
dest network unknown
dest host unknown
source quench (congestion
control - not used)
echo request (ping)
route advertisement
router discovery
TTL expired
bad IP header
traceroute is developed by a clever use
of ICMP
26
Data Forwarding: Steps
 If no error, look up packet destination
address in forwarding table:


if datagram for a host on directly attached
network, it is the job of the link layer now
otherwise,
• lookup: find next-hop router, and its outgoing interface
• if needed, do fragmentation
• forward packet to outgoing interface (to the next hop
neighbor)
try %netstat –rn to see the forwarding table
27
Forwarding Look up
default: -
#
prefix
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
00001
00010
00011
001
0101
011
10
100
1010
1011
1100
interface
0
1
a
g
f
d
e
abc
h
i
jk
The networks are represented by
a decision tree, e.g., a Patricia Trie
to look for the longest match of
the destination address
28
Content Addressable Memory (CAM)
 Standard computer memory (Random Access
Memory or RAM)
Word  Read(addr)
 Write(addr, Word)

 CAM
Word  Read(addr)
 Write(addr, Word)
 AddrList  Search(Word)

29
Ternary CAM
 Binary CAM

It searches words consisting entirely of 1s and
0s.
 TCAM
 It
allows a third matching state of “X” or “Don’t
Care” for one or more bits in the stored data
word.
 “10XX0” will match “10000”, “10010”, “10110”,
“10100”.
30
Prototype of TCAM Based IP Table Lookup
Physical Address
IP Prefix (4 Bytes)
Next Hop (4 Bytes)
0x00000000
0x010203XX (1.2.3.0/24) 1
0x00000008
0x050607XX(5.6.7.0/24) 2
0x00000010
0x0102XXXX(1.2.0.0/16)
0
Keep an eye on hardware revolutions
Solid State Drive (SDD)
Optical Switching or Routing?
31
Example 1 (same network): A->B
src
forwarding table in A
dst
misc
data
fields 223.1.1.1 223.1.1.3
 Look up dest address
 find dest is on same net
 link layer will send the
datagram directly inside a
link-layer frame
Dest. Net. next router
Nhops
223.1.1/24
1
223.1.2/24 223.1.1.4
2
223.1.3/24 223.1.1.4
2
0.0.0.0/0 223.1.1.4
To Internet
A 223.1.1.1
223.1.4.1
223.1.2.1
B
223.1.1.2
223.1.1.4
223.1.2.9
223.1.2.2
223.1.1.3
223.1.3.1
223.1.3.27
223.1.3.2
32
Example 2 (Different Networks): A-> E
forwarding table in A
misc
data
fields 223.1.1.1 223.1.2.3
 look up dest address in
forwarding table
 routing table: next hop
router to dest is 223.1.1.4
 link layer sends datagram to
router 223.1.1.4 inside a linklayer frame

the dest. of the link layer
frame is 223.1.1.4
Dest. Net. next router
Nhops
223.1.1/24
1
223.1.2/24 223.1.1.4
2
223.1.3/24 223.1.1.4
2
0.0.0.0/0 223.1.1.4
To Internet
A 223.1.1.1
223.1.4.1
223.1.2.1
B
223.1.1.2
223.1.1.4
223.1.2.9
223.1.2.3
223.1.1.3
223.1.3.1
223.1.3.27
E
223.1.3.2
33
Example 2 (Different Networks): A-> E
misc
data
fields 223.1.1.1 223.1.2.3
Arriving at 223.1.1.4,
destined for 223.1.2.2
 look up dest address in
router’s forwarding table
 E on same network as
router’s interface 223.1.2.9
 router, E directly attached
 link layer sends datagram to
223.1.2.2 inside link-layer
frame via interface 223.1.2.9
 datagram arrives at
223.1.2.2!! (hooray!)
forwarding table in router
Dest. Net router Nhops interface
223.1.1/24
223.1.2/24
223.1.3/24
0.0.0.0/0
A
B
-
223.1.1.1
1
223.1.1.4
1
223.1.2.9
1
223.1.3.27
223.1.4.1
To Internet
223.1.4.1
223.1.2.1
223.1.1.2
223.1.1.4
223.1.2.9
223.1.2.3
223.1.1.3
223.1.3.1
223.1.3.27
E
223.1.3.2
34
What A Router Looks Like: Outside
35
Look Inside a Router
Two key router functions:
 run routing algorithms/protocol (RIP, OSPF, BGP)

switching datagrams from incoming to outgoing ports
36
Input Port Functions
physical layer:
bit-level reception
data link layer:
e.g., Ethernet
network layer:
lookup output port
using forwarding table
37
Switching: Low End
38
Switching Via An
Interconnection Network
 Overcome bus bandwidth limitations
 fragmenting datagram into fixed length
cells, switch cells through the fabric.
 Crossbar, Banyan networks, and others
 Cisco 12416: switches 320 Gbps (upgradeable to 1.28
Tbps) with 16 slots (each 10G full-duplex) through the
crossbar interconnection network
39
New Potential Bottleneck: Output Ports
 Due to output port contention and head-of-the-Line
(HOL) blocking (i.e., queued datagram at front of
queue prevents others in queue from moving
forward)
40
Head-of-Line Blocking Limits Thrput
 Due to output-port contention and HOL blocking, the stable throughput
is only around 2 - sqrt(2) = 0.586 of line speed !
41
Avoiding Port Contention and HOB
 Virtual output queueing
 Input/output ports matching algorithm
 Switch fabric speedup, e.g., two cells to one
output port
For more details: http://www.cisco.com/warp/public/63/arch12000-swfabric.html
42
Output Ports

Buffering required when

Queueing (delay) and
loss due to output port
buffer overflow !

Scheduling and
queue/buffer
management choose
datagrams arrive from
fabric faster than the
transmission rate
among queued
datagrams for
transmission
43
Backup Slides
44
Another Method for Checking
Convergence: Dispute Wheels
 A dispute
wheel
prefers
R1Q2 over Q1
 u2 prefers
R2 Q3 over
Q2
 etc
 u1
2
210
20
R1
Q2
Q1
u1
130
10
1
3
Q3
d
u3
Qn
Rn
un
0
R2
u2
R3
320
30
45
Patterns of Valid Routes
 Consider how a path is
extended according to the
export policies

case 1 (format: Destination,
link type, …)
• Dest, …CP  Dest, … CP CP
• Dest, …CP  Dest, … CP PC
• Dest, …CP  Dest, … CP PP

case 2
 Valid consecutive link types
(starting from source to
destination, i.e., reserve)





PC PC
CP PC
PP PC
CP CP
CP PP
• Dest, …PC  Dest, … PC PC

case 3
• Dest, …PP  Dest, … PP PC
46
Case 1: The first link of Q1 is a PC link
 Then the first link of R1Q2 must be a PC
link

because u1 chooses R1Q2 and C > E/P
 Thus all remaining links along R1Q2 are PC
links

because only PC follows PC
u1
 Thus the first link of Q2 is
a PC link
 ……
 All links along R1, …, Rn are
PC links
 The network has a PC loop.
contradiction !
PC
Rn
R1
PC
Q1
u2
Qn
un
d
Rn-1
Q2
R2
47
Case 2: The first link of Q1 is a CP/PP link
 All links along Rn are CP links because all
links before CP/PP are CP
 The first link of Qn must be a CP/PP link

because C > E/P
 ……
CP
Rn
u1
CP/PP
Q1
 All links along R1, …, Rn are CP links
 The network has a PC loop.
contradiction !
u2
Qn
un
CP/PP
Rn-1
d
Q2
R2
48
Backup: IP Multicast
49
IP Fragmentation & Reassembly
 Network links have MTU
(max.transfer size) largest possible link-level
frame.
 different link types,
different MTUs, e.g.
Ethernet MTU is 1500
bytes
 Large IP datagram divided
(“fragmented”)
 one datagram becomes
several datagrams
 “reassembled” only at
final destination
 IP header bits used to
identify, order related
fragments
fragmentation:
in: one large datagram
out: 3 smaller datagrams
reassembly
50
IP Fragmentation and Reassembly
Example
 4000 byte
datagram
 MTU = 1500 bytes
length ID fragflag offset
=4000 =x
=0
=0
One large datagram becomes
several smaller datagrams
length ID fragflag offset
=1500 =x
=1
=0
length ID fragflag offset
=1500 =x
=1
=1480
length ID fragflag offset
=1040 =x
=0
=2960
51
IP Multicast: Service Model
128.59.16.12
128.119.40.186
multicast
group
226.17.30.197
128.34.108.63
128.34.108.60

Multicast group concept: use of indirection


Open group model



A group is identified by a location-independent
logical address (class D IP address: prefix 1110)
Anyone can send packets to the “logical” group address
Anyone can join a group and receive packets
Normal, best-effort delivery semantics of IP
Needed: infrastructure to deliver mcast-addressed datagrams to
all hosts that have joined that multicast group
52
Multicast Across LANs

Goal: find a tree (or trees) connecting routers having
local mcast group members

source-based: different tree from sender to each receiver
– Distance-vector multicast routing protocol (DVMRP)
– Protocol-independent multicast-dense mode (PIM-DM)

shared-tree: same tree used by all group members
– Core-Based Tree (CBT)
– Protocol-independent multicast-sparse mode (PIM-SM)
shared tree
source-based trees
53
Source Tree:
Reverse Path Flooding (RPF)
 A router x forwards a packet from source (S)
iff it arrives via neighbor y, and y is on the
shortest path from x back to S
 A packet is replicated to all but the incoming
interface
1
S
1
y
x
1
1
z
t
1
a
54
Reverse Path Forwarding:
Improvement
 Basic idea: forward a packet from S only on
child links for S
 A child link of router x for source S
a link that has x as parent on
the shortest path from the
link to S
 a child x notifies its parent y
(through the routing protocol)
that it has selected y as its
parent

S
y
x
z
t
a
55
Reverse Path Forwarding: Pruning
 No
need to forward datagrams down subtree
with no mcast group members
 “prune” msgs sent upstream by router with
no downstream group members
LEGEND
S: source
R1
router with attached
group member
R4
R2
P
R5
R3
R6
P
R7
P
router with no attached
group member
prune message
links with multicast
forwarding
56
Pruning
 Prune (Source, Group) at a leaf router if no members
 send No-Membership Report (NMR) up tree
 If all children of router R prune (S,G)
 propagate prune for (S,G) to its parent
 What do you do when a member of a group (re)joins?
 send a Graft message to upstream parent
 How to deal with failures?
 prune dropped
 flow is reinstated
 down stream routers re-prune
 Note: again a soft-state approach
57
Implementation of
Source Trees in the Internet
 Multicast OSFP (MOSFP)
 Membership is part of the link state distribution; calculate
source specific, pre-pruned trees
 Reverse Path Forwarding
 Distance Vector Multicast Routing Protocol (DVMRP)
 Protocol Independent Multicast – Dense Mode (PIM-DM)
• very similar to DVMRP


Difference: PIM uses any unicast routing algorithm to
determine the path from a router to the source; DVMRP
uses distance vector
Question: the state requirement of Reverse Path
Forwarding
58
Building a Shared Tree
 Steiner Tree: minimum cost




tree connecting all routers
with attached group members
A Steiner tree is not a
spanning tree because you
do not need to connect all
nodes in the network
Problem is NP-hard
Excellent heuristics exists
Not used in practice:



computational complexity
information about entire network needed
monolithic: rerun whenever a router needs to join/leave
59
Center (Core) based Shared Tree
 Single delivery tree shared by all
 One router identified as
“center” of tree
 Tree construction is receiver-based
 edge router sends unicast join-msg addressed to center
router
 join-msg “processed” by intermediate routers and
forwarded towards center
 join-msg either hits existing tree branch for this center,
or arrives at center
 path taken by join-msg becomes new branch of tree for
this router
 A sender unicasts a packet to center
 The packet is distributed on the tree when it hits the tree
60
Example: M3 Joins
 Group members: M1, M2
core
M1
M2
M3
shared tree
join message
S1
Discussion: what is property of the constructed tree?
61
Example: M1 Sends Data
 Group members: M1, M2, M3
 M1 sends data
core
M1
M2
control (join) messages
data
M3
S1
62
Shared Tree Protocols in the Internet
 Core Based Tree
 Protocol Independent Multicast (PIM)
Sparse mode
 The catch: how do you know the center?

session announcement
63
Mbone: Tunneling
Q: How to connect “islands” of multicast
routers in a “sea” of unicast routers?
physical topology
logical topology
 mcast datagram encapsulated inside “normal” (non-multicast-
addressed) datagram
 normal IP datagram sent thru “tunnel” via regular IP unicast
to receiving mcast router
 receiving mcast router unencapsulates to get mcast datagram
64