Transcript 07-seattle

SEATTLE
- A Scalable Ethernet Architecture
for Large Enterprises
M.Sc. Pekka Hippeläinen
IBM
phippela@gmail
1.10.2012
T-110.6120 – Special Course in Future Internet Technologies
1
SEATTLE
 Based on and pictures borrowed
from:Changhoon,K;Caesar,M;Rexford,J.
Floodless in SEATTLE: A Scalable Ethernet
Architecture for Large Enterprises
 Is it possible to build a protocol that maintains the
same configuration-free properties as Ethernet
bridging, yet scales to large networks?
1.10.2012
2
Contents
 Motivation: network management challenge
 Ethernet features: ARP and DHCP broadcasts
 1) Ethernet Bridging
 2) Scaling with Hybrid networks
 3) Scaling with VLANs
 Distributed Hashing
 SEATTLE approach
 Results
 Conclusions
1.10.2012
3
Network management challenge
 IP Networks require massive effort to configure




and manage
Even 70% of an enterprise network’s cost goes
to maintenance and configuration
Ethernet is much simpler to manage
However Ethernet does not scale well beyond
small LANs
SEATTLE architecture aims to provide scalability
of IP with simplicity of Ethernet management
1.10.2012
4
Why Ethernet is so wonderful ?
 Easy to setup, easy to manage
 DHCP server, some hubs, plug’n play
1.10.2012
5
Flooding query 1: DHCP
requests
 Lets say node A joins the ethernet
 To get IP / confirm IP – node A sends a DHCP
request as a broadcast
 Request floods through the broadcast domain
18.09.2012
6
Flooding query 2: ARP
 In order for node A to communicate to node B in




the same broadcast domain, the sender needs
MAC address of the node B
Lets assume that node B IP is know
Node A sends and Address Request Protocol
(ARP) broadcast – to find out MAC address of
node B
Similarly to DHCP broadcast – the request is
flooded through the whole broadcast domain
This is basically {IP -> MAC} mapping
1.10.2012
7
Why flooding is bad ?
 Large Ethernet deployments contain vast




number of hosts and thousands of bridges
Ethernet was not designed to such a scale
Virtualization and mobile deployments can
cause many dynamic events – causing control
traffic
Broadcast messages need to be processed in the
end hosts – interrupting cpu
The bridges forwarding tables grow roughly
linearly with number of hosts
1.10.2012
8
1) Ethernet bridging
 Ethernet consists of segments each comprising




a single physical layer
Ethernet bridges are used to interconnect
segments to multi-hop network i.e. LAN
This forms a single broadcast domain
Bridge learns how to reach a host – by
inspecting the incoming frames and associating
the source MAC with the incoming port
A bridge stores this information to a forwarding
table – using the table to forward packets to
correct direction
1.10.2012
9
Bridge spanning tree
 One bridge is configured to be the root bridge
 Other bridges collectively compute a spanning
tree based on the distance to the root
 Thus traffic is not routed through shortest path
but along the spanning tree
 This approach avoids broadcast storms
1.10.2012
10
1.10.2012
11
2) Hybrid IP/Ethernet
 In this approach multiple LANs are




interconnected with IP routing
In hybrid networks each LAN contains at most a
few hundred of hosts that form IP subnet
IP subnet is associated with the IP prefix
Assigning IP prefixes to subnet and associating
subnets with router interfaces is a manual
process
Unlike MAC which is host identifier – IP address
denotes the hosts current location in the
network
1.10.2012
12
1.10.2012
13
Drawbacks of Hybrid approach
 Biggest drawback is the configuration overhead
 Router interfaces must be configured
 Host must have correct IP address corresponding to
the subnet it is located (DHCP can be used)
 Networking policies are defined usually per
network prefix i.e. topology
 When network changes the policies must be updated
 Limited mobility support
 Mobile users & virtualized hosts at datacenters
 If IP is constant – the user should stay on the same
subnet
1.10.2012
14
3) Virtual LANs
 Overcomes some problems of Ethernet and IP




Networks
Administrators can logically groups hosts into
same broadcast domain
VLANS can be configured to overlap –
configuring bridges not the hosts
Now broadcast overhead can be reduced by the
isolates domains
Mobility is simplified – IP address can be
retained while moving between bridges
1.10.2012
15
Virtual LANs
 Traffic from B1 to B2 can be ‘trunked’ over
multiple bridges
 Inter domain traffic needs to be routed
1.10.2012
16
Drawbacks of VLANs
 Trunk configuration overhead
 Extending VLAN across multiple bridges requires
VLAN to be configured at each of the bridges
participating. Often manual work.
 Limited control plane scalability
 Forwarding table entries and broadcast traffic for
every active host and every VLAN visible
 Insufficient data plane efficiency
 Single spanning tree is still used within each VLAN
 Inter-VLAN traffic must be routed via IP gateways
1.10.2012
17
Distributed Hash Tables
 Hash tables are used to store {key -> value} pairs
 In case of multiple nodes there is nice way to
make
 Nodes symmetric
 Distribute the hash table entries evenly among nodes
 Keep reshuffling of entries small in case of
adding/removing nodes
 Idea is to calculate H(key) that is mapped to a
host – one can visualize this to mapping to an
angle (or to a point on a circle)
1.10.2012
18
Distributed Hash Tables
 Each node is mapped to randomly distributed
points on the circle
 Thus each node is mapped to multiple buckets
 One calculates the H(key) – and stores the entry
to the node owning this bucket
 If node is removed – the values are now assigned
to next buckets
 If node is added – entries are moved to the new
buckets
1.10.2012
19
SEATTLE approach 1/2
 1) Switches calculate shortest
path among themselves
 This is link state protocol – basically Dijkstra
 Switch level discovery protocol – Ethernet hosts do
not respond
 Switch topology much more stable than at host level
 Much more scalable than at host level
 Each switch has an ID – one MAC address of the
switch interfaces
1.10.2012
20
SEATTLE approach 2/2
 2) DHT used in switches
 {IP->MAC} mapping
 This is essentially ARP request avoiding flooding
 {MAC->location} mapping
 When switch is located – routing along the shortest path
can be used
 DCHP Service location can also be stored
 SEATTLE thus reduces flooding, allows usage of
shortest path and offers a nice way to locate
DHCP service
1.10.2012
21
SEATTLE
 Control overhead reduced with consistent
hashing
 When set of switches changes due to network failure
or recovery – only some entries must be moved
 Balancing load with virtual switches
 If some switches are more powerful – the switch can
represent itself as many – getting more load
 Enabling flexible service discovery
 This is mainly DHCP – but could be something like
{“PRINTER”->location}
1.10.2012
22
Topology changes
 Adding and removing switches/links can alter
topology
 Switch/link failures and recoveries can also lead
to partitioning events (more rare)
 Non-partitioning link failures are easy to handle
– the resolver for hash entry is not changed
1.10.2012
23
Switch failures
 If switch fails or recovers hash entries need to be
moved
 The switch that published value – monitors the
liveliness of resolver. Republishing entry when
needed
 The entries have TTL
1.10.2012
24
Partitioning events
 Each switch has to book keep also locally-stored
location entries
 If switch s_old is removed / not reachable – all the
switches need to remove these location entries
 This approach correctly handles partitioning
events
1.10.2012
25
Scaling:
location
 Hosts use directory service to publish and
maintain {mac->location} mappings
 When host a with mac_a arrives – it accesses
switch S_a (steps 1-3)
 Switch s_a publishes {mac_a,location}, by calculating
the correct bucket F(mac_a) i.e. switch/resolver
 When node b wants to send message to node a
 F(mac_a) is calculated to fetch the location
 ’Reactive resolution’ – also cache misses do not
lead flooding
1.10.2012
26
Scaling:
ARP
 When node b makes ARP request – SEATTLE
converts this to a {F(IP_a) -> mac_a} request
 The resolver/switch for F(IP_a) is usually
different from F(mac_a)
 Optimization for hosts making ARP request
 F(IP_a) address resolver can also store mac_a and S_a
 When node b makes F(IP_a) ARP request also mac_a-
>S_a mapping is cached to S_b
 Shortest path (-> path 10) can now be used
1.10.2012
27
Handling host dynamics
 Location change
 Wireless handoff
 VM moved but retaining MAC
 Host MAC address changes
 NIC card replaced
 Failover event
 VM migration forcing MAC change
 Host changes IP
 DHCP leave expires
 Manual reconfiguration
1.10.2012
28
Insert, delete and update
 Location change
 Host h moves from s_old to s_new
 s_new updates the existing mac-to-location entry
 MAC change
 IP-to-MAC update
 MAC-to-location deletion (old) and insertion (new)
 IP change
 S_h deletes old IP-to-MAC and inserts new IP-to-MAC
1.10.2012
29
Ethernet: Bootstrapping
hosts
 Host discovered by access switches
 SEATTLE switches snoop ARP requests
 Most OSes generate ARP request at boot up / if up
 Aldo DHCP messages or host down can be used
 Host configuration without broadcast
 DHCP_SERVER hashes string “DHCP_SERVER” and
stores the location to the switches
 The “DHCP_SERVER” string is used to locate service
 No need to broadcast for ARP or DHCP
1.10.2012
30
Scalable and flexible VLANs
 To support broadcasts – the authors suggest
using groups
 Similar to VLAN - groups is defined as a set of
hosts who share the same broadcast domain
 The groups are not limited to layer-2
reachability
 Multicast-based group-wide broadcasting
 Multicast tree with broadcast root for each group
 F(group_id) used for broadcast root location
1.10.2012
31
Simulations
 1) Campus ~40 000 students
 517 routers and switches
 2) AP-Large (Access Provider)
 315 routers
 3) Datacenter (DC)
 4 core routes with 21 aggregation switches
 Routers were converted to SEATTLE switches
1.10.2012
32
Cache timeout and AP-large
with 50k hosts
 Shortest path cache timeout
has impact on number of
location lookups
 Even with 60s time out 99.98%
packets were forwarded without lookup
 Control overhead (blue) decreases very fast – where as the
table size increases only moderately
 Shortest path is used in majority of routing in these
simulations
1.10.2012
33
Table size increase in DC
 Ethernet bridges stores entry
for each destination ~ O(sh)
behavior across network
 SEATTLE requires only ~O(h) state since only access
and resolver switches need to store and location
information for each hosts
 With this topology the table size was reduced by factor of 22
 In AP-large case the factor was increased to 64
1.10.2012
34
Control overhead in AP-large
 Number of control messages
over all links in the topology
divided by the number switches
and duration of the trace
 SEATTLY significantly reduces control overhead in
the simulations
 This is mainly because Ethernet generates network
wide floods for a significant number of packets
1.10.2012
35
Effect of switch failure in
DC
 Switches were allowed to fail
randomly
 The average recover time was
30 seconds
 SEATTLE can use all the links in the topology, where
as Ethernet is restricted to the spanning tree
 Ethernet must re-compute the tree causing outages
1.10.2012
36
Effect of host mobility in
Campus
 Hosts were randomly moved
between access switches
 For high mobility rates,
SEATLLES loss rate was
lower than Ethernet
 On Ethernet it takes sometime for switches to evict
the stale information location information and relearn the new location
 SEATTLE provided low loss and broadcast overhead
1.10.2012
37
What was omitted
 Authors suggest multi-level one-hop DHTs
 With large dynamic networks – it can be beneficial that
entries are stored close
 This is achieved with regions and backbone – border
switches connect to the backbone switches
 With topology changes
 Approach to seamless mobility is described in the paper
 Updating remote host caches is required with switch
based MAC revocation lists
 Some simulation results
 Authors also made sample implementation
1.10.2012
38
Conlusions
 Operators today face challenges in managing
and configuring large networks. This is largely to
complexity of administering IP networks.
 Ethernet is not a viable alternative
 poor scaling and inefficient path selection
 SEATTLE promises scalable self-configuring
routing
 Simulations suggest efficient routing, low latency
with quick recovery
 Host mobility supported with low control overhead
 Ethernet stacks at end hosts are not modified
1.10.2012
39
Thank you for your attention!
Questions? Comments?
1.10.2012
40