Control Update - Applied Research Lab

Download Report

Transcript Control Update - Applied Research Lab

Substrate Control: Overview
Fred Kuhns
[email protected]
Applied Research Laboratory
Washington University in St. Louis
[email protected]
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Defining Terms and Models
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
2
The SPP Node
NPE
code option
• Slice instantiation:
– Allocate virtual machine (VM)
instance on a GPE
NPE
SCD
SRAM
GPE
FPx
mi-mux
GPE
local delivery/exceptions,
uses an Internal UDP Tunnel
Egress
IP route
table and
ARP
LC
…
SCD (ARP, nat)
map flow
to internal
destination
…
• Line card TCAM Filters direct traffic
app
planetlab OS
• Share a common set of (global) IP
addresses
– UDP/TCP port space shared across GPE/NPEs
vmx
RMP
TCAM
– may request code option instance,
NPE resources and bandwidth
NMP
…
…
– unregistered traffic originating outside the node
Ingress
is sent to the CP.
– unregistered traffic originating within node uses
Internet
NAT (on line card)
– application may register server ports. Causes filter to be inserted in the line card directing traffic to
specific GPE
– application must register ports (or tunnels) associated with fast path instances
• It is assumed that fast path instances will use tunnels (overlays) to send traffic between
routing nodes.
– Currently we only support UDP tunnels but will extend to include GRE and possibly others.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
3
Meta-Interfaces and Tunnels
•
Slice Fast path (Code option instance, allocated resources) are assumed to sit at one end of a tunnel
–
–
–
•
The encapsulated packet is processed by the fast path.
–
–
•
currently only UDP tunnels are supported.
UDP Tunnel is defined by the 4-tuple
UDP tunnel: {peer ipaddr, peer port, local ipaddr, local port}
Meta-interface or MI: Represents a tunnel endpoint as viewed by a slice’s the fast path router. A meta-interface
is defined by the local endpoint’s address
Meta-Interface: {local ipaddr, local UDP port}
packet is always encapsulated within a tunnel by the substrate
code option instance processes the encapsulated frame
In the SPP context, slice registers MI and substrate manages encapsulation headers:
–
–
–
–
Guard against forging source address
A filter is installed in the corresponding line card’s TCAM to send matching packets to the correct NPE
NPE’s decap module verifies the encapsulation header and provides isolation between slices (based on local IP
and port number values in the tunnel header)
Fabric VLANs are used to provide link level isolation between slice instances. The VLAN label is also used by
the substrate to associate packets with slice fast paths.
MI IP Address
MI: local tunnel endpoint (UDP), {external ipaddr, udp_port}
fast path (FPx)
meta-interfaces
0
Fred Kuhns - 7/17/2015
1
2
3
4
5
6
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
UDP Port
0
192.168.1.2
6060
1
192.168.1.3
6060
2
192.168.1.2
6061
3
192.168.1.2
6062
4
192.168.1.3
6061
5
192.168.1.3
6062
6
192.168.1.3
6063
4
Lookup Table, TCAM, Use
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
5
Lookup filters: Key, Action and Result
•
A lookup key is then created from the packet’s header fields and the receiving metainterface
– code option extracts fields from the encapsulated packet
– substrate adds the receiving meta-interface identifier
•
If no entry is found then the packet’s no_route exception attribute is set, otherwise a result is
returned containing an action field and forwarding information (output meta-interface
and next hop address)
– a code option may define additional exception attributes
• The complete filter specification: {lookup_key, result_vector}
• lookup_key : {RxMI, *copt_key}
– RxMI : Meta interface ID on which the packet was received.
– copt_key : Lookup key defined by the code option. The IPv4 key:
{daddr(32),saddr(32),sport(16),dport(16),tcp_flgs(8),proto(8)}
• result_vector : {sindx, action[, qid, TxMI, nexthop]}
– sindx : stats index
– action: Packet disposition, one of {drop, fwd, ld}
• drop : drop packet;
• fwd : forward packet using next hop value (fwdkey)
• ld : local delivery, code option instance has local address information??
– qid : packet Queue
– TxMI : Meta-interface used for sending packet, corresponds to a previously registered local tunnel
endpoint. Used to fill in the local address of the outgoing packet tunnel header.
– nexthop : Tunnel endpoint for the next hop. For UDP tunnels, this is the IP address and UDP
port number of the next hop device.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
6
Slice view of the Lookup Key
user specified lookup key (4 - 32-bit words)
xsid’
12
xmi
N
slice defined fields
128-N
• When a packet is received the substrate creates a lookup key using the target
slices xsid and the receiving meta-interface. The remaining bits are
defined by the code option.
– xsid’ : represents the internal slice ID and may differ from the value of xsid.
For implementation efficiency, this is the VLAN identifier assigned to the slice.
– xmi : Internal representation of the meta-interface (MI), encoding of the received
tunnel endpoint.
• For UDP tunnels this field includes a 4-bit interface id and the 16 bit local UDP port
number. The 4-bit id is used as an index into a table of local IP addresses.
• The IPv4 code option defined fields are shown below where pr is the IP
protocol field and tcp is the TCP header flags.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
7
IPv4 TCAM Filter Formats (on NPE)
Defined by the IPv4 Code Option, 112bits
Substrate defined
T if
1 4
vlan
11
RX port
16
daddr
saddr
sport
dport
tcp/proto
32
32
16
16
16
Represents input meta-interface
TCP 0100 flags
2 2
12
T = 0: Normal Lookup
T = 1; substrate only lookup
2 RSV
6 proto
8
!TCP 00
Result, 64 bits
rsv D L rsv
3 1 1 11
sindx
16
global stats
index
(SCD maps
slice’s sindx
to global value)
TX IP daddr
32
TX dport
16
TX sport
16
TX IP address and sport represents
the output meta-interface. The
dport is provided by the slice.
(RMP maps miid to tx tunnel params,
use dport provided by slice)
rsv
12
QM Sch
2 3
qid
15
20-bit internal qid
(SCD maps slice’s miid
to QM and Sch. SCD Also
maps slice’s qid to
global qid value)
D: Drop packet
L: Local delivery
Slice parameters:
Key: Input miid, IPv4 fltr {daddr, saddr, sport, dport, tcp/proto}
Result: Flags {Drop, GPE}, sindx, Output miid, QID
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
8
Lookup
• Parse block make copt_key.
• Substrate add the xsid and xmi fields.
• Substrate uses the TxMI and nexthop fields to construct
encapsulation header
packet
parse block
decap
xsid’
annotations:
{xsid, RxMI}
xmi
slice defined fields
Lookup A
xsid:RxMI:copt_key sindx;action:qid:TxMI:nexthop
Fred Kuhns - 7/17/2015
...
...
TxMI:nexthop
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
9
Version 2 and Multicast
•
In version 2 there will be 2 stages to the lookupadd
fanout (count) to lookup B.
if fanout > 1 then address of fanout else result
vector; Chain fanout blocks
TxMI includes an interface vector: 4-bit field that is
used to lookup interface IP address and MAC
address.
•
•
fanout Table
qid:TxMI:nexthop
VLAN table in header format
and VLAN table in Decap/Parse
packet
decap
...
parse block
sindex passed from side A
annotations:
{xsid, RxMI}
overloaded with fanout address
xmi
slice defined fields
LookupA
lookup_key action:sindx:rindx
rindx
LookupB
sindx:qid:TxMI:nexthop
Fred Kuhns - 7/17/2015
...
...
result_index
...
xsid’
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
10
Lookup Example
•
When a code option is requested the slice is
allocated the requested number of TCAM
entries; fid ε {0,..., Nf-1}
– all TCAM operations accept a TCAM entry
ID (fid)
– Entries are listed in priority order with
fid=0 the highest priority and entry Nf-1
the lowest.
•
Slice BW Allocations
Interface
BW
ipAddr
0*
BE
192.168.1.2
1
100Mbps 10.50.10.2
2
10Mbps
10.1.1.1
It is up to the slice control path to order the
lookup entries.
– For example if we have the simple routing
database:
10.10.2.1/32Local delivery (GPE)
10.5.2.0/24 NH A
10.5.1.0/24 NH B
10.5.0.0/16 NH C
• Then the control software could use the following:
Slice Meta-Interfaces
MI IP Address UDP Port
0 192.168.1.2
6060
1 10.50.10.2
6061
2 10.50.10.2
6062
3
10.1.1.1
6060
Slice Queue Bindings
QID Interface BW max Bytes
0
0*
Local*
1
1
40%
1024
2
1
60%
1024
3
2
100%
1024
Desired Route Table (LPM)
prefix
TxMI nexthop
10.10.2.1/32
0*
Local
10.5.2.0/24
1
NH A
10.5.1.0/24
2
NH B
10.5.0.0/16
3
NH C
write_fltr(fid, rxmi, {prefix,width}, action, {qid,TxMI,nexthop})
write_fltr(0, *, {10.10.2.1, 0xFFFFFFFF}, LD})
write_fltr(1, *, {10.5.2.0, 0xFFFFFF00}, fwd, {1, 1, NHA})
write_fltr(2, *, {10.5.1.0, 0xFFFFFF00}, fwd, {2, 2, NHB})
write_fltr(3, *, {10.5.0.0, 0xFFFF0000}, fwd, {3, 3, NHC})
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
11
Example IPv4 LPM
• In general for longest prefix match a good strategy is to divide
allocated filters into 32 sets
• For example assume 1024 TCAM entries have been allocated
and we are using LPM.
– Divide the filters into 32 sets of 32 filters each and associate a prefix
length with each:
Prefix Width
32
31
w
1
Filter ID Range
0 - 31
32-63
(32-w)*32 +(0...31)
992 - 1023
– Then for a particular prefix width add it to the appropriate set.
– Entries within a set are non-overlapping so their order doesn’t matter.
– This is the scheme used by software written by IDT, the manufacturer
of the TCAM we currently use.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
12
Keeping track of TCAM entries
• Substrate will have to manage the mapping of VM
TCAM filter IDs to the actual filter ID.
• VM control software will use a normalized filter index
list (starts at 0 and has the requested number of filters
entries).
• The SCD (xscale daemon) must map the per-VM index
into the actual TCAM Index.
• Source for managing TCAM entries.
• NPU A and B share a common TCAM and index range
so this must be managed across the two xscales.
– See C++ implementation of the RangeMap class in
$WUSRC/range
– Class will also be used for managing the QID name space.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
13
Control Software:
Resource Management
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
14
System Resource Manager
node components not in hub
(switch, GPEs, Development Hosts)
Resource DB
SNM
Support fast path
configuration via
the PLC
SRM
CP
GPE
NMP
NPE
SCD
LC
SCD
MUX
TCAM
RMP
SRAM
FP
k
FP
kx
FP
vmx
control
SP
root context
planetlab OS
vnet
TCAM
Exception and Local delivery traffic.
Includes shim header with RxMI.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
15
Partitioning of (substrate) Responsibilities
•
Virtual Machine (Slice control SW): Application logic, code option specific control and data
operations.
–
–
–
•
vnet
–
–
•
traditional PlanetLab slice operations
manage code option specific lookup tables, stats, memory and configuration blocks
implements interface with fast path for exception and local delivery traffic
flow isolation: filtering traffic through the linux kernel
add support for VLAN- based filtering and port reservation
Resource Manager Proxy (aka Local Resource Manager)
–
all VM commands are issued to the RMP
•
•
•
–
–
•
verifies (or inserts) substrate message header slice IDs to prevent deliberate or accidental masquerading - part
of ensuring isolation and security.
in tandem with SRM implements device independent logic
System Resource Manager
–
–
device independent logic
responsible for implementing and enforcing
•
•
•
•
the RMP is able to validate command sender (authenticate)
enforce access restrictions (authorize)
decouples VMs from substrate control entities. That is, maps exported abstractions and interfaces to specific hardware and
software interfaces.
system resource abstractions
resource isolation and allocation policies
facilitating SNM: implementing PlanetLab compatible behavior and abstractions
Substrate Control Daemon
–
–
–
intermediary between VM and code option instances (vouches for VM)
enforces policies on resource allocations and isolation in the control plane
implements device dependent logic
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
16
Responsibilities
System
tables
Interfaces
ifn:{type,ipaddr,linkBW,availBW}
xsid
Per Slice
Tables
endpoints
id:{type,ipaddr,port,proto,board,bw}
NPE (allocated)
sram {start,size} #flts
BW
controlIP
board ID
#Qs
BW
#Stats
SRM
(the “Decider”)
request allocation
SCD (NPE)
Tables in data Path
base SRAM
“real”
indx Lookup Table
xsid:offset
fid
xsid:range
“real”
indx Stats Table
sid
“real”
indx
xsid:offset
vlan
VLAN Table
copt:sram_addr
xsid:size
Queue Params
xsid:range
HF Control Block?
code option
control blocks?
ranges are not required to be contiguous
Per interface scheduler and rate limits
Per Slice data
Slice Maps
Slice SRAM Assignments
xsid: {qidMap,FidMap,statsMap}
xsid: {sram_start,sram_size}
Interface BW
Fred Kuhns - 7/17/2015
RMP
RMP Responsibilities
• Translate slice MI to local endpoint. Either
call SRM or cache mappings.
• Add xsid to subMsg header
• Pass through identifiers mapped by SCD:
qid, fid and stats.
• Pass through relative queue weights, SCD
maps to global weight.
make allocation
qid
xsid:range
GPE
BWmaps??
endpoint (port) maps
servMap
resvMap
meta-ifaces
mi:endpoint
...
plab sliceID
vlanid:xsid
xsidMap
...
gpe
board id
vlan
VLAN maps
range:{start,end}
...
endpoint (port) maps
resvMap availMap usedMaps
...
...
NPE Table
id:{addr,BW/Port,copts,fltrs,sram,Qs}
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
SCD Responsibilities
• Translate slice specific indices to global
indices: qid, fid and stats.
• Knows the location of all tables
• Interprets commands to add, remove and
modify entries to data path tables.
• Knows per slice interface BW allocation and
maps relative queue weight to global weight.
• Each interface schedule is assigned (by SRM)
17
max rate.
Queuing and allocating
Interface Bandwidth
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
18
Simple Queuing Example
Slice Interface and Queue Allocations:
{Port, BW, QList}, Qlist = {{qid, weight, threshold},...}
NPE
wrr
q10
qid in 0...n-1
q11
...
FP slice1
q1n’
BW11
q20
qid in 0...m-1
q21
...
FP slice2
q2m’
Physical Port (Interface)
Attributes:
{ifn, type, ipaddr,
linkBW, availBW}
ifn : Interface number
type: {Internet, Peering}
Operations:
get_interfaces()
LC
get_ifattrs(ifn)
get_ifpeer(ifn)
alloc_ifbw(ifn,xsid,bw)
wrr
FP1
FP2
BW1
BW11 + BW21 = BW1
GPE
GPE
ipAddr
linkBW
BW21
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
19
Substrate Message Format
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
20
Substrate Message
msg
header
15
mlen
cid
0 15
0
mid
cmd
body: 0-N(B)
mlen: Total message length,
including the header.
mid: Message ID, used to support
synchronous message processing.
cid: context identifier. Specifies
context within which the message
is processed. A value of 0
indicates substrate context.
cmd: Command to execute or a
return code.
The 4 header fields are each 16 bites.
body: 0 or more bytes of command
data.
Fred Kuhns - 7/17/2015
• Assume a simple command response
(two-way) messaging framework. But
will support one-way schemes..
• Supports asynchronous
communications using a message ID.
• The command field is overloaded for
the return code.
• Every server is expected to implement
a simple Version command (cmd ==
0) which return the server’s ID and
Version number as two 32-bit fields.
– primary use is for monitoring health of
servers and debugging.
– All other command values are uniique
only to a particular server.
• Uses UDP as the transport protocol.
• All commands are expected to be
idempotent
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
21
Overview
• In the interface specifications I provide a c-like description of
the operations and results.
• The descriptions are only intended to describe the actual
message format, data fields and returned results. It is not meant
to specify an application level library.
• The arguments are to be encoded into the message body in the
order that are given, using network byte order (Big Endian)
and without padding.
• All commands result in:
1. No return response: one-way call semantics
2. an error occurs processing the message or command encounters and
unexpected condition or error. In this case the return message will have
the error return code in the cmd field.
3. The command completes and does not indicate and error to the message
framework then the message result code indicates success. The message
body contains any result data.
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
22
Example Message
• Slice with xsid of 0x10 requests the allocation of a global
UDP port (decimal 17) for the local IP address
128.252.130.34 (hex 0x80FC8222).
– Assume the alloc_port command ID is 4.
port = alloc_port(0x80FC8222, 0, 17)
– Allocate a global UDP (decimal 17) port for the local IP address
128.252.130.34 (hex 0x80FC8222), and let the system assign the next
available port number.
• The resource manager allocates port 5050 (0x13BA), the return
code of 0 indicates success.
Command Message
Reply Message
1
F
10
4
80 FC 82 22
00 00 11
1
F
10
0
80 FC 82 22
13 BA 11
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
23
NAT
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
24
• Problem:
– UDP, TCP: 2 or more GPEs attempt to use same global IP,
Port and Proto
– ICMP: ???
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
25
BWi,j 
w i,j
Wj
w i,j  Wj 
Fred Kuhns - 7/17/2015
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
 BWj, BWj,min 
BWi,j
BWj
 MTU 
MTU
 BWj
Wj
BWi,j
BWj,min
26