Transcript slides

OFED TCP Port Mapper Proposal
June 15, 2011
Overview
• Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA
connections
•
•
•
Hardware tags packets used for RDMA connection management for easy identification
Host TCP/IP stack services used for address resolution and neighbor updates
RDMA CM claims TCP port creating a kernel socket when the unified portspace patch is applied and
support is enabled via module option:
http://git.openfabrics.org/git?p=~amirv/ofed_1_5.git;a=blob;f=kernel_patches/fixes/cma_0100_unified_tcp_ports.pa
tch;h=cfe1288041929f2940252de9b8ba15f2e35b2997;hb=ofed_kernel_1_5
• Unified portspace kernel patch is applied only when OFED distribution is used intact
• At least one OSV is moving to a model where OFED kernel patches will not be applied
•
RedHat starting with RHEL 6.0
• iSCSI hardware acceleration has moved to a separate MAC/IP address that is not visible
•
•
to the linux TCP/IP stack (private interface)
Linux community has rejected previous push for including the portspace patch rather
violently
Suggestion from linux community is to do what iSCSI did
Goal of this presentation is to …
• Describe a solution to the iWARP TCP portspace issue using the Sockets Direct Protocol
Port Mapper and Netlink sockets
2
Current OFED iWARP CM Flows
(Listen)
1. Rdma_listen(Local IP0,
Local Port0)
•
Userspace Verbs
UCM
2. Transition to Kernel CM
Userspace
Provider
Library
•
Sockets
Userspace<->Kernel
User
5. Kern_socket,
bind
Kernel
CM
3. Interface Selected
4. Port Selected
6.
create_
listen
OFED Kernel Verbs
Kernel
Provider
Linux TCP/IP
Stack
Mini-cm
7. Setup Hardware
RNIC
3
L2
MAC0/IP0
•
Application issues rdma_listen
•
In case of userspace application,
kernel transition occurs
•
Local IP address is the Linux IP
address (IP0)
OFED CM selects an interface and
selects a local port from the
appropriate portspace
•
•
Simple case (IP0 and TCP Port0)
•
Local port can be ANY; CM picks
a port
•
IF local IP and Port are any, port
must be accepted on all
interfaces
Local IP can be ANY; CM issues
listen to all interfaces
Portspace patch issues Socket and
Bind for iWARP providers
•
This portion has not been
accepted to the kernel
•
Patch exists in the OFED
package
•
Default just has kernel CM
picking a port independent of the
host TCP/IP stack
Current OFED iWARP CM Flows
(Connect)
1. Rdma_connect(
Local IP0, Local Port0,
Remote IP2, Remote Port2)
• Application issues rdma_connect
•
Userspace Verbs
•
UCM
2. Transition to Kernel CM
Userspace
Provider
Library
• OFED CM selects an interface
and selects a local port from the
appropriate portspace
•
•
Sockets
Userspace<->Kernel
User
5. Kern_socket,
bind
Kernel
CM
3. Interface Selected
4. Port Selected
6.
connect
OFED Kernel Verbs
Kernel
Provider
8. Neighbour
Update
Mini-cm
7. Setup Hardware
RNIC
4
In case of userspace application,
kernel transition occurs
Local and remote IP addresses
are the Linux IP addresses (IP0,
IP2)
9.
CM
Packets
Linux TCP/IP
Stack
Local IP can be ANY
CM uses the linux stack to pick
an interface, this usually handles
the Neighbour updated before
getting to the provider
• Portspace patch issues Socket
and Bind for iWARP providers
• Kernel provider is informed (and
L2
MAC0/IP0
•
can trigger) Neighbour updates
to stay in sync with the Linux
TCP/IP stack
Kernel provider mini-cm issues
handles TCP/IP three way
handshake and MPA exchange
through dev_queue_xmit and
private receive path
New OFED iWARP CM Architecture
•
•
Userspace Verbs
UCM
iWARP Port Mapper Daemon
Netlink
Sockets
Sockets
Userspace
Provider
Userspace
Library
Mini-cm or
Private
TCP
Userspace<->Kernel
Netlink
Sockets
•
new Port Mapper Daemon
•
•
User
•
Kernel
Netlink
Sockets
CM
OFED
Core
OFED Kernel Verbs
Linux TCP/IP
Stack
L2
MAC0/IP0
Kernel
Provider
w/netlink
Client
MAC1/IP1
Netlink interface roughly modeled
after iSCSI
Supports (but does not require)
second MAC/IP addresses on local
and on remote peer (soft iWARP)
Netlink Messages:
• Port Mapper Netlink Upcalls: Query
•
•
Kernel
Mini-cm
•
RNIC
Similar to current flow for CM
OFED has new iWARP Port Mapper
Daemon in userspace
OFED has new netlink interface
between user and kernel
• Introduced for statistics
• Extended for iWARP providers and
PID, Add/Remove Mapping, Query
Mapping
Provider Netlink Upcalls: Query PID,
Connect, Listen, Resolve
Provider Netlink Downcalls: Inbound
Connect, Operation Complete for
upcalls, Interface Down
Three RNIC models supported
• RNICs with CM in Kernel/Adapter
• RNICs with CM in userspace
• Hybrid RNICs with userspace CM
that requires adapter assistance
5
iWARP Port Mapper Concept
• Port Mapper concept was introduced by the RDMA Consortium as part of the Socket Direct
Protocol specification
•
http://www.rdmaconsortium.org/home/draft-pinkerton-iwarp-sdp-v1.0.pdf
• Provides a mechanism to have an iWARP port space separate from linux TCP port space
•
iWARP port space can be on an independent IP address or single IP address
• Port Mapper service runs over TCP on a well known port (3935) on linux IP addresses
•
Listen issued at service startup
• Port Mapper service rdma_listen steps:
•
Register a mapping between linux IP Address/TCP Port and iWARP IP Address/TCP Port with the Port Mapper
service
• Port Mapper service rdma_connect steps:
•
•
•
•
Receive a query request from a Port Mapper service client
Connect to remote peer on well known port
Query RDMA peer’s iWARP IP Address/TCP port using the SDP Port Mapper protocol (PMRequest)
Return information from the PMAccept message to the client of the Port Mapper service
• Port Mapper service peer query steps:
•
•
•
•
Accept Port Mapper connection (port 3935 to linux IP address) from node issuing the query
Receive the PMRequest message
Look up the IP address and Port from the PM request in the local database from the rdma_listen step
Return the mapped IP address and port information in a PMAccept message
• iWARP provider issues iWARP connect using an iWARP local and remote IP Address/TCP port
•
6
“quad” after receiving the PMAccept message
Later slides show more detail
Pending Netlink Patch for OFED
•
A patch has been submitted recently to query RDMA connection
information via netlink
•
•
Roland has rolled this patch into the linux-next patch set for late May
This patch introduces a single OFED netlink port and an Infiniband
netlink infrastructure in ib_core
•
•
•
Components interested in adding netlink capabilities to OFED can
register with Infiniband netlink infrastructure
•
•
•
7
Support for 32 clients within OFED and 1024 operations for each client
Only a single client is currently defined (rdma_cm)
The Port Mapper daemon consumes one client
Each iWARP provider consumes an additional client
The dump netlink operation is used to provide data back to the
netlink client
New OFED iWARP CM Flows
(Listen: Userspace provider CM)
•
1. Rdma_listen(Local IP0,
Local Port0)
UCM
2. Transition to
Kernel CM
iWARP Port Mapper Daemon
Netlink
Sockets
Sockets
User
Userspace
Provider
Userspace
Library
Mini-cm or
Private
TCP
Userspace<->Kernel
Netlink
Sockets
CM
OFED
Core
Linux TCP/IP
Stack
L2
5. create_
listen
Kernel
Provider
w/netlink
Client
MAC1/IP1
•
•
Kernel
Mini-cm
9. Setup Hardware (IP1, Port1)
RNIC
•
•
OFED Kernel Verbs
MAC0/IP0
8
Netlink
Sockets
6. Netlink: Listen
8. Netlink: Complete
Kernel
3. Interface
Selected
4. Port
Selected
•
Userspace Verbs
7. Netlink: Register Port Map
IP0, Port0 -> IP1, Port1
Similar to current flow for CM
CM can now independently reserve
ports since the Port Mapper allows
providers to use any provider
managed port number to represent
CM port number
Netlink message used to issue listen to
userspace library
• Mini-cm or userspace TCP stack
manages provider “port space”
to get Local TCP port1 that is
related to the CM local Port0
Userspace library registers local IP1,
Port1
For compatibilty, bind could also be
made on existing MAC/IP stack. Soft
iWARP requires this, along with some
customers.
If userspace provider library issues
socket/bind to Linux TCP/IP stack (like
soft iWARP would do), then IP0 = IP1
and Port0 != Port1
New OFED iWARP CM Flows
(Connect: Userspace provider CM)
1. Rdma_connect(
Local IP0, Local Port0,
Remote IP2, Remote Port2)
Userspace Verbs
userspace library
1.
UCM
Netlink: Resolve Remote
Port
IP2, Port2 -> IP3, Port3
iWARP Port Mapper Daemon
2. Transition to Kernel CM
Netlink
Sockets
User
Sockets
Userspace
Provider
Userspace
Library
Mini-cm or
Private
TCP
Netlink
Sockets
OFED
Core
CM
3. Interface
Selected
4. Port
Selected
OFED Kernel Verbs
Linux TCP/IP
Stack
L2
MAC0/IP0
1.
2.
connect
Connect
Reply Event
Netlink
Sockets
Userspace<->Kernel
8. SDP Port Mapper
Protocol (IP0 <-> IP2)
Kernel
• Similar to current flow for CM
• Netlink used to issue connect to
Kernel
Provider
w/netlink
Client
MAC1/IP1
1.
2.
• Mini-cm or userspace TCP
stack manages provider
“portspace” to get Local
TCP port1 that is related to
the CM local Port0
• Userspace library resolves
Netlink: Connect
Netlink: Connect
Complete
•
•
Kernel
Mini-cm
10. Setup Hardware
RNIC
• The kernel driver sets up the
•
9
remote IP2, Port2 through the
Port Mapper and gets remote IP
and port number IP3, Port3
Userspace provider CM issues
iWARP connect to IP3, Port3,
including MPA handshake
Userspace Mini-cm sends Netlink
Connect Complete call to the
kernel provider indicating the
new connection information:
IP1:Port1, IP3:Port3
RNIC hardware including
transitioning the QP to RTS
Kernel CM Issues Connect Reply
Event
New OFED iWARP CM Flows
(Accept: Userspace provider CM)
•
4.
Rdma_accept(
Local IP0, Local Port0,
Remote IP3, Remote Port3)
UCM
Userspace Verbs
iWARP Port Mapper Daemon
3. Transition to Userspace CM
5. Transition to Kernel CM
Netlink
Sockets
Sockets
Userspace
Provider
Userspace
Library
Mini-cm or
Private
TCP
Netlink
Sockets
Userspace<->Kernel
User
•
1.
Kernel
Netlink
Sockets
CM
OFED
Core
OFED Kernel Verbs
Linux TCP/IP
Stack
L2
MAC0/IP0
1.
2.
8.
10
•
Connect Request Event
CM Accept
Established Event
Netlink: Connect
Request
•
Kernel
Provider
w/netlink
Client
MAC1/IP1
Kernel
Mini-cm
•
•
7. Setup Hardware
RNIC
•
Userspace provider CM receives a connect
request on IP1, port1
• TCP three-way handshake and MPA
request from peer received
Userspace library issues Connect Request
netlink downcall to kernel provider library
• Remote iWARP: IP3, Port3 (Port
Mapped)
• Remote TCP: Unknown, use Port
Mapped IP3, Port3)
• Local iWARP: IP1, Port1 (Port
Mapped)
• Local TCP: IP0, Port0 (from listen)
Kernel Mini-cm sends Netlink Connect
Request event to the iWARP indicating the
new connection information: IP0:Port0,
IP3:Port3
Application is notified of the connection
request, it turns around with an
rdma_accept call
The kernel CM issues an accept call to the
kernel provider
The kernel provider then sets up the RNIC
hardware, including sending the MPA
response and transitioning the QP to RTS
The kernel provider issues an Established
CM event
New OFED iWARP CM Flows
(kernel provider CM)
•
Changes to RNICs that support kernel only connection management
drivers are minimal
•
On listen requests, the kernel provider CM must issue the Register
Port Map request to the iWARP Port Mapper Daemon using netlink
sockets
•
On connect requests, the kernel provider CM must:
•
•
•
11
Issue the Resolve Remote Port netlink message to the iWARP Port Mapper
Daemon
On completion, use the local and remove iWARP IP addresses and Port
numbers to issue the iWARP connect request (instead of the linux IP
addresses and Port numbers from the connect request
On Connect Request event and accept request handling, map the
local iWARP IP address and Port number to the original listen IP
address and port number
New OFED iWARP CM Flows
(hybrid provider CM)
• A hybrid RNIC has a userspace Connection Manager or Private
TCP stack that manages the iWARP IP address and port
space, but does not get involved with connection setup
• The Listen flow for a hybrid RNIC is the same as the flow for
the userspace stack
• The Accept flow is the same as the flow for a kernel provider
• The Connect flow is slightly different and depicted on the
following slide.
12
New OFED iWARP CM Flows
(Connect: Hybrid CM)
1. Rdma_connect(
Local IP0, Local Port0,
Remote IP2, Remote Port2)
Userspace Verbs
1.
UCM
Netlink: Resolve Remote
Port
IP2, Port2 -> IP3, Port3
iWARP Port Mapper Daemon
2. Transition to Kernel CM
Netlink
Sockets
User
Netlink
Sockets
OFED
Core
CM
3. Interface
Selected
4. Port
Selected
13
Sockets
Userspace
Provider
Userspace
Library
Mini-cm or
Private
TCP
OFED Kernel Verbs
Linux TCP/IP
Stack
L2
MAC0/IP0
1.
2.
connect
Connect
Reply Event
Kernel
Provider
w/netlink
Client
MAC1/IP1
•
Netlink
Sockets
Userspace<->Kernel
8. SDP Port Mapper
Protocol (IP0 <-> IP2)
Kernel
•
•
1.
2.
Netlink: Resolve
Netlink: Resolve
Complete
•
Kernel
Mini-cm
•
10. Setup Hardware
RNIC
•
Similar to current flow for CM
Netlink used to issue resolve
message to userspace library
• Mini-cm or userspace TCP
stack manages provider
“portspace” to get Local TCP
port1 that is related to the CM
local Port0
Userspace library resolves remote
IP2, Port2 through the Port Mapper
and gets remote IP and port
number IP3, Port3
• This information is returned to
the kernel provider CM in a
resolve complete netlink
message
Kernel provider CM issues iWARP
connect to IP3:Port3 from IP1:Port1,
including MPA handshake
The kernel driver sets up the RNIC
hardware including transitioning the
QP to RTS
Kernel CM Issues Connect Reply
Event indicating IP0:Port0 and
IP2:Port2 as the connection
information
Conclusions/Next Steps
• This proposal supports moving iWARP traffic to an
•
•
•
•
•
14
independent port space from TCP/IP sockets applications
transparently to the RDMA verbs consumer
The iWARP port space can remain on the same IP address
(like soft iWARP) or on a separate IP address (like iSCSI)
Three different RNIC connection management models are
supported
The RDMA Consortium published the wire protocol for
mapping TCP port numbers to iWARP port numbers
This proposal also resolves a port space issue with iSER
targets and iWARP in OFED
Backward compatibility can be ensured by using timeouts on
the port mapper protocol to fall back to the current behavior