Transcript Headline

100 Gb/s InfiniBand Transport over up to 100 km

Klaus Grobe and Uli Schlegel, ADVA Optical Networking, and David Southwell, Obsidian Strategics, TNC2009, Málaga, June 2009

Agenda

InfiniBand in Data Centers

InfiniBand Distance Transport

2 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

InfiniBand in Data Centers

3 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

Connectivity performance

4 10T 1T Adapted from: Ishida, O., “Toward Terabit LAN/WAN” Panel, iGRID2005 WDM

InfiniBand

100G Ethernet

FC

10G 1G 100M 1990 1995 2000 2005 2010 640 320 160 QDRx12 80 40 QDRx4 20 QDRx1 10 2008 2009 EDRx12 EDRx4 EDRx1 2010 2011 HDRx12 HDRx4 HDRx1 Time   Bandwidth requirements follow Moore’s Law (# transistors on a chip) So far both, Ethernet and InfiniBand outperform Moore’s growth rate © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

5

InfiniBand Data Rates

InfiniBand IBx1 IBx4

Single Data Rate, SDR Double Data Rate, DDR 2.5 Gb/s 5 Gb/s 10 Gb/s

20 Gb/s

Quad Data Rate, QDR

10 Gb/s 40 Gb/s

IB uses 8B/10B coding, e.g., IBx1 DDR has 4 Gb/s throughput

IBx12

30 Gb/s

60 Gb/s

120 Gb/s

Copper

 Serial (x1, not much seen on the market)  Parallel copper cables (x4, x12)

Fiber Optic

 Serial for x1 and SDR x4 LX (serialized I/F)  Parallel for x4, x12 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

Converged Architectures

iFCP

FCP iFCP TCP IP

Ethernet

lossy

FCIP iSCSI DCB Operating System / Application Small Computer System Interface (SCSI)

FCP iSCSI FCP FCIP TCP IP

Ethernet

lossy TCP IP

Ethernet

lossy FCoE

DCB

lossless

InfiniBand

SRP

IB

lossless

Latency Performance

6 SRP – SCSI RDMA Protocol © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

HPC Networks today

FC and GbE HBAs and IB HCAs

IB FC Eth

Server Cluster

IB FC Eth Relevant Parameters

   LAN HBA based on GbE/10GbE SAN HBAs based on 4G/8G-FC HCAs based on IBx4 DDR/QDR

IB FC Eth IB FC Eth

7 FC FC FC FC SAN Ethernet LAN

Typical HPC Data Center today

 Dedicated networks / technologies for LAN, SAN, CPU (server) interconnect  Consolidation required (management complexity, cables, cost, power) © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

InfiniBand Distance Transport

8 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

Generic NREN

Large, dispersed Metro Campus, or Cluster of Campuses

DC DC DC DC DC DC DC DC DC

Connection to Backbone (NREN) Dedicated (P2P) Connection to large Data Centers 9 Core (Backbone) Router

DC

Large Data Center Layer-2 Switch OXC / ROADM © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

10

InfiniBand-over-Distance Difficulties and solution considerations

Technical difficulties:

 IB-over-copper – limited distance (<15 m)  IB-to-XYZ conversion – high latency  No IB buffer credits in today’s switches for distance transport  High-speed serialization and E-O conversion needed

Requirements:

 Lowest latency, hence highest throughput is a must  Interworking must be demonstrated © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

InfiniBand Flow Control

11    InfiniBand is credit-based per virtual lane (16)    On initialization, each fabric end-point declares its capacity to receive data This capacity is described as its buffer credit As buffers are freed up, end points post messages updating their credit status InfiniBand flow control happens before transmission, not after it – lossless transport Optimized for short signal flight time; small buffers are used inside the ICs: Limits effective range to ~300 m HCA A HCA B From System Memory 1 Across IB Link 2 4 Update Credit 3 Into System Memory © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

InfiniBand Throughput vs. Distance

With B2B Credits 12 w/o B2B Credits Distance    Only sufficient Buffer-to-Buffer credits (B2B credits) in conjunction with error-free optical transport can ensure maximum InfiniBand performance over distance Throughput drops significantly after several 10 m w/o additional B2B credits, this is caused by an inability to keep the pipe full by restoring receive credits fast enough Buffer credit size depends directly on desired distance © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

InfiniBand-over-Distance Transport

13

DC IB

CPU/ Server Cluster IB SF

IB

IB HCAs

DC

FC FC FC SAN LAN … WDM 80 x 10G DWDM (redundant)     FC FC SAN LAN Point-to-point Typically, <100 km, but can be extended to any arbitrary distance FC 10GbE…100GbE

Gate way

IB SF – InfiniBand Switch Fabric Low latency (distance!) Transparent infrastructure (should support other protocols) NREN © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

IB Transport Demonstrator Results

N x 10G InfiniBand Transport over >50 km Distance demonstrated

B2B Credits SerDes DWDM 80 x 10G DWDM DWDM B2B Credits SerDes 14

Obsidian Campus C100

 4x SDR copper to serial 10G optical   840 ns port-to-port latency Buffer Credits for up to 100 km (test equipment ready for 50 km)

ADVA FSP 3000 DWDM

   Up to 80 x 10Gb/s transponders <100 ns latency per transponder Max. reach 200/2000 km SendRecV Throughput vs. Message Length 1 0.8

0.6

0.4

0.2

0 0 0.4 km 25.4 km 50.4 km 75.4 km 100.4 km 1000 2000 3000 Message Length [kB] 4000 1 0.8

0.6

0.4

0.2

0 0 SendRecV Throughput vs. Distance 32 kB 128 kB 512 kB 4096 kB 20 40 60 Distance [km] © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

80 100

15

Solution Components

WCA-PC-10G WDM Transponder  Bit rates: 4.25 / 5.0 / 8.5 / 10.0 / 10.3

/ 9.95 / 10.5 Gb/s  Applications: IBx1 DDR/QDR, IBx4 SDR , 10GbE WAN/LAN PHY, 4G-/8G-/10G FC  Dispersion tolerance: up to 100 km w/o compensation  Wavelengths: DWDM (80 channels) and CWDM (4 channels)  Client port: 1 x XFP (850 nm MM, or 1310/1550 nm SM)  Latency <100 ns Campus C100 InfiniBand Reach Extender  Optical bit rate 10.3

Gb/s (850 nm MM, 1310/1550 nm SM)  InfiniBand bit rate 8 Gb/s (4x SDR v1.2 compliant port)  Buffer credit range up to 100 km (depending on model)  InfiniBand node type: 2-port switch  Small-packet port-to-port latency: 840 ns  Packet forwarding rate: 20 Mp/s © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

16

Solution 8x10G InfiniBand Transport

FSP 3000 DWDM System (~100 km, dual-ended) Chassis, PSUs, Controllers 10G DWDM Modules Optics (Filters, Amplifiers) ~€10.000, ~€100.000, ~€10.000, Sum (budgetary) ~€120.000, 16 x Campus C100 (100 km) ~€300.000, System total (budgetary) ~€420.000, © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

An Example…

17 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.

NASA's largest supercomputer uses 16 Longbow C102 devices to span two buildings, 1.5 km apart, at a link speed of 80 Gb/s and a memory-to-memory latency of just 10 µs.

Thank you

[email protected]

IMPORTANT NOTICE

The content of this presentation is strictly confidential. ADVA Optical Networking is the exclusive owner or licensee of the content, material, and information in this presentation. Any reproduction, publication or reprint, in whole or in part, is strictly prohibited. The information in this presentation may not be accurate, complete or up to date, and is provided without warranties or representations of any kind, either express or implied. ADVA Optical Networking shall not be responsible for and disclaims any liability for any loss or damages, including without limitation, direct, indirect, incidental, consequential and special damages, alleged to have been caused by or in connection with using and/or relying on the information contained in this presentation.

Copyright © for the entire content of this presentation: ADVA Optical Networking.