Transcript Headline
100 Gb/s InfiniBand Transport over up to 100 km
Klaus Grobe and Uli Schlegel, ADVA Optical Networking, and David Southwell, Obsidian Strategics, TNC2009, Málaga, June 2009
Agenda
InfiniBand in Data Centers
InfiniBand Distance Transport
2 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
InfiniBand in Data Centers
3 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
Connectivity performance
4 10T 1T Adapted from: Ishida, O., “Toward Terabit LAN/WAN” Panel, iGRID2005 WDM
InfiniBand
100G Ethernet
FC
10G 1G 100M 1990 1995 2000 2005 2010 640 320 160 QDRx12 80 40 QDRx4 20 QDRx1 10 2008 2009 EDRx12 EDRx4 EDRx1 2010 2011 HDRx12 HDRx4 HDRx1 Time Bandwidth requirements follow Moore’s Law (# transistors on a chip) So far both, Ethernet and InfiniBand outperform Moore’s growth rate © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
5
InfiniBand Data Rates
InfiniBand IBx1 IBx4
Single Data Rate, SDR Double Data Rate, DDR 2.5 Gb/s 5 Gb/s 10 Gb/s
20 Gb/s
Quad Data Rate, QDR
10 Gb/s 40 Gb/s
IB uses 8B/10B coding, e.g., IBx1 DDR has 4 Gb/s throughput
IBx12
30 Gb/s
60 Gb/s
120 Gb/s
Copper
Serial (x1, not much seen on the market) Parallel copper cables (x4, x12)
Fiber Optic
Serial for x1 and SDR x4 LX (serialized I/F) Parallel for x4, x12 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
Converged Architectures
iFCP
FCP iFCP TCP IP
Ethernet
lossy
FCIP iSCSI DCB Operating System / Application Small Computer System Interface (SCSI)
FCP iSCSI FCP FCIP TCP IP
Ethernet
lossy TCP IP
Ethernet
lossy FCoE
DCB
lossless
InfiniBand
SRP
IB
lossless
Latency Performance
6 SRP – SCSI RDMA Protocol © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
HPC Networks today
FC and GbE HBAs and IB HCAs
IB FC Eth
Server Cluster
IB FC Eth Relevant Parameters
LAN HBA based on GbE/10GbE SAN HBAs based on 4G/8G-FC HCAs based on IBx4 DDR/QDR
IB FC Eth IB FC Eth
7 FC FC FC FC SAN Ethernet LAN
Typical HPC Data Center today
Dedicated networks / technologies for LAN, SAN, CPU (server) interconnect Consolidation required (management complexity, cables, cost, power) © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
InfiniBand Distance Transport
8 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
Generic NREN
Large, dispersed Metro Campus, or Cluster of Campuses
DC DC DC DC DC DC DC DC DC
Connection to Backbone (NREN) Dedicated (P2P) Connection to large Data Centers 9 Core (Backbone) Router
DC
Large Data Center Layer-2 Switch OXC / ROADM © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
10
InfiniBand-over-Distance Difficulties and solution considerations
Technical difficulties:
IB-over-copper – limited distance (<15 m) IB-to-XYZ conversion – high latency No IB buffer credits in today’s switches for distance transport High-speed serialization and E-O conversion needed
Requirements:
Lowest latency, hence highest throughput is a must Interworking must be demonstrated © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
InfiniBand Flow Control
11 InfiniBand is credit-based per virtual lane (16) On initialization, each fabric end-point declares its capacity to receive data This capacity is described as its buffer credit As buffers are freed up, end points post messages updating their credit status InfiniBand flow control happens before transmission, not after it – lossless transport Optimized for short signal flight time; small buffers are used inside the ICs: Limits effective range to ~300 m HCA A HCA B From System Memory 1 Across IB Link 2 4 Update Credit 3 Into System Memory © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
InfiniBand Throughput vs. Distance
With B2B Credits 12 w/o B2B Credits Distance Only sufficient Buffer-to-Buffer credits (B2B credits) in conjunction with error-free optical transport can ensure maximum InfiniBand performance over distance Throughput drops significantly after several 10 m w/o additional B2B credits, this is caused by an inability to keep the pipe full by restoring receive credits fast enough Buffer credit size depends directly on desired distance © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
InfiniBand-over-Distance Transport
13
DC IB
CPU/ Server Cluster IB SF
IB
IB HCAs
DC
FC FC FC SAN LAN … WDM 80 x 10G DWDM (redundant) FC FC SAN LAN Point-to-point Typically, <100 km, but can be extended to any arbitrary distance FC 10GbE…100GbE
Gate way
IB SF – InfiniBand Switch Fabric Low latency (distance!) Transparent infrastructure (should support other protocols) NREN © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
IB Transport Demonstrator Results
N x 10G InfiniBand Transport over >50 km Distance demonstrated
B2B Credits SerDes DWDM 80 x 10G DWDM DWDM B2B Credits SerDes 14
Obsidian Campus C100
4x SDR copper to serial 10G optical 840 ns port-to-port latency Buffer Credits for up to 100 km (test equipment ready for 50 km)
ADVA FSP 3000 DWDM
Up to 80 x 10Gb/s transponders <100 ns latency per transponder Max. reach 200/2000 km SendRecV Throughput vs. Message Length 1 0.8
0.6
0.4
0.2
0 0 0.4 km 25.4 km 50.4 km 75.4 km 100.4 km 1000 2000 3000 Message Length [kB] 4000 1 0.8
0.6
0.4
0.2
0 0 SendRecV Throughput vs. Distance 32 kB 128 kB 512 kB 4096 kB 20 40 60 Distance [km] © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
80 100
15
Solution Components
WCA-PC-10G WDM Transponder Bit rates: 4.25 / 5.0 / 8.5 / 10.0 / 10.3
/ 9.95 / 10.5 Gb/s Applications: IBx1 DDR/QDR, IBx4 SDR , 10GbE WAN/LAN PHY, 4G-/8G-/10G FC Dispersion tolerance: up to 100 km w/o compensation Wavelengths: DWDM (80 channels) and CWDM (4 channels) Client port: 1 x XFP (850 nm MM, or 1310/1550 nm SM) Latency <100 ns Campus C100 InfiniBand Reach Extender Optical bit rate 10.3
Gb/s (850 nm MM, 1310/1550 nm SM) InfiniBand bit rate 8 Gb/s (4x SDR v1.2 compliant port) Buffer credit range up to 100 km (depending on model) InfiniBand node type: 2-port switch Small-packet port-to-port latency: 840 ns Packet forwarding rate: 20 Mp/s © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
16
Solution 8x10G InfiniBand Transport
FSP 3000 DWDM System (~100 km, dual-ended) Chassis, PSUs, Controllers 10G DWDM Modules Optics (Filters, Amplifiers) ~€10.000, ~€100.000, ~€10.000, Sum (budgetary) ~€120.000, 16 x Campus C100 (100 km) ~€300.000, System total (budgetary) ~€420.000, © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
An Example…
17 © 2009 ADVA Optical Networking. All rights reserved. ADVA confidential.
NASA's largest supercomputer uses 16 Longbow C102 devices to span two buildings, 1.5 km apart, at a link speed of 80 Gb/s and a memory-to-memory latency of just 10 µs.
Thank you
[email protected]
IMPORTANT NOTICE
The content of this presentation is strictly confidential. ADVA Optical Networking is the exclusive owner or licensee of the content, material, and information in this presentation. Any reproduction, publication or reprint, in whole or in part, is strictly prohibited. The information in this presentation may not be accurate, complete or up to date, and is provided without warranties or representations of any kind, either express or implied. ADVA Optical Networking shall not be responsible for and disclaims any liability for any loss or damages, including without limitation, direct, indirect, incidental, consequential and special damages, alleged to have been caused by or in connection with using and/or relying on the information contained in this presentation.
Copyright © for the entire content of this presentation: ADVA Optical Networking.