BEFORE YOU BEGIN: Assign Information Classification

Download Report

Transcript BEFORE YOU BEGIN: Assign Information Classification

InfiniBand: Today and Tomorrow
Jamie Riotto
Sr. Director of Engineering
Cisco Systems (formerly Topspin Communications)
[email protected]
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
1
Agenda
• InfiniBand Today
– State of the market
– Cisco and InfiniBand
– InfiniBand products available now
– Open source initiatives
• InfiniBand Tomorrow
– Scaling InfiniBand
– Future Issues
• Q&A
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
2
InfiniBand Maturity Milestones
• High adoption rates
– Currently shipping > 10,000 IB ports / Qtr
• Cisco acquisition will drive broader market
adoption
• End-to-end price points of <$1000.
• New Cluster scalability proof-points
– 1000 to 4000 nodes
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
3
Cisco Adopts InfiniBand
• Cisco acquired Topspin on May 16, 2005
• Adds InfiniBand to Switching Portfolio
– Network Switches, Storage Switches,
now Server Switches
– Creates independent Business Unit to promote InfiniBand
& Server Virtualization
• New Product line of Server Fabric Switches (SFS)
– SFS 7000 Series InfiniBand Server Switches
– SFS 3000 Series Multifabric Server Switches
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
4
Cisco and InfiniBand
The Server Fabric Switch
Network Switch
Storage Switch
Clients
Storage (SAN)
Network Resources
Server
Server Switch
Servers
Network
Storage
(Internet, Printer, Server)
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
5
Cisco HPC Case Studies
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
6
Real Deployments Today: Wall Street Bank
with 512 Node Grid
Existing
Networks
SAN
GRID
I/O
LAN
Fibre Channel and GigE
connectivity built
seamlessly into the cluster
2 TS-360 w/ Ethernet and Fibre
Channel Gateways
2 96-port
TS-270
Core
Fabric
23 24-port
TS-120
Edge
Fabric
512 Server Nodes
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
7
NCSA
National Center for Supercomputing Applications
Tungsten 2: 520 Node Supercomputer
6 72-port
TS270
Core
Fabric
174 uplink
cables
Edge
Fabric
29 24-port
TS120
512 1m
cables
18 Compute
Nodes
18 Compute
Nodes
520 Dual CPU Nodes
1,040 CPUs
 Parallel MPI codes for commercial clients
 Point to point 5.2us MPI latency
Deployed: November 2004
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
8
D.E. Shaw Bio-Informatics:
1,066 Node Super Computer
1,066 Fully Non-Blocking Fault Tolerant IB Cluster
Fault
Tolerant
Core
Fabric
12 96-port
TS-270
1,068 5m/7m/10m/15m
uplink cables
89 24-port
TS-120
Edge
Fabric
1,066 1m
cables
12 Compute
Nodes
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
12 Compute
Nodes
Cisco Public
9
Large Government Lab
Worlds Largest Commodity Server Cluster – 4096 nodes
• Application:
High Performance Super
Computing Cluster
• Environment:
8x SFS TS740
288 ports each
Core
Fabric
4096 Dell Servers
2048 uplinks
(7m/10m/15m/20m)
50% Blocking Ratio
8 TS-740s
Edge
256 TS-120s
256x TS120
24-ports each
• Benefits:
Compelling
Price/Performance
Largest Cluster Ever Built
(by approx. 2X)
Expected to be 2nd
Largest Supercomputer in
the world by node count
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
18
Compute
Nodes)
18
Compute
Nodes)
8192 Processor
60TFlop SuperCluster
Cisco Public
10
InfiniBand Products Available
Today
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
11
InfiniBand Switches and HCAs
• Fully non-blocking switch
building blocks available in
sizes from 24 up to 288
ports.
• Blade servers offer
integrated switches and
pass-through modules
• HCAs available in PCI-X and
PCI-Express
• IP & Fibre-Channel Gateway
Modules
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
12
Integrated InfiniBand for Blade Servers
Create “wire-once” fabric
• Integrated 10Gbps InfiniBand
switches provide unified “wireonce” fabric
10Gbps
30Gbps
• Optimize density, cooling,
space, and cable management.
• Option of integrated InfiniBand
switch (ex: IBM BC) or passthru module (ex: Dell 1855)
IB Switch
IB Switch
• Virtual I/O provides shared
Ethernet and Fibre Channel
ports across blades and racks
Blade Chassis with InfiniBand Switches
HCA
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
13
Ethernet and Fibre Channel Gateways
Unified “wire-once” fabric
Server Cluster
Single InfiniBand link for:
- Storage
- Network
SAN
Server Fabric
Fibre Channel to InfiniBand gateway for storage
access
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
LAN/WAN
Ethernet to InfiniBand gateway for LAN
access
Cisco Public
14
InfiniBand Price / Performance
InfiniBand
PCI-Express
10GigE
GigE
Myrinet D
Myrinet E
950MB/s
900MB/s
100MB/s
245MB/s
495MB/s
5us
50us
50us
6.5us
5.7us
$550
$2K-$5K
Free
$535
$880
Switch Port
$250
$2K-$6K
$100-$300
$400
$400
Cable Cost
$100
$100
$25
$175
$175
Data Bandwidth
(Large Messages)
MPI Latency
(Small Messages)
HCA Cost
(Street Price)
(3m Street Price)
•Myrinet pricing data from Myricom Web Site (Dec 2004)
** InfiniBand pricing data based on Topspin avg. sales price (Dec 2004)
*** Myrinet, GigE, and IB performance data from June 2004 OSU study
• Note: MPI Processor to Processor latency – switch latency is less
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
15
InfiniBand Cabling
• CX4 Copper (15m)
• Flexible 30-Gauge Copper
(3m)
• Fiber Optics up to 150m
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
16
Host Drivers for Standard Protocols
• Open source strategy = reliability at low cost
• IPoIB: legacy TCP/IP applications
• SDP: reliable socket connections (optional RDMA)
• MPI: leading edge HPCC applications (RDMA)
• SRP: block storage access (RDMA)
• uDAPL: User level RDMA
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
17
OS Support
• Operating Systems Available:
– Linux (Red Hat, SuSE, Fedora, Debian, etc.)
– Windows 2000 and 2003
– HP-UX (Via HP)
– Solaris (Via Sun)
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
18
The InfiniBand Driver Architecture
APPLICATION
INFINIBAND
NETWORK
BSD Sockets
BSD Sockets
SAN
NFS-RDMA
UDAPL
TCP
User
FS API
FILE SYSTEM
SDP
TS
IP
SDP
TS
IPoIB
Kernel
DAT
SCSI
API
SRP
Drivers
FCP
VERBS
ETHER
INFINIBAND HCA
ETHER
SWITCH
INFINIBAND SWITCH
ETH GW
E
LAN/WAN
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
FC
FC GW
SERVER FABRIC
FC
SWITCH
SAN
Cisco Public
19
Open Software Initiatives
• OpenIB.org
– Topspin primary authors of major portions
including IPoIB, SDP, SRP and TS-API. Cisco will
continue to invest.
– Current protocol development nearing production
quality code. Expect release by end of year.
– Charter has been expanded to include Windows
and iWarp
– MPI will be available in the near future (MVAPICH
0.96)
• OpenSM
• OpenMPI
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
20
InfiniBand Tomorrow
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
21
Looking into the future
• Cost
• Speed
• Distance Limitations
• Cable Management
• Scalability
• IB and Ethernet
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
22
Speed: InfiniBand DDR / QDR, 4X / 12X
• DDR Available end of 2005
Doubles wire speeds to ? (ok, still working on this one)
PCI-Express DDR
Distances of 5-10m using copper
Distances of 100m using fiber
• QDR Available WHEN?
• 12X (30 Gb/s) available for over one year!!
– Not interesting until 12X HCA
• Not interesting until > 16X PCIe
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
23
Future InfiniBand Cables
• InfiniBand over CAT5 / CAT6 / CAT7
Shielded cable distances up to ???
Leverage existing 10-GigE cabling
10-GigE too expensive?
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
24
IB Distance Scaling
• IB Short Haul
– New Copper drivers
– 25 – 50 Meters (KeyEye)
– 75 - 100 Meters (IEEE 10Ge)
• IB Wan
– Same Subnet over distance (300 KM target)
– Buffer / Credit / Timeout issues
– Applications: Disaster Recover, Data Mirroring
• IB Long Haul
– IB over IP (over SONET?)
– utilizes existing public plant (WDM, Debugging, etc)
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
25
Scaling InfiniBand
• Subnet Management
• Host-side Drivers
MPI
IPoIB
SRP
• Memory Utilization
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
26
IB Subnet Manager
• Subnets are getting bigger
– 4,000 -> 10,000 nodes
– Topology convergence times
• Topology disturbance times
• Topology disturbance minimization
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
27
Subnet Management Challenges
• Cluster Cold Start times
–Template Routing
– Persistent Routing
• Cluster Topology Change Management
– Intentional Change - Maintenance
– Unintentional Change – Dealing with Faults
• How to impact minimum number of connections
• Predetermine fault reaction strategy?
• Topology Diagnostic Tools
– Link/Route Verification
– Built-in BERT testing
• Partition Management
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
28
Multiple Routing Models
• Minimum Latency Routing:
– Load-Balanced Shortest-Path Routing
• Minimum Contention Routing:
– Lowest-Interference Divergent-Path Routing
• Template Driven Routing:
– Supports Pre-Determined Routing Topology
– For example: Clos Routing, Matrix Row/Column, etc
– Automatic Cabling Verification for Large Installations
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
29
IB Routing Challenges
• Static / Dynamic Routing
– IB impliments Static Routing through Linear Forwarding
Tables at each chip
– Multi-LID Routing enables Dynamic Routing
• Credit Loops
• Cost Base Routing
– Speed mismatches cause Store & Forward (vs. cut through)
– SDR <> DDR <>QDR
– 4X <> 12X
– Short Haul <> Long Haul
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
30
Multi-LID Source-Based Routing Support
• Applications can implement “Dynamic” Routing for Contention
Avoidance, Failover, Parallel Data Transfer
1,2,3,4
Leaf Switches
Session Number
Presentation_ID
Spine Switches
© 2005 Cisco Systems, Inc. All rights reserved.
Leaf Switches
Cisco Public
31
New IB Peripherals
• CPUs?
• Storage
– SAN
– NFS-RDMA
• Memory (coherent / non-coherent)
• Purpose built Processors?
– Floating Point Processors
– Graphics Processors
– Pattern Matching Hardware
– XML Processor
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
32
THANK YOU!
• Questions & Answers
Session Number
Presentation_ID
© 2005 Cisco Systems, Inc. All rights reserved.
Cisco Public
33