SCI SOCKET: The fastest socket on earth? Atle Vesterkjær [email protected] http://www.dolphinics.com Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00

Download Report

Transcript SCI SOCKET: The fastest socket on earth? Atle Vesterkjær [email protected] http://www.dolphinics.com Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00

SCI SOCKET: The fastest socket on earth?
Atle Vesterkjær
[email protected]
http://www.dolphinics.com
Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00 Fax: +47 23 16 71 80
LCSC 2004
SCI SOCKET - Outline
• The fastest socket on earth and the
impact on storage and applications
SCI
technology
SCI
SOCKET for storage and
applications.
SCI
SOCKET benchmarks
LCSC 2004
Highlights of the Dolphin SCI Technology
•
•
Ultra Low Latency
 CPU has direct access to remote memory
 No protocol overhead
• 1.4 µs 4 bytes write
• < 3 µs 512 bytes write
• 0.2 µs pipelined write
 Fast failover for HA systems
Highly efficient bus bridging
 Bus Requests and Responses (CPU load/store
operations) are translated directly in Hardware to
Request and Response Packets

Point to Point Links gives Bus Performance and
Latency over Distance
• High data throughput: ~ 346 MByte/s
• 0.2 µs pipelined write
LCSC 2004
Highlights of the Dolphin SCI Technology
•
Wide Application Area - Common
Mode : Multiprocessing

•
Storage, Clustering,
Multiprocessing, Embedded
Systems, Telecommunication,
Defense, Medical Imaging
Choice of Topologies, Ring, Torus,
Switched
•
Shipping in Critical Applications
for more than 10 years
•
Based on ANSI/IEEE 1592-1992
Scalable Coherent Interface (SCI)
Standard
LCSC 2004
Linköping University - NSC - SCI Clusters
Also in Sweden, Umeå University 120 Athlon nodes
• Monolith: 200 node,
2xXeon, 2,2 GHz, 3D
SCI
• INGVAR: 32 node,
AMD 900 MHz, 2D
SCI
• Otto: 48 node, P4
2.26 GHz, 2D SCI
• Maxwell: 40 node
2xXeon, 2D SCI
• Bris: 16+2, 2x Xeon
• Total 336 SCI Nodes
LCSC 2004
Applications, Database Clustering
Ultra Enterprise
Cluster
• SUN’s High End servers are clustered with
Dolphin Cards
 Money Transaction and Data Base
Applications
 High Availability and Performance
 Dolphin Ships: Cards and Switches
 7th year of shipments
 Oracle 9i Performance and Scaleability
 SCI runs natively on SUN’s RSM (Remote
Shared Memory API
LCSC 2004
Mirage 2000 Upgrade, First Test Flight January 2001
Thales uses
Dolphin’s
Technology as the
main interconnect
in the on-board
Multi Processor
Offered
with
systems
like
Mirage
2000-9,
Mirage
2000-5,
Rafale
and
more
LCSC 2004
Space Mission Application
Dolphin’s
technology is
chosen for
evaluation
http://sim.jpl.nasa.gov/
Dolphins in Space!
LCSC 2004
SCI Adapter Cards - 64 bit 66 MHz
• PCI-, PMC(VME)- and CompactPCI™SCI Adapter Card
• Industry-best latency
 1.4 microseconds 4 bytes write
 < 3 microseconds 512 bytes write
 0.2 microseconds pipelined write
• High data throughput ~ 346 MBytes/s
• Supports both:
 Direct Memory Access (DMA)
 Remote Memory Access (RMA)
 Remote Interrupt
• Hot-pluggable cabling
• Redundant SCI adapters can be used for
Fault-tolerance
PCI
LC
PSB
SCI
Cluster Adapter
PCI to PCI Bridge
PCI Extension
Reflected Memory
LCSC 2004
Dolphin Products: Switches, Chips and Cards
LCSC 2004
Torus Topology
LC
PSB
1D Topology
(Ring) to 10 Nodes
SCI
PCI
2D Torus Topology
to 100+ Nodes
SCI
LC LC
PSB
SCI
PCI
3D Torus Toplogy
to 1000s of Nodes
SCI
SCI
LC LC LC
PSB
PCI
LCSC 2004
SCI
Dolphin SW
•
•
•
•
•
•
All Dolphin SW is free open source (GPL or LGPL)
SISCI – shared memory interface
SCI-Sockets
 Low Latency Socket Library
 TCP and UDP Replacement
 User and Kernel level support
 Release 2.3 available
SCI-MPICH (RWTH Aachen)
 MPICH 1.2 and some MPICH 2 features. MPICH 2
in development.
 New release is being prepared, beta available
SCI Interconnect Manager
 Automatic failover recovory.
 No single point of failuere in 2D and 3D networks.
Other
 SCI Reflective Memory, Scali MPI, Linux Labs SCI
Cluster Cray-compatible shmem and Clugres
PostgreSQL, MandrakeSoft Clustering HPC solution,
Xprimes X1 Database Performance Cluster for
Microsoft SQL Servers, ClusterFrame from Qlusters
and SunCluster 3.1 (Oracle 9i), MySQL Cluster
LCSC 2004
Latency vs SW
SW
Latency
(1/2 Ping Pong roundtrip)
SISCI (Direct HW)
1.4 µs
SCI-Sockets
2.3 µs
Scali MPI Connect
3.5 µs
SCI-MPICH
3.8 µs
LCSC 2004
Replace in
Title/Slide Master
with Company Logo
or delete
SCI SOCKET
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
Motivation
• Link level speeds of interconnects are increasing

Communication bottleneck moved to protocol software

High speed networks provide their own efficient interfaces
• On the other hand:

A large number of applications is build around legacy protocols such
as TCP/IP suite

De-facto standard: Berkeley Sockets API

Porting to hardware specific APIs unprofitable in many cases
• SCI SOCKET aims to bring together:
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
Berkeley Sockets over SCI
• High Speed, Low Latency Replacement for Gigabit Ethernet for Critical
Applications
• Bypassing traditional network stacks like TCP/UDP/IP
 Eliminating protocol overhead and Reducing latency
• Transparent to applications, no modifications or recompilation required
• Ultra low latency
 2.27 us socket send/receive latency
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
Berkeley Sockets over SCI
• Data transfer through remote shared memory
• Offers new socket transport family AF_SCI
• Flexible using configuration files
 Specifying Cluster nodes
 Specifying ports
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
LD_PRELOAD
•
•
•
•
Standard mechanism to preload C library functions
User defined Library fuctions called instead of C library
AF_INET selects traditional TCP/IP path
AF_SCI selects SCI_SOCKET
int socket(int family, int type, int protocol) {
if((family == AF_INET) && (type == TCP || type == UDP))
socket_lib(AF_SCI,type);
else
socket_lib(family,type);
}
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
SCI SOCKET
• Easy installation of the SCI socket library
Application
Configuration file
Legacy Socket
Applications
SCI Socket
library
SCI
Standard
Socket library
Ethernet
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
Configuration File /etc/sci/scisock.conf
• Selects which machines that can be reached using SCI
• Optionally /etc/sci/scisock_opt.conf selects which ports that can be reached
using SCI
#This is a SCI socket config file
#Should be placed in /etc/sci
#
#hostname
SCI NodeId
#This is a SCI socket_opt config file
#Should be placed in /etc/sci directory
#
#-key
-Type -value
nodeA
193.71.152.89
Mailhost
File-serv
EnablePortsByDefault
EnablePort
DisablePort
EnablePortRange
DisablePortRange
4
8
16
20
tcp|udp
tcp|udp
tcp|udp
tcp|udp
-yes/no
’portnumber’
’portnumber’
’start_port end_port’
’start_port end_port’
LCSC 2004
Linux Kernel Socket Switch
User App
User space
Kernel space
Cluster File System
iSCSI
Linux Kernel Socket Switch
SCI
Native SOCKET
SOCKET
TCP
UDP
IP
Socket lib
Ethernet
driver
Ethernet HW
SCI HW
LCSC 2004
Small Message Latency
LCSC 2004
TCP STREAM
LCSC 2004
TCP-RR SCI SOCKET vs Gigabit Ethernet
LCSC 2004
Scali MPI over SCI SOCKET
• SCI SOCKET is 1.6 - 6.0 times faster than TCP/GigE
LCSC 2004
Why is SCI SOCKET so fast ?
• Small messages are sent using basic CPU instructions
 Data are normally located in CPU cache
 Low cost write post to local memory address
 Single store CPU instruction to send 8 bytes
 Raw send latency for 8 bytes is approximately 210 nanoseconds
 No need to lock down or register memory
• Large messages are sent using DMA
• Stream-lined and lock-free messaging protocol on top of shared memory
• Combination of polling and interrupts
• Receive message causes received message to be cached
 No additional memory access
Legacy Socket
Applications
SCI SOCKET
Low Latency
SCI Interconnect
LCSC 2004
Cluster File Systems
• SCI SOCKET: A typical cluster file system will run out of the box
• PVFS
 Open Source / GPL software

http://www.parl.clemson.edu/pvfs/desc.html
• Lustre
 Open Source / GPL software
• http://www.lustre.org/
• GFS
 Global File System
• Commersial file system available from Sistina

www.sistina.com/products_gfs.htm
LCSC 2004
iSCSI
• SCSI over IP
 Protocol for encapsulating SCSI
commands into IP packets
 I/O block data transport over IP
networks
• iSCSI and SCI SOCKET can be used to
build scalable SAN / NAS solutions
iSCSI Driver
TCP/IP
NIC
IP network
NIC
TCP/IP
SCSI Driver
LCSC 2004
iSCSI over SCI SOCKET
• Latency is approximately 10x better than Gigabit Ethernet
 Latency is reported by Intels ’ktest’
Gigabit Ethernet
SCI SOCKET
SCSI op 0x28
250 us
29 us
SCSI op 0x2A
250 us
31 us
SCSI op 0x25
250 us
27 us
LCSC 2004
iSCSI over SCI SOCKET
• Throughput is 2-4 times Gigabit Ethernet
LCSC 2004
SCI SOCKET comparison
Technology
Latency
Throughput
Reference
SCI
2.26 us
2016 Mbps
www.dolphinics.com
Myrinet
12 us
1818 Mbps
www.myrinet.com
Gbit Ethernet
23 us
936 Mbps
www.dolphinics.com
Infiniband
28 us
3768 Mbps
IEEE Symposium IPASS
2004
LCSC 2004
SCI vs other interconnects
• As reported by Ameslab (Iowa state University, USA)

Netpipe benchmark
LCSC 2004
Applications running SCI SOCKET
 Intel
iSCSI
 PVFS
 LUSTRE
 MySQL Cluster
 LAM-MPI
 MPICH2
 PVM
 Oracle
(Client/Server
sqlplus)
 TerraGrid
(tm) by
Terrascale
 Scali
MPI Connect™
 Latency_bench
 Netpipe
TCP/PVM
 Netperf
LCSC 2004
Current Development
 Available
on X86, X86_64, Linux 2.4 and 2.6.
 Itanium beta release is ready
 Porting to windows in progress
 Support for multiple adapters in progress
• Data striping gives multiple throughput with no
latency penalty or extra CPU load
• Redundancy and transparent failover to other SCI
adapter and Ethernet
LCSC 2004
SCI SOCKET: The fastest socket on earth?
Atle Vesterkjær
[email protected]
http://www.dolphinics.com
Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00 Fax: +47 23 16 71 80
LCSC 2004
LCSC 2004
http://www.gria.org/
• Would you like your computers to earn you extra money?
• Would you like to have cheap access to tons of computing power?
 The GRIA project will take Grid technology into the real world, enabling
industrial users to trade computational resources on a commercial basis to
meet their needs more cost effectively.
• GRIA enables organizations to:
 Outsource computation.
• If you need short-term computation, and cannot justify the expense of the
hardware purchase, GRIA provides a mechanism to discover, negotiate and
utilize other organizations' spare computing resources.

Rent out spare CPU cycles.
• GRIA provides a mechanism allowing you to commercially offer your spare
computing resources on the Grid.
LCSC 2004
Acknowledgement
• SCI SOCKET kernel module has been developed in the IST-33240
project GRIA (http://www.gria.org)
• SCI SOCKET user space software library has been developed in the
ITEA project HYADES (http://www.hyades-itea.org)
• The SCI SOCKET software is open source and available under
GPL/LGPL. Dolphin strongly appreciates the contribution to the code
and testing done by volunteer programmers and partners.
• More information about SCI SOCKET can be found at
http://www.dolphinics.com/products/software/sci_sockets.html
LCSC 2004