Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010 Agenda • Infiniband Basics • What is RDS (Reliable Datagram Sockets)? • Advantages of RDS over InfiniBand •

Download Report

Transcript Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010 Agenda • Infiniband Basics • What is RDS (Reliable Datagram Sockets)? • Advantages of RDS over InfiniBand •

Reliable Datagram Sockets and InfiniBand

Hanan Hit NoCOUG Staff 2010

2

Agenda

• • • • • •

Infiniband Basics What is RDS (Reliable Datagram Sockets)?

Advantages of RDS over InfiniBand Architecture Overview TPC-H over 11

g

Benchmark InfiniBand vs. 10GE

• •

Value Proposition - Oracle Database RAC

Oracle Database Real Application Clusters (RAC) provides the ability to build an application platform from multiple systems clustered together Benefits – Performance • Increase performance of a RAC database by adding additional servers to the cluster – Fault Tolerance • A RAC database is constructed from multiple instances. Loss of an instance does not bring down the entire database – Scalability • Scale a RAC database by adding instances to the cluster database 3 November 11,2010 3

Some Facts

• High-end database applications in the OLTP category are in size range from 10-20 TB with 2-10k IOPS.

• The high end DW applications falls into the category of 20-40 TB with I/O bandwidth requirement of around 4-8 GB per second. • The x86_64 server with 2 sockets seems to offer the best price at the current point. •The major limitations of the above servers is limited number of slots available to connect to the external I/O cards and the CPU cost of processing I/O in conventional kernel based I/O mechanisms.

•The main challenge in building cluster databases that runs in multiple serves is the ability to provide low cost balanced I/O bandwidth.

•The conventional fiber channel based storage arrays with its expensive plumbing does not scale very well to create the balance where these db servers could be optimally utilized. 4 November 11,2010

IBA/Reliable Datagram Sockets (RDS) Protocol

What is IBA

InfiniBand Architecture (IBA) is an industry-standard, channel-based, switched-fabric, high-speed interconnect architecture with low latency and high throughput . The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices.

5

What is RDS

• A low overhead, low latency, high bandwidth, ultra reliable, supportable, Inter-Process Communication (IPC) protocol and transport system • Matches Oracle’s existing IPC models for RAC communication  Optimized for transfers from 200Bytes to 8MByte • Based on Socket API November 11,2010

Reliable Datagram Sockets (RDS) Protocol

Leverage InfiniBand’s built-in high availability and load balance features

• Port failover on the same HCA • HCA failover on the same system • Automatic load balancing •

Open Source on Open Fabric / OFED

http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/ 6 November 11,2010

Advantages of RDS over InfiniBand

7 • • •

Lowering Data Center TCO requires efficient fabrics

• Oracle RAC 11

g

will scale for database intensive applications only with the proper high speed protocol and efficient interconnect

RDS over 10GE

•10Gbps not enough to feed multi core Server IO needs •Each core may require > 3Gbps •Packets can be lost and require retransmit •Statistics are not accurate throughput indication •Efficiency is much lower than reported

RDS over InfiniBand

• The network efficiency is always 100% • 40Gbps today • Uses Infiniband delivery capabilities that offload end-to-end checking to the Infiniband fabric. •Integrated in the Linux kernel •More tools will be ported to support RDS, i.e.: netstat, etc. •Shows significant real world application performance boost •Decision Support System •Mixed Batch/OLTP workloads November 11,2010

Infiniband considerations

Why do Oracle use Infiniband?

• High bandwidth (1x SDR = 2.5 Gbps, 1x DDR = 5.0 Gbps, 1x QDR = 10.0 Gbps) •V2 DB machine uses 4x QDR links (40 Gbps in each direction, simultaneously) • Low latency (few µs end-to-end, 160ns per switch hop) • RDMA capable •Exadata cells recv/send large transfers using RDMA, thus saving CPU for other operations 8 November 11,2010

Architecture Overview

9 November 11,2010

#1 Price/Performance TPC-H over 11

g

Benchmark

• 11

g

– over DDR Servers: 64 x ProLiant BL460c • CPU: 2 x Intel Xeon X5450 – Quad-Core – – Fabric: Mellanox DDR InfiniBand Storage: • Native InfiniBand Storage – 6 x HP Oracle Exadata $25.00

$20.00

$15.00

$10.00

Price / QphH*@1000GB DB

$5.00

11

g

over 1GE 11

g

over DDR

World Record clustered TPC-H Performance and Price/Performance

10 November 11,2010 10

POC Hardware Configuration

Application Servers Concurrent Management Servers Database Servers Storage Array Application Servers

2x HP BL480C 2 Processors / 8 core X560 3.16GHz

64GB RAM 4x 72GB 15K drives NIC: HP NC373i 1GB NIC

Concurrent Manager Servers

6x HP BL480C 2 Processors / 8 core X560 3.16GHz

64GB RAM 4x 72GB 15K drives NIC: HP NC373i 1GB NIC

Database Servers

6x HP DL580 G5 4 processors / 24 cores X7460 2.67GHz

256GB RAM 8x 72GB 15K drives NIC: Intel 10GBE XF SR 2 port PCIe NIC Interconnect: Mellanox 4x PCIe Infiniband

Storage Array

HP XP24000 64GB cache / 20GB shared memory 60 Array Groups of 4 spindles 240 spindles total 146GB 15K fibre channel disk drives 1 GbE Network 10 GbE Network Infiniband Network 4Gb Fibre Channel Network 11 November 11,2010 11

CPU Utilization

InfiniBand maximize CPU efficiency – Enables >20% higher than 10GE

InfiniBand Interconnect 10GigE Interconnect

12 November 11,2010

Disk IO Rate

InfiniBand maximizes Disk utilization – Delivers 46% higher IO traffic than 10GE

InfiniBand Interconnect 10GigE Interconnect

13 November 11,2010

InfiniBand deliver 63% more TPS vs. 10GE

• TPS Rates for invoice load use case 1 2

Oracle RAC Workload

3 4 5 • • Activity

InfiniBand Interconnect

1 Invoice Load - Load File 2 3

10 GigE interconnect

1 Invoice Load - Load File 2 Invoice Load - Auto Invoice 3 Invoice Load - Auto Invoice Invoice Load – Total Invoice Load – Total Start Time 6/17/09 7:48 6/17/09 8:00 N/A 6/25/09 17:15 6/25/09 18:22 N/A Work Load – – – Nodes 1 through 4: Batch processing Node 5: Extra Node not used Node 6: EBS Other Activity Database size (2 TB) – – ASM 5 LUNS @ 400 GB End Time 6/17/09 7:54 6/17/09 9:54 N/A 6/25/09 17:20 6/25/09 20:39 N/A 1600 1400 1200 1000 800 600 400 200 0 Duration 0:06:01 1:54:21 2:00:22 0:05:21 2:17:05 2:22:26 Records 9,899,635 9,899,635 9,899,635 7,196,171 7,196,171 7,196,171

10GE InfiniBand

6 TPS 27,422.81

1,442.89

1,370.76

22,417.98

874.91

842.05

14

InfiniBand needs only 6 servers vs. 10 Servers needed by 10GE

November 11,2010

Sun Oracle Database Machine

• • Clustering is the architecture of the future – Highest performance, lowest cost, redundant, incrementally scalable Sun Oracle Database Machine that based on 40Gb/s InfiniBand delivers a complete clustering architecture for all data management needs 15 November 11,2010

Sun Oracle Database Server Hardware

• 8 Sun Fire X4170 DB per rack • 8 CPU cores • 72 GB memory • Dual-ports 40Gb/s InfiniBand card • Fully redundant power and cooling 16 November 11,2010

Exadata Storage Server Hardware

• Building block of massively parallel Exadata Storage Grid – – Up to 1.5 GB/sec raw data bandwidth per cell Up to 75,000 IOPS with Flash • Sun Fire™ X4275 Server – – – 2 Quad Core Intel® Xeon® E5540 Processors 24GB RAM

Dual-port 4X QDR (40Gb/s) InfiniBand card

• • Disk Options12 x 600 GB SAS disks (7.2 TB total) 12 x 2TB SATA disks (24 TB total) – 4 x 96 GB Sun Flash PCIe Cards (384 GB total) • Software pre-installed – – – Oracle Exadata Storage Server Software Oracle Enterprise Linux Drivers, Utilities 17 • Single Point of Support from Oracle – 3 year, 24 x 7, 4 Hr On-site response November 11,2010

• • • •

Mellanox 40Gbps InfiniBand Networking

Highest Bandwidth and Lowest Latency

Sun Datacenter InfiniBand Switch – 36 Ports QSFP Fully redundant non-blocking IO paths from servers to storage 2.88 Tb/sec bi-sectional bandwidth per switch 40Gb/s QDR, Dual ports per server 18 November 11,2010

DB machine protocol stack

iDB SQL*Net, CSS, etc TCP/UDP IPoIB Infiniband HCA Oracle IPC RDS RAC RDS provides - Zero loss - Zero copy (ZDP) 19 November 11,2010

What's new in V2

V1 DB machine V2 DB machine • • • • • • 2 managed, 2 unmanaged switches 24 port DDR switches 15 second min. SM failover timeout CX4 connectors SNMP monitoring available Cell HCA in x4 PCIe slot • • • • • • 3 managed switches 36 port QDR switches 5 seconds min. SM failover timeout QSFP connectors SNMP monitoring coming soon Cell HCA in x8 PCIe slot 20 November 11,2010

Infiniband Monitoring

• SNMP alerts on Sun IB switches are coming 21 • EM support for IB fabric coming – Voltaire EM plugin available (at an extra cost) • In the meantime, customers can & should monitor using – – IB commands from host Switch CLI to monitor various switch components • Self monitoring exists – – – Exadata cell software monitors its own IB ports Bonding driver monitors local port failures SM monitors all port failures on the fabric November 11,2010

Scale Performance and Capacity

22 • Scalable – Scales to 8 rack database machine by just adding wires •

More with external InfiniBand switches

– Scales to hundreds of storage servers • Multi-petabyte databases November 11,2010 • Redundant and Fault Tolerant – Failure of any component is tolerated – Data is mirrored across storage servers

Competitive Advantage

“…everybody is using Ethernet, we are using InfiniBand, 40Gb/s InfiniBand”

Larry Ellison Keynote at Oracle OpenWorld introducing Exadata-2 (Sun Oracle DB machine), October 14, 2009 San Francisco

23 November 11,2010