slides - Yufei Ren

Download Report

Transcript slides - Yufei Ren

Design and Performance Evaluation of
NUMA-Aware RDMA-Based End-to-End
Data Transfer System
Yufei Ren, Tan Li, Dantong Yu, Shudong Jin,
and Thomas Robertazzi
Massive Data Output
• Big data in pata-/exa-bytes scale
Climate Simulation
High-energy physics
System biology
SC 13
2
Massive Data Output
• Big data in pata-/exa-bytes scale
• Data transfer and data synchronization
Climate Simulation
High-energy physics
System biology
LAN
WAN
SC 13
2
End-to-End Data Transfer
1~10 Gbps Ethernet
1~10 Gbps Ethernet
Ethernet
SAN
Gateways
Gateways
SAN
• TCP-based Data Transfer Software: : GridFTP, scp, BBCP
• SAN storage: iSCSI
SC 13
3
RDMA is a Game-Changer
100 Gbps
100 Gbps
Ethernet
SAN
SC 13
Gateways
Gateways
SAN
4
RDMA is a Game-Changer
100 Gbps
100 Gbps
Ethernet
SAN
Gateways
Gateways
SAN
RoCE, iWARP
SC 13
InfiniBand
4
RDMA is a Game-Changer
100 Gbps
100 Gbps
Ethernet
SAN
Gateways
Gateways
SAN
• Q1: Scalability and efficiency of TCP-based software in
high speed?
• Q2: How to utilize advanced RDMA technology to
transfer data with high bandwidth and low cost?
SC 13
4
A Preliminary Experiment
3 x 40 Gbps RoCE
SC 13
5
A Preliminary Experiment
3 x 40 Gbps RoCE
117.6
Throughput (Gbps)
120
100
80
91.8
91.8
60
83.5
40
20
0
TCP
SC 13
RDMA
• TCP wasn’t able to saturate
this fat link.
• RDMA achieved 98% baremetal throughput.
• TCP: 35% CPU is used for
memory copy
• copy_user_generic_string()
5
Goals
• Better practice in high-speed data transfer
• High throughput
• Achieve line speed
• Low cost
• CPU utilization
• Memory footprints
• Scalability
• 100 Gbps and Beyond
• Wide Area Networks
SC 13
5
Hardware Development and Zero-copy
SC 13
Hardware Development
1990
2000
Single-core
Memory Wall
Power Wall
Frequency Wall
Processor-Memory
Performance Gap:
(grows 50% / year)
2010
© John D. McCalpin
SC 13
7
Hardware Development
1990
Single-core
Memory Wall
Power Wall
Frequency Wall
2000
c
c
c
c
Multi-core
2010
SC 13
Memory
Controller
Memory
7
Hardware Development
1990
Single-core
Memory Wall
Power Wall
Frequency Wall
2000
c
c
c
c
c
c
c
c
Multi-core
2010
SC 13
Memory
Controller
Memory
7
Hardware Development
1990
Single-core
Memory Wall
Power Wall
Frequency Wall
2000
Multi-core
Memory
Memory
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
2010
NUMA
SC 13
Memory
Memory
7
Hardware Development
1990
Single-core
Memory Wall
Power Wall
Frequency Wall
2000
Multi-core
Memory
Memory
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
2010
NUMA
SC 13
Memory
Memory
7
Hardware Development
1990
Single-core
Gigabit Ethernet
Multi-core
10 Gigabit
Ethernet/InfiniBand
NUMA
40/56/100 Gigabit
Ethernet/InfiniBand
2000
2010
SC 13
8
Therefore …
High speed data transfer
should pay high attention to
data copy, and use more
efficient zero copy to
address performance
bottlenecks.
SC 13
9
Cost: Data Copy vs. Zero Copy
40 Gbps RoCE
RFTP
/dev/zero
SC 13
iperf
RDMA
TCP
RFTP
/dev/null
iperf
10
Cost: Data Copy vs. Zero Copy
40 Gbps RoCE
RFTP
/dev/zero
Loading
SC 13
iperf
RDMA
TCP
Transmission
RFTP
/dev/null
iperf
Offloading
10
Cost Breakdown
Loading
Transmission
Offloading
400%
350%
300%
250%
200%
150%
100%
50%
0%
RDMA data
source
SC 13
RDMA data
sink
loading
TCP data
source
TCP data sink
10
Cost Breakdown
Loading
Transmission
400%
350%
300%
250%
200%
150%
100%
50%
0%
111%
152%
101%
158%
RDMA data
source
SC 13
Offloading
loading
RDMA data
sink
TCP data
source
protocol processing
TCP data sink
data copy
10
Cost Breakdown
Loading
Transmission
Offloading
400%
350%
300%
250%
200%
150%
100%
50%
0%
RDMA data
source
SC 13
loading
RDMA data
sink
protocol processing
TCP data
source
data copy
TCP data sink
offloading
10
RDMA-Based End-to-End Data Transfer
SC 13
RDMA-based End-to-End Solution
iSER
iSCSI Extentions for RDMA
100 Gbps
100 Gbps
Ethernet
SAN
SC 13
Gateways
Gateways
SAN
12
RDMA-based End-to-End Solution
iSER
iSCSI Extentions for RDMA
RFTP
RDMA enabled FTP service
100 Gbps
100 Gbps
Ethernet
SAN
SC 13
Gateways
Gateways
SAN
12
End-to-End: TCP vs. RDMA
iSER
iSCSI
User
Buffer
RDMA
TCP
Kernel
Buffer
IB
NIC
SC 13
IB
Loading
RoCE
iWARP
RoCE
iWARP
Transmission
IB
Offloading
IB
13
End-to-End: TCP vs. RDMA
iSER
RFTP
GridFTP, SCP
iSCSI
User
Buffer
RDMA
TCP
Kernel
Buffer
IB
NIC
SC 13
IB
Loading
RoCE
iWARP
RoCE
iWARP
Transmission
IB
Offloading
IB
13
End-to-End: TCP vs. RDMA
iSER
iSER
RFTP
GridFTP, SCP
iSCSI
iSCSI
User
Buffer
RDMA
TCP
Kernel
Buffer
IB
NIC
SC 13
IB
Loading
RoCE
iWARP
RoCE
iWARP
Transmission
IB
Offloading
IB
13
SAN: NUMA-Agnostic iSER
iSER
SAN
tgtd
Initiator
Initiator
c
c
c
c
c
c
c
c
Memory
Memory
tgtd
SC 13
13
SAN: NUMA-Aware iSER
iSER
SAN
numactl tgtd
Initiator
Initiator
c
c
c
c
c
c
c
c
Memory
Memory
numactl tgtd
SC 13
14
RFTP: RDMA-based FTP Service
• RDMA Pros
• Save CPU & Memory Resource
• Low latency & high throughput
• RDMA Cons
• Explicit memory management
• Asynchronous, event-driven programming
interfaces
• The application has to pipeline RDMA
operations itself and manage in-flight memory
status.
SC 13
15
Front-end: RFTP Software
More in : Protocols for Wide-Area Data-Intensive
Applications: Design and Performance Issues, SC ‘12
• One dedicated Reliable Connection queue pair for
exchanging control messages, and one or more for
actual data transfer
• Multiple memory blocks in flight
• Multiple reliable queue pairs for data transfer
• Proactive feedback
Process Load
Data
put_ready_blk
Process
Offload Data
get_ready_blk
get_free_blk
put_free_blk
Control Msg QP
SC 13
Data
Data
Source
Sink
Bulk Data Transfer QPs
16
End-to-End Performance Evaluation
SC 13
Testbed Setup
• Testbed
• LAN: 3 * 40 Gbps RoCE, 2 * 56 Gbps InfiniBand
• WAN: 40 Gbps RoCE
• 384 GB memory as storage media to simulate real
high performance storage system
• GridFTP vs. RFTP
• Bandwidth
• CPU Utilization
• Load data from storage server and dump data to
storage server
• TCP tuning
• Jumbo Frame, IRQ affinity, TCP buffer etc.
SC 13
18
Storage Performance
HP DL380
tgtd
fio
IBM X3650
Mellanox FDR Switch
InfiniBand SX6018
• NUMA-aware tgtd vs. NUMA agnostic tgtd
SC 13
19
Storage Performance - Read
14
12
10
8
6
4
2
0
tgtd CPU Utilization - Read
7.6%
CPU Utilization(%)
Throughput (GB/s)
Read Throughput
64
256
512
1024
4096
8192
800
700
600
500
400
300
200
100
0
64
256
I/O Size (KB)
OS default
SC 13
NUMA-aware tuning
OS default
512
1024
I/O Size (KB)
4096
8192
NUMA-aware tuning
20
Storage Performance - Write
Write Throughput
19%
tgtd CPU Utilization - Write
10
CPU Utilization (%)
Throughput (GB/s)
12
8
6
4
2
0
64
256
512
1024
4096
8192
1600
1400
1200
1000
800
600
400
200
0
300%
64
256
I/O Size (KB)
OS default
NUMA-aware tuning
512
1024
4096
I/O Size (KB)
OS default
NUMA-aware tuning
• Cache coherent traffic in NUMA architecture
• Read: cached/shared
• Write: modified
SC 13
8192
21
LAN Testbed
HP DL380
IBM X3650
Mellanox QDR Switch
Ethernet SX1036
Mellanox FDR Switch
InfiniBand SX6018
HP DL380
SC 13
Mellanox
ConnectX 3 VPI
56Gbps FDR
IBM X3650
Mellanox
ConnectX 3 Ethernet
40Gbps QDR
22
LAN Testbed
iSER
RFTP
GridFTP
HP DL380
IBM X3650
Mellanox QDR Switch
Ethernet SX1036
Mellanox FDR Switch
InfiniBand SX6018
HP DL380
SC 13
Mellanox
ConnectX 3 VPI
56Gbps FDR
IBM X3650
Mellanox
ConnectX 3 Ethernet
40Gbps QDR
22
LAN: End-to-End Performance
Bandwidth Comparison
CPU Comparison
100
1400
Bandwidth (Gbps)
3x
60
40
20
CPU Utilization (%)
1200
80
1000
800
600
400
200
0
25 Minutes
RFTP
GridFTP
0
RFTP
source
user
SC 13
GridFTP
source
sys
RFTP sink
GridFTP
sink
wait
23
LAN: End-to-End Performance
1400
1200
80
Bandwidth (Gbps)
CPU Comparison
3x
60
40
20
CPU Utilization (%)
100
Bandwidth Comparison
Storage Threshold
1000
800
600
400
200
0
25 Minutes
RFTP
GridFTP
0
RFTP
source
user
SC 13
GridFTP
source
sys
RFTP sink
GridFTP
sink
wait
23
End-to-End Performance: Bi-directional
Bi-directional Bandwidth
Bandwidth (Gbps)
200
150
100
50
0
30 Minutes
RFTP
GridFTP
• RFTP: 83% improvement vs. unidirectional
• GridFTP: 33% improvement vs. unidirectional
SC 13
24
40 Gbps WAN Testbed
NERSC
•
•
•
•
•
ANL
40 Gbps RoCE WAN
4,000 miles
RTT: 95 millisecond
BDP: 500 MB
Will RFTP be scalable in WAN?
SC 13
25
RFTP Bandwidth in 40 Gbps WAN
Bandwidth (Gbps)
40
39
38
37
36
35
1M
2M
4M
8M
16M
Block Size
1
2
4
8
16
# of streams
SC 13
26
Scale RDMA to WAN
• RoCE and iWARP
• RoCE requires a complicated layer-2 configuration for
lossless operation.
• iWARP: ToE
• iWARP operate with standard switches
Bandwidth (Gbps)
End-to-End Performance over 40 Gbps iWARP in LAN
SC 13
40
30
20
10
0
27
Conclusion
• HPC data transfer
• Hardware advances need advanced software
• Efficient memory usage in HPC
• RDMA-based design
• NUMA-aware tuning
• Testbed in LAN and WAN validated our design
SC 13
28
Q&A
RFTP Software
http://ftp100.cewit.stonybrook.edu/rftp
RFTP runs on
Caltech booth
Stony Brook University
http://ftp100.cewit.stonybrook.edu/ganglia
SC 13
29