High Performance Data Transfer over TransPAC

Download Report

Transcript High Performance Data Transfer over TransPAC

High Performance Data Transfer
over TransPAC
The 3rd International HEP DataGrid Workshop
August 26, 2004
Kyungpook National Univ., Daegu, Korea
Masaki Hirabaru
[email protected]
NICT
Acknowledgements
•NICT Kashima Space Research Center
Yasuhiro Koyama, Tetsuro Kondo
•MIT Haystack Observatory
David Lapsley, Alan Whitney
•APAN Tokyo NOC
•JGN II NOC
•NICT R&D Management Department
•Indiana U. Global NOC
Contents
•
•
•
•
e-VLBI
Performance Measurement
TCP test over TransPAC
TCP test in the Laboratory
Motivations
• MIT Haystack – NICT Kashima e-VLBI Experiment
on August 27, 2003 to measure UT1-UTC in 24 hours
– 41.54 GB CRL => MIT 107 Mbps (~50 mins)
41.54 GB MIT => CRL 44.6 Mbps (~120 mins)
– RTT ~220 ms, UDP throughput 300-400 Mbps
However TCP ~6-8 Mbps (per session, tuned)
– BBFTP with 5 x 10 TCP sessions to gain performance
• HUT – NICT Kashima Gigabit VLBI Experiment
- RTT ~325 ms, UDP throughput ~70 Mbps
However TCP ~2 Mbps (as is), ~10 Mbps (tuned)
- Netants (5 TCP sessions with ftp stream restart extension)
They need high-speed / real-time / reliable / long-haul
high-performance, huge data transfer.
VLBI (Very Long Baseline Interferometry)
•e-VLBI
geographically distributed
observation, interconnecting
radio antennas over the world
radio signal
from a star
•Gigabit / real-time VLBI
multi-gigabit rate sampling
A/D
clock
A/D
clock
High Bandwidth –
Delay Product Network issue
Internet
correlator
Data rate 512Mbps ~
(NICT Kashima Radio Astronomy Applications Group)
Recent Experiment of UT1-UTC Estimation
between NICT Kashima and MIT Haystack (via Washington DC)
•July 30, 2004 4am-6am JST
Kashima was upgraded to 1G through JGN II 10G link.
All processing done in ~4.5 hours (last time ~21 hours)
Average ~30 Mbps transfer by bbftp (under investigation)
test
experiment
Network Diagram for e-VLBI and test servers
Seoul XP
10G
Korea
100km
Daejon
2.5G
bwctl server
JGNII
KOREN
Kwangju
Kashima
Taegu
Tokyo XP
Busan
perf server
1G (10G)
Koganei
e-vlbi server
1G
1G(10G)
250km
2.5G SONET
TransPAC
/ JGN II
APII/JGNII
Kitakyushu
Fukuoka
1G
1,000km
2.5G
10G
Chicago
9,000km
MIT Haystack
1G (10G)
Abilene
Genkai XP
Fukuoka
Japan
e-VLBI:
10G
2.4G (x2)
4,000km
Washington
DC
Los Angeles
Indianapolis
– Done 1 Gbps upgrade at Kashima
– On-going 2.5 Gbps upgrade at Haystack
– Experiments using 1 Gigabit bps or more
– Using real-time correlation
*Info and key exchange page needed like:
http://e2epi.internet2.edu/pipes/ami/bwctl/
APAN JP Maps
written in perl and fig2div
Purposes
• Measure, analyze and improve end-to-end
performance in high bandwidth-delay product
networks
– to support for networked science applications
– to help operations in finding a bottleneck
– to evaluate advanced transport protocols
(e.g. Tsunami, SABUL, HSTCP, FAST, XCP, [ours])
• Improve TCP under easier conditions
– with a signle TCP stream
– memory to memory
– bottleneck but no cross traffic

Consume all the available bandwidth
Path
a) w/o bottleneck
Access Backbone
queue
Access
Sender
Receiver
B1
B2
B3
B1 <= B2 & B1 <= B3
b) w/ bottleneck
Access Backbone
queue
Access
Sender
Receiver
B1
bottleneck
B2
B3
B1 > B2 || B1 > B3
TCP on a path with bottleneck
queue
overflow
bottleneck
loss
The sender may generate burst traffic.
The sender recognizes the overflow after the delay < RTT.
The bottleneck may change over time.
Limiting the Sending Rate
1Gbps
a)
Sender
congestion
Receiver
20Mbps
throughput
Receiver
90Mbps
throughput
100Mbps
b)
Sender
congestion
better!
Web100 (http://www.web100.org)
• A kernel patch for monitoring/modifying TCP
metrics in Linux kernel
• We need to know TCP behavior to identify a
problem.
• Iperf (http://dast.nlanr.net/Projects/Iperf/)
– TCP/UDP bandwidth measurement
• bwctl (http://e2epi.internet2.edu/bwctl/)
– Wrapper for iperf with authentication and scheduling
1st Step: Tuning a Host with UDP
• Remove any bottlenecks on a host
– CPU, Memory, Bus, OS (driver), …
• Dell PowerEdge 1650 (*not enough power)
– Intel Xeon 1.4GHz x1(2), Memory 1GB
– Intel Pro/1000 XT onboard PCI-X (133Mhz)
• Dell PowerEdge 2650
– Intel Xeon 2.8GHz x1(2), Memory 1GB
– Intel Pro/1000 XT PCI-X (133Mhz)
• Iperf UDP throughput 957 Mbps
– GbE wire rate: headers: UDP(20B)+IP(20B)+EthernetII(38B)
– Linux 2.4.26 (RedHat 9) with web100
– PE1650: TxIntDelay=0
2nd Step: Tuning a Host with TCP
•
Maximum socket buffer size (TCP window size)
–
–
•
Driver descriptor length
–
•
mtu=1500 (IP MTU)
Iperf TCP throughput 941 Mbps
–
–
•
fifo (default)
MTU
–
•
txqueuelen=100 (default)
net.core.netdev_max_backlog=300 (default)
Interface queue descriptor
–
•
e1000: TxDescriptors=1024 RxDescriptors=256 (default)
Interface queue length
–
–
•
net.core.wmem_max net.core.rmem_max (64MB)
net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB)
GbE wire rate: headers: TCP(32B)+IP(20B)+EthernetII(38B)
Linux 2.4.26 (RedHat 9) with web100
Web100 (incl. High Speed TCP)
–
–
–
–
–
net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache)
net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full)
net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning)
Net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP)
net.ipv4.web100_default_wscale=7 (default)
Network Diagram for TransPAC/I2 Measurement (Oct. 2003)
Kashima
100km
Tokyo XP
server (general)
server (e-VLBI)
0.1G
Koganei
1G
sender
TransPAC
PE1650
Linux 2.4.22 (RH 9)
Xeon 1.4GHz
Memory 1GB
GbE Intel Pro/1000 XT
2.5G
1G x2
9,000km
MIT Haystack
Abilene
10G
1G
4,000km
Washington DC
Los Angeles
Indianapolis
Iperf UDP
~900Mbps
(no loss)
10G
I2 Venue
1G
receiver
Mark5
Linux 2.4.7 (RH 7.1)
P3 1.3GHz
Memory 256MB
GbE SK-9843
TransPAC/I2 #1: High Speed (60 mins)
TransPAC/I2 #2: Reno (10 mins)
TransPAC/I2 #3: High Speed (Win 12MB)
Test in a laboratory – with bottleneck
PE 2650
Sender
L2SW
(FES12GCF)
GbE/T
PE 1650
GbE/T
Receiver
GbE/SX
Bandwidth 800Mbps Buffer 256KB
Packet
Sphere
•
•
#1: Reno => Reno
#2: High Speed TCP => Reno
Delay 88 ms
Loss 0
2*BDP = 16MB
Laboratory #1,#2: 800M bottleneck
Reno
HighSpeed
Laboratory #3,#4,#5: High Speed (Limiting)
Window Size
(16MB)
With limited slow-start (1000)
Rate Control
270 us every 10 packets
With limited slow-start (1000)
Cwnd Clamp
(95%)
With limited slow-start (100)
How to know when bottleneck changed
• End host probes periodically (e.g. packet train)
• Router notifies to the end host (e.g. XCP)
Another approach: enough buffer on router
• At least 2xBDP (bandwidth delay product)
e.g. 1G bps x 200ms x 2 = 500Mb ~ 50MB
• Replace Fast SRAM with DRAM
in order to reduce space and cost
Test in a laboratory – with bottleneck (2)
PE 2650
Sender
L2SW
(FES12GCF)
GbE/T
PE 1650
GbE/T
Receiver
GbE/SX
Bandwidth 800Mbps Buffer 64MB
Network
Emulator
•
#6: High Speed TCP => Reno
Delay 88 ms
Loss 0
2*BDP = 16MB
Laboratory #6: 800M bottleneck
HighSpeed
Report on MTU
• Increasing MTU (packet size) results in better
performance. Standard MTU is 1500B. MTU
9KB is available throughout Abilene, TransPAC,
APII backbones.
• On Aug 25, 2004, a remaining link with 1500B
was upgraded to 9KB in Tokyo XP. MTU 9KB is
available from Busan to Los Angeles.
Current and Future Plans of e-VLBI
• KOA (Korean Observatory of Astronomy) has one
existing radio telescope but in a different band from ours.
They are building another three radio telescopes.
• Using a dedicated light path from Europe to Asia through
US is being considered.
• e-VLBI Demonstration in SuperComputing2004
(November) is being planned, interconnecting radio
telescopes from Europe, US, and Japan.
• Gigabit A/D converter ready and now implementing 10G.
• Our peformance measurement infrastructure will be
merged into a framework of Global (Network)
Observatory maintained by NOC people. (Internet2
piPEs, APAN CMM, and e-VLBI)
Questions?
• See
http://www2.nict.go.jp/ka/radioastro/index.html
for VLBI