Transcript WP7 Web sites
TCP/IP on High Bandwidth Long Distance Paths or
So TCP works … but still the users ask: Where is my throughput?
Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1
Layers & IP
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 2
The Network Layer 3: IP
IP Layer properties:
Provides best effort delivery It is unreliable
Packet may be lost
Duplicated
Out of order
Connection less Provides logical addresses Provides routing Demultiplex data on protocol number
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 3
The Internet datagram
Frame header IP header Transport FCS 0 Vers 4 Hlen 8 Type of serv.
16 Identification TTL Protocol Flags 19 24 Total length Fragment offset Header Checksum Source IP address Destination IP address IP Options (if any) Padding 31
20 Bytes
4 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
IP Datagram Format (cont.)
Type of Service – TOS
: now being used for QoS
Time to live
– TTL
: specifies how long datagram is allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops
Protocol
: specifies the format of the data area Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 …
Source & destination IP address:
(32 bits each) contain IP address of sender and intended recipient
Total length
: length of datagram in bytes, includes header and data
Options:
(variable length) Mainly used to record a route, or timestamps, or specify routing 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 5
The Transport Layer 4: UDP
UDP Provides :
Connection less service over IP
No setup teardown
One packet at a time Minimal overhead – high performance Provides best effort delivery It is unreliable:
Packet may be lost
Duplicated
Out of order
Application is responsible for
Data reliability
Flow control
Error handling
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 6
UDP Datagram format
Frame header IP header UDP header Application data FCS 0 8
Source port
16 24
Destination port UDP message len Checksum (opt.)
31
8 Bytes
Source/destination port:
port numbers identify sending & receiving processes Port number & IP address allow any application on Internet to be uniquely identified Ports can be static or dynamic Static (< 1024) assigned centrally, known as well known ports Dynamic
Message length:
in bytes includes the UDP header and data (min 8 max 65,535) 7 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
The Transport Layer 4: TCP
TCP RFC 768 RFC 1122 Provides :
Connection orientated service over IP
During setup the two ends agree on details
Explicit teardown
Multiple connections allowed
Reliable end-to-end Byte Stream delivery over unreliable network It takes care of:
Lost packets
Duplicated packets
Out of order packets
TCP provides
Data buffering
Flow control
Error detection & handling
Limits network congestion
8 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
The TCP Segment Format
Frame header IP header TCP header Application data FCS 0 4 8
Source port
10 16 24
Destination port Hlen Sequence number Acknowledgement number Resv Code Checksum Window Urgent ptr Options (if any)
31
Padding
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
20 Bytes
9
TCP Segment Format – cont.
Source/Dest port
: TCP port numbers to ID applications at both ends of connection
Sequence number
: byte stream First byte in segment from sender’s
Acknowledgement
: identifies the number of the byte sender of this (ACK) segment expects to receive next the
Code
: used to determine segment purpose, e.g. SYN, ACK, FIN, URG
Window
: Advertises how much data this station is willing to accept. Can depend on buffer space remaining.
Options
: used for window scaling, SACK, timestamps, maximum segment size etc.
Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Options (if any) Urgent ptr 10 Padding 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
TCP – providing reliability
Positive acknowledgement (ACK) of each received segment Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit
Sender Receiver Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time
Inefficient – sender has to wait 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 11
Flow Control: Sender – Congestion Window
Uses Congestion window, cwnd, a sliding window to control the data flow
Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important.
ACK gives next sequence no to receive AND The available space in the receive buffer
Timer kept for each packet Data sent and ACKed TCP Cwnd slides Sent Data buffered waiting ACK Unsent Data may be transmitted immediately Data to be sent, waiting for window to open.
Application writes here Received ACK advances trailing edge Sending host advances marker as data transmitted Receiver’s advertised window advances leading edge
12 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
Flow Control: Receiver – Lost Data
Application reads here Data given to application Lost data Window slides ACKed but not given to user Received but not ACKed Next byte expected Expected sequence no.
Last ACK given Receiver’s advertised window advances leading edge
If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 13
How it works: TCP Slowstart
Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1 st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs increase cwnd to 4 Time to reach cwnd size W T W = RTT*log
2
(W) (not exactly slow!) Rate doubles each RTT packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again time 14 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
TCP Slowstart Animated
Toby Rodwell Dante
Growth of CWND related to RTT
(Most important in Congestion Avoidance phase)
Source Sink
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 15
How it works: TCP Congestion Avoidance
additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 segment per rtt cwnd increased by 1 /cwnd for each ACK – linear increase in rate TCP takes packet loss as indication of congestion !
multiplicative decrease: aggressively if a packet is lost cut the congestion window size Standard TCP reduces cwnd by 0.5
Slow start to Congestion Avoidance transition determined by ssthresh packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester time 16
TCP Fast Retransmit & Recovery
Duplicate ACKs are due to lost segments or segments out of order.
Fast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) Sender re-transmits the missing segment
Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs
Keep sending new data if allowed by cwnd Set cwnd to half original value on new ACK
no need to go into “slow start” again At the steady state, cwnd oscillates around the optimal window size With a retransmission timeout, slow start is triggered again exponential increase 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 17
TCP: Simple Tuning - Filling the Pipe
Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on: Bandwidth end to end, i.e. min(BW links ) AKA bottleneck bandwidth Round Trip Time (RTT)
The number of bytes in flight to fill the entire path:
Bandwidth*Delay Product BDP = RTT*BW
Can increase bandwidth by Sender Receiver orders of magnitude RTT
Windows also used for flow control
Segment time on wire = bits in segment/BW Time
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
ACK
18
Standard TCP (Reno) – What’s the problem?
TCP has 2 phases: Slowstart
Probe the network to estimate the Available BW Exponential growth
Congestion Avoidance
Main data transfer phase – transfer rate glows “slowly”
AIMD and High Bandwidth – Long Distance networks
Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm.
For each ack in a RTT
without
loss:
cwnd -> cwnd
+ a /
cwnd
For each window
- Additive Increase, a=1
experiencing loss
:
cwnd -> cwnd –
b (
cwnd) - Multiplicative Decrease, b=
½
Packet loss is a killer !!
19 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
TCP (Reno) – Details of problem #1
Time for TCP to recover its throughput from 1 lost 1500 byte packet given by: 100000 10000
C
2 * *
RTT MSS
2
2 min
1000 100 10 1 0.1
0.01
0.001
0.0001
10Mbit 100Mbit 1Gbit 2.5Gbit
10Gbit 0 50 100
rtt ms
150 200 for rtt of ~200 ms @ 1 Gbit/s: UK 6 ms 1.6 s Europe 25 ms USA 150 ms 26 s 28min 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 20
Investigation of new TCP Stacks
The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd
For each window experiencing loss: - Additive Increase, a=1 cwnd -> cwnd – b (cwnd) Multiplicative Decrease, b= ½ High Speed TCP a and b vary depending on current cwnd using a table
a increases more rapidly with larger cwnd for the network path – returns to the ‘optimal’ cwnd size sooner
b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput.
Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd
a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno
Scalable over any link speed.
Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, H-TCP, BiC-TCP
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 21
Lets Check out this theory about new TCP stacks Does it matter ?
Does it work?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 22
Problem #1 Packet Loss Is it important ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 23
Packet Loss with new TCP Stacks
TCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 24
Packet Loss and new TCP Stacks
TCP Response Function UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
sculcc1-chi-2 iperf 13Jan05
1000 Agreement with theory good 100 10 A0 1500 A1 HST CP A2 Scalable A3 HT CP A5 BICT CP A8 W estwood A7 Vegas A0 T heory Series10 Scalable T heory 1 100000000 10000000 1000000 100000 Packet drop rate 1 in n 10000 1000 100
sculcc1-chi-2 iperf 13Jan05
Some new stacks good at high loss rates 1000 900 800 700 600 500 400 300 200 100 0 100000000 10000000 1000000 100000 Packet drop rate 1 in n 10000 1000 A0 1500 A1 HSTCP A2 Scalable A3 HTCP A5 BICTCP A8 Westwood A7 Vegas 100 25 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
London (Chicago) Dual Zeon 2.2 GHz lon01 Cisco 7609
High Throughput Demonstrations
Cisco GSR Cisco GSR Manchester rtt 6.2 ms (Geneva) rtt 128 ms Dual Zeon 2.2 GHz man03 Cisco 7609 1 GEth Drop Packets 2.5 Gbit SDH MB-NG Core 1 GEth Send data with TCP Monitor TCP with Web100
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 26
High Performance TCP – MB-NG
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s Standard HighSpeed Scalable 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 27
High Performance TCP – DataTAG
Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 10 6 High-Speed Rapid recovery Scalable Very fast recovery Standard Recovery would take ~ 20 mins 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 28
NU-E (Leverone) 2 x GE Nortel Passport 8600
FAST demo via OMNInet and Datatag
San Diego Workstations 10GE Photonic Switch FAST display A. Adriaanse, C. Jin, D. Wei (Caltech) J. Mambretti, F. Yeh (Northwestern) OMNIne t Layer 2 path FAST Demo
Cheng Jin, David Wei Caltech
StarLight-Chicago Nortel Passport 8600 10GE Photonic Switch Layer 2/3 path Workstations 2 x GE 2 x GE 7,000 km 2 x GE OC-48 CalTech Cisco 7609 DataTAG Alcatel Alcatel 1670 1670
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
2 x GE CERN -Geneva CERN Cisco 7609 S. Ravot (Caltech/CERN)
29
FAST TCP vs newReno
Traffic flow Channel #1 : newReno Utilization: 70%
Traffic flow Channel #2: FAST Utilization: 90%
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 30
Problem #2 Is TCP fair?
look at Round Trip Times & Max Transfer Unit
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 31
1000 900 800 700 600 500 400 300 200 100 0 0
MTU and Fairness
Host #1
1 GE
GbE Switch
1 GE
R
POS 2.5 Gbps
R
1 GE
Host #2
1 GE Bottleneck
Starlight (Chi) CERN (GVA)
Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %
Host #1 Host #2
Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck 1000 2000 3000 Time(s) 4000 5000 6000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester MTU = 3000 Byte Average over the life of the connection MTU = 3000 Byte MTU = 9000 Byte Average over the life of the connection MTU = 9000 Byte Sylvain Ravot DataTag 2003 32
RTT and Fairness
1000 900 800 700 600 500 400 300 200 100 0 0
Host #1
1 GE
GbE Switch
1 GE
R
POS 2.5 Gb/s
R
10GE
R
POS 10 Gb/s
Host #2
CERN (GVA)
1 GE Bottleneck 1 GE
Host #1
Starlight (Chi)
Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 % Throughput of two streams with different RTT sharing a 1Gbps bottleneck 1000 2000 3000 4000 5000 6000 7000 Time (s) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
R
Sunnyvale
Host #2
RTT=181ms Average over the life of the connection RTT=181ms RTT=117ms Average over the life of the connection RTT=117ms Sylvain Ravot DataTag 2003 33
Problem #n Do TCP Flows Share the Bandwidth ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 34
Test of TCP Sharing: Methodology (1Gbit/s)
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms) Used iperf/TCP and UDT/UDP to generate traffic Iperf or UDT SLAC Ping 1/s TCP/UDP bottleneck Caltech/UFL/CERN iperf ICMP/ping traffic Each run was 16 minutes, in 7 regions 2 mins 4 mins 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester Les Cottrell PFLDnet 2005 35
TCP Reno single stream
Les Cottrell PFLDnet 2005
Low performance on fast long distance paths AIMD (add congestion)
a=1
pkt to
cwnd
/ RTT, decrease
cwnd
by factor
b=0.5
in Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Unequal sharing Increase recovery rate Remaining flows do not take up slack when flow removed RTT increases when achieves best throughput SLAC to CERN Congestion has a dramatic effect Recovery is slow 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 36
Fast
As well as packet loss, FAST uses RTT to detect congestion RTT is very stable:
σ(RTT)
~ 9ms vs 37 ±0.14ms for the others 2 nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from SLAC-CERN 37 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
Hamilton TCP
One of the best performers Throughput is high Big effects on RTT when achieves best throughput Flows share equally > 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester SLAC-CERN 38
Problem #n+1 To SACK or not to SACK ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 39
The SACK Algorithm
SACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window The SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for: Each SACK block To mark lost packets To re-transmit Processing so long input Q becomes full Get Timeouts SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li 40 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
SACK …
Look into what’s happening at the algorithmic level with web100:
Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li
Strange hiccups in cwnd only correlation is SACK arrivals 41 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
Real Applications on Real Networks
Disk-2-disk applications on real networks
Memory-2-memory tests Transatlantic disk-2-disk at Gigabit speeds HEP&VLBI at SC|05
Remote Computing Farms
The effect of TCP The effect of distance
Radio Astronomy e-VLBI
Leave for the talk later in the meeting 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 42
iperf Throughput + Web100
SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 43
Applications: Throughput Mbit/s
HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0 (not disk limited)
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 44
Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 45
bbftp: What else is going on?
Scalable TCP
SuperMicro + SuperJANET
Instantaneous 0 - 550 Mbit/s
Congestion window – duplicate ACK Throughput variation not TCP related?
Disk speed / bus transfer
Application architecture
BaBar + SuperJANET
Instantaneous 200 – 600 Mbit/s
Disk-mem ~ 590 Mbit/s remember the end host
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 46
SC2004
SC2004 UKLIGHT Overview
SLAC Booth
Cisco 6509
Manchester
MB-NG 7600 OSR Caltech Booth UltraLight IP UCL network
UCL HEP
NLR Lambda NLR-PITT-STAR-10GE-16 K2
ULCC UKLight
K2 Ci
UKLight 10G Four 1GE channels
Caltech 7600
Surfnet/ EuroLink 10G Two 1GE channels Amsterdam Chicago Starlight
K2 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester Ci
UKLight 10G
47
Transatlantic Ethernet: TCP Throughput Tests
Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP
Wire rate throughput of 940 Mbit/s
2000 1500 1000 500 0 0 20000 40000 60000 80000
time ms
100000 InstaneousBW AveBW CurCwnd (Value) 120000 1400000000 1200000000 1000000000 800000000 600000000 400000000 200000000 0 140000
First 10 sec
Work in progress to study:
Implementation detail
Advanced stacks Effect of packet loss Sharing
2000 1500 1000 500 0 0 1000 2000 3000 4000 5000
time ms
6000 7000 InstaneousBW AveBW CurCwnd (Value) 40000000 35000000 30000000 25000000 8000 9000 20000000 15000000 10000000 5000000 0 10000 48 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
SC2004 Disk-Disk bbftp
bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 GByte file Web100 plots: 2500 2000 InstaneousBW AveBW CurCwnd (Value) Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s) 1500 1000 500 0 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 0 5000 10000
time ms
15000 20000 Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s ~4.5s of overhead) 2500 2000 1500 1000 500 0 0 InstaneousBW AveBW CurCwnd (Value) Disk-TCP-Disk at 1Gbit/s 5000 10000
time ms
15000 20000 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 49
Network & Disk Interactions
Hosts: (work in progress)
Supermicro X5DPE-G2 motherboards
dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0
six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP traffic RAID0 6disks 1 Gbyte Write 64k 3w8506-8 2000 1500 1000
Disk write 1735 Mbit/s % CPU kernel mode
500 0 0.0
20.0
40.0
60.0
Trial number 80.0
R0 6d 1 Gbyte udp Write 64k 3w8506-8 100.0
2000 1500
Disk write + 1500 MTU UDP
1000 2000 1500 1000 500 500 0 0 0.0
1218 Mbit/s Drop of 30%
0.0
20.0
40.0
60.0
80.0
Trial number R0 6d 1 Gbyte udp9000 write 64k 3w8506-8 100.0
Disk write +
200 180 160
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
8k 64k y=178-1.05x
9000 MTU UDP 1400 Mbit/s
140 120 100 80 60 40
Drop of 19%
20 20.0
40.0
60.0
80.0
100.0
0 0 20 Trial number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory 40 R. Hughes-Jones Manchester 60 80 100 120
% cpu system mode L1+2
140 160 200 180 50
Transatlantic Transfers With UKLight SuperComputing 2005
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 51
ESLEA and UKLight
6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application to give ~850Mbit/s One stream of data 840-620 Mbit/s Stream UDP VLBI data UK to Seattle 620 Mbit/s
Reverse TCP
1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 4500 4000 3500 3000 2500 2000 1500 1000 500 0 16:00 17:00 17:00 17:00 17:00 17:00 18:00 18:00 18:00 18:00 18:00
sc0501 SC|05
19:00 20:00
time sc0502 SC|05
19:00 20:00
date -time sc0503 SC|05
19:00 20:00
date -time sc0504 SC|05
19:00 20:00
date -time UKLight SC|05
19:00 20:00
date -time
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 21:00 21:00 21:00 21:00 21:00 22:00 23:00 22:00 22:00 22:00 22:00 23:00 23:00 23:00 23:00 52
SC|05 – SLAC 10 Gigabit Ethernet
2 Lightpaths: Routed over ESnet Layer 2 over Ultra Science Net 6 Sun V20Z systems per λ 3 Transmit 3 Receive dcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s Used Netweion NICs & Chelsio TOE Data also sent to StorCloud using fibre channel links Traffic on the 10 GE link for 2 nodes:
3-4 Gbit per nodes 8.5-9Gbit on Trunk
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 53
Remote Computing Farms in the ATLAS TDAQ Experiment
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 54
ATLAS Remote Farms – Network Connectivity
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 55
ATLAS Application Protocol
SFI and SFO Event Filter Daemon EFD Request event Request-Response time (Histogram) Send event data Process event Request Buffer Send OK Send processed event
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes Processing of event Return of computation EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
●●● Time
tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.
56 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-CERN Req-Resp
Round trip time 20 ms
64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms TCP Congestion window gets re-set on each Request TCP stack
RFC 2581 & RFC 2861
reduction of Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms 250000 200000 150000 100000 50000 0 0 250000 200000 150000 100000 50000 0 0 200 400 200 400 600 600 800 1000 time 1200 1400 1600 400 DataBytesIn (Delta 350 300 250 200 150 1800 100 50 0 2000 DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value 250000 200000 150000 100000 800 1000
time ms
1200 1400 1600 1800 50000 0 2000 Transfer achievable throughput 120 Mbit/s 180 160 140 120 100 80 60 40 20 0 0 200 400 600 800 1000
time ms
1200 1400 1600 1800 250000 200000 150000 100000 50000 0 2000 57 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-CERN Req-Resp TCP stack tuned
Round trip time 20 ms
64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 19 rtt or ~ 380 ms 1200000 1000000 800000 600000 400000 200000 0 0 500 1000 1500 time 2000 DataBytesOut (Delta 400 DataBytesIn (Delta 350 300 250 200 2500 150 100 50 0 3000 TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s
Rate ~10/s (with 50ms wait) 800 700 600 500 400 300 200 100 0 0 500 1000 1500
time ms
2000 PktsOut (Delta PktsIn (Delta CurCwnd (Value 2500 1200000 1000000 800000 600000 400000 200000 0 3000 Transfer achievable throughput grows to 800 Mbit/s Data transferred
WHEN
the application requires the data 900 800 700 600 500 400 300 200 100 0 0 1000 2000 3000 4000
time ms
5000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 6000 7000 1200000 1000000 800000 600000 400000 200000 0 8000 58
tcpmon: TCP Activity Alberta-CERN Req-Resp TCP stack tuned
Round trip time 150 ms
64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 11 rtt or ~ 1.67 s TCP Congestion window in slow start to ~1.8s then congestion avoidance Response in 2 rtt after ~2.5s
Rate 2.2/s (with 50ms wait) 1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 0 700 600 500 400 300 200 100 0 0 2000 1000 4000 6000 2000 time 3000 DataBytesOut (Delta DataBytesIn (Delta 4000 PktsOut (Delta PktsIn (Delta CurCwnd (Value 400 350 300 250 200 150 100 50 0 5000 1000000 800000 600000 400000 200000 8000 0 10000 12000 14000 16000 18000 20000
time ms
Transfer achievable throughput grows slowly from 250 to 800 Mbit/s 800 700 600 500 400 300 200 100 0 0 2000 1000000 800000 600000 4000 6000 400000 200000 8000 0 10000 12000 14000 16000 18000 20000
time ms
59 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
Summary & Conclusions
Standard TCP not optimum for high throughput long distance links
Packet loss is a killer for TCP
Check on campus links & equipment, and access links to backbones
Users need to collaborate with the Campus Network Teams
Dante Pert
New stacks are stable and give better response & performance
Still need to set the TCP buffer sizes !
Check other kernel settings e.g. window-scale maximum
Watch for “TCP Stack implementation Enhancements”
TCP tries to be fair
Large MTU has an advantage
Short distances, small RTT, have an advantage
TCP does not share bandwidth well with other streams
The End Hosts themselves
Plenty of CPU power is required
for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power
Interaction between HW, protocol processing, and disk sub-system complex
Application architecture & implementation are also important
The TCP protocol dynamics strongly influence the behaviour of the Application.
Users are now able to perform sustained 1 Gbit/s transfers
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 60
More Information Some URLs 1
UKLight web site: http://www.uklight.ac.uk
MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt
& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html
& http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 61
More Information Some URLs 2
Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm
www.cs.pdx.edu/~jrb/tcpip.lectures.html
www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm
www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
www.jbmelectronics.com/tcp.htm
Encylopaedia http://www.freesoft.org/CIE/index.htm
TCP/IP Resources www.private.org.il/tcpip_rl.html
Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html
Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt
Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt
& /etc/protocols 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 62
Any Questions?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 63
Backup Slides
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 64
Latency Measurements
UDP/IP packets sent between back-to-back systems
Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency
Round trip times measured using Request-Response UDP frames Latency as a function of frame size Slope is given by: s data paths 1 db dt Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 65
Throughput Measurements
UDP Throughput
Send a controlled stream of UDP frames spaced at regular intervals Sender Receiver Zero stats OK done Send data frames at regular intervals Time to send Get remote statistics ●●● ●●● Inter-packet time (Histogram) Time to receive Signal end of test Send statistics: No. received OK done No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Time Number of packets n bytes time Wait time 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 66
PCI Bus & Gigabit Ethernet Activity
PCI Activity
Logic Analyzer with PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC CPU PCI bus Gigabit Ethernet Probe chipset mem Logic Analyser Display Possible Bottlenecks 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester PCI bus CPU chipset mem 67
Network switch limits behaviour
End2end UDP packets from udpmon 1000 Only 700 Mbit/s throughput 900 800 700 600 500 400 300 200 100 0 0
w05gva-gig6_29May04_UDP
Packet loss distribution 14000 12000 10000 8000 6000 4000 2000 Lots of packet loss shows throughput limited 0 500 510 100 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20
Spacing between frames us
25
w05gva-gig6_29May04_UDP
30 35 5 10 15 20
Spacing between frames us
25 30
w05gva-gig6_29May04_UDP wait 12us
35 14000 12000 10000 8000 6000 4000 2000 0 0 100 200 300
Packet No.
530
Packet No.
540 550 520 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 400 500 40 40 68 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 600
“Server Quality” Motherboards
SuperMicro P4DP8-2G (P4DP6) Dual
Xeon 400/522 MHz Front side bus
6 PCI PCI-X slots 4 independent PCI buses
64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X Dual Gigabit Ethernet Adaptec AIC-7899W dual channel SCSI UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 69
“Server Quality” Motherboards
Boston/Supermicro H8DAR Two Dual Core Opterons
200 MHz DDR Memory Theory BW: 6.4Gbit
HyperTransport
2 independent PCI buses
133 MHz PCI-X 2 Gigabit Ethernet SATA ( PCI-e ) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 70
10 Gigabit Ethernet: UDP Throughput
1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
6000 5000 4000 3000 2000
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
1000 0 0 5 10 15 20
Spacing between frames us
25 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 30 35 40 16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes 71
10 Gigabit Ethernet: Tuning PCI-X
16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times
Times based on PCI-X times from the logic analyser
Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
10 8 6 4 2 0 0
DataTAG Xeon 2.2 GHz
measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 1000 2000 3000 Max Memory Read Byte Count 4000 5000 10 8 6 4 2 0 0
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 1000 2000 3000 Max Memory Read Byte Count 4000 5000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes
CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update
mmrbc 4096 bytes 5.7Gbit/s
72
Congestion control: ACK clocking
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 73
End Hosts & NICs CERN-nat-Manc.
Use UDP packets to characterise Host, NIC & Network SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus
Request-Response Latency Throughput Packet Loss Re-Order
1000 900 800 700 600 500 400 300 200 100 0 pcatb121-nat-gig6_13Aug04 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 6000 5000 4000 3000 256 bytes pcatb121-nat-gig6 2000 1000 0 20900 21100 21300 Latency us 21500 10000 512 bytes pcatb121-nat-gig6 8000 6000 4000 2000 0 20900 21300 21500 5000 1400 bytes pcatb121-nat-gig6 4000 3000 2000 1000 0 20900 21100 21300 Latency us 21500 The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS have no re-ordering 80 60 40 20 0 0 15 10 5 0 0 0 5 5 10 20 Spacing between frames us pcatb121-nat-gig6_13Aug04 10 15 20 Spacing between frames us 25 pcatb121-nat-gig6_13Aug04 10 15 20 Spacing between frames us 25 30 30 30 35 35 40 40 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 74 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
tcpdump / tcptrace
tcpdump: dump all TCP header information for a specified source/destination ftp://ftp.ee.lbl.gov/ tcptrace: format tcpdump output for analysis using xplot http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools http://www.ncne.nlanr.net/TCP/testrig/ Sample use:
tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl
75 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester
tcptrace and xplot
X axis is time
Y axis is sequence number
the slope of this curve gives the throughput over time.
xplot tool make it easy to zoom in
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 76
Zoomed In View
Green Line Yellow Line
: ACK values received from the receiver
Green Ticks
tracks the receive window advertised from the receiver track the duplicate ACKs received.
Yellow Ticks
track the window advertisements that were the same as the last advertisement.
White Arrows
represent segments sent.
Red Arrows (R)
represent retransmitted segments 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 77
TCP Slow Start
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 78