WP7 Web sites

Transcript WP7 Web sites

TCP/IP on High Bandwidth Long Distance Paths or

So TCP works … but still the users ask: Where is my throughput?

Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1

Layers & IP

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 2

The Network Layer 3: IP



IP Layer properties:

 

Provides best effort delivery It is unreliable



Packet may be lost



Duplicated



Out of order

   

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 3

The Internet datagram

Frame header IP header Transport FCS 0 Vers 4 Hlen 8 Type of serv.

16 Identification TTL Protocol Flags 19 24 Total length Fragment offset Header Checksum Source IP address Destination IP address IP Options (if any) Padding 31

20 Bytes

4 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

IP Datagram Format (cont.)



Type of Service – TOS

: now being used for QoS  

Time to live



– TTL

: specifies how long datagram is allowed to remain in internet Routers decrement by 1  When TTL = 0 router discards datagram  Prevents infinite loops 

Protocol

:  specifies the format of the data area Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 … 

Source & destination IP address:

(32 bits each) contain IP address of sender and intended recipient 

Total length

: length of datagram in bytes, includes header and data

Options:

(variable length) Mainly used to record a route, or timestamps, or specify routing 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 5

The Transport Layer 4: UDP



UDP Provides :

   

Connection less service over IP



No setup teardown



One packet at a time Minimal overhead – high performance Provides best effort delivery It is unreliable:



Packet may be lost



Duplicated



Out of order



Application is responsible for



Data reliability



Flow control



Error handling

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 6

UDP Datagram format

Frame header IP header UDP header Application data FCS 0 8

Source port

16 24

Destination port UDP message len Checksum (opt.)

8 Bytes



Source/destination port:

port numbers identify sending & receiving processes  Port number & IP address allow any application on Internet to be uniquely identified  Ports can be static or dynamic  Static (< 1024) assigned centrally, known as well known ports  Dynamic 

Message length:

in bytes includes the UDP header and data (min 8 max 65,535) 7 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

The Transport Layer 4: TCP



TCP RFC 768 RFC 1122 Provides :



Connection orientated service over IP



During setup the two ends agree on details



Explicit teardown



Multiple connections allowed

 

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of:



Lost packets



Duplicated packets



Out of order packets



TCP provides



Data buffering



Flow control



Error detection & handling



Limits network congestion

8 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

The TCP Segment Format

Frame header IP header TCP header Application data FCS 0 4 8

Source port

10 16 24

Destination port Hlen Sequence number Acknowledgement number Resv Code Checksum Window Urgent ptr Options (if any)

Padding

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

20 Bytes

TCP Segment Format – cont.



Source/Dest port

: TCP port numbers to ID applications at both ends of connection 

Sequence number

: byte stream First byte in segment from sender’s 

Acknowledgement

: identifies the number of the byte sender of this (ACK) segment expects to receive next the 

Code

: used to determine segment purpose, e.g. SYN, ACK, FIN, URG 

Window

: Advertises how much data this station is willing to accept. Can depend on buffer space remaining.



Options

: used for window scaling, SACK, timestamps, maximum segment size etc.

Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Options (if any) Urgent ptr 10 Padding 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP – providing reliability

 Positive acknowledgement (ACK) of each received segment    Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit

Sender Receiver Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time

 Inefficient – sender has to wait 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 11

Flow Control: Sender – Congestion Window



Uses Congestion window, cwnd, a sliding window to control the data flow

  

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important.

ACK gives next sequence no to receive AND The available space in the receive buffer



Timer kept for each packet Data sent and ACKed TCP Cwnd slides Sent Data buffered waiting ACK Unsent Data may be transmitted immediately Data to be sent, waiting for window to open.

Application writes here Received ACK advances trailing edge Sending host advances marker as data transmitted Receiver’s advertised window advances leading edge

12 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Flow Control: Receiver – Lost Data

Application reads here Data given to application Lost data Window slides ACKed but not given to user Received but not ACKed Next byte expected Expected sequence no.

Last ACK given Receiver’s advertised window advances leading edge

 If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 13

How it works: TCP Slowstart

 Probe the network - get a rough estimate of the optimal congestion window size  The larger the window size, the higher the throughput  Throughput = Window size / Round-trip Time  exponentially increase the congestion window size until a packet is lost  cwnd initially 1 MTU then increased by 1 MTU for each ACK received  Send 1 st packet get 1 ACK increase cwnd to 2  Send 2 packets get 2 ACKs increase cwnd to 4  Time to reach cwnd size W T W = RTT*log

(W) (not exactly slow!)  Rate doubles each RTT packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again time 14 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP Slowstart Animated

Toby Rodwell Dante



Growth of CWND related to RTT



(Most important in Congestion Avoidance phase)

Source Sink

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 15

How it works: TCP Congestion Avoidance

 additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth  cwnd increased by 1 segment per rtt  cwnd increased by 1 /cwnd for each ACK – linear increase in rate  TCP takes packet loss as indication of congestion !

 multiplicative decrease: aggressively if a packet is lost  cut the congestion window size Standard TCP reduces cwnd by 0.5

 Slow start to Congestion Avoidance transition determined by ssthresh packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester time 16

TCP Fast Retransmit & Recovery

 Duplicate ACKs are due to lost segments or segments out of order.

 Fast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected)   Sender re-transmits the missing segment   

Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs

 

Keep sending new data if allowed by cwnd Set cwnd to half original value on new ACK

no need to go into “slow start” again  At the steady state, cwnd oscillates around the optimal window size  With a retransmission timeout, slow start is triggered again exponential increase 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 17

TCP: Simple Tuning - Filling the Pipe

 Remember, TCP has to hold a copy of data in flight  Optimal (TCP buffer) window size depends on:   Bandwidth end to end, i.e. min(BW links ) AKA bottleneck bandwidth Round Trip Time (RTT) 

The number of bytes in flight to fill the entire path:

 Bandwidth*Delay Product BDP = RTT*BW 

Can increase bandwidth by Sender Receiver orders of magnitude RTT

 Windows also used for flow control

Segment time on wire = bits in segment/BW Time

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

ACK

Standard TCP (Reno) – What’s the problem?

  

TCP has 2 phases: Slowstart

Probe the network to estimate the Available BW Exponential growth 

Congestion Avoidance

Main data transfer phase – transfer rate glows “slowly”

AIMD and High Bandwidth – Long Distance networks

Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm.

 For each ack in a RTT

without

loss:

cwnd -> cwnd

+ a /

cwnd

 For each window

- Additive Increase, a=1

experiencing loss

cwnd -> cwnd –

b (

cwnd) - Multiplicative Decrease, b=

½ 

Packet loss is a killer !!

19 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP (Reno) – Details of problem #1

 Time for TCP to recover its throughput from 1 lost 1500 byte packet given by: 100000 10000  

2 * *

RTT MSS

2 min

1000 100 10 1 0.1

0.01

0.001

0.0001

10Mbit 100Mbit 1Gbit 2.5Gbit

10Gbit 0 50 100

rtt ms

150 200  for rtt of ~200 ms @ 1 Gbit/s: UK 6 ms 1.6 s Europe 25 ms USA 150 ms 26 s 28min 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 20

Investigation of new TCP Stacks



The AIMD Algorithm – Standard TCP (Reno)



For each ack in a RTT without loss:



cwnd -> cwnd + a / cwnd



For each window experiencing loss: - Additive Increase, a=1 cwnd -> cwnd – b (cwnd) Multiplicative Decrease, b= ½ High Speed TCP a and b vary depending on current cwnd using a table



a increases more rapidly with larger cwnd for the network path – returns to the ‘optimal’ cwnd size sooner



b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput.



Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd

 

a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno



Scalable over any link speed.



Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.



HSTCP-LP, H-TCP, BiC-TCP

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 21

Lets Check out this theory about new TCP stacks Does it matter ?

Does it work?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 22

Problem #1 Packet Loss Is it important ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 23

Packet Loss with new TCP Stacks

 TCP Response Function   Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 24

Packet Loss and new TCP Stacks

 TCP Response Function  UKLight London-Chicago-London rtt 177 ms  2.6.6 Kernel

sculcc1-chi-2 iperf 13Jan05

1000  Agreement with theory good 100 10 A0 1500 A1 HST CP A2 Scalable A3 HT CP A5 BICT CP A8 W estwood A7 Vegas A0 T heory Series10 Scalable T heory 1 100000000 10000000 1000000 100000 Packet drop rate 1 in n 10000 1000 100

sculcc1-chi-2 iperf 13Jan05

 Some new stacks good at high loss rates 1000 900 800 700 600 500 400 300 200 100 0 100000000 10000000 1000000 100000 Packet drop rate 1 in n 10000 1000 A0 1500 A1 HSTCP A2 Scalable A3 HTCP A5 BICTCP A8 Westwood A7 Vegas 100 25 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

London (Chicago) Dual Zeon 2.2 GHz lon01 Cisco 7609

High Throughput Demonstrations

Cisco GSR Cisco GSR Manchester rtt 6.2 ms (Geneva) rtt 128 ms Dual Zeon 2.2 GHz man03 Cisco 7609 1 GEth Drop Packets 2.5 Gbit SDH MB-NG Core 1 GEth Send data with TCP Monitor TCP with Web100

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 26

High Performance TCP – MB-NG

 Drop 1 in 25,000  rtt 6.2 ms  Recover in 1.6 s Standard HighSpeed Scalable 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 27

High Performance TCP – DataTAG

 Different TCP stacks tested on the DataTAG Network  rtt 128 ms  Drop 1 in 10 6  High-Speed  Rapid recovery  Scalable  Very fast recovery  Standard  Recovery would take ~ 20 mins 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 28

NU-E (Leverone) 2 x GE Nortel Passport 8600

FAST demo via OMNInet and Datatag

San Diego Workstations 10GE Photonic Switch FAST display A. Adriaanse, C. Jin, D. Wei (Caltech) J. Mambretti, F. Yeh (Northwestern) OMNIne t Layer 2 path FAST Demo

Cheng Jin, David Wei Caltech

StarLight-Chicago Nortel Passport 8600 10GE Photonic Switch Layer 2/3 path Workstations 2 x GE 2 x GE 7,000 km 2 x GE OC-48 CalTech Cisco 7609 DataTAG Alcatel Alcatel 1670 1670

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

2 x GE CERN -Geneva CERN Cisco 7609 S. Ravot (Caltech/CERN)

FAST TCP vs newReno



Traffic flow Channel #1 : newReno Utilization: 70%



Traffic flow Channel #2: FAST Utilization: 90%

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 30

Problem #2 Is TCP fair?

look at Round Trip Times & Max Transfer Unit

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 31

1000 900 800 700 600 500 400 300 200 100 0 0

MTU and Fairness

Host #1

1 GE

GbE Switch

1 GE

POS 2.5 Gbps

1 GE

Host #2

1 GE Bottleneck

Starlight (Chi) CERN (GVA)

     Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %

Host #1 Host #2

Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck 1000 2000 3000 Time(s) 4000 5000 6000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester MTU = 3000 Byte Average over the life of the connection MTU = 3000 Byte MTU = 9000 Byte Average over the life of the connection MTU = 9000 Byte Sylvain Ravot DataTag 2003 32

RTT and Fairness

1000 900 800 700 600 500 400 300 200 100 0 0

Host #1

1 GE

GbE Switch

1 GE

POS 2.5 Gb/s

10GE

POS 10 Gb/s

Host #2

CERN (GVA)

    

1 GE Bottleneck 1 GE

Host #1

Starlight (Chi)

Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 % Throughput of two streams with different RTT sharing a 1Gbps bottleneck 1000 2000 3000 4000 5000 6000 7000 Time (s) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Sunnyvale

Host #2

RTT=181ms Average over the life of the connection RTT=181ms RTT=117ms Average over the life of the connection RTT=117ms Sylvain Ravot DataTag 2003 33

Problem #n Do TCP Flows Share the Bandwidth ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 34

Test of TCP Sharing: Methodology (1Gbit/s)

 Chose 3 paths from SLAC (California)  Caltech (10ms), Univ Florida (80ms), CERN (180ms)  Used iperf/TCP and UDT/UDP to generate traffic Iperf or UDT SLAC Ping 1/s TCP/UDP bottleneck Caltech/UFL/CERN iperf ICMP/ping traffic  Each run was 16 minutes, in 7 regions 2 mins 4 mins 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester Les Cottrell PFLDnet 2005 35



TCP Reno single stream

Les Cottrell PFLDnet 2005

Low performance on fast long distance paths  AIMD (add congestion)

a=1

pkt to

cwnd

/ RTT, decrease

cwnd

  by factor

b=0.5

in Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Unequal sharing Increase recovery rate Remaining flows do not take up slack when flow removed RTT increases when achieves best throughput SLAC to CERN Congestion has a dramatic effect Recovery is slow 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 36

Fast

 As well as packet loss, FAST uses RTT to detect congestion  RTT is very stable:

σ(RTT)

~ 9ms vs 37 ±0.14ms for the others 2 nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from SLAC-CERN 37 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Hamilton TCP

 One of the best performers    Throughput is high Big effects on RTT when achieves best throughput Flows share equally > 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester SLAC-CERN 38

Problem #n+1 To SACK or not to SACK ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 39

 

The SACK Algorithm

SACK Rational  Non-continuous blocks of data can be ACKed   Sender transmits just lost packets Helps when multiple packets lost in one TCP window The SACK Processing is inefficient for large bandwidth delay products  Sender write queue (linked list) walked for:  Each SACK block   To mark lost packets To re-transmit   Processing so long input Q becomes full Get Timeouts SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li 40 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

SACK …

 Look into what’s happening at the algorithmic level with web100:

Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li

 Strange hiccups in cwnd  only correlation is SACK arrivals 41 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Real Applications on Real Networks



Disk-2-disk applications on real networks

 Memory-2-memory tests  Transatlantic disk-2-disk at Gigabit speeds  HEP&VLBI at SC|05 

Remote Computing Farms

 The effect of TCP  The effect of distance 

Radio Astronomy e-VLBI

 Leave for the talk later in the meeting 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 42

iperf Throughput + Web100

   

SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

   

BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 43

Applications: Throughput Mbit/s

  

HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET



bbcp



bbftp



Apachie



Gridftp



Previous work used RAID0 (not disk limited)

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 44

Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 45

bbftp: What else is going on?

Scalable TCP



SuperMicro + SuperJANET



Instantaneous 0 - 550 Mbit/s

 

Congestion window – duplicate ACK Throughput variation not TCP related?



Disk speed / bus transfer



Application architecture



BaBar + SuperJANET



Instantaneous 200 – 600 Mbit/s



Disk-mem ~ 590 Mbit/s remember the end host

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 46

SC2004

SC2004 UKLIGHT Overview

SLAC Booth

Cisco 6509

Manchester

MB-NG 7600 OSR Caltech Booth UltraLight IP UCL network

UCL HEP

NLR Lambda NLR-PITT-STAR-10GE-16 K2

ULCC UKLight

K2 Ci

UKLight 10G Four 1GE channels

Caltech 7600

Surfnet/ EuroLink 10G Two 1GE channels Amsterdam Chicago Starlight

K2 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester Ci

UKLight 10G

Transatlantic Ethernet: TCP Throughput Tests

     

Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP



Wire rate throughput of 940 Mbit/s

2000 1500 1000 500 0 0 20000 40000 60000 80000

time ms

100000 InstaneousBW AveBW CurCwnd (Value) 120000 1400000000 1200000000 1000000000 800000000 600000000 400000000 200000000 0 140000 

First 10 sec



Work in progress to study:



Implementation detail

  

Advanced stacks Effect of packet loss Sharing

2000 1500 1000 500 0 0 1000 2000 3000 4000 5000

time ms

6000 7000 InstaneousBW AveBW CurCwnd (Value) 40000000 35000000 30000000 25000000 8000 9000 20000000 15000000 10000000 5000000 0 10000 48 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

SC2004 Disk-Disk bbftp

        bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 GByte file Web100 plots: 2500 2000 InstaneousBW AveBW CurCwnd (Value) Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s) 1500 1000 500 0 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 0 5000 10000

time ms

15000 20000    Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s ~4.5s of overhead) 2500 2000 1500 1000 500 0 0 InstaneousBW AveBW CurCwnd (Value)  Disk-TCP-Disk at 1Gbit/s 5000 10000

time ms

15000 20000 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 49



Network & Disk Interactions

Hosts:  (work in progress)

Supermicro X5DPE-G2 motherboards

 

dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0



six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

 Measure memory to RAID0 transfer rates with & without UDP traffic RAID0 6disks 1 Gbyte Write 64k 3w8506-8 2000 1500 1000

Disk write 1735 Mbit/s % CPU kernel mode

500 0 0.0

20.0

40.0

60.0

Trial number 80.0

R0 6d 1 Gbyte udp Write 64k 3w8506-8 100.0

2000 1500

Disk write + 1500 MTU UDP

1000 2000 1500 1000 500 500 0 0 0.0

1218 Mbit/s Drop of 30%

0.0

20.0

40.0

60.0

80.0

Trial number R0 6d 1 Gbyte udp9000 write 64k 3w8506-8 100.0

Disk write +

200 180 160

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

8k 64k y=178-1.05x

9000 MTU UDP 1400 Mbit/s

140 120 100 80 60 40

Drop of 19%

20 20.0

40.0

60.0

80.0

100.0

0 0 20 Trial number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory 40 R. Hughes-Jones Manchester 60 80 100 120

% cpu system mode L1+2

140 160 200 180 50

Transatlantic Transfers With UKLight SuperComputing 2005

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 51

ESLEA and UKLight

   6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR Disk-to-disk transfers with bbcp  Seattle to UK     Set TCP buffer and application to give ~850Mbit/s One stream of data 840-620 Mbit/s Stream UDP VLBI data UK to Seattle 620 Mbit/s

Reverse TCP

1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 1000 900 800 700 600 500 400 300 200 100 0 16:00 4500 4000 3500 3000 2500 2000 1500 1000 500 0 16:00 17:00 17:00 17:00 17:00 17:00 18:00 18:00 18:00 18:00 18:00

sc0501 SC|05

19:00 20:00

time sc0502 SC|05

19:00 20:00

date -time sc0503 SC|05

19:00 20:00

date -time sc0504 SC|05

19:00 20:00

date -time UKLight SC|05

19:00 20:00

date -time

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 21:00 21:00 21:00 21:00 21:00 22:00 23:00 22:00 22:00 22:00 22:00 23:00 23:00 23:00 23:00 52

SC|05 – SLAC 10 Gigabit Ethernet

  2 Lightpaths:  Routed over ESnet  Layer 2 over Ultra Science Net 6 Sun V20Z systems per λ  3 Transmit 3 Receive  dcache remote disk data access  100 processes per node   Node sends or receives One data stream 20-30 Mbit/s   Used Netweion NICs & Chelsio TOE Data also sent to StorCloud using fibre channel links  Traffic on the 10 GE link for 2 nodes:

3-4 Gbit per nodes 8.5-9Gbit on Trunk

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 53

Remote Computing Farms in the ATLAS TDAQ Experiment

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 54

ATLAS Remote Farms – Network Connectivity

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 55

ATLAS Application Protocol

SFI and SFO Event Filter Daemon EFD Request event Request-Response time (Histogram) Send event data Process event Request Buffer Send OK Send processed event

    Event Request  EFD requests an event from SFI  SFI replies with the event ~2Mbytes Processing of event Return of computation  EF asks SFO for buffer space  SFO sends OK  EF transfers results of the computation

●●● Time

tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

56 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

tcpmon: TCP Activity Manc-CERN Req-Resp

   

Round trip time 20 ms

64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms    TCP Congestion window gets re-set on each Request TCP stack

RFC 2581 & RFC 2861

reduction of Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms 250000 200000 150000 100000 50000 0 0 250000 200000 150000 100000 50000 0 0 200 400 200 400 600 600 800 1000 time 1200 1400 1600 400 DataBytesIn (Delta 350 300 250 200 150 1800 100 50 0 2000 DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value 250000 200000 150000 100000 800 1000

time ms

1200 1400 1600 1800 50000 0 2000  Transfer achievable throughput 120 Mbit/s 180 160 140 120 100 80 60 40 20 0 0 200 400 600 800 1000

time ms

1200 1400 1600 1800 250000 200000 150000 100000 50000 0 2000 57 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

      

tcpmon: TCP Activity Manc-CERN Req-Resp TCP stack tuned

Round trip time 20 ms

64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 19 rtt or ~ 380 ms 1200000 1000000 800000 600000 400000 200000 0 0 500 1000 1500 time 2000 DataBytesOut (Delta 400 DataBytesIn (Delta 350 300 250 200 2500 150 100 50 0 3000 TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s

Rate ~10/s (with 50ms wait) 800 700 600 500 400 300 200 100 0 0 500 1000 1500

time ms

2000 PktsOut (Delta PktsIn (Delta CurCwnd (Value 2500 1200000 1000000 800000 600000 400000 200000 0 3000   Transfer achievable throughput grows to 800 Mbit/s Data transferred

WHEN

the application requires the data 900 800 700 600 500 400 300 200 100 0 0 1000 2000 3000 4000

time ms

5000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 6000 7000 1200000 1000000 800000 600000 400000 200000 0 8000 58

      

tcpmon: TCP Activity Alberta-CERN Req-Resp TCP stack tuned

Round trip time 150 ms

64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 11 rtt or ~ 1.67 s TCP Congestion window in slow start to ~1.8s then congestion avoidance Response in 2 rtt after ~2.5s

Rate 2.2/s (with 50ms wait) 1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 0 700 600 500 400 300 200 100 0 0 2000 1000 4000 6000 2000 time 3000 DataBytesOut (Delta DataBytesIn (Delta 4000 PktsOut (Delta PktsIn (Delta CurCwnd (Value 400 350 300 250 200 150 100 50 0 5000 1000000 800000 600000 400000 200000 8000 0 10000 12000 14000 16000 18000 20000

time ms

 Transfer achievable throughput grows slowly from 250 to 800 Mbit/s 800 700 600 500 400 300 200 100 0 0 2000 1000000 800000 600000 4000 6000 400000 200000 8000 0 10000 12000 14000 16000 18000 20000

time ms

59 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester



Summary & Conclusions

Standard TCP not optimum for high throughput long distance links



Packet loss is a killer for TCP



Check on campus links & equipment, and access links to backbones



Users need to collaborate with the Campus Network Teams



Dante Pert



New stacks are stable and give better response & performance



Still need to set the TCP buffer sizes !



Check other kernel settings e.g. window-scale maximum



Watch for “TCP Stack implementation Enhancements”



TCP tries to be fair



Large MTU has an advantage



Short distances, small RTT, have an advantage



TCP does not share bandwidth well with other streams



The End Hosts themselves



Plenty of CPU power is required



for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power



Interaction between HW, protocol processing, and disk sub-system complex



Application architecture & implementation are also important



The TCP protocol dynamics strongly influence the behaviour of the Application.



Users are now able to perform sustained 1 Gbit/s transfers

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 60

More Information Some URLs 1

         UKLight web site: http://www.uklight.ac.uk

MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt

& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html

& http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 61

More Information Some URLs 2

      Lectures, tutorials etc. on TCP/IP:  www.nv.cc.va.us/home/joney/tcp_ip.htm

 www.cs.pdx.edu/~jrb/tcpip.lectures.html

    www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm

www.cis.ohio-state.edu/htbin/rfc/rfc1180.html

www.jbmelectronics.com/tcp.htm

Encylopaedia  http://www.freesoft.org/CIE/index.htm

TCP/IP Resources  www.private.org.il/tcpip_rl.html

Understanding IP addresses  http://www.3com.com/solutions/en_US/ncs/501302.html

Configuring TCP (RFC 1122)  ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

Assigned protocols, ports etc (RFC 1010)  http://www.es.net/pub/rfcs/rfc1010.txt

& /etc/protocols 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 62

Any Questions?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 63

Backup Slides

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 64

Latency Measurements



UDP/IP packets sent between back-to-back systems

 Processed in a similar manner to TCP/IP   Not subject to flow control & congestion avoidance algorithms Used UDPmon test program   

Latency

Round trip times measured using Request-Response UDP frames Latency as a function of frame size  Slope is given by: s   data paths 1 db dt  Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)    Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:  Behavior of the IP stack  The way the HW operates  Interrupt coalescence 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 65

Throughput Measurements

 

UDP Throughput

Send a controlled stream of UDP frames spaced at regular intervals Sender Receiver Zero stats OK done Send data frames at regular intervals Time to send Get remote statistics ●●● ●●● Inter-packet time (Histogram) Time to receive Signal end of test Send statistics: No. received OK done No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Time Number of packets n bytes  time Wait time 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 66

PCI Bus & Gigabit Ethernet Activity



PCI Activity

 Logic Analyzer with    PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC CPU PCI bus Gigabit Ethernet Probe chipset mem Logic Analyser Display Possible Bottlenecks 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester PCI bus CPU chipset mem 67

Network switch limits behaviour

 End2end UDP packets from udpmon 1000  Only 700 Mbit/s throughput 900 800 700 600 500 400 300 200 100 0 0

w05gva-gig6_29May04_UDP

  Packet loss distribution 14000 12000 10000 8000 6000 4000 2000 Lots of packet loss shows throughput limited 0 500 510 100 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20

Spacing between frames us

w05gva-gig6_29May04_UDP

30 35 5 10 15 20

Spacing between frames us

25 30

w05gva-gig6_29May04_UDP wait 12us

35 14000 12000 10000 8000 6000 4000 2000 0 0 100 200 300

Packet No.

530

Packet No.

540 550 520 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 400 500 40 40 68 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 600

“Server Quality” Motherboards

  

SuperMicro P4DP8-2G (P4DP6) Dual

Xeon 400/522 MHz Front side bus  

6 PCI PCI-X slots 4 independent PCI buses

 64 bit 66 MHz PCI   100 MHz PCI-X 133 MHz PCI-X    Dual Gigabit Ethernet Adaptec AIC-7899W dual channel SCSI UDMA/100 bus master/EIDE channels  data transfer rates of 100 MB/sec burst 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 69

“Server Quality” Motherboards

  

Boston/Supermicro H8DAR Two Dual Core Opterons

200 MHz DDR Memory  Theory BW: 6.4Gbit



HyperTransport



2 independent PCI buses

 133 MHz PCI-X   2 Gigabit Ethernet SATA  ( PCI-e ) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 70

10 Gigabit Ethernet: UDP Throughput

     

1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s

   

CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s

   

SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s

6000 5000 4000 3000 2000

an-al 10GE Xsum 512kbuf MTU16114 27Oct03

1000 0 0 5 10 15 20

Spacing between frames us

25 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 30 35 40 16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes 71

10 Gigabit Ethernet: Tuning PCI-X

  

16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc



Measured times



Times based on PCI-X times from the logic analyser

 

Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s

10 8 6 4 2 0 0

DataTAG Xeon 2.2 GHz

measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 1000 2000 3000 Max Memory Read Byte Count 4000 5000 10 8 6 4 2 0 0

Kernel 2.6.1#17 HP Itanium Intel10GE Feb04

measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 1000 2000 3000 Max Memory Read Byte Count 4000 5000 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes

CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update

mmrbc 4096 bytes 5.7Gbit/s

Congestion control: ACK clocking

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 73

End Hosts & NICs CERN-nat-Manc.

 Use UDP packets to characterise Host, NIC & Network  SuperMicro P4DP8 motherboard  Dual Xenon 2.2GHz CPU  400 MHz System bus  64 bit 66 MHz PCI / 133 MHz PCI-X bus

Request-Response Latency Throughput Packet Loss Re-Order

1000 900 800 700 600 500 400 300 200 100 0 pcatb121-nat-gig6_13Aug04 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 6000 5000 4000 3000 256 bytes pcatb121-nat-gig6 2000 1000 0 20900 21100 21300 Latency us 21500 10000 512 bytes pcatb121-nat-gig6 8000 6000 4000 2000 0 20900 21300 21500 5000 1400 bytes pcatb121-nat-gig6 4000 3000 2000 1000 0 20900 21100 21300 Latency us 21500  The network can sustain 1Gbps of UDP traffic  The average server can loose smaller packets  Packet loss caused by lack of power in the PC receiving the traffic  Out of order packets due to WAN routers  Lightpaths look like extended LANS have no re-ordering 80 60 40 20 0 0 15 10 5 0 0 0 5 5 10 20 Spacing between frames us pcatb121-nat-gig6_13Aug04 10 15 20 Spacing between frames us 25 pcatb121-nat-gig6_13Aug04 10 15 20 Spacing between frames us 25 30 30 30 35 35 40 40 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 74 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester



tcpdump / tcptrace

tcpdump: dump all TCP header information for a specified source/destination  ftp://ftp.ee.lbl.gov/  tcptrace: format tcpdump output for analysis using xplot   http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools  http://www.ncne.nlanr.net/TCP/testrig/  Sample use:

tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl

75 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

tcptrace and xplot



X axis is time



Y axis is sequence number



the slope of this curve gives the throughput over time.



xplot tool make it easy to zoom in

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 76

Zoomed In View

   

Green Line Yellow Line

: ACK values received from the receiver

Green Ticks

tracks the receive window advertised from the receiver track the duplicate ACKs received.

Yellow Ticks

track the window advertisements that were the same as the last advertisement.

 

White Arrows

represent segments sent.

Red Arrows (R)

represent retransmitted segments 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 77

TCP Slow Start

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 78