Related Projects and TCP Kernel

Download Report

Transcript Related Projects and TCP Kernel

Towards Gigabit
David Wei
Netlab@Caltech
For FAST Meeting July.2
Potential Problems





Hardware / Driver / OS
Protocol Stack Overhead
Scalability of the protocol specification
TCP Stability /Utilization (New Congestion
Control Algorithm)
Related Experiments & Measurements
Hardware / Drivers /OS






NIC Driver
Device Management
(Interrupt)
Redundant Copies
Device Polling
(http://info.iet.unipi.it/
~luigi/polling/)
Zero-Copy TCP
…
www.cs.duke.edu/ari/publications/talks/freebsdcon
Device Polling
Current process for NIC driver in FreeBSD:
1.
Packet come to NIC
2.
NIC->Hardware Interrupt
3.
CPU jumps to the interrupt handler for that NIC
4.
MAC layer process reads data from NIC to a queue
5.
Upper layer process the data in queue (lower priority)

Drawback:
CPU checks the NIC for every packet -- Context switching.
Frequent interruption for high speed device

Live-Lock:
CPU is too busy working on NIC interruption to process the
data in the queue.

Device Polling
Device Polling:
 Polling: CPU checks the device when it has time.
 Scheduling: User specifies a time ratio for CPU
to work on devices and on non-device
processing.
Advantages:
 Balance between the device service and nondevice processing
 Improve performance in fast devices
Protocol Stack Overhead
Per-packet over head:
 Ethernet Header / Checksum
 IP Header / Checksum
 TCP Header / Checksum
 Coping / interruption process
Solution: Increase packet size
 Opt Packet Size=min{ packet size along the path}
(Fragmentation results in low performance too.)
Path MTU Discovery (1191)
Current Method:
 “Don’t Fragment” bits
(Router: Drop/Fragment; Host: Test/Enforce)
 MTU=min{576, first hop MTU}
 MSS=MTU-40
 MTU<=65535 (Architecture)
 MSS<=65495 (IP sign-bit bugs…)
 Drawback: Usually too small
Path MTU Discovery
How to Discover PMTU?
Current:
 Search (Proportional Decreasing / Binary)
 Update (Periodically Increasing – set to the MTU
of first hop)
Proposed:
 Search/Update with typical MTU values
 Routers: provide suggestion of MTU in DTB
indicating the DF pack drop.

Path MTU Discovery
Implementation
Host:
 Packetization Layer (TCP / Connection over UDP):
DF/Packet Size
 IP: Store PMTU for each known path (routing table)
 ICMP: “Datagram Too Big” Message
Router:
 Send ICMP Packet when Datagram is too big.
Implementation problems:
 RFC 2923
Scalability of Protocol Specifications
1
2
5
3
A CK :
5
4
4
3
A CK :
2
3
1
A CK :

2

Windows Size Space (<=64K)
Sequence Number Space (Wrapping up, <=2G)
Inadequate Frequency of RTT Sampling (1
sample per Window)
A CK :

4
5
Sequence Number Space
1
Sequence Number Space
A CK :
3
1
2
5
A CK :
1
4
1
3
1
2
A CK :
1
4
5
Sequence Number Space
1
2
4
3
5
ACK:
1
3
ACK:
1
2
ACK:
1
1
4
5
Sequence Number Space
4
A
1
2
5
:
K
C
3
6
ACK:
1
3
ACK:
1
2
ACK:
1
1
4
5
Sequence Number Space
5
6
..
..
1
2
3
4
ACK:
7
7
ACK:
6
6
7
Sequence Number Space
.. ..
0
7
..
6
ACK:
7
6
ACK:
6
0
ACK:
1
..
7
Sequence Number Space
.. ..
0
??
7
..
6
ACK:
7
6
ACK:
6
0
ACK:
1
..
7
Sequence Number Space
.. ..
0
Accept when the
delay <=Max
Segment Life
??
7
..
6
ACK:
7
6
ACK:
6
0
ACK:
1
..
7
Sequence Number Space
.. ..


0
Accept when the
delay <=Max
Segment Life
??
7
..
6
ACK:
7
6
ACK:
6
0
ACK:
1
..
7
MSL (Max Segment Life)>Variance of IP delay
MSL<Sequence Number Space/Bandwidth
Sequence Number Space








MSL (Max Segment Life)>Variance in IP
MSL<8*|Sequence Number Space|/Bandwidth
|SN Space|=2^31=2GB
Bandwidth=1GB
MSL<=16sec
Variance of IP delay<=16 sec
Current TCP: 3 min.
Not scalable with bandwidth growth
TCP-Extensions (1323)




Window Spaces: 16bit Scale Factor in SYN:
Win=[Win]*2^S
RTT Measurement: Timestamp for each packet
(generated by sender, relayed by receiver)
PAWS (Protect Against Wrapped Sequence
Number): Use timestamp to expand the
sequence space. (So the timer should not be too
fast or too slow: 1ms ~ 1 sec)
Header Prediction: Simplify the process
High Speed TCP
Floyd ’02. Goals:
 Achieve large window size with realistic loss rate
(Use current window size in AIMD parameter)
 High Speed in a single connection (10Gbps)
 Easy to achieve high sending rate for a given
loss rate. How to Achieve TCP-Friendliness?
 Incremental Deployable (no router support
required)
High Speed TCP
Problem in Steady State:
 TCP response function:
Large congestion window requires a very low loss
rate.
Problem in Recovery:
 Congestion Avoidance takes too long to recover
(Consecutive Time-outs)

Consecutive Time-out
1
Consecutive Time-out
Time Out:1
SS-Threshold=cwnd/2
Slow Start: cwnd=1
1
1
Consecutive Time-out
Time Out:1
Time Out:R1
SS-Threshold=cwnd/2 SS-Threshold=2
Slow Start: cwnd=1 Slow Start:cwnd=1
1
1
1
Consecutive Time-out
1
Time Out:1
Time Out:R1
SS-Threshold=cwnd/2 SS-Threshold=2
Slow Start: cwnd=1 Slow Start: cwnd=1
cwnd=2
Congestion
1
1
Avoidence
High Speed TCP
a
ACK : w  w 
w
Drop : w  w  bw
Change the TCP response function:
 p is high (higher than maxP corresponding to
the default cwnd size W): standard TCP
 p is low: (cwnd >= W): use a(w), b(w) instead of
constant a,b in the adjustment of cwnd.
 For a given loss rate P and desired windows Size
W1 at P: get a(w) and b(w). (Keep the linearity
on a log-log scale. ∆ logW∆ logP)
Change TCP Function

Standard TCP:
log w
log w=-(log p)/2+(log1.5)/2
log(1.5)/2
0
log p
Change TCP Function
log w
log w=-(log p)/2+(log1.5)/2
log W
log(1.5)/2
log P
0
log p
Change TCP Function
log w
log W1
log w=-(log p)/2+(log1.5)/2
log W
log(1.5)/2
log P1
log P
0
log p
Change TCP Function
log w
log W1
log w=-(log p)/2+(log1.5)/2
log W
log(1.5)/2
log P1
log P
0
log p
Expectations
Achieve large window with realistic loss rate
 Relative fairness between standard TCP and
High speed TCP (Acquired bandwidth  cwnd )
 Moderate decrease instead of halving window
size when congestion detected (0.33 at 1000)
 Pre-computed Look-up
to implement a(w) and b(w).

log w
log W1
log w=-(log p)/2+(log1.5)/2
log W
log(1.5)/2
log P1
log P
0
log p
Slow Start
Modification of Slow Start:
 Problem: doubling cwnd for
each RTT is too aggressive
for large cwnd
 Proposal: To limit ∆cwnd in
a RTT in Slow Start.
rate
Loss
Bandwidth
t
Limited Slow Start
For each ACK:
 Cwnd<=max_ss_threshold:
∆cwnd=MSS
(Standard TCP Slow Start)
 Cwnd>max_ss-threshold:
∆cwnd=0.5max_ss_threshold/cwnd
(at most 0.5 max_ssthreshold each
RTT)
rate
Bandwidth
max ssthreshold
t
Related Projects








Cray Research (’92);
CASA Testbed (’94)
Duke (’99)
Pittsburg Supercomputing center
Portland State Univ.(’00)
Internet 2 (’01)
Web100
Net100 (built on Web 100)
Cray Research ’92
TCP/IP Performance at Cray Research (Dave Borman)
Configuration:
 HIPPI between two dedicated Y/MPs with Model E IOS
and Unicos 8.0
 Memory to memory transfer
Results:
 Direct channel-to-channel:
MTU - 64K - 781 Mbps
 Through a HIPPI switch:
MTU - 33K - 416 Mbps
MTU - 49K - 525 Mbps
MTU - 64K - 605 Mbps

CASA Testbed ’94
Applied Network Research of San Diego Supercomputer
Center + UCSD
 Goal: Delay and Loss Characteristics of HIPPI-based
gigabit testbed
 Link Feature: Blocking (HIPPI), tradeoff between high
lost rate and high delay
 Conclusion: Avoiding packet loss is more important than
reduce delay
 Performance (Delay*Bandwidth =2MB; 1323 on; Cray
machines): 500Mbps TCP sustained throughput
(TTCP/Netperf)
Trapeze/IP (Duke)
Goal:
 What optimization is most useful to reduce host
overheads for fast TCP?
 How fast does TCP really go, at what cost?
Approaches:
 Zero-Copy
 Checksum offloading
Result:
 >900Mbps for MTU>8K
Trapeze/IP (Duke)

Zero-copy
www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdcon
Enabling High Performance Data
Transfers on Hosts
By Pittsburg Supercomputing center
 Enable RFC 1191 MTU Discovery
 Enable RFC 1323 Large Windows
 OS Kernel: Large enough socket buffers
 Application: Set its send and receive socket
buffer sizes
Detailed methods to tune various OS.
PSU Experiment
Goal:
 Round Trip Delay and TCP throughput with different
window size
 Influence by different devices (CISCO 3508/3524/5500),
different NIC
Environment:
 OS: FreeBSD 4.0/4.1 (without 1323?), Linux, Solaris
 WAN: 155Mbps OC-3 over SONET MAN
 Measurement Tools: Ping + TTCP
PSU Experiment






"smaller" switches and low-level routers can
easily muck things up.
bugs in Linux 2.2 kernels
Different NICs have different performance.
Fast PCI bus (64 bits * 66mhz) is necessary
Switch MTU size can make a difference (giant
packets are better).
Bigger TCP window sizes can help but there
seems to be a knee around 4MB that is not
remarked upon in the literature.
Internet-2 Experiment
Goal: Single TCP connection with 700-800Mbps
over WAN; Relations among Window Size, MTU
and Throughput
Back-to-Back
 OS: FreeBSD 4.3 release
 Architecture: 64bit-66Mhz PCI+…
 Configuration: sendspace=recvspace=102400
 Setup: Direct connection (back-back) and WAN
 WAN: Symmetric path: host1-Abilene-host2
 Measurement: Ping + IPerf
Internet-2 Experiment
Back-to-Back
 No Loss
 Found some
bug in
FreeBSD 4.3
Window
4KB MTU 8KB MTU
512K
690
855-986
1M
2M
658
562
986
986
4M
217
987
8M
93
987
WAN:
16M
86
985
 <=200Mbps
 Asymmetry in different directions (cache of
MTU…)
Web 100
Goal: Make it easy for non-expertise to achieve
high bandwidth
 Method: Get more information from TCP
 Software:
Measurement: embedded into kernel TCP
App Layer: Diagnostics / Auto-Tuning
 Proposal:
RFC 2012 (MIB)

Net 100





Built on Web 100
Auto-tune the parameter for non-experts.
Network-Aware OS
Bulk File Transportation for ORNL
Implementation of Floyd’s High Speed TCP
Floyd’s TCP SS on Net100
www.csm.ornl.gov/~dunigan/net100/floyd.html
RTT:80ms
1MBsndwnd
2MBrcvwnd
Cwnd:web100
Floyd’s TCP AIMD on Net100
www.csm.ornl.gov/~dunigan/net100/floyd.html
RTT:87ms
Wnd:1000seg
Max_ss:100seg
Ss:1.8sec
MD at 1000:
*0.33/Timeout
AI at 700:
+8/RTT
Old TCP:
45sec recovery
Trend (Mathis: Oct 2001)
Trend (Mathis: Oct 2001)

TCP over Long Path:
Year
Wizard
1988
1Mbps
1991
10Mbps
1995
100Mbps
1999
1Gbps
NonWizard
300kbps
Ratio
3Mbps
300:1
3:1
Related Tools
Measurement:
 IPerf
 TCP Dump
 Web100
Emulation:
 Dummynet
NLANR-Iperf
Feature:
 Try to send data on user space
 Support: IPv4/IPv6
 Support: TCP/UDP/Multicast…
 Similar software: Auto Tuning Enabled FTP
Client/Server
Concern:
 Preemption by other processes in Gigabit test?
(Observation in Internet2 Experiment)
Dummy Net
Embedded in FreeBSD now
 Delay: delay in IP layer
 Loss: random loss in IP layer
Concern:
 Overhead
 Pattern of packet loss

Current Status in Netlab@Caltech

100Mbps Testbed in netlab:
TCP
Monitor
IP+
dummynet
UTP
Cable
IP
Driver
TCP
UTP Cable
100M
Hub
UTP Cable
Driver
Next Step
1Gbps Testbed in lab:
TCP
er
IP
+dummynet
Fiber
Fibe
r
Fib
IP
Driver
TCP
Monitor
r
Fibe

Splitter
Splitter
Fiber
r
Fibe
Driver
Q&A