Document 7834382

Transcript Document 7834382

TCP
10
TCP – purpose
•
•
•
•
TCP provides reliable data transmission over an unreliable network.
TCP provides congestion control
TCP provides flow control
TCP passes messages
–
Inputs
•
•
•
•
–
Outputs
•
•
•
•
•
Destination address
Destination port
Source port (socket)
Message
Message
Error reporting
If TCP reports that the message has been delivered then we can rest assured that the receiving
application has received the data. What the application does with it is another story.
At least 85% of all traffic uses TCP….but I heard the 50% of traffic in S. Korea uses UDP (gaming).
UDP
–
–
No flow control
No error reporting (little error reporting)
BGP FTP HTTPSMTP telnet
icmp UDP OSPF
TCP
IP
TCP header
• IP header is 20 bytes (source IP, destination IP, protocol,
TTL,…)
• TCP header 20 bytes
Source port
Destination port
Sequence #
ACK #
U A P R S F
Reserved
R C S S Y I
Header length 4 bits
6
G K H T N N
REC WIN 16
CHECK SUM 16
Urgent ptr 16
Options and padding
• Ports – used so a single host can have many connections at the same
time. When a packet arrives, it is distinguished by the source IP, source
port, and destination port. More or less, the IPs and port define an
application
• Sequence number – indicates the 1st byte of the data.
• ACK# is the next expected sequence number
• Header length in 32 bit words. 4 bits means the max size is 60 bytes.
20 bytes are used by the header, so up to 40 bytes more could be in
options.
• flags
– URG – urgent ptr (urgent data and valid urgent ptr, eg., cntrl-c)
– ACK – ACK number is valid
– PSH – receiver (the receiver should pass this data to the application as
soon as possible… as oppose to what? This should be set when this packet
will empty the outgoing buffer so the receiver should not wait for a full
buffer before passing data to the app. Just send it now.)
– RST – reset connection (something went wrong, good for detecting
attacks).
– SYN – synchronize sequence number
– FIN – sender is finished sending data
connection establishment
Node A initiates a connection with node B
=> Node A performs an active open, node B passive open (listen)
source
Send SYN
SYN=1, seq=2197
ACK=0
dest
Send SYN/ACK
SYN=1, seq#=197
ACK=1, ack#=2198
Send ACK
(for syn)
ACK flag=1
ack#=198
seq#=2198
Initial SYN depends on
implementation…
Connection establishment
• If the first SYN is dropped, then it is resent 3 seconds later. If this is
dropped, it is resent 6 seconds. And so on. The maximum waiting time
is 64 seconds. The maximum time can be as high as 180 second. But
this depends on the implementation.
• If the listener doesn’t get an ACK, it will retransmit in 3 second and
back-off in the same way.
• But if the listener gets a data packet, the ack will be set and this will
end the connection establishment.
• Often during connection establishment connection setup data is
included in the options.
– Eg., the segment size is included in the options.
– More option discussed later
Connection termination
•
•
•
•
FIN flag implies no more data will be sent from that host.
A FIN from each side closes the connection.
A FIN from only one size puts the connection in the half close state.
Example
– Node A sends first
• A sends pkt with FIN=1 and seq#=U (A enters FIN_WAIT)
• B responds with ACK and ack#=U+1 (B enters close_wait)
• A receives ACK
(A enters FIN_WAIT2)
• Now b closes
• B send pkt with FIN set and seq#=V (enters LAST_ACK)
• A responds with ACK and ack#=V+1 (enters TIME_WAIT and stays there for
120 seconds and then enters closed)
• B receives ACK and enters closed.
• Use netstat to determine the state of the TCP connections.
Sending data
•
•
Either side can send data. When sequence number indicates where the first
byte is placed in the receiver buffer.
The receiver responds with an ACK, the ack# indicates the next empty byte
location in the buffer.
SYN had seq#=14
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Seq #
15
16
buffer
S
t
17
e
18
19
20
21
v e H i
22
SYN had seq#=14
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq #
buffer
Seq#=22
Ack#=1001
Data = ‘Bye’, size = 3 (bytes)
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
buffer
17
t e
18
19
20
21
22
20
21
22
v e
15
S
16
17
t e
18
19
v e
B y e
SYN had seq#=14
Seq #
buffer
Seq#=1001
Ack#=25
Data size =0
S
16
SYN had seq#=14
Seq #
Seq#=1001
Ack#=20
Data size =0
15
15
S
16
17
t e
18
19
v e
20
21
H i
22
B y e
Note: here the receiver is not sending data, so its seq
num is never changing and the reply ack is never
changing. But the definitions of the ACK and SYN
remain valid. Note that SYN and FIN packets are
special cases. No data, but the ACKs increment.
Retransmission time-out
•
•
•
How to decide when a packet should be retransmitted?
Two methods. Here we talk about the first, when the ACK has not been
received in a long time, TCP assumes that the packet was dropped.
How long is a long time…..? No good solution.
RTT is the round-trip time
SRTT is a smoothed (filtered) version of RTT
RTTMD accounts for the variance of RTT
Van jackobson’s
algorithm
SRTT k1  SRTT k 
1 
RTT k
RTTMDk1  RTTMDk 
1 
|SRTT k RTT k |
 0. 9 or 7/8
 0. 25
RTOk max
SRTT k 4RTMDk , MinRTO
MinRTO 200ms in linux, 500ms in BSD,
RFC’s say it should be 1second
This does not work all that well. Really, it is MinRTO that
controls when time-outs occur. Van Jackobson’s algorithm
does not work well. But more analysis is required.
RTO analysis
Suppose that the pdf of RTT is e R (exponentially distributed, e.g., M/M/1 queue)
Mean deviation is

1

0 r   e dr 0 1 re r dr 1 r  1 e 0 r dr
r
1

1
1
 e e  2r e 1

P
timeoutP
R  1 4 2 e 1 8 e 1 1 e r dr e  8e
Using the July 25, 2001 snapshot of round-trip
times from the NLANR data set. we computed
empirical probability of spurious timeouts.
The total data set consists of nearly 13000
connections between 122 sites and 17.5 million
round-trip time measurements.
This data consisted of time series of round-trip
times for each connection with each time series
containing 1440 round-trip times (one sample per
minute over the entire day)

1
0. 019 2%

0.07
0.06
P(RTT>RTO)

1
0.05
0.04
0.03
0.02
0.01
0
0
5
10
K
15
20
Detecting drops with triple Dup ACKs
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Seq #
buffer
15 16 17 18 19 20 21 22
25
30
30 35
S t e v eH i
Seq#=22
Ack#=1001
Data = ‘Bye’, size = 2 (bytes)
Seq#=25
Ack#=1001
Data = ‘Wazup’, size = 5 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
Seq#=30
Ack#=1001
Data = ‘Give’, size = 4 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
Seq#=34
Ack#=1001
Data = ‘Me’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
Seq#=22
Ack#=1001
Data = ‘Bye’, size = 2 (bytes)
Seq#=1001
Ack#=36
Data size =0
Rwin=2
15 16 17 18 19 20 21 22
25
S t e v eH i
Wa z u p
30
30 35
30
30 35
15 16 17 18 19 20 21 22
25
S t e v eH i
Wa z u p G i v e
15 16 17 18 19 20 21 22
25
S t e v eH i
Wa z u p G i v e Me
15 16 17 18 19 20 21 22
25
30
30
30 35
30 35
S t e v e H i B y e Wa z u p G i v e Me
Why triple dup ACK?
•
Why not one DUP ACK?
1. Bennet and Partridge, Packets reordering is not pathological
network behavior, 1999. This paper showed that packet reordering
can/does occur. Further research into this could be a project.
1. The reason for the packet reordering is that the routers have parallel paths
through them. So, depending on the order of arrival and the packet sizes, the
incoming order will be different from the outgoing order.
2. Supposedly this was only a problem with older model juniper routers. There
are many of these routers out there. Cisco field day!
3. Reordering only happens when the packets arrive at nearly the same time.
This might not happen that much in TCP (see ACK clocking later).
4. However, this is an active research area.
5. Load balancing can cause packets to take different paths. This can cause
reordering. Load balancing is a good project topic.
6. Route flap can also cause reordering.
2. Why not a larger DUPThres (larger than 3)?
1. This casues other problems.
2. Limited transmit can help. See my papers on TCP-PR for details.
1.
Using triple DUP ACKs instead of RTO is called fast retransmit
because the drop is detected faster.
Flow control – so the receive doesn’t get overwhelmed.
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
SYN had seq#=14
Seq #
buffer
Seq#=22
Ack#=1001
Data = ‘By’, size = 2 (bytes)
15
16
S
15
t e
16
S
Seq#=1001
Ack#=24
Data size =0
Rwin=0
17
17
t e
•
18
19
20
21
22
•
v e H i
18
19
20
21
v e H i
22
B y
Application reads buffer
24
25
26
27
28
29
30
31
24
25
26
27
28
29
30
31
Seq#=1001
Ack#=24
Data size =0
Rwin=9
Seq#=4
Ack#=1001
Data = ‘e’, size = 1 (bytes)
e
The number of unacknowledg
packets must be lass than the
receiver window.
As the receivers buffer fills,
decreases the receiver window
Flow control – so the receive doesn’t get overwhelmed.
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
SYN had seq#=14
Seq #
16
15
S
buffer
Seq#=22
Ack#=1001
Data = ‘By’, size = 2 (bytes)
17
18
16
S
19
20
21
22
•
v e H i
t e
15
Seq#=1001
Ack#=24
Data size =0
Rwin=0
•
17
18
19
20
21
22
v e H i
t e
B y
Application reads buffer
24
3s
25
26
27
28
29
30
31
Seq#=1001
Ack#=24
Data size =0
Rwin=9
Seq#=4
Ack#=1001
Data = , size = 0 (bytes)
window probe
Seq#=1001
Ack#=24
Data size =0
Rwin=9
Seq#=4
Ack#=1001
Data = ‘e’, size = 1 (bytes)
24
e
25
26
27
28
29
30
31
The number of unacknowledg
packets must be lass than the
receiver window.
As the receivers buffer fills,
decreases the receiver window
Flow control – so the receive doesn’t get overwhelmed.
Seq#=20
Ack#=1001
Data = ‘Hi’, size = 2 (bytes)
Seq#=1001
Ack#=22
Data size =0
Rwin=2
Seq#=22
Ack#=1001
Data = ‘By’, size = 2 (bytes)
Seq#=1001
Ack#=24
Data size =0
Rwin=0
SYN had seq#=14
Seq #
buffer
15
S
15
S
16
17
t e
16
17
t e
•
18
19
20
21
22
•
v e H i
18
19
20
21
v e H i
The number of unacknowledg
packets must be lass than the
receiver window.
As the receivers buffer fills,
decreases the receiver window
22
B y
3s
Seq#=4
Ack#=1001
Data = , size = 0 (bytes)
Seq#=1001
Ack#=24
Data size =0
Rwin=0
6s
Seq#=4
Ack#=1001
Data = , size = 0 (bytes)
Max time between probes is 60 or 64 seconds
Receiver window
• The receiver window field is 16 bits.
• Default receiver window
–
–
–
–
–
By default, the receiver window is in units of bytes.
Hence 64KB is max receiver size for any (default) implementation.
Ethernet segments are 1500 bytes (TCP data =1460).
So that would give 44 packets.
If the bit-rate was 10Mbps, what is the RTT so that this window size is
equal to the bandwidth delay product.
• Receiver window scale
– During SYN, one option is Receiver window scale.
– This option provides the amount to shift the Receiver window.
– Eg. Is rec win scale = 4 and rec win=10, tehn real receiver window is
10<<4 = 160 bytes.
Congestion Control
• Make sure not to overwhelm the network
• How much data to put into the network?
• The sender maintains a the congestion window (cwnd) that
is the maximum number of unacknowledged packets.
• InFlight is the number of unacked packets.
• If InFlight < cwnd, then a packet can be sent.
• When an ACK arrives, InFlight decreases so another
packet can be sent.
suppose that cwnd = 4*MSS
suppose MSS=1000
Inflight=1MSS
Inflight=2MSS
MSS is maximum segment size = min of
segment sizes of sender and receiver. It is
negotiated during SYN.
Seq#=20 Ack#=1001Data = …, size =1 MSS (bytes)
Seq#=1020 ck#=1001 Data = …, size =1 MSS (bytes)
Seq#=2020 Ack#=1001 Data = …, size =1 MSS (bytes)
Inflight=3MSS
Inflight=4MSS
Seq#=3020 Ack#=1001 Data = …, size =1 MSS (bytes)
Seq#=1001
Ack#=1020
Data size =0
Inflight=3MSS
Inflight=4MSS
Inflight=3MSS
Inflight=4MSS
Seq#=1001
Ack#=1020
Data size =0
Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes)
Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes)
suppose that cwnd = 4*MSS
suppose MSS=1000
Inflight=1MSS
Inflight=2MSS
MSS is maximum segment size = min of
segment sizes of sender and receiver. It is
negotiated during SYN.
Seq#=20 Ack#=1001Data = …, size =1 MSS (bytes)
Seq#=1020 ck#=1001 Data = …, size =1 MSS (bytes)
Seq#=2020 Ack#=1001 Data = …, size =1 MSS (bytes)
Seq#=3020 Ack#=1001 Data = …, size =1 MSS (bytes)
Inflight=3MSS
Inflight=4MSS
Seq#=1001
Ack#=1020
Data size =0
Inflight=3MSS
Inflight=4MSS
Inflight=3MSS
Inflight=4MSS
Seq#=1001
Ack#=1020
Data size =0
Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes)
Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes)
ACK clocking
What is the maximum rate
that ACKs can arrive at the
sender?
ACK clocking
100Mbps
Packets can leave here
at 100Mbps
10Mbps
100Mbps
ACK clocking
100Mbps
10Mbps
100Mbps
Packets can leave here
at 100Mbps
Packets leave here at a
rate of 10Mbps
What rate do packets
leave here?
ACK clocking
10Mbps
100Mbps
100Mbps
Packets can leave here
at 100Mbps
Packets leave here at a
rate of 10Mbps
What rate do packets
leave here?
Ans: 10Mbps, they
arrive at 10Mbps
What about the ACKs?
100Mbps
10Mbps
100Mbps
What rate do ACKs leave here?
ACK clocking
10Mbps
100Mbps
100Mbps
Packets can leave here
at 100Mbps
Packets leave here at a
rate of 10Mbps
What rate do packets
leave here?
Ans: 10Mbps, they
arrive at 10Mbps
What about the ACKs?
100Mbps
What rate do ACKs leave here?
Ans: 40/1040 * 10Mbps. Or at a rate
so that if a oacket is send for each
ACK, then the rate that the packets
are sent is 10Mbps
What about the packets?
10Mbps
100Mbps
What rate do ACKs leave here?
Ans: 40/1040 * 10Mbps. Or at a rate
so that if a oacket is send for each
ACK, then the rate that the packets
are sent is 10Mbps
ACK clocking
10Mbps
100Mbps
100Mbps
Packets can leave here
at 100Mbps
Packets leave here at a
rate of 10Mbps
What rate do packets
leave here?
Ans: 10Mbps, they
arrive at 10Mbps
What about the ACKs?
100Mbps
10Mbps
What rate do ACKs leave here?
Ans: 40/1040 * 10Mbps. Or at a rate
so that if a oacket is send for each
ACK, then the rate that the packets
are sent is 10Mbps
What about the packets? 10Mbps. Perfect!!!
100Mbps
What rate do ACKs leave here?
Ans: 40/1040 * 10Mbps. Or at a rate
so that if a oacket is send for each
ACK, then the rate that the packets
are sent is 10Mbps
Congestion control
• ACK clocking makes the sender not send any faster than
the bottleneck link speed.
• But how to “fill the pipe?”
Sending at “burst”
rate of 10Mbps
Not sending pckts.
Wasted bandwidth
Sending at “burst”
rate of 10Mbps
We only send cwnd packets in a burst.
How big should cwnd be?
Congestion control
• ACK clocking makes the sender not send any faster than
the bottleneck link speed.
• But how to “fill the pipe?”
We only send cwnd packets in a burst.
How big should cwnd be?
RTT
The number of pckts sent in one RTT is
the cwnd.
In order to not waste bandwidth, how
many packets should be sent?
Congestion control
• ACK clocking makes the sender not send any faster than
the bottleneck link speed.
• But how to “fill the pipe?”
We only send cwnd packets in a burst.
How big should cwnd be?
The number of pckts sent in one RTT is
the cwnd.
In order to not waste bandwidth, how
many packets should be sent?
RTT
Cwnd (bytes)= Link byte-rate (byte/s) * RTT s
Bottleneck links speed
Bandwidth delay product = Link byte-rate (byte/s) * RTT s
Congestion control
• Ideally cwnd = bandwidth delay product.
• This ignores fairness. If there are N flows that are also use
the same link. Then ideally cwnd = bandwidth delay
product/N.
• But how to find this value???
TCP congestion control
• Theme: probe the system.
– Slowly increase cwnd until there is a packet drop. That must imply
that the cwnd size (or sum of windows sizes) is larger than the
BWDP.
– Once a packet is dropped, then decrease the cwnd. And then
continue to slowly increase.
• Two phases:
– slow start (to get to the ballpark of the correct cwnd)
– Congestion avoidance, to oscillate around the correct cwnd size.
Cwnd>ssthress
Triple dup ack
Connection
establishment
Congestion
avoidance
Slow-start
timeout
Connection
termination
Slow start
• When the connect first start (and after a timeout for today’s
TCPs)
• Cwnd starts at 1 or 2 MSS.
• For each non-dup ACK received, the window size increase
by one.
• This increasing continues until the window reaches the
value of SSThres.
• The initial value of SSThres is often large (taken as
infinite). So the Rwin limits the growth of the window.
Slow start
cwnd
SYN: Seq#=20 Ack#=X
SYN: Seq#=1000 Ack#=21
SYN: Seq#=21 Ack#=1001
1
Seq#=21 Ack#=1001 Data=‘…’ size =1000
2
Seq#=1021 Ack#=1001 Data=‘…’ size =1000
Seq#=2021 Ack#=1001 Data=‘…’ size =1000
3
4
5
6
7
8
Seq#=1001 Ack#=1021 size =0
Seq#=1001 Ack#=1021 size =0
Seq#=1021 Ack#=1001 Data=‘…’ size =1000
Seq#=2021 Ack#=1001 Data=‘…’ size =1000
Seq#=1021 Ack#=1001 Data=‘…’ size =1000
Seq#=2021 Ack#=1001 Data=‘…’ size =1000
Seq#=1001 Ack#=1021 size =0
The pipe is full!
Slow start
cwnd
SYN: Seq#=1000 Ack#=21
1
2
Cwnd doubles
every RTT!!
3
4
5
6
7
8
RTT
Seq#=1001 Ack#=1021 size =0
RTT
Seq#=1001 Ack#=1021 size =0
Seq#=1001 Ack#=1021 size =0
RTT
RTT
RTT??
The pipe is full!
What is
happening here?
Slow start
cwnd
SYN: Seq#=1000 Ack#=21
1
2
Cwnd doubles
every RTT!!
3
4
5
6
7
8
RTT
Seq#=1001 Ack#=1021 size =0
RTT
Seq#=1001 Ack#=1021 size =0
Seq#=1001 Ack#=1021 size =0
RTT
RTT
RTT??
What is
happening here?
Now the queue is
filling. Either it will
fill and drop a packet
or the recWin will stop
cwnd from increasing
•
If RecWin!=inf and RecWin<bandwidth delay product + queue size, and there
are no other packets, then there will never be a drop. Lots of conditions, but a
large number of flows do not experience drops.
•
If RecWin/ssthress=inf and the outgoing link of the sender is not the
bottleneck, then eventually there will be a drop. If the drop is detected with
triple dupack, then cwnd = cwnd/2 and congestion avoidance is entered.
•
If the drop(s) is(are) detected with timeout, then ssthress=cwnd/2, cwnd=1 and
slowstart is continued.
•
If ssthress< bandwidth delay product + queue size and RecWin>ssthress, the
congestion avoidance is entered.
Congestion Avoidance
Basics: additive increase multiplicative decrease (AIMD)!!
Rough view
For every cwnd’s worth of packets, cwnd is incremented by one.
When there is a drop, cwnd=cwnd/2.
cwnd
4
5
6
Seq#
(MSS)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cwnd
6
16
17
18
19
20
21
2
3
4
5
5
6
7
8
9
10
11
12
13
14
15
6
3
4
15
22
23
24
11
12
13
14
15
15
15
15
15
15
22
23
24
Rough view of TCP congestion control
drops
Cwnd=ssthres
Slow start
Congestion avoidance
drops
drop
Slow start
Congestion avoidance
drops
drop
Slow start
Congestion
avoidance
Slow start
TCP - more detailed view
• Delayed ACKs
– The worry was that the network was going to be all jammed up
with ACKs.
– So instead of sending an ACK for every pck, delay the ack and
maybe ack two packets
• Generate an ACK for at least every other packet.
• Don’t delay an ACK by more than 500ms. (exact number depends on
implementation.)
• If packets are out of order, generate an ACK for every packet.
• Also, immediately send an ACK when a “gap” in the buffer is filled.
– Delayed ACKs can greatly slow down a connection.
• Eg., the first packet is delayed by 500ms
• Depending on the implementation, cwnd will grow more slowly.
Details - Fast recovery
• cwnd after a drop
• Recall, TCP only sends packets when InFlight < Cwnd.
• InFlight only decreases when a new ACK is received, I.e.,
a DUP ACK does not cause InFlight to change.
– If a DUP ACK arrives, then it means that a packet arrived at the
receiver and an ACK was sent. So the number of packet in the
network has decreased. So InFlight should decrease.
– But maybe the network has duplicated the ACK. To be
conservative, leave InFlight as is (I guess).
Fast recovery
• Upon the two DUP ACK arrival, do nothing. Don’t send any packets
(InFlight is the same).
• Upon the third Dup ACK,
– set SSThres=cwnd/2.
– Cwnd=cwnd/2+3
– Retransmit the requested packet.
•
•
•
•
Upon every other DUP ACK, cwnd=cwnd+1.
If InFlight<cwnd, send a packet and increment InFlight.
When a new ACK arrives, set cwnd=ssthres (RENO).
When an ACK arrives that ACKs all packets that were outstanding
when the first drop was detected, cwnd=ssthres (NEWRENO)
Fast recovery
cwnd
4
5
6
Seq#
(MSS)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Inflight cwnd
6
6
16
17
18
19
20
21
2
3
4
5
5
6
7
8
9
10
11
12
13
14
15
6 6=6/2+3
7
8
3
7
8
3
15
22
23
24
11
12
13
14
15
15
15
15
15
15
22
23
24
cwnd
4
5
6
Seq#
(MSS)
Fast recovery – multiple drops - RENO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
3
4
5
Inflight cwnd
6
6
16
17
18
19
20
21
5
6
7
8
9
10
11
12
11
12
6 6=6/2+3
7
8
7
8
12
12
3
12
22
23
12
12
12
12
12
12
12
15
15
15
24
3
Why is this bad?
The first drop told us that we were sending to fast.
The second drop tells us the same thing (already).
So why react to the same news twice….NewReno
15
5
5=2+3 15
16
2
2
15
Fast Recovery – multiple drops - NewReno
• The problem was that one of the packets that was outstanding when the
drop was detected was also dropped.
• Solution (NewReno)
– When a drop is detected,
•
•
•
•
Ssthres=cwnd/2
Cwnd=cwnd/2+3
Recover = seq# of largest byte sent.
Retransmit the dropped packet
– Upon a DUP ACK, increment cwnd and sent if Inflight<cwnd
– If ACK is larger than pervious ACK, but smaller than recover (partial ack)
•
•
•
•
•
Suppose that pervious ack#=X and now ack#=Y<recover
Retransmit drop packet
Cwnd = cwnd – (Y-X)+1
Of course, Inflight = Inflight-(Y-X)
So transmit another packet (that makes two transmissions)
– If ACK>recover,
• Cwnd=ssthres
• Exit fast recovery
Fast Recovery – single drops - NewReno
Inflight
cwnd
14
14
16
17
18
19
20
21
17
17
17
17
Recover=29
14
15
10
11
12
13
14
15
16
7
17
31
Note how the actual number
outstanding is always = 7
Fast Recovery – multiple drops - NewReno
Inflight
cwnd
14
14
16
17
18
19
20
21
29
17
17
17
17
Recover=29
14
15
10
11
12
13
14
15
16
17
21
19
15=19-4
NewReno sends two
packets for every ACK
indicating a multiple drop.
16=19-(21-17)+1
21
35
7
Exit fast recovery
2 drops takes 2 RTT to recover.
N drops takes N RTT to recover.
If N*RTT>RTO, then
slow-steady => no TO
impatient => TO
Other things
• Idle restart
– If no packet has been sent in RTO seconds
• SSThress=Cwnd
• Cwnd=1
• Slow-start
– Avoids big bursts after idle times
• E.g., get data form disk
• http 1.1
• Timeout – exponential back off
– If no ACK arrives before RTO timer expires, then time-out
• Ssthress=cwnd/2; Cwnd=2; slow-start
• RTO=min(2*RTO,64s)
– If next packet is dropped, then the wait is longer
– Gives up after 9-12 tries. But implementation dependent (ns never stops)
• If a retransmitted is dropped, the TCP times out.
Dup ACKs after timeout
Inflight cwnd
14
14
20
21
22
23
16
17
18
19
20
21
17
17
17
17
24
29
14
15
10
11
12
13
14
15
16
24
26
28
30
30
Recover=29
42
42
42
42
42
42
42
42
17
31
19
15=19-4
16=19-(21-17)+1
eventually timeout
DUP ACKS
17
18
18
19
Set send_high to maximum seq# sent.
If DUP ACKs are received for segments
less than send_high, assume it does not
indicate a drop. In case there was a drop,
then there will be a time out.
Selective Acknowledgment – SACK
The latest widespread congestion control
•
•
•
•
•
seq num
Problem: when a multiple packets are dropped, the cumulative ACK does not
give information as to which packets were dropped. As a result, fast recovery
is not so fast; it takes one RTT per lost packet.
Solution: embed into the ACK some information about which packets have
successfully arrived.
TCP-SACK allows ACKs to contain information about received packets.
If the packets are received in order, then the ACK looks the same as TCPRENO or TCP-NEWRENO. But if a packet the packets arrive out of order,
then the ACK contains SACK blocks.
A SACK block indicates a sequence of segments that have been received.
15
A
20
A A
ACKed
S
25
S
S
SACKed
30
S
S
S
SACKed
S
35
N
N N
Not Sent
TCP-SACK
Highest ACK
seq num
15
A
20
A A
ACKed
S
25
S
S
SACKed
left edge of
2nd block
30
S
S
S
S
SACKed
right edge of left edge of
2nd block
2nd block
SACK
option
N N N
Not Sent
right edge of
2nd block
SACK blocks are 8 bytes long (4 bytes for each edge)
The SACK option includes 1 byte to specify that it is a SCK block and one byte for the number of SACK blocks.
1 SACK block = 10 bytes + 2 bytes padding -> 52 bytes header
2 SACK blocks = 18 bytes + 2 bytes padding -> 60 bytes header
3 SACK blocks = 26 bytes + 2 bytes padding -> 68 bytes header
4 SACK blocks = 34 bytes + 2 bytes padding -> 76 bytes header
Max ACK is 80 bytes
If time stamp option is used, then the max number of SACK blocks is 3.
kind=5 length=2
left edge of 2st block = 26
right edge of 2st block = 30
left edge of 1st block = 20
right edge of 1st block = 23
35
Generation of SACKs
1.
2.
3.
4.
5.
No SACK blocks if no out of order packets
No delayed ACK if out of order packets (send an ACK for every received packet.
When an out of order packet arrives, the first SACK block contains contain the segment that just arrived.
The ACK should contain as many SACK blocks as fit and are required (no skimping to save bit-rate).
The SACK blocks included should be those that have most recently been reported (see 3). So if there are at most 3
SACK blocks, then each continuous block of segments will be reported at least 3 times.
If the packet that arrived has just been received (a duplicate reception), then the first SACK block should identify
this packet. (This is the DSACK extension to SACK). In this case, the next SACK block should indicate the
continuous sequence of segments that contain the segments received in duplicate.
6.
seq num
15
A
20
A A
ACKed
S
25
S
S
SACKed
left edge of
2nd block
30
S
S
S
SACKed
right edge of left edge of
2nd block
2nd block
Now suppose that segment 21 arrives for a second time.
SACK
option
S
kind=5
left edge of DUP packet = 21
right edge of DUP packet = 22
left edge of 1st block = 20
right edge of 1st block = 23
left edge of 2st block = 26
right edge of 2st block = 30
length=2
35
N N N
Not Sent
right edge of
2nd block
DSACK
• DSACK is to identify packets that have been needlessly retransmitted.
• The primary source of such retransmissions is packet reordering.
• If such a retransmission occurs, it likely means that cwnd was divided
by 2 needlessly.
• DSACK helps identify these needless divides by two.
• It is not clear what can be done once they are identified.
• Many ideas have been suggested, but it remains to be scene if they
actually improve things
– Ethan Blanton, Mark Allman, On Making TCP More Robust to Packet
Reordering (2002): show that some improvement is possible
– Bohacek et al shows that if there is persistent reordering, more drastic
measures are required.
– Neither paper includes analysis of the current situation in the Internet.
• The current situation is not completely known.
• The homework provides backbone traces with rampant reordering.
• In my opinion (on 2/20/04) some sort of timer-based approach is necessary.
The DUPACK threshold approach is not appropriate because a burst of packets
(as can be seen in the homework) can be very reordered. But reordering by
more than a few milliseconds is very rare.
• A project could examine this.
Eifel Detection
• DSACK is only useful after the arrival of the second copy
of the packet.
• Eifel uses time-stamps to inform the sender that a packet
that was thought to have been lost has actually arrived.
TCP-SACK (Sender side)
•
•
Slow start and the linear increase part of SACK is the same as TCPRENO/NEWRENO. The fast recovery part is different.
SACK provides more information about which packets have been lost. The sender can
use this to determine
–
–
•
which packets to send
when to send packets
When to assume that a packet is lost
1.
If DupThresh continuous SACK blocks have been SACKed that have larger sequence number.
The idea is that DupThresh packets have been SACKed with larger sequence number, but
continuous SACK blocks are used instead.
If DupThresh*MSS bytes have been SACKed that have larger sequence number.
2.
MSS=5 bytes
DupThresh=3
little packets

Packet num
3
seq num
15:19
A
8
13
65:69
40:44
A A
S
S
S

ACKed
Assumed dropped because of reason 1 and 2
1.
Number of continuous sack blocks
with higher seq num = 4DupThresh
2.
Number SACKed bytes with large seq
num = 25  MSS*DupThresh
SACKed
14
15
16
17
70:71 72:73 74:75 76:77

S
S
18
19
S
SACKed
Assumed dropped because of reason 1 only
1.
Number of continuous sack blocks
with higher seq num = 3 DupThresh
2.
Number SACKed bytes with large seq
num = 9<MSS*DupThresh
23
78:82 83:87
N N N
Not Sent
Not assumed dropped.
Number in “pipe” or InFlight
• If a packet has been sent, not lost, and not SACKed, then
this packet is assumed to be in the pipe.
• Any packet that has been retransmitted and not SACKed.
– Retransmission happen in order (smallest seq num first, why?)
– Let HighRX denote the highest segment that has been
Retransmitted.
– Any packet that has been not been SACKed and has seq num less
been retransmitted, so it is in the pipe.
Which packet to send next? (during fast recovery)
•
The next to transmit is the segment with the smallest seq num that satisfies
1.
2.
3.
seq num
If the segment is less than HighRX
If the segment has seq num less than the largest segment in a SACK block
If the segment is assumed to be lost.
15
20
A
A
A
25
S
ACKed
S
30
S
S
S
SACKed
S
S
35
N
SACKed
N
N
Not Sent
HighRX
already retransmitted
•
•
seq num
next to be sent
If the above is an empty set, then the next to be sent is smallest segment that has not yet been sent.
If the above is also empty (because there are no more packets to be sent),
15
20
A
A
A
25
S
ACKed
S
S
30
S
S
N
SACKed
SACKed
next to be sent
end of file
seq num
15
A
20
A
A
S
25
S
S
S
SACKed
SACKed
ACKed
already retransmitted
S
HighRX
next to be sent
N
Not Sent
HighRX
already retransmitted
35
N
TCP-SACK congestion control
• When a loss is detected:
– set RecoveryPoint=Seq num of highest segment sent. Fast recovery
ends when this seq num is ACKed (SACKed is not good enough).
– ssthresh = cwnd=Inflight
– Retransmit lost packet with smallest seq num.
– Set HighRX equal to the retransmitted packet
• During recovery (until RecoveryPoint is ACKed)
– If pipe<cwnd, then send next to be sent.
TCP-SACK notes
• After RTO, the TCP-SACK sender starts fresh and erases
SAKC info from prior to the RTO (some of it might be
regained in retransmissions of SACK blocks).
• Like NEWRENO, the highest seq sent before an RTO is
recorded and a dupack from a packet qith seq num less
than this highest seq does not cause fast
recovery/retransmit.
• Like NEWRENO, the retransmit timer can be reset during
recovery (slow and steady) or not (impatient).
Inflight
14
cwnd
newReno
14
TCP-SACK timeout
pkt sent
•
16
17
18
19
20
21
17
•
29
17
17
17
•
14
14
10
11
12
17
13
14
14
no more packet sent
time-out
SACK, NewReno, etc. will time-out if
a retransmission is lost.
If SACK uses the same technique to
increase cwnd as NewReno (I.e.,
cwnd=inflight/2+3…). and if there are
more than cwnd/2 packets are lost,
SACK will time-out.
The ns implementation has this
problem.
TCP-SACK burst
pipe
cwnd pkt sent
SACK
•
16
17
18
19
20
21
14
•
17
29
17
17
17
4,5,6,7
7
7
17,18,19,20
21
22
lost ACK clocking
and sent a burst
24
31
37
38
recovery ends
SACK, NewReno, etc. will time-out if
a retransmission is lost.
Multiple drops lead to a burst of
packets being sent.
Limited Transmit
•
When a packet is dropped and the window size is less than 4, TCP will always timeout (not enough ACKs arrive to
get triple DUP).
It, upon receiving a DUP ACK, a packet is transmitted, then there might be enough DUPACKs to cause fast
retransmitted and avoid time-out.
Limited transmit allow for a packet to be send when the second Dup Ack is received. (In general, for every other
dup ack).
Even if a packet is lost, sending a packet for every other ACK is sending at half the bit-rate.
While this helps TCP avoid time-outs, it also makes this version of TCP far more aggressive for loss probability
greater than about 1% (where time-outs become quite prevalent for non-limited transmit TCP)
•
•
•
•
Seq#
cwnd (MSS)
Seq#
cwnd (MSS)
3
3
1
2
3
1
2
3
2
2
2
4
4
2
5
5
2
Triple dup ack!
No time out
2
Time out
Limited Transmit
cwnd
Seq#
(MSS)
5
cwnd
4
1
2
3
4
Seq#
(MSS)
1
2
3
4
5
2
2
2
2
6
5
2
6
2
7
2
2
Triple dup ack!
Triple dup ack!
ECN
• Sometimes the router will have a large enough queue to accept the
packet, but the queue occupancy is beyond a threshold, so in order to
try to get the TCP flows to send at a slower rate, the router would drop
packets (even though there is room in the queue).
• It’s funny to drop packets when there is room in the queue, so another
option is to mark the packets. The receiver should include in the ACK
that packet that is being ACKed has been marked and the sender
should react to this marking as it would to a drop, except that there is
no reason to retransmit the marked packet.
• This approach has little impact in general, except, like limited transmit,
when the loss probability if very high, it can reduce timeouts.

Document 7834382

Transcript Document 7834382

Directory