Transcript 0 - net

How to use it
• Press Space to go alonge slide animation
• Don’t hurry to press Space next time. Wait for end of
animation
• If you want to go back, use key «PgUp».
Version 08 June 1999
Come later - presentation is under construction now
Encapsulation data into Ethernet packet
User data
Application
header
TCP
header
User data
Application data
TCP segment
IP
header
TCP
header
Application data
IP datagram
Ethernet
header
IP
header
TCP
header
Application data
46 to 1500 bytes
Ethernet frame
Ethernet
trailer
IEEE 802.2/802.3 Encapsulation (RFC 1042)
802.3 MAC
802.2 LLC
802.2 SNAP
Destination
address
Source
address
lengt
h
DSAP
0xAA
SSAP
0xAA
cntl
03
Org code
00
type
6
6
2
1
1
1
3
2
LENGTH contain length packet from next byte till CRC (CRC
isn’t included)
DATA
38-1492
Type
0800
IP Datagram
2
DSAP (Destination Service Access Point) and SSAP (Source
Service Access Point) both are set to 0xAA.
CRC
38-1492
or
CNTL (Control field) is set to 3.
ORG CODE allways is 0 in all bytes
TYPE field identifies data that follows. For example, type 0x0800
(hex) identifies IP datagram follows
Type
0806
2
ARP request/reply
28
PAD
18
or
Type
8035
2
RARP request/reply
28
PAD
18
4
Ethernet Encapsulation (RFC 894)
46-1500 bytes
Destination
address
Source
address
type
6
6
2
DATA
46-1500
Type
0800
IP Datagram
2
46-1500
or
Type
0806
2
ARP request/reply
28
PAD
18
or
Type
8035
2
RARP request/reply
28
PAD
18
CRC
4
IP packet structure
0
15
4-bit
ver
4-bit
IHL
TOS
31
16-bit total packet length
flags
3-bit
16-bit identification
TTL
16
Protocol
13-bit Fr offset
Header checksum
Source address
Destination address
Options (+padding)
Version.Current protocol version is 4.
IHL - IP header length. IHL is quantity of
32-bit words in IP header. This field has 4bit length => maximum header length is 60
bytes
TOS - type of service contain of a 3-bit
precedence bits (ignored), 4 TOS bits, and
unused bit which must be 0. 4 TOS bits:
minimize delay
maxm,ize throughput
maximize reliability
minimize monetary cost
Only 1 of these 4 bits can be turned on
TPL - total packet length is total IP
packet’s length in bytes. Then maximum
length of IP packet is 65535 bytes.
DATA
Continue...
IDENTIFICATIN - this field is used when
IP need fragment fatagrams. Identification
identifies each datagram and is
incremented each time a datagram is sent
We’ll see meaning of this field when we
talk about fragmentation
FLAGS and FRAGMENT OFFEST we’’
see also when we talk about fragmentation
IP packet structure
0
15
4-bit
ver
4-bit
IHL
TOS
31
16-bit total packet length
flags
3-bit
16-bit identification
TTL
16
Protocol
13-bit Fr offset
Header checksum
Source address
Destination address
Options (+padding)
TTL - time-to-live sets an upper limit of
routers through which a datagram can pass.
This field is decremented each time when
datagram pass the router. When this field
became 0 a datagram is dropped by router
and ICMP message is sent to datagram’s
sender.
PROTOCOL - this field identifies DATA
portion of datagram (which protocol is
encapsulated into IP datagram).
HEADER CHECKSUM is calculetaed for
IP header only.
SOURCE and DESTINATION addresses
is sender’s and receiver’s IP addresses.
DATA
OPTIONS is a variable-length field which
contain som eoptions. We’ll discuss some
of them later. The option field always end
on a 32-bit boundary. PAD bytes (value is
0) are added if neccessary.
DATA is data.
Special case IP addresses
netID
0
0
127
-1
netid
netid
netid
IP addresses
subnetID
subnetid
-1
hostID
0
hostid
anything
-1
-1
-1
-1
Can appear as
source
destination
OK
never
OK
never
OK
OK
never
OK
never
OK
never
OK
never
OK
Description
this host on this net
specified host on this net
loopback address
limited broadcast (never forwarded)
net-directed broadcast to netid
subnet-directed broadcast to netid, subnetid
all-subnets-directed broadcast to netid
IP address classes
Class
Range
A
0.0.0.0 to 127.255.255.255
B
128.0.0.0 to 191.255.255.255
C
192.0.0.0 to 223.255.255.255
D
224.0.0.0 to 239.255.255.255 Multicast
E
240.0.0.0 to 247.255.255.255
•
ARP and RARP
ARP
For example, we are working on the Ethernet network. Ethernet driver and
adapter are using MAC-address. TCP/IP is using IP addresses. When host
want to send data to another host it known onlt receiver’s IP address and
put this information to TCP/IP stack. Then TCP/IP stack need mechanism
to have correspondence between MAC and IP addresses. IP have two
algorithms for solve it.
32-bit IP address
ARP
RARP
48-bit Ethernet address
•
RARP
If system don’t have hard or floppy drive and should boot from network it
can’t take IP address from local resourses. Such system have only MACaddress. RARP is algorithm which allow system to obtain IP address from
network
ARP
Send IP datagram
Host
to IP address
ARP
IP
Resolve IP address to
hardware address
No
Do I know
hardware address?
Yes
Ethernet driver
ARP request
Host
Host
Ethernet driver
ARP
Is somebody looking
for my address?
No
Ignore request
Ethernet driver
Is somebody looking
for my address?
ARP
Yes
Send ARP reply
RARP
Diskless workstation
Boot
Read own
hardware
network address
I have a IP
address!!!
Send RARP request
Send RARP reply
Somebody wants
to have IP
address!
Give to somebody
IP address from
my table
RARP server
ARP packet
Dest
address
Source
address
type
Hard
type
Prot
type
Hard
size
Prot
size
op
Sender
Ethernet
address
Sender IP
address
Target
Ethernet
address
Target IP
address
6
6
2
2
2
1
1
2
6
4
6
4
type
hardware type
0x806
Specified hardware type. 1 for an Ethernet
protocol type
0x800 for IP
hardware size
Size of hardware address. 6 for an Ethernet
protocol size
op
Dest address
Size of protocol address. 4 for IP
Type of operation (request or reply). ARP request - 1, ARP reply - 2, RARP request - 3, RARP reply - 4.
Broadcast
ICMP - Internet Control Message Protocol
RFC 792
packet structure
IP header
ICMP message
20
The same for all
type of messages
8-bit type
8-bit code
16-bit checksum
(for entire ICMP message)
Contents depend on type and code
ICMP address mask request and reply
Type 17-request
18 - reply
16-bit checksum
(for entire ICMP message)
Code - 0
identifier (anything)
sequence number (anything)
12 bytes
Subnet mask
ICMP timestamp request and reply
Type 13-request
14 - reply
Code - 0
identifier (anything)
16-bit checksum
(for entire ICMP message)
sequence number (anything)
32-bit originate timestamp
32-bit receive timestamp
32-bit transmit timestamp
20 bytes
ICMP port unreachable error
IP datagram
ICMP message
Data portion of ICMP message
Ethernet
header
ICMP
header
IP header
14
20
8
IP header of datagram
that generated error
20
UDP
header
8
Must include
IP header of the datagram that generated the error
At least 8 byte that followed this IP header. In this example it is UDP header
General format ICMP unreachable message
type 3
code 0-15
16-bit checksum
(for entire ICMP message)
Unused (must be 0)
IP header uncluding options + first 8 bytes of original IP datagram data
8 bytes
Client
I want to know
is server alive
ICMP echo request and echo
reply (PING)
I received
“ping” to my
address
Server is
alive
Server
Answer to
client
Send
Sendecho
echorequest
reply
Packets:
type 0 - reply
8 - request
code 0
identifier
16-bit checksum
(for entire ICMP message)
sequence number
Optional data
identifier - process ID of the sending process
sequence number - starts at 0 and incremented every time a new echo request is sent
Server must reply identifier and sequence number fields. Historically ping
has operated in mode where it sends an echo request once a second.
8 bytes
IP record option (-r option)
Send echo reply
Send echo request with -r option
Client
Router 1
Router 2
Server
Router 3
Packet IP option:
Routers put into RR packet IP addresses of their outgoing interfaces
code
len
ptr
IP addr R1
IP addr R3
IP addr R2
IP addr of
server
IP addr R2
IP addr R1
Incoming
interface
1
1
1
4
4
4
4
4
4
4
Ptr: =
20
16
12
28
24
8
4
Code
1-byte field specifying the type of IP option. For RR option its value is 7
Len
total number of bytes of the RR option. Ping always provides a 38-byte option, to record
up to 9 IP addresses - maximum
There is the limited room in the IP header for the list of IP addresses, because entire IP header is limited to 15*32bit words (60 bytes). There are only up to 40 bytes for option field in IP header
BROADCASTING
Four types of IP broadcast
Name
Address
Description
Limited
255.255.255.255
limited broadcast never forwarded by a router.
Net-directred
netid.255.255.255 routers forward this kind of broadcast. These broadcast
asign for netid IP network
Subnet-directred
host ID all is 1 bit broadcast for specific subnet. For example,
knowledge of
172.19.128.255 is broadcast for subnet 172.19.128.x
mask is required
with subnet mask 255.255.255.0
All-subnet-directred
knowledge of
mask is required
subnet ID all 1,
host ID all 1
If network is subneted this is all-subnet-directed
broadcast. If network isn’t subneted this is net-directed
broadcast
MULTICASTING
!Note!
On an Ethernet multicast address is 01:00:00:00:00:00
Addressing
Do you remember?
Class D
224.0.0.0 to 239.255.255.255 Multicast
Here is format of a class D IP address
First four bit for class D:
1110 0000 = 224
1110 1111 = 239
1 1 1 0 0 0 0 0
28 bit multicast group ID
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
IP address
The set of host listening to a particular IP multicast address is called a host group. A host group
can span multiple networks. Membership in a host group is dynamic - hosts may join and leave
host group at will. There is no restriction on the number of hosts in a host group, and a host not
have to belong to a group to send a message to that group.
MULTICASTING
Converting Multicast Group addresses to Ethernet Addresses
The Ethernet addresses corresponding to IP multicasting are in the range
01:00:5e:00:00:00 through 01:00:5e:7f:ff:ff
We have 23 bits in the Etherntet address to correspond to the IP multicast group ID. The mapping
places the low order 23 bits of the multicast group ID into these 23 bits of the Ethernet address.
These 5 bits in the multicast froup ID are not used
to form the Ethernet address
Class D IP address
1 1 1 0
5e
Low-order 23 bits of multicast group ID is
copied to Ethernet address
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0
48-bit Ethernet address
Since the upper 5 bits of the multicast group ID are ignored in this mapping, it is not uniwue. 32
different multicast group IDs map to same Ethernet address (1111 = 31). The device driver or the IP
software must perform filtering, since the interface card may receive multicast frames in which the host
is really not interested.
IGMP reports and queries
(Internet Group Management Protocol)
Group
Group 1
Group 2
Address
224.8.8.1
224.8.8.2
Multicast groups participant:
No
1
Process 3
Join to
group 1
IP
IGMP
Another
IGMP
report
report
GMP report
Dest
Dest
Dest
IPIPIP - -224.8.8.1
-224.8.8.1
224.8.8.1
Group
Group
Group
IPIPIP- -224.8.8.1
-224.8.8.1
224.8.8.1
Interface 1
Group 21
alive
Multicast groups
on interface 1:
Timer!
Router
IGMP
Another
Another
IGMP
IGMP
report
query
IGMP
report
IGMPreport
report
Dest
Dest
Dest
IPIP IP- -224.8.8.2
224.8.8.1
224.0.0.1
- 224.8.8.1
Group
Group
Group
IPIP IP
- -224.8.8.2
0224.8.8.1
- 224.8.8.1
Don’t report group
IP
2 next time
1 2
Send IGMP
query
Host
IP
Join to
Group 1
group 1
reported
Join
Leave
to
Report
group22
group
only
Process 1
Process 2
Multicast groups participant:
No
1
2
Host
IGMP packet
IP datagram
IGMP
message
IP header
20
8
IGMP message
0
4
IGMP
version (1)
8
IGMP type
(1-2)
16
unused
16-bit checksum
8 bytes
32-bit group address (calss D IP address)
Version
1
Type
1 - multicast router query
2 - response sent by a host
Group address
31
class D IP address. For query address is set to 0
UDP
UDP packet
0
16
31
Source port
Destination port
UDP length
UDP checksum
DATA (if any)
TFTP
Trivial File Transfer Protocol
Packet types
IP datagram
UDP datagram
Requestes
TFTP message
IP header
20
Data packet
Data ACK packet
Error packet
UDP header
Opcode
1=RRQ
2=WRQ
8
2
filename
N
opcode
3=data
Block
number
2
2
opcode
4=ACK
Block
number
2
2
opcode
5=error
Error
number
2
2
0
mode
1
0
N
1
data
0-512
Mode
netascii
octet
Error message
N
0
1
TFTP operations
File transfer
File trnsfer Read request for “File”
opcode 3 ACK
opcode ACK
3
opcode
1
blcok number
1
opcode
opcode
4 4
blcok number
2UDP port
Dest
69
bytes
512 block
number
number
1 block
2
bytes block
356
(last
of “File”)
Dest UDP port Source
- appl UDP port - appl
Source UDP port - new port number, was
appointed for this file transfer by
TFTP server
Receiving
Receiving
block 2.
Need file block 1
Data size < 512 byte =>
“File” from
last block of file
server
Those ports numbers will be used during
file transfer.
File
Client
can be
received
read by
block
client?1
YES
Process
Client
In case of write file the client sends the WRQ. If all is OK, server responds with ACK and
block number 0. And so on.
Error messages. Server responds with this type of packet if a read request or write request
can’t be processed. Also read or write error during file transmission can cause this message to
be sent, and transmission is then terminated.
Server
BOOTP: Bootstrap Protocol
BOOTP Packet Format
IP datagram
UDP datagram
IP header
20
UDP
header
8
BOOTP request/reply
300
0
7
opcode
8
15
hardware type
16
23
24
hardware address
length
31
BOOTP datagram
hopcount
Opcode - 1 - request,
2 - reply
Transaction ID
H type - 1 for Ethernet
H addr length - 6 for Ethernet
number of seconds
unused
Hop count - set to 0 by client
Trans ID - set by client and
returned by the server
client IP address
Number of seconds - set by client
server IP address
gateway IP address
client hardware address (16 bytes)
server hostname (64bytes)
boot filename (128 bytes)
vendor-specific information (64 bytes)
300 bytes
your IP address
Client IP - set by client. If client
don’t have an address => 0
Your IP - filled by the server with
client’s IP address
Server IP - filled by the server
Gateway IP - filled by a proxy
server. If is.
Client H address - must be set by
client
Server hostname - null terminating
string that is optionally filled
in by the server
Boot filename -fully qualified, null
terminated pathnema of a
file to bootstrap from
BOOTP
Port numbers
Server
67
Client
68
Vendor-Specific information
Pad
0
255
1
1
End of the items. Any bytes after
this should be set to 255
Examples
Subnet mask
Gateway
0
4
1
1
0
N
1
1
subnet mask
4
IP address of
preferred gateway
many fields ...
4
IP address of
preferred gateway
4
If information in vendor-specific filed is provided, the first 4 bytes of this area are set to th IP address
99.130.83.99. This is called magic cookie.
tag
length
BOOTP operations
Is
BOOTP
my IIPhave
IMy
have
Receiving
IPIP,
address
process
address
UDP I
loodable
image.
information
unique!
unique?
portstart!
68
can
Boot
process
Client. Port 68.
Server’s
reply
ARP request
to see
if anyone
Client’s
ARP
request
Client’s
ARP
reply
request
request
“whoelse
is server”
Server’s
reply
TFTP
Client’s
request
on
network
has
same
adress
Source
IP Source
- 1.1.1.1
Dest UDP
Sender
port
Sender
67
- 1.1.1.1
IPIP
1.1.1.2
- 1.1.1.2
Source
IP
- 1.1.1.1
Server’s
reply
Source
IP
1.1.1.2
Your
IPTarget
- 1.1.1.2
Target
IP
- 1.1.1.2
Clients
read
boot file BFILE
Source
IP
0.0.0.0
Dest
Target
IP
IP
1.1.1.1
IP
255.255.255.255
1.1.1.1
Your
IPIP- the
1.1.1.2
Source
- 1.1.1.1
IP -from
255.255.255.255
server
Server
-Dest
1.1.1.1
Source
- NOBODY
0.0.0.0
Dest
IPIP
Target
-IP
255.255.255.255
harware
address
- server’s
ANSWER
Server
IP
- 1.1.1.1
Your
IP
1.1.1.2
Gateway IP - 1.1.1.1
Client sends second
ARP request
Gateway
- 1.1.1.1
Server
IP -IP
1.1.1.1
Boot
file
name
BFILE
0.5 second later, and third ARP
Boot after
fileIPname
- BFILE
Gateway
- 1.1.1.1
request 0.5 second
it.
Third
ARP request Boot
Source
address
is
fileIPname
- BFILE
1.1.1.2 (client’s address)
BOOTP
server UDP
port 67
Server. Port 67.
IP - 1.1.1.1
For client - 1.1.1.2
TCP
TCP packet
0
16
Source port
31
Destination port
Sequence number
Acknowledgment number
Header Reserved
flags (6)
Window
length (4) (6)
Header checksum
Urgent pointer
Options (+padding)
DATA
The MSS option is using only in SYN packets
TCP sequence and aknowledgement
Receiving
SEQ 10SEQ
and 20
Receiving
10 bytes
DATA
10
ACK 50
Receiving SEQ 30
DATA 20
ACK 20
my ACK = 30 + 20
Server received my
data, his ACK = 20
my curr SEQ
= prev send plus
data = 10 + 10
Client
Send 20
10 bytes
SEQ
10
30
20
50
ACK
ACK
No
20
50
30
And so on….
ACK = 10
(SEQ)=+2010+ 10
my ACK
bytes my
Client received
data, his ACK = 50
my curr SEQ
= prev
Send
mysend
ownplus
data
datamy
= 30
20
with
own+ SEQ
and ACK = 20
Server
TCP connection establishment
Send packet with S (SYN) flag.
Receiving
server’s
respond
(SYN
segement).
Packet
contain
the port number of the server that
the client want to connect
Server respond contain correct
ACK
Receiving packet.
ACK
SEQ
145
348
349
ACK
ACK
Flags146
Flags
A
SA
S
Respond with own SYN segment
containing own SN and ACK for
client’s SYN plus one (SYN
comsumes one sequence number)
ACK = 145 + 1 = 146
Acknowledge server’s SYN with
ACK = server’s SN + 1 = 348 +
1 = 349
Client
ISN = 145
Active open
The connection establishment
completed
ISN = 348
Server
Passive open
ISN - initial sequence number
Described three segments complete the connection establishment. This is often called the threeway handshake.
TCP connection termination
Receiving FIN packet.
Receiving FIN packet.
User type “quite”, for example
Respond with correspondent
Next ACKACK
should be, for
example, 426 and my own SN
must be 658
Send FIN - packety with FIN flag
Client
Active close
SEQ
ACK
658
427
659
426
ACK
Flags659
426
Flags
A
FA
Respond with correspondent
ACK
I should close second direction
Now is «half-close». It can be
some data is sending by server
to client, with corresponding
ACKs. Then server close
another direction of connection
The connection closed
Server
Passive close
TCP connection is full duplex, and each direction must be shut down independenly
TCP states for connection establishment and
termination
active open
Client
Server
passive open
SYN J
SYN_SENT
SYN_RCVD
SYN K, ack J+1
ESTABLISHED
ack K+1
ESTABLISHED
active close
FIN_WAIT_1
passive close
FIN M
ack M+1
CLOSE_WAIT
FIN_WAIT_2
FIN N
LAST_ACK
TIME_WAIT
ack N+1
Client stays in this
state for twice the
MSL
CLOSED
2 MSL state
• All received datagram is discarded
• There is impossible to open another connection for this socket pairs (IP tuple)
Quiet Time
If a host in the 2MSL wait crashes, reboots within MSL seconds and immediatly establishes new
connections isung the same local and foreign IP addresses and port number. To protect this scenario
RFC 793 states that TCP should not create any connectionfor MSL seconds after rebooting. This is
called the quiet time.
Reset Segments
Reset segment - “reset” bit in TCP header is set to 1.
Any queued data is thrown away and the reset is sent immediately. The receiver of the RST can tell
that the other end did an abort instead of a normal close.
Example
We trying to connect to server with port number that’s not in use on the destionation. UDP sends
“port unreachable” message in this case. TCP sends reset segment.
SEQ
400
0
ACK
Flags
401
S
Flags
port 10000
RA
Client
FIN - orderly release. RST - abortive release.
Server doesn’t have
process with port
10000
Server
Half-Open
Packet
But sometimes
All something
is fine ! can crash.
Alive computer don’t know that peer is died.
Peer havn’t sent FIN or RES segments.
Connection is Half-Open
Simultaneous Open
Usual connection open
active open
passive open
SYN J
SYN_SENT
SYN_RCVD
SYN K, ack J+1
ESTABLISHED
ack K+1
ESTABLISHED
Simultaneous Open
active open
active open
SYN_SENT
SYN_RCVD
SYN J
SYN J, ack K+1
SYN K
SYN K, ack J+1
ESTABLISHED
SYN_SENT
SYN_RCVD
ESTABLISHED
Result - one connection, not two.
Simultaneous Close
Usual connection close
active close
passive close
FIN M
FIN_WAIT_1
CLOSE_WAIT
ack M+1
FIN_WAIT_2
FIN N
TIME_WAIT
LAST_ACK
ack N+1
CLOSED
Simultaneous Close
active close
active close
FIN_WAIT_1
CLOSING
TIME_WAIT
FIN J
ack K+1
FIN K
ack J+1
FIN_WAIT_1
CLOSING
TIME_WAIT
TCP options (RFC 792 and 1323)
(examples)
End of option list
No operations
kind=0
1 byte
Those options don’t have length field. The other do.
kind=1
length is th total length, uncluding the kind
and len bytes.
1 byte
Maximum segment size
Window scale factor
Timestamp
kind=2
len=4
MSS
1 byte
1 byte
2 byte
kind=3
len=3
shift count
1 byte
1 byte
1 byte
kind=8
len=10
timestamp value
timestamp echo reply
1 byte
1 byte
4 byte
4 byte
Delayed Acknowledgment (delayed ACK)
For example, delayed ACK here is 200 ms. See to client.
Client
Server
PSH 2:6 (4) ack 11
START KERNEL
long time...
TIME
is waiting
is waiting
And
acknow...
6
Client don’t send ACK immediatly. It
PSH 6:12 (4)instant
ack 11
200delay
ms intervals
ACK,Another
hoping to have data to
Herethem
delayed
ACK
flag
is turned
send
in the
same
direction
as off
the
PSH
11:15
(4)
ack
12
ACK. It can wait till next “delay
piggyback
ACK” boundary.
TCP has decided to sent data packet.
Nagle algoritm
Client
APPLICATION
TCP
TCP
TCPhas
doesn’t
hasdata
received
for
send
send
packet.
packet.
entireWe
Now
packet.
are
it
Send
packet
waiting
can send
And
for first
data
TCPpacket’s
from
does buffer.
it. ACK.
PSH 2:3 (1) ack 2
ack 3
PSH 3:5 (2) ack 2
mss (20
bytes)
20 bytes
PSH 5:25 (20) ack 2
ack 5
TCP
buffer
1
1 byte
byte
ack 25
bla.., bla... bla… bla… tume has passed
PSH 8:10 (2) ack 55
PSH 55:56 (1) ack 10
ack 56
ACK is receiving, I have data,
preparing and send packet
Now I have data for sending
again. And I have “free” ACK
from server (packet *)
PSH 10:12 (2) ack 56
Befor packet was pushed into
PSH 56:58 (2) ack 10
physical media another packet
PSH 56:58
ackreceived
12
from server
had (2)
been
*
TCP timers
• Retransmission timer. This timer is used when
expecting an acknowledfment from other end.
• Persist timer keeps window size information flowing
even if the other end closes its receive window.
• Keepalive timer detect when the other end on an
otherwise idle connection crashes or reboots.
• 2MSL timer measures the time a connection has been
in the TIME_WAIT state.
Round-Trip Time
PSH 2:3 (1) ack 2
Measured RTT (M)
ack 3
Send bytes
Receive ACK for
that bytes
There are some formules which are used for calculate
retransmissiom timeout value (RTO).
Err = M - A
A  A + gErr
D  D + h(|Err| - D)
A - smoothed RTT (an estimator of average)
D - smoothed mean deviation
g - 0.125 (1/8)
h - 0.25
RTO = A + 4D
Karn’s algoritm.
Algoritm specify that when retransmission occurs, we cannot update the RTT estimator when the
acknowledgement for the retransmitted data finally arrives.
RTT example. Measurement.
Most implementation measure only one RTT value per connection at any time. If the timer for a
given connection is already in use when a data segment is transmitted, that segment is not timed.
1:257 (256) ack 1 1
start timer
RTT №1
1.061 sec
2 ack 257
stop timer
257:513 (256) ack 1 3
513:769 (256) ack 1 4
start timer
RTT №2
0.808 sec
5 ack 513
8 ack 769
stop timer
769:1025 (256) ack 1 6
1025:1281 (256) ack 1 7
start timer
10 ack 1025
12 ack 1281
1281:1537 (256) ack 1 9
RTT №3
1.015 sec
stop timer
1537:1793 (256) ack 1 11
...
RTT example. Measurement.
1:257 (256) ack 1 1
RTT №1
1.061 sec
The timing is done by
incrementing a counter
every 500-ms TCP timer
routine is invoked. Figure
shows the relationship in
our example between
actual RTT that we can
determin by network
analyzator and the
counted clock ticks.
2 ack 257
257:513 (256) ack 1 3
513:769 (256) ack 1 4
RTT №2
0.808 sec
5 ack 513
8 ack 769
769:1025 (256) ack 1 6
1025:1281 (256) ack 1 7
10 ack 1025
12 ack 1281
1281:1537 (256) ack 1 9
RTT №3
1.015 sec
...
1537:1793 (256) ack 1 11
2.53
RTT №3.
2 ticks
3.03
stop timer
RTT
№2.
1 tick
2.03
start timer
3 ticks
1.53
stop timer
RTT №1.
1.03
start timer
0.53
stop timer
start timer
0.03
RTT example. Calculation.
Err = M - A
A  A + gErr
RTT №1
1.061 sec
(3
D  D + h(|Err| - D)
RTO = A + 4D
RTT №1 = 3 ticks
RTT №2 = 1 ticks
RTT №3 = 2 ticks
RTT №2
0.808 sec
1:257 (256) ack 1 1
2 ack 257
257:513 (256) ack 1 3
513:769 (256) ack 1 4
5 ack 513
8 ack 769
769:1025 (256) ack 1 6
1025:1281 (256) ack 1 7
RTT №3
1.015 sec
A is initialized to 0
D is initialized to 3
Initial RTO = A + 2D = 0 + 2*3 = 6 seconds
(Factor 2 is used only for initial calculation)
When the ACK for the first data segment arrives
(segment 2) measured RTT is 3 and our estimators
initialized as
A = M + 0.5 = 1.5 + 0.5 = 2
D = A/2 = 1
RTO = A+4D = 2+ 4*1 = 6 seconds
1281:1537 (256) ack 1 9
10 ack 1025
12 ack 1281
...
1537:1793 (256) ack 1 11
When the ACK for the second data segment arrives
(segment 5) measured RTT is 1 and update is
Err = M - A = 0.5 - 2 = -1.5
A = A + g*Err = 2 - 0.125*1.5 = 1.8125
D = D + H(|Err| - D) = 1 + 0.25*(1.5 - 1) = 1.125
RTO = A + 4D = 1.8125 + 4*1.125 = 6.3125
But most implementation use RTO as a multiple of
500 ms. In our instance RTO will be 6 seconds.
Congestion example.
There is normal
data flow
6401:6657 (256) ack 1
6657:6913 (256) ack 1
ack 6657
6913:7169 (256) ack 1
7169:7425 (256) ack 1
ack 6913
Congestion. For example,
router lost packet
7425:7681 (256) ack 1
7681:7937 (256) ack 1
First duplicate ACK
7937:8193 (256) ack 1
Second duplicate ACK
There is third
6913:7169 (256) ack 1
duplicate
ACKs
retransmission
3rd ACK
to appl
to appl
Host knows that
ack
6913 (save
256)
prevous
packet
is
missed. Then host
send ACK for
ack 6913 (save 256)
prevous
received
ack
6913 (save
256)
packet and save
receiving packet.
ack 6913 (save 256)
all saved to appl
ack 8193
to appl
Received
ack
8449 missed
8193 :8449 (256) ack 1
TCP count the number of duplicate ACKs received,
and when the third one is received assume that a
segment has been lost. TCP retransmit only one one
segment, starting with that sequence number. We
discuss fast retransmit algoritm later.
packet. Now this
host has all data
bytes 6913-8192.
Slow start.
cwnd = 1
Slow start works with congestion
window - CWND. CWND is
initialized to 1 (one) segment and is
increased by one segment each time
an ACK is received.
1:513 (512) ack 1
ack 513
cwnd = 2
513:1025 (512) ack 1
1025:1537 (512) ack 1
At some point the capacity of the
cwnd = 3
network can be reached and some 1537:2049 (512) ack 1
packets can be discarded. This
2049:2561 (512) ack 1
situation tells to the sender that its
CWND is too large. We’’ ll see later
cwnd = 4
mechanism of CWND adjusting.
2561:3073 (512) ack 1
3073:3585 (512) ack 1
The sender can transmit up to the
minimum of the congestion window
and advertized windiw. CWND is
flow control imposed by sender.
ack 1025
ack 1537
Sender sends only two segments
because ACK for segment
1025:1537 hasn’t received.
Result: We have CWND = 3
and 3 sended (without ACK)
And
so on
segments.
CWND is maintained in bytes
Congestion avoidance algoritm.
Congestion avoidance and slow start are different. But in practice congestion
avoidance and slow start are implemented together. When congestion occurs
TCP slows down the transmission rate of packets into the network and then
invoke slow start to get things going again.
Congestion avoidance and slow start require that two variables be maintained for each connection:
•
CWND
•
A slow start treshold size, ssthresh
There are two indications of packet loss:
•
a timeout occure
•
the receipt of duplicate ACKs
Congestion avoidance algoritm.
Combined algoritm’s work.
No
Yes
Initialization:
CWND = 1 segment
SSTHRESH = 65535 bytes
Is congestion indicated by
timeout?
Yes
Normal data flow, CWND is
growing
No
CWND = 1 segment
Congestion occur!
Retransmission , bla-bla-bla..
At least: ACK is received
SSTRESH = CWS/2
TCP increase CWND, but
the way it increases depends
on whether we TCP performs
slow start or congestion
avoidance
CWND =< SSTHRESH?
CWS - current
window size
TCP’s doing
SLOW START
Slow start has CWND start at
one segment and be
incremented by one
segmentevery an ACK is
received. (Do you remember
slide before?).
Slow start continues until we
are halfway to where
congestion occured (since
we recorded half of the
window size that got us into
trouble), and then congestion
avoidance takes over.
CONGESTION AVOIDANCE
Congestion avoidance dictates that
CWND be incremented by 1/CWND
each time an ACK is received. So we
want to increase CWND by at most one
segment each RTT, whereas slow start
will increment CWND by the number of
ACKs received in a RTT
Congestion avoidance algoritm. Illustration.
SSTRESH = 32 / 2 = 16
CWND = 1
1 segment is send at time 0
At time 1 ACK is returned
and CWND is incremented
to 2 segments
CWND
20
SSTRESH = 16
Starting point:
We assumed that congestion has just occured
when CWND had a value of 32 segments.
Congestion was indicated by timeout
18
16
14
12
10
8
6
4
At time 2 two ACK is returned and
CWND is incremented to 4 segments
(CWND was 2 and two ACK
received)
2
congestion
moment
1
2
3
4
round-trip times
And so on
CWND = SSTRESH. Slow start is
stopped and congestion avoidance is
started
Now congestion avoidance is
working. Increasing of CWND is
linear, with a maximum increase of
one segment per round-trip time
5
6
7
Fast retransmit and Fast recovery algoritms.
TCP host
I am able to send 3
packets
1:513 (512) ack 1
513:1025 (512) ack 1
NETWORK
ack 513
ack 513
1st duplicated ACK
ack 513
2nd duplicated ACK
ack 513
3rdt duplicated ACK
It’
I think
It’ duplicated
duplicated
segmentACK
ACK
is lost
also
may
be
generated
may be generated by
by reordering
reordering
segments.
segments.
Host don’t wait for timer
retransmission expires. It send
the lost segment. This is:
Slow start isn’t performed, but
congestion algoritm is working.
This is
FAST RETRANSMIT
ALGORITM
FAST RECOVERY
ALGORITM
Fast retransmit and Fast recovery algoritms.
Combined algoritm’s work.
3rd duplicate ACK is
received
ACK is received which
acknowledges all data
segments sent between lost
packet and 1st duplicate
ACK
SSTRESH = CWS/2
CWND= SSTRESH + 3 *
segment size
CWND = SSTRESH
Retransmit the missing
segment
Congestion avoidance is now
working
If duplicate ACK arrives,
INC(CWND;segment size);
transmit packet (if CWND
allows)
Slow start and congestion avoidance example
CWND
Segment #
Send
Action
Receive
Comment
initialize
CWND
256
Variable
SSTRESH
65335
timeout
retransmit
256
512
SYN
SYN
SYN, ACK
ACK 257
slow start
512
512
257:513(256)
513:769 (256)
ACK 513
slow start
768
512
ACK 769 cong. avoid.
885
512
ACK 1025 cong. avoid.
991
512
ACK 1281 cong. avoid.
1089
512
769:1025(256)
1025:1281(256)
1281:1537(256)
1537:1793(256)
Initialize:
CWND = MSS = 256
SSTRESH = 65535
Timeout occurs
SSTRESH = CWS/2 = minimum valuse = 512
CWND = 1 segment = 256
Here is no changes because new data is not
being acknowledged
Here is ACK for data!
CWND <= SSTRESH we in slow start
1 segment = 256
CWND = CWND + 256 = 512
SEQ x 1000
1100
1,8
1000
1,6
900
1,4
800
1,2
700
1
600
0,8
500
0,6
400
0,4
300
0,2
200
ACK
S, A
SYN
SYN
1
2
3
4
5
6
7
8
9
10
11
12
ACK
1:257 (256)
DATA GO
CWND <=SSTRESH slow start
CWND = CWND + 1 segment
CWND = 512 + 256 = 768
CWND > SSTRESH cong.avoid.
CWND <-768 + 256*256/768 + 256/8
We are using integer arithmetic.
CWND = 885
1 2 3 4 5 6 7 8 9 10 11 12
numbers (from table)
CWND > SSTRESH cong.avoid.
CWND <-991 + 256*256/991 + 256/8
We are using integer arithmetic.
CWND = 1089
CWND > SSTRESH cong.avoid.
CWND <-885 + 256*256/885 + 256/8
We are using integer arithmetic.
Real formula for 1/CWND is
CWND = 991
cwnd <- cwnd + (segsize*segsize)/cwnd + segsize/8
Slow start and congestion avoidance example
CWND
Seg #
Send
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Action
Variable
Receive
Comment
CWND SSTRESH
ACK 6657 ACK new data 2426
512
8705:8961(256)
ACK 6657 dulp ACK #1
ACK 6657 dulp ACK #2
ACK 6657 dulp ACK #3
6657:6913(256)
SEQ x 1000
3400
9,7
3200
9,6
2426
2426
1792
512
512
1024
3000
9,5
2800
9,4
ACK 6657 dulp ACK #4
ACK 6657 dulp ACK #5
ACK 6657 dulp ACK #6
2048
2304
2560
1024
1024
1024
2600
9,3
2400
9,2
ACK 6657 dulp ACK #7
2816
1024
2200
9,1
ACK 6657 dulp ACK #8
3072
1024
2000
9
1800
8,9
1600
8,8
1400
8,7
retransmission
8961:9217(256)
9217:9473(256)
9473:9729(256)
ACK 8961 ACK new data
1280
First two duplicated ACK is received
and is counted and CWND is left alone
Third duplicated ACK is arrived
SSTRESH = CWND/2 = 2426/ 2 = 1024
(rounded down to the next mult. of the segment size)
CWND = SSTRESH + number of dupl ACKs =
1024 + 3 * 256 = 1792
Retransmission is sent
NOTE: here we have 2304 unacknowledged
data from prevous segments
Duplicated ACK is received.
CWND = CWND + 1 segment = 1792 + 256 = 2048
But CWND ‘s not big enough for sent data
1024
1200
58
60
62
64
66
68
70
72
numbers (from table)
Duplicated ACK is received.
CWND = CWND + 1 segment = 2048 + 256 = 2304
But CWND ‘s not big enough for sent data
Duplicated ACK is received.
CWND = CWND + 1 segment = 2304 + 256 = 2560
We can send data
Data is sent
There are some segments with
same situation
ACK for new data is received
CWND <= SSTRESH
slow start!!!
CWND = SSTRESH +
segment size =
1024 + 256 = 1280
TCP keepalive timer
TCP implementation may use keepalive option. This option is used to know:
Is my peer alive?
One example is one half-open connection. One peer is died but another end
don’t know about it. It keeps socket (IP address + port number) for that died
perr. But peer needn’t anything already...
And alive one must know it!
Usually the keepalive timer is 2 hours.
There are 4 scenarios if there is no activity on connection and one peer send
keepalive probe to another
TCP keepalive timer
Scenario 1. Peer is alive and reachable.
keepalive
ARP
ARP
Packet
ACK
request
reply
probe
That’s
all.. Peers
have that
any is
data
sendthan
to each
otherbe
but
Keepalive
probedon’t
has SEQ
onetoless
it should
Client received answer
from the
server. It knows that the
connection
established
(for example, receiver
wait forisSEQ
= 14, but keepalive probe
server is alive and reset its keepalive timer
Client
has SEQ = 13. Receiver receivs packet with incorrect SEQ
2 (two) hours passed...
and is forced to respond with ACK which containnext SEQ
My keepalive timer exhaust
thar the server is expecting
Is my peer alive? But I forgot
his MAC address...
Server
Scenario 2. Peer crashed or process was rebooted.
keepalive probe
Packet
That’s all.. Peers don’t have any data to send to each other but
2 hours have
passed
connection
is established
Client
Server
75 seconds…
My keepalive timer exhaust
75 seconds…
But peer is crashed
No answer
Is my peer alive?
No answer
TCP send request. Don’t see now on Client send 10 keep-alive probes. If it doesn’t receive
lower level (for ARP). We should know response, it consider the peer’s host is down or
whatever perr alive or not.
terminate connection
TCP keepalive timer
Scenario 3. Peer has crashed and rebooted.
I’ll be laconic…
2 hours has passed
reset
keepalive
connection
probe
Client
Host has crashed,
rebooted. It has
working TCP stack
but doesn’t have
socket for that
connection
Server
Once again..
My keepalive timer exhaust
Is my peer alive?
Are they crazy? I don’t
have such socket!
Scenario 4.Client is running, but unreachable.
In this scenario situation will be the same as in scenario 1 - from client’s point of view. This situation
may be caused by accident with intermediate router
Path MTU Discovery
Connection established
Decrease MTU
MTU = MIN (my interface
MTU; MSS announced by
the other end)
Router generate newer form
of ICMP error message
which contain its MSS
Router generate older form
of ICMP error message
MTU = MSS - IP header TCP header
We take next smaller MTU
If th other end doesn’t specify MSS, it default to 536
It is possible to save path MTU on a per-route basis
We send datagrams with DF
(don’t fragment) bit set
We have received ICMP
error “can’t fragment”
Things is being changing…
After timeout we can try
bigger MTU (depending on
implementation ). RFC 1101
recommends 10 minutes.
But things is changing… For example, router fell and route was changed.
Another router needs fragmnet our datagram, but datagram has DF bit
set. Router is sending ICMP error to our host
TCP packet with MSS option
TCP packet
Source port
Destination port
Sequence number
Acknowledgment number
Data offset
Flags
Reserved
Header checksum
Window
Urgent pointer
Options (+padding)
DATA
Maximum segment size option
kind=2
len=0
MSS
1 byte
1 byte
2 byte
Path MTU Discovery. Example.
Host 2
Host 1
MTU = 552
MTU = 296
Router 1
MTU = 1500
MTU = 1500
SYN,
SYN
ACK
ICMP error
1:513message:
(512)
1:257(256)
mss
mss
=Host
=1460
5121 unreachable, need
ACK
to frag, mtu = 296
ACK
(newer implementation router’s TCP)
Router:
I can’t send so big datagram
without fragmentation. But
DF bit is set => error occur!
MTU is 552! I can send
datagram with 512 bytes of
data.
My MSS now 256 (MTU =
296)
Window Scale Option
• Networks are growing and buffers is coming bigger and there is not enough window size 65535
(maximum window size allowed by window field in TCP header)
• The newer implementation using WINDOW SCALE OPTION
• The newer implementation can work with oldest implementations.
TCP header
Source port
Destination port
Option field can contain
WINDOW SCALE OPTION
Sequence number
Acknowledgment number
Data offset Flags Reserved
Window
Window
Urgent pointer
Header checksum
Options (+padding)
DATA
There are only 16 bit
kind=3
len=3
shift count
1 byte
1 byte
1 byte
WINDOW SCALE OPTION
can be advertized only in SYN segment.
Sacel factor is fixed in each direction
when the connection established
Shift count:0 - 14
0 - no scaling performed
Window Scale Option. Setting.
To enable window scaling both ends must have this option in
their SYN segments
SYN,
SYN,
ACK,
wscale
wscale
13
Active
I think my window scale
should be 1
Open
Active peer is going to use window
scale! I understand it and choose my
window scale = 0. I must set this
option to 0.
How scale work.
Window scale is using to shift
value from window field to get
real window size
For example, window scale was
set to 1 and window size in the
receiving packet is 4 (it’s only
example)
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Using window scale to shift value to left for 1 bit...
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Real advertized window is 8
Timestamp option
kind=8
len=10
timestamp value
timestamp echo reply
1 byte
1 byte
4 byte
4 byte
Timestamp oyin isusing for better calculating RTT
The sender places a 32-bit value in the first field and the receiver echoes this back in the reply field.
For usinf this option both ends must be able to work with this option.
For established this option the active peer must set timestamp option in the SYN and another
(passive) end must answer with option too.
Only one timestamp option is kept per connection
How does TCP do it?
• Receiver’s TCP keeps:
ACK number from the last ACK which was sent, and time stamp value which was placed to there
(tsrecenct).
ACK number is next sequence number whivh we are waiting for (lastback).
• Segment arrived:
If SEQ from segment is lastback, tsrecent = timestamp option from the segment
SEQ
• Trsent is sent to the timestamp reply field and lastback is sent to ACK value in the sending ACK.
PAWS: Protection Against Wrapped Sequence
Numbers
C onsider a T C P connection using the w indow scale option w ith the largest possible w indow , I
30
14
16
14
gigab yte (2 ). (T he largest w indow is just sm aller than this, 65535 x 2 , not 2 x 2 , but that
doesn't affect this discussion.). A lso assum e the tim estam p option is being used and that the
tim estam p value assigned by the sender increm ents by one for each w indow that is sent. (T his is
conservative. N orm ally the tim estam p increm ents faster than this.) F igure 24.8 show s the
possible data flow betw een the tw o hosts, w hen transferring 6 gigabytes. T o avoid lots of IO digit num bers, w e use the notation G to m ean a m ultiple of 1,073,741,824. W e also use the
notation from tcpdum p that J:K m eans byte J through and including byte K -1.
T im e
B yte
sen t
SEQ #
S en d
R eceive
A
0G :1G
0G :1G
1
OK
B
1G :2G
1G :2G
2
O k but one segm ent lost and retransm itted
C
2G :3G
2G :3G
3
OK
D
3G :4G
3G :4G
4
OK
E
4G :5G
0G :1G
5
OK
F
5G :6G
1G :2G
6
O K but retransm itted segm ent reappears
tim esta m p
T he 32-bit sequence num ber w raps betw een tim es D and E . W e assum e that one seg m ent
gets lost at tim e В and is retransm itted. W e also assum e that this lost segm ent reappears at
tim e F .
T his assum es that the tim e difference betw een the segm ent getting lost and reap pearing
is less than the M S L ; otherw ise the segm ent w ould have been discarded by som e router
w hen its T T L expired. A s w e m entioned earlier, it is only w ith high-speed connections that
this problem appears, w here old segm ents can reappear and contain sequence num bers
currently being transm itted.
W e can also see from F igure 24.8 that using the tim estam p prevents this problem . T he
receiver considers the tim estam p as a 32-bit extension of the sequence num ber. Since the lost
segm ent that reappears at tim e F has a tim estam p of 2, w hich is less than the m ost recent
valid tim estam p (5 or 6), it is discarded by the P A W S algorithm .
T he P A W S algorithm does not require any form of tim e synchronization betw een the
sender and receiver. A ll the receiver needs is for the tim estam p values to be m ono -tonically
increasing, and to increase by at least one per w indow .