Transcript 0 - net
How to use it • Press Space to go alonge slide animation • Don’t hurry to press Space next time. Wait for end of animation • If you want to go back, use key «PgUp». Version 08 June 1999 Come later - presentation is under construction now Encapsulation data into Ethernet packet User data Application header TCP header User data Application data TCP segment IP header TCP header Application data IP datagram Ethernet header IP header TCP header Application data 46 to 1500 bytes Ethernet frame Ethernet trailer IEEE 802.2/802.3 Encapsulation (RFC 1042) 802.3 MAC 802.2 LLC 802.2 SNAP Destination address Source address lengt h DSAP 0xAA SSAP 0xAA cntl 03 Org code 00 type 6 6 2 1 1 1 3 2 LENGTH contain length packet from next byte till CRC (CRC isn’t included) DATA 38-1492 Type 0800 IP Datagram 2 DSAP (Destination Service Access Point) and SSAP (Source Service Access Point) both are set to 0xAA. CRC 38-1492 or CNTL (Control field) is set to 3. ORG CODE allways is 0 in all bytes TYPE field identifies data that follows. For example, type 0x0800 (hex) identifies IP datagram follows Type 0806 2 ARP request/reply 28 PAD 18 or Type 8035 2 RARP request/reply 28 PAD 18 4 Ethernet Encapsulation (RFC 894) 46-1500 bytes Destination address Source address type 6 6 2 DATA 46-1500 Type 0800 IP Datagram 2 46-1500 or Type 0806 2 ARP request/reply 28 PAD 18 or Type 8035 2 RARP request/reply 28 PAD 18 CRC 4 IP packet structure 0 15 4-bit ver 4-bit IHL TOS 31 16-bit total packet length flags 3-bit 16-bit identification TTL 16 Protocol 13-bit Fr offset Header checksum Source address Destination address Options (+padding) Version.Current protocol version is 4. IHL - IP header length. IHL is quantity of 32-bit words in IP header. This field has 4bit length => maximum header length is 60 bytes TOS - type of service contain of a 3-bit precedence bits (ignored), 4 TOS bits, and unused bit which must be 0. 4 TOS bits: minimize delay maxm,ize throughput maximize reliability minimize monetary cost Only 1 of these 4 bits can be turned on TPL - total packet length is total IP packet’s length in bytes. Then maximum length of IP packet is 65535 bytes. DATA Continue... IDENTIFICATIN - this field is used when IP need fragment fatagrams. Identification identifies each datagram and is incremented each time a datagram is sent We’ll see meaning of this field when we talk about fragmentation FLAGS and FRAGMENT OFFEST we’’ see also when we talk about fragmentation IP packet structure 0 15 4-bit ver 4-bit IHL TOS 31 16-bit total packet length flags 3-bit 16-bit identification TTL 16 Protocol 13-bit Fr offset Header checksum Source address Destination address Options (+padding) TTL - time-to-live sets an upper limit of routers through which a datagram can pass. This field is decremented each time when datagram pass the router. When this field became 0 a datagram is dropped by router and ICMP message is sent to datagram’s sender. PROTOCOL - this field identifies DATA portion of datagram (which protocol is encapsulated into IP datagram). HEADER CHECKSUM is calculetaed for IP header only. SOURCE and DESTINATION addresses is sender’s and receiver’s IP addresses. DATA OPTIONS is a variable-length field which contain som eoptions. We’ll discuss some of them later. The option field always end on a 32-bit boundary. PAD bytes (value is 0) are added if neccessary. DATA is data. Special case IP addresses netID 0 0 127 -1 netid netid netid IP addresses subnetID subnetid -1 hostID 0 hostid anything -1 -1 -1 -1 Can appear as source destination OK never OK never OK OK never OK never OK never OK never OK Description this host on this net specified host on this net loopback address limited broadcast (never forwarded) net-directed broadcast to netid subnet-directed broadcast to netid, subnetid all-subnets-directed broadcast to netid IP address classes Class Range A 0.0.0.0 to 127.255.255.255 B 128.0.0.0 to 191.255.255.255 C 192.0.0.0 to 223.255.255.255 D 224.0.0.0 to 239.255.255.255 Multicast E 240.0.0.0 to 247.255.255.255 • ARP and RARP ARP For example, we are working on the Ethernet network. Ethernet driver and adapter are using MAC-address. TCP/IP is using IP addresses. When host want to send data to another host it known onlt receiver’s IP address and put this information to TCP/IP stack. Then TCP/IP stack need mechanism to have correspondence between MAC and IP addresses. IP have two algorithms for solve it. 32-bit IP address ARP RARP 48-bit Ethernet address • RARP If system don’t have hard or floppy drive and should boot from network it can’t take IP address from local resourses. Such system have only MACaddress. RARP is algorithm which allow system to obtain IP address from network ARP Send IP datagram Host to IP address ARP IP Resolve IP address to hardware address No Do I know hardware address? Yes Ethernet driver ARP request Host Host Ethernet driver ARP Is somebody looking for my address? No Ignore request Ethernet driver Is somebody looking for my address? ARP Yes Send ARP reply RARP Diskless workstation Boot Read own hardware network address I have a IP address!!! Send RARP request Send RARP reply Somebody wants to have IP address! Give to somebody IP address from my table RARP server ARP packet Dest address Source address type Hard type Prot type Hard size Prot size op Sender Ethernet address Sender IP address Target Ethernet address Target IP address 6 6 2 2 2 1 1 2 6 4 6 4 type hardware type 0x806 Specified hardware type. 1 for an Ethernet protocol type 0x800 for IP hardware size Size of hardware address. 6 for an Ethernet protocol size op Dest address Size of protocol address. 4 for IP Type of operation (request or reply). ARP request - 1, ARP reply - 2, RARP request - 3, RARP reply - 4. Broadcast ICMP - Internet Control Message Protocol RFC 792 packet structure IP header ICMP message 20 The same for all type of messages 8-bit type 8-bit code 16-bit checksum (for entire ICMP message) Contents depend on type and code ICMP address mask request and reply Type 17-request 18 - reply 16-bit checksum (for entire ICMP message) Code - 0 identifier (anything) sequence number (anything) 12 bytes Subnet mask ICMP timestamp request and reply Type 13-request 14 - reply Code - 0 identifier (anything) 16-bit checksum (for entire ICMP message) sequence number (anything) 32-bit originate timestamp 32-bit receive timestamp 32-bit transmit timestamp 20 bytes ICMP port unreachable error IP datagram ICMP message Data portion of ICMP message Ethernet header ICMP header IP header 14 20 8 IP header of datagram that generated error 20 UDP header 8 Must include IP header of the datagram that generated the error At least 8 byte that followed this IP header. In this example it is UDP header General format ICMP unreachable message type 3 code 0-15 16-bit checksum (for entire ICMP message) Unused (must be 0) IP header uncluding options + first 8 bytes of original IP datagram data 8 bytes Client I want to know is server alive ICMP echo request and echo reply (PING) I received “ping” to my address Server is alive Server Answer to client Send Sendecho echorequest reply Packets: type 0 - reply 8 - request code 0 identifier 16-bit checksum (for entire ICMP message) sequence number Optional data identifier - process ID of the sending process sequence number - starts at 0 and incremented every time a new echo request is sent Server must reply identifier and sequence number fields. Historically ping has operated in mode where it sends an echo request once a second. 8 bytes IP record option (-r option) Send echo reply Send echo request with -r option Client Router 1 Router 2 Server Router 3 Packet IP option: Routers put into RR packet IP addresses of their outgoing interfaces code len ptr IP addr R1 IP addr R3 IP addr R2 IP addr of server IP addr R2 IP addr R1 Incoming interface 1 1 1 4 4 4 4 4 4 4 Ptr: = 20 16 12 28 24 8 4 Code 1-byte field specifying the type of IP option. For RR option its value is 7 Len total number of bytes of the RR option. Ping always provides a 38-byte option, to record up to 9 IP addresses - maximum There is the limited room in the IP header for the list of IP addresses, because entire IP header is limited to 15*32bit words (60 bytes). There are only up to 40 bytes for option field in IP header BROADCASTING Four types of IP broadcast Name Address Description Limited 255.255.255.255 limited broadcast never forwarded by a router. Net-directred netid.255.255.255 routers forward this kind of broadcast. These broadcast asign for netid IP network Subnet-directred host ID all is 1 bit broadcast for specific subnet. For example, knowledge of 172.19.128.255 is broadcast for subnet 172.19.128.x mask is required with subnet mask 255.255.255.0 All-subnet-directred knowledge of mask is required subnet ID all 1, host ID all 1 If network is subneted this is all-subnet-directed broadcast. If network isn’t subneted this is net-directed broadcast MULTICASTING !Note! On an Ethernet multicast address is 01:00:00:00:00:00 Addressing Do you remember? Class D 224.0.0.0 to 239.255.255.255 Multicast Here is format of a class D IP address First four bit for class D: 1110 0000 = 224 1110 1111 = 239 1 1 1 0 0 0 0 0 28 bit multicast group ID 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IP address The set of host listening to a particular IP multicast address is called a host group. A host group can span multiple networks. Membership in a host group is dynamic - hosts may join and leave host group at will. There is no restriction on the number of hosts in a host group, and a host not have to belong to a group to send a message to that group. MULTICASTING Converting Multicast Group addresses to Ethernet Addresses The Ethernet addresses corresponding to IP multicasting are in the range 01:00:5e:00:00:00 through 01:00:5e:7f:ff:ff We have 23 bits in the Etherntet address to correspond to the IP multicast group ID. The mapping places the low order 23 bits of the multicast group ID into these 23 bits of the Ethernet address. These 5 bits in the multicast froup ID are not used to form the Ethernet address Class D IP address 1 1 1 0 5e Low-order 23 bits of multicast group ID is copied to Ethernet address 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 48-bit Ethernet address Since the upper 5 bits of the multicast group ID are ignored in this mapping, it is not uniwue. 32 different multicast group IDs map to same Ethernet address (1111 = 31). The device driver or the IP software must perform filtering, since the interface card may receive multicast frames in which the host is really not interested. IGMP reports and queries (Internet Group Management Protocol) Group Group 1 Group 2 Address 224.8.8.1 224.8.8.2 Multicast groups participant: No 1 Process 3 Join to group 1 IP IGMP Another IGMP report report GMP report Dest Dest Dest IPIPIP - -224.8.8.1 -224.8.8.1 224.8.8.1 Group Group Group IPIPIP- -224.8.8.1 -224.8.8.1 224.8.8.1 Interface 1 Group 21 alive Multicast groups on interface 1: Timer! Router IGMP Another Another IGMP IGMP report query IGMP report IGMPreport report Dest Dest Dest IPIP IP- -224.8.8.2 224.8.8.1 224.0.0.1 - 224.8.8.1 Group Group Group IPIP IP - -224.8.8.2 0224.8.8.1 - 224.8.8.1 Don’t report group IP 2 next time 1 2 Send IGMP query Host IP Join to Group 1 group 1 reported Join Leave to Report group22 group only Process 1 Process 2 Multicast groups participant: No 1 2 Host IGMP packet IP datagram IGMP message IP header 20 8 IGMP message 0 4 IGMP version (1) 8 IGMP type (1-2) 16 unused 16-bit checksum 8 bytes 32-bit group address (calss D IP address) Version 1 Type 1 - multicast router query 2 - response sent by a host Group address 31 class D IP address. For query address is set to 0 UDP UDP packet 0 16 31 Source port Destination port UDP length UDP checksum DATA (if any) TFTP Trivial File Transfer Protocol Packet types IP datagram UDP datagram Requestes TFTP message IP header 20 Data packet Data ACK packet Error packet UDP header Opcode 1=RRQ 2=WRQ 8 2 filename N opcode 3=data Block number 2 2 opcode 4=ACK Block number 2 2 opcode 5=error Error number 2 2 0 mode 1 0 N 1 data 0-512 Mode netascii octet Error message N 0 1 TFTP operations File transfer File trnsfer Read request for “File” opcode 3 ACK opcode ACK 3 opcode 1 blcok number 1 opcode opcode 4 4 blcok number 2UDP port Dest 69 bytes 512 block number number 1 block 2 bytes block 356 (last of “File”) Dest UDP port Source - appl UDP port - appl Source UDP port - new port number, was appointed for this file transfer by TFTP server Receiving Receiving block 2. Need file block 1 Data size < 512 byte => “File” from last block of file server Those ports numbers will be used during file transfer. File Client can be received read by block client?1 YES Process Client In case of write file the client sends the WRQ. If all is OK, server responds with ACK and block number 0. And so on. Error messages. Server responds with this type of packet if a read request or write request can’t be processed. Also read or write error during file transmission can cause this message to be sent, and transmission is then terminated. Server BOOTP: Bootstrap Protocol BOOTP Packet Format IP datagram UDP datagram IP header 20 UDP header 8 BOOTP request/reply 300 0 7 opcode 8 15 hardware type 16 23 24 hardware address length 31 BOOTP datagram hopcount Opcode - 1 - request, 2 - reply Transaction ID H type - 1 for Ethernet H addr length - 6 for Ethernet number of seconds unused Hop count - set to 0 by client Trans ID - set by client and returned by the server client IP address Number of seconds - set by client server IP address gateway IP address client hardware address (16 bytes) server hostname (64bytes) boot filename (128 bytes) vendor-specific information (64 bytes) 300 bytes your IP address Client IP - set by client. If client don’t have an address => 0 Your IP - filled by the server with client’s IP address Server IP - filled by the server Gateway IP - filled by a proxy server. If is. Client H address - must be set by client Server hostname - null terminating string that is optionally filled in by the server Boot filename -fully qualified, null terminated pathnema of a file to bootstrap from BOOTP Port numbers Server 67 Client 68 Vendor-Specific information Pad 0 255 1 1 End of the items. Any bytes after this should be set to 255 Examples Subnet mask Gateway 0 4 1 1 0 N 1 1 subnet mask 4 IP address of preferred gateway many fields ... 4 IP address of preferred gateway 4 If information in vendor-specific filed is provided, the first 4 bytes of this area are set to th IP address 99.130.83.99. This is called magic cookie. tag length BOOTP operations Is BOOTP my IIPhave IMy have Receiving IPIP, address process address UDP I loodable image. information unique! unique? portstart! 68 can Boot process Client. Port 68. Server’s reply ARP request to see if anyone Client’s ARP request Client’s ARP reply request request “whoelse is server” Server’s reply TFTP Client’s request on network has same adress Source IP Source - 1.1.1.1 Dest UDP Sender port Sender 67 - 1.1.1.1 IPIP 1.1.1.2 - 1.1.1.2 Source IP - 1.1.1.1 Server’s reply Source IP 1.1.1.2 Your IPTarget - 1.1.1.2 Target IP - 1.1.1.2 Clients read boot file BFILE Source IP 0.0.0.0 Dest Target IP IP 1.1.1.1 IP 255.255.255.255 1.1.1.1 Your IPIP- the 1.1.1.2 Source - 1.1.1.1 IP -from 255.255.255.255 server Server -Dest 1.1.1.1 Source - NOBODY 0.0.0.0 Dest IPIP Target -IP 255.255.255.255 harware address - server’s ANSWER Server IP - 1.1.1.1 Your IP 1.1.1.2 Gateway IP - 1.1.1.1 Client sends second ARP request Gateway - 1.1.1.1 Server IP -IP 1.1.1.1 Boot file name BFILE 0.5 second later, and third ARP Boot after fileIPname - BFILE Gateway - 1.1.1.1 request 0.5 second it. Third ARP request Boot Source address is fileIPname - BFILE 1.1.1.2 (client’s address) BOOTP server UDP port 67 Server. Port 67. IP - 1.1.1.1 For client - 1.1.1.2 TCP TCP packet 0 16 Source port 31 Destination port Sequence number Acknowledgment number Header Reserved flags (6) Window length (4) (6) Header checksum Urgent pointer Options (+padding) DATA The MSS option is using only in SYN packets TCP sequence and aknowledgement Receiving SEQ 10SEQ and 20 Receiving 10 bytes DATA 10 ACK 50 Receiving SEQ 30 DATA 20 ACK 20 my ACK = 30 + 20 Server received my data, his ACK = 20 my curr SEQ = prev send plus data = 10 + 10 Client Send 20 10 bytes SEQ 10 30 20 50 ACK ACK No 20 50 30 And so on…. ACK = 10 (SEQ)=+2010+ 10 my ACK bytes my Client received data, his ACK = 50 my curr SEQ = prev Send mysend ownplus data datamy = 30 20 with own+ SEQ and ACK = 20 Server TCP connection establishment Send packet with S (SYN) flag. Receiving server’s respond (SYN segement). Packet contain the port number of the server that the client want to connect Server respond contain correct ACK Receiving packet. ACK SEQ 145 348 349 ACK ACK Flags146 Flags A SA S Respond with own SYN segment containing own SN and ACK for client’s SYN plus one (SYN comsumes one sequence number) ACK = 145 + 1 = 146 Acknowledge server’s SYN with ACK = server’s SN + 1 = 348 + 1 = 349 Client ISN = 145 Active open The connection establishment completed ISN = 348 Server Passive open ISN - initial sequence number Described three segments complete the connection establishment. This is often called the threeway handshake. TCP connection termination Receiving FIN packet. Receiving FIN packet. User type “quite”, for example Respond with correspondent Next ACKACK should be, for example, 426 and my own SN must be 658 Send FIN - packety with FIN flag Client Active close SEQ ACK 658 427 659 426 ACK Flags659 426 Flags A FA Respond with correspondent ACK I should close second direction Now is «half-close». It can be some data is sending by server to client, with corresponding ACKs. Then server close another direction of connection The connection closed Server Passive close TCP connection is full duplex, and each direction must be shut down independenly TCP states for connection establishment and termination active open Client Server passive open SYN J SYN_SENT SYN_RCVD SYN K, ack J+1 ESTABLISHED ack K+1 ESTABLISHED active close FIN_WAIT_1 passive close FIN M ack M+1 CLOSE_WAIT FIN_WAIT_2 FIN N LAST_ACK TIME_WAIT ack N+1 Client stays in this state for twice the MSL CLOSED 2 MSL state • All received datagram is discarded • There is impossible to open another connection for this socket pairs (IP tuple) Quiet Time If a host in the 2MSL wait crashes, reboots within MSL seconds and immediatly establishes new connections isung the same local and foreign IP addresses and port number. To protect this scenario RFC 793 states that TCP should not create any connectionfor MSL seconds after rebooting. This is called the quiet time. Reset Segments Reset segment - “reset” bit in TCP header is set to 1. Any queued data is thrown away and the reset is sent immediately. The receiver of the RST can tell that the other end did an abort instead of a normal close. Example We trying to connect to server with port number that’s not in use on the destionation. UDP sends “port unreachable” message in this case. TCP sends reset segment. SEQ 400 0 ACK Flags 401 S Flags port 10000 RA Client FIN - orderly release. RST - abortive release. Server doesn’t have process with port 10000 Server Half-Open Packet But sometimes All something is fine ! can crash. Alive computer don’t know that peer is died. Peer havn’t sent FIN or RES segments. Connection is Half-Open Simultaneous Open Usual connection open active open passive open SYN J SYN_SENT SYN_RCVD SYN K, ack J+1 ESTABLISHED ack K+1 ESTABLISHED Simultaneous Open active open active open SYN_SENT SYN_RCVD SYN J SYN J, ack K+1 SYN K SYN K, ack J+1 ESTABLISHED SYN_SENT SYN_RCVD ESTABLISHED Result - one connection, not two. Simultaneous Close Usual connection close active close passive close FIN M FIN_WAIT_1 CLOSE_WAIT ack M+1 FIN_WAIT_2 FIN N TIME_WAIT LAST_ACK ack N+1 CLOSED Simultaneous Close active close active close FIN_WAIT_1 CLOSING TIME_WAIT FIN J ack K+1 FIN K ack J+1 FIN_WAIT_1 CLOSING TIME_WAIT TCP options (RFC 792 and 1323) (examples) End of option list No operations kind=0 1 byte Those options don’t have length field. The other do. kind=1 length is th total length, uncluding the kind and len bytes. 1 byte Maximum segment size Window scale factor Timestamp kind=2 len=4 MSS 1 byte 1 byte 2 byte kind=3 len=3 shift count 1 byte 1 byte 1 byte kind=8 len=10 timestamp value timestamp echo reply 1 byte 1 byte 4 byte 4 byte Delayed Acknowledgment (delayed ACK) For example, delayed ACK here is 200 ms. See to client. Client Server PSH 2:6 (4) ack 11 START KERNEL long time... TIME is waiting is waiting And acknow... 6 Client don’t send ACK immediatly. It PSH 6:12 (4)instant ack 11 200delay ms intervals ACK,Another hoping to have data to Herethem delayed ACK flag is turned send in the same direction as off the PSH 11:15 (4) ack 12 ACK. It can wait till next “delay piggyback ACK” boundary. TCP has decided to sent data packet. Nagle algoritm Client APPLICATION TCP TCP TCPhas doesn’t hasdata received for send send packet. packet. entireWe Now packet. are it Send packet waiting can send And for first data TCPpacket’s from does buffer. it. ACK. PSH 2:3 (1) ack 2 ack 3 PSH 3:5 (2) ack 2 mss (20 bytes) 20 bytes PSH 5:25 (20) ack 2 ack 5 TCP buffer 1 1 byte byte ack 25 bla.., bla... bla… bla… tume has passed PSH 8:10 (2) ack 55 PSH 55:56 (1) ack 10 ack 56 ACK is receiving, I have data, preparing and send packet Now I have data for sending again. And I have “free” ACK from server (packet *) PSH 10:12 (2) ack 56 Befor packet was pushed into PSH 56:58 (2) ack 10 physical media another packet PSH 56:58 ackreceived 12 from server had (2) been * TCP timers • Retransmission timer. This timer is used when expecting an acknowledfment from other end. • Persist timer keeps window size information flowing even if the other end closes its receive window. • Keepalive timer detect when the other end on an otherwise idle connection crashes or reboots. • 2MSL timer measures the time a connection has been in the TIME_WAIT state. Round-Trip Time PSH 2:3 (1) ack 2 Measured RTT (M) ack 3 Send bytes Receive ACK for that bytes There are some formules which are used for calculate retransmissiom timeout value (RTO). Err = M - A A A + gErr D D + h(|Err| - D) A - smoothed RTT (an estimator of average) D - smoothed mean deviation g - 0.125 (1/8) h - 0.25 RTO = A + 4D Karn’s algoritm. Algoritm specify that when retransmission occurs, we cannot update the RTT estimator when the acknowledgement for the retransmitted data finally arrives. RTT example. Measurement. Most implementation measure only one RTT value per connection at any time. If the timer for a given connection is already in use when a data segment is transmitted, that segment is not timed. 1:257 (256) ack 1 1 start timer RTT №1 1.061 sec 2 ack 257 stop timer 257:513 (256) ack 1 3 513:769 (256) ack 1 4 start timer RTT №2 0.808 sec 5 ack 513 8 ack 769 stop timer 769:1025 (256) ack 1 6 1025:1281 (256) ack 1 7 start timer 10 ack 1025 12 ack 1281 1281:1537 (256) ack 1 9 RTT №3 1.015 sec stop timer 1537:1793 (256) ack 1 11 ... RTT example. Measurement. 1:257 (256) ack 1 1 RTT №1 1.061 sec The timing is done by incrementing a counter every 500-ms TCP timer routine is invoked. Figure shows the relationship in our example between actual RTT that we can determin by network analyzator and the counted clock ticks. 2 ack 257 257:513 (256) ack 1 3 513:769 (256) ack 1 4 RTT №2 0.808 sec 5 ack 513 8 ack 769 769:1025 (256) ack 1 6 1025:1281 (256) ack 1 7 10 ack 1025 12 ack 1281 1281:1537 (256) ack 1 9 RTT №3 1.015 sec ... 1537:1793 (256) ack 1 11 2.53 RTT №3. 2 ticks 3.03 stop timer RTT №2. 1 tick 2.03 start timer 3 ticks 1.53 stop timer RTT №1. 1.03 start timer 0.53 stop timer start timer 0.03 RTT example. Calculation. Err = M - A A A + gErr RTT №1 1.061 sec (3 D D + h(|Err| - D) RTO = A + 4D RTT №1 = 3 ticks RTT №2 = 1 ticks RTT №3 = 2 ticks RTT №2 0.808 sec 1:257 (256) ack 1 1 2 ack 257 257:513 (256) ack 1 3 513:769 (256) ack 1 4 5 ack 513 8 ack 769 769:1025 (256) ack 1 6 1025:1281 (256) ack 1 7 RTT №3 1.015 sec A is initialized to 0 D is initialized to 3 Initial RTO = A + 2D = 0 + 2*3 = 6 seconds (Factor 2 is used only for initial calculation) When the ACK for the first data segment arrives (segment 2) measured RTT is 3 and our estimators initialized as A = M + 0.5 = 1.5 + 0.5 = 2 D = A/2 = 1 RTO = A+4D = 2+ 4*1 = 6 seconds 1281:1537 (256) ack 1 9 10 ack 1025 12 ack 1281 ... 1537:1793 (256) ack 1 11 When the ACK for the second data segment arrives (segment 5) measured RTT is 1 and update is Err = M - A = 0.5 - 2 = -1.5 A = A + g*Err = 2 - 0.125*1.5 = 1.8125 D = D + H(|Err| - D) = 1 + 0.25*(1.5 - 1) = 1.125 RTO = A + 4D = 1.8125 + 4*1.125 = 6.3125 But most implementation use RTO as a multiple of 500 ms. In our instance RTO will be 6 seconds. Congestion example. There is normal data flow 6401:6657 (256) ack 1 6657:6913 (256) ack 1 ack 6657 6913:7169 (256) ack 1 7169:7425 (256) ack 1 ack 6913 Congestion. For example, router lost packet 7425:7681 (256) ack 1 7681:7937 (256) ack 1 First duplicate ACK 7937:8193 (256) ack 1 Second duplicate ACK There is third 6913:7169 (256) ack 1 duplicate ACKs retransmission 3rd ACK to appl to appl Host knows that ack 6913 (save 256) prevous packet is missed. Then host send ACK for ack 6913 (save 256) prevous received ack 6913 (save 256) packet and save receiving packet. ack 6913 (save 256) all saved to appl ack 8193 to appl Received ack 8449 missed 8193 :8449 (256) ack 1 TCP count the number of duplicate ACKs received, and when the third one is received assume that a segment has been lost. TCP retransmit only one one segment, starting with that sequence number. We discuss fast retransmit algoritm later. packet. Now this host has all data bytes 6913-8192. Slow start. cwnd = 1 Slow start works with congestion window - CWND. CWND is initialized to 1 (one) segment and is increased by one segment each time an ACK is received. 1:513 (512) ack 1 ack 513 cwnd = 2 513:1025 (512) ack 1 1025:1537 (512) ack 1 At some point the capacity of the cwnd = 3 network can be reached and some 1537:2049 (512) ack 1 packets can be discarded. This 2049:2561 (512) ack 1 situation tells to the sender that its CWND is too large. We’’ ll see later cwnd = 4 mechanism of CWND adjusting. 2561:3073 (512) ack 1 3073:3585 (512) ack 1 The sender can transmit up to the minimum of the congestion window and advertized windiw. CWND is flow control imposed by sender. ack 1025 ack 1537 Sender sends only two segments because ACK for segment 1025:1537 hasn’t received. Result: We have CWND = 3 and 3 sended (without ACK) And so on segments. CWND is maintained in bytes Congestion avoidance algoritm. Congestion avoidance and slow start are different. But in practice congestion avoidance and slow start are implemented together. When congestion occurs TCP slows down the transmission rate of packets into the network and then invoke slow start to get things going again. Congestion avoidance and slow start require that two variables be maintained for each connection: • CWND • A slow start treshold size, ssthresh There are two indications of packet loss: • a timeout occure • the receipt of duplicate ACKs Congestion avoidance algoritm. Combined algoritm’s work. No Yes Initialization: CWND = 1 segment SSTHRESH = 65535 bytes Is congestion indicated by timeout? Yes Normal data flow, CWND is growing No CWND = 1 segment Congestion occur! Retransmission , bla-bla-bla.. At least: ACK is received SSTRESH = CWS/2 TCP increase CWND, but the way it increases depends on whether we TCP performs slow start or congestion avoidance CWND =< SSTHRESH? CWS - current window size TCP’s doing SLOW START Slow start has CWND start at one segment and be incremented by one segmentevery an ACK is received. (Do you remember slide before?). Slow start continues until we are halfway to where congestion occured (since we recorded half of the window size that got us into trouble), and then congestion avoidance takes over. CONGESTION AVOIDANCE Congestion avoidance dictates that CWND be incremented by 1/CWND each time an ACK is received. So we want to increase CWND by at most one segment each RTT, whereas slow start will increment CWND by the number of ACKs received in a RTT Congestion avoidance algoritm. Illustration. SSTRESH = 32 / 2 = 16 CWND = 1 1 segment is send at time 0 At time 1 ACK is returned and CWND is incremented to 2 segments CWND 20 SSTRESH = 16 Starting point: We assumed that congestion has just occured when CWND had a value of 32 segments. Congestion was indicated by timeout 18 16 14 12 10 8 6 4 At time 2 two ACK is returned and CWND is incremented to 4 segments (CWND was 2 and two ACK received) 2 congestion moment 1 2 3 4 round-trip times And so on CWND = SSTRESH. Slow start is stopped and congestion avoidance is started Now congestion avoidance is working. Increasing of CWND is linear, with a maximum increase of one segment per round-trip time 5 6 7 Fast retransmit and Fast recovery algoritms. TCP host I am able to send 3 packets 1:513 (512) ack 1 513:1025 (512) ack 1 NETWORK ack 513 ack 513 1st duplicated ACK ack 513 2nd duplicated ACK ack 513 3rdt duplicated ACK It’ I think It’ duplicated duplicated segmentACK ACK is lost also may be generated may be generated by by reordering reordering segments. segments. Host don’t wait for timer retransmission expires. It send the lost segment. This is: Slow start isn’t performed, but congestion algoritm is working. This is FAST RETRANSMIT ALGORITM FAST RECOVERY ALGORITM Fast retransmit and Fast recovery algoritms. Combined algoritm’s work. 3rd duplicate ACK is received ACK is received which acknowledges all data segments sent between lost packet and 1st duplicate ACK SSTRESH = CWS/2 CWND= SSTRESH + 3 * segment size CWND = SSTRESH Retransmit the missing segment Congestion avoidance is now working If duplicate ACK arrives, INC(CWND;segment size); transmit packet (if CWND allows) Slow start and congestion avoidance example CWND Segment # Send Action Receive Comment initialize CWND 256 Variable SSTRESH 65335 timeout retransmit 256 512 SYN SYN SYN, ACK ACK 257 slow start 512 512 257:513(256) 513:769 (256) ACK 513 slow start 768 512 ACK 769 cong. avoid. 885 512 ACK 1025 cong. avoid. 991 512 ACK 1281 cong. avoid. 1089 512 769:1025(256) 1025:1281(256) 1281:1537(256) 1537:1793(256) Initialize: CWND = MSS = 256 SSTRESH = 65535 Timeout occurs SSTRESH = CWS/2 = minimum valuse = 512 CWND = 1 segment = 256 Here is no changes because new data is not being acknowledged Here is ACK for data! CWND <= SSTRESH we in slow start 1 segment = 256 CWND = CWND + 256 = 512 SEQ x 1000 1100 1,8 1000 1,6 900 1,4 800 1,2 700 1 600 0,8 500 0,6 400 0,4 300 0,2 200 ACK S, A SYN SYN 1 2 3 4 5 6 7 8 9 10 11 12 ACK 1:257 (256) DATA GO CWND <=SSTRESH slow start CWND = CWND + 1 segment CWND = 512 + 256 = 768 CWND > SSTRESH cong.avoid. CWND <-768 + 256*256/768 + 256/8 We are using integer arithmetic. CWND = 885 1 2 3 4 5 6 7 8 9 10 11 12 numbers (from table) CWND > SSTRESH cong.avoid. CWND <-991 + 256*256/991 + 256/8 We are using integer arithmetic. CWND = 1089 CWND > SSTRESH cong.avoid. CWND <-885 + 256*256/885 + 256/8 We are using integer arithmetic. Real formula for 1/CWND is CWND = 991 cwnd <- cwnd + (segsize*segsize)/cwnd + segsize/8 Slow start and congestion avoidance example CWND Seg # Send 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Action Variable Receive Comment CWND SSTRESH ACK 6657 ACK new data 2426 512 8705:8961(256) ACK 6657 dulp ACK #1 ACK 6657 dulp ACK #2 ACK 6657 dulp ACK #3 6657:6913(256) SEQ x 1000 3400 9,7 3200 9,6 2426 2426 1792 512 512 1024 3000 9,5 2800 9,4 ACK 6657 dulp ACK #4 ACK 6657 dulp ACK #5 ACK 6657 dulp ACK #6 2048 2304 2560 1024 1024 1024 2600 9,3 2400 9,2 ACK 6657 dulp ACK #7 2816 1024 2200 9,1 ACK 6657 dulp ACK #8 3072 1024 2000 9 1800 8,9 1600 8,8 1400 8,7 retransmission 8961:9217(256) 9217:9473(256) 9473:9729(256) ACK 8961 ACK new data 1280 First two duplicated ACK is received and is counted and CWND is left alone Third duplicated ACK is arrived SSTRESH = CWND/2 = 2426/ 2 = 1024 (rounded down to the next mult. of the segment size) CWND = SSTRESH + number of dupl ACKs = 1024 + 3 * 256 = 1792 Retransmission is sent NOTE: here we have 2304 unacknowledged data from prevous segments Duplicated ACK is received. CWND = CWND + 1 segment = 1792 + 256 = 2048 But CWND ‘s not big enough for sent data 1024 1200 58 60 62 64 66 68 70 72 numbers (from table) Duplicated ACK is received. CWND = CWND + 1 segment = 2048 + 256 = 2304 But CWND ‘s not big enough for sent data Duplicated ACK is received. CWND = CWND + 1 segment = 2304 + 256 = 2560 We can send data Data is sent There are some segments with same situation ACK for new data is received CWND <= SSTRESH slow start!!! CWND = SSTRESH + segment size = 1024 + 256 = 1280 TCP keepalive timer TCP implementation may use keepalive option. This option is used to know: Is my peer alive? One example is one half-open connection. One peer is died but another end don’t know about it. It keeps socket (IP address + port number) for that died perr. But peer needn’t anything already... And alive one must know it! Usually the keepalive timer is 2 hours. There are 4 scenarios if there is no activity on connection and one peer send keepalive probe to another TCP keepalive timer Scenario 1. Peer is alive and reachable. keepalive ARP ARP Packet ACK request reply probe That’s all.. Peers have that any is data sendthan to each otherbe but Keepalive probedon’t has SEQ onetoless it should Client received answer from the server. It knows that the connection established (for example, receiver wait forisSEQ = 14, but keepalive probe server is alive and reset its keepalive timer Client has SEQ = 13. Receiver receivs packet with incorrect SEQ 2 (two) hours passed... and is forced to respond with ACK which containnext SEQ My keepalive timer exhaust thar the server is expecting Is my peer alive? But I forgot his MAC address... Server Scenario 2. Peer crashed or process was rebooted. keepalive probe Packet That’s all.. Peers don’t have any data to send to each other but 2 hours have passed connection is established Client Server 75 seconds… My keepalive timer exhaust 75 seconds… But peer is crashed No answer Is my peer alive? No answer TCP send request. Don’t see now on Client send 10 keep-alive probes. If it doesn’t receive lower level (for ARP). We should know response, it consider the peer’s host is down or whatever perr alive or not. terminate connection TCP keepalive timer Scenario 3. Peer has crashed and rebooted. I’ll be laconic… 2 hours has passed reset keepalive connection probe Client Host has crashed, rebooted. It has working TCP stack but doesn’t have socket for that connection Server Once again.. My keepalive timer exhaust Is my peer alive? Are they crazy? I don’t have such socket! Scenario 4.Client is running, but unreachable. In this scenario situation will be the same as in scenario 1 - from client’s point of view. This situation may be caused by accident with intermediate router Path MTU Discovery Connection established Decrease MTU MTU = MIN (my interface MTU; MSS announced by the other end) Router generate newer form of ICMP error message which contain its MSS Router generate older form of ICMP error message MTU = MSS - IP header TCP header We take next smaller MTU If th other end doesn’t specify MSS, it default to 536 It is possible to save path MTU on a per-route basis We send datagrams with DF (don’t fragment) bit set We have received ICMP error “can’t fragment” Things is being changing… After timeout we can try bigger MTU (depending on implementation ). RFC 1101 recommends 10 minutes. But things is changing… For example, router fell and route was changed. Another router needs fragmnet our datagram, but datagram has DF bit set. Router is sending ICMP error to our host TCP packet with MSS option TCP packet Source port Destination port Sequence number Acknowledgment number Data offset Flags Reserved Header checksum Window Urgent pointer Options (+padding) DATA Maximum segment size option kind=2 len=0 MSS 1 byte 1 byte 2 byte Path MTU Discovery. Example. Host 2 Host 1 MTU = 552 MTU = 296 Router 1 MTU = 1500 MTU = 1500 SYN, SYN ACK ICMP error 1:513message: (512) 1:257(256) mss mss =Host =1460 5121 unreachable, need ACK to frag, mtu = 296 ACK (newer implementation router’s TCP) Router: I can’t send so big datagram without fragmentation. But DF bit is set => error occur! MTU is 552! I can send datagram with 512 bytes of data. My MSS now 256 (MTU = 296) Window Scale Option • Networks are growing and buffers is coming bigger and there is not enough window size 65535 (maximum window size allowed by window field in TCP header) • The newer implementation using WINDOW SCALE OPTION • The newer implementation can work with oldest implementations. TCP header Source port Destination port Option field can contain WINDOW SCALE OPTION Sequence number Acknowledgment number Data offset Flags Reserved Window Window Urgent pointer Header checksum Options (+padding) DATA There are only 16 bit kind=3 len=3 shift count 1 byte 1 byte 1 byte WINDOW SCALE OPTION can be advertized only in SYN segment. Sacel factor is fixed in each direction when the connection established Shift count:0 - 14 0 - no scaling performed Window Scale Option. Setting. To enable window scaling both ends must have this option in their SYN segments SYN, SYN, ACK, wscale wscale 13 Active I think my window scale should be 1 Open Active peer is going to use window scale! I understand it and choose my window scale = 0. I must set this option to 0. How scale work. Window scale is using to shift value from window field to get real window size For example, window scale was set to 1 and window size in the receiving packet is 4 (it’s only example) 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Using window scale to shift value to left for 1 bit... 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Real advertized window is 8 Timestamp option kind=8 len=10 timestamp value timestamp echo reply 1 byte 1 byte 4 byte 4 byte Timestamp oyin isusing for better calculating RTT The sender places a 32-bit value in the first field and the receiver echoes this back in the reply field. For usinf this option both ends must be able to work with this option. For established this option the active peer must set timestamp option in the SYN and another (passive) end must answer with option too. Only one timestamp option is kept per connection How does TCP do it? • Receiver’s TCP keeps: ACK number from the last ACK which was sent, and time stamp value which was placed to there (tsrecenct). ACK number is next sequence number whivh we are waiting for (lastback). • Segment arrived: If SEQ from segment is lastback, tsrecent = timestamp option from the segment SEQ • Trsent is sent to the timestamp reply field and lastback is sent to ACK value in the sending ACK. PAWS: Protection Against Wrapped Sequence Numbers C onsider a T C P connection using the w indow scale option w ith the largest possible w indow , I 30 14 16 14 gigab yte (2 ). (T he largest w indow is just sm aller than this, 65535 x 2 , not 2 x 2 , but that doesn't affect this discussion.). A lso assum e the tim estam p option is being used and that the tim estam p value assigned by the sender increm ents by one for each w indow that is sent. (T his is conservative. N orm ally the tim estam p increm ents faster than this.) F igure 24.8 show s the possible data flow betw een the tw o hosts, w hen transferring 6 gigabytes. T o avoid lots of IO digit num bers, w e use the notation G to m ean a m ultiple of 1,073,741,824. W e also use the notation from tcpdum p that J:K m eans byte J through and including byte K -1. T im e B yte sen t SEQ # S en d R eceive A 0G :1G 0G :1G 1 OK B 1G :2G 1G :2G 2 O k but one segm ent lost and retransm itted C 2G :3G 2G :3G 3 OK D 3G :4G 3G :4G 4 OK E 4G :5G 0G :1G 5 OK F 5G :6G 1G :2G 6 O K but retransm itted segm ent reappears tim esta m p T he 32-bit sequence num ber w raps betw een tim es D and E . W e assum e that one seg m ent gets lost at tim e В and is retransm itted. W e also assum e that this lost segm ent reappears at tim e F . T his assum es that the tim e difference betw een the segm ent getting lost and reap pearing is less than the M S L ; otherw ise the segm ent w ould have been discarded by som e router w hen its T T L expired. A s w e m entioned earlier, it is only w ith high-speed connections that this problem appears, w here old segm ents can reappear and contain sequence num bers currently being transm itted. W e can also see from F igure 24.8 that using the tim estam p prevents this problem . T he receiver considers the tim estam p as a 32-bit extension of the sequence num ber. Since the lost segm ent that reappears at tim e F has a tim estam p of 2, w hich is less than the m ost recent valid tim estam p (5 or 6), it is discarded by the P A W S algorithm . T he P A W S algorithm does not require any form of tim e synchronization betw een the sender and receiver. A ll the receiver needs is for the tim estam p values to be m ono -tonically increasing, and to increase by at least one per w indow .