R2D2: Scalable, Reliable and Rapid Data Delivery for Data

Transcript R2D2: Scalable, Reliable and Rapid Data Delivery for Data

Berk Atikoglu, Mohammad Alizadeh, Tom Yue,
Balaji Prabhakar, Mendel Rosenblum

Unreliable packet delivery due to
 Corruption
▪ Dealt with via retransmission
 Congestion
▪ Particularly bad due to incast or fan-in congestion

These losses increase difficulty of reliable
transmission
 Loss of throughput
 Increase in flow transfer times
1
S
S
S
C
S
S
1
S
S
S
C
S
S
S
S
2
S
S
C
1
The client sends a request to several servers.
2
The responses travel to the switch simultaneously.
3
The switch buffer overflows from the amount of data.
Some packets are dropped.
S
3
2

High-resolution timers
 Reduce retransmission timeouts (RTO) to hundreds of µs
▪ Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN 2009)
 Large number of CPU cycles on rapid interrupts or timer programming
 In virtualized environments, high cost of processing hardware interrupts
means even higher overhead

Large switch buffers
 Reduce incast occurences by caching enough packets
 Increased packet latency
 Complex implementation
 Large caches are expensive
 Increased power usage
3

R2D2: collapse all flows into a single “meta-flow”
 Single wait queue holds packets sent by host that are not yet acked
 Single retransmission timer, no per-flow state
 Provides reliable packet delivery
 Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3

Key observation: Exploit uniformity of Data Center
environments
 Path lengths between hosts are small (3 – 5 hops)
 RTTs are small (100 – 400 µs)
 Path bandwidths are uniformly high (1Gbps, 10Gbps)
 Therefore, amount of data from a 1G/10G source “in flight” is less than 64/640 KB
 Store source packets in R2D2 on-the-fly, rapidly retransmit dropped or corrupted
packets
4
L2
L3
L3
L2
5
L2
L3
L2.5
L3
L2
6
1 Outbound packet is
Layer 3
intercepted by R2D2.
2 A timer is started.
3 A copy of the packet is placed
in the wait queue.
4 The returned TCP ack
removes all ACKed packets
held in the wait queue.
2
Layer 2.5
R2D2
sender
4
1
3
Layer 2

When a flow times out:
 Retransmit first un-ACKed
packet (fill the hole).
 Back-off: double the flow’s
timeout value.

When an ACK comes in:
 Reset the timeout back-off.
7

Reliable, but not guaranteed, delivery
 Maximum number of retransmissions before giving up

State-sharing
 Only one wait queue; all packets go in same queue

No change to network stack
 Kernel module in Linux; driver in Windows
 Hardware version is OS-independent

Incremental deployability
 Possible to protect a subset of flows
8

Implemented as a Linux Kernel Module on Kernel
2.6.*
 No need to modify kernel
 Can be loaded/unloaded easily
Incoming/outgoing TCP/IP packets are captured
using Netfilter
 Captured packets are put into a queue

 just meta-data is kept in queue; packet is cloned

L2.5 thread processes the packets in the queue
periodically
9
1 rack
48 servers
1GbE / 10GbE
…

48 Dell PowerEdge 2950 Servers

Intel Core 2 Quad Q9550 × 2
 16GB ECC DRAM
 Broadcom NetXtreme II 5708 1GbE NIC
 CentOS 5.3 Final; Linux 2.6.28-10

Switches

Netgear GS748TNA (48 ports, GbE)
 Cisco Catalyst 4948 (48 ports, GbE)
 BNT RackSwitch G8421 (24 ports,
10GbE)
10

R2D2
 Minimum timeout: 3ms
 Max retransmissions: 10
 Delayed ack disabled

TCP: CUBIC TCP
 minRTO: 200ms
 Segmentation offloading: disabled
 TCP timestamps: disabled
11

Number of servers (N): 1, 2, 4, 8, 16, 32, 46

File size (S): 1MB, 20MB

Client:
 requests (S/N) MB from each server
 Issues new request when all servers respond

Measurements:
 Goodput
 Retransmission ratio:
Retransmitted packets
Total packets sent by TCP
12
1000
900
900
800
800
700
700
600
500
R2D2
400
TCP
300
Goodput (Mbps)
Goodput (Mbps)
1000
600
500
200
100
100
0
0
2
4 8 16 32 46
Servers
1MB
TCP
300
200
1
R2D2
400
1
2
4 8 16 32 46
Servers
20MB
13
0.006
0.006
0.005
Retransmission Ratio
Retransmission Ratio
0.007
0.005
0.004
0.003
0.002
0.004
0.003
0.002
0.001
0.001
0
0
1
2
4 8 16 32 46
Servers
1MB
1
2
4 8 16 32 46
Servers
20MB
14
6 clients (instead of 1 client)
32 servers
Each client requests a file from each of the 32 servers

1000
1000
800
800
Goodput (Mbps)
Goodput (Mbps)


600
400
200
0
600
400
200
0
R2D2
TCP
Test
1MB
R2D2
TCP
Test
20MB
15
1000
900
900
800
800
700
700
600
500
R2D2
400
TCP
300
Goodput (Mbps)
Goodput (Mbps)
1000
600
500
200
100
100
0
0
2
4 8 16 32 46
Servers
1MB
TCP
300
200
1
R2D2
400
1
2
4 8 16 32 46
Servers
20MB
16
0.06
0.005
Retransmission Ratio
Retransmission Ratio
0.05
0.04
0.03
0.02
0.01
0
0.004
0.003
0.002
0.001
0
1
2
4 8 16 32 46
Servers
1MB
1
2
4 8 16 32 46
Servers
20MB
17
1000
900
900
800
800
700
1
600
2
500
3
400
4
300
5
200
6
100
0
Goodput (Mbps)
Goodput (Mbps)
1000
700
1
600
2
500
3
400
4
300
5
200
6
100
0
R2D2
TCP
Test
1MB
R2D2
TCP
Test
20MB
18
9000


8000
6000
5000
4000
R2D2
3000
TCP
2000
1000
0
1
5
9 13 17
Servers
21
Retransmission Ratio
Goodput (Mbps)
7000
File size: 10MB
Number of servers: 1,
5, 9, 13, 17, 21
0.002
0.0015
0.001
0.0005
0
1
5
9 13 17
Servers
21
19

R2D2 is scalable and fast, provides reliable delivery
 No need to modify kernel
 Can be loaded/unloaded easily
 Improves reliability in data center networks

Hardware implementation in NIC can be much
faster
 Work well with TCP offload options like segmentation and
checksum offloading
 Developing an FPGA implementation
20