R2D2: Scalable, Reliable and Rapid Data Delivery for Data
Download
Report
Transcript R2D2: Scalable, Reliable and Rapid Data Delivery for Data
Berk Atikoglu, Mohammad Alizadeh, Tom Yue,
Balaji Prabhakar, Mendel Rosenblum
Unreliable packet delivery due to
Corruption
▪ Dealt with via retransmission
Congestion
▪ Particularly bad due to incast or fan-in congestion
These losses increase difficulty of reliable
transmission
Loss of throughput
Increase in flow transfer times
1
S
S
S
C
S
S
1
S
S
S
C
S
S
S
S
2
S
S
C
1
The client sends a request to several servers.
2
The responses travel to the switch simultaneously.
3
The switch buffer overflows from the amount of data.
Some packets are dropped.
S
3
2
High-resolution timers
Reduce retransmission timeouts (RTO) to hundreds of µs
▪ Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN 2009)
Large number of CPU cycles on rapid interrupts or timer programming
In virtualized environments, high cost of processing hardware interrupts
means even higher overhead
Large switch buffers
Reduce incast occurences by caching enough packets
Increased packet latency
Complex implementation
Large caches are expensive
Increased power usage
3
R2D2: collapse all flows into a single “meta-flow”
Single wait queue holds packets sent by host that are not yet acked
Single retransmission timer, no per-flow state
Provides reliable packet delivery
Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3
Key observation: Exploit uniformity of Data Center
environments
Path lengths between hosts are small (3 – 5 hops)
RTTs are small (100 – 400 µs)
Path bandwidths are uniformly high (1Gbps, 10Gbps)
Therefore, amount of data from a 1G/10G source “in flight” is less than 64/640 KB
Store source packets in R2D2 on-the-fly, rapidly retransmit dropped or corrupted
packets
4
L2
L3
L3
L2
5
L2
L3
L2.5
L3
L2
6
1 Outbound packet is
Layer 3
intercepted by R2D2.
2 A timer is started.
3 A copy of the packet is placed
in the wait queue.
4 The returned TCP ack
removes all ACKed packets
held in the wait queue.
2
Layer 2.5
R2D2
sender
4
1
3
Layer 2
When a flow times out:
Retransmit first un-ACKed
packet (fill the hole).
Back-off: double the flow’s
timeout value.
When an ACK comes in:
Reset the timeout back-off.
7
Reliable, but not guaranteed, delivery
Maximum number of retransmissions before giving up
State-sharing
Only one wait queue; all packets go in same queue
No change to network stack
Kernel module in Linux; driver in Windows
Hardware version is OS-independent
Incremental deployability
Possible to protect a subset of flows
8
Implemented as a Linux Kernel Module on Kernel
2.6.*
No need to modify kernel
Can be loaded/unloaded easily
Incoming/outgoing TCP/IP packets are captured
using Netfilter
Captured packets are put into a queue
just meta-data is kept in queue; packet is cloned
L2.5 thread processes the packets in the queue
periodically
9
1 rack
48 servers
1GbE / 10GbE
…
48 Dell PowerEdge 2950 Servers
Intel Core 2 Quad Q9550 × 2
16GB ECC DRAM
Broadcom NetXtreme II 5708 1GbE NIC
CentOS 5.3 Final; Linux 2.6.28-10
Switches
Netgear GS748TNA (48 ports, GbE)
Cisco Catalyst 4948 (48 ports, GbE)
BNT RackSwitch G8421 (24 ports,
10GbE)
10
R2D2
Minimum timeout: 3ms
Max retransmissions: 10
Delayed ack disabled
TCP: CUBIC TCP
minRTO: 200ms
Segmentation offloading: disabled
TCP timestamps: disabled
11
Number of servers (N): 1, 2, 4, 8, 16, 32, 46
File size (S): 1MB, 20MB
Client:
requests (S/N) MB from each server
Issues new request when all servers respond
Measurements:
Goodput
Retransmission ratio:
Retransmitted packets
Total packets sent by TCP
12
1000
900
900
800
800
700
700
600
500
R2D2
400
TCP
300
Goodput (Mbps)
Goodput (Mbps)
1000
600
500
200
100
100
0
0
2
4 8 16 32 46
Servers
1MB
TCP
300
200
1
R2D2
400
1
2
4 8 16 32 46
Servers
20MB
13
0.006
0.006
0.005
Retransmission Ratio
Retransmission Ratio
0.007
0.005
0.004
0.003
0.002
0.004
0.003
0.002
0.001
0.001
0
0
1
2
4 8 16 32 46
Servers
1MB
1
2
4 8 16 32 46
Servers
20MB
14
6 clients (instead of 1 client)
32 servers
Each client requests a file from each of the 32 servers
1000
1000
800
800
Goodput (Mbps)
Goodput (Mbps)
600
400
200
0
600
400
200
0
R2D2
TCP
Test
1MB
R2D2
TCP
Test
20MB
15
1000
900
900
800
800
700
700
600
500
R2D2
400
TCP
300
Goodput (Mbps)
Goodput (Mbps)
1000
600
500
200
100
100
0
0
2
4 8 16 32 46
Servers
1MB
TCP
300
200
1
R2D2
400
1
2
4 8 16 32 46
Servers
20MB
16
0.06
0.005
Retransmission Ratio
Retransmission Ratio
0.05
0.04
0.03
0.02
0.01
0
0.004
0.003
0.002
0.001
0
1
2
4 8 16 32 46
Servers
1MB
1
2
4 8 16 32 46
Servers
20MB
17
1000
900
900
800
800
700
1
600
2
500
3
400
4
300
5
200
6
100
0
Goodput (Mbps)
Goodput (Mbps)
1000
700
1
600
2
500
3
400
4
300
5
200
6
100
0
R2D2
TCP
Test
1MB
R2D2
TCP
Test
20MB
18
9000
8000
6000
5000
4000
R2D2
3000
TCP
2000
1000
0
1
5
9 13 17
Servers
21
Retransmission Ratio
Goodput (Mbps)
7000
File size: 10MB
Number of servers: 1,
5, 9, 13, 17, 21
0.002
0.0015
0.001
0.0005
0
1
5
9 13 17
Servers
21
19
R2D2 is scalable and fast, provides reliable delivery
No need to modify kernel
Can be loaded/unloaded easily
Improves reliability in data center networks
Hardware implementation in NIC can be much
faster
Work well with TCP offload options like segmentation and
checksum offloading
Developing an FPGA implementation
20