Slide - Stanford University

Download Report

Transcript Slide - Stanford University

Curbing Delays in Datacenters:
Need Time to Save Time?
Mohammad Alizadeh
Sachin Katti, Balaji Prabhakar
Insieme Networks
Stanford University
1
Window-based rate control schemes (e.g., TCP)
do not work at near zero round-trip latency
2
Datacenter Networks
 Message latency is King
need very high throughput, very low latency
10-40Gbps links
1-5μs latency
1000s of server ports
web
app
cache
db
mapreduce
HPC
monitoring
3
Transport in Datacenters
• TCP widely used, but has poor performance
– Buffer hungry: adds significant queuing latency
TCP
Queuing Latency
~1–10ms
Baseline fabric latency: 1-5μs
How do we get here?
DCTCP
~100μs
~Zero Latency
4
Reducing Queuing:
Sn
700
Experiment:
2 flows (Win 7 stack), Broadcom 1Gbps Switch
700 700
600
600 600
500
500 500
400
400
400
300
300
300
200
TCP, 2 flows
200
TCP 2 flows
DCTCP,
200
TCP,
ECN Marking Thresh = 30KB
100
DCTCP2 flows
DCTCP, 2 flows
100
1000
0
0
0
0
0
Time (seconds)
5
Time (seconds)
(KBytes)
Queue
Length
(Packets)
Queue
Length
(Packets)
Queue Length (Packets)
DCTCP vs TCP
S1
Towards
Zero Queuing
S1
ECN@90%
Sn
S1
ECN@90%
Sn
Towards
Zero Queuing
S1
ECN@90%
Sn
ns2 sim: 10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util)
45
40
Queueing Latency
10
Total Latency
9.5
Latency (μs)
35
30
Floor ≈ 23μs
25
20
15
10
Throughput (Gbps)
50
Target
Throughput
9
8.5
8
7.5
5
0
7
0
20
40
Round-Trip Propagation Time (μs)
0
20
40
Round-Trip Propagation Time (us)
Window-based Rate Control
RTT = 10  C×RTT = 10 pkts
Cwnd = 1
Sender
C=1
Receiver
Throughput = 1/RTT = 10%
8
Window-based Rate Control
RTT = 2  C×RTT = 2 pkts
Cwnd = 1
Sender
C=1
Receiver
Throughput = 1/RTT = 50%
9
Window-based Rate Control
RTT = 1.01  C×RTT = 1.01 pkts
Cwnd = 1
Sender
C=1
Receiver
Throughput = 1/RTT = 99%
10
Window-based Rate Control
RTT = 1.01  C×RTT = 1.01 pkts
Cwnd = 1
Sender 1
Receiver
Sender 2
As propagation time  0:
Cwnd
= 1buildup is unavoidable
Queue
11
So What?
Window-based RC needs lag in the loop
Near-zero latency transport must:
1. Use timer-based rate control / pacing
2. Use small packet size
Both increase CPU overhead (not practical in software)
Possible in hardware, but complex (e.g., HULL NSDI’12)
Or…
Change the Problem!
12
Changing the Problem…
Switch
Port
FIFO
queue
5
9
4
3
7
1
Switch
Port
Priority
queue
Queue buildup costly
Queue buildup irrelevant
 need precise rate control
 coarse rate control OK
13
pFABRIC
14
DC Fabric: Just a Giant Switch
H1
H2
H3
H4
H5
H6
H7
H8
H9
15
H2
H3
H4
H4
H5
H5
H3
H6
H6
H2
H8
H8
H9
H9
H7
H7
H1
TX
H1
DC Fabric: Just a Giant Switch
RX
16
H2
H3
H4
H4
H5
H5
H3
H6
H6
H2
H8
H8
H9
H9
H7
H7
H1
TX
H1
DC Fabric: Just a Giant Switch
RX
17
DC transport =
Flow scheduling
on giant switch
Objective?
 Minimize avg FCT
H1
H1
H2
H2
H3
H3
H4
H4
H5
H5
H6
H6
H7
H7
H9
ingress & egress
capacity constraints
H9
H8
H8
TX
RX
18
“Ideal” Flow Scheduling
Problem is NP-hard [Bar-Noy et al.]
– Simple greedy algorithm: 2-approximation
1
1
2
2
3
3
19
pFabric in 1 Slide
Packets carry a single priority #
• e.g., prio = remaining flow size
pFabric Switches
• Very small buffers (~10-20 pkts for 10Gbps fabric)
• Send highest priority / drop lowest priority pkts
pFabric Hosts
• Send/retransmit aggressively
• Minimal rate control: just prevent congestion collapse
20
Key Idea
Decouple flow scheduling from rate control
Switches implement flow
scheduling via local mechanisms
Queue buildup does not hurt performance
 Window-based rate control OK
Hosts use simple window-based
rate control (≈TCP) to avoid high
packet loss
H
1
H
2
H
3
H
4
H
5
H
6
H
7
H
8
H
9
21
pFabric Switch
HHH HHH HHH
1 2 3 4 5 6 7 8 9
 Priority Scheduling
send highest priority
packet first
 Priority Dropping
drop lowest priority
packets first
5
9
4
3
7
Switch
Port
1
prio = remaining flow size
small “bag” of
packets per-port
22
pFabric Switch Complexity
• Buffers are very small (~2×BDP per-port)
– e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB
– Today’s switch buffers are 10-30x larger
Priority Scheduling/Dropping
• Worst-case: Minimum size packets (64B)
– 51.2ns to find min/max of ~600 numbers
– Binary comparator tree: 10 clock cycles
– Current ASICs: clock ~ 1ns
23
Why does this work?
Invariant for ideal scheduling:
At any instant, have the highest priority packet
(according to ideal algorithm) available at the switch.
• Priority scheduling
 High priority packets traverse fabric as quickly as possible
• What about dropped packets?
 Lowest priority → not needed till all other packets depart
 Buffer > BDP → enough time (> RTT) to retransmit
24
Evaluation
(144-port fabric; Search traffic pattern)
FCT (normalized to optimal in idle fabric)
Ideal
pFabric
PDQ
DCTCP
TCP-DropTail
10
Recall: “Ideal” is REALLY idealized!
• Centralized with full view of flows
8
• No rate-control dynamics
7
6
• No buffering
5
• No pkt drops
4
• No load-balancing inefficiency
9
3
2
1
0
0.1
0.2
0.3
0.4
0.5
Load
0.6
0.7
0.8
25
Mice FCT (<100KB)
Average
pFabric
PDQ
10
9
8
7
6
5
4
3
2
1
0
DCTCP
Normalized FCT
Normalized FCT
Ideal
99th Percentile
0.1
0.2
0.3
0.4 0.5
Load
0.6
0.7
0.8
TCP-DropTail
10
9
8
7
6
5
4
3
2
1
0
0.1
0.2
0.3
0.4 0.5
Load
0.6
0.7
26
0.8
Conclusion
• Window-based rate control does not work
at near-zero round-trip latency
• pFabric: simple, yet near-optimal
– Decouples flow scheduling from rate control
– Allows use of coarse window-base rate control
• pFabric is within 10-15% of “ideal” for
realistic DC workloads (SIGCOMM’13)
27
Thank You!
28
29