Scaling Collective Multicast Fattree Networks

Download Report

Transcript Scaling Collective Multicast Fattree Networks

Scaling Collective Multicast
Fat-tree Networks
Sameer Kumar
Parallel Programming Laboratory
University Of Illinois at Urbana
Champaign
ICPADS’04
1
Collective Communication

Communication operation in which all or
a large subset participate



For example broadcast
Performance impediment
All to all communication


07/07/04
All to all personalized communication
(AAPC)
All to all multicast (AAM)
ICPADS’04
2
Communication Model

Overhead of a point to point message is

Tp2p = α + mβ
α is the total software overhead of sending the message
β is the per byte network overhead
m is the size of the message

Direct all to all overhead



07/07/04
TAAM = (P – 1) × (α + mβ)
α domain when m is small
β domain when m is large
ICPADS’04
3
Optimization Strategies

Short messages


Parameter α dominates
Message combining



Large messages
 Parameter β dominates


Reduce the total number
of messages
Multistage algorithm to
send messages along a
virtual topology
07/07/04
ICPADS’04
Network contention
Network topology specific
optimizations that
minimize network
contention
4
Direct Strategies

Direct strategies optimize all to all multicast
for large messages


07/07/04
Minimize network contention
Topology specific optimizations that take
advantage of contention free schedules
ICPADS’04
5
Fat-tree Networks




Popular network topology for clusters
Bisection bandwidth O(P)
Network scales to several thousands of
nodes
Topology: k-ary,n-tree
07/07/04
ICPADS’04
6
k-ary n-trees
a)
b)
4-ary 1-tree
4-ary 2-tree
c)
07/07/04
4-ary 3 tree
ICPADS’04
7
Contention Free Permutations

Fat-trees have a nice property:some
processor permutations are contention free

Prefix permutation k



Processor i sends data to
ik
Cyclic shift by k

Processor i sends a message to

Contention free ifk
(i  k )%P
 a  4 j , a  1,2,3, j  0
Contention free permutations presented in
Heller et. al. from CM-5
07/07/04
ICPADS’04
8
Prefix Permutation 1
0
1
2
3
4
5
6
7
Prefix Permutation by 1
Processor p sends to p XOR 1
07/07/04
ICPADS’04
9
Prefix Permutation 2
0
1
2
3
4
5
6
7
Prefix Permutation by 2
Processor p sends to p XOR 2
07/07/04
ICPADS’04
10
Prefix Permutation 3
0
1
2
3
4
5
6
7
Prefix Permutation by 3
Processor p sends to p XOR 3
07/07/04
ICPADS’04
11
Prefix Permutation 4 …
0
1
2
3
4
5
6
7
Prefix Permutation by 4
Processor p sends to p XOR 4
07/07/04
ICPADS’04
12
Cyclic Shift by k
0
1
2
3
4
5
6
7
Cyclic Shift by 2
07/07/04
ICPADS’04
13
Quadrics: HPC Interconnect

Popular interconnect



Several in top500 use quadrics
Used by Pittsburgh’s Lemieux (6TF) and ASCI-Q
(20TF)
Features




07/07/04
Low latency (5 μs for MPI)
High bandwidth (320MB/s/node)
Fat tree topology
Scales to 2K nodes
ICPADS’04
14
Effect of Contention of Throughput
Node Bandwidth Kth Permutation (MB/s)
Drop in bandwidth at k=4,16,64
Node Bandwidth (MB/s)
300
Cyclic Shift
250
Prefix Send
200
Cyclic Shift
(Main Memory)
150
Sending data from main memory
is much slower
100
0
16
32
48
64
80
96 112 128 144 160 176 192 208 224 240 256
k
07/07/04
ICPADS’04
15
Performance Bottlenecks

320 byte packet size


Packet protocol restricts bandwidth to
faraway nodes
PCI/DMA bandwidth is restrictive

07/07/04
Achievable bandwidth is only 128MB/s
ICPADS’04
16
Quadrics Packet Protocol
Send the first packet
Send Header
Ack Header
Send Payload
Receive Ack
Send the next
packet after
first has been
acked.
Sender
07/07/04
Nearby Nodes
Full Utilization
Receiver
ICPADS’04
17
Far Away Messages
Send the first packet
Send Header
Ack Header
Send Payload
Receive Ack
Faraway Nodes
Low Utilization
Send the next
packet
Sender
07/07/04
Receiver
ICPADS’04
18
AAM on Fat-tree Networks

Overcome bottlenecks



Messages sent from NIC memory have 2.5
times better performance
Avoid sending messages to far away nodes
Using contention free permutations

07/07/04
Permutation: every processor sends a
message to a different destination
ICPADS’04
19
AAM Strategy: Ring

Performs all to all multicast by sending
messages along a ring formed by the
processors




Equivalent to P-1 cyclic-shift-by-1 operations
Congestion free
Has appeared in literature before
Drawback

Processors send different messages in each step
……
0
07/07/04
1
2
……..
i
i+1
ICPADS’04
P-1
20
Prefix Send Strategy

P-1 prefix permutations




In stage j, processor i sends a message to
processor (i XOR (j+1))
Congestion free
Can send messages from Elan memory
Bad performance on large fat-trees


07/07/04
Sends P/2 messages to far-away nodes at
distance P/2 or more away
Wire/Switch delays restrict performance
ICPADS’04
21
K-Prefix Strategy

Hybrid of ring strategy and prefix send



Prefix send used in partitions of size k
Ring used between the partitions
Our contribution!
Prefix Send within
……
0
1
2
……..
i
i+1
P-1
Ring across fat-trees of size k
07/07/04
ICPADS’04
22
Performance
Collective Multicast Performance (128 Nodes)
Node bandwidth (MB/s) each way
Completion Time (ms)
100
10
10000
prefix-send
k-prefix
Prefix
K-Prefix
64
123
260
265
128
99
224
259
144
94
-
261
256
95
215
256
ring
Our strategies send messages from
Elan memory
07/07/04
MPI
100000
Message Size (bytes)
MPI
Nodes
ICPADS’04
23
Cost Equation
Tk  prefix  ( P  1)(  b  mem)  ( P / k )m






α , host and network software overhead
αb, cost of barrier (barriers needed to
synchronize the nodes)
βem, per byte network transmission cost
δ, copying overhead to NIC memory
P, Number of processors
k, Size of the partition in k-Prefix
07/07/04
ICPADS’04
24
K-Prefixlb Strategy
AAM Performance (128 Nodes)
Completion Time (ms)
100
k-Prefixlb strategy
synchronizes nodes
after a few steps
10
1
10000
MPI
07/07/04
Message Size (bytes)
k-prefix
k-prefixlb
100000
k-prefixlb-cpu
ICPADS’04
25
CPU Overhead



Strategies should also be evaluated on
compute overhead
Asynchronous non blocking primitives
needed
A data driven system like Charm++ will
automatically support this
07/07/04
ICPADS’04
26
Predicted vs Actual Performance
k-Prefix Performance (128 Nodes)
Completion Time (ms)
100
Predicted plot
assumes: α = 9us,
αb= 15us,
β = δ = 294MB/s
10
1
10000
k-prefix
07/07/04
100000
Message Size (bytes)
K-Prefix Predicted
ICPADS’04
27
Missing Nodes
Node bandwidth with 1 missing node
• Missing nodes due to
down nodes in the fat
tree
• Prefix-Send and k-Prefix
do badly in this scenario
07/07/04
Nodes MPI Prefix-Send K-Prefix
128
72
158
169
240
69
-
173
ICPADS’04
28
K-Shift Strategy
Processor i sends data to the consecutive
nodes

[i-k/2+1,…, i-1, i+1,…, i+k/2] and to i+k
Contention free and good performance with
non-contiguous nodes, whenK-shift
k=8gains because most of
the destinations for each node
Our contribution
do not change in the presence


……
0
…
i-k/2+1
Node bandwidth
(MB/s) with one
missing node
07/07/04
of missing nodes
…
i-1
i
…
i+k/2
…
i+k
Nodes
K-Shift
K-Prefix
128
196
169
240
197
173
ICPADS’04
P-1
29
Conclusion

We optimize AAM for Quadrics QsNet






Copying and sending a message from the NIC has
more bandwidth
K-Prefix avoids sending messages to far away
nodes
Handle missing nodes through the k-shift strategy
Cluster interconnects other than quadrics
also have such problems
Impressive performance results
CPU overhead should be a metric to evaluate
AAM strategies
07/07/04
ICPADS’04
30