Using Packet Information for Efficient Communication in NoCs

Download Report

Transcript Using Packet Information for Efficient Communication in NoCs

Using Packet Information for
Efficient Communication in
NoCs
Prasanna Venkatesh R, Madhu Mutyam
PACE Lab
IIT Madras
21-07-2015
1
Agenda
• Motivation
• Existing techniques to handle multicasts at NoC
• Dynamic Multicast Tree
• VC as Cache
• Packet Concatenation
• IPC Results
• Energy Analysis
• Conclusion
21-07-2015
2
Motivation
Max Sharers Per Multicast
Mean Sharers Per Multicast
30
28
Num Sharers
25
20
20
15
20
18
16.71
11
11
9
10
5
2.2
2.6
2
2.5
Fmm
Ocean
2.1
3.4
1.9
2.39
0
Barnes
21-07-2015
Cholesky
Radiosity
Radix
Swaptions
SPLASH and PARSEC benchmarks have upto 87% of nodes participating in a
multicast. But the average is 7.5% only.
Mean
3
Motivation
% Sharers
% Multicasts Above 0.5 * Max Sharers
9
8
7
6
5
4
3
2
1
0
Barnes
21-07-2015
Cholesky
Fmm
Ocean
Radiosity
Radix
Swaptions
Mean
SPLASH and PARSEC benchmarks have upto 87% of nodes participating in a
multicast. But the maximum communication exists for < 4% of the time.
4
Multicasts: Solutions in the literature
• Separate injections flood the
network with redundant copies
• Multicasts: Single copy till a
common path and forks to
multiple copies
• Simplifies routing logic
21-07-2015
Dynamic Multicast routing can make use of idle paths to avoid congestion.
But is it possible to meet timing constraints?
5
Our Proposals to achieve multicast efficiency
• Dynamic multicast tree construction using redundant route
computation units
• Will penalize unicasts and create starvation?
• Three optimizations on unicasts to enhance dynamic multicasting
• VC as cache
• Packet Concatenation
• Critical word first
21-07-2015
6
Critical Word First
• Borrowed from Cache data transfer optimization technique
• Make efficient use of the flit level split of a packet containing cache
block
• Send the requested word with the header flit
21-07-2015
7
Dynamic Multicast Tree
Method
• Compute Odd-Even route at each router for all multicast destinations
• Takes one RC cycle per destination
• Add a redundant RC unit to speed this process
• No extra chip area because of the simplicity
Caveats
• Bottlenecks unicasts
• Slow when there is no congestion
21-07-2015
8
VC as Cache: Scenario
• A shared cache block is requested by more than one node at a given
time frame
• The owner sends a multicast of the block to all the requestors
• A request arrives after this multicast
• The owner resends the block after processing this request
21-07-2015
9
Solution – Add the new requestor to the
processed multicast midway!
• Compare up to five multicast
packets with an incoming request
packet at the router
• If matched,
• Forward the request to the owner
for coherence and book keeping
with a time stamp of the previous
message
• Add this requestor to the multicast
destinations
21-07-2015
10
Packet Concatenation
• A request is a single flit packet
• When RC units are busy, we can
club single flit packets to the same
destination to form a “superpacket”
• This means it is going to take one
RC cycle to compute multiple
packet routes from there on.
21-07-2015
11
Configuration for simulations
• Simulators Multi2sim 4.0.1, Booksim 2.0, Orion 2.0
• Real time simulation
• 64 Nodes with 32 cores + L1 nodes and 32 shared distributed L2
cache banks
• 1 Flit for request and coherence packets, 5 flits for cache block
• Benchmarks:
• SPLASH2 and PARSEC workloads with 32 threads
• All high injection workloads are picked after an initial study on their injection
rates
21-07-2015
12
IPC Results
% Improvement over Odd Even
(Base)
Radix
21-07-2015
90
80
70
60
50
40
30
20
10
0
Abbreviations:
C – Critical Word first
V – VC as cache
D – Dynamic Multicast Tree
P – Packet Concatenation
13
IPC Results
% Improvement over Odd Even
(Base)
Geomean
21-07-2015
20
15
10
Abbreviations:
C – Critical Word first
V – VC as cache
D – Dynamic Multicast Tree
P – Packet Concatenation
5
0
14
Scaling to 512 Nodes: IPC Results
% Improvement over Odd Even
(Base)
MultiPrograms
21-07-2015
10
8
6
4
Abbreviations:
C – Critical Word first
V – VC as cache
D – Dynamic Multicast Tree
P – Packet Concatenation
2
0
15
% Energy Improvement over
Base
Fine Grained Energy Footprint of Barnes
25
20
Abbreviations:
C – Critical Word first
V – VC as cache
D – Dynamic Multicast Tree
P – Packet Concatenation
15
10
5
0
Clock Cycles
Base+D
21-07-2015
Base+CD
Base+CVD
Base+CVDP
16
Conclusion and future extensions
• Scalable solution for multicasts
• Can fit with existing techniques
• Easy to implement
• Energy Efficient
• Packet Concatenation can be switched on selectively depending on
the load requirements
• Other architecture level inputs can also be used for further
performance.
• Example: #Instructions waiting, memory level parallelism
21-07-2015
17
Thank you
21-07-2015
18