Multicast within a Router for High Performance Network-on

Download Report

Transcript Multicast within a Router for High Performance Network-on

McRouter: Multicast within a Router for High Performance NoCs Yuan He

, Hiroshi Sasaki*, Shinobu Miwa, Hiroshi Nakamura The University of Tokyo and *Kyushu University 1

Executive Summary

• Like other networks, NoCs are bandwidth plentiful latency critical (within the routers) . But through evaluations, we also observed that they can be quite • We propose to have packets multicast within a router (routed to all possible outputs), so that route computation is completely hidden and is only required to acknowledge the ONE correctly routed packet in a multicasting • Results show that – McRouter incurs more productive use of its internal bandwidth – It outperforms the Prediction Router (the best router so far) with nearly all application traffic we evaluated

Outline

• • • • • Scope of the Work Motivation Proposal: Multicast within a Router Evaluations and Results Conclusion

Scope

• On-chip routers • Standalone router designs – – So not based on look-ahead routing Conventional Router – Prediction Router (HPCA 2009, Matsutani et al) • Mesh topology – But the idea should be able to other topologies as well 4

Motivation

• Modern On-chip Networks – Latency Critical • NoCs affects cache/memory access latency – Let us look at two router designs • Conventional Router (4-cycle) • Prediction Router (1-cycle when prediction succeeds) 5

Conventional Router (CR)

Credits Out VC Allocator Credits In Route Computation Input 1 VCs Switch Allocator Output 1 Pipeline Register Input n VCs Output n Pipeline Register

• Conventional Virtual Channel Router – BW/RC -> VA -> SA -> ST • Problem ->

4 cycles

BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal

Prediction Router (PR, Hit)

Credits Out Credits In VC Allocator Route Computation Switch Allocator Kill Signals Input 1 VCs Predictor(s) Output 1 Pipeline Register Kill Signals Predictor(s) Output n Input n VCs Pipeline Register

• Prediction Router (HPCA 2009, Matsutani et al) – If prediction hits (and VA/SA succeeds with this predicted RC), only ST is needed ( 1-cycle )

Prediction Router (PR, Miss)

Credits In Credits Out VC Allocator Route Computation Switch Allocator Kill Signals Input 1 VCs Predictor(s) Output 1 Pipeline Register

Kill Signals Predictor(s) Output n Input n VCs Pipeline Register

Prediction Router – If prediction misses, miss-routed packets get killed and the conventional data path is then used – Problem -> prediction accuracy is around 65% evaluation in our

Motivation (cont…)

• Modern On-chip Networks – Bandwidth Plentiful – Observations 9

Observation 1: Avearge Link Utilization

0,05 0,045 0,04 0,035 0,03 0,025 0,02 0,015 0,01 0,005 0

Observation 1: Avearge Link Utilization

Credits Out Credits In VC Allocator Route Computation Input 1 VCs Switch Allocator Output 1 Pipeline Register Output n Input n VCs

Pipeline Register

0.031 flits/link/cycle for the worst case - FT – 0.2 flits / crossbar / cycle router assuming a radix-6 Little contention internally

Observation 2: Concurrent Flits to a Router

100% 98% 96% 94% 92% 90% 88% 86% 84% 82% 80% 0 1 >=2

12

Observation 2: Concurrent Flits to a Router

Credits Out Credits In VC Allocator Route Computation Input 1 VCs Switch Allocator Output 1 Pipeline Register Input n VCs

• Taking the worst case workload – FT – 83% of the time -> no incoming flits – 15% of the time -> 1 flit only – 2 % of the time -> 2+ flits

Output n Pipeline Register

Very few chances of encountering concurrent flits

Proposal: Multicast within a Router

• Or McRouter for short – Single-cycle router when having enough bandwidth – Is based on multicast operation inside a router – A multicast is like a always-correct prediction • No predictors Conventional Router Prediction Router McRouter 14

McRouter: Conditions to Invoke A Multicasting

Credits Out Credits In VC Allocator Multicast Unit Switch Allocator Route Computation ACK 1 Input 1 VCs Valid VCID 1 Output 1 ACK n Input n VCs Valid VCID n Output n

1) Only 1 flit arrives at the router (which means no concurrent flits) 2) Within this router, no flit is waiting to undertake ST (switch traversal) 15

Multicasting Operation

Credits Out Multicast Unit Input 1 Route Computation VCs Input n VCs Credits In VC Allocator Switch Allocator ACK 1 Valid VCID 1 Output 1 ACK n Valid VCID n Output n

16

A Summary on McRouter

• Pros – A single cycle router when internal bandwidth allows – No predictors • Cons – More complex control over the crossbar switch – Killing of more miss-routed flits

• • • • • • •

Evaluation Methodology

CPU Model: Simics 3.0.31

16 cores, in-order Memory Model: GEMS 2.1.1

32KB L1 I/D Caches

– –

256KB L2 Cache X 16 Banks 4 Memory Controllers, 4GB main memory NoC Model: GARNET

4 X 4 Mesh with virtual channel routers NoC Power Model: Orion 2

32nm process and 1V Vdd Synthetic Traffic: Uniform Radom Benchmarks: 13 workloads

From SPLASH-2 and NPB-3 Counterparts: CR and PR

Router Link L2$ Core/L1$s Link Memory Controller Router

Evaluations with Synthetic Traffic

55 Conventional Router Prediction Router (LPM) 50 Prediction Router (FCM) McRouter

0.34 flits/link/cycle

45

0.07 flits/link/cycle

40 35 30 0,025 0,05 0,075 0,1 0,125 0,15 0,175 0,2 0,225 0,25 0,275 0,3 0,305 Injection Rate (flits/node/cycle)

1,1 1 0,9 1,4 1,3 1,2

Evaluations with Application Traffic: Normalized System Speed-up

1,5 Conventional Router Prediction Router (LPM) Prediction Router (FCM) McRouter

Sensitivity Study with Network Parameter Downscaling

CR PR(LPM) PR(FCM) McRouter CR PR(LPM) PR(FCM) McRouter 1,5 1,4 1,4 1,3 1,3 1,2 1,2 1,1 1,1 1 1

• •

0,9 0,9 128-bit, 4 VCs 64-bit, 4 VCs 128-bit, 1 VC

Workload: raytrace

128-bit, 4VCs 64-bit, 4 VCs

Workload: FT

128-bit, 1 VC

Parameters downscaled – Link width halved – # of VCs minimized McRouter still works with thinned bandwidth – Its advantages over CR/PR is not from over-designing

Conclusion

• A new low-latency router – It successfully hides route computation and arbitration delays while still being a standalone design – It outperforms PR (best router so far) in practice – We uncover an insight that with more aggressive utilization of remaining internal bandwidth, a router can have its latency dramatically shortened with simple architectural changes 22

Thank you so much for attention!