Design Tradeoffs For hard and Soft FPGA-based Networks-on-Chip

Download Report

Transcript Design Tradeoffs For hard and Soft FPGA-based Networks-on-Chip

Mohamed ABDELFATTAH
Vaughn BETZ
1 Why NoCs on FPGAs?
2 Embedded NoCs
3 Area & Power Analysis
4 Comparison Against P2P/Buses
2
1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Interconnect
Wires
3
1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard Blocks:
• Memory
• Multiplier
• Processor
4
1. Why NoCs on FPGAs?
Hard Interfaces
DDR/PCIe ..
1600 MHz
Logic Blocks
Switch Blocks
Interconnect
still the same
800 MHz
Wires
200 MHz
Hard Blocks:
• Memory
• Multiplier
• Processor
5
1. Why NoCs on FPGAs?
1600and
MHzController
DDR3 PHY
PCIe Controller
Problems:
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
800 MHz
200 MHz
Gigabit Ethernet
6
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
PCIe Controller
Problems:
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
4. Wire speed not scaling:
Gigabit Ethernet
– Delay is interconnect-dominated
7
Source: Google Earth
Barcelona
Los Angeles
Keep the “roads”, but add “freeways”.
Logic Cluster
Hard Blocks
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
NoC
DDR3 PHY and Controller
Links
4. Wire speed not scaling:
Routers
Router forwards
data packet
Router moves
data to local
interconnect
Gigabit Ethernet
– Delay is interconnect-dominated
PCIe Controller
Problems:
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
9
1. Why NoCs on FPGAs?
DDR3 PHY and Controller
PCIe Controller
Problems:
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
– Huge CAD Problem
– Slow compilation
– Power/area utilization
4. Wire speed not scaling:
5. Abstraction favours modularity:
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
 High bandwidth endpoints known
Gigabit Ethernet
– Delay is interconnect-dominated
 Pre-design NoC to requirements
 NoC links are “re-usable”
 NoC is heavily “pipelined”
10
 NoC abstraction favors modularity
1. Why NoCs on FPGAs?
– Huge CAD Problem
How to integrate
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
FPGA design
NoCs in FPGAs?
PCIe Controller
Problems:
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
NoCs
can
simplify
3. High interconnect utilization:
Does the NoC abstraction come at a high area/power cost?
4. Wire speed not scaling:
5. Abstraction favours modularity:
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
Gigabit Ethernet
Delay
is interconnect-dominated
How– do
embedded
NoCs compare to current interconnects?
 Latency-tolerant communication
11
 NoC abstraction favors modularity
1 Why NoCs on FPGAs?
2 Embedded NoCs
Mixed NoCs
Hard NoCs
3 Area & Power Analysis
4 Comparison Against P2P/Buses
12
2. Embedded NoCs
Links
(Hard or Soft)
Router
(Hard or Soft)
PCIe Interface
Fa
Po bric
rt
DDRx Interface
FPGA
Soft Routers
Hard Routers
Hard Routers
+
+
+
Compute
Module
Soft Links
Soft Links
Hard Links
=
=
=
“Soft” NoC
“Mixed” NoC
“Hard” NoC
Soft
FPGA CAD Tools
Mixed
Hard
ASIC CAD Tools
Area
Speed
Design Compiler
HSPICE
Power?
Power
Gate-level simulation
Gate-level simulation
Toggle
rates
14
2. Embedded NoCs
FPGA
Logic blocks
Programmable
“soft” interconnect
Router
Baseline Router
Hard Routers
+
Width
VCs
Ports
Buffer
32
2
5
10/VC
Soft Links
=
“Mixed” NoC15
2. Embedded NoCs
FPGA
Router
Logic
Router
Hard Routers
+
Soft Links
=
“Mixed” NoC16
2. Embedded NoCs
FPGA
Router
Special Feature
Configurable topology
Assumed a mesh  Can form any topology
17
2. Embedded NoCs
FPGA
Logic blocks
Programmable
“soft” interconnect
Router
Dedicated “hard” interconnect
Hard Routers
+
Hard Links
=
“Hard” NoC 18
2. Embedded NoCs
FPGA
Router
Logic
Router
Hard Routers
+
Hard Links
=
“Hard” NoC 19
2. Embedded NoCs
0.9 V
1.1 V
FPGA
Router
Special Feature
Low-V mode
Save 33% Dynamic Power
~15% slower
Hard Routers
+
Hard Links
=
“Hard” NoC 20
2. Embedded NoCs
Links
(Hard or Soft)
Bridge NoC and FPGA fabric:
Router
• Frequency adaptation
• Voltage adaptation
(Hard or Soft)
Fa
Po bric
rt
• Width adaptation
Compute
Module
• Bus protocol e.g. AXI
21
1 Why NoCs on FPGAs?
2 Embedded NoCs
3 Area & Power Analysis
Soft vs. mixed
vs.Hard
System
Area/Power
4 Comparison Against P2P/Buses
22
3. Area/Power Analysis
Virtual Channel
(VC) Allocator
Switch Allocator
1
1
5
Crossbar Switch
Input Modules
5
Output Modules
 State-of-the-art router architecture from Stanford:
1. NoC community have excelled at building on-chip routers:
We just use it
2. To meet FPGA bandwidth requirements:
High-performance router
3. Complex functionality such as virtual channels:
Assigning traffic priority could be useful
23
3. Area/Power Analysis
Hard Router vs. Soft Router
30X smaller, 6X faster, 14X lower power
Hard Links vs. Soft Links
9X smaller, 2.4X faster, 1.4X lower power
24
3. Area/Power Analysis
[65 nm]
64 – NoC
64-node NoC on Stratix III
Area
Soft
Mixed
Hard
~12,500 LBs
576 LBs
448 LBs
33% of FPGA
~ 1.5% of FPGA
Speed
166 MHz
730 – 940 MHz
Bisection
SpeedBW
~ 10 GB/s
~ 50 GB/s
3. Area/Power Analysis
[65 nm]
64-node NoC on Stratix III
Provides ~50GB/s peak bisection bandwidth
64 – NoC
Very Cheap! Less than cost of 3 soft nodes
Area
Soft
Mixed
Hard (Low-V)
~12,500 LBs
576 LBs
448 LBs
33% of FPGA
~ 1.5% of FPGA
Speed
166 MHz
730 – 940 MHz
Bisection
SpeedBW
~ 10 GB/s
~ 50 GB/s
3. Area/Power Analysis
250 GB/s total Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
bandwidth 123%
How much is used for system-level communication?
17.4 W
Largest Stratix-III
device
Typical FPGA
Dynamic Power
28
3. Area/Power Analysis
250 GB/s total Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
15%
bandwidth 123%
NoC
17.4 W
Typical FPGA
Dynamic Power
29
3. Area/Power Analysis
250 GB/s total Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
15%
11%
bandwidth 123%
NoC
17.4 W
Typical FPGA
Dynamic Power
30
3. Area/Power Analysis
250 GB/s total Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
7%
15%
11%
bandwidth 123%
NoC
17.4 W
Typical FPGA
Dynamic Power
31
3. Area/Power Analysis
DDR3  Module 1
14.6 GB/s
14.6 GB/s
PCIe  Module 2
Full theoretical BW
Cross whole chip!
14.6 GB/s
Aggregate Bandwidth
14.6 GB/s
126 GB/s
NoC Power Budget
3.5%
32
1 Why NoCs on FPGAs?
2 Embedded NoCs
3 Area &Power Analysis
4 Comparison Against P2P/Buses
Point-to-point
links
Qsys Buses
33
4. Comparison
Interconnect = Just wires
Interconnect = Wires + Logic
Point-to-point Links
Multiple Masters
1
1
1
Broadcast
1
Mux +
Arbiter
1
Comparen “wires” interconnect
to NoCs
1
Multiple Masters, Multiple Slaves
1
Mux +
Arbiter
1
n
Mux +
Arbiter
n
n
Interconnect = NoC
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
n
34
4. Comparison
Length of 1 NoC Link
High Performance / Packet Switched
1 % area overhead on Stratix 5
200
MHz
Runs at 730-943 MHz
Power on-par with simplest FPGA interconnect
Hard and Mixed NoCs  Area/Power Efficient
35
4. Comparison
Qsys bus:
Build logical bus from fabric
Embedded NoC:
16 Nodes, hard routers & links
36
4. Comparison
close
• Steps to close timing using Qsys
FPGA
37
4. Comparison
far
• Steps to close timing using Qsys
FPGA
38
4. Comparison
far
• Steps to close timing using Qsys
FPGA
Timing closure can be simplified with an embedded NoC39
4. Comparison
40
4. Comparison
41
4. Comparison
Entire NoC
smaller than bus
for 3 modules!
42
4. Comparison
1/8 Hard NoC BW used  already less area for most systems
43
4. Comparison
Hard NoC saves power for even the simplest systems44
1 Why NoCs on FPGAs?
Big city needs freeways to handle traffic
2 Embedded NoCs: Mixed & Hard
Area: 20-23X
Speed: 5-6X
Power: 9-15X
3 Area & Power Analysis
• Area Budget for 64 nodes: ~1%
• Power Budget for 100 GB/s: 3-7%
4 Comparison Against P2P/Buses
• Raw efficiency close to simplest P2P links
• NoC more efficient & lower design effort
eecg.utoronto.ca/~mohamed/noc_designer.html
46
eecg.utoronto.ca/~mohamed/noc_designer.html
47
2. Embedded NoCs
 200 MHz 128-bit module, 900 MHz 32-bit router?
 Configurable time-domain mux / demux: match bandwidth
 Asynchronous FIFO: cross clock domains
 Full NoC bandwidth, w/o clock restrictions on modules
48
1. Why NoCs on FPGAs?
GPU
CPU
• Maxeler
• Geoscience (14x, 70x)
• Financial analysis (5x, 163x)
• Altera OpenCL
• Video compression (3x, 114x)
• Information filtering (5.5x)
49
1. Why NoCs on FPGAs?
50
1. Why NoCs on FPGAs?
51
1. Why NoCs on FPGAs?
NoC
52