PPT - FPGA BUS Designer

Download Report

Transcript PPT - FPGA BUS Designer

Mohamed ABDELFATTAH
Vaughn BETZ
1 Why NoCs on FPGAs?
2 Hard/soft efficiency gap
3 Integrating hard NoCs with FPGA
2
1 Why NoCs on FPGAs?
Motivation
Previous Work
2 Hard/soft efficiency gap
3 Integrating hard NoCs with FPGA
3
1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Interconnect
Wires
4
1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard Blocks:
• Memory
• Multiplier
• Processor
5
1. Why NoCs on FPGAs?
Hard Interfaces
DDR/PCIe ..
1600 MHz
Logic Blocks
Switch Blocks
Interconnect
still the same
800 MHz
Wires
200 MHz
Hard Blocks:
• Memory
• Multiplier
• Processor
6
1. Why NoCs on FPGAs?
1600and
MHzController
DDR3 PHY
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
800 MHz
200 MHz
Gigabit Ethernet
7
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
Gigabit Ethernet
8
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
4. Wire speed not scaling:
– Delay is interconnect-dominated
Gigabit Ethernet
9
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
4. Wire speed not scaling:
– Delay is interconnect-dominated
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
Gigabit Ethernet
5. Low-level interconnect hinders
modularity:
10
Source: Google Earth
Barcelona
Los Angeles
Keep the “roads”, but add “freeways”.
Logic Cluster
Hard Blocks
1. Why NoCs on FPGAs?
NoC
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
Links
4. Wire speed not scaling:
– Delay is interconnect-dominated
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
Routers
Router forwards
data packet
Router moves
data to local
interconnect
Gigabit Ethernet
5. Low-level interconnect hinders
modularity:
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
12
1. Why NoCs on FPGAs?
DDR3 PHY and Controller
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
– Huge CAD Problem
– Slow compilation
– Power/area utilization
4. Wire speed not scaling:
– Delay is interconnect-dominated
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
Gigabit Ethernet
5. Low-level interconnect hinders
modularity:
 High bandwidth endpoints known
 Pre-design NoC to requirements
 NoC links are “re-usable”
 Latency-tolerant communication
13
 NoC abstraction favors modularity
1. Why NoCs on FPGAs?
– Huge CAD Problem
– Slow compilation
– Power/area utilization
DDR3 PHY and Controller
PCIe Controller
1. Bandwidth requirements for
hard logic/interfaces
2. Timing closure
3. High interconnect utilization:
4. Wire speed not scaling:
– Delay is interconnect-dominated
– Parallel compilation
– Partial reconfiguration
– Multi-chip interconnect
Gigabit Ethernet
5. Low-level interconnect hinders
modularity:
 Latency-tolerant communication
14
 NoC abstraction favors modularity
1. Why NoCs on FPGAs?
• Slower, bigger
Hard NoC
• Must build the whole thing
• Must beEfficiency
general enough for any
aiapplication
• Faster, smaller
Gigabit Ethernet
Soft NoC
• Build as needed out of LUTs
• Tailor
to application
Configurability
DDR3 PHY and Controller
PCIe Controller
 Implementation options:
 Soft Logic (LUTs, .. )
 Hard Logic (unchangeable)
 Mixed Soft/Hard
 Investigate the hard vs. soft tradeoff for NoCs (area/delay)15
1. Why NoCs on FPGAs?
 FPGA-tuned Soft NoCs:
– LiPar (2005), NoCeM (2008), Connect (2012)
 Hard NoCs:
– Francis and Moore (2008): Exploring Hard and Soft
Networks-on-Chip for FPGAs
 Applications that leverage NoCs:
– Chung et al. (2011): CoRAM: An In-Fabric Memory
Architecture for FPGA-based Computing
Our Contributions:
1. Quantify area/performance gap of hard and soft NoCs
2. Investigate how this impacts NoC design (hard/soft)
3. Integrate hard NoC with FPGA fabric
16
1 Why NoCs on FPGAs?
2 Hard/soft efficiency gap
NoC
Methodology
Architecture
Results
Area/Speed
Efficiency Gap
Soft
NoC design
3 Integrating hard NoCs with FPGA
17
2. Hard/Soft Efficiency
 NoC = Routers + Links
 State-of-the-art router architecture from Stanford:
1. Acknowledge that the NoC community have excelled at
building a router: We just use it
2. To meet FPGA bandwidth requirements:
High-performance router
3. A complex router includes a superset of NoC components
that may be used: More complete analysis
 Split router into 5 Components 
18
2. Hard/Soft Efficiency
Virtual Channel
(VC) Allocator
Switch Allocator
1
1
5
5
Crossbar Switch
Input Modules
Output Modules
19
2. Hard/Soft Efficiency
Virtual Channel
(VC) Allocator
Switch Allocator
Multi-Queue Buffer
= Memory +
CIControl Logic
1
• Port Width
• Buffer depth
• Number of VCs
5
Crossbar Switch
Input Modules
Input Modules
1
5
Output Modules
20
2. Hard/Soft Efficiency
Virtual Channel
(VC) Allocator
Switch Allocator
Multiplexers
1
1
Logic + crowded
interconnect
• Port Width
• Number of Ports
5
Input Modules
Crossbar Switch
Crossbar
5
Output Modules
21
2. Hard/Soft Efficiency
Virtual Channel
(VC) Allocator
Switch Allocator
Retiming Register
Registers + little
control logic
1
1
• Port Width
• Number of VCs
5
Crossbar Switch
Input Modules
5
Output
Modules
Output
Modules
22
2. Hard/Soft Efficiency
Allocators
Arbiters
= Logic + Registers
• Number of Ports
• Number of VCs
Virtual Channel
(VC) Allocator
Switch Allocator
1
1
5
5
Crossbar Switch
Input Modules
Output Modules
23
2. Hard/Soft Efficiency
4 Parameters
5 Components
Virtual Channel
(VC) Allocator
Input Module
Port Width
Switch Allocator
Crossbar
Number of Ports
1
Output Module
VC Allocator
1
Number of VCs
5
Buffer
Depth
Crossbar
Switch
Input Modules
5
Output Modules
SW Allocator
24
2. Hard/Soft Efficiency




Post-routing FPGA (soft) area and delay
Post-synthesis ASIC (hard) area and delay
Both TSMC 65 nm technology (Stratix III)
Verify results against previous FPGA:ASIC
comparison by Kuon and Rose
Per Router
Component
25
2. Hard/Soft Efficiency
 Relatively small memories
 Critical component in router design
 3 options for FPGA:
Registers
One per LUT
LUTRAM
640 bits
Block RAM
9 Kbits
 Area of each implementation option 
26
2. Hard/Soft Efficiency
Width = 32 Bits
Another logic
cluster used
27
2. Hard/Soft Efficiency
 Relatively small memories
 3 options for implementation on FPGA
Registers
One per LUT
0.77 Kbit/mm2
LUTRAM
640 bits
23 Kbit/mm2
Block RAM
9 Kbits
142 Kbit/mm2
Soft
 16% utilized BRAM more area efficient than fully
used LUTRAM (Valid for Stratix III)
 LUTRAM could win for some points in other FPGAs
Use BRAM for FPGA (soft) implementation
28
2. Hard/Soft Efficiency
60X – 170X
Soft
24X – 94X
High port count inefficient in soft
29
2. Hard/Soft Efficiency
72X
Soft
26X – 17X
High port count inefficient in soft  Width scales better 30
Soft
2. Hard/Soft Efficiency
Buffer depth is free on FPGAs when using BRAM
31
2. Hard/Soft Efficiency
Soft
Use BRAM for FPGA (soft) implementation
High port count inefficient in soft  Width scales better
Soft
Soft
 Design recommendations based on FPGA silicon area
 Supported by delay measurements
Buffer depth is free on FPGAs when using BRAM
32
2. Hard/Soft Efficiency
Router Component Mean Area Ratio
Memory Input Module
17
Crossbar
85
48
= Logic + VC Allocator
Registers Switch Allocator
56
Output Module
39
Router
30
LUT:REG
--8:1
20:1
0.6:1
33
2. Hard/Soft Efficiency
Router Component Mean Delay Ratio
Input Module
2.9
Crossbar
4.4
VC Allocator
3.9
Switch Allocator
3.3
Output Module
3.4
Router
3.6
34
1 Why NoCs on FPGAs?
2 Hard/soft efficiency gap
3 Integrating hard NoCs with FPGA
Hard NoC +
FPGA Wiring
Conclusion
Future Work
35
3. Hard NoC with FPGA
Router Component Area Ratio
40% Input Module
17
Crossbar
85
50%
Total VC Allocator
48
Allocator
Area
Switch
Allocator Critical 56
SW Allocator
Path
39
10% Output Module
Router
30
Delay Ratio
2.9
4.4
3.9
3.3
3.4
3.6
Results suggest hardening Crossbar and Allocators
36
 Mixed hard/soft implementation
3. Hard NoC with FPGA
For a typical
router ..
Area
• 5 ports
• 32 bits wide Speed
• 2 VCs
•How
10 buffer
words
to connect
Soft
Hard
Mixed
4.1 mm2 (1X)
0.14 mm2 (30X)
2.3 mm2 (1.8X)
150 MHz (1X)
810 MHz
390 MHz (2.5X)
hard and soft?
How efficient is mixed/hard
after doing that?
1
(5X)
Mixed not worth
hardening
Virtual Channel
(VC) Allocator
Switch Allocator
?
?
1
Hard
5
Soft
Crossbar Switch
Input Modules
5
Output Modules
37
3. Hard NoC with FPGA
Logic clusters
Programmable
Interconnect
Router
Router
Logic
Router
Logic
• Same I/O mux structure as a logic block – 9X the area
• Conventional FPGA interconnect between routers
38
3. Hard NoC with FPGA
FPGA
730 MHz
Router
1
th of FPGA vertically (~2.5 mm)
9
• Same I/O mux structure as a logic block – 9X the area
• Conventional FPGA interconnect between routers
39
3. Hard NoC with FPGA
FPGA
Router
Assumed a mesh  Can form any topology
40
3. Hard NoC with FPGA
Soft
Hard
Hard (+ interconnect)
Area
4.1 mm2 (1X)
0.14 mm2 (30X)
0.18 mm2 = 9 LABs (22X)
Speed
150 MHz (1X)
810 MHz
(5X)
730 MHz
(4.7X)
64-node NoC on Stratix V
Router
Soft
Hard (+ interconnect)
Area
~12,500 LABs
576 LABs
%LABs
33 %
1.6 %
%FPGA
12 %
0.6 %
Provides 47 GB/s peak bisection bandwidth
Very Cheap! Less than cost of 3 soft nodes
Hard NoC + Soft Interconnect is very compelling
41
1
Why NoCs on FPGAs?
• Big city needs freeways to handle traffic
• Solve communication problems for a large/heterogeneous FPGA:
• Timing Closure – Interconnect Scaling – Modular Design
2
Hard/soft efficiency gap
• A hard NoC is on average 30X smaller and 3.6X faster than soft
• Crossbars and allocators worst – Input buffer best
• An efficient soft NoC:
• Uses BRAMs – Large width, low Port Count – Deep buffers
3
Integrating hard NoCs with FPGA
• Mixed implementation does not make sense
• Integrated fully hard NoC with FPGA fabric (for NoC Links)
• 22X area improvement over soft
• Reaches max. FPGA frequency (4.7X faster than soft)
• 64-node NoC = 0.6% of total FPGA area (Stratix V)
3. Hard NoC with FPGA
 Power analysis
 More hardening:
– Dedicated inter-router links (hard wires)
– Clock domain crossing hardware
 How do traffic hotspots (DDR/PCIe) influence NoC design?
 Latency insensitive design methodology that uses NoC
 CAD tool changes for a NoC-based FPGA
43