Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

NoC Physical Implementation
Federico Angiolini
[email protected]
DEIS Università di Bologna
Physical Implementation and
NoCs



NoCs and physical implementation flows are
strictly related topics
On the one hand, NoCs are designed to
alleviate back-end issues (structured wiring)
On the other hand, back-end properties
critically affect NoC behaviour and effectiveness
ASIC Synthesis Flow
A Typical ASIC Design Flow
Design
Space
Exploration


RTL
Coding
Logic
Synthesis
Placement
Routing
Ideally, one-shot linear flow
In practice, iterations needed to fix issues



Validation failures
Bad quality of results
No timing closure
Basics of a Back-End Flow
RTL code
Circuit description
Tech
Libs
GTech
Analysis
Connected network
of logic blocks
Netlist
Logic
Synthesis
Connected network
of gates
Placed Netlist
Placement
Major vendors:
Synopsys, Mentor,
Magma, Cadence
Placed network of
gates
Layout
Routing
Placed and routed
network of gates
Notes on Tech Libraries





Encapsulate foundry capabilities
Typical content: boolean gates,
flip-flops, simple gates
But, in lots of variations: fan-in,
driving strength, speed, power...
Describe: function, delay, area,
power, physical shape...
Often many libraries per process:
high-perf/low-power; best/worst;
varying VDD; varying VT
Tech
Libs
Analysis of the Hardware
Description
Analysis




Normally, a very swift step
Input: Verilog/VHDL description
Output: circuit description in terms of
“adders”, “muxes”, “registers”, “boolean
gates”, etc. (GTech = Generic Technology)
Output is not optimized by any metric

Just translates specifications into an abstract
circuit
Logic Synthesis




Logic
Synthesis
Takes minutes to hours
Input: GTech description
Output: circuit description in terms of
“HSFFX4”, “LPNOR2X2”, “LLINVX32”, etc.
(i.e.: specific gates of a specific tech library)
Output is...


Complying with timing specs (e.g. “at 500 MHz”)
Optimized for area and power
...How Does This Work?

Based on GTech, paths are identified







register-to-register
input-to-register
register-to-output
input-to-output
Along each path, GTech blocks are replaced with
actually available gates from a technological library
The outcome is called netlist
Delay is analyzed first – and some paths are
detected as critical
Example: Critical Paths
“Adventures in ASIC
Digital Design”

Based on chosen library gates and netlist,
path 1 → 6 is longest and violates constraints
Netlist Optimization

Synthesis process optimizes critical paths until
timing constraints are met, e.g.




Use faster gates instead of lower-power
Play with driving strength (as in buffering)
Refactor combinational logic to minimize gates to
be traversed
Once timing is met, analyze non-critical paths

Optimize for area and power, even if slower
Placement
Placement

Step 1: Floorplanning



Place macro-blocks onto “rectangle” (→ chip)
e.g. processors, memories...
Step 2: Detailed placement


Align single gates of macro-blocks into “rows”
Typically aiming at 85% row utilization
Example: xpipes Placement
Approach

Floorplan = mix of


hard macros for IP cores
soft macros for NoC blocks
Routing
Routing

Step 1: Clock tree insertion


Step 2: Power network insertion




Bring clock to all flip-flops
Bring VDD, GND nets across chip
Typically over top metal layers
Either as ring (small designs) or grid (bigger designs)
Step 3: Logic routing


Actually connect gates to each other
Typically over bottom metal layers
Example: Binary Clock Tree



Issue:
minimizing
skew
Critical at
high
frequencies
Consumes
large amount
of power
Courtesy Shobha
Vasudevan
Issue with Traditional Flow




Major problem with traditional flow...
...wiring is not considered during synthesis!!!
Outdated assumption: wiring delay is negligible
Partial fix: wireload models




Consider fan-out of gates
If small, assume short wiring at outputs, and a bit of
extra delay
If large, assume long wiring at outputs, and a noticeable
extra delay
Still grossly inaccurate
Physical Synthesis


Currently envisioned solution:
physical synthesis
Merge placement with logic
synthesis:





Initial, quick logic synthesis
Coarse-grained placement
Incremental synthesis &
placement until convergence
Drastically better results (more
predictable)
Still may not suffice... also
integrate routing step??
RTL
Quick logic synthesis
Initial Netlist
Quick placement
Initial Placed
Netlist
Incremental synthesis
& placement
Final Placed
Netlist
Advanced Back-End Flow
RTL code
Circuit description
Tech
Libs
GTech
Analysis
Connected network
of logic blocks
Physical
Synthesis
Major vendors:
Synopsys, Mentor,
Magma, Cadence
Placed Netlist
Placed network of
gates
Layout
Routing
Placed and routed
network of gates
Some Observations on the
Physical Implementation of NoCs
Study 1: Cross-Benchmarking NoCs
vs. Traditional Interconnects

Study performance, area, power of a
NoC implementation as opposed to
traditional bus interconnects




Plain shared bus
Hierarchical bus
130nm technology
Note: on old, unoptimized version of
NoC architecture
AMBA AHB Shared Bus
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 T0 T1 T2 T3 T4
AMBA AHB
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 S10 S11 S12 S13 S14




Baseline architecture
Ten ARM cores, five traffic generators, fifteen
slaves (fully populated bus)
ARM cores: running a pipelined multimedia
benchmark
Traffic generators:


Streaming traffic towards a memory (DSP-like)
Periodically querying some slaves (IOCtrl-like)
AMBA AHB Multilayer
M0 M1 T0
AHB Layer 0
P0 P1
S10

M2 M3 T1

AHB Layer 1
M4 M5 T2
AHB Layer 2
P4 P5
M6 M7 T3
AHB Layer 3
S11
AMBA AHB
crossbar
P2 P3
S12
S13
P6 P7
M8 M9 T4
AHB Layer 4
P8 P9
Dramatically improves
performance:
S14


Intra-cluster traffic to
private slaves (P0-P9) is
bound within each layer,
reducing congestion
Shared slaves (S10-S14)
can be accessed in parallel
Representative 5x5
Multilayer configuration
(up to 8x8 allowed)
xpipes (Quasi-)Mesh
T0
P0
M3
S10
T2
P3
M6



M2
P1
M4
S12
M7
P6

M1
P2
T3
P4
T4
P7
T1
M5
S13
M8
S14
S11
130nm
M0
P5
M9
P8
P9
1
mm²
Excellent bandwidth
Balanced architecture, no max frequency bottlenecks
Very regular topology: easy to floorplan
Overhead of area&power due to many links and buffers
NoCs vs. Traditional
Interconnects - Performance
Execution time (ms)
4
AMBA AHB shared bus
AMBA AHB multilayer
xpipes mesh (21 bit, 3 buffers)
xpipes mesh (38 bit, 3 buffers)
3


2
1

0
256 B
1 kB
4 kB
Time to complete
functional benchmark
Shared buses are totally
collapsing
NoCs are 10-15% faster
than hierarchical buses
Cache size
Observation #1:
NoCs are much more scalable and
can provide better performance
under severe load.
NoCs vs. Traditional
Interconnects - Summary
Cross-benchmarking Layout
Frequency
Bandwidth
AMBA vs. 3 NoCs Frequency Predictability
Functional
Benchmark
Execution
Time
Cell Area
Power
Energy (NoC
+ 5W cores)
AMBA Multilayer
370 MHz
-23%
24 GB/s
baseline
0.52 mm2
75 mW
5.08 mJ
xpipes 21-bit qmesh
793 MHz
-6%
87 GB/s
~10% faster
1.7 mm2
376 mW
5.17 mJ
xpipes 38-bit qmesh
793 MHz
-6%
158 GB/s
~15% faster
2.1 mm2
473 mW
4.96 mJ
Observation #2:
NoCs are dramatically more
predictable than traditional
interconnects.
Observation #3:
NoCs are better in performance and
physical design, but be careful about
area and power!
Bandwidth or Latency?
AMBA AHB multilayer
70
Overall
Bandwidth
NoC bandwidth is much
higher (44 links, ~1 GHz)
But this is indirect clue of
performance
xpipes 21-bit qmesh
87 GB/s
xpipes
10 38-bit qmesh
158 GB/s


xpipes mesh (21 bit, 3 buffers)
Bandwidth
(GB/s)
60
NoC latency penalty/gain
depends on transaction


Penalty on short reads
Gain on posted writes
Observation #4:
Latency matters more than raw
bandwidth. NoCs have to be careful
about some transaction types.
xpipes mesh (38 bit, 3 buffers)
50
40
AMBA
Multilayer
24 GB/s
30
20
0
256 B
1 kB
4 kB
Cache size
posted writes
Processor Perceived Latency (ns)

Processor Perceived Latency (ns)
short reads
7
AMBA AHB multilayer
xpipes mesh (21 bit, 3 buffers)
xpipes mesh (38 bit, 3 buffers)
6
5
4
3
2
1
0
256 B
1 kB
Cache size
4 kB
Area, Power Budget Analysis
Clock trees,
spare cells
4%
OCP clock tree
9%
NI initiators
14%
NI initiators
24%
NI targets
11%
xpipes clock
tree
35%
Switches
48%
NI targets
24%
a. Area
38-bit qmesh
Switches
31%
b. Power
Observation #5:
Clock trees are negligible in area, but eat
up almost half of the power budget.
Study 2: Implementation of NoCs in
90 and 65nm


Study behaviour of NoCs as they are
implemented in cutting-edge
technologies
Observe behaviour of tech libraries,
tools, architecture and links as they are
scaled from one technology node to
another
Link Design Constraints
65nm lowest
power

65nm power/
performance
Power to drive a 38-bit (plus flow control) unidirectional link
Observation #6:
Long links (unless custom designed)
become either infeasible, or too powerhungry. Keep them segmented.
Link Repeaters/Relay Stations

Wire segmentation by topology design



Put more switches, closer
Adds a lot of overhead
Wire segmentation by repeater insertion


Flops/relay stations to break links
Details are strictly related to flow control
VALID
VALID
VALID
(N)ACK
(N)ACK
(N)ACK
VALID
VALID
VALID
STALL
STALL
STALL
Sender
Sender
Receiver
Receiver
Observation #7:
Architectural provisions may be
needed to tackle physical-level
issues. These may impact
performance, so they should be
accounted for in advance.
Wireload Models and 65nm

Wireload models to guesstimate propagation delay
during logic synthesis are inaccurate


As seen, for 130nm, 6 to 23% off from actual achievable
post-placement timing
In 65nm, problem is dramatically worse



No timing closure after placement (-50% frequency,
huge runtimes...)
Traditional logic synthesis tools (e.g. Synopsys Design
Compiler) insufficient
Physical synthesis however works great
Observation #8:
Physical synthesis is compulsory for
next-generation nodes.
Placement in Soft Macros

In our experiments, placement&routing is extremely
sensitive to soft macro area



Fences too tight: flow fails
Fences too wide: tool produces bad results
Solution: accurate component area models

Involves work since area depends on architectural
parameters (cardinality, buffering...)
Observation #9:
Thorough characterization of the
components may be key to the convergence
of the flow for a whole topology.
65nm Degrees of Freedom
5.9X
25
90 nm, HP
90 nm, LP
65 nm, HP
20
65 nm, LP
15
11X
10
2.7X
6.3X
5
0
relative frequency

Observation #10:
There is no such thing as
a “65nm library”.
Power/performance
degrees of freedom span
across one order of
magnitude. It is the
designer’s (or the tools’)
responsibility to pick the
right library choice.
relative power
LP and HP libraries differ in gate design, VT, VDD...
Technology Scaling within
Modules
1,5
90 nm, HP
65 nm, HP
1,25
6x6 switch,
38 bits,
6 buffers
1
0,75
0,5
0,25
0
relative frequency

relative area
relative power
Within modules, scaling looks great



+25% frequency
-46% area
-52% power
Technology Scaling on
Topologies

Three designs for max frequency:
65 nm, 1 mm2 cores
90 nm, 1 mm2 cores
65 nm, 0.4 mm2 cores
Mesh Scaling

Scaling of meshes
(max perf. corner)
Max Layout
Frequency
Max
Bandwidth
Cell Area
Power/MHz
90nm, 1 mm2 cores
1 GHz
228 GB/s
1.31 mm2
0.785 mW/MHz
65nm, 1 mm2 cores
1.25 GHz
285 GB/s
0.64 mm2
0.416 mW/MHz
65nm, 0.4 mm2 cores
1.25 GHz
285 GB/s
0.63 mm2
0.396 mW/MHz
Links


Always short (<1.2 mm) → non-pipelined
However




90 nm 1 mm2: 3.1 mW
65 nm 1 mm2: 3.6 mW (tightest fit → more buffering)
65 nm 0.4 mm2: 2.2 mW
Power shifting from switches/NIs to links (buffering)
High-Radix Switch Feasibility
1200
Frequency (MHz)
1000
Estimated after
synthesis
800
After P&R
600
400
200
0
2x2
3x3
4x4
6x6
8x8
10x10 14x14 18x18 22x22 26x26 30x30
Switch Radix



High-radix switches become too slow
10x10 is maximum realistic size
For sizes 26x26 and 30x30, P&R is unfeasible!
Absolute Clock Tree Skew (ns)
Clock Skew in High-Radix
Switches
0,200
14,00%
12,00%
0,160
10,00%
0,120
8,00%
6,00%
0,080
Absolute Skew
Relative skew
4,00%
0,040
2,00%
0,000
0,00%
2x2
4x4
8x8
14x14
22x22
30x30
Switch Radix


A single switch is still a small entity
Skew can be confined to <10%, typically <5%
A Complete NoC Synthesis
Flow
Design of a NoC-Based System
Software Services
Mapping, QoS, middleware...
Architecture
Packeting, buffering, flow control...
CAD Tools
Physical Implementation
Synchronization, wires, power...



All these items are key opportunities and challenges
Strict interaction/feedback mandatory!...
CAD tools must guide designers to best results
The Design Tool Dilemma

Automatically find topology and
architectural parameters so that


Design constraints are satisfied
Area, power, latency are minimized
A hypercube?
A torus? Or,
do I want a
custom
topology?
Custom Topology & Mapping

Objectives



Design fully application-specific custom topologies
Generate deadlock-free networks
Optimize architectural parameters of the NoC
(frequency, flit size), tuning based upon application
requirements
Physical design awareness


Leverage accurate analytical models for area and
power, back-annotated from layouts
Integrated floorplanner to achieve design closure
while also considering wiring complexity
The xpipes NoC Design Flow
User
objectives:
power,
hop delay
Applicatio
n Traffic
Task
Graph
NoC
Area, Power
Models
Constraints:
area, power,
hop delay,
wire length
NoC
component
library
FPGA
Emulation
IP Core
models
Topology
Synthesis
includes:
Floorplanner
NoC Router
SunFloor
System
specs
Platform
Generation
Platform
Generation
xpipesCompiler
SystemC
code
Synthesis
RTL
Architectural
Simulation
Floorplanning specifications
Area, power characterization
Placement&
Routing
To fab
Example: Task Graph
VLD
INV
SCAN
ACDC
PRED
VOP
MEM

STRIPE
MEM
RLD
IDCT
UP
SAMP
IQUANT
PAD
VOP
REC
ARM
Captures communication among system cores


Source/destination pairs
Required bandwidth
Measuring xpipes Performance
topology
specs
fabric instantiation
xpipes
library
xpipesCompiler
topology
SystemC
architectural simulation
cycle-accurate simulation platform
HDL translation
RTL SystemC Converter
topology
HDL
fabric synthesis
tech
library
traffic
generators
Synopsys Physical
Compiler
topology
netlist
place&route
Synopsys Astro
area
figure
s
architectural
statistics
topology
floorplan
traffic
logs
performanc
e
figures
verification,
power modeling
Mentor ModelSim
Synopsys PrimePower
power
figure
s
Example Layout




Floorplan is automatically
generated
Black areas = IP cores
Colored areas = NoC
Over-the-cell routing
allowed in this example
65nm design