sFPGA2 Architecture

Download Report

Transcript sFPGA2 Architecture

The Optimization of
Interconnection Networks in
FPGAs
Dr. Yajun Ha
Assistant Professor
Department of Electrical & Computer Engineering
National University of Singapore
© NUS 2010
Dagstuhl Seminar
1
Outline
Background and Motivation
 Time-multiplexed interconnects in FPGAs
 sFPGA2 architecture
 Conclusion

© NUS 2010
Dagstuhl Seminar
2
FPGA Research Challenges
Research challenges for FPGA
architectures and tool are closely
linked.
 The source for FPGA challenges
are coming from the underlying
semiconductor technologies.
 Scaling semiconductor
technologies bring the following new
challenges:


Leakage Power
Dual Vt or Vdd or subthreshold architectures
 Process variations
Reconfigurability for variability, fault tolerant
 Substantially more transistors
Scalable, multi-core, secure architectures and SLD
© NUS 2010
Dagstuhl Seminar
3
Motivation
Logic
and Interconnect are un-balanced in FPGAs.
Qualitatively:
“PLDs are 90% routing and 10% logic.”
Prof. Jonathan Rose, Design of Interconnection Networks for
Programmable Logic, Kluwer Academic Publishers, 2004, Page
xix;
 “…(in FPGAs) programmable interconnect comes at a substantial
cost in performance in area, performance and power.”
Prof. Jan Rabaey, Digital Integrated Circuits, 2nd Edition,
Prentice-Hall, 2003, Page 413;

Quantitatively:
 Area: Logic area v.s. Routing area;
 Delay: Logic delay v.s. Net delay;
 Power: Dynamic power consumption by Logic v.s. by Interconnect.
© NUS 2010
Dagstuhl Seminar
4
Unbalance: Area
Routing Area / Logic Area
25
20
15
10
tseng
spla
seq
s38584.1
s38417
s298
pdc
misex3
frisc
ex5p
ex1010
elliptic
dsip
diffeq
des
clma
bigkey
apex4
apex2
0
alu4
5
Relative weight of routing area and logic area of the 20 largest MCNC benchmark
circuits, assuming PTM 90nm CMOS process. Data produced by VPR v5.0.2.
© NUS 2010
Dagstuhl Seminar
5
Unbalance: Delay
Logic Delay
Net Delay
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
tseng
spla
seq
s38584.1
s38417
s298
pdc
misex3
frisc
ex5p
ex1010
elliptic
dsip
diffeq
des
clma
bigkey
apex4
apex2
0.0%
alu4
10.0%
Delay breakdown along the critical path for the 20 largest MCNC benchmarks,
assuming PTM 90nm CMOS process. Data produced by VPR v5.0.2.
© NUS 2010
Dagstuhl Seminar
6
Unbalance: Power
Dynamic power breakdown for a real circuit
[1], assuming the Xilinx Virtex-II FPGAs
Note:
Double: The length-2 wires;
Hex: The length-6 wires;
Long: The long wires spanning the whole
chip;
IXbar & OXbar: Crossbar at the input &
output pins of the logic blocks.
[1] L. Shang, A. Kaviani and K. Bathala, “Dynamic power consumption in Virtex-II
FPGA family,” ACM FPGA, 2002.
© NUS 2010
Dagstuhl Seminar
7
Outline
Background and Motivation
 Time-multiplexed interconnects in FPGAs
 sFPGA2 architecture
 Conclusion

© NUS 2010
Dagstuhl Seminar
8
Intra-Clock Cycle Idleness
Clock
cycle is constrained by the critical path delay.
Many wires are idle for a significant amount of time in a
clock cycle.
An example:
clma: the largest circuit (~8400 4-input LUTs) in MCNC benchmark;
 Use VPR v5.0.2 to implement to an island-style FPGA (10 4-inputs
LUT in each CLB and 100% length-4 wires ), assuming the PTM
90nm CMOS process;

Timing results after P&R:
 Critical path delay = 9.50ns;
 Delay of most nets (~96.5%) are less than 1ns;
Expensive
© NUS 2010
Dagstuhl Seminar
wires are often less utilized.
9
Time-Multiplexing
Net N1
CLB
Net N2
CLB
CLB
CLB
routing wire
Conventional switch
Two nets use two wires
Switches with multiple contexts
Two nets share one wire
• Use switches with multiple contexts to achieve time-multiplexing
of wires. Keep wires busy;
• Can potentially save wire area and achieve better timing
performance.
© NUS 2010
Dagstuhl Seminar
10
Preliminary Results
Bring
time-multiplexing enhancements to existing
CAD tools;
Preliminary studies show positive results:
For 16 MCNC benchmark circuits, ~11.5% saving in minimum
required number of wires, (but) ~1.5% timing overhead;
 For 16 MCNC benchmark circuits, ~8.2% reduction in critical path
delay, using the same number of wires;
 See [1] [2] for details.

[1] H. Liu et al, “An Area-Efficient Timing-Driven Routing Algorithm for Scalable
FPGAs with Time-Multiplexed Interconnects,” FCCM 2008.
[2] H. Liu et al, “An Architecture and Timing-Driven Routing Algorithm for AreaEfficient FPGAs with Time-Multiplexed Interconnects,” FPL 2008.
© NUS 2010
Dagstuhl Seminar
11
TM FPGA Challenges and Ongoing Work
The TM rate cannot be too high to have a reasonable
TM clock rate. We are targeting at 2-4 at the moment.

The nets that are qualified for TM are limited since
most nets having delays finished in the first microcycle.

Dual Vt architectures are proposed to adjust the
delay to achieve low power and higher TM
opportunities.

© NUS 2010
Dagstuhl Seminar
12
Outline
Background and Motivation
 Time-multiplexed interconnects in FPGAs
 sFPGA2 architecture
 Conclusion

© NUS 2010
Dagstuhl Seminar
13
Motivation
In
current FPGAs, switching requirement grows
superlinearly with number of logic resources. In other
words, current architecture scales poorly.
To
address this, we need to organize FPGA
interconnecting wires hierarchically to achieve
scalability
[3] Rizwan Syed et al, “sFPGA2 - A Scalable GALS FPGA Architecture and Design
Methodology,” FPL 2009.
© NUS 2010
Dagstuhl Seminar
14
How Multiple FPGAs Are Connected?
MGT based
Serial Switch
Interconnect
PCI Express
Serial and switched based interconnects are the future of
peripheral interconnect!
© NUS 2010
Dagstuhl Seminar
15
sFPGA2 Is an On-Chip Version
sFPGA2
is a scalable FPGA
architecture using hierarchical
routing network employing high
speed serial links and switches
to route multiple nets
simultaneously [3].
Consists
of two levels:
Base Level (eg.: A0…A7, S0)
 Higher Levels (eg.: X0)

Architecture Block Diagram
[3] Rizwan Syed et al, “sFPGA2 - A Scalable GALS FPGA Architecture and Design
Methodology,” FPL 2009.
© NUS 2010
Dagstuhl Seminar
16
sFPGA2 Architecture (Contd)
A0…A7
are FPGA
tiles (similar to current
FPGAs).
Courtesy of Xilinx (Virtex II Pro)
S0 contains very high
speed transceivers
capable of aggregating
multiple high speed
serial links into a very
high link.
© NUS 2010
Dagstuhl Seminar
17
sFPGA2 (Contd)
Routing
is done
using either of the
two methodology
shown in figure.
Intra
cluster routing
uses only the switch
blocks and channels
in that level.
While
inter cluster
routing uses very
high speed links and
switches.
© NUS 2010
Dagstuhl Seminar
18
Design Methodology
NOP v0
v1
*
v3
*
v2
*
*
v4
*
v5
v6
v7
*
+
v9
+
v8
<
v10
v11
An inter
tile net
NOP vn
The new step to deal
with inter-tile nets!
© NUS 2010
Dagstuhl Seminar
19
Preliminary Results
Successfully
implemented a JPEG engine and
demonstrated it to transport groups of nets on an
emulation platform built on 3 Xilinx Virtex 2 Pro FPGA
boards. Serial communication was emulated by MGTs.
Preliminary
studies show that latency in transport is
very high mainly due to high latency transceivers thus
limiting application domain to GALS designs only.
However, with the advancement in transceivers, this
can be extended to pure synchronous designs as well.
© NUS 2010
Dagstuhl Seminar
20
Conclusion
Logic
/ Interconnect unbalance in FPGAs makes the
optimization of interconnection network important.
Significant intra-clock cycle idleness exists in FPGA
routing wires.

Time-multiplexing increases resource utilization, and can potentially
save area and achieve better timing.
Current FPGA interconnection network is not scalable.
 On-chip network, consisting of switches and serial links, can
improve scalability.
Promising
preliminary results justify our approaches.
Future work needs to thoroughly investigate the impact
of architecture changes.
© NUS 2010
Dagstuhl Seminar
21
Multi-FPGA or Multi-Core?
FPGA
Tile
FPGA
Tile
FPGA
Tile
FPGA
Tile
FPGA
Tile
NoC
FPGA
Tile
uP
Tile
uP
Tile
uP
Tile
NoC
uP
Tile
uP
Tile
uP
Tile
1. Building Multi-FPGA or Multi-Core will not be difficult with the
development of semiconductor technology.
2. We (hardware engineers) know programming multi-FPGA
more than programming multi-core processors.
3. Should we use VHDL/Verilog as the (intermediate)
programming language for both Multi-FPGA or Multi-Core?
© NUS 2010
Dagstuhl Seminar
22
© NUS 2010
Dagstuhl Seminar
23
See also
v5.0.2 – Versatile Placement & Routing tool for
heterogeneous FPGAs:
http://www.eecg.utoronto.ca/vpr/
VPR
Predictive
Technology Model (PTM):
http://www.eas.asu.edu/~ptm
© NUS 2010
Dagstuhl Seminar
24