Transcript Document

Synthesizing Datapath Circuits
for FPGAs With Emphasis on
Area Minimization
Andy Ye, David Lewis, Jonathan Rose
Department of Electrical and Computer Engineering,
University of Toronto
{yeandy, lewis, jayar}@eecg.utoronto.ca
1
Motivation: Datapath Regularity
• Larger FPGAs
– Larger applications on FPGAs
– More datapath logic in larger applications
– Datapath logic is highly regular
• Utilize regularity to improve logic density
2
Utilizing Datapath Regularity
• A new datapath-oriented FPGA
• New CAD tools supporting the new FPGA
–
–
–
–
Synthesis
Packing
Placement
Routing
• This talk focuses on synthesis
3
Background: Datapath-oriented
FPGA
• Architected to utilize datapath regularity
• Architectural features
– Capture regularity using special logic blocks
– Increase logic density by coarse grain routing
4
Background: FPGA Overview
L
Routing
Channels
L
L
S
L
L
Logic cluster
S
Switch box
Coarse grain routing tracks
Fine grain routing tracks
5
Background: Logic Cluster
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
Subcluster 4 Subcluster 3 Subcluster 2 Subcluster 1
DFF
A Subcluster
MUX
BLE
Local
BLE
Routing
BLE
Network
BLE
LUT
M
A Basic Logic Element (BLE)
6
Background: FPGA Overview
L
Routing
Channels
L
L
S
L
L
Logic cluster
S
Switch box
Coarse grain routing tracks
Fine grain routing tracks
7
Background: Coarse Grain
Routing Tracks
Subcluster
Subcluster
Subcluster
SubCluster
M
M
M
M
Fine Grain Routing
M
Coarse Grain Routing
Switch Box
Logic Cluster
M
M
8
Datapath Synthesis
• Synthesis
– The first step in a fully automated CAD flow
– Transforms high level descriptions into logic
• Conventional synthesis (flat synthesis)
– Minimizes area and delay metrics
– Destroys datapath regularity
• Datapath synthesis
– Preserves datapath regularity
– Supports downstream CAD tools
9
Datapath Representation
• Datapath circuits are represent by netlists of
datapath components (VHDL or Verilog)
• Datapath component library
–
–
–
–
–
Multiplexers
Adders/subtracters
Shifters
Comparators
Registers
• Each component consists of identical bit-slices
10
Hard Boundary Hierarchical
Synthesis
• Optimize within the boundaries of bit-slices
• Keep identical bit-slices identical
• Optimized 15 datapath circuits from Picojava processor using Synopsys [sun]
– Good regularity
– Bad area - 38% area inflation
• FPGA architecture – increase logic density
– Need a better synthesis tool
11
Causes of Area Inflation
• Examined circuits to determine the causes
• Constraint of preserving bit-slice boundaries
– Common sub-expressions exist across bit-slices
– Harder to discover in datapath synthesis
• Constraint of preserving datapath regularity
– Identical bit-slices have different external connections
– Some bit-slices have more optimization opportunities
– Missing optimization opportunities if one has to
keeping all bit-slices identical
12
Enhanced Module Compaction
Netlist of Datapath
Components
Manual Operation
Word-level Optimization
Module Compaction
Bit-slice Netlist I/O
Optimization
Flat Synthesis & Optimization
Within Bit-slice Boundaries
Netlist of Synthesized
Bit-slices
13
Word-level Optimization
• Done manually and will be automated
• Optimizes across bit-slice boundaries
• Uses the functionality of each datapath
component to create optimization opportunities
• Two are performed
– Multiplexer tree collapsing
– Operation reordering
• More in the future
14
Multiplexer Tree Collapsing
• Datapath circuits contain multiplexers in a
tree topology
• Collapses several multiplexers in a
multiplexer tree into a single multiplexer
• Collapsing operation creates common subexpressions
• Extracts common expressions out of
multiple bit-slices to save area
15
An Example
A
S1
S2
R
A
S1
mux1
mux2
rl
S2
FF
FF
rl – random logic
16
Operation Reordering
• Transforms result selection into operand
selection
• Accepts the transformation if resulting in
smaller area
17
An Example
a
b
c
+
s
a0
b0
d
a
s
+
c
mux
mux
e
cin0a c0
+
e
d0
a0
cin0b
sum carry sum carry
cout0a
cout0b
s0
e0
b
d
mux
c0
b0
d0
s0
cin0
sum carry
cout0
e0
18
Module Compaction
• Merges bit-slices into larger bit-slices
• Based on connectivity between datapath
components
• Larger bit-slices have more optimization
opportunities for flat synthesis
• Avoids merging based on carry chains
• Similar to the algorithm proposed by Koch
19
An Example
FA0
FA1
mux0
FA2
mux1
FA3
mux2
FA4
mux3
20
Bit-slice I/O Optimization
• Granularity of bit-slice I/O optimization, m
• Breaks datapath components into m-bit
wide chunks
• m bit-slices are kept identical to each other
• Allows some bit-slices in a datapath
component to be optimized more than
others
21
Bit-slice I/O Optimization
• Converts bit-slice I/O signals into internal signals
if all m bit-slices meet an optimization criteria
• More optimization opportunities for flat synthesis
• Four types of I/O optimizations
–
–
–
–
Constant absorption
Feedback absorption
Duplicated input absorption
Unused output absorption
22
Experimental Results
• Fifteen benchmark circuits
– From the Pico-java processor
– Synthesized into 4-LUTs and DFFs
• Experiments
– Area
– Regularity
– Area against m (the granularity of bit-slice I/O
optimization)
23
Area
• m (granularity of bit-slice I/O optimization)
=4
• Compare datapath synthesis with flat
synthesis
24
Post-synthesis Area (LUT Count)
icu_dpath
ex_dpath
multmod_dp
ucode_dat
imdr_dpath
dcu_dpath
mantissa_dp
incmod_dp
smu_dpath
exponent_dp
pipe_dpath
prils_dp
rsadd_dp
code_seq_dp
ucode_reg
Total Area
Flat Synthesis
Area
3120
2530
1558
1243
1182
960
846
779
490
477
443
377
346
218
78
14647
Datapath Synthesis
Area
Inflation
3235
3.7%
2553
0.91%
1634
4.9%
1304
4.9%
1219
3.1%
966
0.63%
878
3.8%
865
11%
493
0.61%
501
5.0%
471
6.3%
388
2.9%
305
-12%
223
2.3%
82
5.1%
15117
3.2%
25
Regularity
• m (granularity of bit-slice I/O optimization)
=4
• Two terminal connections captured by
– 4-bit wide buses
– 4-bit wide control groups
26
Regularity
A 4-bit wide bus
S4
S3
S2
S1
S4
S3
S2
S1
A 4-bit wide control group
S4
S3
S2
S1
27
Regularity Results
dcu_dpath
ex_dpath
icu_dpath
imdr_dpath
pipe_dpath
smu_dpath
ucode_data
ucode_reg
code_seq_dp
exponent_dp
incmod_dp
mantissa_dp
multmod_dp
prils_dp
rsadd_dp
Total
Two Terminal
Connections
2232
6547
8047
3100
1049
1167
3143
194
799
1362
2013
2533
3380
864
722
37152
4-bit Wide Buses
49%
52%
47%
50%
48%
48%
52%
72%
58%
32%
42%
47%
39%
41%
52%
48%
4-bit Wide Control
groups
43%
39%
36%
36%
42%
25%
41%
21%
18%
23%
33%
36%
25%
32%
27%
35%
• 94% of LUTs remain in regular datapath components
28
Granularity (m) Vs. Area
• Higher m (the granularity of bit-slice I/O
optimization)
– Keeps more bit-slices identical
– Preserves more regularity
– Higher area cost
29
Granularity Vs. Area Inflation
8
7
6
5
%4
3
2
1
0
1
4
8
12
16
20
24
28
32
30
Conclusion
• Presented a datapath-oriented FPGA
architecture
• Presented an enhanced module compaction
algorithm
• Empirically demonstrated the area efficiency of
the algorithm
– 3%-8% area inflation
• Good regularity
– 48% two terminal connections are in 4-bit wide buses
– 35% two terminal connections are in 4-bit wide control
groups
31