Clustering of Large Designs for Channel Width Constrained

Download Report

Transcript Clustering of Large Designs for Channel Width Constrained

Un/DoPack: Re-Clustering of Large
System-on-Chip Designs with
Interconnect Variation for Low-Cost
FPGAs
Marvin Tom*
Xilinx Inc. ([email protected])
San Jose, CA, USA
*Work performed at University of British Columbia
David Leong
University of British Columbia ([email protected])
Vancouver, BC, Canada
Guy Lemieux
University of British Columbia ([email protected])
Vancouver, BC, Canada
Overview
• Introduction, Goals and Motivation
– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits
– Varying amount of interconnect variation
• Un/DoPack CAD Tool:
– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
2
Overview
• Introduction, Goals and Motivation
– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits
– Varying amount of interconnect variation
• Un/DoPack CAD Tool:
– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
3
Mesh-Based FPGA Architecture
• 9 logic blocks
• 4 wires per channel
• 3*4=12 total horizontal tracks
• 16 logic blocks
• 4 wires per channel
• 4*4=16 total horizontal tracks
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
• Larger FPGAs have more “aggregate” interconnect
4
Motivation: Area of FPGA Devices
SIZE of
Layout Tile
MCNC Circuits Mapped onto an FPGA
90
80
70
Routed 60
Channel 50
Width
40
30
20
10
frisc
ex1010
pdc
spla
apex4
des
elliptic
dsip
bigkey
ex5p
diffeq
seq
apex2
s38584
s38417
s298
misex3
tseng
alu4
0
50
100
150
200
250
300
CLB Count
Total Layout AREA
= SIZE * Number
Number of
Layout Tiles
5
Motivation: Channel Width Demand
Interconnect
Range
User has
no choice!
MCNC Circuits Mapped onto an FPGA
90
80
70
Routed 60
Channel 50
Width
40
30
20
10
frisc
ex1010
pdc
spla
apex4
des
elliptic
dsip
bigkey
ex5p
diffeq
seq
apex2
s38584
s38417
s298
misex3
tseng
alu4
0
Devices built for worst-case
channel width (fixed width)
Interconnect dominates area (>70%)
50
100
150
200
250
300
CLB Count
Logic Range
User buys bigger device.
6
Goal: Reduce Channel Width
90
Altera Cyclone
80
• Channel width constraint
70
of 80 routing tracks
Routed 60
Channel 50
Width
40
Constrained FPGA
30
• Channel width constraint
20
of 60 routing tracks
10
• Smaller area, lower cost
for low-channel-width circuits
frisc
ex1010
pdc
spla
apex4
des
elliptic
dsip
bigkey
ex5p
diffeq
seq
apex2
s38584
s38417
s298
misex3
tseng
alu4
0
50
100
150
200
250
300
CLB Count
But { apex4, elliptic, frisc, ex1010, spla, pdc } are unroutable….
Can we make them routable in a Constrained FPGA?
7
Possible Solution
• Trade-off logic utilization for channel width
– User can always buy more logic…. (not more wires)
clma
Trade-off:
CLB count
for
Channel width
Routed Channel Width
90
frisc
80
ex1010 pdc
70
spla
apex4
des
60
seq
30
ex1010
s38584
bigkey
apex2
ex5p
s38417
s298
apex4
diffeq misex3
tseng
40
L
LL
LL
L
L
L
LL
LL
L
L
L
LL
LL
LL
L
L
L
LL
LL
LL
L
L
L
L
L
L
elliptic
dsip
50
L
frisc
spla
pdc
elliptic
alu4
20
FPGA 1
10
0
FPGA 2
50 100 150 200 250 300 350 400 450 500 550 600 650 700
CLB Count
What about area??
8
Features and Costs of Two FPGA
Families
Altera Device
LEs
Memory
Mult.
Routing
Cost
Cyclone 1C12
12,060
239,616
0
80
$56
Stratix 1S10
10,570
920,448
48
232
$190
Cyclone 1C20
20,060
294,912
0
80
$100
Stratix 1S20
18,460
1,669,249
80
232
$350
• Sample Benchmark Circuit
–
–
–
–
10,000 LEs
150 Routing Tracks
No Multipliers
100 K Memory
• Sample Benchmark Circuit
– 20,000 LEs
– 75 Routing Tracks
9
Overview
• Introduction, Goals and Motivation
– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits
– Varying amount of interconnect variation
• Un/DoPack CAD Tool:
– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
10
GNL Circuit Benchmark Suite
• Create benchmark circuits with variation
– SoC <==> Randomly integrate/stitch together “IP Blocks”
– IP Blocks have varied interconnect needs
• Generate Netlist (GNL)
– Stroobandt @ Ghent University
– Synthetic benchmark generator
• GNL circuits generated hierarchically
– Root  # I/Os, # IP blocks
– Second Level  20 IP blocks, # LEs, Rent parameter
11
Rent Linear Interpolation
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
IP Blocks
10
10
ex
5p
ex
pd
c
x3
m
ise
u4
al
98
s2
di
ffe
q
gk
ey
s3
85
84
.1
el
lip
tic
Stdev000
Stdev002
Stdev004
Stdev006
Stdev008 / meta clone
Stdev010
Stdev012
bi
Rent Parameter
• 7 benchmark circuits
• Average Rent = 0.62, Stdev Rent = 0  0.12
• 240/120 primary inputs/outputs
12
Overview
• Introduction, Goals and Motivation
– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits
– Varying amount of interconnect variation
• Un/DoPack CAD Tool:
– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
13
Un/DoPack Flow
• Iterative non-uniform cluster
depopulation tool
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
• Step 1: Traditional SIS/VPR
• Step 2: UnPack:
– Congestion Calculator
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
• Step 3: DoPack:
– Incremental Re-Cluster
• Step 4,5: Fast Place/Route
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
No
Yes
Success!
14
Un/DoPack Flow: SIS/VPR
• Step 1: Traditional SIS/VPR
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
15
Un/DoPack Flow: SIS/VPR
• Step 1: Traditional SIS/VPR
Synthesize and
Technology Map
(SIS/Flowmap)
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
Placement
(VPR)
Channel Width
Constraint Met?
No
Yes
Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Success!
16
Un/DoPack Flow: SIS/VPR
• Step 1: Traditional SIS/VPR
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Channel Width
Constraint Met?
No
Yes
Success!
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
17
Un/DoPack Flow: UnPack
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
• Step 2: UnPack:
– Congestion Calculator
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Congestion
Calculator
(UnPack)
Array Size Limits
Reached?
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
Yes
No
Channel Width
Constraint Met?
No
Yes
Failure
Channel Width
Constraint Met?
No
Yes
Success!
18
Un/DoPack Flow: UnPack
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
• Step 2: UnPack
CLB Label
– Generate Congestion Map
– CLB Label = Largest CW occ
in 4 adjacent channels
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
120
100
80
60
40
20
0
50
40
30
20
10
0
0
CLB Label
CLB Y-Location
50
40
30
20
10
CLB X-Location
Channel Width
Constraint Met?
120
No
Yes
100
Channel Width
Constraint Met?
No
Yes
80
Success!
60
40
20
19
0
60
50
40
30
20
10
0
0
10
20
30
40
50
60
Un/DoPack Flow: UnPack
• Step 2: UnPack:
– Depop Center = Largest CLB label
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
20
M X M Array
Un/DoPack Flow: UnPack
• Step 2: UnPack:
– Option 1 Coarse Grain:
• Dpop Radius = M/4
• Dpop Amt: 1 new row/col in array
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
21
M X M Array
Un/DoPack Flow: UnPack
• Step 2: UnPack:
– Option 2 Fine Grain:
• Dpop Radius = M/4, M/5, M/6, M/8
• Dpop Amt: 1 new row/col in region
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
22
M X M Array
Un/DoPack Flow: DoPack
• Step 3: DoPack:
– Incremental Re-Cluster
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
No
Incremental
Cluster
(DoPack)
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
Failure
No
Yes
Success!
23
Un/DoPack Flow: Fast P&R
• Step 4,5: Fast Place/Route
Array Size Limits
Reached?
Yes
No
Fast Routing
(VPR)
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
No
Channel Width
Constraint Met?
Yes
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Fast Placement
(Incremental or
VPR)
Channel Width
Constraint Met?
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
No
Yes
Channel Width
Constraint Met?
No
Yes
Success!
Success!
24
Un/DoPack Flow: Fast P&R
• Step 4,5: Fast Place/Route
• Fast Placement
– UBC Incremental Placer
(under development)
– VPR –fast
Circuit Description
Architecture Description
Channel Width Constraint
Array Size Constraint
Congestion
Calculator
(UnPack)
Synthesize and
Technology Map
(SIS/Flowmap)
Array Size Limits
Reached?
Yes
No
Cluster
(iRAC Replica)
Incremental
Cluster
(DoPack)
Placement
(VPR)
Fast Placement
(Incremental or
VPR)
Routing
(VPR)
Fast Routing
(VPR)
Failure
• Fast Router
– Use illegal pathfinder solution
from first iterations
• Unsuccessful so far
– Use full routed solution
• Slow but reliable
Channel Width
Constraint Met?
No
Yes
Channel Width
Constraint Met?
No
Yes
Success!
25
Overview
• Introduction, Goals and Motivation
– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits
– Varying amount of interconnect variation
• Un/DoPack CAD Tool:
– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
26
Un/DoPack: Baseline Flow
•
•
•
•
UnPack: Coarse grained congestion calculator
DoPack: iRAC replica
Fast Place: UBC Incremental Placer
Fast Route: None
• FPGA Architecture:
–
–
–
–
LUT size (k) = 6
Cluster size (N) = 16
Inputs per cluster (I) = 51
Wires of length (L) = 4
27
Normalized Area
Area of GNL Benchmarks
2.00
1.90
1.80
1.70
1.60
1.50
1.40
1.30
1.20
1.10
1.00
0.90
stdev0
stdev002
stdev004
stdev006
stdev008 / meta clone
stdev010
stdev012
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
% of Maximum Channel Width
28
140
High Variation
Circuits Require
Wide Channel Width
Baseline
130
10% Area Increase
20% Area Increase
120
25% Area Increase
110
100
90
80
12
v0
st
de
10
v0
st
de
08
v0
st
de
06
v0
st
de
04
v0
st
de
02
v0
st
de
v0
00
70
st
de
Minimum Routed Channel Width
Interconnect Variation: Impact on FPGA
Architecture Design
29
Critical Path of GNL Benchmarks
Normalized Critical Path
1.25
1.20
1.15
1.10
1.05
1.00
0.95
0.5
0.55
0.6
0.65
0.7 0.75 0.8 0.85 0.9
% of Max Channel Width
0.95
1
1.05
30
Before
CLB Label
Un/DoPack Congestion Map
120
100
80
60
40
20
0
50
40
30
20
10
0
0
10
After
Un/DoPack
CLB Label
CLB Y-Location
50
40
30
20
CLB X-Location
120
100
80
60
40
20
0
60
50
40
30
20
10
0
CLB Y-Location
0
10
20
30
40
60
50
CLB X-Location
31
Multi-Region Un-Pack
• Depopulate multiple
regions at once
– Depopulate each region
separately
– Smaller radius
= M/10
• Handle overlapping
regions
32
Norma lize d Are a
Normalized Area
3.00
2.80
s tde v000
2.60
s tde v008 / clone
2.40
s tde v010
2.20
2.00
1.80
1.60
1.40
1.20
1.00
0.80
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Cha nne l W idth Cons tra int (% of m a x MRCW )
33
Normalized Critical Path
1.25
Norma lize d Critica l P a th De la y
s tde v000
s tde v008 / clone
1.20
s tde v010
1.15
1.10
1.05
1.00
0.95
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
C ha nne l Width C ons tra int (% of ma x MR C W)
34
Run-Time Comparisons
Log Run Time (in hours)
stdev000
stdev008
stdev010
MR stdev000
MR stdev008 / clone
MR stdev010
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Channel Width Constraint (% of max MRCW)
35
Conclusion
• Un/DoPack: FPGA CAD flow
– Find “local” congestion  depopulate  reduced interconnect demand
• FPGA benchmark circuit “suite”
– Stdev: Used to vary interconnect demand
• Discoveries…
– “Non-uniform” depopulation limits area inflation
– “Interconnect variation” important for area inflation and FPGA
architecture design
– “Routing closure” achieved by re-clustering and incremental
place & route
• UNROUTABLE circuits made ROUTABLE 
buy an FPGA with MORE LOGIC!!!
36
End of Talk