Northwestern VLSI CAD Group Physical Design for Reconfigurable Computing Systems using Firm Templates K.

Download Report

Transcript Northwestern VLSI CAD Group Physical Design for Reconfigurable Computing Systems using Firm Templates K.

Northwestern VLSI CAD Group
Physical Design for
Reconfigurable Computing Systems
using Firm Templates
K. Bazargan
R. Kastner
M. Sarrafzadeh
Department of Electrical &
Computer Engineering
Northwestern University
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
2
Outline
 • FPGA: What and why?
 • What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
3
The Architecture of a Reconfigurable System
Data Memory
Data
Data
Control
RFU
CPU
Data
RFUOPs
CPU instructions
Instruction Memory (Program)
Sep 10, 99
4
Execution of a Sample Program
Code
DFG
…
=> x = 3*a - b;
(on CPU)
=> C = RFUOP1(x,5);
(on RFU)
=> y = 4*x - c;
for (i=0;i<3;i++){
t
y
x+=RFUOP2(y);
=>
x
++y;
RFU
}
No room on RFU to run all
in parallel ==> run in sequence
=> z = RFUOP1(x,3);
=> a = z - y;
=> b = RFUOP3(a,b);(in parallel)
=> c = a - b;
=> …
Sep 10, 99
5
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
 • Application example
• RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
6
Application Example: Image Restoration
The value of the center pixel in the next iteration:
xk+1 = *y + xk - * (d**xk)
y: the pixel value from the original degraded image
xk: the pixel value from the previous iteration
d**xk denotes the weighted sum
r1*  (eight neighbor pixels) + r0 * center pixel
r1
r1
r1
r1
r0
r1
Sep 10, 99
r1
r1
r1
7
Image Restoration (cont.)
• Incentive:
– Processing of large images using FPGA’s
with limited resources
• Strategy:
m
– Segmentation of the image into
smaller sized images suitable
for the FPGA
– Segments of size m x n
are surrounded by an overlap of o.
n
o
Sep 10, 99
8
Image Restoration: Data Flow Strategy
• Data flow strategy
– Pixels of individual segments are restored in parallel by
hardware.
– Restored segments are written back after the overlap
is discarded
MEMORY
m
n
RFU
o
Sep 10, 99
9
Image Restoration Example
Degraded Image
Sep 10, 99
Restored Image
10
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
 • RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
11
System Components
CPU instructions
Data
CPU
Configuration
Memory
Data
Memory
Data
Config.
Bits
Data
RFU
Program
Manager
RFUOPs
Cache Control Prefetch/Branch
Manager
Prediction Unit
Placement
Engine
RFU Manager
Sep 10, 99
Instruction
Mem. (Prog.)
12
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
 • Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
13
Online Placement: Problem Definition
• Input:
arrival
– RFU dimensions
– List of RFUOP events:
(W, H)
(w, h, arrival, departure)
departure
• Output:
– For each module, either
• Rejected (not able to place)
• Accepted:
(x,y)
[penalty?]
accepted
rejected
Sep 10, 99
14
Current
Placement
Online Placement
New module
to be inserted
+
= ?
• When a new RFUOP arrives,
– Is there enough room?
– If yes, which location is best?
• Previous work
– Bin-packing heuristics (1-D) - O(n2)
• First Fit, Best Fit, Shelf, Look ahead, …
Sep 10, 99
– [Chazelle’83] The Bottom-Left heuristic. O(n2)
– [Healy-Creavin’97] O(n2 lg n)
15
Our Online Placement
• Our approach:
– Divide the empty space into explicit “empty rectangles”
• When a new RFUOP arrives
– Is there enough room?
(any ER large enough?)
– If yes, which location is best? (which ER is best?)
• Packing rule
– Best Fit, Bottom Left, First Fit
Sep 10, 99
16
Heuristics for Choosing an Empty Rectangle
Current
Placement
A
B
BF (Best Fit)
New module
to be inserted
+
=
FF (First Fit)
?
BL (Bottom Left)
P1
P2
Places the new module in the empty Any of A or B could be chosen for
rectangle which causes less wasted placing the new module.
space.
Area( ) < Area( )  Choose A
Sep 10, 99
Chooses the empty rect which is
more to the bottom left
y(P2) < y(P1)

Choose B
17
Our Online Placement
• Our approach:
– Divide the empty space into explicit “empty rectangles”
• When a new RFUOP arrives

– Is there enough room?
(any ER large enough?)
– If yes, which location is best? (which ER is best?)
• Managing the empty space
– Keep empty rectangles explicitly,
use “range tree” to store/access empty rects.
 – Efficient use of RFU real estate
• KAMER: Keep all O(n2) maximal empty rectangles
Sep 10, 99
18
Keeping All Empty Rectangles
Sep 10, 99
19
Our Online Placement
• Our approach:
– Divide the empty space into explicit “empty rectangles”
• When a new RFUOP arrives

– Is there enough room?
(any ER large enough?)
– If yes, which location is best? (which ER is best?)
• Managing the empty space
– Keep empty rectangles explicitly,
use “range tree” to store/access empty rects.
– Efficient use of RFU real estate
• KAMER: Keep all O(n2) maximal empty rectangles
 – Fast but sub-optimal
• Keep only O(n) empty rectangles
– Shorter Seg. (SSEG), Square Empty Rects. (SQR), ...
Sep 10, 99
20
Keeping O(n) Empty Rectangles - SSEG
Sep 10, 99
21
Heuristics for Choosing a Segment
A
S1
Chooses the shorter of the two
segments.
S2

LSEG (Longer Seg)
Chooses the longer of the two
segments.
S1 < S2
Sep 10, 99

D

BER (Balanced Empty Rects)
Chooses the segment which creates
less area difference.
A
S1

B
Area(B) - Area(A) > Area(D) - Area(C)
S 1 < S2
C
D

SSEG (Shorter Seg)
A
B
S2

C
C
B

LSQR (Larger Rect Square)
Chooses the segment which creates
the larger rectangle closer to square.
AspectRatio(B) > AspectRatio(D)
A
C
B
D
D



LER (Large Empty Rects)
Chooses the segment which creates
the larger empty rectangle.
Area(B) > Area(D)


SQR (Square Rects)
Chooses the segment which creates
empty rectangles closer to squares.
Max{AR(A),AR(B)} <
Max{AR(C),AR(D)}
22
AR = AspectRatio
How Good is a Placement?
• Acceptance rate
– percentage of modules accepted (placed)
• Volume penalty
– Area  complexity
– Time-span in the system  loop iterations
– Penalty of rejecting a module
penalty = volume = area * time
• Input data
– Randomly generated dimensions
– Randomly generated enter/leave time
Sep 10, 99
23
Program
snapshot
Sep 10, 99
24
Online Placement Results
BinPack
Data set KAMER
ra2048
ra4096
FF
ra8192
ra16384
Avg(FF)
ra2048
ra4096
BF
ra8192
ra16384
Avg(BF)
ra2048
ra4096
BL
ra8192
ra16384
Avg(BL)
79.25
84.59
79.71
81.35
81.23
82.52
87.06
82.28
84.04
83.97
81.84
86.18
81.17
83.46
83.16
SSEG
74.26
79.1
73.39
75.08
75.46
77.49
81.76
77.57
78.81
78.91
76.22
81.93
75.71
77.39
77.81
BER
61.52
66.84
63.23
63.59
63.80
67.18
73.22
67.85
68.5
69.19
61.72
70.29
65.04
64.97
65.50
LSQR
70.36
74.39
69.87
70.42
71.26
75.05
80.32
73.91
75.36
76.16
73.29
78.56
72.9
74.53
74.82
LSEG
52.83
58.37
55.87
55.73
55.70
58.93
64.57
59.04
60.92
60.86
55.57
62.33
59.71
58.23
58.96
LER
73.87
79.49
74.88
76.13
76.09
76.46
81.66
76.12
78.25
78.12
76.07
81.42
76.54
78.29
78.08
Percentage of accepted modules using different
bin-packing and empty space partitioning rules
Sep 10, 99
SQR
70.36
74.73
68.11
69.38
70.65
74.66
79.78
73.77
75.44
75.91
71.83
78.54
72.18
73.25
73.95
25
Online Placement Results (cont.)
Penalties for different partitioning heuristics when
BF is used
A2048
A4096
A8192
A16384
1.8E+08
1.6E+08
Penalty
1.4E+08
1.2E+08
1.0E+08
8.0E+07
6.0E+07
4.0E+07
2.0E+07
0.0E+00
KAMER SSEG
BER
LSQR LSEG
Partitioning heuristic
Sep 10, 99
LER
SQR
26
Online Placement Results (cont.)
Running Time Comparison
(Time to place "A16384" file)
40
35
35.77
34.27 34.74
Time (sec.)
30
25
BF
20
FF
15
BL
10
2.23 2.12
5
2.24
0
KAMER
Sep 10, 99
SSEG
27
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
• Online placement: problem
definition and our approach
 • Offline placement and scheduling
• Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
28
3-D Floorplanning
DFG
Schedule
RFU
CPU
RFU area
time
t
y
x
Sep 10, 99
RFU
29
3-D Floorplanning
DFG
Schedule
RFU
t
y
x
Sep 10, 99
RFU
CPU
By deleting this RFUOP
(CPU performs the
operation)...
30
3-D Floorplanning
DFG
Schedule
RFU
t
CPU
This RFUOP can be
moved on the RFU
y
x
Sep 10, 99
RFU
31
3-D Floorplanning
DFG
Schedule
RFU
t
y
x
Sep 10, 99
CPU
These RFUOPs can be
performed earlier...
RFU
32
3-D Floorplanning
DFG
Schedule
RFU
t
CPU
y
x
Sep 10, 99
RFU
33
Our Current 3-D Floorplanners
• No change in the schedule
– Fixed insertion and deletions of RFUOPs
• Annealing based.
– Move set
• Move operation from CPU set to RFU set
• Move operation from RFU set to CPU set
• Displace an already placed RFUOP on the RFU
– Cost function
• Penalty in rejecting modules (sum of volumes of the
RFUOPs in the CPU set)
• No overlap allowed during annealing
• Greedy
– Sort the modules on decreasing vol., apply KAMER
Sep 10, 99
34
Our Current 3-D Floorplanners (cont.)
• KAMER-BF-Decreasing
– Sort the modules on their volumes
– Use KAMER to find a fast placement of the modules
• Low-temp. annealing (LTSA)
– Similar to KAMER-BFD, but use KAMER to place only
the X% largest modules
– Use low-temp annealing to place the rest
• Zero-temp. annealing (ZTSA) -- Greedy
– Use KAMER to place as many modules as you can
– Use only displace and move from CPU to RFU
annealing moves.
Sep 10, 99
35
Our Current 3-D Floorplanners (cont.)
• BFOP - Best Fit Online Placement
– Sort the RFUOPs on volume (decreasing)
– For each RFUOP, find candidate “corners”
– Choose the corner which results in min wasted area
(similar to well-studied 2-D Bin Packing problem)
t1
A Floor corresponding to time t1
t1
t
corners
y
36
Sep 10, 99
x
Annealing-Based Offline vs. Online
Algorithm Data
set
T50
T100
LTSA
S100
X=100%
S200
S1024
A1024
Avg
T50
LTSA T100
X=20% S100
S200
A1024
Avg
Offline Online
Ratio
acc. rate acc. rate
70
84 83.33%
72
83 86.75%
86
84 102.38%
81
89.5 90.50%
84.5
84.6 99.88%
87
89 97.75%
80.08
85.68 93.43%
76
84 90.48%
82
83 98.79%
81
84 96.43%
85.5
89.5 95.53%
81
89 91.01%
81.10
85.90 94.45%
Offline
Penalty
147287
253566
464049
539435
4468662
427761
1050126
148975
225603
287153
359980
213036
246949
Online
Penalty
213153
307879
508923
612623
4643786
456627
1123831
213153
307879
508923
612623
456627
419841
Ratio
69.10%
82.36%
91.18%
88.05%
96.23%
93.68%
86.77%
69.89%
73.28%
56.42%
58.76%
46.65%
61.00%
Percentage of accepted modules and penalties using two offline parameters.
The higher the RFU acceptance rate and lower the penalty, the better the algorithm.
Sep 10, 99
37
Offline Placement Results - All
Comparison of different offline algorithms
Penalty of placement
700000
600000
500000
KAMER -BFD
400000
LTSA
300000
ZTSA
200000
BFOP
100000
0
Tiny50
Tiny100 Small100 Small200 A100
Data files
Sep 10, 99
38
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
 • Flexible modules and firm templates
• Conclusion and future work
Sep 10, 99
39
Flexible Modules
• Library of soft templates
– Flexible shapes
• Constant area, different width,height
• Problem? Hard to build (PD should be done for each shape)
– Median
• Use the same area, but square shape
– Rotation
• Placement method
– Use best shape (min wasted area)
Sep 10, 99
40
Using Flexible Modules in BFOP
Quality improvement when using flexible modules
Median
Median/Rotation
Improvement (percentage)
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
av
g
A2
04
8
A1
00
al
l1
02
4
Sm
al
l2
00
Sm
al
l1
00
Sm
ny
10
0
Ti
Ti
ny
50
0.00%
Data files
Sep 10, 99
Median uses a square module with the same area
41
Flexible Modules (cont.)
• “Firm” templates
– Slice the module into x horizontal or vertical strips
– If cannot place the module, use the 2-split, 3-split, …
until you can fit.
• Problem?
– Routing!
– Limited module types can be split (like carry chains,
etc. with min communication between stages)
Vertical 3-split
Sep 10, 99
42
Quality Improvements Using Firm Templates
Percentage improvement over nosplit
Placment improvement when using firm templates (in
OBFD)
Sep 10, 99
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Tiny50
Tiny100
Small100
Small200
Small1024
A100
A1024
avg
Split-2
Split-3
Split-4
Split-5
Split-6
43
Outline
• FPGA: What and why?
• What is Reconfigurable Computing
System (RCS)?
• Application example
• RCS: System components
• Online placement: problem
definition and our approach
• Offline placement and scheduling
• Flexible modules and firm templates
 • Conclusion and future work
Sep 10, 99
44
Conclusion
• Which online algorithm?
– If speed is an issue, SSEG, ow KAMER
• Online or offline?
– If you have the schedule => offline
• Which offline algorithm?
– BFOP is the best (faster+better quality)
• Median? Flexibility? Firm templates?
– Surprisingly, median gives little improvement
– If flexible shape avail, better than splitting
(no additional routing problem)
– How many splits?
• no-split  2-split: 23% improvement
• 5-split  6-split: 3% improvement
Sep 10, 99
45