Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon , Rakesh Kumar
Download
Report
Transcript Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon , Rakesh Kumar
Application-Specific Customization of
Parameterized FPGA Soft-Core Processors
David Sheldona, Rakesh Kumarb, Roman Lyseckyc, Frank Vahida*,
Dean Tullsenb
aDepartment
of Computer Science and Engineering
University of California, Riverside
*Also with the Center for Embedded Computer Systems at UC Irvine
bDepartment
of Computer Science and Engineering
University of California, San Diego
cDepartment
of Electrical and Computer Engineering
University of Arizona
This work was supported in part by the National Science Foundation, the Semiconductor
Research Corporation, and by hardware and software donations from Xilinx
FPGA Soft Core Processors
HDL
Description
Soft-core Processor
HDL description
Flexible implementation
FPGA or ASIC
Technology independent
David Sheldon, UC Riverside
FPGA
Spartan 3
Virtex 2
ASIC
Virtex 4
2 of 22
FPGA Soft Core Processors
Soft Core Processors can
have configurable options
Datapath units
Cache
Bus architecture
Current commercial FPGA
Soft-Core Processors
Xilinx Microblaze
Altera Nios
FPU
μP
MAC
Cache
FPGA
David Sheldon, UC Riverside
3 of 22
Goal
Goal: Tune FPGA soft-core microprocessor for a given application
App
Parameter
Values
Parameter
Values
μP
Synthesis
Configured μP
time
Configured μP
FPGA
David Sheldon, UC Riverside
size
4 of 22
Microblaze – Xilinx
FPGA Soft-Core
All units not necessarily
the fastest, due to critical
path lengthening
Base
MicroBlaze
Speedup
Multiplier
Barrel Shifter
Divider
B ase M B
Full M B
Optimal M B
Ba a
s e i fir
FP
bi 01
tm
np
br
c a ev
nr
d
g3 r
g7 fa
21 x
_p
s
id
m ct
a
ra tm u
yt
ra l
c
tb e
lo
o
tts k
pr
k
AV
G
FPU
7
6
5
4
3
2
1
0
Cache
18
FPU
base
bs
mul
12
mul+bs
10
bs+cache
8
mul+bs+cache
6
4
2
14
00
0
12
00
0
10
00
0
80
00
60
00
40
00
0
0
Significant
tradeoffs
14
20
00
Instantiatable units
Application Runtime (ms)
16
Size (Equivalent LUTs)
David Sheldon, UC Riverside
5 of 22
Problem
Need fast exploration
Synthesis runs can take an
hour
This talk
Two approaches
Approach 1: Using Traditional
CAD Techniques
Approach 2: Synthesis-in-theloop
Results
David Sheldon, UC Riverside
Parameter
Values
Exploration
μP
Synthesis
~20-60
mins
Configured μP
6 of 22
Constraints on Configurations
Size constraints may prevent use of all
possible units
Multiplier
Barrel Shifter
Multiplier
MicroBlaze
FPU
FPU
Cache
Max Area
David Sheldon, UC Riverside
Divider
Cache
7 of 22
Approach 1: Traditional CAD
Techniques
Create model
Create a model of the
problem
Solve model with extensive
search heuristics
We will model this problem
as a 0-1 knapsack problem
Multiplier
MicroBlaze
Slow,
includes
synthesis
Model
Exploration
Fast,
considers
1000s
of
configurations
FPU
Cache
Max Area
David Sheldon, UC Riverside
8 of 22
Approach 1: Traditional CAD
Techniques
Synthesis
Synthesis
Creating the
model
FPU
MicroBlaze
Base
size
perf
perf
perf
App
Multiplier
FPU
size
Cache
size
size
BS
FPU
MUL
Perf increment
1.1
0.9
1.2
1.0
1.3
Size increment
1.4
2.7
1.8
1.1
1.6
Perf/Size
0.96
0.34
0.63
0.93
0.80
David Sheldon, UC Riverside
Divider
perf
Barrel Shifter
perf
MicroBlaze
size
DIV CACHE
9 of 22
Approach 1: Traditional CAD
Techniques
0-1 knapsack model
Object’s benefit = Unit’s performance increment / size increment
Object’s weight = Unit’s Size
Knapsack’s size constraint = FPGA size constraint
BS
FPU
MUL
Perf increment
1.1
0.9
1.2
1.0
1.3
Size increment
1.4
2.7
1.8
1.1
1.6
Perf/Size
0.96
0.34
0.63
0.93
0.80
David Sheldon, UC Riverside
DIV CACHE
MicroBlaze
10 of 22
Approach 1: Traditional CAD
Techniques
Solved the 0-1 knapsack problem using
established methods
Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem.
Computing 1980
Running time
6 Microblaze configuration synthesis runs to
create model
O(n*p) to solve model
n is the number of factors
p is the available area
Negligible (seconds) compared to synthesis
runtimes (~hour)
David Sheldon, UC Riverside
11 of 22
Approach 1: Traditional CAD
Techniques
Problems
100’s of target FPGAs
Model approach estimates size
and performance for two or more
units
Different hard core resources
(multiplier, block RAM)
MUL speedup 1.3, DIV speedup
1.6 estimate MUL+DIV
speedup 1.9
May really be 1.7
Device
XC2V2000
XC2VP2
XC4VLX80
XC4VLX15
XC2S300E
XC2V4000
XC2VP40
XC4VSX25
XC4VSX35
XC4VFX20
XC2S150E
XC2VP30
XC4VLX60
XC2S600E
XC2VP20
XC2V500
XC2VPX70
XC4VLX40
XC2V6000
XC4VFX60
XC4VFX100
XC2VP4
XC2VP70
LUTs
PPCs
21504
2816
71680
12288
6140
46080
38784
20480
30720
17088
3456
27392
53248
13824
18560
6144
66176
36864
67584
50560
84352
6016
66176
0
0
0
0
0
0
2
0
0
1
0
2
0
0
2
0
2
0
0
2
2
1
2
Model inaccuracies may be large
David Sheldon, UC Riverside
12 of 22
Approach 2: Synthesis-in-the-Loop
Problem with traditional CAD approach
100’s of target FPGAs
Model approach estimates size and
performance for two or more units
Model inaccuracies may be large
Create model
Model
Exploration
Solution – Synthesis in the loop
No abstract model
Guided by actual size and performance
data
But slow – can only explore a few
configurations
10’s of
minutes
Synthesis-in-the-Loop
Exploration
Synthesis
Execute
David Sheldon, UC Riverside
size
perf
13 of 22
Approach 2: Synthesis-in-the-Loop
First pre-analyze units to guide heuristic
Same calculations as when creating model for knapsack
perf
size
Multiplier
size
Cache
size
size
BS
FPU
MUL
Perf increment
1.1
0.9
1.2
1.0
1.3
Size increment
1.4
2.7
1.8
1.1
1.6
Perf/Size
0.96
0.34
0.63
0.93
0.80
David Sheldon, UC Riverside
Divider
perf
Floating Point
perf
Barrel Shifter
perf
perf
size
DIV CACHE
14 of 22
Approach 2: Synthesis-in-the-Loop
Build “impact-ordered tree”
structure
Tree is specific to given
application
BS
Perf/Size
FPU MUL DIV CACHE
0.96 0.34 0.63 0.93 0.80
Sort
BS
Perf/Size
DIV CACHE MUL FPU
0.96 0.93
David Sheldon, UC Riverside
0.80
0.63 0.34
Application Specific
Impact-ordering
Impact
BS
0.96
DIV
0.93
CACHE
0.80
MUL
0.63
FPU
0.34
15 of 22
Approach 2: Synthesis-in-the-Loop
Run tree-based search heuristic
Not
Include
Synthesis-in-the-Loop
Include
Perf/Size
Useful
BS
0.96
Yes
DIV
0.93
No
CACHE 0.80
No
MUL
0.63
Yes
FPU
0.34
No
Exploration
Synthesis
size
Execute
David Sheldon, UC Riverside
perf
16 of 22
Comparison of Approaches
Approach 1 – Traditional CAD
6 synthesis runs to build model
O(np) knapsack solution
Examines thousands of configurations during
exploration
Approach 2 – Synthesis in the loop
11 synthesis runs (6 pre-analysis, 5 exploration)
Examines (at most) 5 configurations during
exploration
David Sheldon, UC Riverside
17 of 22
Results
10 EEMBC and Powerstone benchmarks
Average results shown, on Virtex 2 Pro, for particular size
constraint
Tool Run Time (min)
aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul,
tblook, ttsprk
Knapsack sub-optimality
due to multi-unit
estimation inaccuracy
David Sheldon, UC Riverside
800
Exhaustive
App-Spec
600
Knapsack
400
200
0
1
1.5
2
2.5
Application-specific
impact-ordered tree
approach yields
near-optimal results
in acceptable tool
runtime
Speedup
18 of 22
Obtained results
for six different
size constraints
Results shown for
a second size
constraint
Similar findings for
all six constraints
Tool Run Time (min)
Results
800
Exhaustive
App-Spec
600
Knapsack
400
200
0
1
1.5
2
2.5
Speedup
David Sheldon, UC Riverside
19 of 22
Also ran for
different FPGA
Xilinx Spartan2
Similar findings
Tool Run Time (min)
Results
300
Exhaustive
250
App-Spec
200
Knapsack
150
100
50
0
1
1.2
1.4
1.6
Speedup
David Sheldon, UC Riverside
20 of 22
Conclusions
Synthesis-in-the-loop approach outperformed
traditional CAD approach
Better results
Slightly longer runtime
Application-specific impact-ordered tree
heuristic served well for synthesis-in-the-loop
approach
Future
Extend for highly-configurable soft-core processors,
and for multiple processors competing for and/or
sharing resources
David Sheldon, UC Riverside
21 of 22
Questions?
David Sheldon, UC Riverside
22 of 22