DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING A.

Download Report

Transcript DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING A.

DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS,
MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING
A. Akoglu, A. Dasu, S. Panchanathan
Center for Ubiquitous Computing, Arizona State University, Tempe, AZ
{akoglu,dasu,panch}@asu.edu
Abstract
This research work presents results obtained from hybrid FPGA
architecture design methodology proposed in earlier work. Hybrid
architecture is formed of ASIC units and LUT based processing elements.
ASIC units represent tasks or core clusters obtained through common subgraph analysis between basic blocks within and across routines of
computation intensive applications and are basically recurring patterns.
Results show that partial reconfiguration with the use of computation
cores embedded in a sea of LUTs offer potential for massive savings in
gate density by eliminating the need for redundant sub-circuit pattern
configurations. Since ASICs cover only parts of data flow graphs,
remaining computations are implemented on LUT based reconfigurable
hardware. A new packing function is proposed to form LUT based
processing elements. Packing cost function prioritizes reduction of
input/output pins of the clusters being formed. Results show that
significant savings in number of nets to be routed are obtained through
proposed method.
Introduction
Numerical simulations in Computational Fluid Dynamics, Molecular
Modeling have some common computation features, and performed
iteratively. Similarly video and image processing applications involve
tasks (mosaic building to compress video into images, image
compression such as DCT, DWT etc.) which also have some common
computation features and require iterative processing. It has been shown
by several researchers that these applications are well suited to be
executed on spatially parallel processor architectures. FPGAs in
particular offer large amounts of on-chip spatial parallel units, thus
capable of performing orders of magnitude faster than regular serial
processors. But FPGAs suffer from the drawbacks of being application
agnostic and hence incur penalties of loss of clock cycles in redundant
reconfigurations, generic routing and poor memory architectures which
impact speed, power and silicon area. All these factors have led us into
exploring the reconfigurable architecture design space with the
application domain being prioritized.
Application

Source code
C
O
M
P
I
L
E
R
Application

Source code
Application

Source code
A
N
A
L
Y
S
I
S
T
O
O
L
S
Lance, SUIF, gcc
Methodology (Figure 1) involves
extraction of tasks or core clusters in
Control Data Flow Graphs (CDFGs) of
applications followed by designing the
architecture to embed them in HybridFPGA environments (Figure 2a,2b). By
Hybrid, we mean that the proposed
FPGA architectures will involve LUT and
ASIC regions. Tasks or core clusters
Figure 2b. Hybrid-FPGA
obtained through the common-sub-graph
analysis between basic blocks within and across routines are basically
recurring computation patterns implemented as ASICs on nonreconfigurable area(CIPEs). Designing processing elements based on
identifying correlated compute intensive regions within each application
and between applications result in large amounts of processing in
localized regions of the chip. This reduces the amount of reconfigurations
and on-chip communication hence results with faster application
switching and reduced power consumption. This task comprises of
finding the Common Sub-graphs (Figure 3), which is closely related to
the Largest Common Sub-Graph problem (a proven NP complete
problem). Core reusable regions that have been detected as common
within or across applications by peer research efforts, have either been at
the granularity of MAC units (2 nodes) or at the granularity of entire
function modules. There has been no reported work that has detected core
reusable regions consisting of several operation (multiply, add, divide etc)
nodes between basic blocks in applications, with emphasis on
accelerating data flow on hardware. Our method generates ASIC cores of
higher granularity by specifically focusing on Dataflow graphs of
Hardware Computations and taking advantages of the restrictions that
they offer. In Hybrid-FPGA model, we propose that the CIPE region be
constrained and mapped onto a slab (a region of LUTs isolated by
MacroBus as in the Virtex architecture) or implemented as gates in ASIC
technology. Even though a large ASIC on chip increases the costs of
mask design, it offers the maximum amount of gate savings. Remaining
slabs are implemented on LUT based reconfigurable. To the best of our
knowledge currently there exists no known technology that maps regions
within a single DFG into multiple Slabs for Partial Reconfiguration We
have conducted experiments on several complex routines from the target



•Function level ?
•BB level ?
•Or something in between ?
Figure 1. Methodology
Finding Comon Sub-Graph
applications. We have conducted experiments on MPEG-4 VVM, Gnu
Scientific Library (GSL) and NAMD molecular modeling library. A map
report based on Spartan 2E architecture was obtained based on the
synthesis report. Results show that partial reconfiguration with the use of
computation cores embedded in a sea of LUTs offer the potential for
massive savings in gate density by eliminating the need for redundant subcircuit pattern configurations (Table 1).
Figure 6. How to prioritize net reduction
7
Gain   Case(i)
i 1
Figure 7. Packing Cost Function
Figure 4. LUT Based Architecture Methodology
Since CIPEs cover only parts of DFGs, remaining computations
(reconfigurable data flow computations) are then implemented on LUT
based reconfigurable hardware. Methodology in Figure 4 proposes to
provide optimum interconnection pathways between different hierarchy
levels with variable size processing elements, allocating just enough
switching and wiring resources as a result of profiling the computational
characteristics of the application domains. In existing approaches packing
threats number of intersecting nets as positive gain and doesn’t address
how wiring requirement grows after including an LUT into a cluster. We
also argue that cost function should give priority to the nets causing a
decrease in the number of input or output pins of the target cluster. Rent’s
rule based packing mechanism (Figure 5) designed to improve the routing
architecture by reducing the number of nets to be routed has been
implemented. This mechanism prioritizes (Figure
Captions6,7)
to benets
set inthat
Timeslead to
reduction of number of input/output pins during
packing
in addition
to
or Times
New Roman
or
italic, between
18 2
routability driven cost metrics defined by equivalent,
other researchers.
Table
and 24
presents the performance of packing compared
to points.
V-Pack ( Rose et. al)
Left
aligned
if it refers to a
and R-Pack( Sarrafzadeh et. al)
figure on its left. Caption
starts right at the top edge of
the picture (graph or photo).
Dominant Sub-Graph
Figure 3
Table-1 configuration bits and clock cycles savings
2
R  B
3
Pm  CBmr ( Bm)
 C r
R
 C
L
i 0
L
Ti
i 0
Tool #2
Table-2 Amount of nets and tracks savings
i
Ti
Step-1
W 
CR
2
Step-2
Ssize  Ls  Cs
Csize  Cs
Ssize '  Ssize  1
Csize '  Csize  1
L1 
Ssize
N1 
Csize
L1' 
Ssize '
From this research effort we believe that partial reconfiguration with the
use of computation cores embedded in a sea of LUTs offer the potential
for massive savings in gate density and by eliminating the need for
unnecessary and redundant sub-circuit pattern configurations. We believe
that this direction will lead to the next generation FPGA devices geared
towards computationally intensive applications such as bio-chemical
algorithms and scientific applications.
N 1'  Csize '
1  T 1L1  T 1' L1'
 2  K 1' N 1' K 1 N 1
Step-3
Step-4
Win  L1'
Win  L1
Wout  L1'
Wout  L1
Captions
Step-5 to be set in Times or Times New Roman or equivalent, italic, 18 to 24
points, to the length of the column in case a figure takes more than 2/3 of column
width.
Figure 2a. Hybrid-FPGA
Conclusion
Figure 5. Rent ‘s Rule based Packing Parameters
Selected Publications
1. “Reconfigurable Media Processing”, A. Dasu et.al, Parallel
Computing Vol. 28, August 2002. Pg(s) 1111 - 1139.
2. A.Akoglu, A. Dasu and S. Panchanathan ,”A Framework for the
Design of the Heterogeneous Hierarchical Routing Architecture of a
Dynamically
Reconfigurable
Application
Specific
Media
Processor”,Workshop on Embedded Systems for Media
Processing,Dec 17, 2003,Hyderabad, India
3. A. Dasu, A.Akoglu, S.Panchanathan , “An Analysis Tool Set for
Reconfigurable Media Processing”,The International Conference on
Engineering of Reconfigurable Systems and Algorithms
(ERSA'03),June 2003,Las Vegas
4. 3 Patents pending under US and International Protection in the
technologies for “Designing High Performance Reconfigurable
Processing”