Transcript part-1

DSP in FPGA
Topics

Signal Processing






Example: FIR Filter


State of the Art
Flexibility
Multi-Channel Friendly







IP Block Example: FIR Filter


Simulink Equalizer
Routing Challenge

DSP Slice
Multiplication Modes
IP Blocks

Data-path with Constant Cache
FFT Example
Other Examples:

Resources

When not to use Floating Point
Example FP: Adder Hardware
Circuit
Constant Cache

DSP on FPGA


Considerations

FPGA Applications with DSP
DSP milestones
PDSP Architecture
PDSP vs FPGA



Routing Resources: Altera vs.
Xilinx
Example: Matrix Multiplication
Hypothesis and Rule’s of
Thumb
Results
Paper Analysis
Signal Processing
 Transform
 Most
 DSP
or manipulate analog or digital signal.
frequent application: filtering.
has replaced related traditional analog
signal processing systems in many applications.
FPGA’s Applications
Milestones
Cooley and Tukey 1965
 Efficient
algorithm to
compute the
discrete Fourier
Transform (DFT)
Today PDSPs:
PDSP 1970
 Compute
(fixedpoint) “multiplyand-accumulate”
in only one clock
cycle
Floating-point multipliers, barrel shifters, memory banks,
zero-overhead interfaces to A/D and D/A Converters
PDSP Architecture
 Single-DSP
implementations have insufficient
processing power for today’s system’s complexity.
 Multiple-chip
systems: more costly, complex and
higher power requirements.
 Solution:
FPGAs
Managing Resources &
Design Reliability
FPGA vs. PDSPs
PDSPs



RISC paradigm with MAC
Advantage: multistage
pipeline architectures
can achieve MAC rates
limited only by speed of
array multiplier.
Dominate applications
that required
complicated algorithms
(e.g. several if-then-else
constructs)
FPGA



Implement MAC at higher cost.
High-bandwith SP applications
through multiple MAC cells on
one chip.
Algorithms:




CORDIC,
NTT or
error-correction algorithms
Dominate more front-end
(sensor) applications



FIR filters,
CORDIC algorithms
FFTs
FPGA Advantages
1.
Ability to tailor the
implementation to
match system
requirements.
2.
Multiple-channel or
high-speed system:
take advantage of
the parallelism within
the device to
maximize
performance,
3.
Control logic
implemented in
hardware
Fir Filter Example
State of the Art (Xilinx)
Flexibility
 How
many MACs do you need?
 For example, in FIR Filter,
FPGAs can meet various throughput requirement
Multi-Channel Friendly
 Parallelism
enables efficient implementation of
multi-channel into a single FPGA.
 Many low sample rate channels can be
multiplexed and processed at a higher rate.
Resources
 Challenge:
How to
make the best use
of resources in most
efficient manner?
DSP48E1 Slice





2 DSP48E1 slices per tile
Column Structure to avoid routing delay
Pre-adder, 25x18 bit multiplier, accumulator
Pattern detect, logic operation, convergent/symmetric
rounding
638 MHz Fmax
Multiplication Modes

Each DSP block in a Stratix
device can implement:




Four 18x18-bit
multiplications,
Eight 9x9-bit multiplication,
or
One 36x36-bit
multiplication
While configured in the
36x36 mode, the DSP
block can also perform
floating-point arithmetic.
DSP IP Portfolio
 Comprehensive
 Constraint
Driven
IP Block example
 Overclocking
automatically used to reduce DSP
slice count.
 Quick estimates provided by IP compiler GUI
 Insures best results for your design requirements.
Altera: DFPAU



D-Floating Point
Arithmetic
Coprocessor.
Replaces C
software functions
by fast hardware
operations –
accelerates system
performance
Uses specialized
algorithms to
compute arithmetic
functions
Altera: DFPAU
Hardware circuit for FP adder

Breaking up an number into
exponent and mantissa
requires pre- and postprocessing

Comprises





Alignment (100 ALMs)
Operation (21 ALMs)
Normalization (81 ALMs)
Rounding (50 ALMs)
Normalization and rounding
together occupy half of the
circuit area
How to improve this?
When not to use Floating Point?

Algorithms designed for fixed point


Greater precision and dynamic
range are not helpful because
algorithms are bit exact.
E.g. Transform to go to frequency
domain in video codecs has some
form of a DCT (Discrete Cosine
Transform).

Designed to be performed on a
fixed-point processor and are bit
exact.
Also, when precision is not as important as speed
Constant Cache

Some applications load data from memory once and
reuse it frequently


Could pose a bottleneck on performance.
What can we do?

Copying data to local memory 


i.e. FFT
may not be enough, as each work group would have to
perform the copy operation
Solution

Create a constant cache that only loads data when it is
not present within it, regardless of which workgroup
requires the data
Datapath with a Constant Cache
Example FFT
Large
computation,
can be precomputed
Equalizer Example
Routing
Challenge
Routing challenge
 Designed
performance achieved only when the
datasets are readily accessed from fast on-chip
SRAMs.
 For large data sets, the main performance
bottleneck is the off-chip memory bandwidth.

With DRAM, you can
process data on stages
with only a portion of
dataset that fits on chip
operated on at a time.

Available memory
bandwidth determines
performance.
Routing Resources
Xilinx: more local routing
resources
Altera: wide buses
 Synergistic
 Also
with DSP
because most DSP
algorithms process
data locally.
has value,
because normally
wide data vectors
with 16 to 32 bits
must be moved to
the next DSP block.
Example: Matrix Multiplication



Double-precisions FP cores (64 bits)
Matrix operations require all matrix element
calculations to complete at the same time.
These parallelized or “vector” operations will
occur at the slowest clock speed of all the FP
functions in the FPGA.
Routing Challenge

Hypothesis (constrained performance
prediction):



Estimated 15 % logic unusable (due to data path
routing, routing constraints, etc.)
Estimated 33 % decrease in FP function clock speed
Extra 24,000 ALUs for local SRAM memory controller
and processor interface
39 +, 39 X
Clock Speed:
200 Mhz
Performance:
15.7 GFLOPS
Peak is:
300 MHZ
25.5 GFLOPS
Routing Challenge
 Considerations:


Latency of transfer of A and B matrix from
microprocessor to local FPGA SRAM not included in
benchmark time.
Challenge when using all double-precision FP cores:
feeding them with data on every clock cycle.
When dealing with double-precision 64-bit data,
and parallelizing many FP arithmetic cores,
wide internal memory interfaces are needed.
Routing Challenge: Results

Average sustained throughput : 88 percent.



The GFLOPS calculation then is 200 MHz * 81 operators * 88
percent duty cycle = 14.25 GFLOPS.


40 multiply and 40 adder tree cores – result every clock cycle
Five additional adder cores used for blocking implementation: one
value per clock cycle
Lower than expectation – due to the time needed to read and
write values to the external SRAM.
With multiple SRAM banks providing higher memory
bandwidth, the GFLOPS would be closer to the 15.7 GFLOPS
number.
Power:
The expected 15 GFLOPS performance of the Stratix
EP2S180 FPGA running at 30 W is close to the sustained
performance of a 60-W 3-GHz Intel Woodcrest CPU
FPGA
implementations
of fast Fourier
transforms for
real-time signal &
image processing
I.S. Uzun, A. Amira and A.
Bouridane
Functional block diagram of 1-D
FFT processor architecture
AGU: Radix-2 DIF FFT
















w s :¼ 1
for stage :¼ log 2 ðNÞ to
1 step 1 fnnstage loop
m :¼ 2^stage
is :¼ m=2
w index0 :¼ 0
for group :¼ 0 to n m step m
fnngroup loop
for bfi :¼ 0 to is l fnnbutterfly loop
Index0 :¼ r þ j
IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 3, June 2005 295
Index1 :¼ Index0 þ is;
}
w index0 :¼ w index0 þ w s;
}
w s :¼ w s 1