Transcript RCS for DSP .
ENG6530 Reconfigurable Computing Systems
Digital Signal Processing using FPGAs
ENG6530 RCS 1
Topics
Digital Signal Processing (DSP):
Definition, Advantages and Disadvantages Applications, ….
DSP vs. GPP vs. ASIC vs. FPGA Why use Reconfigurable Computing.
Xilinx System Generator
ENG6530 RCS 2
References
I.
II.
III.
IV.
“http://www.xilinx.com
“Reconfigurable Computing for DSP: A Survey”, by R. Tessier and W. Burleson, 2001 “Optimization Techniques for Efficient Implementation of DSP in FPGAs”, by J. Wang “Reconfigurable Computing: The Theory and Practice of FPGA Based Computing.
Chapter 24: Distributed Arithmetic.
ENG6530 RCS 3
Introduction
The term Digital Signal Processing, or and manipulation of signals in
DSP
digital form .
, refers to the branch of electronics concerned with the representation
Such
applications
as i.
Telecommunication (switches, …) ii.
iii.
iv.
Medical (Images, equipment, ..) Military (radar, missiles, ..) Consumers (Cell Phones, TVs, ..) ENG6530 RCS 4
DSP Flow
The data to be processed starts out as a signal in the real (analog) world.
This analog signal is then sampled by means of an analog to digital converter.
These samples are then processed in the digital domain.
The digital samples are subsequently converted into an analog equivalent by means of a digital to analog converter.
Analog input signal A/D Digital input samples
Analog domain
DSP
Digital domain ENG6530 RCS
Modified output samples D/A Analog output signal
Analog domain 5
DSP Flow
Digital System Signal Analysis System Analysis Filter Design ADC 1010..
Sampling + Quantification DSP Architecture Fix Point Arithmetic
Architecture Types Selection Criteria
1001..
DAC ENG6530 RCS 6
Transition from Analog to Digital
The transition from analog to more digital techniques has been driven by the
many advantages
of DSP:
The main advantage of digital signals over analog signals is that the precise signal level of former is not vital ( immune to imperfections ) Digital signals can be saved in memory and then recalled.
Digital signals can convey information with greater noise immunity . Digital signals can be processed by digital circuit components, which are cheap and easily produced.
Digital can be encrypted so that only the intended receiver can decode.
The flexibility in precision through changing word lengths and/or number representation (e.g., fixed point vs. floating point) The ability to use a single processing element to process multiple incoming signals through multiplexing.
Enables transmission of signals over a long distance and higher rate.
The ease with which digital approaches can adjust their processing parameters, such as with adaptive filtering .
ENG6530 RCS 7
Transition from Analog to Digital
The
main disadvantage
of DSP:
i.
Increased system complexity
, DSP requires that signals be converted between analog and digital forms using a sample and hold circuit, analog-to-digital converters (ADCs), and digital-to analog converters (DACs) and analog filtering.
ii.
iii.
Power consumption
, DSP tends to require more power since a dedicated processor is used.
Frequency range limitation
, analog hardware will naturally be able to work with higher frequency signals than is possible with DSP hardware due to the limitations of performing analog to digital conversion.
For many applications, the advantages of DSP far outweigh these disadvantages.
ENG6530 RCS 8
DSP: Common Operations
Some of the most common operations performed on signals using digital or analog techniques include: Elementary time-domain operations: amplification, attenuation, integration, differentiation, addition of signals, multiplication of signals, etc., Filtering (FIR, IIR) Transforms (FFT, IFFT) Convolution (Integral of product of two functions) Error Correction (Transmission) Compression and decompression (Audio, Video) Modulation and demodulation (BPSK, QAM, FSK, ASK, …) Multiplexing and de-multiplexing Signal generation
ENG6530 RCS 9
DSP Applications
Audio Applications: MPEG Audio Portable audio Photography: Digital cameras CAM Wireless Applications WiFi WiMax Blue Tooth
Networking
Switches Classifiers Medical Equipment: Hearing Aids Heart Pacers Cable modems ADSL VDSL Cellular Phones Base Stations GSM LTE Military Applications: Radar
ENG6530 RCS 10
Main DSP Operations
DSP is the arithmetic processing of digital signals sampled at regular intervals DSP can be reduced to three trivial operations:
Delay Add
Multiply
Accumulate = Add + Delay MAC = M ultiply + Ac cumulate The MAC is the engine behind DSP More MACs = Higher Performance, Better Signal Quality MACs vs. MIPS, not always equal
Filter
3 MACs 50* MACs 100 MACs
Alternative DSP Implementations
DSP tasks can be implemented in a number of different ways.
i.
ii.
iii.
iv.
A general purpose processor (GPP): The processor can perform DSP by running an appropriate DSP algorithm.
A digital signal processor (PDSP): This is a specialized form of microprocessor chip that has been designed to perform DSP tasks much faster and more efficiently than GPP.
Dedicated ASIC hardware: Custom hardware implementation that executes the DSP task.
Dedicated FPGA hardware: Similar to ASIC except that it offers:
Flexibility in terms of reconfiguration.
Embedded microprocessor cores on the FPGA.
ENG6530 RCS 12
The Performance Gap
Algorithmic complexity increases as application demands increase.
In order to process these new algorithms, higher performance signal processing engines are required ENG6530 RCS 13
Traditional DSP Approaches
Digital Signal Processor IC Software programmable, like a microprocessor Single MAC unit All processing done sequentially Fit the algorithm to the architecture Analog input ADC ‘Traditional’ DSP Processor MAC Memory Analog output DAC Digital output Data Controller ASIC (gate array) Fit the architecture to the algorithm Significantly higher performance than DSP processor High cost and high risk to develop Usually only for high-volume applications
The Promise of Programmable Logic
ASIC
Pros
High performance High density One chip solution
Cons
High design risk Long design cycle
FPGA Best from both worlds plus:
DSP Processor
Pros
Efficient IC architecture High flexibility System features Good adaptability Short design cycle Low design risk Automatic migration to low cost HardWire
Cons
Performance Hardware Complexity
Why FPGAs?
The
most commonly
used DSP functions are: FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection functions.
All of these blocks
perform intensive arithmetic operations (data path intensive operations) such as: add, subtract, multiply, multiply-add or, multiply-accumulate.
ENG6530 RCS 16
Why Use FPGAs in DSP Applications?
10x More DSP Throughput Than DSP Processors Parallel vs. Serial Architecture
DSP System
Software DSP FPGA
Cost-Effective for Multi-Channel Applications Flexible Hardware Implementation Single-Chip Solution System (Hardware/Software) Integration Benefits
Software Embedded Processor FPGA
DSP-related embedded FPGA resources
Many FPGAs incorporate dedicated multiplier blocks (Virtex-5/6/7).
Similarly, some FPGAs offer dedicated adder blocks.
One operation that is very common in DSP-type application is called the multiply-and-accumulate (MAC) unit.
To make life easier for implementing DSP on FPGAs some provide an entire MAC as an embedded function (Virtex-4) Multiplier Adder Accumulator A[n:0] Y[(2n - 1):0] B[n:0] MAC
ENG6530 RCS 18
DSP Functions are Parallel in Nature
8-Bit, 16-Tap Finite Impulse Response (FIR) Filter
Data Input X[7:0] REG REG REG REG REG REG REG
0 15
REG
1 14
REG
2 13
REG
3 12
REG
4 11
REG
5 10
REG
6 9
REG
7 8
REG
Filter Taps Multiply by Filter Co-Efficients
C0 C1 C2 C3 C4 C5 Equation:
Y j
k n
1
kj
0
Accumulate Values
c x
2
c x
3
Data Output Y[9:0]
c x
13 C6
Symmetrical Coefficients
C7
c x
15
DSP and FPGA
FPGAs Parallel Approach to DSP Enables Higher Computational Throughput Consider a 256-tap FIR filter:
Conventional DSP Processor – Serial Implementation FPGA – Fully parallel implementation
Multiply Accumulate Multiple Engines
Parallel processing maximizes data throughput Support any level of parallelism Optimal performance/cost tradeoff
256 Tap FIR Filter
256 multiply and accumulate (MAC) operations per data sample
One output every clock cycle
Flexible architecture Distributed DSP resources (LUT, registers, multipliers, & memory) Data In C0 Reg0 C1 Reg1 All 256 MAC operations in 1 clock cycle C2 Reg2 ....
C255 Data Out
Reg255 ENG6530 RCS 21
FPGAs Outperform ‘Traditional’ DSP Processors
25 8-Bit, 16-Tap FIR Filter Performance Comparisons (External Performance)
Parallel Distributed Arithmetic (PDA)
22.00
(est.) 20 16.00
15 10 5 0 0.24
133 MHz Pentium™ Processor 750 KHz FPGA
Serial Distributed Arithmetic (SDA)
1.00
Single 50 MHz DSP 3 MHz 2.60
FPGA XC4003E-3 FPGA (68% util.) 8 MHz 4.00
FPGA MCM Four 50 MHz DSPs 12 MHz XC4010E-3 FPGA (98% util.) 56 MHz XC4013E-2 FPGA (75% util.) 66 MHz
Case Study: Viterbi Decoder
(FPGA-based DSP Co-Processor) Old_1 + + R E G + R E G M U X R E G
I/O Bus
INC + R E G + R E G + R E G M U X R E G New _1 MSB Dif f _1 Dif f _2 MSB New _2
I/O Bus
Old_2 + + R E G Optional Pipelining Registers Prestate Buf f er 24-bit 24-bit 1 0 Bit 24-bit
3 2 2.67 tim es better perform ance w ith FPGA-assisted DSP 1 0 360 ns Two 6 6 MHz DS P s S ix 15 ns RAMs 135 ns 6 6 MHz DS P + FP G A Thre e 15 ns RAMs DSP-Only 8 D EVICES
Two 66 MHz DSPs Six 15 ns SRAMs System logic
DSP + FPGA 4 D EVICES
One 66 MHz DSP XC4013E-3 FPGA (44%) Three 15 ns SRAMs
What to Look for in Your DSP Application
Identify Parallel Data Paths Find Operations that Require Multiple Clock Cycles Processor Bottlenecks Flexibility Parallel Data Paths Scaleable Bandwidth Design Modification Device Expansion = YES = NO
When to Use FPGAs for DSP 50 45 40 35 30 Number of DSPs
4 DSPs 3 DSPs 2 DSPs 1 DSP
25 20 15 10 FPGA Region 5 DSP Region 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Arithmetic Operations Per Sample
High sample rates
Up to 500 MHz with Virtex 5/6/7 Low sample rates Integrate DSP + system logic in a low-cost DSP using serial sequential algorithm
Short word lengths
DA algorithm gets faster with shorter word length
Lots of filter taps
FPGA processes all taps in parallel, faster than DSP
Fast correlators
Single-chip solution required HardWire gate array migration path for high-volume designs
Co-processing with a FPGA
FPGA co-processors are an extremely cost-effective means of off-loading computationally intensive algorithms from a DSP processor.
FPGA Coprocessor for
WiMAX Baseband Processing
FPGA Coprocessor for High-Definition
H.264 Encoding
Digital Filters
Digital filters are one of the main elements of DSP and are performed using only a MAC operation.
A digital filter performs a filtering function on data by attenuating or reducing bands of frequencies.
Remove High Frequency Noise from Speech Signal Remove low Frequency Noise for some sensors Emphasize a particular Frequency in Music Signal Remove 50 HZ mains hums from ECG Signal ENG6530 RCS 27
Low Pass Digital Filter
An example of the operation of a low pass filter is: The weights W 0 to W N-1 must be appropriately chosen
ENG6530 RCS 28
Digital Filters: Types
Finite Impulse Response (FIR): Non-recursive linear filter (i.e. no feedback present).
Infinite Impulse Response (IIR) Recursive linear filter (i.e. with feedback) Adaptive Digital Filter (ADF) A self learning filter that adapts itself to a desired signal.
Non-Linear Filters: A Filter that can perform non-linear operations e.g. median filter min/max filters
ENG6530 RCS 29
FIR Filters
A Finite Impulse Response (FIR) filter performs a weighted average (convolution) on a window of N data samples:
ENG6530 RCS 30
FIR FILTERS
Register FINITE-IMPULSE RESPONSE FILTER
Z
1
Z
1
. . . .
C
1
C
2
C N
1
Z
1
C N
Multiplier ENG6530 RCS Adder 31
Frequency Response
The frequency/phase response of a digital filter is found by taking the Discrete Fourier Transform (DFT) of the impulse
ENG6530 RCS 32
FPGA Implementations
1.
Hardware Description Language:
VHDL Verilog 2.
Electronic System Level
Handel-C, Vivado HLS (Lab #7) Impulse-C 3.
Core Generator (IP Selection) 4.
System Generator (Lab #6)
Matlab, Simulink, System Generator ENG6530 RCS 33
FIR FILTER: VHDL Implementation
Simple VHDL design example of an 8-tap FIR filter.
ENG6530 RCS 34
Hardware Descriptive Languages
Full VHDL/Verilog (RTL code) Advantages: Portability and efficient implementation Complete control of the design implementation and tradeoffs Easier to debug and understand a code that you own
Disadvantages:
Can be time consuming Don’t always have control over the Synthesis tool Need to be familiar with algorithm and how to write it
ENG6530 RCS 35
ENG6530 RCS 36
Abstraction: Advantages
ENG6530 RCS 37
CORE Generator
HDL COREGen Synthesis Implementation Download Behavioral Simulation Functional Simulation Timing Simulation In-Circuit Verification ENG6530 RCS Instantiate optimized IP within the HDL code 38
Xilinx CORE Generator
List of available IP from or
ENG6530 RCS
Fully Parameterizable
39
Xilinx IP Solutions
DSP Functions $ P Reed Solomon
$ 3GPP Turbo Code
$ P Viterbi Decoder $ P Convolution Encoder $ P Interleaver/De-interleaver P LFSR
P 1D DCT
P DA FIR
P MAC P MAC-based FIR filter
Fixed FFTs 16, 64, 256, 1024 points
P FFT - 32 Point
P Sine Cosine P Direct Digital Synthesizer P Cascaded Integrator Comb
P Bit Correlator P Digital Down Converter
IP CENTER
http://www.xilinx.com/ipcenter
Math Functions P Multiplier Generator - Parallel Multiplier - Dyn Constant Coefficient Mult - Serial Sequential Multiplier Multiplier Enhancements
P Divider P CORDIC
Base Functions P Binary Decoder P Two's Complement P Shift Register RAM/FF P Gate modules P Multiplexer functions P Registers, FF & latch based P Adder/Subtractor P Accumulator P Comparator P Binary Counter
$ - License Fee, P - Parameterized, S - Project License Available,
BOLD – Available in the Xilinx Blockset for the System Generator for DSP Memory Functions
P Asynchronous FIFO
P Block Memory modules P Distributed Memory P Distributed Mem Enhance P Sync FIFO (SRL16) P Sync FIFO (Block RAM)
P CAM (SRL16) $ P PCI 64/66 $ PS PCI 32/33 $ P PCI-X 64/66
PCI Networking
8B/10B Encoder/Decoder $ POS-PHY L3 $ POS-PHY L4 $ Flexbus 4 $ RapidIO PHY Layer $ S HDLC 1 and 32 channel $ S G.711 PCM Cores $ S ADPCM 32 & 64 channel
Core Generator: Summary
CORE Generator Advantages Can quickly access and generate existing functions No need to reinvent the wheel and re-design a block
if it meets specifications
IP is optimized for the specified architecture
Disadvantages
IP doesn’t always do exactly what you are looking for Need to understand signals and parameters and match them to your specification Dealing with black box and have little information on how the function is implemented
ENG6530 RCS 41
Xilinx System Generator for DSP
• • • • •
Industry’s first tool system-level design environment (IDE) for FPGAs Simulink library of arithmetic, logic operators and DSP functions (Xilinx blockset) Arithmetic abstraction VHDL code generation for most Spartan based FPGAs and Virtex 4/5/6/7 FPGAs Enables Hardware in the Loop Co-simulation
•
MATLAB
MATLAB™, the most popular system design tool , is a programming language, interpreter, and modeling environment – – – Extensive libraries for math functions , signal processing, DSP, communications, and much more Visualization : large array of functions to plot and visualize your data and system/design Open architecture: software model based on base system and domain specific plug-ins
System Level Evaluation
Irrespective of the final implementation technology (GPP, DSP, ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment.
The de facto industry standard for DSP algorithmic verification is MATLAB.
Original Concept
Algorithmic Verification Auto C/C++ Generation Handcrafted C/C++ Handcrafted Assembly ENG6530 RCS Compile / Assemble Machine Code 44
System/Algorithmic level to RTL
Many DSP design teams commence by performing their system level evaluation and algorithmic validation in MATLAB using floating point representation.
Alternatively, they may first transition the FP representation into their fixed-point counterparts at the system level.
At this point, many design teams bounce directly into hand coding fixed-point RTL equivalents of the design in VHDL Original Concept System/Algorithmic Verification
(Floating-point)
(a) System/Algorithmic Verification (b)
(Fixed-point)
Handcraft Verilog/VHDL RTL
(Fixed-point)
To standard RTL-based simulation and synthesis
ENG6530 RCS 45
•
Simulink
Simulink™ Visual data flow environment for modeling and simulation of dynamical systems – – – – – Fully integrated with the MATLAB engine Graphical block editor Event-driven simulator Models parallelism Extensive library of parameterizable functions • • Simulink Blockset - math, sinks, sources DSP Blockset - filters, transforms, etc.
• Communications Blockset - modulation, DPCM, etc.
Traditional Simulink FPGA Flow
System Verification System Architect GAP FPGA Designer HDL Synthesis Implementation Download Simulink Functional Simulation Timing Simulation In-Circuit Verification Verify Equivalence
System Generator
MATLAB/Simulink HDL System Generator Synthesis Implementation System Verification
•VHDL •IP •Testbench •Constraints File
Functional Simulation Timing Simulation Download In-Circuit Verification
Creating a System Generator Design
• Xilinx Block-set listed in Simulink Library Browser • Create Design by Dragging and Dropping components from the Xilinx Block-set onto your new sheet to create design
Finding Blocks
• • Use the Find feature to search ALL Simulink libraries Xilinx blockset has nine major sections – Basic elements – – – – – – – – • Counters, delays Communication • Error correction blocks Control Logic • MCode, Black Box Data Types • Convert, Slice DSP • FDATool, FFT, FIR Index • All Xilinx blocks – quick way to view all blocks Math • Multiply, accumulate, inverter Memory • Dual Port RAM, Single Port RAM Tools • ModelSim, Resource Estimator
Configure Your Blocks
• •
Double-click
or go to Block Parameters to view a block’s configurable parameters – – – – – – – Arithmetic Type: Unsigned or twos complement Implement with Xilinx Smart-IP Core (if possible)/ Generate Core Latency: Specify the delay through the block Overflow and Quantization: Users can saturate or wrap overflow. Truncate or Round Quantization Override with Doubles: Simulation only Precision: Full or the user can define the number of bits and where the decimal point is for the block Sample Period: Can be inherent with a “-1” or must be an integer value
Note: While all parameters can be simulated, not all are realizable
Values Can Be Equations
• • • You can also enter equations in the block parameters, which can aid calculation and your own understanding of the model parameters The equations are calculated at the beginning of a simulation Useful MATLAB operators – – – – – – – + add - subtract * multiply / divide ^ power pi (3.1415926535897.…) exp(x) exponential (ex)
Important Concept 1: The Numbers Game
• • Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos complement floating point number” – Because the binary point can move, a double can represent any number between +/- 9.223 x 10 18 resolution of 1.08 x 10 -19 …a wide desirable range, but not efficient or realistic for FPGAs with a Xilinx Blockset uses n-bit fixed point number (twos complement optional) -2
1
2 2
0
1 2
1
0 2
1
-1 2
0
-2 2
1
-3 2
1
-4 2
1
-5 2
1
-6 2
0
-7 2
1
-8 Integer Fraction Format = Sign_Width_Decimal point from the LSB 2 -9
0
2 -10 2 -11 2 -12 2 -13 Value = -2.261108…
0 1 0 1
Format = Fix_16_13
(Sign: Fix = Signed Value UFix = Unsigned value)
Design Hint:
Always try to maximize the dynamic range of design by using only the required number of bits Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks
(Xilinx blockset
MATLAB I/O
Gateway In/Out)
•
What About All Those Other Bits?
The Gateway In and Out blocks support parameters to control the conversion from double precision to N - bit fixed point precision
. . . .
-2 6
1
2 5
1
2 4
1
2 3
1
2 2 2 1
1 0
2 0
1 DOUBLE
2 -1 2 -2 2 -3 2 -4 2 -5 2 -6
1 0 1 1 1 1
2 -7 2 -8 2 -9
0 1 0
2 -10 2 -11
0 1
2 -12
0
2 -13
1 . . . .
OVERFLOW - Wrap - Saturate - Flag Error
-2 2 2 1
1 0
2 0
1
2 -1 2 -2 2 -3 2 -4 2 -5 2 -6 2 -7
1 0 1 1 1 1 0
2 -8 2 -9
1 0 FIX_12_9 QUANTIZATION - Truncate - Round
Creating a System Generator Design
IO blocks used as interface between the Xilinx blockset and other Simulink blocks Simulink sources SysGen blocks realizable in Hardware Simulink sinks & library functions
Using the Scope
• Click properties to change the
number of axis displayed and the time range value (X axis)
• Use
Data history values are stored and displayed on the scope to control how many
• Click
configure the display to the correct axis values autoscale to quickly let the tools
• Right click on the Y-axis to set its value
Design & Simulate in Simulink
Simulate the design by pushing “play.” Go to “Simulation Parameters” under the “Simulation” menu to control the length of simulations
Resource Estimator
• •
The block provides fast estimates of FPGA resources required to implement the subsystem Most of the blocks in the System Generator Blockset carries the resources information
– – – – – –
LUTs FFs BRAM Embedded multipliers 3-state buffers I/Os
Resource Estimator
•
Three types of estimation
–
Estimate Area
•
This option computes resources for the current level and all sub-levels
– –
Quick Sum
•
Uses the resources stored in block directly and sum them up (no sub-levels functions are invoked) Post-Map Area
•
Opens up a file browser and let user select map report file. The design should have been generated and gone through synthesis, translate, and mapping phases.
The Black Box
Use the Black Box when: • You need a function that cannot be created with the Xilinx Blockset • You already have a piece of VHDL you wish to use for a section of the design Creates a place holder for the ‘Black box’ in generated VHDL Use Black Box parameters to control the VHDL placeholder’s features
Generate the VHDL Code
Once complete, double click the System Generator token Select the target device Select to generate the testbench Set the System clock period desired Generate the VHDL
Hardware-in-the-Loop Reduces Design Time & Cost
• •
Configure any development board for hardware-in-the-loop using JTAG header in < 20 minutes
–
Automatically create FPGA bit-stream from Simulink
–
Transparent use of FPGA implementation tools
– –
Accelerate and verify the Simulink design using FPGA hardware Mirrors traditional DSP processor design flows Combine with black box to simulate HDL & EDIF
Create Bit-stream
Step 1 Select Target H/W Platform Step 2 Generate Bit-stream
Co-Simulate in Hardware
Step 3 contd.
Post-generation script creates a new library containing a parameterized run-time co simulation block.
Step 5 Simulate for verification Step 4 Copy the a co simulation run time block into the original model.
Hardware in the Loop Performance Results
Single Step Clock Mode (bit and cycle accurate) Application Software Simulation Time (seconds)
676
Hardware Simulation Time (seconds)
6
Speed-up
Image Filtering
112X
QAM Demodulator + Extension
67X
1203 18 5 x 5 Image Filter Cordic Arc Tangent Additive White Gaussian Noise Channel 170 187 600 4 27 80
43X 7X 7.5X
Free Running Clock Mode
A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve up to
6 orders of magnitude performance enhancement
over original software simulation.
DSP System Generator: Summary
•
System Generator for DSP
– –
Advantages
• • • • • •
Ability to simulate the design at a system level High level of abstraction - Very attractive for FPGA novices Optimize Area, Speed, combination Estimate resources easily Hardware Co-Simulation (FPGA in the loop) Test-bench and golden data written automatically Disadvantages
• •
Cost of abstraction: doesn’t always give the best result from an area usage point Only as good as the IP support
FPGAs versus DSP
FPGAs can out perform DSP processors on certain DSP tasks; computation intensive, highly parallelizable tasks DSP processors have the advantage for development infrastructure, time-to-market, developer familiarity DSP processors are still easier to use Many engineers possess DSP processor development skills Ultimate speed is not always the first priority Combination of FPGA and DSP processor is an excellent solution if performance requirements cannot be met by the processor alone The “Best” architecture depends on the requirements of the applications
Problem with this flow?
There is a significant conceptual and representational divide between the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representation in VHDL.
Manual translation from one to another is time consuming and prone to error.
Any changes made to the original specs during the course of the project will be a painful and time consuming process to translate again to RTL.
Original Concept System/Algorithmic Verification
(Floating-point)
(a) System/Algorithmic Verification (b)
(Fixed-point)
Handcraft Verilog/VHDL RTL
(Fixed-point)
To standard RTL-based simulation and synthesis
ENG6530 RCS 69
Direct RTL Generation
Original Concept (a) (b) Some system/algorithmic level design environments offer direct VHDL code generation.
An example of this type of environment is offered by AccelChip Inc whose environment can accept floating-point MATLAB M files, output their fixed point equivalent for verification and then use these new M files to auto generate RTL.
System/Algorithmic Environment
System/Algorithmic Verification
(Floating-point)
System/Algorithmic Verification
(Fixed-point)
Auto-generate Verilog/VHDL RTL
(Fixed-point) ENG6530 RCS System/Algorithmic Environment
System/Algorithmic Verification
(Floating-point) Third-party Environment
Auto-interactive quantization (Fixed-point) Auto-generate Verilog/VHDL RTL
(Fixed-point)
(a) To standard RTL-based simulation and synthesis (b)
70
Transposed FIR with Multiplier Block
ENG6530 RCS 71
DSP Processors vs. FPGAs
High Speed DSP Processor High Level of Parallel Processing in FPGA
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
1-8 Multipliers
Needs looping for more than 8 multiplications Needs multiple clock cycles because of serial computation
200 Tap FIR Filter would need 25+ clock cycles per sample with an 8 MAC unit processor
Can implement hundreds of MAC functions in an FPGA Parallel implementation allows for faster throughput
–
200 Tap FIR Filter would need 1 clock cycle per sample
Multiply Accumulate Single Engine
Sequential processing limits data throughput: Time-shared MAC unit
Data width is fixed!!
High clock frequency creates difficult system-challenge
256 Tap FIR Filter
256 multiply and accumulate (MAC) operations per data sample One output every
256 clock cycles ENG6530 RCS
Data In Reg Loop Algorithm 256 times MAC unit Data Out
73
Filters: Applications
ENG6530 RCS 74
Impulse Response
The Impulse Response of an FIR filter is obtained from the output of a filter when a single unit impulse is input:
ENG6530 RCS 75
Solution: Building a MAC with System Generator
MAC using Sliced Based Multiplier
Slice Count: 70 Slices Performance: ~130 Mhz (2v1000 -4)
c
i a b i
a b +
MAC using Embedded Multiplier
Slice Count: 22 Slices, 1 embedded multiplier Performance: ~126 MHz (2v1000 -4)
c
FIR: Cont … VHDL Implementation
For convenience the selected coefficients are powers of 2.
To operate, the filter must have eight register stages, each of which is eight bits wide.
Therefore, for the register or memory portion of the design, 64 flip flops are required.
At each clock cycle, each coefficient is multiplied by the eight-bit value in the appropriate register.
Due to the selection of ``powers of two” coefficients, multiplication is achieved by a
simple shifting operation
The coefficient values may be stored as constants .
The coefficients used in the example are given below: a 0 = 2 -3 , a 1 =2 -2 , a 2 =2 -1 ,a 3 =1,a 4 =1,a 5 =2 -1 ,a 6 =2 -2 ,a 7 =2 -3
ENG6530 RCS 77
VHDL Description of FIR Filter
library
ieee;
use
ieee.std_logic_1164.all;
entity
FIR1
is port
(clk :
in
std_logic; x :
in
integer
range
0
to
255; y :
out
integer
range
0
to
511);
end entity
FIR1;
ENG6530 RCS 78
VHDL Description of FIR Filter
architecture arch1 of FIR1 is begin process (clk) type RegType is array (7 downto 0) of integer; variable Reg: RegType:= ( others => 0); begin if (clk’event and clk=‘1’) then - - multiply/accumulate (MAC) operation y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3) + Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8; - - update register values by shifting Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4); Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x; end if ; end process ; end architecture arch1; ENG6530 RCS 79