RCS for DSP .

Download Report

Transcript RCS for DSP .

ENG6530 Reconfigurable Computing Systems

Digital Signal Processing using FPGAs

ENG6530 RCS 1

Topics

   

Digital Signal Processing (DSP):

  

Definition, Advantages and Disadvantages Applications, ….

DSP vs. GPP vs. ASIC vs. FPGA Why use Reconfigurable Computing.

Xilinx System Generator

ENG6530 RCS 2

References

II.

III.

IV.



“http://www.xilinx.com

“Reconfigurable Computing for DSP: A Survey”, by R. Tessier and W. Burleson, 2001 “Optimization Techniques for Efficient Implementation of DSP in FPGAs”, by J. Wang “Reconfigurable Computing: The Theory and Practice of FPGA Based Computing.

Chapter 24: Distributed Arithmetic.

ENG6530 RCS 3

Introduction



The term Digital Signal Processing, or and manipulation of signals in

DSP

digital form .

, refers to the branch of electronics concerned with the representation



Such

applications

as i.

Telecommunication (switches, …) ii.

iii.

iv.

Medical (Images, equipment, ..) Military (radar, missiles, ..) Consumers (Cell Phones, TVs, ..) ENG6530 RCS 4

DSP Flow

   

The data to be processed starts out as a signal in the real (analog) world.

This analog signal is then sampled by means of an analog to digital converter.

These samples are then processed in the digital domain.

The digital samples are subsequently converted into an analog equivalent by means of a digital to analog converter.

Analog input signal A/D Digital input samples

Analog domain

DSP

Digital domain ENG6530 RCS

Modified output samples D/A Analog output signal

Analog domain 5

DSP Flow

Digital System Signal Analysis System Analysis Filter Design ADC 1010..

Sampling + Quantification DSP Architecture Fix Point Arithmetic

Architecture Types Selection Criteria

1001..

DAC ENG6530 RCS 6

Transition from Analog to Digital



The transition from analog to more digital techniques has been driven by the

many advantages

of DSP:

        

The main advantage of digital signals over analog signals is that the precise signal level of former is not vital ( immune to imperfections ) Digital signals can be saved in memory and then recalled.

Digital signals can convey information with greater noise immunity . Digital signals can be processed by digital circuit components, which are cheap and easily produced.

Digital can be encrypted so that only the intended receiver can decode.

The flexibility in precision through changing word lengths and/or number representation (e.g., fixed point vs. floating point) The ability to use a single processing element to process multiple incoming signals through multiplexing.

Enables transmission of signals over a long distance and higher rate.

The ease with which digital approaches can adjust their processing parameters, such as with adaptive filtering .

ENG6530 RCS 7

Transition from Analog to Digital

  The

main disadvantage

of DSP:

Increased system complexity

, DSP requires that signals be converted between analog and digital forms using a sample and hold circuit, analog-to-digital converters (ADCs), and digital-to analog converters (DACs) and analog filtering.

ii.

iii.

Power consumption

, DSP tends to require more power since a dedicated processor is used.

Frequency range limitation

, analog hardware will naturally be able to work with higher frequency signals than is possible with DSP hardware due to the limitations of performing analog to digital conversion.

For many applications, the advantages of DSP far outweigh these disadvantages.

ENG6530 RCS 8

DSP: Common Operations

Some of the most common operations performed on signals using digital or analog techniques include:  Elementary time-domain operations:  amplification, attenuation,   integration, differentiation, addition of signals, multiplication of signals, etc.,  Filtering (FIR, IIR)  Transforms (FFT, IFFT)  Convolution (Integral of product of two functions)  Error Correction (Transmission)   Compression and decompression (Audio, Video) Modulation and demodulation (BPSK, QAM, FSK, ASK, …)  Multiplexing and de-multiplexing  Signal generation

ENG6530 RCS 9

DSP Applications

    Audio Applications:  MPEG Audio  Portable audio Photography:  Digital cameras  CAM Wireless Applications  WiFi  WiMax  Blue Tooth

Networking

 Switches  Classifiers     Medical Equipment:  Hearing Aids  Heart Pacers Cable modems  ADSL  VDSL Cellular Phones  Base Stations  GSM  LTE Military Applications:  Radar

ENG6530 RCS 10

    

Main DSP Operations

DSP is the arithmetic processing of digital signals sampled at regular intervals DSP can be reduced to three trivial operations:  

Delay Add



Multiply

Accumulate = Add + Delay MAC = M ultiply + Ac cumulate The MAC is the engine behind DSP  More MACs = Higher Performance, Better Signal Quality  MACs vs. MIPS, not always equal

Filter

3 MACs 50* MACs 100 MACs

Alternative DSP Implementations



DSP tasks can be implemented in a number of different ways.

ii.

iii.

iv.

A general purpose processor (GPP): The processor can perform DSP by running an appropriate DSP algorithm.

A digital signal processor (PDSP): This is a specialized form of microprocessor chip that has been designed to perform DSP tasks much faster and more efficiently than GPP.

Dedicated ASIC hardware: Custom hardware implementation that executes the DSP task.

Dedicated FPGA hardware: Similar to ASIC except that it offers:



Flexibility in terms of reconfiguration.



Embedded microprocessor cores on the FPGA.

ENG6530 RCS 12

The Performance Gap

 

Algorithmic complexity increases as application demands increase.

In order to process these new algorithms, higher performance signal processing engines are required ENG6530 RCS 13

Traditional DSP Approaches

 Digital Signal Processor IC  Software programmable, like a microprocessor   Single MAC unit All processing done sequentially  Fit the algorithm to the architecture Analog input ADC ‘Traditional’ DSP Processor MAC Memory Analog output DAC Digital output Data Controller  ASIC (gate array)   Fit the architecture to the algorithm Significantly higher performance than DSP processor   High cost and high risk to develop Usually only for high-volume applications

The Promise of Programmable Logic

ASIC

Pros

 High performance  High density  One chip solution

Cons

 High design risk  Long design cycle

FPGA Best from both worlds plus:

DSP Processor

Pros

 Efficient IC architecture  High flexibility  System features  Good adaptability  Short design cycle  Low design risk  Automatic migration to low cost HardWire

Cons

 Performance  Hardware Complexity

Why FPGAs?

  The

most commonly

used DSP functions are:  FIR (Finite Impulse response) filters,  IIR (Infinite Impulse response) filters,  FFT (Fast Fourier Transform),  DCT (Direct Cosine Transform),  Encoder/Decoder and Error Correction/Detection functions.

All of these blocks

perform intensive arithmetic operations (data path intensive operations) such as:  add, subtract,  multiply, multiply-add or,  multiply-accumulate.

ENG6530 RCS 16

Why Use FPGAs in DSP Applications?

 10x More DSP Throughput Than DSP Processors  Parallel vs. Serial Architecture

DSP System

Software DSP FPGA

 Cost-Effective for Multi-Channel Applications  Flexible Hardware Implementation  Single-Chip Solution  System (Hardware/Software) Integration Benefits

Software Embedded Processor FPGA

DSP-related embedded FPGA resources

    Many FPGAs incorporate dedicated multiplier blocks (Virtex-5/6/7).

Similarly, some FPGAs offer dedicated adder blocks.

One operation that is very common in DSP-type application is called the multiply-and-accumulate (MAC) unit.

To make life easier for implementing DSP on FPGAs some provide an entire MAC as an embedded function (Virtex-4) Multiplier Adder Accumulator A[n:0] Y[(2n - 1):0] B[n:0] MAC

ENG6530 RCS 18

DSP Functions are Parallel in Nature

 8-Bit, 16-Tap Finite Impulse Response (FIR) Filter

Data Input X[7:0] REG REG REG REG REG REG REG

0 15

REG

1 14

REG

2 13

REG

3 12

REG

4 11

REG

5 10

REG

6 9

REG

7 8

REG

Filter Taps Multiply by Filter Co-Efficients

C0 C1 C2 C3 C4 C5  Equation:

Y j



k n

  1

 0 

Accumulate Values



c x

2 

c x

Data Output Y[9:0]



c x

13  C6

Symmetrical Coefficients

C7 

c x

DSP and FPGA

FPGAs Parallel Approach to DSP Enables Higher Computational Throughput Consider a 256-tap FIR filter:

Conventional DSP Processor – Serial Implementation FPGA – Fully parallel implementation

Multiply Accumulate Multiple Engines

   Parallel processing maximizes data throughput  Support any level of parallelism  Optimal performance/cost tradeoff

256 Tap FIR Filter

 256 multiply and accumulate (MAC) operations per data sample 

One output every clock cycle

Flexible architecture  Distributed DSP resources (LUT, registers, multipliers, & memory) Data In C0 Reg0 C1 Reg1 All 256 MAC operations in 1 clock cycle C2 Reg2 ....

C255 Data Out

Reg255 ENG6530 RCS 21

FPGAs Outperform ‘Traditional’ DSP Processors

25 8-Bit, 16-Tap FIR Filter Performance Comparisons (External Performance)

Parallel Distributed Arithmetic (PDA)

22.00

(est.) 20 16.00

15 10 5 0 0.24

133 MHz Pentium™ Processor 750 KHz FPGA

Serial Distributed Arithmetic (SDA)

1.00

Single 50 MHz DSP 3 MHz 2.60

FPGA XC4003E-3 FPGA (68% util.) 8 MHz 4.00

FPGA MCM Four 50 MHz DSPs 12 MHz XC4010E-3 FPGA (98% util.) 56 MHz XC4013E-2 FPGA (75% util.) 66 MHz

Case Study: Viterbi Decoder

(FPGA-based DSP Co-Processor) Old_1 + + R E G + R E G M U X R E G

I/O Bus

INC + R E G + R E G + R E G M U X R E G New _1 MSB Dif f _1 Dif f _2 MSB New _2

I/O Bus

Old_2 + + R E G Optional Pipelining Registers Prestate Buf f er 24-bit 24-bit 1 0 Bit 24-bit

3 2 2.67 tim es better perform ance w ith FPGA-assisted DSP 1 0 360 ns Two 6 6 MHz DS P s S ix 15 ns RAMs 135 ns 6 6 MHz DS P + FP G A Thre e 15 ns RAMs DSP-Only 8 D EVICES

Two 66 MHz DSPs Six 15 ns SRAMs System logic

DSP + FPGA 4 D EVICES

One 66 MHz DSP XC4013E-3 FPGA (44%) Three 15 ns SRAMs

What to Look for in Your DSP Application

   Identify Parallel Data Paths Find Operations that Require Multiple Clock Cycles Processor Bottlenecks Flexibility Parallel Data Paths Scaleable Bandwidth Design Modification Device Expansion = YES = NO

When to Use FPGAs for DSP 50 45 40 35 30 Number of DSPs

4 DSPs 3 DSPs 2 DSPs 1 DSP

25 20 15 10 FPGA Region 5 DSP Region 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Arithmetic Operations Per Sample

      

High sample rates

 Up to 500 MHz with Virtex 5/6/7 Low sample rates  Integrate DSP + system logic in a low-cost DSP using serial sequential algorithm

Short word lengths

 DA algorithm gets faster with shorter word length

Lots of filter taps

 FPGA processes all taps in parallel, faster than DSP

Fast correlators

Single-chip solution required HardWire gate array migration path for high-volume designs

Co-processing with a FPGA

FPGA co-processors are an extremely cost-effective means of off-loading computationally intensive algorithms from a DSP processor.

FPGA Coprocessor for

WiMAX Baseband Processing

FPGA Coprocessor for High-Definition

H.264 Encoding

Digital Filters

  Digital filters are one of the main elements of DSP and are performed using only a MAC operation.

A digital filter performs a filtering function on data by attenuating or reducing bands of frequencies.

Remove High Frequency Noise from Speech Signal Remove low Frequency Noise for some sensors Emphasize a particular Frequency in Music Signal Remove 50 HZ mains hums from ECG Signal ENG6530 RCS 27

Low Pass Digital Filter

 An example of the operation of a low pass filter is:  The weights W 0 to W N-1 must be appropriately chosen

ENG6530 RCS 28

Digital Filters: Types

    Finite Impulse Response (FIR):  Non-recursive linear filter (i.e. no feedback present).

Infinite Impulse Response (IIR)  Recursive linear filter (i.e. with feedback) Adaptive Digital Filter (ADF)  A self learning filter that adapts itself to a desired signal.

Non-Linear Filters:  A Filter that can perform non-linear operations  e.g. median filter  min/max filters

ENG6530 RCS 29

FIR Filters

 A Finite Impulse Response (FIR) filter performs a weighted average (convolution) on a window of N data samples:

ENG6530 RCS 30

FIR FILTERS

Register FINITE-IMPULSE RESPONSE FILTER

 1

. . . .

C N

 1

C N

Multiplier ENG6530 RCS Adder 31

Frequency Response

 The frequency/phase response of a digital filter is found by taking the Discrete Fourier Transform (DFT) of the impulse

ENG6530 RCS 32

FPGA Implementations

Hardware Description Language:

 

VHDL Verilog 2.

Electronic System Level

  

Handel-C, Vivado HLS (Lab #7) Impulse-C 3.

Core Generator (IP Selection) 4.

System Generator (Lab #6)



Matlab, Simulink, System Generator ENG6530 RCS 33

FIR FILTER: VHDL Implementation

 Simple VHDL design example of an 8-tap FIR filter.

ENG6530 RCS 34

Hardware Descriptive Languages

 Full VHDL/Verilog (RTL code)  Advantages:  Portability and efficient implementation  Complete control of the design implementation and tradeoffs  Easier to debug and understand a code that you own 

Disadvantages:

 Can be time consuming  Don’t always have control over the Synthesis tool  Need to be familiar with algorithm and how to write it

ENG6530 RCS 35

ENG6530 RCS 36

Abstraction: Advantages

ENG6530 RCS 37

CORE Generator

HDL COREGen Synthesis Implementation Download Behavioral Simulation Functional Simulation Timing Simulation In-Circuit Verification ENG6530 RCS Instantiate optimized IP within the HDL code 38

Xilinx CORE Generator

List of available IP from or

ENG6530 RCS

Fully Parameterizable

Xilinx IP Solutions

DSP Functions $ P Reed Solomon

$ 3GPP Turbo Code

$ P Viterbi Decoder $ P Convolution Encoder $ P Interleaver/De-interleaver P LFSR

P 1D DCT

P DA FIR

P MAC P MAC-based FIR filter

Fixed FFTs 16, 64, 256, 1024 points

P FFT - 32 Point

P Sine Cosine P Direct Digital Synthesizer P Cascaded Integrator Comb

P Bit Correlator P Digital Down Converter

IP CENTER

http://www.xilinx.com/ipcenter

Math Functions P Multiplier Generator - Parallel Multiplier - Dyn Constant Coefficient Mult - Serial Sequential Multiplier Multiplier Enhancements

P Divider P CORDIC

Base Functions P Binary Decoder P Two's Complement P Shift Register RAM/FF P Gate modules P Multiplexer functions P Registers, FF & latch based P Adder/Subtractor P Accumulator P Comparator P Binary Counter

$ - License Fee, P - Parameterized, S - Project License Available,

BOLD – Available in the Xilinx Blockset for the System Generator for DSP Memory Functions

P Asynchronous FIFO

P Block Memory modules P Distributed Memory P Distributed Mem Enhance P Sync FIFO (SRL16) P Sync FIFO (Block RAM)

P CAM (SRL16) $ P PCI 64/66 $ PS PCI 32/33 $ P PCI-X 64/66

PCI Networking

8B/10B Encoder/Decoder $ POS-PHY L3 $ POS-PHY L4 $ Flexbus 4 $ RapidIO PHY Layer $ S HDLC 1 and 32 channel $ S G.711 PCM Cores $ S ADPCM 32 & 64 channel

Core Generator: Summary

 CORE Generator  Advantages  Can quickly access and generate existing functions  No need to reinvent the wheel and re-design a block

if it meets specifications

 IP is optimized for the specified architecture 

Disadvantages

 IP doesn’t always do exactly what you are looking for  Need to understand signals and parameters and match them to your specification  Dealing with black box and have little information on how the function is implemented

ENG6530 RCS 41

Xilinx System Generator for DSP

• • • • •

Industry’s first tool system-level design environment (IDE) for FPGAs Simulink library of arithmetic, logic operators and DSP functions (Xilinx blockset) Arithmetic abstraction VHDL code generation for most Spartan based FPGAs and Virtex 4/5/6/7 FPGAs Enables Hardware in the Loop Co-simulation

•

MATLAB

MATLAB™, the most popular system design tool , is a programming language, interpreter, and modeling environment – – – Extensive libraries for math functions , signal processing, DSP, communications, and much more Visualization : large array of functions to plot and visualize your data and system/design Open architecture: software model based on base system and domain specific plug-ins

System Level Evaluation

 

Irrespective of the final implementation technology (GPP, DSP, ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment.

The de facto industry standard for DSP algorithmic verification is MATLAB.

Original Concept

Algorithmic Verification Auto C/C++ Generation Handcrafted C/C++ Handcrafted Assembly ENG6530 RCS Compile / Assemble Machine Code 44

System/Algorithmic level to RTL

   Many DSP design teams commence by performing their system level evaluation and algorithmic validation in MATLAB using floating point representation.

Alternatively, they may first transition the FP representation into their fixed-point counterparts at the system level.

At this point, many design teams bounce directly into hand coding fixed-point RTL equivalents of the design in VHDL Original Concept System/Algorithmic Verification

(Floating-point)

(a) System/Algorithmic Verification (b)

(Fixed-point)

Handcraft Verilog/VHDL RTL

(Fixed-point)

To standard RTL-based simulation and synthesis

ENG6530 RCS 45

•

Simulink

Simulink™ Visual data flow environment for modeling and simulation of dynamical systems – – – – – Fully integrated with the MATLAB engine Graphical block editor Event-driven simulator Models parallelism Extensive library of parameterizable functions • • Simulink Blockset - math, sinks, sources DSP Blockset - filters, transforms, etc.

• Communications Blockset - modulation, DPCM, etc.

Traditional Simulink FPGA Flow

System Verification System Architect GAP FPGA Designer HDL Synthesis Implementation Download Simulink Functional Simulation Timing Simulation In-Circuit Verification Verify Equivalence

System Generator

MATLAB/Simulink HDL System Generator Synthesis Implementation System Verification

•VHDL •IP •Testbench •Constraints File

Functional Simulation Timing Simulation Download In-Circuit Verification

Creating a System Generator Design

• Xilinx Block-set listed in Simulink Library Browser • Create Design by Dragging and Dropping components from the Xilinx Block-set onto your new sheet to create design

Finding Blocks

• • Use the Find feature to search ALL Simulink libraries Xilinx blockset has nine major sections – Basic elements – – – – – – – – • Counters, delays Communication • Error correction blocks Control Logic • MCode, Black Box Data Types • Convert, Slice DSP • FDATool, FFT, FIR Index • All Xilinx blocks – quick way to view all blocks Math • Multiply, accumulate, inverter Memory • Dual Port RAM, Single Port RAM Tools • ModelSim, Resource Estimator

Configure Your Blocks

• •

Double-click

or go to Block Parameters to view a block’s configurable parameters – – – – – – – Arithmetic Type: Unsigned or twos complement Implement with Xilinx Smart-IP Core (if possible)/ Generate Core Latency: Specify the delay through the block Overflow and Quantization: Users can saturate or wrap overflow. Truncate or Round Quantization Override with Doubles: Simulation only Precision: Full or the user can define the number of bits and where the decimal point is for the block Sample Period: Can be inherent with a “-1” or must be an integer value

Note: While all parameters can be simulated, not all are realizable

Values Can Be Equations

• • • You can also enter equations in the block parameters, which can aid calculation and your own understanding of the model parameters The equations are calculated at the beginning of a simulation Useful MATLAB operators – – – – – – – + add - subtract * multiply / divide ^ power  pi (3.1415926535897.…) exp(x) exponential (ex)

Important Concept 1: The Numbers Game

• • Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos complement floating point number” – Because the binary point can move, a double can represent any number between +/- 9.223 x 10 18 resolution of 1.08 x 10 -19 …a wide desirable range, but not efficient or realistic for FPGAs with a Xilinx Blockset uses n-bit fixed point number (twos complement optional) -2

2 2

1 2

0 2

-1 2

-2 2

-3 2

-4 2

-5 2

-6 2

-7 2

-8 Integer Fraction Format = Sign_Width_Decimal point from the LSB 2 -9

2 -10 2 -11 2 -12 2 -13 Value = -2.261108…

0 1 0 1

Format = Fix_16_13

(Sign: Fix = Signed Value UFix = Unsigned value)

Design Hint:

Always try to maximize the dynamic range of design by using only the required number of bits Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks

(Xilinx blockset



MATLAB I/O



Gateway In/Out)

•

What About All Those Other Bits?

The Gateway In and Out blocks support parameters to control the conversion from double precision to N - bit fixed point precision

. . . .

-2 6

2 5

2 4

2 3

2 2 2 1

1 0

2 0

1 DOUBLE

2 -1 2 -2 2 -3 2 -4 2 -5 2 -6

1 0 1 1 1 1

2 -7 2 -8 2 -9

0 1 0

2 -10 2 -11

0 1

2 -12

2 -13

1 . . . .

OVERFLOW - Wrap - Saturate - Flag Error

-2 2 2 1

1 0

2 0

2 -1 2 -2 2 -3 2 -4 2 -5 2 -6 2 -7

1 0 1 1 1 1 0

2 -8 2 -9

1 0 FIX_12_9 QUANTIZATION - Truncate - Round

Creating a System Generator Design

IO blocks used as interface between the Xilinx blockset and other Simulink blocks Simulink sources SysGen blocks realizable in Hardware Simulink sinks & library functions

Using the Scope

• Click properties to change the

number of axis displayed and the time range value (X axis)

• Use

Data history values are stored and displayed on the scope to control how many

• Click

configure the display to the correct axis values autoscale to quickly let the tools

• Right click on the Y-axis to set its value

Design & Simulate in Simulink

Simulate the design by pushing “play.” Go to “Simulation Parameters” under the “Simulation” menu to control the length of simulations

Resource Estimator

• •

The block provides fast estimates of FPGA resources required to implement the subsystem Most of the blocks in the System Generator Blockset carries the resources information

– – – – – –

LUTs FFs BRAM Embedded multipliers 3-state buffers I/Os

Resource Estimator

•

Three types of estimation

–

Estimate Area

•

This option computes resources for the current level and all sub-levels

– –

Quick Sum

•

Uses the resources stored in block directly and sum them up (no sub-levels functions are invoked) Post-Map Area

•

Opens up a file browser and let user select map report file. The design should have been generated and gone through synthesis, translate, and mapping phases.

The Black Box

Use the Black Box when: • You need a function that cannot be created with the Xilinx Blockset • You already have a piece of VHDL you wish to use for a section of the design Creates a place holder for the ‘Black box’ in generated VHDL Use Black Box parameters to control the VHDL placeholder’s features

Generate the VHDL Code

Once complete, double click the System Generator token Select the target device Select to generate the testbench Set the System clock period desired Generate the VHDL

Hardware-in-the-Loop Reduces Design Time & Cost

• •

Configure any development board for hardware-in-the-loop using JTAG header in < 20 minutes

–

Automatically create FPGA bit-stream from Simulink

–

Transparent use of FPGA implementation tools

– –

Accelerate and verify the Simulink design using FPGA hardware Mirrors traditional DSP processor design flows Combine with black box to simulate HDL & EDIF

Create Bit-stream

Step 1 Select Target H/W Platform Step 2 Generate Bit-stream

Co-Simulate in Hardware

Step 3 contd.

Post-generation script creates a new library containing a parameterized run-time co simulation block.

Step 5 Simulate for verification Step 4 Copy the a co simulation run time block into the original model.

Hardware in the Loop Performance Results

Single Step Clock Mode (bit and cycle accurate) Application Software Simulation Time (seconds)

676

Hardware Simulation Time (seconds)

Speed-up

Image Filtering

112X

QAM Demodulator + Extension

67X

1203 18 5 x 5 Image Filter Cordic Arc Tangent Additive White Gaussian Noise Channel 170 187 600 4 27 80

43X 7X 7.5X

Free Running Clock Mode

A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve up to

6 orders of magnitude performance enhancement

over original software simulation.

DSP System Generator: Summary

•

System Generator for DSP

– –

Advantages

• • • • • •

Ability to simulate the design at a system level High level of abstraction - Very attractive for FPGA novices Optimize Area, Speed, combination Estimate resources easily Hardware Co-Simulation (FPGA in the loop) Test-bench and golden data written automatically Disadvantages

• •

Cost of abstraction: doesn’t always give the best result from an area usage point Only as good as the IP support

FPGAs versus DSP

     FPGAs can out perform DSP processors on certain DSP tasks;  computation intensive,  highly parallelizable tasks DSP processors have the advantage for   development infrastructure, time-to-market,  developer familiarity DSP processors are still easier to use Many engineers possess DSP processor development skills Ultimate speed is not always the first priority   Combination of FPGA and DSP processor is an excellent solution if performance requirements cannot be met by the processor alone The “Best” architecture depends on the requirements of the applications

Problem with this flow?

   There is a significant conceptual and representational divide between the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representation in VHDL.

Manual translation from one to another is time consuming and prone to error.

Any changes made to the original specs during the course of the project will be a painful and time consuming process to translate again to RTL.

Original Concept System/Algorithmic Verification

(Floating-point)

(a) System/Algorithmic Verification (b)

(Fixed-point)

Handcraft Verilog/VHDL RTL

(Fixed-point)

To standard RTL-based simulation and synthesis

ENG6530 RCS 69

Direct RTL Generation

Original Concept (a) (b)   Some system/algorithmic level design environments offer direct VHDL code generation.

An example of this type of environment is offered by AccelChip Inc whose environment can accept floating-point MATLAB M files, output their fixed point equivalent for verification and then use these new M files to auto generate RTL.

System/Algorithmic Environment

System/Algorithmic Verification

(Floating-point)

System/Algorithmic Verification

(Fixed-point)

Auto-generate Verilog/VHDL RTL

(Fixed-point) ENG6530 RCS System/Algorithmic Environment

System/Algorithmic Verification

(Floating-point) Third-party Environment

Auto-interactive quantization (Fixed-point) Auto-generate Verilog/VHDL RTL

(Fixed-point)

(a) To standard RTL-based simulation and synthesis (b)

Transposed FIR with Multiplier Block

ENG6530 RCS 71

DSP Processors vs. FPGAs

High Speed DSP Processor High Level of Parallel Processing in FPGA

MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC  

1-8 Multipliers



Needs looping for more than 8 multiplications Needs multiple clock cycles because of serial computation



200 Tap FIR Filter would need 25+ clock cycles per sample with an 8 MAC unit processor

 

Can implement hundreds of MAC functions in an FPGA Parallel implementation allows for faster throughput

–

200 Tap FIR Filter would need 1 clock cycle per sample

Multiply Accumulate Single Engine

  Sequential processing limits data throughput:  Time-shared MAC unit 

Data width is fixed!!

 High clock frequency creates difficult system-challenge

256 Tap FIR Filter

 256 multiply and accumulate (MAC) operations per data sample  One output every

256 clock cycles ENG6530 RCS

Data In Reg Loop Algorithm 256 times MAC unit Data Out

Filters: Applications

ENG6530 RCS 74

Impulse Response

 The Impulse Response of an FIR filter is obtained from the output of a filter when a single unit impulse is input:

ENG6530 RCS 75

Solution: Building a MAC with System Generator

MAC using Sliced Based Multiplier

Slice Count: 70 Slices Performance: ~130 Mhz (2v1000 -4)

 

i a b i

a b +

MAC using Embedded Multiplier

Slice Count: 22 Slices, 1 embedded multiplier Performance: ~126 MHz (2v1000 -4)

FIR: Cont … VHDL Implementation

    For convenience the selected coefficients are powers of 2.

To operate, the filter must have eight register stages, each of which is eight bits wide.

 Therefore, for the register or memory portion of the design, 64 flip flops are required.

At each clock cycle, each coefficient is multiplied by the eight-bit value in the appropriate register.

 Due to the selection of ``powers of two” coefficients, multiplication is achieved by a

simple shifting operation

The coefficient values may be stored as constants .

  The coefficients used in the example are given below: a 0 = 2 -3 , a 1 =2 -2 , a 2 =2 -1 ,a 3 =1,a 4 =1,a 5 =2 -1 ,a 6 =2 -2 ,a 7 =2 -3

ENG6530 RCS 77

VHDL Description of FIR Filter

library

ieee;

use

ieee.std_logic_1164.all;

entity

FIR1

is port

(clk :

in

std_logic; x :

in

integer

range

0 to

255; y :

out

integer

range

0 to

511);

end entity

FIR1;

ENG6530 RCS 78

VHDL Description of FIR Filter

architecture arch1 of FIR1 is begin process (clk) type RegType is array (7 downto 0) of integer; variable Reg: RegType:= ( others => 0); begin if (clk’event and clk=‘1’) then - - multiply/accumulate (MAC) operation y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3) + Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8; - - update register values by shifting Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4); Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x; end if ; end process ; end architecture arch1; ENG6530 RCS 79