MMU Patching

Download Report

Transcript MMU Patching

A High-Performance Radix-2 FFT
in ANSI C for RTL Generation
John Ardini
MAPLD 2005
Motivation
• Implementations of algorithms in ANSI C
for
– Rapid protyping
– Incorporation into reconfigurable platform with
runtime partitioning or binding (same FFT
mapped to HW or SW)
• Establish a method for software engineers
to generate IP
Ardini
2
MAPLD 2005
Goals
• Show drastic reduction in IP development
time
• Beat DSP performance in throughput and
area while maintaining energy
consumption
• Allow production of coprocessor IP with
small learning curve: weeks not months
Ardini
3
MAPLD 2005
FFT Test Algorithm
•
•
•
•
Well understood and studied
Frequently used
Standard DSP benchmark
Standard software implementation
available (Numerical Recipes in C)
• Radix2 standard in C and DSP and used
for this study
Ardini
4
MAPLD 2005
RTL Generator
• ImpulseC chose for this study
– ANSI C
– Simple modifications to algorithm to compile for
processor
• Data I/O path
• Word types as simple #defines
– High level of abstraction
• Small learning curve
• Give up low-level control of registers/signals
• Some control over max gate delay using #pragma
– Desktop simulation for fast algorithm debug
Ardini
5
MAPLD 2005
Test Environment
• Alpha-Data VirtexII Pro
card on PCI bus
• Simple bus wrapper also
counts clocks to execute
FFT algorithm
• Use Visual C++ to write
high level application
code
Ardini
6
Local bus
to PCI
bridge, PC
FPGA
wrapper
IP
MAPLD 2005
FFT Structure
• Classic DIT radix 2 structure requires
(N/2)log2(N) butterfly computations
– 5120 for our 1024 test case
• Butterflies evaluated with 3 nested loops:
– Outer walks the stages
– Middle walks the butterflies for each branch
– Inner walks the branches
Ardini
7
MAPLD 2005
Butterfly Loop Structure
Outer loop
// butterfly operation
CMPLX_RD( i, cmplxI );
CMPLX_RD( j, cmplxJ );
tempr = (wr*cmplxJ[REAL] wi*cmplxJ[IMAG]) >> FAC_SHIFT;
tempi = (wr*cmplxJ[IMAG] +
wi*cmplxJ[REAL]) >> FAC_SHIFT;
cmplxJ[REAL] = cmplxI[REAL] - tempr;
cmplxJ[IMAG] = cmplxI[IMAG] - tempi;
cmplxI[REAL] += tempr;
cmplxI[IMAG] += tempi;
CMPLX_WR( i, cmplxI );
CMPLX_WR( j, cmplxJ );
Inner
loop
Middle loop
Ardini
8
MAPLD 2005
General IP Strucutre
• Written as FFT coprocessor block with
input / output “stream” model
// stream in N points
// butterfly computation loops (prior page)
// stream out N points
Ardini
9
MAPLD 2005
DSP Benchmark
• Clock cycles to complete FFT calculation,
time from last data in to first data available
is 23848
– Ref “TMS320C55x DSP Library
Programmer’s Reference,” TI SPRU422H, Oct
2004
Ardini
10
MAPLD 2005
Implementation A
• Direct mapping of classic Decimation in
Time (DIT) algorithm to fixed point code
• Calculation in place using single data
buffer for complex numbers
• Use 2 word arrays for internal
representation of complex numbers
Ardini
11
MAPLD 2005
Implementation A Results
• Implementation effort: about 1 week
– About 100 SLOC
•
•
•
•
•
•
Clocks to complete FFT: 48162, about 2x DSP
Inner butterfly loop takes 9 clocks
I/O loops take 4 clocks per point
Slices: 536 (includes simple bus wrapper)
Multipliers: 8
Block RAMs: 2
Ardini
12
MAPLD 2005
Implementation B
• Scalarize internal complex number representation to
eliminate memory contention:
// int16 cmplxI[2]
// int16 cmplxJ[2]
// becomes
int16 cmplxIReal, cmplxIImag
int16 cmplxJReal, cmplxJImag
• Allows simultaneous assignements to real and imaginary
parts of complex working variables
• Reads and writes of working variables done with
#defines to hide implementation:
// e.g.
#define CMPLX_RD(ofst,dest) dest##Real = dataBuf[ofst];
dest##Imag = dataBuf[ofst+1]
//
CMPLX_RD( i, cmplxI );
Ardini
13
MAPLD 2005
Implementation B Results
• Clocks to complete FFT: 32802, about
1.4x DSP
• Inner butterfly loop takes 6 clocks
– Savings is 3 clocks * 5120 flies = 15360
clocks
•
•
•
•
I/O loops take 4 clocks per point
Slices: 398
Multipliers: 8
Block RAMs: 2
Ardini
14
MAPLD 2005
Ardini
15
MAPLD 2005
ImagBuf
ImagBuf
realBuf
• Replace single
input data buffer
with imag and
real buffers
• Allows
simultaneous
access access to
re,im parts of
data buffer
realBuf
Implementation C
Implementation C Results
• Clocks to complete FFT: 17442, about
0.7x DSP
• Inner butterfly loop takes 3 clocks
– Savings is 3 clocks * 5120 flies = 15360
clocks
•
•
•
•
I/O loops now take 3 clocks per point
Slices: 425
Multipliers: 8
Block RAMs: 2
Ardini
16
MAPLD 2005
Implementation D
• Examine DIF
structure
• After first stage,
to handle 2
parallel engines
• Could also be DIT
Ardini
17
MAPLD 2005
Implementation D
• Note first stage
calculations can
be handled as
data arrives
• Also note last
stage could be
handled as data
leaves
Ardini
Input Stage
Main fly
Engines
Simple
data input
Output
stage,
add/sub
Input with
butterfly
18
MAPLD 2005
Implementation D
• Implement 2 butterflies in parallel
– double up code, tool worries about parallelism
• Hide first and last butterfly stages by
peforming butterflies as data
arrives/leaves
– Note that last stage is trivial multiplications, so
no FPGA multipliers are required
• Also places twiddles in ROM to lower use
of FPGA multiplier resources
Ardini
19
MAPLD 2005
Implemenation D Results
•
•
•
•
Clocks to complete FFT: 7186, about 0.3x DSP
Inner butterfly loop still takes 3 clocks
I/O loops still take 3 clocks per point
Savings due to parallelism:
– 8 stages*(512/2) flies*3 clocks = 6144 clocks, inner loop
– 2 clocks * (2n, n=1,2…8) times through loop = 1024 clocks,
middle loop
• Savings due to I/O stage butterflies
– 2*512*3 = 3072 clocks
•
•
•
•
Slices: 859 (813 w/o bus wrapper)
Multipliers: 12
Block RAMs: 8
Max clock rate: 76MHz, VirtexII Pro
Ardini
20
MAPLD 2005
On Size and Power
• Effective area when placed into VirtexII or
Virtex4 FPGAs is on the order of 1/2 to 1/3
that of a DSP based on package sizes and
resource utilization
• Power on the order of 200-250mW for
Virtex4 device (estimated)
• Energy for 1024 point FFT: estimated 42
µJ
– Estimated 32 µJ for DSP
Ardini
21
MAPLD 2005
Conclusions / Future Work
• Implementation time extremely short
– 1-2 weeks vs. estimated 3+ months with HDL
– SW approach without need for understanding reg vs wire,
pipelining
• For clock rates to 75MHz, this design is 3x faster than a
DSP
– Trade gate delay for clock rate with available #pragma for
designs in excess of 75MHz
– Use two clock domains: I/O, core
• Other optimizations
– Radix4
– ImpulseC parallel processes
– I/O rate can be improved with 32-bit bus
Ardini
22
MAPLD 2005