Rader*s FFT algorithm acceleration using Maxeler

Download Report

Transcript Rader*s FFT algorithm acceleration using Maxeler

Rader’s FFT algorithm
acceleration using
Maxeler
Author: Tadej Matek
Fourier Transform
● Fourier transform decomposes
a signal into its frequency components
● Used in telecommunications, data
compression, digital signal processing, fast
multiplication of polynomials ...
Tadej Matek
Source: http://fweb.wallawalla.edu/class-wiki/index.php/DFT_example_using_MATLAB_-_HW11
1/17
Fourier Transform and computers
● Transformation: Discrete Fourier Transform
Time: O(n2)
● Algorithm(s): Fast Fourier Transform (FFT)
(Cooley-Tukey, Bruun’s FFT, Rader’s FFT, Bluestein’s FFT …)
Time: O(nlogn)
Tadej Matek
2/17
Why is FFT faster than DFT
● Divide & conquer + properties
of primitive roots
● Primitive root of unity:
Source:
http://mathworld.wolfram
.com/images/gifs/rootsu.
gif
● Conquer step (butterfly):
Tadej Matek
3/17
Rader’s FFT algorithm overview
● Primitive root defined as:
● Bit reversal revk(i):
rev4(3): 3(10) = 0011(2) → 1100(2) = 12(10)
Tadej Matek
4/17
Example of calculation
n=4
k = log(n) = 2
8,
2,
z=5
2,
i=0
s = revk(i) = 0
10,
4
s = revk(i) = 2
i=1
6
p = 13
6, 11
8+z0*2 % 13 = 10
8+z2*2 % 13 = 6
2+z0*4 % 13 = 6
2+z2*4 % 13 = 11
i=0
s=0
3
Tadej Matek
i=1
s=2
4
i=2
i=3
s=1
9
s=3
3
5/17
Example: fast multiplication
● How to multiply two large polynomials?
● Basic approach: multiply each
component of 1st with each
component of 2nd -> O(n2)
● Using FFT: compute DFT
transform of both polynomials,
multiply in O(n) time and do
inverse FFT -> O(nlogn)
Tadej Matek
6/17
Dataflow implementation (1)
8,
Data
dependency!
10,
3
2,
2,
6
4
6, 11
4
9
3
Kernel needs updated data for each level!
Solution: LMem
Tadej Matek
7/17
Dataflow implementation (2)
Input sequence
(1)
(3)
Call kernel k times
CPU
...
Output
sequence
Manager
LMem
Tadej Matek
(2)
Kernel
Manager streams data in and
out of Kernel
8/17
Dataflow implementation (3)
● LMem works in bursts
(example: 384 B, but
depends on DFE)
● Good for consecutive calculations
● zs are calculated on CPU and
written to LMem
Tadej Matek
9/17
Performance & results (1)
● CPU used for testing: Intel Core2
Quad Processor Q9400 2.86GHz
● Maxeler card of type MAX2336B
was used for DFE testing
Tadej Matek
10/17
Performance & results (2)
● Conditions: BIG data, 95%
run time in loops
● Type of experiments: consecutive
calculations starting from
10K and up to 10M
● Consecutive calculations for
input sequences of length 32,
64, 128 and 256
Tadej Matek
11/17
Performance & results (3)
Tadej Matek
Execution time, N = 32, for CPU and DFE
12/17
Performance & results (4)
Tadej Matek
Speedup according to the number of consecutive
calculations for N = 32
13/17
Performance & results (5)
Speedup according to the number of consecutive calculations
14/17
for N = 64
Tadej Matek
Performance & results (6)
Tadej Matek
Speedup according to the number of consecutive
calculations for N = 256
15/17
Performance & results (7)
Speedup according to the size of input sequence (for
100K calculations)
Tadej Matek
16/17
Conclusion
● FFTs are one of the most
used algorithms today
● There can be massive
speedup but the requirement
are consecutive calculations
● Power usage: reduced due to
lower frequency (200Mhz vs 2.86GHz)
Tadej Matek
17/17