Document 7601927

Download Report

Transcript Document 7601927

AHPCC/NCSA WORKSHOP
Fast Fourier Transform Using FFTw
Guobin Ma1 ([email protected]),
Sirpa Saarinen2 ([email protected]),
and Paul M. Alsing1 ([email protected]),
1AHPCC, 2NCSA
4/18/00
Spring 2000 FFTw workshop
1
Contents
FFT basic
(Paul)
What is FFT and why FFT
FFTw
Outline of FFTW
(Guobin)
Characteristics
C routines
Performance and C example codes
(Sirpa)
Fortran wrappers and example codes
(Guobin)
Exercises
4/18/00
(skipped)
Spring 2000 FFTw workshop
2
FFT Basic
What is FFT and why FFT
by Paul Alsing
4/18/00
Spring 2000 FFTw workshop
3
Fourier Transform: frequency analysis of time series data.
DFT: Discrete Fourier Transform (N time/freq points)
FFT: Fast Fourier Transform: efficient implementation ~O(Nlog2N)

H  f    ht  e
2 i f t

1
ht  
2



N 1
H n   hk e 2 i k n / N
dt
k 0
H  f e
 2 i f t
d
1
hk 
N
N 1
H e
n 0
 2 i k n / N
n
hk  ht k , t k  kt , k  0,1, , N  1
fn 
4/18/00
n
, n   N / 2, , N / 2  1
Nt
Spring 2000 FFTw workshop
4
Aliasing issues:
Let fc = Nyquist Frequency
= 1/(2t). A sine wave
sampled at fc will be sampled at
2 points, the peak and the trough.
Frequency components f > | fc |
will be falsely folded back into
the range -fc < f < fc.
4/18/00
Spring 2000 FFTw workshop
5
Fourier Transform: radix 2, Danielson-Lanczos
N 1
H n   e 2 i k n / N hk
k 0

N / 2 1
e
2 i  2 k  n / N
k 0

N / 2 1
e
hk 
2 i k n /  N / 2 
k 0
N / 2 1
2 i  2 k 1 n / N
e
hk

k 0
hk  W
N / 2 1
n
2 i k n /  N / 2 
n
2 i / N
e
h
,
W

e

k
k 0
 H ne  W n H no
where H ne is the nth component of the FT of length N/2 formed from the
even components of the original hk ' s; H no is the nth component of the FT
of length N/2 formed from the odd components of the original hk ' s
4/18/00
Spring 2000 FFTw workshop
6
Fourier Transform: radix 2, Danielson-Lanczos (cont.)
H n  H ne  W n H no ,




 H nee  W 2 n H neo  W n H noe  W 2 n H noo ,

W n
H
H
Hn
H
H
H
e
n
, H no
ee
n
 W 4 n H neeo  W 2 n H neoe  W 4 n H neoo
oee
n
 W 4 n H noeo
, H neo , H noe , H noo
,H
2n
ooe
n

,
 W 4 n H noooo
log 2  N  8 steps
is of length N

eee
eoe
n
n
4/18/00
 
 W H
eee
n
are of length N / 2

are of length N / 4
, H noee , H nooe , H neeo , H neoo , H noeo , H nooo
Spring 2000 FFTw workshop

are of length N / 8
7
Fourier Transform: radix 2, butterfly Cooley-Tukey algorithm
We finally get down to 1-point transforms such as
H neoeeoeooee  hm (some input valu e) for some 0  m  N - 1
The question is: which value of m corresponds to which pattern
of e’s and o’s?
The answer is:
Let {e=0,o=1}. Reverse the pattern of e’s and o’s and you will
have the value of m in binary.
4/18/00
Spring 2000 FFTw workshop
8
Bit reversal:
The Cooley-Tukey
algorithm first
rearranges the data
in bit reversed form,
then builds up the
transform in
N log2N iterations
(decimation in time).
4/18/00
eee
eeo
eee
eeo
eoe
eoe
eoo
eoo
oee
oee
oeo
oeo
ooe
ooe
eee
eee
Spring 2000 FFTw workshop
9
Ordering of
time series
(coord space)
and frequencies
in fourier
(momentum)
space.
4/18/00
Spring 2000 FFTw workshop
10
Example Application: Quantum Mechanics
Propagation of (dimensionless) Schrodinger Wave Function
  x, t 
 2  x, t 
i

 V  x, t   x, t   H  x, t   0
2
t
x
 x, t  t   e i H  t x, t ,
H  T V
 e iT  t / 2 e iV  t e i T  t / 2  x, t 
 e iV  x1 ,t t     x1 , t  

 

iV  t
In coordinate space, e
  x, t   

  
 iV  x N ,t t  



e

x
,
t
N




 e i1/ 2 k1 t / 2   ˆ k1 , t  

 

iT  t
  
In fourier (momentum) space, e
ˆ k , t   

 i1/ 2 k N2 t / 2  
e
 ˆ k N , t 


2
11
Serial FFTs
y
x
y
x
4/18/00
Transpose data to keep
x data is contiguous in
memory (Fortran) Spring 2000 FFTw workshop y transforms continguous
12
in memory.
transpose
Parallel FFTs
P0
x
P1 y
P2
P3
P0
P1
x P2
P3
y
In parallel, all x transforms
transpose
are local operations on each
processor (no communication)
In performing the transpose
processors must perform an
All-to-All communication.
Outline of FFTw
By Guobin Ma
4/18/00
Spring 2000 FFTw workshop
14
Characteristics of FFTw
C routines generated by Caml-Light ML
1D/nD, real/complex data
Arbitrary input size, not necessary 2n
Serial/Parallel, Share/Distributed Memory
Faster than all others, high performance
Portable, automatically adapt to machine
4/18/00
Spring 2000 FFTw workshop
15
Two Phases of FFTw
Hardware dependent algorithm
Planner
‘Learn’ the fast way on your machine
Produce a data structure --‘plan’
Reusable
Executor
Compute the transform
Apply to all FFTw operation modes
1D/nD, complex/real, serial/parallel
4/18/00
Spring 2000 FFTw workshop
16
C Routines of FFTw
Routines
1D/nD complex
1D/nD real
Corresponding parallel (MPI) ones
Arguments
Special notes
Data formats
4/18/00
Spring 2000 FFTw workshop
17
1D Complex Transform
Typical call
#include <fftw.h>
…
{
fftw_complex in[N], out[N];
fftw_plan p;
…
p = fftw_create_plan(int n, fftw_direction dir, int flags);
…
fftw_one(p, in, out);
…
fftw_destroy_plan(p);
}
4/18/00
Spring 2000 FFTw workshop
18
1D Complex Transform (cont.)
 Routines
 fftw_plan fftw_create_plan(int n,
fftw_direction dir, int flags);
 void fftw_one(fftw_plan plan, fftw_complex *in,
fftw_complex *out);
 fftw_plan fftw_create_plan_specific(int n,
fftw_direction dir, int flags,
fftw_complex *in, int istride,
fftw_complex *out, int ostride);
4/18/00
Spring 2000 FFTw workshop
19
1D Complex Transform (cont.)
 Routines (cont.)
 void fftw(fftw_plan plan, int howmany,
fftw_complex *in, int istride, int idist ,
fftw_complex *out, int ostride, int odist);
 fftw_destroy_plan(fftw_plan plan);
4/18/00
Spring 2000 FFTw workshop
20
1D Complex Transform (cont.)
 Arguments
 plan: data structure containing all the information
 n: data size
 dir: FFTW_FORWARD (-1), FFTW_BACKWORD (+1)
 flags: FFTW_MESURE, FFTW_ESTIMATE, FFTW_OUT_PLACE,
FFTW_IN_PLACE, FFTW_USE_WISDOM, separated by |
 howmany: number of transforms / input arrays
 in, istride, idist: input arrays, in[i*istride+j*idist]
 out, ostride, odist: output arrays, ...
4/18/00
Spring 2000 FFTw workshop
21
1D Complex Transform (cont.)
 Notes
 out of place (default), in[N], out[N]
 in place, save memory, cost more time
ignore ostride and odist; ignore out
 in-order output, 0 frequency at out[0]
 unnormalized, factor of N
4/18/00
Spring 2000 FFTw workshop
22
nD Complex Transform
Routines, similar to 1D case, except …
fftwnd_plan fftwnd_create_plan(int rank,
const *int n, fftw_direction dir, int flags);
void fftwnd_one(fftwnd_plan plan, , );
fftwnd_plan fftw_create_plan_specific(int rank,
const *int n, fftw_direction dir, , , , , );
void fftwnd(fftwnd_plan plan, , , , , , , );
fftwnd_destroy_plan(fftwnd_plan plan);
4/18/00
Spring 2000 FFTw workshop
23
nD Complex Transform (cont.)
Arguments
rank: dimensionality of the arrays to be transformed
n: pointer to an array of rank - size of each dimension, e.g. n[8,4,5]
row-major for C, column-major for Fortran
Special routines for 2D and 3D cases
nd -> 2d, 3d
n_dim -> nx, ny
4/18/00
or nx, ny, nz
Spring 2000 FFTw workshop
24
1D Real Transform
Routines, similar to 1D complex case, except …
rfftw_plan rfftw_create_plan( , , );
void rfftw_one(rfftw_plan plan, fftw_real *in,
fftw_real *out);
rfftw_plan rfftw_create_plan_specific(int n,
fftw_direction dir, int flags, fftw_real *in,
int istride, fftw_real *out, int ostride);
void rfftw(rfftwnd_plan plan, int howmany,
fftw_real *in, int istride, int idist,
fftw_real *out, int ostride, int odist);
rfftw_destroy_plan(rfftw_plan plan);
4/18/00
Spring 2000 FFTw workshop
25
1D Real Transform (cont.)
Arguments
dir: FFTW_REAL_TO_COMPLEX = FFTW_FORWARD = -1
FFTW_COMPLEX_TO_REAL = FFTW_BACK_WARD = 1
others have the same meaning as before
4/18/00
Spring 2000 FFTw workshop
26
nD Real Transform
Routines, similar to 1D real case, but …
rfftwnd_plan rfftwnd_create_plan(int rank,
const *int n, fftw_direction dir, int flags);
void rfftwnd_one_real_to_complex(rfftwnd_plan plan,
fftw_real *in, fftw_complex *out);
void rfftwnd_one_complex_to_real(rfftwnd_plan plan,
fftw_complex *in, fftw_real *out);
void rfftwnd_real_to_complex(rfftwnd_plan plan,
int howmany, fftw_real *in, int istride,
int idist, fftw_complex *out, int ostride,
int odist);
4/18/00
Spring 2000 FFTw workshop
27
nD Real Transform (cont.)
Routines (cont.)
void rfftwnd_complex_to_real(rfftwnd_plan plan,
int howmany, fftw_complex *in, int istride,
int idist, fftw_real *out, int ostride,
int odist);
rfftwnd_destroy_plan(rfftwnd plan);
Special 2D and 3D routines
4/18/00
Spring 2000 FFTw workshop
28
nD Array Format
nD arrays stored as a single contiguous block
C order, Row-major order
First index most slowly, last most quickly
Fortran order, Column-major order
First index most quickly, last most slowly
Static Array - no problem
Dynamic Array - may have problem in nD case
4/18/00
Spring 2000 FFTw workshop
29
Parallel FFTw
Multi-thread
Skipped
MPI
nD complex
Routines
Notes
Data Layout
1D complex
nD real
4/18/00
Spring 2000 FFTw workshop
30
nD Complex MPI FFTw
Routines, similar to uniprocessor case, except mpi…
fftwnd_mpi_plan fftwnd_create_plan(mpi_comm comm,
int rank, const *int n, fftw_direction dir, int flags);
void
int
int
int
int
fftwnd_mpi_local_size(fftwnd_mpi_plan p,
*local_first, int *local_first_start,
*local_second_after_transpose,
*local_second_start_after_transpose,
*total_local_size);
local_data = (fftw_complex*) malloc(sizeof(fftw_complex)
* total_local_size);
work = (fftw_complex*) malloc(sizeof(fftw_complex)
* total_local_size);
4/18/00
Spring 2000 FFTw workshop
31
nD Complex MPI FFTw (cont.)
Routines (cont.)
void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields,
fftw_complex *local_data, fftw_complex *work,
fftw_mpi_output_order output_order);
void fftw_mpi_destroy_plan(fftwnd_mpi_plan p);
4/18/00
Spring 2000 FFTw workshop
32
nD Complex MPI FFTw (cont.)
Notets
First argument: comm - MPI communicator
Data layout
All fftw_mpi are in-place
work:
Optional,
Same size as local_data,
great efficiency by extra storage
output_order: normal/transposed
transposed: performance improvements, need to reshape the
data manually, may have problem sometimes
4/18/00
Spring 2000 FFTw workshop
33
nD Complex MPI FFTw (cont.)
Data layout
Distributed data
Divided according to row (1st dimension) in C
Divided according to column (last dimension) in Fortran
Given plan, all other parameters regarding to data layout are
determined by fftwnd_mpi_local_size
total_local_size = n1/np*n1*n2…*nk*n_fields
transposed_order: n2 will be the 1st dimension in output
inverse transform n[n2,n1,n3,...,nk]
4/18/00
Spring 2000 FFTw workshop
34
1D Complex MPI FFTw
Routines, similar to nD case, except no nd…
 fftw_mpi_plan fftw_create_plan(mpi_comm comm,
int n, fftw_direction dir, int flags);
 void
int
int
int
int
4/18/00
fftw_mpi_local_size(fftw_mpi_plan p,
*local_n, int *local_n_start,
*local_n_after_transpose,
*local_start_after_transpose,
*total_local_size);
Spring 2000 FFTw workshop
35
1D Complex MPI FFTw (cont.)
Routines (cont.)
 void fftw_mpi(fftw_mpi_plan p, int n_fields,
fftw_complex *local_data, fftw_complex *work,
fftw_mpi_output_order output_order);
 void fftw_mpi_destroy_plan(fftw_mpi_plan p);
Generally worse speedup than nD, fit large size
4/18/00
Spring 2000 FFTw workshop
36
nD Real MPI FFTw
Similar to that for uniprocessor and complex MPI
Speedup 2, save 1/2 space at the expense of more
complicated data format
Can have transposed-order output data
No 1D Real MPI FFTw
4/18/00
Spring 2000 FFTw workshop
37
FFTw Fortran-Callable Wrappers
Routine names, append _f77 in C routine names
 fftw/fftwnd/rfftw/rfttwnd ->
fftw_f77/fftwnd_f77/rfftw_f77/rfttwnd_f77
fftw_mpi/fftwnd_mpi ->
fftw_f77_mpi/fftwnd_f77_mpi
e.g. fftwnd_create_plan(3, n_dim,
FFTW_FORWARD, FFTW_ESTIMATE | FFTW_IN_PLACE)
-> fftwnd_f77_create_plan(plan, 3, n_dim,
FFTW_FORWARD, FFTW_ESTIMATE + FFTW_IN_PLACE)
4/18/00
Spring 2000 FFTw workshop
38
Break
4/18/00
Spring 2000 FFTw workshop
39
FFTw Performance
By Sirpa Saarinen
4/18/00
Spring 2000 FFTw workshop
40
C Example Codes
By Sirpa Saarinen
4/18/00
Spring 2000 FFTw workshop
41
FFTW Fortran Wrappers
and Example Codes
By Guobin Ma
4/18/00
Spring 2000 FFTw workshop
42
FFTw Fortran-Callable Wrappers
Notes
Any function that returns a value is converted into a subroutines with an
additional (first) parameter.
 No null in Fortran, must allocate and pass an array for out.
nD arrays, column-major, Fortran order
plan variables: be declared as integer
Constants
FFTW_FORWARD, FFTW_BACKWARD, FFTW_IN_PLACE, …
separated by ‘+’ instead of ‘|’
In file fortran/fftw_f77.i, fftw_f90.i, fftw_f90_mpi.i
4/18/00
Spring 2000 FFTw workshop
43
Fortran Examples
Source codes at AHPCC (tested on Turing, BB, SGI):
~gbma/workshop/fftw/codes or
http://www.arc.unm.edu/~gbma/Workshop/FFTW/codes
Complex data
1D serial, fftw_1d.f90
1D parallel, fftw_1d_p.f90
nD serial, fftw_3d.f90
nD Parallel
Normal order, fftw_3d_p_n.f90
Transposed order, fftw_3d_p_t.f90
4/18/00
Spring 2000 FFTw workshop
44
Fortran Examples (cont.)
1D case
Input
f ( x)  
N 2
keikx dk
N 2
Forward output F (k )  (0,1,2,..., k ,..., N  1, N , N  1,...  1)
2
2
2
Inverse output f (x )
nD case
Input
f ( x, y, z )   k x k y k z e
i(kx xk y ykz z )
dk x dk y dk z
Forward output F (k x , k y , k z )  (0,1,2,..., k x k y k z ,...,1)
Inverse output
f ( x, y , z )
4/18/00
Spring 2000 FFTw workshop
45
1D Serial Fortran Example
FFTw codes
...
call fftw_f77_create_plan(plan_forward,N,
FFTW_FORWARD, FFTW_ESTIMATE)
call fftw_f77_create_plan(plan_reverse,N,
FFTW_BACKWARD,FFTW_ESTIMATE)
...
call fftw_f77_one(plan_forward,in,out)
...
call fftw_f77_one(plan_reverse,out,in)
...
call fftw_f77_destroy_plan(plan_forward)
call fftw_f77_destroy_plan(plan_reverse)
4/18/00
Spring 2000 FFTw workshop
&
&
46
1D Parallel Fortran Example
FFTw codes
...
call fftw_f77_mpi_create_plan(p_fwd,MPI_COMM_WORLD,N, &
FFTW_FORWARD,FFTW_ESTIMATE)
...
call fftw_f77_mpi_local_sizes(p_fwd, local_n, local_start, &
local_n_after_trans, local_start_after_trans,
total_local_size)
...
allocate( psi_local(0:total_local_size-1) )
...
allocate( work(0:total_local_size-1) )
4/18/00
Spring 2000 FFTw workshop
47
1D Parallel Fortran Example (cont.)
FFTw codes (cont.)
...
call fftw_f77_mpi(p_fwd,1,psi_local,work,USE_WORK)
...
call fftw_f77_mpi_destroy_plan(p_fwd)
...
call fftw_f77_mpi_create_plan(p_rvs,MPI_COMM_WORLD,N, &
FFTW_BACKWARD,FFTW_ESTIMATE)
...
call fftw_f77_mpi(p_rvs,1,psi_local,work,USE_WORK)
...
call fftw_f77_mpi_destroy_plan(p_rvs)
4/18/00
Spring 2000 FFTw workshop
48
nD Serial Fortran Example
FFTw codes
call fftwnd_f77_create_plan(p_fwd,nd,n_dim, &
FFTW_FORWARD,FFTW_ESTIMATE + FFTW_IN_PLACE)
call fftwnd_f77_one(p_fwd,psi,0)
call fftwnd_f77_destroy_plan(p_fwd)
call fftwnd_f77_create_plan(p_rvs,nd,n_dim, &
FFTW_BACKWARD,FFTW_ESTIMATE + FFTW_IN_PLACE)
call fftwnd_f77_one(p_rvs,psi,0)
call fftwnd_f77_destroy_plan(p_rvs)
4/18/00
Spring 2000 FFTw workshop
49
nD Parallel Fortran Example
FFTw codes, normal order, nD local array
n_dim(1)=nx; n_dim(2)=ny; n_dim(3)=nz
call fftwnd_f77_mpi_create_plan(p_fwd,MPI_COMM_WORLD,&
nd,n_dim,FFTW_FORWARD,FFTW_ESTIMATE)
call fftwnd_f77_mpi_local_sizes(p_fwd, local_nlast,
&
local_last_start, local_nlast2_after_trans,
&
local_last2_start_after_trans, total_local_size)
allocate( psi_local(0:nx-1,0:ny-1,0:local_nlast-1) )
allocate( work(0:nx-1,0:ny-1,0:local_nlast-1) )
4/18/00
Spring 2000 FFTw workshop
50
nD Parallel Fortran Example (cont.)
FFTw codes, normal order, nD local array (cont.)
call fftwnd_f77_mpi(p_fwd,1,psi_local,work,USE_WORK,order)
call fftwnd_f77_mpi_destroy_plan(p_fwd)
call fftwnd_f77_mpi_create_plan(p_rvs,MPI_COMM_WORLD, &
nd,n_dim,FFTW_BACKWARD,FFTW_ESTIMATE)
call fftwnd_f77_mpi(p_rvs,1,psi_local,work,USE_WORK,order)
call fftwnd_f77_mpi_destroy_plan(p_rvs)
4/18/00
Spring 2000 FFTw workshop
51
nD Parallel Fortran Example (cont.)
FFTw codes, transposed order, 1D local array
n_dim(1)=nx; n_dim(2)=ny; n_dim(3)=nz
call fftwnd_f77_mpi_create_plan(p_fwd,MPI_COMM_WORLD,&
nd,n_dim,FFTW_FORWARD,FFTW_ESTIMATE)
call fftwnd_f77_mpi_local_sizes(p_fwd, local_nlast,
&
local_last_start, local_nlast2_after_trans,
&
local_last2_start_after_trans, total_local_size)
allocate( psi_local(0:total_local_size-1) )
allocate( work(0:total_local_size-1) )
4/18/00
Spring 2000 FFTw workshop
52
nD Parallel Fortran Example (cont.)
FFTw codes, transposed order, 1D local array (cont.)
call fftwnd_f77_mpi(p_fwd,1,psi_local,work,USE_WORK,order)
call fftwnd_f77_mpi_destroy_plan(p_fwd)
n_dim(1)=nx; n_dim(2)=nz; n_dim(3)=ny
call fftwnd_f77_mpi_create_plan(p_rvs,MPI_COMM_WORLD, &
nd,n_dim,FFTW_BACKWARD,FFTW_ESTIMATE)
call fftwnd_f77_mpi(p_rvs,1,psi_local,work,USE_WORK,order)
call fftwnd_f77_mpi_destroy_plan(p_rvs)
4/18/00
Spring 2000 FFTw workshop
53
nD Parallel Fortran Example (cont.)
Notes
Normal order
Easy to code, ‘low’ performance
Transposed order
‘High’ performance, complicated to code, user reorder data
Use-work
High efficiency, large memory space
4/18/00
Spring 2000 FFTw workshop
54
Run the Examples at AHPCC
Copy files to your directory
cp ~gbma/workshop/fftw/codes/*.* .
Compile
make filename.tur
make filename.bb
make filename.sgi
with link specification -lfftw -lfftw_mpi (only for MPI)
Run
BB: qsub -I -l nodes=2
mpirun -np 2 -machinefile $PBS_NODEFILE filename.bb
Turing: filename.tur
SGI: mpirun -np 2 filename.sgi
4/18/00
Spring 2000 FFTw workshop
55
References
Numerical Recipe (FOTRAN)
by / William T. Vetterling et al., New York : Cambridge
University Press, 1992
Numerical integration
by P. J. Davis & P. Rabinowitz, Waltham, Mass., Blaisdell
Pub. Co. 1967
www.fftw.org
FFTW User’s manual
by M. Frigo & S. G. Johnson
4/18/00
Spring 2000 FFTw workshop
56
Acknowledgement
Brain Baltz
installation of FFTw at AHPCC
running MPI at AHPCC
John Greenfield
setting up the grid access
Andrew Pineda
computer work environment at AHPCC
Brain Smith & Susan Atlas
many stimulated discussions
Many others ...
4/18/00
Spring 2000 FFTw workshop
57